You can download this blog post’s source (implemented in Coq using the HoTT library). Learn more about HoTTSQL by visiting our website.
Relations
The basic datatype in SQL is a relation, which is a bag (i.e., multiset) of tuples with the same given schema. You can think of a tuple’s schema as being like a variable’s type in a programming language. We formalize a bag of some type A as a function that maps every element of A to a type. The type’s cardinality indicates how many times the element appears in the bag.
Definition Bag A := A -> Type.
Definition numbers : Bag nat :=
fun n => match n with
| 7 => Bool
| 42 => Unit
| _ => Empty
end.
Definition bagUnion {A} (r1 r2:Bag A) : Bag A :=
fun (a:A) => r1 a + r2 a.
Lemma bag_union_symm {A} (r1 r2 : Bag A) :
bagUnion r1 r2 = bagUnion r2 r1.
Proof.
unfold bagUnion.
by_extensionality a.
(* r1 a + r1 a = r2 a + r2 a *)
apply path_universe_uncurried.
(* r1 a + r1 a <~> r2 a + r2 a *)
apply equiv_sum_symm.
Qed.
Many SQL rewrite rules simplify to an expressions built using the operators of this semi-ring (e.g. r1 a + r1 a = r2 a + r2 a above), and could thus be potentially solved or simplified using a ring tactic (see). Unfortunately, Coq’s ring tactic is not yet ported to the HoTT library. Porting ring may dramatically simplify many of our proofs (Anyone interested in porting the ring tactic? Let us know).
It is reasonable to assume that SQL relations are bags that map tuples only to 0-truncated types (types with no higher homotopical information), because real-world databases’ input relations only contain tuples with finite multiplicity (Fin n is 0-truncated), and because SQL queries only use type operators that preserve 0-truncation. However, HoTTSQL does not requires this assumption, and as future work, it may be interesting to understand what the “cardinality” of a type with higher homotopical information means.
Previous work has modeled bags as lists (e.g., as done by Malecha et al.), where SQL queries are recursive functions over input lists, and two bags are equal iff their underlying lists are equal up to element reordering. Proving two queries equal thus requires induction on input lists (including coming up with induction hypothesis) and reasoning about list permutations. In contrast, by modeling bags as functions from tuples to types, proving two queries equal just requires proving the equality of two HoTT types.
In the database research community, prior work has modeled bags as functions to natural numbers (e.g., as done by Green et al.). Using this approach, one cannot define the potentially infinite sum ∑ a, r a that counts the number of elements in a bag r. This is important since a basic operation in SQL, projection, requires counting all tuples in a bag that match a certain predicate. In contrast, by modeling bags as functions from tuples to types, we can count the number of elements in a bag using the sigma type ∑, where the cardinality of the sigma type ∑ a, r a is equal to the sum of the cardinalities of r a for all a.
Schemas
Traditionally, a relation is modeled as a bag of n-ary tuples, and a relation’s schema both describes how many elements there are in each tuple (i.e., n), and the the type of each element. Thus, a schema is formalized as a list of types.
In HoTTSQL, a relation is modeled as a bag of nested pairs (nested binary-tuples), and a relation’s schema both describes the nesting of the pairs and the types of the leaf pairs. In HoTTSQL, a schema is thus formalized as a binary tree, where each node stores only its child nodes, and each leaf stores a type. Our formalization of schemas as trees and tuples as nested pairs is unconventional. We will see later how this choice simplifies reasoning.
Inductive Schema :=
| node (s1 s2 : Schema)
| leaf (T:Type)
.
We formalize a tuple as a function Tuple that takes a schema s and returns a nested pair which matches the tree structure and types of s.
Fixpoint Tuple (s:Schema) : Type :=
match s with
| node s1 s2 => Tuple s1 * Tuple s2
| leaf T => T
end.
Finally, we formalize a relation as a bag of tuples that match a given schema s.
Definition Relation (s:Schema) := Bag (Tuple s).
Queries
Recall that a SQL query maps one or more input relations to an output relation, and that we can implement SQL queries with operations on bags. In this section, we incrementally introduce various SQL queries, and describe their semantics in terms of bags.
Union and Selection
The following subset of the SQL language supports unioning relations, and selecting (i.e., filtering) tuples in a relation.
Inductive SQL : Schema -> Type :=
| union {s} : SQL s -> SQL s -> SQL s
| select {s} : Pred s -> SQL s -> SQL s
(* … *)
.
Fixpoint denoteSQL {s} (q : SQL s) : Relation s :=
match q with
| union _ q1 q2 => fun t => denoteSQL q1 t + denoteSQL q2 t
| select _ b q => fun t => denotePred b t * denoteSQL q t
(* … *)
end.
To syntactically resemble SQL, we write q1 UNION ALL q2 for union q1 q2, q WHERE b for select b q, and SELECT * FROM q for q. We write ⟦q⟧ for the denotation of a query denoteQuery q, and ⟦b⟧ for the denotation of a predicate denotePred b.
To prove that two SQL queries are equal, one has to prove that their two denotations are equal, i.e., that two bags returned by the two queries are equal, given any input relation(s). The following example shows how we can prove that selection distributes over union, by reducing it to showing the distributivity of * over + (lemma sum_distrib_l).
Lemma proj_union_distr s (q1 q2 : SQL s) (p:Pred s) :
⟦ SELECT * FROM (q1 UNION ALL q2) WHERE p ⟧ =
⟦ (SELECT * FROM q1 WHERE p) UNION ALL
(SELECT * FROM q2 WHERE p) ⟧.
Proof.
simpl.
by_extensionality t.
(* ⟦p⟧ t * (⟦q1⟧ t + ⟦q2⟧ t) = ⟦p⟧ t * ⟦q1⟧ t + ⟦p⟧ t * ⟦q2⟧ t *)
apply path_universe_uncurried.
apply sum_distrib_l.
Qed.
Duplicate Elimination, Products, and Projections
So far, we have seen the use of homotopy types to model SQL relations, and have seen the use of the univalence axiom to prove SQL rewrite rules. We now show the use of truncation to model the removal of duplicates in SQL relations. To show an example of duplicate removal in SQL, we first have to extend our semantics of the SQL language with more features.
Inductive Proj : Schema -> Schema -> Type :=
| left {s s’} : Proj (node s s’) s
| right {s s’} : Proj (node s’ s) s
(* … *)
.
Inductive SQL : Schema -> Type :=
(* … *)
| distinct {s} : SQL s -> SQL s
| product {s1 s2} : SQL s1 -> SQL s2 -> SQL (node s1 s2)
| project {s s’} : Proj s s’ -> SQL s -> SQL s’
(* … *)
.
Fixpoint denoteProj {s s’} (p : Proj s s’) : Tuple s ->
Tuple s’ :=
match p with
| left _ _ => fst
| right _ _ => snd
(* … *)
end.
Fixpoint denoteSQL {s} (q : SQL s) : Relation s :=
match q with
(* … *)
| distinct _ q => fun t => ║ denoteSQL q t ║
| product _ _ q1 q2 => fun t => denoteSQL q1 (fst t) *
denoteSQL q2 (snd t)
| project _ _ p q => fun t’ => ∑ t, denoteSQL q t *
(denoteProj p t = t’)
(* … *)
end.
The query product q1 q2 creates the cartesian product of q1 and q2, i.e., it returns a bag that maps every tuple consisting of two tuples t1 and t2 to the number of times t1 appears in q1 multiplied by the number of times t2 appears in q2.
The query project p q projects elements from each tuple contained in the query q. The projection is defined by p, and is denoted as a function that takes a tuple of some schema s and returns a new tuple of some schema s’. For example, left is the projection that takes a tuple and returns the tuple’s first element. We assume that tuples have no higher homotopical information, and that equality between tuples is thus (-1)-truncated.
Like before, we write DISTINCT q for distinct q, FROM q1, q2 for product q1 q2, and SELECT p q for project p q. We write ⟦p⟧ for the denotation of a projection denoteProj p.
Projection of products is the reason HoTTSQL must model schemas as nested pairs. If schemas were flat n-ary tuples, the left projection would not know which elements of the tuple formerly belonged to the left input relation of the product, and could thus not project them (feel free to contact us if you have ideas on how to better represent schemas)
Projection requires summing over all tuples in a bag, as multiple tuples may be merged into one. This sum is over an infinite domain (all tuples) and thus cannot generally be implemented with natural numbers. Implementing it using the ∑ (sigma) type is however trivial.
Equipped with these additional features, we can now prove the following rewrite rule.
Lemma self_join s (q : SQL s) :
⟦ DISTINCT SELECT left FROM q, q ⟧ =
⟦ DISTINCT SELECT * FROM q ⟧.
To prove the equivalence of any two (-1)-truncated types ║ q1 ║ and ║ q2 ║, it suffices to prove the bi-implication q1 <-> q2 (lemma equiv_iff_trunc). This is one of the cases where concepts from HoTT simplify formal reasoning in a big way. Instead of having to apply a series of equational rewriting rules (which is complicated by the fact that they need to be applied under the variable bindings of Σ), we can prove the goal using deductive reasoning.
Proof.
simpl.
by_extensionality t.
(* ║ ∑ t’, ⟦q⟧ (fst t’) * ⟦q⟧ (snd t’) * (fst t’ = t) ║ =
║ ⟦q⟧ t ║ *)
apply equiv_iff_trunc.
split.
– (* ∃ t’, ⟦q⟧ (fst t’) ∧ ⟦q⟧ (snd t’) ∧ (fst t’ = t) →
⟦q⟧ t *)
intros [[t1 t2] [[h1 h2] eq]].
destruct eq.
apply h1.
– (* ⟦q⟧ t →
∃ t’, ⟦q⟧ (fst t’) ∧ ⟦q⟧ (snd t’) ∧ (fst t’ = t) *)
intros h.
exists (t, t).
(* ⟦q⟧ t ∧ ⟦q⟧ t ∧ (t = t) *)
split; [split|].
+ apply h.
+ apply h.
+ reflexivity.
Restart.
conjuctiveQueryProof.
Qed.
Conclusion
We have shown how concepts from HoTT have enabled us to develop HoTTSQL, a SQL semantics that makes it easy to formally reason about SQL query rewrites.
We model bags of type A as a function A -> Type. Bags can be proven equal using the univalence axiom. In contrast to models of bags as list A, we require no inductive or permutation proofs. In contrast to models of bags as A -> nat, we can count the number of elements in any bag.
Duplicate elimination in SQL is implemented using (-1)-truncation, which leads to clean and easily automatable deductive proofs. Many of our proofs could be further simplified with a ring tactic for the 0-trucated type semi-ring.
Visit our website to access our source code, learn how we denote other advanced SQL features such as correlated subqueries, aggregation, advanced projections, etc, and how we proved complex rewrite rules (e.g., magic set rewrites).
Contact us if you have any question, feedback, or know how to improve HoTTSQL (e.g., you know how to use more concepts from HoTT to extend HoTTSQL).
This is an interesting application of HoTT!
It seems that for the current results it would be enough to consider a theory with a single univalent universe of sets (aka the Hofmann-Streicher groupoid model, which had been known before “HoTT” came up). Is that right? I am curious whether it is possible to get a significant advantage from unrestricted types. I hope we can read about it here when you find out 🙂
I found it indeed a bit surprising that you formalise a schema as a tree. I guess I would have tried vectors first (maybe implemented as maps from some finite type Fin n to the universe). Any (order-preserving) projection to a subset of the components would then just be given as a composition with an injective (and order-preserving) map between finite sets. This would probably make some proofs a bit harder but it would avoid the necessity of nesting ‘lefts’ and ‘rights’ (the nesting seems a bit arbitrary to me).
This seems like an interesting application. A quick comment about bags. Are you aware of the [work](http://link.springer.com/chapter/10.1007%2F978-3-642-32347-8_11) by Nils Anders Danielson?
This may be relevant:
Type theoretical databases
Henrik Forssell, Håkon Robbestad Gylterud, David I. Spivak
https://arxiv.org/abs/1406.6268
Is “r1 a + r1 a = r2 a + r2 a” in “bag_union_symm” a typo? (I think it should be “r1 a + r2 a = r2 a + r1 a”). It also appeared in the following paragraph.
Pingback: Cosette: An Automated SQL Solver | Artificia Intelligence
Pingback: How do the unit intervals compose when generating higher order homotopies? - PhotoLens