Relational Databases Quick Start
After going through this tutorial, one can expect to have gained the core practical understanding of how to construct Entity framework queries and execute them. Some more advanced material, subtleties and corner cases have been intentionally left out, to keep the tutorial of a reasonably small size and introductory level. However, most core query-building machinery is covered and illustrated with examples, which should be sufficient to get started.
Relational databases are an important and widely used means for information storage, processing and lookup. The associated technology has become very mature over the last decades, and modern relational database backends have exceptional capabilities for storing and processing structured data, from small to very large data volumes. It is important, therefore, to be able to seamlessly integrate relational databases into one's workflows, for doing work in a number of areas, from data science to web applications.
One important aspect of this is the ability to process even extremely large datasets on the database side, "out of core". Most common relational database backends have been heavily optimized and apply cutting-edge technology to be able to process such large datasets very efficiently.
In the context of the Wolfram Language, the ability to offload the computations on large datasets to a database engine is particularly important, since many such computations do not require the full computational power of the Wolfram Language, while the size of the resulting dataset can be considerably reduced, so that it can be further processed with ease within the Wolfram Language.
Direct integration of relational databases into one's work may get complicated for a number of reasons. First, many domain experts do not have the knowledge of SQL that is required for direct use of relational databases. Second, there are (often subtle) differences between SQL dialects and supported feature sets for different database backends in common use. Third, one has to build data pipelines manually, connecting the database to one's programming environment. For these reasons, tight integration of relational databases directly into the language is very beneficial. It allows the users to focus on solving their problems rather than on providing the technical infrastructure.
The very high-level symbolic nature of the Wolfram Language sets stiff requirements for the framework that integrates relational databases into the language. It has to be high level, symbolic and idiomatic to use, yet exposing enough structure to not limit the query building and other capabilities too much with respect to raw SQL. In particular, idiomatic Wolfram Language programming gravitates toward functional programming style, function composition and immutability—the features that the connectivity framework is therefore required to embrace and support.
As it turns out, the Entity framework does largely satisfy the requirements outlined. While there are no objects in the OOP sense in the Wolfram Language, the Entity framework for the Wolfram Language is, in the context of relational databases, in many ways similar to what object-relational mappers (ORMs) are for other languages. Somewhat oversimplifying, entities map to single rows of database tables, entity classes map to database tables (existing or virtual), and entity properties map to database columns.
As a result, for the vast number of cases when one needs to query a database, it is possible to stay within the Wolfram Language and communicate with the database via high-level Entity framework queries, constructed using Entity framework functional query language. The framework is then responsible for translating the query into an appropriate SQL dialect, executing it on the database side and returning the results back to the user in the appropriate format.
For a long time, the main tool to interact with relational databases in the Wolfram Language has been DatabaseLink. It is built on top of J/Link and JDBC and provides a full-featured toolbox to interact with relational databases.
It is important to understand the differences between the DatabaseLink toolbox and the technology discussed in this tutorial. The most important differences are in the the level of abstraction offered by either one of these tools, and the supported features for interacting with databases.
The toolbox provided by DatabaseLink is rather low level: while it has some features that allow the database queries to be constructed symbolically in simple cases, the majority of real-world queries would have to be written as SQL strings when used with DatabaseLink. This has a number of implications for the user, from the need for familiarity with SQL (and a particular SQL dialect that corresponds to a specific database backend used) to the inability to use the strong points of the Wolfram Language in terms of its symbolic capabilities and high-level abstractions. All this leads to an incomplete, low-level integration of the databases into the Wolfram Language.
The technology based on the Entity framework has been designed from the ground up to provide a high-level seamless integration of relational databases into the Wolfram Language and contains features (such as symbolic query language, relations, automatic SQL generation and backend specialization, built-in type-checking, etc.) that enable considerably more powerful and high-level workflows involving relational databases in the Wolfram Language.
On the other hand, DatabaseLink supports a number of core features that are required for common workflows involving databases, such as write operations (SQL INSERT / UPDATE / DELETE), transaction support, control over timeouts, connection pools, etc. It also provides low-level access to database result sets for efficiency purposes. All of these the Entity framework-based technology is currently lacking, so one can say that DatabaseLink is at present more feature-complete.
There is currently no deep interoperability between the two technologies, other than the fact that both are integrated into the Wolfram Language. The level of interoperability may become better in future versions of the Wolfram Language.
The second section reviews and illustrates key features of the Entity framework in the context of relational databases.
The third section goes through core query-building blocks that the Entity framework provides to build more complex queries.
The fourth section contains a brief coverage of some additional useful constructs and tools, which may be considered somewhat more advanced, but are extremely useful in practice.
The fifth section illustrates what kind of SQL can be typically generated for various query-building primitives, supported by the Entity framework.
The sixth section describes how one can programmatically generate queries and illustrates with examples why this may be beneficial.
The seventh section describes a few practical techniques of query construction.
The eighth section contains a brief coverage of error handling and typical errors one may encounter when constructing and running queries.
The final section contains several examples of more complex queries, illustrating how one can use different query building blocks together to construct more sophisticated queries.
Examples in this tutorial are based on sample public domain database Classic Models (in particular its SQLite version), which is a retailer of scale models of classic cars database.
- offices table has a 1 to N relationship with employees table: there are generally many employees working for the same office. The relationship is realized via the foreign key officeCode of the employees table corresponding to the primary key officeCode of the offices table.
- employees table has a 1 to N relationship with itself: in general, each employee may have 0, 1 or more other employees reporting to them. The relationship is realized via the foreign key reportsTo, linked to the primary key employeeNumber of the same employees table.
- employees table has 1 to N relationship with customers table: each employee may deal with 0, 1 or more customers, while each customer is always dealt with by at most one employee. The relationship is realized via the foreign key salesRepEmployeeNumber of the customers table, which corresponds to the primary key employeeNumber of the employees table.
- customers table has a 1 to N relationship with payments table: each customer may have made 0, 1 or more payments. The relationship is realized via the foreign key customerNumber of the payments table, which corresponds to the primary key customerNumber of the customers table.
- customers table has a 1 to N relationship with orders table: each customer may have placed 0, 1 or more orders. The relationship is realized via the foreign key customerNumber of the orders table, which corresponds to the primary key customerNumber of the customers table.
- orders table has a 1 to N relationship with orderdetails table: each order consists of 1 or more items, with each item of distinct product type having a single record in orderdetails table. The relationship is realized via the foreign key orderNumber of orderdetails table, which corresponds to the primary key orderNumber of the orders table.
- orderdetails table has an N to 1 relationship with products table: each product may have been ordered 0, 1 or more times. The relationship is realized via foreign key productCode of orderdetails table, which corresponds to primary key productCode of products table.
- products table has an N to 1 relationship with productlines table: each product line may have 1 or more products belonging to it. The relationship is realized via foreign key productLine of products table, which corresponds to primary key productLine of productlines table.
- Establish a connection to the database, using DatabaseReference.
- Create a RelationalDatabase object, using that connection (this operation inspects the database and retrieves database metadata about the database schema etc.).
- Register that EntityStore object with the Entity framework.
As described in the previous section, one of the steps required to create the database-backed EntityStore consists of creating the RelationalDatabase object. However, this object is also useful in its own right. It contains information about database schema (tables, columns, constraints, etc.) and can be used to both visually inspect and programmatically extract parts of this information that one is interested in.
The formatting for the RelationalDatabase object makes it easy to visually inspect the database structure, using hierarchical group openers, which can be expanded to inspect more details about a given table or column:
For more details about the various properties and methods, consult the reference page for RelationalDatabase object.
This section covers important core operations supported by the Entity framework, in the way they are used in the context of relational databases. It is not primarily focused on query construction, which has a separate section dedicated to it. Rather, it lays the groundwork and discusses other important aspects, many of which are prerequisites for effective work with relational databases through the Entity framework.
To use the Entity framework effectively in one's work with relational databases, it is important to have at least a basic understanding of how the concepts and constructs of the Entity framework correspond to the core concepts and constructs of relational databases. This topic is discussed first.
Query execution and the so-called "resolvers" (functions that are used to actually transform symbolically represented queries into specific results by initiating the process of query compilation and execution and returning the results) are discussed next.
Next, computed properties and EntityFunction are covered. These are very important building blocks that allow one to define and compute new properties on the fly, where such new properties may require complex computations performed on the database side.
In a number of cases, being able to work with single entities is important. This can be useful for various purposes, from better visualization and understanding of the results to query construction (for debugging and prototyping purposes, it is very valuable to be able to execute parts of the query on a single entity). A brief account of this topic is given next.
For relational databases, the notion of a primary key for a database table is a central one. The Entity framework counterpart for this is the CanonicalName property for single entities. The other part that, together with the canonical name, defines a single entity, is a type of that entity. These topics are considered important enough to merit a separate section.
While most modern workflows involving relational databases deal with database tables that do have a primary key, cases when a table either does not have one, or when it does but that is not enforced on the database schema level, are not unheard of. These are even more practically important in the context of the Entity framework, since for such tables/entity types, only a part of the Entity framework's functionality will work. This issue is discussed next.
While this tutorial does not focus on types and the type system implicitly involved in the interaction with relational databases, there is one type distinction that is very important to understand for effective work. Namely, entity types may contain properties that are entity valued and entity-class valued. In this tutorial, such properties are called relations and are added by the framework to the set of core properties (properties directly corresponding to database table columns) for each entity type.
Since the Entity framework is used in the Wolfram Language—in particular, to represent relational databases—there clearly has to be a mapping for the core concepts and constructs between the Entity framework and relational databases/SQL.
For many features of the Entity framework, such a mapping is rather direct. In some ways, however, the Entity framework represents a richer data model. For example, for the in-memory use of the Entity framework, entity properties can have arbitrary Wolfram Language expressions as their values—which, if used directly, would not comply even with the first normal form for relational databases (e.g. when entity properties are List valued). As another example, the Entity framework is capable of representing types with an infinite number of entities, something that relational databases cannot easily do. Also, the set of easily possible computations is more limited for relational databases w.r.t. the Wolfram Language, the latter naturally having much richer computational functionality available "out of the box".
Restrictions of the relational model have their advantages as well, such as the ability to provide strong guarantees for data consistency, transactions and other related useful features of ACID-compliant databases. These restrictions also mean similar restrictions imposed on the Entity framework for database-backed entity stores.
|Database (schema)||EntityStore||A collection of entity types (database tables)|
|Database table||Entity type||A (handle to a) collection of entities with a similar set of properties (table rows)|
|Database table row||Entity (single entity)||A (handle to a) set of properties/values, representing a single "thing" (table row) with a unique identity|
|Database table field (column name)||EntityProperty||Specific property, usually having a specific type|
|Primary key||CanonicalName||A property or set of properties, guaranteed to be unique (when taken together for a set of properties case) for a given entity type (db table)|
|Foreign key||Entity-valued property||Entity property whose value is a single entity (can be of the same or different entity type). For databases, a field that points to a unique row in a table (can be the same or different table).|
|(Derived) table||EntityClass||A (handle to a) collecttion of entities of the same type (either registered or defined by the query). On the database side, the derived table is a virtual table/subquery, used in a FROM clause.|
As explained previously, this mapping has a certain asymmetry. It makes it straightforward to represent existing relational databases as entity stores. However, not all existing in-memory entity stores can be easily mapped to relational databases without creating extra abstraction layers. In particular, the following are some of the restrictions that exist:
- The values of the entity properties can only be of a specific limited set of types supported by a given database backend, for entities of a given type to be easily mapped to a relational database table rows.
- Some entity properties have qualifiers. To support such properties on the database side, extra structures would be required (e.g. qualifier values might need a separate database table, etc.).
While these considerations are more important when one needs to back existing in-memory entity stores by a relational database (the functionality not currently supported by the Entity framework), it is still useful to keep these in mind for a better understanding of the overall picture.
There are a few restrictions that work in the other direction too. For example, it is technically possible to have database tables without a primary key and work with those on the database side. However, on the Entity framework side, it is imperative that each single entity have a CanonicalName that is unique within a given entity class. This means that single entities cannot be defined for such database tables, and therefore all single-entity-related functionality of the Entity framework cannot work for such database tables/types. In particular, functions like EntityList and EntityValue with certain modifiers (like e.g. "EntityPropertyAssociation") will not work in such cases. This issue is further discussed in this section.
The main and most commonly used one is EntityValue. It is typically used to execute a given query and get the results in a variety of forms. In the context of relational databases, queries typically represent (virtual) database tables, which in the Entity framework correspond to (virtual) entity types. The role of EntityValue in this context is to compile a given query to the suitable SQL dialect, execute it on the database, and extract values for a certain set of entity properties (resulting table columns in database terms) in various forms (lists or associations of values).
Following are a few simple examples of use of EntityValue.
In some cases, one may want to obtain a list of entities contained in a given entity type/class. This can be accomplished with EntityList.
Note that, with an exception of certain special cases to be discussed in subsequent sections, the ordering of entities in the result is not guaranteed to be the same for repeated calls to EntityList or EntityValue.
In addition to extracting existing properties, one can also extract computed properties—that is, properties created on the fly, which may require complex computations on the database side. Such properties must be expressed using EntityFunction. The semantics of EntityFunction in this context is similar to Function, with a few important differences to be covered in subsequent sections. One can think of EntityFunction as a construct similar to EntityProperty that (in the context of this section) takes a single entity and returns the result of some computation performed with that entity. It is important that at present EntityFunction can only return a scalar, but cannot return e.g. a list of values, an entity or an entity class.
Following is an example of the use of EntityFunction to retrieve a property that requires computation on the database side.
EntityFunction has the HoldAll attribute, just like Function, to prevent premature evaluation of its variables and body. The body of EntityFunction is an expression, where one can use a limited set of primitives that EntityFunction can understand and compile to SQL. The full list of such primitives can be found in EntityFunction documentation, while many more examples of the practical use of EntityFunction can be found in subsequent sections.
Single entities are an important concept and building block in the Entity framework. The ability to work with single entities adds interactivity to one's workflow and often allows one to get a better understanding of the data, even in cases when the end result should be a query that works on many entities (entity class).
It is important to understand that single entities are essentially handles (or references) to the actual data. They do not contain data, except for their type and canonical name. As such, they are lazy: every time one needs to extract data from an entity, one has to run a new query against it. This lazy nature of entities is a very useful feature of the Entity framework, since it allows one to work with entities in a more abstract fashion. However, it comes with a few caveats that one should be aware of. For example, should some properties of a particular entity change in between two successive executions of the same query, the returned results will generally be different too, reflecting those changes. It also may happen that an entity that was there in a given entity class at some point has been deleted and no longer exists. Of course, these complications can only occur for the data that is changing in time.
It is important to realize that one can extract computed properties for a single entity, using EntityFunction. This can be very useful for both quick lookups and prototyping more complex queries.
The last example also illustrates the property-like nature of EntityFunction.
In the Entity framework, each entity must be unique within its entity type/entity class. The other way to say it is that each entity must have a unique identity, and no entity class/entity type is allowed to contain duplicate entities.
One can extract the canonical name of an entity also using CanonicalName.
The entity type of an entity is always the first argument of Entity, but not always a string. In particular, in cases when the type is not the registered type, but a "run-time" type, implicitly defined by an Entity framework query, it will be that query itself.
- The entity belongs to the registered type, which corresponds to a database table with a composite primary key.
- The entity belongs to the "runtime" type, represented by CombinedEntityClass[...].
- The entity belongs to the "runtime" type, represented by AggregatedEntityClass[...], where entity groups for aggregation are formed based on more than one property.
The previous example belongs to the last category of cases, where the grouping properties are "city" and "country", while the earlier example with entities of "orderdetails" type belongs to the first category, with the primary key of "orderdetails" table being a composite one, made out of properties "orderNumber" and "productCode".
The case when certain database tables do not have the primary key constraints in the database schema might not be typical but is important in practice. Note that not in all such cases is a column with unique values actually absent from the table. However, at present there is no way to indicate such a column for the Entity framework to use as a primary key for such tables: all the information about the primary keys is currently obtained at database inspection time (when the RelationalDatabase object is constructed) and entirely from the existing database constraints.
What this means for the Entity-framework-based workflows is that for such tables, single entities can not be defined, since there is no clear way to attach a canonical name to a database row in such a table. This does not mean that one cannot do anything useful with such tables, but certain parts of the Entity framework's functionality will not work with those.
To illustrate the limitations existing for such tables in terms of Entity framework functionality, the Titanic database will be used, which is an SQLite database hosted in the Wolfram Cloud and based on the following Titanic dataset:
As can be seen from the structure of this dataset, there is no field or combination of fields there that would naturally work as a primary key. There is even no guarantee that there are no duplicate records in the dataset, since there are not even passenger names present in the data, but just the passenger's travel class, age, sex and whether or not they survived. Still, this dataset is a useful source of information.
The first step is to create a RelationalDatabase object:
The next step is to create and register the EntityStore object based on this database:
- Database tables with no primary key constraint in the database schema correspond to entity types with no CanonicalName defined. This means that one cannot define single entities for such types.
- It is possible to register such types in the EntityStore object; however, not all Entity framework functionality will work for them.
- While single entities cannot work for such types (and neither can functions such as EntityList and EntityValue for some modifiers), many forms of EntityValue still work, as well as all queries that do not involve single entities of that type.
Properties can be entity valued or entity-class valued. Such properties do not correspond to any of the database table columns of the corresponding database table, but have been added by the framework and represent related tables.
Note that such entity- and entity-class-valued properties have no direct database column origin, but have been generated by the framework based on the database schema (relations/foreign keys information).
One important observation to be made at this point is that entity-class-valued properties are not automatically resolved to lists of entities, so that EntityList has to be applied explicitly to their values if one wants to get the list of entities as a result.
Generated properties are further discussed in the section on relations.
For some entities, certain properties may not have values in the database table (values may be NULL). On the Wolfram Language side, this corresponds to Missing values.
Another type of input that results in Missing values is request values for invalid property names. In this case, however, the value will have a different reason for being missing: Missing["UnknownProperty",...].
First, property-like queries are covered. They are explained and illustrated with a number of examples concerning how one can significantly expand the set of available entity properties with computed properties, and how EntityFunction is used to express those computed properties.
The rest of this section is dedicated to various primitives that take entity classes (and other parameters) and return new (transformed) entity classes. The Entity framework provides such primitives for the following typical data-processing operations:
- Extending entities in an entity class with new properties, computed on the fly (SQL counterpart: SELECT list)
- Aggregating over entity class or groups of entities within it (SQL counterpart: aggregation/GROUP BY clause)
- Combining two entity classes together to form a new entity class, with entities having properties from both original entity classes (SQL counterpart: JOIN clause)
Property-like queries are Wolfram Language expressions used to build parts of Entity framework queries, which get compiled into SQL expressions in the process of query compilation and execution. The main construct used to construct property-like queries is EntityFunction.
Property-like queries are useful in a number of places, such as defining computed properties, predicates for filtering data, sorting criteria, aggregation and more. In this tutorial, some typical examples of them are considered.
To test for missing values, the MissingQ predicate can be used.
The operations described in the sections that follow, namely FilteredEntityClass, SortedEntityClass, SampledEntityClass, ExtendedEntityClass, AggregatedEntityClass and CombinedEntityClass, can all be called entity class transformers. They all take an entity class (or two in the case of CombinedEntityClass) and return a new entity class.
It is important to realize that these constructs are completely symbolic and inert. No actual work is performed when one of these is applied to an entity class argument. The resulting query stays symbolic, and one needs to call one of the resolver functions (EntityValue or EntityList) to actually perform the query.
When evaluated, the symbolic Entity framework query evaluates to itself, staying as an inert symbolic expression. One can still do a few useful things with it though, such as, for example, finding all entity properties for the entity class that this query represents (this does not require a call to the database).
One uses EntityValue to actually execute the query.
The symbolic and inert nature of Entity framework queries opens the way for programmatic query construction, since queries on their own are inert Wolfram Language expressions and can be constructed from smaller building blocks either manually or programmatically.
One of the most commonly needed operations is to filter database data according to certain criteria. In Entity framework, FilteredEntityClass is used for that purpose, with the filtering criteria expressed using EntityFunction and passed to FilteredEntityClass as a second argument.
The following examples illustrate typical uses of FilteredEntityClass.
Sorting entities (rows in the database table) according to some criteria is another commonly used operation. On the database side, this is achieved by using ORDER BY clause of SQL. On the Entity framework side, SortedEntityClass can be used to achieve that.
In the simplest case, one needs to sort with respect to the values of a single existing entity property (database column), in an ascending order. In this case, the string field name is passed to SortedEntityClass as a second argument.
One can sort by more than one property. In this case, values of the second property are used as tie-breakers for groups of items with identical value of the first property, and the same logic extends to the third etc. sorting property. One can attach the "Ascending" or "Descending" qualifiers for each sorting property separately, to control the sorting order within the (sub)groups of entities.
One can limit the number of results returned by SortedEntityClass, by utilizing its third argument.
In some cases, it is useful to pick only a certain number of entities from an entity class (or, in database terms, pick only a certain subset of rows from a table or virtual table). The way to do this in the Entity framework is to use SampledEntityClass.
Note that the ordering of results can only be guaranteed if SampledEntityClass is applied to SortedEntityClass. However, even in other cases it can still be quite useful, for example, when one wants to peek into a large dataset or prototype a query using just a small sample first.
It is possible to add new, computed properties to the set of available entity properties for a given type/entity class. The result is a new entity class. The newly added property can be used anywhere the original properties can—in EntityValue, further in the outer layers of the query, etc. The construct that creates such a new extended entity class is ExtendedEntityClass.
Aggregation is a very commonly needed operation. It is used to compute values, aggregated over more than one entity (table row in the database). Typical examples of aggregation include computing totals, averages, or minimal and maximal values of some properties or more complex expressions, over a range of entities. Within the Entity framework, the construct that can be used to perform aggregation is AggregatedEntityClass.
As its name indicates, the result of its application is a new entity class. If the aggregation is performed over the entire original entity class, the resulting entity class will contain just a single entity, having as its properties the computed aggregated properties. One can also first group entities based on values of a certain set of their properties, and then perform aggregation over each such group—in which case the new entity class will contain as many such new aggregated entities as there are such groups.
At present, there is a limited and small set of core aggregation functions that are supported by the Entity framework for database-backed entity stores: Total, Length, Min, Max, Mean, Variance and StandardDeviation. One can use standard arithmetic functions and other property-like query building blocks, however, to combine these core operations and compute more complex expressions.
It is important to note that EntityFunction for the case of aggregation is bound to an entire entity class being aggregated rather than individual entities of that class.
Following is a simple example of use of AggregatedEntityClass to compute a single aggregate value.
To understand the semantics of EntityFunction in this case, it is useful to compute the same thing on the top level, using the Wolfram Language.
An important remark on the last example is that for the current version of the Entity framework, this three-argument form of EntityValue, when used for aggregation in a top-level query, will perform the aggregation on the Wolfram Language side rather than the database side (this does not happen when this syntax is used in subqueries). In contrast, all aggregations performed with an explicit use of AggregatedEntityClass will always be performed on the database side.
One can compute aggregations over certain groups of entities rather than the entire entity class. In this case, the third argument of AggregatedEntityClass has to be used, to indicate a property or a list of properties whose values will define unique groups of entities for aggregation.
In relational databases, data is typically stored in several tables, connected to each other via foreign key constraints. Only in rare cases does one work with just a single table for the real-world queries one typically wants to write, and considering that the data, being normalized, is typically split and stored in a (possibly large) number of related tables.
The most common way to use data from more than one table in an SQL query is to use SQL JOIN operations. The Entity framework counterpart of SQL JOIN is the CombinedEntityClass construct. It takes two entity classes/types, the specification that tells how they are going to be combined, and optionally the type of JOIN. The result is a new entity class, with new entities containing properties from both of the original entity classes.
The following examples illustrate typical use cases for CombinedEntityClass.
The new entity classes resulting from application of CombinedEntityClass can be further used in more complex queries, just like any other entity classes.
Note that in the last example it was necessary to explicitly indicate EntityProperty["customers","customerNumber"] in the list of properties in EntityValue, since both types "customers" and "payments" have the property with this name, and one has to disambiguate exactly which property is requested.
It is possible and often desirable to combine a type with itself. In the context of relational database operations, this case corresponds to a self-join. In this case, more care is needed to disambiguate property names—one has to introduce a (string) alias for at least one of the same types being combined. This case is illustrated with the following example.
In some cases, the condition that is tested to decide whether a combination of the two entities from two entity classes being combined should be included into the resulting entity class is more complex than a simple equality between some (combinations of) properties of both entities. For such cases, one can use EntityFunction with two arguments, each one bound to an entity of the corresponding entity class being combined. The body of such an EntityFunction must return a Boolean value, but otherwise can be an arbitrarily complicated (compilable) expression.
In the last example, it is also necessary to prefix one of the combined "offices" types, to disambiguate properties, just like it was in the previous example of self-join. Finally, DeleteDuplicates is used as a third argument of EntityValue to remove the duplicates from the results (and as was noted before, it is important to remember that in cases like this, the third argument of EntityValue is currently executed on the Wolfram Language side, rather than on the database).
Subqueries are queries that are used as building blocks in more complex queries. They take the simplest form when the result of a subquery is a scalar. In this section, several examples of subqueries are used to illustrate their general uses.
As a starting point, consider a common task of obtaining a single aggregated property from some entity class. This will not yet be a subquery. The following examples illustrate various ways to do this.
It is important to note, however, that while the results are the same, the first (shorter) version does in fact perform the aggregation in the Wolfram Language, while the last two versions aggregate on the database side.
However, when such queries are used as subqueries in larger queries, in that case they are always executed on the database side, including the short three-argument form of EntityValue.
Subqueries like the one in the last example are called uncorrelated, since they do not refer to entities (table rows) of the outer layers of the query. One can also build so-called correlated subqueries, which do contain such references (and therefore, cannot be executed on their own without modifications).
Subqueries are a powerful tool, but they can be rather easily abused. Depending on the case, they may or may not be the optimal way to solve a given problem. Deeply correlated subqueries might result in an inferior performance, so one should use this powerful tool carefully.
The MemberQ predicate in the context of database-backed entity stores is used in Entity framework queries to expose the SQL IN operator. In the simplest form, it is used to test membership of a value in an explicit list of values.
One can also use MemberQ in conjunction with subqueries, in which case, the first argument of MemberQ is not a literal list, but a result of a subquery. In such cases, a subquery should return a column rather than a scalar, and typically can be written as class["property"] or, equivalently, EntityValue[class,"property"], where class is some entity class (registered or defined by an inner query).
In some cases, one needs to only keep those selected values or groups of values that are distinct. In SQL, a special DISTINCT keyword is used for this purpose. In the Entity framework, one can use DeleteDuplicates as a third argument of EntityValue to achieve a similar effect.
In the previous example, as was noted before, DeleteDuplicates is currently executed on the Wolfram Language side. However, when a query such as the preceding is used as a subquery, the removal of duplicates will happen on the database side.
In addition to the properties that correspond to existing table columns, new properties are created by the Entity framework for tables that are related to other tables. They constitute the mechanism to provide a higher-level (w.r.t. explicit joins or subqueries) way to use data from multiple related entity types (database tables) in queries.
Such properties have been already considered previously, but on the most basic level.
It is important to realize that there is nothing that can be done with relations that cannot be done with core primitives. However, using relations in many cases can lead to considerably more concise queries, which require a lot less effort to both construct and understand.
More examples of how relations can be used to build more complex queries are considered in the final section of this tutorial.
The purpose of this section is to give a basic idea about the kind of SQL code that is typically generated by the framework for various supported constructs. The generated SQL provided here is mostly intended for illustration purposes.
While many of the SQL queries presented in this section actually correspond to the SQL code generated by the current version of the Entity framework query compiler (for the SQLite backend), one should not assume that the generated SQL will always take that exact form. It may change from version to version. From the viewpoint of the Entity framework, the exact form of generated SQL is an internal implementation detail, as long as the end result is correct and has reasonable efficiency. So one should not rely in any way on the details of generated SQL code shown in this section.
You can start with some basic EntityValue call:
SELECT officeCode, city, state, country
SELECT REGEXP(state, '^C', 0) AS synthetic_prop_9
For the case of EntityList, the database call actually extracts the values of the properties that are required to compute the CanonicalName for the entities of a given type (primary key for the corresponding database table).
Computed properties are defined using EntityFunction. These will typically compile into SQL expressions. Here are a few examples.
orderdetails.priceEach * orderdetails.quantityOrdered
offices.state IS NULL
REGEXP(offices.city, 'a', 0) OR REGEXP(offices.city, 'o', 0) AND NOT (offices.state IS NULL)
(products.MSRP - products.buyPrice) / products.MSRP
power(CAST((power(CAST(products.MSRP AS REAL), 2) - power(CAST(products.buyPrice AS REAL), 2)) / (power(CAST(products.MSRP AS REAL), 2) + power(CAST(products.buyPrice" AS REAL), 2)) AS REAL), 0.5)
orderdetails.quantityOrdered > 30 AND orderdetails.priceEach >= 200
orderdetails.orderNumber % 100 = 0
sum(orderdetails.quantityOrdered * orderdetails.priceEach) / CAST(count(orderdetails.orderLineNumber) AS REAL)
WHEN (1 > length(employees.firstName)) THEN NULL
WHEN 1 THEN substr(employees.firstName, 1, 1)
END IN ('M', 'P', 'D')
This construct generally corresponds to the SQL WHERE clause, with the second argument (being the EntityFunction expression) corresponding to the WHERE clause SQL expression.
SELECT employees.firstName, employees.lastName, employees.jobTitle
WHERE employees.jobTitle != 'Sales Rep'
For example, the following query uses three nested FilteredEntityClass constructs to find all customers from USA, CA, who are located in one of the cities: San Francisco, Los Angeles or San Jose:
WHERE customers.country = 'USA' AND customers.state = 'CA' AND customers.city IN ('San Francisco', 'Los Angeles', 'San Jose')
SELECT employees.employeeNumber, employees.lastName, employees.officeCode
ORDER BY employees.officeCode DESC
One notable difference between SQL and the Entity framework sorting facilities is that not all SQL backends directly support expressions in the ORDER BY clause, while for SortedEntityClass the properties with respect to which one sorts can be either simple entity properties or EntityFunction expressions.
SELECT "T308"."employeeNumber", "T308"."firstName", "T308"."lastName"
"employees_T306"."employeeNumber" AS "employeeNumber",
"employees_T306"."firstName" AS "firstName",
"employees_T306"."lastName" AS "lastName",
length("employees_T306"."firstName") AS synthetic_prop_17
FROM employees AS "employees_T306"
) AS "T308"
ORDER BY "T308".synthetic_prop_17
The most direct SQL counterparts for this construct are LIMIT and OFFSET SQL keywords. However, the actual strategy and internal implementation for subsetting may be different, since in certain cases other strategies may be more efficient to obtain the same result.
"payments_T316"."customerNumber" AS "customerNumber",
"payments_T316".amount AS amount
FROM payments AS "payments_T316"
LIMIT 10 OFFSET 10
On the SQL side, this construct corresponds to the use of SQL expressions to define fields in the SELECT list, whenever one wants to compute properties more complex than just the original fields present in the database table or query.
"employees_T328"."employeeNumber" AS "employeeNumber",
("employees_T328"."firstName" || ' ') || "employees_T328"."lastName" AS "fullName"
FROM employees AS "employees_T328"
Or they can be complex and contain e.g. correlated subqueries like, for example, for the following query, which uses relations:
"employees_T352"."employeeNumber" AS "employeeNumber",
"employees_T352"."firstName" AS "firstName",
"employees_T352"."lastName" AS "lastName",
SELECT "employees_T355"."firstName" AS "firstName_1"
FROM employees AS "employees_T355"
WHERE "employees_T355"."employeeNumber" = "employees_T352"."reportsTo"
) AS "managerFirstName",
SELECT "employees_T358"."lastName" AS "lastName_1"
FROM employees AS "employees_T358"
WHERE "employees_T358"."employeeNumber" = "employees_T352"."reportsTo"
) AS "managerLastName"
FROM employees AS "employees_T352"
When used without the third argument, AggregatedEntityClass represents aggregation performed over the entire first argument (entity class), and corresponds to SQL aggregate queries. In this case, there is typically no special SQL-side keyword that would correspond to the AggregatedEntityClass[...], but the fields in the SELECT list must all use the SQL aggregation functions.
max("orderdetails_T403"."quantityOrdered") AS "maxOrdered",
min("orderdetails_T403"."quantityOrdered") AS "minOrdered",
avg("orderdetails_T403"."quantityOrdered") AS "avgOrdered"
FROM orderdetails AS "orderdetails_T403"
When the third argument of AggregatedEntityClass is used, this corresponds to the SQL GROUP BY clause.
"customers_T434".city AS city,
"customers_T434".country AS country,
count("customers_T434"."customerNumber") AS "customerCount"
FROM customers AS "customers_T434"
GROUP BY "customers_T434".city, "customers_T434".country
It is important to note that for the operations that one may want to further perform on the AggregatedEntityClass, often the generated SQL will have an extra layer of SELECT.
"customers_T495".city AS city,
"customers_T495".country AS country,
count("customers_T495"."customerNumber") AS "customerCount"
FROM customers AS "customers_T495"
GROUP BY "customers_T495".city, "customers_T495".country
) AS "T497"
ORDER BY "T497"."customerCount" DESC
"employees_T529"."employeeNumber" AS "employeeNumber",
"employees_T529"."firstName" AS "firstName",
"employees_T529"."lastName" AS "lastName",
FROM employees AS "employees_T529"
"offices_T532".city AS city,
"offices_T532".country AS country,
"offices_T532"."officeCode" AS "officeCode"
FROM offices AS "offices_T532"
) AS "T534"
ON "employees_T529"."officeCode" = "T534"."officeCode"
Which will translate into SQL similar to this (note that currently DeleteDuplicates in the top-level EntityValue is executed on the Wolfram Language side, so there is no DISTINCT keyword in the SQL query):
SELECT "offices_T571".country AS country
FROM offices AS "offices_T571"
"offices_T574".country AS country_1,
"offices_T574"."officeCode" AS "officeCode"
FROM offices AS "offices_T574"
) AS "T576"
ON "offices_T571".country = "T576".country_1 AND "offices_T571"."officeCode" != "T576"."officeCode"
In this tutorial, subqueries normally means parts of the larger query, which can be expressed using EntityValue. On the SQL side, most of the time this corresponds to the inner SELECT statement with a single field, typically returning a scalar (either because there is a single row or because aggregation is being performed), but sometimes also returning a column (to be used in the IN clause, which would correspond to the use of MemberQ on the Entity framework side).
"products_T583"."productName" AS "productName",
"products_T583"."MSRP" AS "MSRP"
FROM products AS "products_T583"
WHERE "products_T583"."MSRP" >= 0.9 * (
FROM products AS "products_T586"
Where the subquery inside EntityFunction that defines the "closelyPricedProductsCount" extended property is a correlated subquery, since the filtering of products now depends on the price of the current product, which has to be referenced from that filtering/aggregation subquery.
This results in a query containing a correlated subquery, where the correlated subquery is in the SELECT list of the inner query and is aliased by "closelyPricedProductsCount", becoming a new extended property:
SELECT count("products_T638"."productCode") AS count
FROM products AS "products_T638"
WHERE abs("products_T638"."MSRP" - "products_T635"."MSRP") <= 15
) AS "T640"
) AS "closelyPricedProductsCount",
"products_T635"."MSRP" AS "MSRP",
"products_T635"."productName" AS "productName"
FROM products AS "products_T635"
) AS "T637"
ORDER BY "T637"."closelyPricedProductsCount" DESC, "T637"."MSRP" DESC
There is currently no automatic optimization of the generated SQL code involving subqueries performed by the Entity framework query compiler (such as attempts to convert them into JOINs etc.). One should be aware of the performance implications of the use of correlated subqueries in the Entity framework, which are similar to those for SQL.
Relations provide a high-level way to perform lookups for properties in entity classes/types related to a given one (related database tables), without performing explicit joins. Their internal implementation can utilize different tools to achieve this goal, such as subqueries and/or joins, but these are hidden from the user.
For an example of the type of SQL that can be generated from queries that use relations, consider the following query that computes, for each office, the maximal number of customers that a single employee in that office deals with (it is also considered later in the last section of this tutorial):
"offices_T652"."officeCode" AS "officeCode",
SELECT max("T657"."customerCount") AS synthetic_prop_20
SELECT count("customers_T658"."customerNumber") AS synthetic_prop_21
FROM customers AS "customers_T658"
WHERE "employees_T655"."employeeNumber" = "customers_T658"."salesRepEmployeeNumber"
) AS "T660"
) AS "customerCount"
FROM employees AS "employees_T655"
WHERE "offices_T652"."officeCode" = "employees_T655"."officeCode"
) AS "T657"
) AS "T662"
) AS "maxEmployeeCustomerCount"
FROM offices AS "offices_T652"
One important point to remember is that EntityFunction is HoldAll, and therefore, if one needs to use some part of the query stored in a variable, in the body of EntityFunction, one has to use With (or a similar construct) to inject that query part into the body of EntityFunction.
Reusing inert query blocks in several larger queries can be a common need in practice. Since queries are inert, one can easily do so. In particular, one can store smaller parts of a query in variables and use those variables in larger queries.
In some cases, one may need to reuse the same query more than once in one and the same larger query. In SQL, the WITH keyword is used to achieve that. While the Entity framework might acquire native support for this construct in the future, one can emulate it rather easily.
As an example, consider operator forms for common query building blocks. At present, for various reasons there is no direct support for those, for the core Entity framework query building blocks. Yet users often prefer this style of query writing, and it can in some cases make queries more readable by reducing the amount of nesting in the query.
One of the impediments to building complex queries is that the process may look and feel somewhat similar to writing code in compiled languages, where it is hard to test intermediate results or code pieces, and one has to come up with some complete function or program to compile and test it.
This section is intended to show that this is not really the case for Entity framework queries, and that with proper techniques, one can make the query-building process as interactive as most other activities in the Wolfram Language.
The query to illustrate the technique described in this section should do the following: select the five top-paying customers and return their customer number, customer name and total amount they have paid, in the order of decreasing total amount.
Note that in the last query, in the property list it is important to use the full EntityProperty["customers","customerNumber"], rather than just "customerNumber", since CombinedEntityClass[...] now contains two "customerNumber" properties, and it is necessary to disambiguate which one is really requested.
It is often useful to prototype queries using single entities of a given entity type, before actually constructing them for the entire entity type. This section illustrates this process using a concrete example.
The next step is to construct a query that computes the number of customers for each employee. As a first step, one can hard-code the specific employee number. For example, from the preceding it follows that employee with the number 1165 should have six customers.
Where in the last query, an extra layer of EntityFunction is introduced, which allowed the specific employee entity to be passed into the query rather than its number being hard-coded inside of it.
Where the previously constructed query has been used as an inner building block, and also the relation o["employees"] has been used, just as previously inside the EntityList.
Note how using single entities, one can incrementally build nested complex queries, making sure that things work at each level. This particular query technically contains a doubly nested subquery, one level of which is correlated.
This logic also works in reverse: given a query that does not function properly and contains complicated inner structure (involving nested EntityFunction expressions etc.), one can break it into pieces and test inner parts of the query on single entities, to quickly locate and fix the problematic place.
In which case, Missing["UnknownProperty",…] values are returned for nonexistent property.
In which case, Missing values are also returned.
One example of such an error is using an invalid property in EntityFunction.
Query compilation fails whenever the body of EntityFunction contains uncompilable parts.
Another common source of hard errors is when one uses expressions of the wrong type in EntityFunction.
Some operations that take EntityFunction as an argument require specific return types. For example, when used in FilteredEntityClass or CombinedEntityClass, EntityFunction should have a Boolean return type.
Expressions used to compute aggregated properties in EntityFunction for the case of aggregation must use one of the aggregation functions. It is technically possible to construct queries that do not, in which case they will not compile.
Operational errors are errors that are happening on the database side. While the Entity framework makes an effort to intercept most syntactic, type-related and other errors in the queries early, in some cases this either has not yet been done or might not be easily possible to do.
Most interesting real-world questions require queries that combine several query-building primitives in nontrivial ways. This section illustrates how one can achieve this, with a number of more interesting examples.
The same can be achieved, using instead CombinedEntityClass and aggregation with grouping (notice that here one has to add those properties of the "customers" type that one wants to be available after aggregation to the list of properties one uses for grouping. Even though there is a single value of "customerName" that corresponds to specific value of "customerNumber", the query does not know that, unless we tell it this explicitly).
One way of dealing with this type of problem is to use CombinedEntityClass to combine the types together and then aggregate over the combined type, grouping over some of the properties.
Note that in the last query, it was important to use the longer form EntityProperty["offices","officeCode"] instead of just "officeCode" to disambiguate the property, since CombinedEntityClass["offices","employees","officeCode"] contains two properties with the short name "officeCode": EntityProperty["offices","officeCode"] and EntityProperty["employees","officeCode"] (in this case, either one could be used).
"offices_T652"."officeCode" AS "officeCode",
SELECT max("T657"."customerCount") AS synthetic_prop_20
SELECT count("customers_T658"."customerNumber") AS synthetic_prop_21
FROM customers AS "customers_T658"
WHERE "employees_T655"."employeeNumber" = "customers_T658"."salesRepEmployeeNumber"
) AS "T660"
) AS "customerCount"
FROM employees AS "employees_T655"
WHERE "offices_T652"."officeCode" = "employees_T655"."officeCode"
) AS "T657"
) AS "T662"
) AS "maxEmployeeCustomerCount"
FROM offices AS "offices_T652"
This is another showcase to illustrate the power of relations. The task is to select customers who still owe money—in other words, those for whom the amount of total orders exceeds the amount of total payments to date.
To compute total amount to pay, one can start with all the orders for a given customer, which is given by the relation c["orders"] and is an entity class of type "orders". Since each order may contain multiple items, and each item may be ordered in multiple quantities, the next step is to extend this class with a property "totalToPay", which computes the amount to be paid for each order by reaching to all "orderdetails" entities related to this order and computing the total of the product of item price and ordered quantity. Note that here, relation o["orderdetails"] is used for each order. Then one can do the second aggregation, this time over all the orders for a given customer, to obtain the total amount they should pay.
The computation of the total amount the customer already paid is considerably simpler, since it requires a single relational lookup c["payments"]. The small cutoff 0.00001 is introduced to avoid roundoff errors.
The With in the last example is used for readability and is not required; one could have used a single large query. Note however that With is used with := initialization for scoped variables, so that they get injected into the body of the main EntityFunction without evaluation.
The goal for this example is to compute for each employee the total amount to be paid by the five top-paying customers served by that employee, and sort employees in the order of descending total amount (which can be used as a measure of success for a given employee).
In the above example, as also in the example before it, With is used mostly for readability.