MSQL Reference – Querying

`SELECT` statement

The most important (and most commonly used) MSQL statement is the SELECT statement, which is used for retrieving data from all the components of the Mireo SpaceTime Cluster, processing it in various ways and producing meaningful results.

The following is the supported syntax of the SELECT statement:

[ WITH cte AS (query), ... ]
SELECT
    expr [ [AS] alias ], ...
    [ FROM (tableName | subQuery [[AS] alias])
        [
        ([ LEFT | RIGHT ] [ INNER | OUTER ] JOIN (TABLE | subQuery) [ [AS] tbl ]
            (ON condition | USING (columnName, ...)) ], ... )
        | ( (CROSS | OUTER) APPLY tableFunction(args...) [ [AS] tbl ] )
        ]
    [ WHERE condition ]
    [ GROUP BY expr, ... ]
    [ HAVING condition ]

[ UNION ALL query ... ]

[ ORDER BY expr [(ASC | DESC)] [(NULLS FIRST | NULLS LAST)], ... ]
[ LIMIT m ]
[ OFFSET n ]

Projecting

The only required part of the SELECT statement is the select-list, i.e. a list of value expressions coming immediately after the SELECT keyword. These expressions are evaluated for each input row (e.g.) and their results become the column values of the output row (this operation is often called projection). Each expression in the list can be given a name (i.e. an alias, specified by the AS name construct) to make it easier to refer to it later.

Aliased expression names aren't visible inside the WHERE and JOIN clauses, because those clauses are evaluated before the query select-list.

A special kind of projection expression is * (e.g. in SELECT * FROM st.segment), which signifies that all the columns of the input table should simply be copied to the output row.

Two examples of valid MSQL queries using projection:

SELECT to_char(1611577608::timestamp)::text;  -- Note: FROM clause is not mandatory in MSQL queries
  --> '2021-01-25 12:26:48.000000'

SELECT t.vid AS vehicle_id, t.* FROM st.trip AS t;
-- Selects all fields from the st.trip table (aliased as `t`) and also duplicates the `vid` field, aliasing it as `vehicle_id`

Filtering

To return only those rows that satisfy a specified filtering condition, an (optional) WHERE clause can be added to the SELECT statement. This filtering is always done before projections and aggregations, so as to process only the rows which will enter into the final result.

A filtering condition must be a boolean-valued expression (i.e. its result must be of a bool type) and can refer to any of the input columns, but it cannot refer to items from the select-list, as they aren't yet evaluated at the moment when the filtering is done.

Only the rows for which the filtering condition evaluates to true will be retained in the result. Rows for which the condition evaluates to NULL will be dropped just as if it evaluated to false. For example, SELECT COUNT(*) FROM (VALUES 15, NULL::int64) AS t(x) WHERE x > 10 will return 1 because the second row (where x = NULL) is dropped.

The MSQL query planner can convert certain filtering conditions into index constraints (e.g. for SpaceTime range queries), which can dramatically increase the query performance. Whether this happened or not can be seen from the query execution plan.

An example of a filter which can be converted to SpaceTime range query constraints:

SELECT sum(len) FROM st.trip
    WHERE vid = 3267
    	AND t[0] BETWEEN timestamp '2020-07-01 00:00:00' AND timestamp '2020-07-31 23:59:59'
    	AND ST_Intersects(point '(x0, y0)', ST_ToGeom(geogpolygon '((16.2598 48.1487, 16.5015 48.1487, 16.5015 48.2750, 16.2598 48.2750, 16.2598 48.1487))'));

An example of a filter which cannot be so converted:

SELECT count(*) FROM st.segment WHERE flags & 4 = 1;  -- Bitwise operation which cannot be converted into SpaceTime range query constraint

Ordering

The results of a query are typically returned in an unspecified order, i.e. the ordering of result rows can be arbitrary and even differ between two runs of the same query. But an exact ordering of the results can be accomplished by specifying the ORDER BY clause, in which case the result rows will be sorted before being returned from the query.

The ORDER BY clause specifies a list of columns or expressions by which to sort the output rows. Instead of column names, column indices (starting at 1) can also be specified, e.g. ORDER BY 1 will order the rows by the value of the first column.

Each column/expression in the ORDER BY list can have a sorting direction, either ASC (ascending) or DESC (descending), with ascending being the default if none is specified. Likewise, each column/expression can also have an individual policy for treating NULL values: NULLS FIRST which places them before anything else (the default when sorting is descending) or NULLS LAST which places them after everything else (default when sorting is ascending).

An example of a query which specifies an ordering:

SELECT *
    FROM (VALUES (1, 10), (5, 7), (NULL::int64, 17), (5, 11), (5, NULL::int64)) AS t(x, y)
    ORDER BY x, y DESC;

--> +------+------+
--> |    x |    y |
--> +------+------+
--> |    1 |   10 |
--> |    5 | NULL |  -- (y = NULL) before other values
--> |    5 |   11 |
--> |    5 |    7 |
--> | NULL |   17 |  -- (x = NULL) after other values
--> +------+------+

A query declaring null-handling policies for ordering (the specified policies are the opposite of the defaults):

SELECT *
    FROM (VALUES (1, 10), (5, 7), (NULL::int64, 17), (5, 11), (5, NULL::int64)) AS t(x, y)
    ORDER BY x NULLS FIRST, y DESC NULLS LAST;

--> +------+------+
--> |    x |    y |
--> +------+------+
--> | NULL |   17 |  -- (x = NULL) before other values
--> |    1 |   10 |
--> |    5 |   11 |
--> |    5 |    7 |
--> |    5 | NULL |  -- (y = NULL) after other values
--> +------+------+

Specifying a column index instead of a column name:

SELECT x % 3, x
    FROM (VALUES 5, 4, 6, 10) AS t(x)
    ORDER BY 1;  -- Ordering by the first column, i.e. x % 3

--> +-------+----+
--> | x % 3 |  x |
--> +-------+----+
--> |     0 |  6 |
--> |     1 |  4 |
--> |     1 | 10 |
--> |     2 |  5 |
--> +-------+----+

Limiting

The maximum number of rows that the query should return can be specified using the LIMIT clause. The argument of the LIMIT clause must be an unsigned integer specifying the maximum number of result rows. If there are fewer rows than the amount specified, all rows will be returned.

Likewise, a starting offset can be specified with the OFFSET clause, which also takes an unsigned integer argument specifying the number of rows to skip before starting to return results. If this is larger than the total number of rows, the result will be empty.

When both LIMIT num and OFFSET idx are specified, the query will return the next num rows starting from the idx-th row. The rows skipped by the OFFSET clause will still have to be computed, therefore a large OFFSET might result in a slow query.

The limiting and offsetting operations are logically applied after result ordering, so they can be used to e.g. select top 10 rows according to a certain criterion.

If an ORDER BY is not specified before the LIMIT and OFFSET clauses, the results of the query will be non-deterministic, because in that case the LIMIT and OFFSET clause will be applied on input rows which will come in an unspecified order.

The execution plan resulting from adding a LIMIT clause to a query involving joins or aggregates might not be what a first-time user will expect. There is no way the MSQL query planner can know how many rows to fetch from the database in order to produce a maximum of e.g. 100 result rows, if the rows will first be aggregated or joined (and dropped if they don't have a join match). Therefore, queries such as SELECT SUM(len) FROM st.trip LIMIT 100 will actually process all the rows of the st.trip table, and the query will be slow.

A basic example of a limiting query:

SELECT * FROM (VALUES 1, 2, 3, 4) AS t(x) ORDER BY x LIMIT 2;

--> +---+
--> | x |
--> +---+
--> | 1 |
--> | 2 |
--> +---+

A case of fewer rows than the amount specified in LIMIT (all rows are returned):

SELECT * FROM (VALUES 1, 2, 3, 4) AS t(x) ORDER BY x LIMIT 10;

--> +---+
--> | x |
--> +---+
--> | 1 |
--> | 2 |
--> | 3 |
--> | 4 |
--> +---+

A query specifying both LIMIT and OFFSET:

SELECT * FROM (VALUES 1, 2, 3, 4) AS t(x) ORDER BY x LIMIT 2 OFFSET 1;

--> +---+
--> | x |
--> +---+
--> | 2 |
--> | 3 |
--> +---+

An offset larger than the total amount of rows (the query returns nothing):

SELECT * FROM (VALUES 1, 2, 3, 4) AS t(x) ORDER BY x OFFSET 10;
-- Returns an empty resultset

Joining

Data from several tables can be queried and joined together by specifying the JOIN clause after the FROM.

Join condition can be specified in two ways, either as JOIN table ON condition, where condition is a boolean-valued expression, or as JOIN table USING (colName1 [, colName2, ...]), where colName1..N are names of the columns (they must be the same in both tables) to be compared for equality. USING (vid) is mostly just a shortcut for ON t1.vid = t2.vid, with the only difference being that the columns specified in USING won't be duplicated in the result, while in the case of ON they will be (e.g. in the example above there will be two vid columns, one from t1 and the other from t2).

Type	Description
`INNER`	Only rows which have a match will end up in the result. The default.
`LEFT (OUTER)`	Rows from the left* side that don't have a match will also end up in the result*
`RIGHT (OUTER)`	Rows from the right* side that don't have a match will also end up in the result*
`FULL (OUTER)`	Rows from both* sides that don't have a match will also end up in the result*
`CROSS`	Returns a Cartesian product* of two (or more) tables*

NATURAL JOINs aren't supported in the MSQL language. You should instead list the join columns explicitly using JOIN ... USING (col1, col2, ...) syntax.

It is important to note that a single row from either side can be matched with multiple rows from the opposite side, if they all satisfy the given condition.

Two or more tables can also be joined together implicitly, by listing them all in the FROM clause, and putting the join condition inside WHERE (e.g. SELECT * FROM st.segment, ndb.vehicle WHERE vid = id). The MSQL query planner should always be able to convert this into proper joins.

MSQL also supports a similar syntax, the CROSS/OUTER APPLY, which is logically equivalent to INNER/LEFT JOIN, only it usually results in a correlated subquery. This syntax is most commonly used when invoking table-valued functions, e.g. SELECT formatted_address FROM st.trip CROSS APPLY map.rev_geocode(x0, y0).

A basic example of a query involving JOIN (i.e. an inner join, which is the default):

WITH
    t1(x) AS (VALUES 1, 2, 3, 4),
    t2(y) AS (VALUES 1, 2, 5, 6)
SELECT * FROM t1 JOIN t2 ON x = y
    ORDER BY 1, 2;

--> +---+---+
--> | x | y |
--> +---+---+
--> | 1 | 1 |
--> | 2 | 2 |
--> +---+---+

Join which produces multiple matches for each row from the left/right (i.e. it isn't an equi-join):

WITH
    t1(x) AS (VALUES 1, 2, 3, 4),
    t2(y) AS (VALUES 1, 2, 5, 6)
SELECT * FROM t1 JOIN t2 ON x > y
    ORDER BY 1, 2;

--> +---+---+
--> | x | y |
--> +---+---+
--> | 2 | 1 |
--> | 3 | 1 |
--> | 3 | 2 |
--> | 4 | 1 |
--> | 4 | 2 |
--> +---+---+

Specifying the join condition via USING (note there is just one x column, not two as in the previous example):

WITH
    t1(x) AS (VALUES 1, 2, 3, 4),
    t2(x) AS (VALUES 1, 2, 5, 6)
SELECT * FROM t1 JOIN t2 USING(x)
    ORDER BY x;

--> +---+
--> | x |
--> +---+
--> | 1 |
--> | 2 |
--> +---+

Left (outer) join (note that all rows from the left side are returned, but some results have NULL in the y column):

WITH
    t1(x) AS (VALUES 1, 2, 3, 4),
    t2(y) AS (VALUES 1, 2, 5, 6)
SELECT * FROM t1 LEFT OUTER JOIN t2 ON x = y  -- OUTER is optional, LEFT JOIN would work the same
    ORDER BY 1, 2;

--> +---+------+
--> | x |    y |
--> +---+------+
--> | 1 |    1 |
--> | 2 |    2 |
--> | 3 | NULL |
--> | 4 | NULL |
--> +---+------+

Right (outer) join (note that all rows from the right side are returned, but some results have NULL in the x column):

WITH
    t1(x) AS (VALUES 1, 2, 3, 4),
    t2(y) AS (VALUES 1, 2, 5, 6)
SELECT * FROM t1 RIGHT OUTER JOIN t2 ON x = y  -- OUTER is optional
    ORDER BY 1, 2;

--> +------+---+
--> |    x | y |
--> +------+---+
--> |    1 | 1 |
--> |    2 | 2 |
--> | NULL | 5 |
--> | NULL | 6 |
--> +------+---+

Full (outer) join (all rows are returned from both sides, but some results have NULL in either x or y column):

WITH
    t1(x) AS (VALUES 1, 2, 3, 4),
    t2(y) AS (VALUES 1, 2, 5, 6)
SELECT * FROM t1 FULL OUTER JOIN t2 ON x = y  -- OUTER is optional
    ORDER BY 1, 2;

--> +------+------+
--> |    x |    y |
--> +------+------+
--> |    1 |    1 |
--> |    2 |    2 |
--> |    3 | NULL |
--> |    4 | NULL |
--> | NULL |    5 |
--> | NULL |    6 |
--> +------+------+

Cross join (Cartesian product of rows):

WITH
    t1(x) AS (VALUES 1, 2, 3),
    t2(y) AS (VALUES 4, 5, 6)
SELECT * FROM t1 CROSS JOIN t2
    ORDER BY 1, 2;

--> +---+---+
--> | x | y |
--> +---+---+
--> | 1 | 4 |
--> | 1 | 5 |
--> | 1 | 6 |
--> | 2 | 4 |
--> | 2 | 5 |
--> | 2 | 6 |
--> | 3 | 4 |
--> | 3 | 5 |
--> | 3 | 6 |
--> +---+---+

Aggregating

Rows of the result can be grouped and aggregated together by specifying the GROUP BY clause. The GROUP BY clause defines a list of columns or expressions which should be used in grouping. Rows having the same values of the specified columns/expressions will then end up in the same group. Instead of column names, column indices (starting at 1) can also be specified, e.g. GROUP BY 1 will group the rows by the value of the first column (i.e. first expression specified in the select-list).

In a select-list of an aggregating query, the only expressions allowed are either aggregate expressions or references to expressions being used for grouping (e.g. those listed in the GROUP BY).

If you want to select some value in an aggregating query without grouping by it (and it doesn't matter which precise value it is from within a group), you can use the ANY_VALUE aggregate function for this purpose.

If an aggregate expression appears in a select-list but there is no GROUP BY clause, all the rows of the result will be aggregated together in a single resulting row.

Grouping without any aggregate expressions specified in the select-list effectively calculates the set of distinct values. This can be useful because MSQL currently doesn't support SELECT DISTINCT, so grouping without aggregates can be a workaround.

Related to the GROUP BY clause is the HAVING clause, which specifies a boolean condition for filtering the resulting aggregate rows. The difference between WHERE and HAVING clauses is that the WHERE clause is evaluated on the input rows (i.e. it determines whether a given row will be aggregated or not) and the HAVING clause is evaluated on the rows of the aggregate result (i.e. it determines which result rows will be kept and which dropped).

A simple example of an aggregate query:

WITH t(id, val) AS (VALUES (1, 32), (1, 45), (2, 17), (4, 22))
SELECT id, sum(val) AS sum FROM t
    GROUP BY id
    ORDER BY id;

--> +----+-----+
--> | id | sum |
--> +----+-----+
--> |  1 |  77 |
--> |  2 |  17 |
--> |  4 |  22 |
--> +----+-----+

Grouping by more than one column (columns specified by index):

WITH t(x, y, z) AS (VALUES (1, 2, 3), (1, 1, 5), (2, 4, 1), (1, 2, 5))
SELECT x, min(z) AS min, y FROM t
    GROUP BY 1, 3
    ORDER BY 1, 3;

--> +---+-----+---+
--> | x | min | y |
--> +---+-----+---+
--> | 1 |   5 | 1 |
--> | 1 |   3 | 2 |
--> | 2 |   1 | 4 |
--> +---+-----+---+

Doing an aggregation over all rows (no grouping):

SELECT avg(x) FROM (VALUES 1, 4, 5, 9, -2, 10) AS t(x);
  --> 4.5

Doing an aggregation over an empty rowset:

SELECT avg(x) FROM (VALUES 1, 2, 3) AS t(x) WHERE x > 10;
  --> NULL

Specifying a HAVING clause:

WITH t(id, val) AS (VALUES (1, 17), (2, 31), (1, 41), (3, 12), (4, 8), (2, 55))
SELECT id, sum(val) AS sum FROM t
    GROUP BY id
    HAVING sum > 20
    ORDER BY sum DESC;

--> +----+-----+
--> | id | sum |
--> +----+-----+
--> |  2 |  86 |
--> |  1 |  58 |
--> +----+-----+

Using aggregation to produce a set of distinct values:

WITH t(id, val) AS (VALUES (1, 17), (2, 31), (1, 41), (3, 12), (4, 8), (2, 55))
SELECT id FROM t
    GROUP BY 1
    ORDER BY 1;

--> +----+
--> | id |
--> +----+
--> |  1 |
--> |  2 |
--> |  3 |
--> |  4 |
--> +----+

Set Operations

Individual results of several queries can be combined into a single resultset using the set operations, of which currently only UNION ALL is supported. The UNION ALL set operation effectively concatenates the result rows of two or more queries into a single resultset, but it doesn't remove duplicate rows.

UNION DISTINCT (which does remove duplicates) isn't currently supported in the MSQL language.

The syntax for the UNION ALL operation is:

query1
UNION ALL
query2
[UNION ALL ...]

The queries being combined must all return the same number of columns, and corresponding columns must have compatible data types that can be converted to a common result type.

An example of combining queries with UNION ALL:

WITH
    t1(a, b, c) AS (SELECT 1, 'abc', true),
    t2 AS (SELECT 3.14, NULL::text, false)
(SELECT * FROM t1) UNION ALL (SELECT * FROM t2);

--> +------+-------+-------+
--> | a    | b     | c     |
--> +------+-------+-------+
--> |  1.0 | 'abc' |  true |
--> | 3.14 |  NULL | false |
--> +------+-------+-------+

Subqueries

MSQL offers full support for writing subqueries. Subqueries can be used in place of any scalar or table-valued expression, e.g. in the select-list, WHERE clause, HAVING clause (see scalar subqueries), as well as in FROM and JOIN/APPLY clauses (see table expressions).

Most of the subqueries are converted into other operations (such as joins) by the MSQL query planner, but a special kind of subqueries called correlated subqueries often cannot be so converted. They are rather executed as a separate query being invoked for each row of the parent query, which might carry some performance penalties.

There also exists a special syntax for joining with the results of a correlated subquery, CROSS/OUTER APPLY. It causes the right side of the CROSS/OUTER APPLY to be executed for each row from the left side, and combines the results the same as in JOIN. The CROSS APPLY construct is equivalent to INNER JOIN, as it will drop those rows from the left for which the right side didn't return any results, while the OUTER APPLY is equivalent to the LEFT JOIN as it will keep those rows from the left (and set right-side columns to NULL).

An example of CROSS APPLY:

WITH places(name) AS (VALUES 'Colloseum', 'Stadio San Siro', 'Pyramid of Giza')
SELECT name AS search_name, city FROM places
    CROSS APPLY (SELECT * FROM map.geocode_pts(name || ', Italy') LIMIT 1)
    ORDER BY name;

--> +-----------------------+----------+
--> |           search_name |     city |
--> +-----------------------+----------+
--> |           'Colloseum' |   'Roma' |
--> |     'Stadio San Siro' | 'Milano' |
--> +-----------------------+----------+

An example of OUTER APPLY (note the first row which didn't match anything):

WITH places(name) AS (VALUES 'Colloseum', 'Stadio San Siro', 'Pyramid of Giza')
SELECT name AS search_name, city FROM places
    OUTER APPLY (SELECT * FROM map.geocode_pts(name || ', Italy') LIMIT 1)
    ORDER BY name;

--> +-----------------------+----------+
--> |           search_name |     city |
--> +-----------------------+----------+
--> |           'Colloseum' |   'Roma' |
--> |     'Pyramid of Giza' |     NULL |
--> |     'Stadio San Siro' | 'Milano' |
--> +-----------------------+----------+

Constant Value Tables

Constant-value tables can be generated inside queries using the VALUES keyword. These tables are only logical (i.e. the data is never materialized on disk), but they can be used in queries just as any other table. Such tables can be very useful, e.g. for defining and naming constant categories, generating sequences of numbers, converting numeric codes to textual descriptions etc.

The syntax of the VALUE construct is:

VALUES
    (r1_c1, r1_c2, ...),
    (r2_c1, r2_c2, ...),
    ...

but a shorter version VALUES v1, v2, v3, ... is also supported when there is only one column per row.

By default, the columns of the VALUES result have an auto-generated name such as EXPR$0, but it is always advised to assign your own name to them by specifying an alias (e.g. SELECT x FROM (VALUES 1, 2, 3) AS t(x)), to be able to refer to the results in the rest of the query.

The VALUES (1, 'abc'), (2, 'def'), ... is effectively equivalent to

(SELECT 1, 'abc')
UNION ALL
(SELECT 2, 'def')
UNION ALL
...

There is also a special table-valued function CONST.SEQUENCE, which can be used to generate value tables containing a sequence of integers.

An example of a query using VALUES:

SELECT x, name
    FROM (VALUES 1, 2, 3, 2) AS t(x)
    JOIN (VALUES (1, 'one'), (2, 'two'), (3, 'three')) AS n(x, name) USING(x)
    ORDER BY x;

--> +---+---------+
--> | x |    name |
--> +---+---------+
--> | 1 |   'one' |
--> | 2 |   'two' |
--> | 2 |   'two' |
--> | 3 | 'three' |
--> +---+---------+

An example of a query using const.sequence:

SELECT num, num*num AS num2 FROM const.sequence(1, 10);

--> +---+----+
--> | x | x2 |
--> +---+----+
--> | 1 |  1 |
--> | 2 |  4 |
--> | 3 |  9 |
--> | 4 | 16 |
--> | 5 | 25 |
--> | 6 | 36 |
--> | 7 | 49 |
--> | 8 | 64 |
--> | 9 | 81 |
--> +---+----+

Common Table Expressions

MSQL query language also supports Common Table Expressions (CTE), which are auxiliary statements that define temporary tables for use in larger queries. These temporary tables are only logical (i.e. they are never physically materialized) and they exist just for this one query.

Common Table Expressions are specified using the keyword WITH, via the following syntax:

WITH t1 AS (query1),
    t2(col1, col2, ...) AS (query2),  -- Note that column names can be specified next to the CTE name
    ...
SELECT ... -- The rest of query which can use `t1` and `t2` as tables in FROM, JOIN, subqueries etc.

Only regular SELECT statements are supported inside Common Table Expressions, i.e. DDL statements, DESCRIBE or EXPLAIN aren't allowed.

Recursive CTEs aren't supported in the MSQL language.

CTEs are very useful in the context of larger queries, where they allow for reusing a part of the query instead of copying it around, they can help break deeply nested queries into shorter and more shallow ones, and in general they can substantially increase readability and intelligibility of queries.

An example of a query using CTEs:

WITH
    data(id, val) AS (VALUES (1, 43), (2, 57), (3, 32), (4, 50), (5, 22)),
    totals AS (SELECT avg(val) AS mean, stddev_samp(val) AS sd FROM data)
SELECT
    id, val, mean, val - mean AS delta
    FROM data, totals
    WHERE abs(val - mean) > sd;

--> +----+-----+------+-------+
--> | id | val | mean | delta |
--> +----+-----+------+-------+
--> |  2 |  57 | 40.8 |  16.2 |
--> |  5 |  22 | 40.8 | -18.8 |
--> +----+-----+------+-------+

SELECT statement

Projecting

Filtering

Ordering

Limiting

Joining

Aggregating

Set Operations

Subqueries

Constant Value Tables

Common Table Expressions

`SELECT` statement