MSQL Reference – Lexical Structure

A basic unit of MSQL input is a statement. There are several kinds of MSQL statements: SELECT statement, EXPLAIN statement, DESCRIBE statement and SET statement, and others will likely be added in future releases.

Each MSQL statement consists of a sequence of tokens, which are the smallest building blocks of MSQL query language. If there is more than one statement in the input, all but the last statement must be terminated with a semicolon (;).

A token can be a keyword (e.g. SELECT), an identifier (len), a quoted identifier (type), a constant literal ('abc', 127) or a special character symbol (,, ||, *).

Tokens are most often separated by some amount of whitespace (spaces, tabs and newlines), but the whitespace is optional if one of the tokens is a special character symbol
(e.g. SELECT*FROM st.segment WHERE vid>0 is a valid query).

Keywords

Keywords such as SELECT, FROM and WHERE have a predefined meaning in the MSQL language. They consist only of letters and are case-insensitive (e.g. SELECT and seLEct are treated the same).

Keywords cannot appear in an arbitrary position in a statement, but have a reserved place for them in the surrounding context (e.g. a FROM keyword must always come after SELECT, WHERE must not come before GROUP BY etc).

There are two kinds of keywords: reserved keywords and non-reserved keywords. The difference between them is that an identifier cannot be named the same as one of the reserved keywords, but it can have the same name as one of the non-reserved keywords.
For example, SELECT x AS where FROM st.segment is an invalid MSQL statement because where is a reserved keyword, while SELECT t[0] AS left FROM st.segment is a valid statement because left is a non-reserved keyword.

The following are the lists of all the reserved and non-reserved keywords in the MSQL language:

Reserved keywords:

`ALL`	`DISCARD`	`IF`	`OFFSET`	`SOME`
`AND`	`DISJOINT`	`IMMEDIATELY`	`ON`	`SUCCEEDS`
`ANY`	`DISTINCT`	`IN`	`OPTIONS`	`TABLE`
`APPLY`	`DROP`	`INNER`	`OR`	`TEMP`
`AS`	`ELSE`	`INTERSECT`	`ORDER`	`TEMPORARY`
`ASC`	`EMPTY`	`INTERSECTS`	`ORDINALITY`	`THEN`
`BEGIN`	`END`	`IS`	`OUTER`	`TO`
`BETWEEN`	`EQUALS`	`JOIN`	`OVERLAPS`	`TOUCHES`
`BY`	`ESCAPE`	`LAST`	`PHYSICAL`	`TRANSACTION`
`CASE`	`ESTIMATED`	`LEFT`	`PLAN`	`TRUE`
`CONTAINS`	`EXCEPT`	`LIKE`	`PLANS`	`TYPE`
`COUNT`	`EXISTS`	`LIMIT`	`PRECEDES`	`UNION`
`COVEREDBY`	`EXPLAIN`	`LOGICAL`	`PREDICT`	`UNNEST`
`COVERS`	`FALSE`	`MINUS`	`REPLACE`	`USING`
`CREATE`	`FIRST`	`MOD`	`RIGHT`	`VALUES`
`CROSS`	`FOR`	`MODEL`	`ROW`	`WHEN`
`CROSSES`	`FROM`	`NAMESPACE`	`SELECT`	`WHERE`
`DATABASE`	`FULL`	`NOT`	`SEQUENCES`	`WITH`
`DESC`	`GROUP`	`NULL`	`SET`	`WITHIN`
`DESCRIBE`	`HAVING`	`NULLS`	`SIMILAR`	`WORK`

Non-reserved keywords:

CAST IF REPLACE TYPE

Identifiers

Identifiers such as x, vid and st.segment identify tables, columns or intermediate expression values in the context of a query. Like keywords, identifiers are also case-insensitive (e.g. vid and VID refer to the same thing), with the exception of quoted identifiers which are treated below.

For an identifier to be valid, it must start with a letter, contain only letters, digits and underscores (_) and must not equal one of the reserved MSQL keywords. All UTF-8 letter and digit characters are allowed in identifiers, not only the ASCII-compatible ones.

Quoted Identifiers

Quoted identifiers such as `where`, `Sum total` and ndb.`polygon` are a special kind of identifiers which have fewer restrictions on validity. They are bounded by backticks (`), and they can contain whitespace, reserved keywords and arbitrary UTF-8 character sequences. If a quoted identifier contains backticks, they must be escaped as double backticks (``).

Constant Literals

There are four kinds of implicitly-typed constant literals in MSQL:

Numeric literals (int64 and float64 numbers)
String literals (character and byte-string literals)
Temporal literals (representing timestamps and timeperiods)
Spatial literals (representing spatial objects)

There are also three special constants identified by keywords TRUE, FALSE, NULL.

Literals of other data types (e.g. int8, uint16 or float32) can be produced by casting one of the available literals to the desired data type. Such literal casts will be evaluated at query-compile-time, and in the compiled query they'll be replaced with the actual literal of the desired type.

To find out which casts are available for literal construction, take a look at the list of all the type casts supported in MSQL.

Numeric Literals

Numeric literals can be either integral or floating-point. Integer literals can be specified in decimal (default), binary, octal and hexadecimal bases:

decimal: [0-9]+                 -> e.g. 123
binary:  0[bB][01]+             -> e.g. 0b101 = 0B101 = 5 (decimal)
octal:   0[oO][0-7]+            -> e.g. 0o73 = 0O73 = 59 (decimal)
hexadecimal: 0[xX][0-9A-Fa-f]+  -> e.g. 0x1fff = 0X1FFF = 8191 (decimal)

Valid floating-point literals must have one of the following forms:

[0-9]+\.[0-9]+([eE][+-]?[0-9]+)?  -> e.g. '5.72e-12'
\.[0-9]+([eE][+-]?[0-9]+)?        -> e.g. '.5E7'
[0-9]+\.?([eE][+-]?[0-9]+)?       -> e.g. '1e9' or '2.E-4'

Note that:

If there is a decimal point present, at least one digit must be before or after it.
If there is an exponent marker present, at least one digit must always follow it.
No spaces or other characters can be embedded in the literal.
Any leading plus or minus sign is not considered a part of the literal; it is an operator applied to the literal. This holds for integral literals as well.

By default, integral literals are of a int64 type, and floating-point literals are of a float64 type. But depending on the context, integer literals might also be upcasted to float64 type, e.g. when passed as arguments to functions accepting float64 values, or in arithmetic operations where the other operand is already a float64 value.

String Literals

A string literal is an arbitrary sequence of UTF-8 characters bounded by single quotes ('). All string literals are of a text data type.

If a string literal contains single-quote characters, they must be escaped as double single-quotes (''), e.g. 'Hello, I''ts me' (Note that this is different from double-quote character ").

A string literal can also contain arbitrary 4-(hex)digit Unicode escape sequences prefixed by a backslash (\), e.g. '\044f\0431\043b\043e\043a\043e' (яблоко, the word for an apple in Russian Cyrillic script). If a string literal contains a literal backslash (\), it must be escaped as a double-backslash (\\).

C-style character escape sequences (\n, \t, etc.) aren't currently supported in the MSQL. Use the Unicode escape sequences instead (e.g. \000A, \0009).

Two string literals which are separated only by whitespace are concatenated and treated as one single literal.

For example:

SELECT 'abc'
    'def';

is equivalent to

SELECT 'abcdef';

Byte-string Literals

A byte-string literal is a string consisting of an even number of hexadecimal digits representing a sequence of raw bytes. Each byte of a sequence is represented as two consecutive hexadecimal digits, e.g. 7f. All byte string literals are of a binary data type. Byte string literals must be prefixed by a x or X character to distinguish them from ordinary string literals.

Same as with ordinary string literals, two-byte string literals which are separated only by whitespace are concatenated and treated as one single literal.

For example:

SELECT x'abc'
    '123';

is equivalent to

SELECT x'abc123';

Note that only the first part of the literal must be prefixed by x or X.

Temporal Literals

A temporal literal starts with a name of the temporal type (timestamp | ts, timeperiod | tp, unixepoch | epoch, tsrange or epochrange) it represents, followed by a string-literal representation of its temporal value.

Valid timestamp | ts and unixepoch | epoch literals must have the form:

(timestamp | ts) ('yyyy-mm-dd HH:MM:SS.fff' | NULL)    -- for timestamp literals
(unixepoch | epoch) ('yyyy-mm-dd HH:MM:SS.fff' | NULL)  -- for unixepoch literals

Not all parts of the string representation need to be present, only a valid prefix is required. In that case, the unspecified fields are truncated to the lowest value (1 for months and days, 0 for hours, minutes, seconds and milliseconds).

Valid tsrange and epochrange literals must have the form:

(rtimestamp | rts | tsrange | timestamprange) ('[yyyy-mm-dd HH:MM:SS.fff, yyyy-mm-dd HH:MM:SS.fff]' | NULL)
(runixepoch | repoch | epochrange | unixepochrange) ('[yyyy-mm-dd HH:MM:SS.fff, yyyy-mm-dd HH:MM:SS.fff]' | NULL)

As with timestamp | ts and unixepoch | epoch literals, not all parts of a datetime string need to be present but only a valid prefix.

A valid timeperiod | tp literal must have the form:

(timeperiod | tp) ('[-]N1 unit1 [N2 unit2 ...]' | NULL)

where N_i are integers, and unit_i are names/shortcut names of time units.

A few examples:

timestamp '2022-03-01 15:37'
unixepoch '2022-03-01 15:37' WITH TIMEZONE 'Europe/Rome'
timeperiod '3d 7h 13min'
tp '77 days'

For more details, see the built-in PARSE_TS and PARSE_TP functions, which are internally used to parse the string-literal representation of the temporal value.

Spatial Literals

A spatial literal starts with a name of the spatial type it represents, followed by a string-literal representation of its spatial value.

A valid spatial literal must have the following form:

(point | geogpoint) 'point-WKT'
(linestring | geoglinestring) ('linestring-WKT' | EMPTY | NULL)
(polygon | geogpolygon) ('polygon-WKT' | EMPTY | NULL)
(multipoint | geogmultipoint) ('multipoint-WKT' | EMPTY | NULL)
(multilinestring | geogmultilinestring) ('multilinestring-WKT' | EMPTY | NULL)
(multipolygon | geogmultipolygon) ('polygon-WKT' | EMPTY | NULL)

where the string-literal contents should be a valid WKT (Well-Known Text) representation of the spatial value (without the type identifier).

A few examples:

point '(10 20)'
multipolygon EMPTY
geoglinestring '(10 10, 11.5 11.5)'
geogmultipolygon NULL

Special Constants

Keywords TRUE, FALSE and NULL identify special constants built into the MSQL language itself. TRUE and FALSE keywords identify Boolean true and false values (stored in the bool type), and the NULL keyword identifies a special, unknown/missing value constant which can be of any type (including bool).

In some cases, the exact type of NULL constants is deduced from the surrounding context, but when this isn't possible, NULL constant must be given an explicit type by casting (e.g. NULL::int64 or CAST(NULL AS text)). For example, in the statement SELECT 1 + NULL the NULL constant will be automatically assigned a type int64 (the type of the first operand), but the statement SELECT NULL + NULL will cause an error (Illegal use of NULL) because the exact type of NULL constants cannot be deduced.

There also exist special operators for testing if some value is equal to one of the special constants: IS [NOT] TRUE, IS [NOT] FALSE and IS [NOT] NULL.

Comments

Both single-line and multi-line comments are supported in MSQL, using the following syntax:

-- A single-line comment
SELECT 1;  -- Note it need not start at the beginning of the line

/* Multi-line
comment */
SELECT 2, /* But it can also span just one line or a part of it */ 3, 4;

Comments aren't actually tokens, rather they are treated as whitespace by the MSQL parser.

Value Expressions

Value expressions in the MSQL language are formulae for calculating new values from constants and columns of input rows. They are used in a variety of contexts, most often in SELECT-lists for computing column values of the resultset and in WHERE- and ON-clauses for checking various conditions (i.e. computing their truth-value). There are several kinds of expressions supported, listed below.

Column Reference

A column reference expression returns the value of a specified column for each input row. If a column name is not unique across all the tables visible in the surrounding context, the column name must be qualified by prefixing it with the name of the table (or its alias), e.g. tbl.id or st.segment.x. Otherwise, the table name can be omitted and just the column name can be specified, e.g. vid.

Operator Invocation

There are three possible syntaxes for invoking built-in operators:

infix operators: operand OPERATOR operand (e.g. +, /, || operators).
prefix operators: OPERATOR operand (e.g. unary -, ~, NOT operators).
postfix operators: operand OPERATOR (e.g. IS NULL, IS TRUE operators).

Several operators require a special syntax which is not one of the above:

CASE: CASE WHEN c1 THEN r1 WHEN c2 THEN r2 ... END or CASE x WHEN v1 THEN r1 WHEN v2 THEN r2 ... END
[NOT] BETWEEN: x [NOT] BETWEEN y AND z
[] operator: x[0]

The order of evaluation of operands is not explicitly specified and can be changed by the query planner, except in cases of the CASE, AND and OR operators, which have specific evaluation rules (i.e. short-circuiting) and provide guarantees that in some cases certain operands won't even be evaluated.

In many cases, the operands are automatically upcasted to a supertype, e.g. smaller integer types to int64, or integers to floating-point numbers.

Function Invocation

Built-in functions can be invoked in expressions by specifying the function name, followed by a parenthesized list of arguments (which can be arbitrary expressions), e.g. POWER(SQRT(2), 3). If a function doesn't take any arguments, the parentheses are still required, for example CURRENT_TS().

The order of evaluation of function arguments is not explicitly specified and can be changed by the query planner, except in the case of functions which are internally rewritten as CASE expressions (e.g. IF, NULLIF, COALESCE), which then follow the CASE operand evaluation rules.

In many cases, function arguments are automatically upcasted to a supertype, e.g. smaller integer types to int64, or integers to floating-point numbers.

Aggregate Expressions

An aggregate expression represents a computation of a single aggregate value over multiple input rows. The aggregate value is computed by an aggregate function, such as SUM or STDDEV_POP. An aggregate expression can be in one of the following forms:

aggregate_fn ([DISTINCT] expr [, ...] [ORDER BY ...])
aggregate_fn (*)

where aggregate_fn is a name of a built-in aggregate function, and expr is any valid MSQL expression that does not contain an aggregate expression or a window function call.

The first form of the aggregate expression invokes the specified aggregate function once for each input row (when there is no DISTINCT keyword present), or once for each distinct value of the given expression (when DISTINCT is specified). Some aggregate functions (e.g. STRING_AGG) accept or even require that an ordering of their arguments be specified, which can be done using the ORDER BY clause, whose syntax is the same as in the SELECT statement (see the Ordering section below).

The ORDER BY clause, where permitted/required, must come after all the arguments of an aggregate function, e.g. STRING_AGG(x, ', ' ORDER BY t). Even though STRING_AGG(x ORDER BY t, ', ') is also syntactically valid, it has a different meaning as it represents a call of a single-argument aggregate function with two ORDER BY keys (the second one being a constant).

The second form also invokes the aggregate function once for each row of input, but as no input expression is specified, it is used only for the COUNT(*) aggregate expression. The difference between COUNT(*) and COUNT(c) is that the first one will return the total number of input rows, whereas the second will return the number of rows in which the column c is not NULL.

Aggregate expressions with DISTINCT are currently not supported.

Aggregate expressions can only appear in the select-list or in the HAVING clause of a SELECT statement. They must not appear in other clauses (e.g. in WHERE) because they are evaluated before the computation of the aggregate values.

If an aggregate expression appears in a subquery, it is normally evaluated over the rows of the subquery. But if the arguments of the aggregate function contain only variables from the outer query, it is evaluated over the rows of the outer query.

Scalar Subqueries

A scalar subquery is a parenthesized SELECT query that returns exactly one row having only one column. The subquery can refer to variables from the surrounding query, which act as constants in the subquery execution. The result of the subquery (a single column of a single row) can be used the same as the result of any other value expression. A scalar subquery that returns more than one row or more than one column will cause an error, but a query which returns no rows won't cause an error, rather its result will be declared to be NULL.

The following is an example of a query which calculates a delta from the average of a set of values utilizing scalar subqueries (and also a common table expression):

WITH t(x) AS (VALUES 3, 4, 1, 5)
SELECT
    (SELECT avg(x) FROM t) AS avg,
    x - (SELECT avg(x) FROM t) AS delta
FROM t;

--> +------+-------+
--> |  avg | delta |
--> +------+-------+
--> | 3.25 | -0.25 |
--> | 3.25 |  0.75 |
--> | 3.25 | -2.25 |
--> | 3.25 |  1.75 |
--> +------+-------+

Table Expressions

Table expressions in the MSQL language are expressions which return not a single value but a set of rows (i.e. a table). In general, table expressions are used in the FROM and JOIN/APPLY clauses of the SELECT statement, where they define a source of data to be used in the rest of the query.

Beside actual tables and subqueries returning tables, MSQL also offers support for table-valued functions. Table-valued functions behave the same as regular tables, the only difference being that they require arguments next to their name, and have a special join syntax (CROSS/OUTER APPLY) which internally most often results in correlated subqueries.

For example, the following query finds all the streets matching the search term 'Ilica' in the city of Zagreb by invoking the table-valued function map.geocode:

SELECT formatted_address FROM map.geocode_lines('Ilica, Zagreb');

--> +----------------------------------------+
--> |                      FORMATTED_ADDRESS |
--> +----------------------------------------+
--> |                  'Ilica, 10000 Zagreb' |
--> | 'Luke Ilića Oriovčanina, 10000 Zagreb' |
--> +----------------------------------------+

The list of all table-valued functions available in MSQL can be found here.