A basic unit of MSQL input is a statement. There are several kinds of MSQL statements: SELECT
statement, EXPLAIN
statement, DESCRIBE
statement and SET
statement, and others will likely be added in future releases.
Each MSQL statement consists of a sequence of tokens, which are the smallest building blocks of MSQL query language.
If there is more than one statement in the input, all but the last statement must be terminated with a semicolon (;
).
A token can be a keyword (e.g. SELECT
), an identifier (len
), a quoted identifier (type
), a constant literal ('abc'
, 127
) or a special character symbol (,
, ||
, *
).
Tokens are most often separated by some amount of whitespace (spaces, tabs and newlines), but the whitespace is optional if one of the tokens is a special character symbol
(e.g. SELECT*FROM st.segment WHERE vid>0
is a valid query).
Keywords such as SELECT
, FROM
and WHERE
have a predefined meaning in the MSQL language. They consist only of letters and are case-insensitive (e.g. SELECT
and seLEct
are treated the same).
Keywords cannot appear in an arbitrary position in a statement, but have a reserved place for them in the surrounding context (e.g. a FROM
keyword must always come after SELECT
, WHERE
must not come before GROUP BY
etc).
There are two kinds of keywords: reserved keywords and non-reserved keywords. The difference between them is that an identifier cannot be named the same as one of the reserved keywords, but it can have the same name as one of the non-reserved keywords.
For example, SELECT x AS where FROM st.segment
is an invalid MSQL statement because where
is a reserved keyword, while SELECT t[0] AS left FROM st.segment
is a valid statement because left
is a non-reserved keyword.
The following are the lists of all the reserved and non-reserved keywords in the MSQL language:
ALL | DISCARD | IF | OFFSET | SOME |
AND | DISJOINT | IMMEDIATELY | ON | SUCCEEDS |
ANY | DISTINCT | IN | OPTIONS | TABLE |
APPLY | DROP | INNER | OR | TEMP |
AS | ELSE | INTERSECT | ORDER | TEMPORARY |
ASC | EMPTY | INTERSECTS | ORDINALITY | THEN |
BEGIN | END | IS | OUTER | TO |
BETWEEN | EQUALS | JOIN | OVERLAPS | TOUCHES |
BY | ESCAPE | LAST | PHYSICAL | TRANSACTION |
CASE | ESTIMATED | LEFT | PLAN | TRUE |
CONTAINS | EXCEPT | LIKE | PLANS | TYPE |
COUNT | EXISTS | LIMIT | PRECEDES | UNION |
COVEREDBY | EXPLAIN | LOGICAL | PREDICT | UNNEST |
COVERS | FALSE | MINUS | REPLACE | USING |
CREATE | FIRST | MOD | RIGHT | VALUES |
CROSS | FOR | MODEL | ROW | WHEN |
CROSSES | FROM | NAMESPACE | SELECT | WHERE |
DATABASE | FULL | NOT | SEQUENCES | WITH |
DESC | GROUP | NULL | SET | WITHIN |
DESCRIBE | HAVING | NULLS | SIMILAR | WORK |
CAST | IF | REPLACE | TYPE |
Identifiers such as x
, vid
and st.segment
identify tables, columns or intermediate expression values in the context of a query. Like keywords, identifiers are also case-insensitive (e.g. vid
and VID
refer to the same thing), with the exception of quoted identifiers which are treated below.
For an identifier to be valid, it must start with a letter, contain only letters, digits and underscores (_
) and must not equal one of the reserved MSQL keywords. All UTF-8 letter and digit characters are allowed in identifiers, not only the ASCII-compatible ones.
Quoted identifiers such as `where`
, `Sum total`
and ndb.`polygon`
are a special kind of identifiers which have fewer restrictions on validity. They are bounded by backticks (`
), and they can contain whitespace, reserved keywords and arbitrary UTF-8 character sequences. If a quoted identifier contains backticks, they must be escaped as double backticks (``
).
There are four kinds of implicitly-typed constant literals in MSQL:
int64
and float64
numbers)There are also three special constants identified by keywords TRUE
, FALSE
, NULL
.
Literals of other data types (e.g. int8
, uint16
or float32
) can be produced by casting one of the available literals to the desired data type.
Such literal casts will be evaluated at query-compile-time, and in the compiled query they'll be replaced with the actual literal of the desired type.
To find out which casts are available for literal construction, take a look at the list of all the type casts supported in MSQL.
Numeric literals can be either integral or floating-point. Integer literals can be specified in decimal (default), binary, octal and hexadecimal bases:
decimal: [0-9]+ -> e.g. 123
binary: 0[bB][01]+ -> e.g. 0b101 = 0B101 = 5 (decimal)
octal: 0[oO][0-7]+ -> e.g. 0o73 = 0O73 = 59 (decimal)
hexadecimal: 0[xX][0-9A-Fa-f]+ -> e.g. 0x1fff = 0X1FFF = 8191 (decimal)
Valid floating-point literals must have one of the following forms:
[0-9]+\.[0-9]+([eE][+-]?[0-9]+)? -> e.g. '5.72e-12'
\.[0-9]+([eE][+-]?[0-9]+)? -> e.g. '.5E7'
[0-9]+\.?([eE][+-]?[0-9]+)? -> e.g. '1e9' or '2.E-4'
Note that:
By default, integral literals are of a int64
type, and floating-point literals are of a float64
type. But depending on the context, integer literals might also be upcasted to float64
type, e.g. when passed as arguments to functions accepting float64
values, or in arithmetic operations where the other operand is already a float64
value.
A string literal is an arbitrary sequence of UTF-8 characters bounded by single quotes ('
). All string literals are of a text
data type.
If a string literal contains single-quote characters, they must be escaped as double single-quotes (''
), e.g. 'Hello, I''ts me'
(Note that this is different from double-quote character "
).
A string literal can also contain arbitrary 4-(hex)digit Unicode escape sequences prefixed by a backslash (\
), e.g. '\044f\0431\043b\043e\043a\043e'
(яблоко, the word for an apple in Russian Cyrillic script). If a string literal contains a literal backslash (\
), it must be escaped as a double-backslash (\\
).
C-style character escape sequences (\n
, \t
, etc.) aren't currently supported in the MSQL. Use the Unicode escape sequences instead (e.g. \000A
, \0009
).
Two string literals which are separated only by whitespace are concatenated and treated as one single literal.
For example:
SELECT 'abc'
'def';
is equivalent to
SELECT 'abcdef';
A byte-string literal is a string consisting of an even number of hexadecimal digits representing a sequence of raw bytes. Each byte of a sequence is represented as two consecutive hexadecimal digits, e.g. 7f
. All byte string literals are of a binary
data type.
Byte string literals must be prefixed by a x
or X
character to distinguish them from ordinary string literals.
Same as with ordinary string literals, two-byte string literals which are separated only by whitespace are concatenated and treated as one single literal.
For example:
SELECT x'abc'
'123';
is equivalent to
SELECT x'abc123';
Note that only the first part of the literal must be prefixed by x
or X
.
A temporal literal starts with a name of the temporal type (timestamp | ts, timeperiod | tp, unixepoch | epoch, tsrange or epochrange) it represents, followed by a string-literal representation of its temporal value.
Valid timestamp | ts and unixepoch | epoch literals must have the form:
(timestamp | ts) ('yyyy-mm-dd HH:MM:SS.fff' | NULL) -- for timestamp literals
(unixepoch | epoch) ('yyyy-mm-dd HH:MM:SS.fff' | NULL) -- for unixepoch literals
Not all parts of the string representation need to be present, only a valid prefix is required. In that case, the unspecified fields are truncated to the lowest value (1
for months and days, 0
for hours, minutes, seconds and milliseconds).
Valid tsrange and epochrange literals must have the form:
(rtimestamp | rts | tsrange | timestamprange) ('[yyyy-mm-dd HH:MM:SS.fff, yyyy-mm-dd HH:MM:SS.fff]' | NULL)
(runixepoch | repoch | epochrange | unixepochrange) ('[yyyy-mm-dd HH:MM:SS.fff, yyyy-mm-dd HH:MM:SS.fff]' | NULL)
As with timestamp | ts and unixepoch | epoch literals, not all parts of a datetime string need to be present but only a valid prefix.
A valid timeperiod | tp literal must have the form:
(timeperiod | tp) ('[-]N1 unit1 [N2 unit2 ...]' | NULL)
where N_i
are integers, and unit_i
are names/shortcut names of time units.
A few examples:
timestamp '2022-03-01 15:37'
unixepoch '2022-03-01 15:37' WITH TIMEZONE 'Europe/Rome'
timeperiod '3d 7h 13min'
tp '77 days'
For more details, see the built-in PARSE_TS and PARSE_TP functions, which are internally used to parse the string-literal representation of the temporal value.
A spatial literal starts with a name of the spatial type it represents, followed by a string-literal representation of its spatial value.
A valid spatial literal must have the following form:
(point | geogpoint) 'point-WKT'
(linestring | geoglinestring) ('linestring-WKT' | EMPTY | NULL)
(polygon | geogpolygon) ('polygon-WKT' | EMPTY | NULL)
(multipoint | geogmultipoint) ('multipoint-WKT' | EMPTY | NULL)
(multilinestring | geogmultilinestring) ('multilinestring-WKT' | EMPTY | NULL)
(multipolygon | geogmultipolygon) ('polygon-WKT' | EMPTY | NULL)
where the string-literal contents should be a valid WKT (Well-Known Text) representation of the spatial value (without the type identifier).
A few examples:
point '(10 20)'
multipolygon EMPTY
geoglinestring '(10 10, 11.5 11.5)'
geogmultipolygon NULL
Keywords TRUE
, FALSE
and NULL
identify special constants built into the MSQL language itself. TRUE
and FALSE
keywords identify Boolean true
and false
values (stored in the bool type), and the NULL
keyword identifies a special, unknown/missing value constant which can be of any type (including bool
).
In some cases, the exact type of NULL
constants is deduced from the surrounding context, but when this isn't possible, NULL
constant must be given an explicit type by casting (e.g. NULL::int64
or CAST(NULL AS text)
). For example, in the statement SELECT 1 + NULL
the NULL
constant will be automatically assigned a type int64
(the type of the first operand), but the statement SELECT NULL + NULL
will cause an error (Illegal use of NULL) because the exact type of NULL
constants cannot be deduced.
There also exist special operators for testing if some value is equal to one of the special constants: IS [NOT] TRUE, IS [NOT] FALSE and IS [NOT] NULL.
Both single-line and multi-line comments are supported in MSQL, using the following syntax:
-- A single-line comment
SELECT 1; -- Note it need not start at the beginning of the line
/* Multi-line
comment */
SELECT 2, /* But it can also span just one line or a part of it */ 3, 4;
Comments aren't actually tokens, rather they are treated as whitespace by the MSQL parser.
Value expressions in the MSQL language are formulae for calculating new values from constants and columns of input rows. They are used in a variety of contexts, most often in SELECT
-lists for computing column values of the resultset and in WHERE
- and ON
-clauses for checking various conditions (i.e. computing their truth-value). There are several kinds of expressions supported, listed below.
A column reference expression returns the value of a specified column for each input row. If a column name is not unique across all the tables visible in the surrounding context, the column name must be qualified by prefixing it with the name of the table (or its alias), e.g. tbl.id
or st.segment.x
. Otherwise, the table name can be omitted and just the column name can be specified, e.g. vid
.
There are three possible syntaxes for invoking built-in operators:
operand OPERATOR operand
(e.g. +
, /
, ||
operators).OPERATOR operand
(e.g. unary -
, ~
, NOT
operators).operand OPERATOR
(e.g. IS NULL
, IS TRUE
operators).Several operators require a special syntax which is not one of the above:
CASE WHEN c1 THEN r1 WHEN c2 THEN r2 ... END
or CASE x WHEN v1 THEN r1 WHEN v2 THEN r2 ... END
x [NOT] BETWEEN y AND z
x[0]
The order of evaluation of operands is not explicitly specified and can be changed by the query planner, except in cases of the CASE, AND and OR operators, which have specific evaluation rules (i.e. short-circuiting) and provide guarantees that in some cases certain operands won't even be evaluated.
In many cases, the operands are automatically upcasted to a supertype, e.g. smaller integer types to int64
, or integers to floating-point numbers.
Built-in functions can be invoked in expressions by specifying the function name, followed by a parenthesized list of arguments (which can be arbitrary expressions), e.g. POWER(SQRT(2), 3)
. If a function doesn't take any arguments, the parentheses are still required, for example CURRENT_TS()
.
The order of evaluation of function arguments is not explicitly specified and can be changed by the query planner, except in the case of functions which are internally rewritten as CASE expressions (e.g. IF, NULLIF, COALESCE), which then follow the CASE operand evaluation rules.
In many cases, function arguments are automatically upcasted to a supertype, e.g. smaller integer types to int64
, or integers to floating-point numbers.
An aggregate expression represents a computation of a single aggregate value over multiple input rows. The aggregate value is computed by an aggregate function, such as SUM or STDDEV_POP. An aggregate expression can be in one of the following forms:
aggregate_fn ([DISTINCT] expr [, ...] [ORDER BY ...])
aggregate_fn (*)
where aggregate_fn
is a name of a built-in aggregate function, and expr
is any valid MSQL expression that does not contain an aggregate expression or a window function call.
The first form of the aggregate expression invokes the specified aggregate function once for each input row (when there is no DISTINCT
keyword present), or once for each distinct value of the given expression (when DISTINCT
is specified). Some aggregate functions (e.g. STRING_AGG) accept or even require that an ordering of their arguments be specified, which can be done using the ORDER BY
clause, whose syntax is the same as in the SELECT
statement (see the Ordering section below).
The ORDER BY
clause, where permitted/required, must come after all the arguments of an aggregate function, e.g. STRING_AGG(x, ', ' ORDER BY t)
. Even though STRING_AGG(x ORDER BY t, ', ')
is also syntactically valid, it has a different meaning as it represents a call of a single-argument aggregate function with two ORDER BY
keys (the second one being a constant).
The second form also invokes the aggregate function once for each row of input, but as no input expression is specified, it is used only for the COUNT(*)
aggregate expression. The difference between COUNT(*)
and COUNT(c)
is that the first one will return the total number of input rows, whereas the second will return the number of rows in which the column c
is not NULL
.
Aggregate expressions with DISTINCT
are currently not supported.
Aggregate expressions can only appear in the select-list or in the HAVING
clause of a SELECT
statement. They must not appear in other clauses (e.g. in WHERE
) because they are evaluated before the computation of the aggregate values.
If an aggregate expression appears in a subquery, it is normally evaluated over the rows of the subquery. But if the arguments of the aggregate function contain only variables from the outer query, it is evaluated over the rows of the outer query.
A scalar subquery is a parenthesized SELECT
query that returns exactly one row having only one column. The subquery can refer to variables from the surrounding query, which act as constants in the subquery execution. The result of the subquery (a single column of a single row) can be used the same as the result of any other value expression. A scalar subquery that returns more than one row or more than one column will cause an error, but a query which returns no rows won't cause an error, rather its result will be declared to be NULL
.
The following is an example of a query which calculates a delta from the average of a set of values utilizing scalar subqueries (and also a common table expression):
WITH t(x) AS (VALUES 3, 4, 1, 5)
SELECT
(SELECT avg(x) FROM t) AS avg,
x - (SELECT avg(x) FROM t) AS delta
FROM t;
--> +------+-------+
--> | avg | delta |
--> +------+-------+
--> | 3.25 | -0.25 |
--> | 3.25 | 0.75 |
--> | 3.25 | -2.25 |
--> | 3.25 | 1.75 |
--> +------+-------+
Table expressions in the MSQL language are expressions which return not a single value but a set of rows (i.e. a table). In general, table expressions are used in the FROM
and JOIN/APPLY
clauses of the SELECT statement, where they define a source of data to be used in the rest of the query.
Beside actual tables and subqueries returning tables, MSQL also offers support for table-valued functions. Table-valued functions behave the same as regular tables, the only difference being that they require arguments next to their name, and have a special join syntax (CROSS/OUTER APPLY
) which internally most often results in correlated subqueries.
For example, the following query finds all the streets matching the search term 'Ilica'
in the city of Zagreb by invoking the table-valued function map.geocode
:
SELECT formatted_address FROM map.geocode_lines('Ilica, Zagreb');
--> +----------------------------------------+
--> | FORMATTED_ADDRESS |
--> +----------------------------------------+
--> | 'Ilica, 10000 Zagreb' |
--> | 'Luke Ilića Oriovčanina, 10000 Zagreb' |
--> +----------------------------------------+
The list of all table-valued functions available in MSQL can be found here.