MSQL Reference – Getting Started with MSQL

The following is a hands-on introduction to working with MSQL query language and using it to query the data stored in the Mireo SpaceTime Cluster.
It assumes no prior knowledge about the MSQL query language or SpaceTime Cluster, but it does assume at least a basic familiarity with database systems and SQL language(s) in general.

All the queries assume an that there is an existing SpaceTime Cluster instance available and that it already has some data in it, namely the following (non-empty) tables: ndb.vehicle, st.dsensor, st.segment and st.trip.
For more info about these tables, see docs on SpaceTime Cluster Components.

1. Getting a list of available tables

As a first step in working with a new database, you can get a list of all the tables (and other entities) available for querying by running the DESCRIBE DATABASE command.

DESCRIBE DATABASE;

This command will return a list of all available tables (or entities), for example:

--> +-------+-------------------+
--> |   dbs |             table |
--> +-------+-------------------+
--> | 'ndb' | 'const_road_type' |
--> | 'ndb' |             'poi' |
--> | 'ndb' |         'polygon' |
--> | 'ndb' |         'vehicle' |
--> |  'st' |           'alarm' |
--> |  'st' |         'csensor' |
--> |  'st' |         'dsensor' |
--> |  'st' |         'segment' |
--> |  'st' |            'trip' |
--> +-------+-------------------+

The second column of the response contains a name of a table/entity, and the first column specifies the namespace (also known as DBS, short for DataBase System) to which the table/entity belongs.

To refer to a table in a query, you must always specify its full name, e.g. ndb.vehicle or st.trip.
Otherwise, you will get an error complaining about referencing an unknown object.

If you are interested only in tables/entities belonging to a particular namespace/DBS, you can get a list of them using the DESCRIBE NAMESPACE command.

2. Inspecting the layout of a table

When working with a particular table, you can inspect its layout using the DESCRIBE TABLE command, like this:

DESCRIBE TABLE st.trip;

This command will return info about the table columns, for example:

--> +-----------+--------------+----------+-------+
--> |      name |         type | nullable | extra |
--> +-----------+--------------+----------+-------+
--> |     'vid' |     'uint32' |    false |       |
--> |       't' | 'runixepoch' |    false |       |
--> |      'x0' |      'int32' |    false |       |
--> |      'y0' |      'int32' |    false |       |
--> |      'x1' |      'int32' |    false |       |
--> |      'y1' |      'int32' |    false |       |
--> |     'len' |     'uint32' |    false |       |
--> | 'stopdur' |     'uint32' |    false |       |
--> +-----------+--------------+----------+-------+

You can find out more information about column data types in the section on MSQL data types.

3. Selecting some rows from a table

To actually get some rows from a table, you should use the SELECT command. The SELECT command is the most commonly used command in the MSQL language, and the one with the most options and variations.

Here is an example of selecting any 10 rows from the st.trip table, to get a feel on how the data stored in the table really looks like:

SELECT * FROM st.trip LIMIT 10;

If you accidentally try to select all rows from a table (e.g. by forgetting LIMIT N), unless the table is very small, you will most likely get an error: Memory quota exceeded. This is a safety mechanism to prevent excessive data being read and transferred from the database, as the number of rows stored in a single table can easily be several billions or tens of billions.

The query might return a response like this:

--> +------------+--------------------------+------------+------------+------------+------------+------------+------------+
--> |        vid |                        t |         x0 |         y0 |         x1 |         y1 |        len |    stopdur |
--> +------------+--------------------------+------------+------------+------------+------------+------------+------------+
--> |       6263 | [1530875511, 1530876382] |    5003046 |   35963151 |    4971795 |   35917932 |       8319 |      11964 |
--> |       8264 | [1530878467, 1530879096] |    5857489 |   37341153 |    5854240 |   37333359 |       1522 |       7024 |
--> |       8264 | [1530886120, 1530886646] |    5854240 |   37333359 |    5857475 |   37341270 |       1938 |       7642 |
--> |       6263 | [1530888346, 1530888866] |    4971795 |   35917932 |    4991671 |   35930890 |       3065 |        754 |
--> |       6263 | [1530889620, 1530889766] |    4991671 |   35930890 |    4989529 |   35931141 |        284 |         98 |
--> |       6263 | [1530889864, 1530890411] |    4989529 |   35931141 |    5003003 |   35963168 |       5995 |      74899 |
--> |       8264 | [1530894288, 1530895003] |    5857475 |   37341270 |    5851713 |   37331401 |       1860 |       1019 |
--> |       8264 | [1530896022, 1530896308] |    5851713 |   37331401 |    5855096 |   37333282 |        906 |        270 |
--> |       8264 | [1530896578, 1530897399] |    5855096 |   37333282 |    5857483 |   37341256 |       2239 |      60501 |
--> |       8263 | [1530897827, 1530897958] |    6396232 |   37572004 |    6387519 |   37574794 |       1157 |      47459 |
--> +------------+--------------------------+------------+------------+------------+------------+------------+------------+

Note that in this particular table (st.trip), the column t is different from all others - it's a range, or more precisely, a epochrange range comprising of two Unix epoch timestamps representing the time span during which a single trip took place.

4. Analyzing query execution plan

Now, how exactly does the MSQL query engine executes the given query? To get the info about the steps it takes to produce the query result, you can run the EXPLAIN command. This command is also very useful in understanding query performance, and can help more advanced users to optimize their queries for better performance.

To see the execution plan of the the previous query (SELECT * FROM st.trip LIMIT 10), just put the keyword EXPLAIN in front of it, like this:

EXPLAIN SELECT * FROM st.trip LIMIT 10;

--> +----+-----------+--------------------+---------------+
--> | id | parent_id |          operation |       details |
--> +----+-----------+--------------------+---------------+
--> |  1 |      NULL |   'Collect (MSQL)' |            '' |
--> |  2 |         1 | 'Gather Sort (ST)' |   'limit: 10' |
--> |  3 |         2 |  'Table Scan (ST)' | 'table: trip' |
--> +----+-----------+--------------------+---------------+

The result of the query is a step-by-step explanation of the execution plan, which in this case has only two steps: 1. reading the rows from the st.trip table (a.k.a. scanning), and 2. taking only the first 10 rows (limiting).

5. Retrieving a total row count

To get an aggregate statistics about a table, you can specify an aggregate function in the SELECT statement. For example, to get a total number of rows in the st.trip table, you can use the COUNT aggregate function like this:

SELECT count(*) FROM st.trip;

--> 16129563

This query can take a long time to finish if you try to count all the rows from a very large table. It is advised not to do this unless you are sure the table is relatively small.

6. Row count in a single month

If you're only interested in a subset of the data, you can filter the rows by specifying the WHERE clause on the SELECT statement. For example, to only count the trips from January 2019, write the following query:

SELECT count(*) FROM st.trip
    WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01';

--> 825723

As was noted before, the st.trips.t column is a range of unixepoch values, so the bounds on it can be specified using temporal literals. More precisely, this query counts only those trips whose start was between 00:00:00, Jan 1st, 2019 and 00:00:00, Feb 1st, 2019.

From the MSQL version 22.04.1, the query above can be further shortened to: select count(*) from st.trip where t[0] within epochrange '[2019-01-01, 2019-02-01]'.

The conversion between dates and actual timestamps (which are timezone-independent) is done according to the session-default timezone, which is usually GMT. This means that the preceeding query is actually equivalent to this one (where datetimes are converted to unixepoch values):

SELECT count(*) FROM st.trip
    WHERE t[0] >= 1546300800::unixepoch and t[0] <= 1548979200::unixepoch;

--> 825723

But individual temporal literals can have different timezones specified, like so (note the small difference in the result):

SELECT count(*) FROM st.trip
    WHERE t[0] between
        epoch '2019-01-01' WITH timezone 'America/Los_Angeles' and
        epoch '2019-02-01' WITH timezone 'America/Los_Angeles';

--> 834637

It is also possible to specify a different session-default timezone for the remainder of the query, by writing the SET session.timezone = 'TZ name'; before the rest of the query, like so:

SET session.timezone = 'America/Los_Angeles';
SELECT count(*) FROM st.trip
    WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01';

--> 834637

7. Mileage by vehicle in a month

Aggregate statistics can also be computed for arbitrary groups of rows using the GROUP BY clause. For example, you can calculate total mileage driven by each vehicle in January 2019 using the following query:

SELECT vid, sum(len) / 1e3 AS mileage_km
    FROM st.trip
    WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01'
    GROUP BY vid;

The query could return results like these:

--> +------------+------------------+
--> |        vid |       mileage_km |
--> +------------+------------------+
--> |        ... |              ... |
--> |       4250 |          647.006 |
--> |       6618 |           75.565 |
--> |       9411 |         1015.365 |
--> |       8071 |          2386.26 |
--> |       9407 |         3211.232 |
--> |       6812 |          856.738 |
--> |      10537 |          2488.57 |
--> |       5925 |         1347.611 |
--> |        ... |              ... |
--> +------------+------------------+

Note that the query results are unordered. To add an ordering, you should specifiy the ORDER BY clause, optionally with a LIMIT to select only the top N results. In our case, to select only the top 10 vehicles with regard to the driven mileage, write the following query:

SELECT vid, sum(len) / 1e3 AS mileage_km
    FROM st.trip
    WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01'
    GROUP BY vid
    ORDER BY 2 DESC LIMIT 10;

--> +------------+-----------------+
--> |        vid |      mileage_km |
--> +------------+-----------------+
--> |       6519 |        23940.02 |
--> |       6515 |        23564.35 |
--> |       6528 |        21984.14 |
--> |       6529 |        19877.73 |
--> |       6521 |        19571.18 |
--> |       6513 |        18813.94 |
--> |       6514 |        18466.29 |
--> |       6522 |        17991.63 |
--> |       6518 |        17647.84 |
--> |       4452 |        17420.92 |
--> +------------+-----------------+

You can also select only those groups which have the value of computed aggregate above a certain threshold using the HAVING clause. For example, the query to obtain only vehicles that drove more than 20,000km in January 2019 is:

SELECT vid, sum(len) / 1e3 AS mileage_km
    FROM st.trip
    WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01'
    GROUP BY vid
    HAVING mileage_km >= 20000
    ORDER BY 2 DESC;

--> +------------+-----------------+
--> |        vid |      mileage_km |
--> +------------+-----------------+
--> |       6519 |        23940.02 |
--> |       6515 |        23564.35 |
--> |       6528 |        21984.14 |
--> +------------+-----------------+

8. Mileage by vehicle in a month - weekday/weekend

You might also like to calculate some more interesting aggregates, such as how many kilometers did each vehicle drive on weekdays, and how many on weekends. The query to do so is this:

SELECT vid,
        sum(IF(isodow(t[0]) between 1 and 5, len, 0)) / 1e3 AS weekday_km,
        sum(IF(isodow(t[0]) between 6 and 7, len, 0)) / 1e3 AS weekend_km 
    FROM st.trip
    WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01'
    GROUP BY vid;

And the results could be similar to the following:

--> +------------+------------------+------------------+
--> |        vid |       weekday_km |       weekend_km |
--> +------------+------------------+------------------+
--> |        ... |              ... |              ... |
--> |       7070 |         3151.215 |           47.446 |
--> |       8072 |         2461.197 |          149.536 |
--> |       6781 |         1859.853 |           68.814 |
--> |       7999 |            8.627 |              0.0 |
--> |       6833 |         1023.189 |          117.966 |
--> |       7514 |            0.291 |              0.0 |
--> |        ... |              ... |              ... |
--> +------------+------------------+------------------+

9. Vehicle mileage during the day

You can also make some interesting analyses about an individual vehicle, for example a breakdown of mileage during a single day. The (naive) query to do so would be:

SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
    FROM st.segment
    WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
    GROUP BY 1
    ORDER BY 1;

--> +------------------+------------------+
--> |                h |       mileage_km |
--> +------------------+------------------+
--> |                3 | 56.4771007954673 |
--> |                4 | 30.7708292415218 |
--> |                5 | 80.1512585243729 |
--> |                6 | 42.7871458744136 |
--> |                7 | 29.1702128185770 |
--> |                9 | 33.6257630266579 |
--> |               10 | 12.1536966921203 |
--> |               11 | 76.5248896784146 |
--> |               12 | 57.7569693565398 |
--> +------------------+------------------+

As you probably noticed, the problem with this query is that if a vehicle didn't drive during a certain hour of the day, there will be no entry in the result for that hour. To fix this, we can introduce a left join with a constant-value table representing all the hours of a day:

SELECT hrs.h, mileage_km ?? 0
    FROM (
        VALUES 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
    ) AS hrs(h) LEFT JOIN (
        SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
        FROM st.segment
        WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
        GROUP BY 1
    ) s ON s.h = hrs.h
    ORDER BY 1;

--> +------------------+------------------+
--> |                h |       mileage_km |
--> +------------------+------------------+
--> |                0 |              0.0 |
--> |                1 |              0.0 |
--> |                2 |              0.0 |
--> |                3 | 56.4771007954673 |
--> |                4 | 30.7708292415218 |
--> |                5 | 80.1512585243729 |
--> |                6 | 42.7871458744136 |
--> |                7 | 29.1702128185770 |
--> |                8 |              0.0 |
--> |                9 | 33.6257630266579 |
--> |               10 | 12.1536966921203 |
--> |               11 | 76.5248896784146 |
--> |               12 | 57.7569693565398 |
--> |               13 |              0.0 |
--> |               14 |              0.0 |
--> |               15 |              0.0 |
--> |               16 |              0.0 |
--> |               17 |              0.0 |
--> |               18 |              0.0 |
--> |               19 |              0.0 |
--> |               20 |              0.0 |
--> |               21 |              0.0 |
--> |               22 |              0.0 |
--> |               23 |              0.0 |
--> +------------------+------------------+

There are several other ways this query can be improved. First, the VALUES part of the query can be extracted in a WITH clause, like this:

WITH hrs(h) AS (
    VALUES 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
) SELECT hrs.h, mileage_km ?? 0
    FROM hrs LEFT JOIN (
        SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
        FROM st.segment
        WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
        GROUP BY 1
    ) s ON s.h = hrs.h
    ORDER BY 1;

The same can be done with the other subquery:

WITH hrs(h) AS (
    VALUES 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
), seg_stat(h, mileage_km) AS (
    SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
        FROM st.segment
        WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
        GROUP BY 1
)
 SELECT hrs.h, mileage_km ?? 0
    FROM hrs
    LEFT JOIN seg_stats ON hrs.h = seg_stats.h
    ORDER BY 1;

And finally, the VALUES list can be replaced with a call to the CONST.SEQUENCE table-valued function:

WITH hrs(h) AS (SELECT * FROM const.sequence(24)),
    seg_stat(h, mileage_km) AS (
        SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
            FROM st.segment
            WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
            GROUP BY 1
    )
 SELECT hrs.h, mileage_km ?? 0
    FROM hrs
    LEFT JOIN seg_stats ON hrs.h = seg_stats.h
    ORDER BY 1;