The following is a hands-on introduction to working with MSQL query language and using it to query the data stored in the Mireo SpaceTime Cluster.
It assumes no prior knowledge about the MSQL query language or SpaceTime Cluster, but it does assume at least a basic familiarity with database systems and SQL language(s) in general.
All the queries assume an that there is an existing SpaceTime Cluster instance available and that it already has some data in it, namely the following (non-empty) tables: ndb.vehicle
, st.dsensor
, st.segment
and st.trip
.
For more info about these tables, see docs on SpaceTime Cluster Components.
As a first step in working with a new database, you can get a list of all the tables (and other entities) available for querying by running the DESCRIBE DATABASE
command.
DESCRIBE DATABASE;
This command will return a list of all available tables (or entities), for example:
--> +-------+-------------------+
--> | dbs | table |
--> +-------+-------------------+
--> | 'ndb' | 'const_road_type' |
--> | 'ndb' | 'poi' |
--> | 'ndb' | 'polygon' |
--> | 'ndb' | 'vehicle' |
--> | 'st' | 'alarm' |
--> | 'st' | 'csensor' |
--> | 'st' | 'dsensor' |
--> | 'st' | 'segment' |
--> | 'st' | 'trip' |
--> +-------+-------------------+
The second column of the response contains a name of a table/entity, and the first column specifies the namespace (also known as DBS, short for DataBase System) to which the table/entity belongs.
To refer to a table in a query, you must always specify its full name, e.g. ndb.vehicle
or st.trip
.
Otherwise, you will get an error complaining about referencing an unknown object.
If you are interested only in tables/entities belonging to a particular namespace/DBS, you can get a list of them using the DESCRIBE NAMESPACE
command.
When working with a particular table, you can inspect its layout using the DESCRIBE TABLE
command, like this:
DESCRIBE TABLE st.trip;
This command will return info about the table columns, for example:
--> +-----------+--------------+----------+-------+
--> | name | type | nullable | extra |
--> +-----------+--------------+----------+-------+
--> | 'vid' | 'uint32' | false | |
--> | 't' | 'runixepoch' | false | |
--> | 'x0' | 'int32' | false | |
--> | 'y0' | 'int32' | false | |
--> | 'x1' | 'int32' | false | |
--> | 'y1' | 'int32' | false | |
--> | 'len' | 'uint32' | false | |
--> | 'stopdur' | 'uint32' | false | |
--> +-----------+--------------+----------+-------+
You can find out more information about column data types in the section on MSQL data types.
To actually get some rows from a table, you should use the SELECT
command. The SELECT
command is the most commonly used command in the MSQL language, and the one with the most options and variations.
Here is an example of selecting any 10 rows from the st.trip
table, to get a feel on how the data stored in the table really looks like:
SELECT * FROM st.trip LIMIT 10;
If you accidentally try to select all rows from a table (e.g. by forgetting LIMIT N
), unless the table is very small, you will most likely get an error: Memory quota exceeded. This is a safety mechanism to prevent excessive data being read and transferred from the database, as the number of rows stored in a single table can easily be several billions or tens of billions.
The query might return a response like this:
--> +------------+--------------------------+------------+------------+------------+------------+------------+------------+
--> | vid | t | x0 | y0 | x1 | y1 | len | stopdur |
--> +------------+--------------------------+------------+------------+------------+------------+------------+------------+
--> | 6263 | [1530875511, 1530876382] | 5003046 | 35963151 | 4971795 | 35917932 | 8319 | 11964 |
--> | 8264 | [1530878467, 1530879096] | 5857489 | 37341153 | 5854240 | 37333359 | 1522 | 7024 |
--> | 8264 | [1530886120, 1530886646] | 5854240 | 37333359 | 5857475 | 37341270 | 1938 | 7642 |
--> | 6263 | [1530888346, 1530888866] | 4971795 | 35917932 | 4991671 | 35930890 | 3065 | 754 |
--> | 6263 | [1530889620, 1530889766] | 4991671 | 35930890 | 4989529 | 35931141 | 284 | 98 |
--> | 6263 | [1530889864, 1530890411] | 4989529 | 35931141 | 5003003 | 35963168 | 5995 | 74899 |
--> | 8264 | [1530894288, 1530895003] | 5857475 | 37341270 | 5851713 | 37331401 | 1860 | 1019 |
--> | 8264 | [1530896022, 1530896308] | 5851713 | 37331401 | 5855096 | 37333282 | 906 | 270 |
--> | 8264 | [1530896578, 1530897399] | 5855096 | 37333282 | 5857483 | 37341256 | 2239 | 60501 |
--> | 8263 | [1530897827, 1530897958] | 6396232 | 37572004 | 6387519 | 37574794 | 1157 | 47459 |
--> +------------+--------------------------+------------+------------+------------+------------+------------+------------+
Note that in this particular table (st.trip
), the column t
is different from all others - it's a range, or more precisely, a epochrange range comprising of two Unix epoch timestamps representing the time span during which a single trip took place.
Now, how exactly does the MSQL query engine executes the given query? To get the info about the steps it takes to produce the query result, you can run the EXPLAIN
command. This command is also very useful in understanding query performance, and can help more advanced users to optimize their queries for better performance.
To see the execution plan of the the previous query (SELECT * FROM st.trip LIMIT 10
), just put the keyword EXPLAIN
in front of it, like this:
EXPLAIN SELECT * FROM st.trip LIMIT 10;
--> +----+-----------+--------------------+---------------+
--> | id | parent_id | operation | details |
--> +----+-----------+--------------------+---------------+
--> | 1 | NULL | 'Collect (MSQL)' | '' |
--> | 2 | 1 | 'Gather Sort (ST)' | 'limit: 10' |
--> | 3 | 2 | 'Table Scan (ST)' | 'table: trip' |
--> +----+-----------+--------------------+---------------+
The result of the query is a step-by-step explanation of the execution plan, which in this case has only two steps: 1. reading the rows from the st.trip
table (a.k.a. scanning), and 2. taking only the first 10 rows (limiting).
To get an aggregate statistics about a table, you can specify an aggregate function in the SELECT
statement. For example, to get a total number of rows in the st.trip
table, you can use the COUNT aggregate function like this:
SELECT count(*) FROM st.trip;
--> 16129563
This query can take a long time to finish if you try to count all the rows from a very large table. It is advised not to do this unless you are sure the table is relatively small.
If you're only interested in a subset of the data, you can filter the rows by specifying the WHERE
clause on the SELECT
statement. For example, to only count the trips from January 2019, write the following query:
SELECT count(*) FROM st.trip
WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01';
--> 825723
As was noted before, the st.trips.t
column is a range of unixepoch
values, so the bounds on it can be specified using temporal literals. More precisely, this query counts only those trips whose start was between 00:00:00, Jan 1st, 2019
and 00:00:00, Feb 1st, 2019
.
From the MSQL version 22.04.1
, the query above can be further shortened to: select count(*) from st.trip where t[0] within epochrange '[2019-01-01, 2019-02-01]'
.
The conversion between dates and actual timestamps (which are timezone-independent) is done according to the session-default timezone, which is usually GMT
.
This means that the preceeding query is actually equivalent to this one (where datetimes are converted to unixepoch values):
SELECT count(*) FROM st.trip
WHERE t[0] >= 1546300800::unixepoch and t[0] <= 1548979200::unixepoch;
--> 825723
But individual temporal literals can have different timezones specified, like so (note the small difference in the result):
SELECT count(*) FROM st.trip
WHERE t[0] between
epoch '2019-01-01' WITH timezone 'America/Los_Angeles' and
epoch '2019-02-01' WITH timezone 'America/Los_Angeles';
--> 834637
It is also possible to specify a different session-default timezone for the remainder of the query, by writing the SET session.timezone = 'TZ name';
before the rest of the query, like so:
SET session.timezone = 'America/Los_Angeles';
SELECT count(*) FROM st.trip
WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01';
--> 834637
Aggregate statistics can also be computed for arbitrary groups of rows using the GROUP BY
clause. For example, you can calculate total mileage driven by each vehicle in January 2019
using the following query:
SELECT vid, sum(len) / 1e3 AS mileage_km
FROM st.trip
WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01'
GROUP BY vid;
The query could return results like these:
--> +------------+------------------+
--> | vid | mileage_km |
--> +------------+------------------+
--> | ... | ... |
--> | 4250 | 647.006 |
--> | 6618 | 75.565 |
--> | 9411 | 1015.365 |
--> | 8071 | 2386.26 |
--> | 9407 | 3211.232 |
--> | 6812 | 856.738 |
--> | 10537 | 2488.57 |
--> | 5925 | 1347.611 |
--> | ... | ... |
--> +------------+------------------+
Note that the query results are unordered. To add an ordering, you should specifiy the ORDER BY
clause, optionally with a LIMIT
to select only the top N results. In our case, to select only the top 10 vehicles with regard to the driven mileage, write the following query:
SELECT vid, sum(len) / 1e3 AS mileage_km
FROM st.trip
WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01'
GROUP BY vid
ORDER BY 2 DESC LIMIT 10;
--> +------------+-----------------+
--> | vid | mileage_km |
--> +------------+-----------------+
--> | 6519 | 23940.02 |
--> | 6515 | 23564.35 |
--> | 6528 | 21984.14 |
--> | 6529 | 19877.73 |
--> | 6521 | 19571.18 |
--> | 6513 | 18813.94 |
--> | 6514 | 18466.29 |
--> | 6522 | 17991.63 |
--> | 6518 | 17647.84 |
--> | 4452 | 17420.92 |
--> +------------+-----------------+
You can also select only those groups which have the value of computed aggregate above a certain threshold using the HAVING
clause. For example, the query to obtain only vehicles that drove more than 20,000km
in January 2019 is:
SELECT vid, sum(len) / 1e3 AS mileage_km
FROM st.trip
WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01'
GROUP BY vid
HAVING mileage_km >= 20000
ORDER BY 2 DESC;
--> +------------+-----------------+
--> | vid | mileage_km |
--> +------------+-----------------+
--> | 6519 | 23940.02 |
--> | 6515 | 23564.35 |
--> | 6528 | 21984.14 |
--> +------------+-----------------+
You might also like to calculate some more interesting aggregates, such as how many kilometers did each vehicle drive on weekdays, and how many on weekends. The query to do so is this:
SELECT vid,
sum(IF(isodow(t[0]) between 1 and 5, len, 0)) / 1e3 AS weekday_km,
sum(IF(isodow(t[0]) between 6 and 7, len, 0)) / 1e3 AS weekend_km
FROM st.trip
WHERE t[0] between epoch '2019-01-01' and epoch '2019-02-01'
GROUP BY vid;
And the results could be similar to the following:
--> +------------+------------------+------------------+
--> | vid | weekday_km | weekend_km |
--> +------------+------------------+------------------+
--> | ... | ... | ... |
--> | 7070 | 3151.215 | 47.446 |
--> | 8072 | 2461.197 | 149.536 |
--> | 6781 | 1859.853 | 68.814 |
--> | 7999 | 8.627 | 0.0 |
--> | 6833 | 1023.189 | 117.966 |
--> | 7514 | 0.291 | 0.0 |
--> | ... | ... | ... |
--> +------------+------------------+------------------+
You can also make some interesting analyses about an individual vehicle, for example a breakdown of mileage during a single day. The (naive) query to do so would be:
SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
FROM st.segment
WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
GROUP BY 1
ORDER BY 1;
--> +------------------+------------------+
--> | h | mileage_km |
--> +------------------+------------------+
--> | 3 | 56.4771007954673 |
--> | 4 | 30.7708292415218 |
--> | 5 | 80.1512585243729 |
--> | 6 | 42.7871458744136 |
--> | 7 | 29.1702128185770 |
--> | 9 | 33.6257630266579 |
--> | 10 | 12.1536966921203 |
--> | 11 | 76.5248896784146 |
--> | 12 | 57.7569693565398 |
--> +------------------+------------------+
As you probably noticed, the problem with this query is that if a vehicle didn't drive during a certain hour of the day, there will be no entry in the result for that hour. To fix this, we can introduce a left join with a constant-value table representing all the hours of a day:
SELECT hrs.h, mileage_km ?? 0
FROM (
VALUES 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
) AS hrs(h) LEFT JOIN (
SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
FROM st.segment
WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
GROUP BY 1
) s ON s.h = hrs.h
ORDER BY 1;
--> +------------------+------------------+
--> | h | mileage_km |
--> +------------------+------------------+
--> | 0 | 0.0 |
--> | 1 | 0.0 |
--> | 2 | 0.0 |
--> | 3 | 56.4771007954673 |
--> | 4 | 30.7708292415218 |
--> | 5 | 80.1512585243729 |
--> | 6 | 42.7871458744136 |
--> | 7 | 29.1702128185770 |
--> | 8 | 0.0 |
--> | 9 | 33.6257630266579 |
--> | 10 | 12.1536966921203 |
--> | 11 | 76.5248896784146 |
--> | 12 | 57.7569693565398 |
--> | 13 | 0.0 |
--> | 14 | 0.0 |
--> | 15 | 0.0 |
--> | 16 | 0.0 |
--> | 17 | 0.0 |
--> | 18 | 0.0 |
--> | 19 | 0.0 |
--> | 20 | 0.0 |
--> | 21 | 0.0 |
--> | 22 | 0.0 |
--> | 23 | 0.0 |
--> +------------------+------------------+
There are several other ways this query can be improved. First, the VALUES
part of the query can be extracted in a WITH
clause, like this:
WITH hrs(h) AS (
VALUES 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
) SELECT hrs.h, mileage_km ?? 0
FROM hrs LEFT JOIN (
SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
FROM st.segment
WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
GROUP BY 1
) s ON s.h = hrs.h
ORDER BY 1;
The same can be done with the other subquery:
WITH hrs(h) AS (
VALUES 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23
), seg_stat(h, mileage_km) AS (
SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
FROM st.segment
WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
GROUP BY 1
)
SELECT hrs.h, mileage_km ?? 0
FROM hrs
LEFT JOIN seg_stats ON hrs.h = seg_stats.h
ORDER BY 1;
And finally, the VALUES
list can be replaced with a call to the CONST.SEQUENCE table-valued function:
WITH hrs(h) AS (SELECT * FROM const.sequence(24)),
seg_stat(h, mileage_km) AS (
SELECT hour(t[0]) h, sum(st_length(x[0], y[0], x[1], y[1])) / 1e3 AS mileage_km
FROM st.segment
WHERE vid = 3652 and t[0] between epoch '2019-10-15' and epoch '2019-10-16'
GROUP BY 1
)
SELECT hrs.h, mileage_km ?? 0
FROM hrs
LEFT JOIN seg_stats ON hrs.h = seg_stats.h
ORDER BY 1;