Your first JOIN queries - SpatiaLite Cookbook

2011 January 28

You already know the basic foundations about simple SQL queries.

Any previous example encountered since now simply queried a single table:

anyway SQL has no imposed limits, so you can query an arbitrary number of tables at the same time. But in order to do this you must understand how to correctly handle a JOIN.

SELECT *

FROM counties, regions;

county_id county_name car_plate_code region_id region_id region_name

1 TORINO TO 1 1 PIEMONTE

1 TORINO TO 1 2 VALLE D'AOSTA

1 TORINO TO 1 3 LOMBARDIA

1 TORINO TO 1 4 TRENTINO-ALTO ADIGE

1 TORINO TO 1 5 VENETO

... ... ... ... ... ...

Apparently this query immediately works;

but once you get a quick glance at the result-set you'll immediately discover something really puzzling: an unexpected huge number of rows has been returned.

and each single County seems to be related with any possible Region.

Every time SQL queries two different tables at the same time, the Cartesian Product of both datasets is calculated.

i.e. each row coming from the first dataset is JOINed with any possible row coming from the second dataset.

This one is a blind combinatorial process, so it very difficultly can produce useful results.

And this process can easily generate a really huge result-set: this must absolutely be avoided, because: a very long (very, very long) time may be required to complete the operation.

you can easily exhaust operating system resources before completion.

All this said, it's quite obvious that some appropriate JOIN condition has to be set in order to maintain under control the Cartesian Product, so to actually return only meaningful rows.

SELECT *

FROM counties, regions

WHERE counties.region_id = regions.region_id;

This query is exactly the same of the previous one: but this time we introduced an appropriate JOIN condition. Some points to be noted:

we must use fully qualified names to avoid any possible ambiguity:

e.g. counties.region_id identifies the region_id column belonging to the counties table, in an absolutely

unambiguous way.

defining the WHERE counties.region_id = regions.region_id clause we impose an appropriate JOIN

condition.

After this the Cartesian Product will be accordingly filtered, so to insert into the result-set only the rows actually satisfying the imposed condition, ignoring any other.

SELECT c.county_id AS county_id, c.county_name AS county_name, c.car_plate_code AS car_plate_code, r.region_id AS region_id, r.region_name AS region_name FROM counties AS c, regions AS r

WHERE c.region_id = r.region_id;

county_id county_name car_plate_code region_id region_name

1 TORINO TO 1 PIEMONTE 2 VERCELLI VC 1 PIEMONTE 3 NOVARA NO 1 PIEMONTE 4 CUNEO CN 1 PIEMONTE 5 ASTI AT 1 PIEMONTE 6 ALESSANDRIA AL 1 PIEMONTE ... ... ... ... ...

And this one always is the same as above, simply written adopting a most polite syntax:

using extensively the AS clause so to define alias names for both columns and tables make JOIN queries to be

much more concise and readable, and easiest to understand.

SELECT lc.lc_id AS lc_id, lc.lc_name AS lc_name, lc.population AS population, c.county_id AS county_id, c.county_name AS county_name, c.car_plate_code AS car_plate_code, r.region_id AS region_id, r.region_name AS region_name FROM local_councils AS lc, counties AS c, regions AS r

WHERE lc.county_id = c.county_id AND c.region_id = r.region_id;

lc_id lc_name population county_id county_name car_plate_code region_id region_name

1001 AGLIE' 2574 1 TORINO TO 1 PIEMONTE

1002 AIRASCA 3554 1 TORINO TO 1 PIEMONTE

1003 ALA DI STURA 479 1 TORINO TO 1 PIEMONTE

... ... ... ... ... ... ... ...

you simply have to apply any required JOIN condition as appropriate.

Performance considerations

Executing complex queries involving many different tables may easily run in a very slow and sluggish mode. This will most easily noticed when such tables contain a huge number of rows.

Explaining all this isn't at all difficult: in order to calculate the Cartesian Product the SQL engine has to access many and many times each table involved in the query.

The basic behavior is the one to perform a full table scan each time: and obviously scanning a long table many and many times requires a long time.

So the main key-point in order optimize your queries is the one to avoid using full table scans as much as possible.

All this is fully supported, and it's easy to be implemented.

Each time the SQL-planner (an internal component of the SQL-engine) detects that an appropriate INDEX is

available, there is no need at all to perform full table scans, because each single row can now be immediately accessed using this Index.

And this one will obviously be a much faster process.

Any column (or group of columns) frequently used in JOIN clauses is a good candidate for a corresponding INDEX.

Anyway, creating an Index implies several negative consequences:

the storage allocation required by the DB-file will increase (sometimes will dramatically increase). performing INSERT, UPDATE and/or DELETE ops will require a longer time, because the Index has to be

accordingly updated.

And this obviously imposes a further overhead.

So (not surprisingly) it's a trade-off process: you must evaluate carefully when an INDEX is absolutely required,

and attempt to reach a well balanced mix.

i.e a compromise between contrasting requirements, under various conditions and in different users-cases. In other words there is no absolute rule: you must find your optimal case-by-case solution performing several practical tests, until you get the optimal solution fulfilling your requirements.

In document SpatiaLite Cookbook (Page 47-50)