Advanced SQL
Queries for the EMF
Susan McCusker, MARAMA
2
Online EMF Resources: SQL
EMF User’s Guide
https://www.cmascenter.org/emf/internal/guide.html EMF SQL Reference Guide
https://www.cmascenter.org/emf/internal/sql_basics.
3
Roadmap
Previously: SQL snippets, query components,
syntax
Today:
Review – Anatomy of a SQL SELECT query and EMF-specific syntax
New – Query enhancements: data from more than one table – join clause and its variations
Goal: Build a query to summarize CAP emissions for a state by SCC with SCC descriptions
4
Where is SQL used in the EMF?
Filtering
Use where clause snippets to filter records; for example
poll = 'CO' ann_value > 300
Row Filter when viewing and editing raw data Data Value Filter in Advanced Dataset Search Row Filter when exporting datasets and QA step
results
5
Where is SQL used in the EMF?
SQL-based QA steps
Write your own selectqueries to create reports and summaries of datasets
Sum or average emissions by region, SCC, plant, etc.
Include reference information like county names, lat/lon coordinates, or pollutant description
SQL in the EMF
6
Basic SQL Query Components
Which table(s) should we get the data from? Do we want to select all the data in the table
or only certain rows or columns?
Example: select certain rows to get one state Example: select certain columns if don't need all
the data from the original inventory
How should the output be grouped and sorted? Example: by SCC? by FIPS? by state? by
pollutant?
7
SQL Syntax
Basic query to extract/aggregate data from a single dataset table:
select columns (i.e., data fields) from table ($TABLE[1] e in EMF)
where filtering criteria used to pick rows
group by columns to aggregate over (e.g., sum) order by columns to use to sort results
SQL Review
8
SQL Syntax
Query to extract/aggregate data from multiple dataset tables:
select columns (i.e., datafields) from table ($TABLE[1] e in EMF) join table on matching criteria
where filtering criteria used to pick rows
group by columns to aggregate over (e.g., sum) order by columns to use to sort results
9
EMF-Specific Syntax
SQL database stores data in tables
Table names must be unique so the EMF uses
the dataset name and random ID to name tables
Dataset name =
nonpt_2011NEIv2_NONPOINT_20141108_11nov20 14_v1.csv
Underlying data table name =
emissions.ds_nonpt_2011NEIv2_NONPOINT_2014 1108_11nov2014_v1_csv_214226751
EMF-Specific Syntax
Basic query
Choose:
FIPS, scc, pollutant, annual emissions for NY (36)
selectregion_cd, scc, poll, ann_value from
Emissions.DS_nonpt_2011NEIv2_NONPOINT_20141108_11nov2014_v1_csv_ 214226751
whereregion_cd like '36%'
# records: 83,718
10
Basic query
Choose:
FIPS, scc, pollutant, annual emissions for NY (36)
selectregion_cd, scc, poll, ann_value from
Emissions.DS_nonpt_2011NEIv2_NONPOINT_20141108_11nov2014_v1_csv_ 214226751
whereregion_cd like '36%'
# records: 83,718
11
SQL Review
Poll #1
The highlighted part of this where clause can be used directly as a row filter when viewing or exporting data: where region_cd like '36%' • TRUE
• FALSE
13
EMF-Specific Syntax
Instead of directly using the table name in the
from statement, use the special syntax $TABLE[1] e
e is a single character table alias
Can use the alias throughout the query instead of the table name
EMF-Specific Syntax
Make it generic with SQL-specific
syntax
selectregion_cd, scc, poll, ann_value
from$TABLE[1] e
whereregion_cd like '36%'
# records: 83,718
14
15
EMF-Specific Syntax
$DATASET_TABLE["dataset name", 1] a Refers to a different dataset
Uses the default version of the data
Ex:$DATASET_TABLE["nonpt_2011NEIv2_NONPOINT
_20141108_11nov2014_v1.csv", 1] a
Other EMF-specific options let you refer to
specific versions of datasets or output of QA steps (covered in reference guide)
EMF-Specific Syntax
Only interested in CAPs, not HAPs
FIPS, SCC, pollutant, annual CAP emissions for NY (state FIPS = 36)
selecte.region_cd, e.scc, e.poll, e.ann_value from$TABLE[1] e
wheree.region_cd like '36%' and substring(e.poll,1,1) not in ('1', '2', '3',
'4', '5', '6', '7', '8', '9')
# records: 20,187
16
Poll #2
We want to exclude HAPS & type the condition e.poll not in ('1', '2', '3', '4', '5', '6', '7', '8', '9'). a. This condition excludes HAPs
b. This condition results in an error
c. This condition does not exclude HAPs and does not result in an error
17
Summarize by FIPS, SCC, pollutant
Use the aggregate function sum for ann_value
select e.region_cd, e.scc, e.poll, sum(e.ann_value)
from $TABLE[1] e
where e.region_cd like '36%' and substring(e.poll,1,1)
not in ('1', '2', '3', '4', '5', '6', '7', '8', '9')
Error: need group by clause for select sum
18
Summarize by FIPS, SCC, pollutant
Use the aggregate function sum for ann_value
selecte.region_cd, e.scc, e.poll, sum(e.ann_value) from$TABLE[1] e
wheree.region_cd like '36%' and substring(e.poll,1,1) not in ('1', '2', '3', '4', '5',
'6', '7', '8', '9')
group by region_cd, scc, poll
# records: 20,187
19
SQL Review
Summarize by State,SCC, pollutant
Group together by 1st two digits of FIPS code:
selectsubstring(e.region_cd,1,2), e.scc, e.poll, sum(e.ann_value) from$TABLE[1] e
wheree.region_cd like '36%' and substring(e.poll,1,1) not in ('1', '2', '3', '4', '5',
'6', '7', '8', '9')
group by substring(e.region_cd,1,2), e.scc, e.poll
# records: 340
20
21
Multiple Tables
Instead of extracting or aggregating data from
just one table, we can use join to combine data from multiple tables
Example reference table: scc
Query Enhancements: JOIN scc sector scc_description
2401001000 Nonpoint "Solvent Utilization;Surface Coating;Architectural Coatings;Total: All Solvent Types"
2610000500 Nonpoint "Waste Disposal, Treatment, and Recovery;Open Burning;All Categories;Land Clearing Debris (use 28-10-005-000 for Logging Debris Burning)"
Poll #3
The EMF-specific syntax $TABLE[1] e:
a. Refers to the data table for the dataset to which the QA step is attached
b. Is generic, i.e., it can be copied to QA steps for other datasets of the same type/with the same columns c. Assigns a single-character table alias "e" that can be
used in referring to columns from the dataset d. All of the above
23
JOIN Syntax
For a reference table: from $TABLE[1] e
left join reference.scc on e.scc = scc.scc
Query Enhancements: JOIN
After JOIN keyword is the name of the table to join
For a dataset table: from $TABLE[1] e left join
$DATASET_TABLE["dataset name", 1] a
Refers to a different dataset Uses the default version of the data
Less basic query – add a join
select … scc.scc_description … from $TABLE[1] e
left join reference.scc on e.scc = scc.scc
where
group by … scc.scc_description …
24
Summarize by State, SCC, pollutant –
with SCC descriptions
selectsubstring(region_cd,1,2) as FIPS_State, scc,
scc.scc_description, poll, sum(ann_value)
from$TABLE[1] e
left join reference.scc on e.scc = scc.scc
whereregion_cd like '36%' and substring(poll,1,1) not in ('1', '2',
'3', '4', '5', '6', '7', '8', '9')
group by substring(region_cd,1,2), scc,
scc.scc_description,poll
Failed to run…"scc" is ambiguous
25
Query Enhancements: JOIN
Summarize by State, SCC, pollutant –
with SCC descriptions
selectsubstring(e.region_cd,1,2) as FIPS_State, e.scc, scc.scc_description,
e.poll, sum(e.ann_value)
from$TABLE[1] e
left join reference.scc on e.scc = scc.scc
wheree.region_cd like '36%' and substring(e.poll,1,1) not in ('1', '2', '3', '4', '5',
'6', '7', '8', '9')
group by substring(e.region_cd,1,2), e.scc, scc.scc_description, e.poll
26
EMF Reference Table Examples
https://www.cmascenter.org/emf/inte rnal/sql_basics.html
27
EMF Reference Tables - Example
Summarize by State, SCC, pollutant –
with SCC level descriptions
selectsubstring(e.region_cd,1,2) as FIPS_State, e.scc, scc_codes.scc_l1,
e.poll, sum(e.ann_value)
from$TABLE[1] e
left join reference.scc_codes on e.scc = scc_codes.scc
wheree.region_cd like '36%' and substring(e.poll,1,1) not in ('1', '2', '3', '4', '5',
'6', '7', '8', '9')
group by substring(e.region_cd,1,2), e.scc, scc_codes.scc_l1, e.poll
28
29
JOIN Syntax: multiple
criteria,multiple joins
... from $TABLE e left join reference.fips on e.county = fips.county and e.state = fips.st left joinreference.scc on e.scc = e.scc
on clause defines
how the two tables relate to each other
county and state names must match Can have
Different column names
Multiple joins
Query Enhancements: JOIN
30
JOIN Options
31
LEFT JOIN Semantics
To include all the records from
nonpt_2011NEIv2_NONPOINT_20141108_11nov2014_v1.csv
whether or not there’s a matching record in
reference.scc, we use a left join instead of just join
left join is also called left outer join
Query Enhancements: JOIN
32
LEFT JOIN Oil & Gas Example
33
LEFT JOIN Details
Table order is important when using left
join ...
from $TABLE[1] e
left join reference.scc ...
All the records in the “left” table ($TABLE[1]) are
returned in the output
What if the order is reversed?
Query Enhancements: JOIN
Poll #4
What results would we expect from reversing the order of the left join in the previous query, i.e., from reference.scc
left join $TABLE[1] e
a. Only records common to both tables are returned b. All records in either table are returned
c. All records in the reference.scc table are returned d. All records in the $TABLE[1] e are returned
35
Dealing with NULL Values
NULL in SQL is a state (unknown) and not a
value: data value does not exist in the database
Used when data is unknown Unknown ≠ "value of zero"
NULL values can cause unexpected output
when combined with other values
'Fish' || NULL || 'Chips' = NULL
NULL /0 = NULL
NULL Values
What happens if there's no match in the
other table?
Coalesce function
select e.scc, coalesce(scc.scc_description, 'An unspecified description') as scc_description,
e.poll, coalesce(sum(e.ann_value),0) as ann_emis from $TABLE[1] e left join reference.scc on
scc.scc = e.scc
group by e.scc, scc.scc_description, e.poll order by e.scc
36
37
COALESCE Function
coalesce(value1, value2, ...) function
returns the first value in the list that is not null
Replace select scc.scc_descriptionwith select coalesce(scc.scc_description, 'An unspecified description')
If there is no description for the SCC (i.e., scc.scc_description is NULL), 'An upspecified description' is returned.
NULL Values
38
Putting It All Together
selecte.scc,
coalesce(s.scc_description, 'Unspecified description') as scc_description, e.poll,
coalesce(p.descrptn, 'Unspecified description') as
pollutant_code_desc,
coalesce(sum(ann_value), 0) as ann_emis from $TABLE[1] e
left join reference.invtable p on p.cas = e.poll left join reference.scc s on e.scc = s.scc
group by e.scc, e.poll, p.descrptn, s.scc_description,
p.name
39
-- From EMF main window:
Manage Datasets
-- Show Datasets of Type:
Select one from drop-down menu: ex, nonpt_2011NEIv2_NONPOINT_20141108_11nov2014_v1.csv *Use or search button or to narrow down results displayed*
-- Check box to select: ex, nonpt_2011NEIv2_NONPOINT_20141108_11nov2014_v1.csv
-- Click Edit Propertiesat bottom of the screen -- Click QAtab at top
-- Click Add Customat bottom of screen
-- Enter Name (ex, “test”, “temp”, or something descriptive for future reminder) -- In Program drop-down box choose SQL
-- In Arguments box type in or cut & paste SQL query -- Click OK
-- Back in Dataset Properties Editor, check the box next to the QA Step that you added and click Editat
bottom of screen
-- In Edit QA Step window, check box for Download result file to local machine?if want to export results
-- Click Run
-- Click View Resultsto see results in EMF
* If there are multiple files in a dataset type, make sure to keep track of which file your query is applied to. Query results are for that particular file only.
* Dumping run logs in status windows to trashcan before run starts will reduce clustering and make the log easier to read
Exporting to Google Earth
Option to export data in
KMZ format used by Google Earth
For QA step results that
include latitude and longitude coordinates
Export from View QA Step
Results window
40
Exporting to Google Earth
41
QA Step Results: Mapping
Formatting Note: Quotes
42
"smart quotes" cause a syntax error in the EMF