Target. Hunt. Disrupt.
SQRRL
PRODUCT TRAINING
Querying sources
• Sqrrl Query Language (SqrrlQL) is a query language designed specifically to retrieve Sqrrl data
• SELECT queries are used to retrieve rows of data from one or more Sqrrl sources, models or from instances of a single Sqrrl entity
• Sqrrl also includes a SqrrlQL API for using SqrrlQL queries in custom development
SELECT query
• Allows you to retrieve rows of data from sources or models
• SQL-like language with extensions to handle hierarchical data and Lucene-style searches
•
You can issue SELECT queries from the
• Sqrrl Enterprise web application
• Sqrrl shell using the select command • SqrrlClient API
• SELECT query results are always in the form of a table unless you explicitly request the results in JSON format
Select overview
• SELECT selectExpressionList
• selectExpressionList Identifies the data that you want to return
• Can request values from a single column or contain more complex expressions including aggregated expressions
• FROM dataList
• Identifies the location to return the data from (1) list of sources (2) list of models (3) single model entity
• Is optional if you are already in the context of a specific source, model or entity SELECTselectExpressionList FROMdataList [ WHERE whereExpression ] [ GROUP BY
groupingExpressionList ] [ ORDER BY orderExpression ] [ LIMIT limitExpression ] [OFFSET offsetExpression]
Select overview
• WHERE whereExpression (optional) • Filters the returned data
• GROUP BY groupingExpressionList (optional)
• If the SELECT clause includes aggregated expressions then the GROUP BY clause determines how to group the returned results
SELECTselectExpressionList FROM dataList [ WHERE whereExpression ] [ GROUP BY groupingExpressionList ] [ ORDER BY orderExpression ] [ LIMIT limitExpression ] [OFFSET offsetExpression]
Select overview
• ORDER BY orderExpression (optional) • Controls the order of the returned results
• LIMIT limitExpression (optional)
• Limits the total number of results to return from the query
• OFFSET offsetExpression (optional)
• Indicates where to start the matching results by skipping the specified number of rows from the beginning of the original results
• Can only be used when there is an ORDER BY clause
SELECT selectExpressionList FROMdataList [ WHERE whereExpression ] [ GROUP BY groupingExpressionList ] [ ORDER BY orderExpression ] [ LIMITlimitExpression ] [OFFSET offsetExpression]
Select clause
• Identifies the data that you want to return and consists of a list of query expressions
• Each expression in the list becomes a column in the SELECT query results
• SELECT clause Expressions support requests to • Return basic source field or entity feature values • Specify visibility labels to include
• Aggregate values using grouping values provided in the GROUP BY clause • Use mathematical operations to adjust or combine values
Select clause
• SqrrlQL keywords can be all uppercase like 'SELECT' or all lowercase like 'select' but not mixed-case
# returns everything from the source CounterOps SELECT json() FROM CounterOps
select json() from CounterOps # not allowed
X Select json() from CounterOps
• FROM clause is optional if you first execute dataset CounterOps in the shell prior to running the query
• SELECT query results are always in the form of a table unless you request the results in JSON format
SELECT … LIMIT
• Determines the total number of SELECT query results to return
# only fetch 5 results
SELECT json() FROM CounterOps LIMIT 5 SELECT JSON() FROM CounterOps LIMIT 5 SELECT json() FROM Sqrrl_Netflow LIMIT 5
Selecting Specific Fields
# select only specific fields
SELECT hostname FROM CounterOps LIMIT 5
SELECT hostname, ipAddress FROM CounterOps LIMIT 5
SELECT path('hostname'), path(ipAddress) FROM CounterOps LIMIT 5 • Any field reference in a query expression that is not present in a
matching record is interpreted as a null value
# 'foo' is not a field in hostname so it will be null for each record
Selecting Array Elements
SELECT hostname, path('secondaryIds[0]') FROM CounterOps LIMIT 5 SELECT hostname, path('secondaryIds[1]') FROM CounterOps LIMIT 5 SELECT json('secondaryIds') FROM CounterOps LIMIT 5
Selecting Hierarchical Data
• Hierarchical fields are delimited with forward slashes and must be wrapped in the path() function to avoid SQL parsing errors
Shell Options for SELECT Queries
• -np,--no-paginationSELECT hostname, ipAddress FROM CounterOps–np • --batch-size (default is 1000)
SELECT hostname, ipAddress FROM CounterOps-np --batch-size 10
• --format=table (default)
SELECT hostname, ipAddress FROM CounterOps LIMIT 10 --format=table
• --format=simple
SELECT hostname, ipAddress FROM CounterOps LIMIT 10 --format=simple
• --format=csv [--delimiter <delimiter>] (default is tab-delimited)
SELECT hostname, ipAddress FROM CounterOps LIMIT 10 --format=csv
SELECT hostname, ipAddress FROM CounterOps LIMIT 10 --format=csv --delimiter ,
• --format=null
Running Queries in the Background
• SELECT and NEXT commands can take an optional trailing & which indicates to run the search in the background
• To send a running search to the background press the z key
• To monitor the progress of a background query use the
WATCHPROGRESS, CHECKPROGRESS, and PEEK commands • To rejoin the query use the NEXT command
• SELECT and NEXT commands also take an optional -w or --watch-progress argument
Running Queries in the Background
• Once the batch is returned to the shell a message is printed on the console
• You can use NEXT without the “&” to print the background results to the screen
• Typing NEXT without the “&” rejoins the search and blocks until the current batch is returned
• While waiting for search results
• Type "z” to return to the prompt while the query continues in the background • Type "c” to cancel the query
Select Using as
• SELECT query supports using AS to alias a SELECT clause expression
• After you define an alias you can use that alias in subsequent expressions in the query. (You cannot use an alias in a SELECT query WHERE clause)
• Also used as the column heading in the query results
• Aliases are often used for grouped or ordered queries where you may need to refer to a SELECT expression in the GROUP BY or ORDER BY clause
Aliases
• AS keyword is optional
SELECT hostname AS Name FROM CounterOps LIMIT 5 SELECT hostname Name FROM CounterOps LIMIT 5
© 2019, Amazon Web Services, Inc. or its Affiliates.
FROM clause
• Identifies the location of the SELECT query data
• Comma-separated list of source name (FROM source1,source2,source3)
• Sqrrl looks for matching data in the source records
• Comma-separated list of model names (FROM model1,model2,model3)
• Sqrrl looks for matching data in the entity instances for those models
• A single entity in the format modelName.entityName (FROM model1.entity)
Where clause
• Allows you to filter the returned query results based on expression values
• Supports the full SQL query language except for aggregate expressions
• You cannot use an alias from the SELECT clause in the WHERE clause
• Supports regular expressions to check for specific content
• When you specify sources or models in the FROM clause but don't provide a WHERE clause the results list includes rows from *ALL* of the sources and *ALL* all of the entities from the model
Comparison Operators (=, <, >, <=, >=, <>)
• All comparison operators with the exception of <> can take advantage of indexes as follows
• Numeric, DATETIME, and INET_ADDR queries can exploit VALUE_BY_FIELD indexes
• STRING queries perform best with the VALUE_BY_FIELD index but fall back to the TERMS_BY_FIELD or TERMS index if either is available
• Matching aggregate field values can exploit AGGREGATE indexes
• The <> operator can never utilize indexes and requires a full table scan which can be SLOW !
Comparison Operators: =, <, >, <=, >=, <>
SELECT hostname FROM CounterOps WHERE hostname = 'cam' SELECT hostname FROM CounterOps WHERE hostname < 'cam' SELECT hostname FROM CounterOps WHERE hostname >= 'cam' SELECT hostname FROM CounterOps WHERE hostname <> 'cam'
SELECT SourceAddress Src, DestinationAddress Dst, Protoco FROM Netflow WHERE Protocol = 'ICMP'
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol <> 'ICMP'
Where Like clause
• Use LIKE operator to compare values to a specified pattern
• If you just provide a value without any pattern wildcards the string must match exactly
Pattern Description
% Matches zero or more characters.
For example, "b%ad" would match "bad", "brad", and "broad”. _ Matches any single character.
For example, "b_ad" would match "brad" or "bead". It would not match "bad" or "broad”. […] Matches any character in the brackets. Also supports inclusive ranges of values ( [a-z]).
For example, "b[a-f]ad" would match "bead", but not "brad”
[^…] Matches any character or range of characters other than the ones specified. For example, "b[^a-f]d would match "bird", but not "bead”
WHERE … LIKE
SELECT hostname FROM CounterOps WHERE hostname LIKE 'cam' SELECT hostname FROM CounterOps WHERE hostname LIKE 'cam_' SELECT hostname FROM CounterOps WHERE hostname LIKE 'cam[23]_' SELECT hostname FROM CounterOps WHERE hostname LIKE 'cam[^23]_'
# range search - lower bound is inclusive, upper bound is exclusive
SELECT hostname FROM CounterOps WHERE hostname LIKE '[b-c]%'
# these queries always require a table scan
SELECT hostname FROM CounterOpsWHERE hostname LIKE '%cam'
SELECT hostname FROM CounterOpsWHERE hostname LIKE 'cam%'
WHERE … IN
• Use IN operator to check whether a value matches one of a list of expression values
<value> IN (<match1>, <match2>, ...)
SELECT hostname FROM CounterOps WHERE hostname IN ('athena','cam') SELECT SourceAddress Src, DestinationAddress Dst, Protocol,
WHERE … BETWEEN
• BETWEEN operator checks whether a value falls within an inclusive range
<value> BETWEEN <lowerBound> AND <upperBound>
• Range search: lower bound is inclusive, upper bound is exclusive
• String queries can exploit VALUE_BY_FIELD indexes.
• Numeric queries can exploit VALUE_BY_FIELD or AGGREGATE indexes
SELECT hostname FROM CounterOps WHERE hostname BETWEEN 'b' AND 'd' SELECT hostname, path('secondaryIds[0]') AS id0,
path('secondaryIds[1]') AS id1 FROM CounterOps WHERE path('secondaryIds[1]') BETWEEN 'b' AND 'd'
WHERE … NOT
SELECT hostname FROM CounterOps WHERE hostname NOT LIKE 'a%' SELECT hostname FROM CounterOps WHERE hostname NOT IN
('athena','cam')
SELECT hostname FROM CounterOps WHERE hostname NOT BETWEEN 'b' AND
'd'
• Use the NOT operator to search for records that do not match a given expression
• NOT operator cannot use indexes as a result some queries may be very SLOW !
WHERE … IS [NOT] NULL
# Note that the training sources do not contain any null fields
SELECT hostname FROM CounterOps WHERE hostname IS NULL
SELECT hostname FROM CounterOps WHERE hostname IS NOT NULL
• Use the NULL keyword to find NULL values for an expression
• Can NEVER utilize indexes requiring a full table scan which can be very SLOW !
WHERE … AND | OR
• WHERE clause can include multiple expressions using AND or OR • Expressions combined using AND must match all WHERE expressions • Expressions combined using OR must match at least one of WHERE
expressions
WHERE … AND | OR
SELECT SourceAddress Src, DestinationAddress Dst, Protocol, DestinationPort DstPort FROM Netflow WHERE Protocol='TCP' AND DestinationPort=88
SELECT SourceAddress Src, DestinationAddress Dst, Protocol, DestinationPort DstPort FROM Netflow WHERE Protocol IN('TCP','UDP') AND DestinationPort=88 SELECT InBytes, InPackets, SourceAddress Src, DestinationAddress Dst,
Protocol, DestinationPort DstPort FROM Netflow WHERE InBytes>100000000 AND InPackets>100000
SELECT InBytes, InPackets, SourceAddress Src, DestinationAddress Dst,
Protocol, DestinationPort DstPort FROM Netflow WHERE InBytes > 100000000 OR InPackets>100000
SELECT InBytes, InPackets, SourceAddress Src, DestinationAddress Dst,
Protocol, DestinationPort DstPort FROM Netflow WHERE (InBytes>100000000 OR InPackets>100000) AND DestinationPort=139
lucene
• The lucene() function in WHERE clauses supports the full Lucene 4.0 syntax except for fuzzy searches and the NOT operator
• Lucene() function DOES SUPPORT
• Fielded or unfielded queries Leading or trailing wildcards
• AND, OR String and numeric range searches
• Phrase searches Full regular expression searches
• Lucene() function DOES NOT SUPPORT
• Negations ('NOT','-') • Fuzzy searches ('~') • Proximity searches ('“”~')
WHERE Lucene()
• Use cases Lucene queries are well-suited for
• Tokenization of individual field data into more granular Terms that can be indexed & queried including
– email headers, senders, recipients, subjects, & message bodies – structured or unstructured textual content in general
• Full-Text searches that can match either terms or phrases anywhere in a record regardless of which field contains them
• Fielded searches that can match either terms or phrases anywhere in a particular field
• Regular Expression-based queries
• In all other cases prefer '=' or 'LIKE' instead of Lucene queries for better performance!
WHERE Lucene()
• A Lucene search pattern consists of terms and operators
• There are two types of Lucene terms
• A single term is a single word such as 'test' or 'hello'.
• A phrase is a group of words surrounded by double quotes such as "hello dolly”
Lucene Tokenization & Term Indexing
• The Lucene Standard Tokenizer tokenizes field values to more granular terms as follows
• Words are split at punctuation characters removing punctuation
• A dot that is not followed by whitespace is considered part of a token
• Email addresses and internet host & domain names are recognized as as single tokens
• A TERMS_BY_FIELD index indexes individual terms per field
• A TERMS index indexes individual terms across multiple document fields
Unfielded Lucene Queries
• An unfielded Lucene query string consists of a search pattern to match against without restricting the search to any particular document field(s), thereby providing Full-Text Search capability
lucene('pattern') - True if any field(s) in a record match the search pattern otherwise false
Fielded Lucene Queries
• A fielded Lucene query string additionally specifies one or more document fields which restricts the search to those fields
lucene('fieldname:pattern') - True if any records contain a field whose
name matches the specified fieldname (which may contain wildcards) and whose data matches the search pattern otherwise false
Lucene Phrase Searches
• Specifies a literal quoted phrase as the search pattern
• Search evaluates to true if the exact phrase is found anywhere in a document field irrespective of term tokenization boundaries
• Can exploit TERMS_BY_FIELD or TERMS indexes
• Phrase search patterns are always treated as literal values
• Unlike term searches they cannot take advantage of search pattern modifiers such as ranges, wildcards, and regular expressions
Lucene Phrase Searches
# search for the phrase "cam" in any field
SELECT hostname FROM CounterOps WHERE lucene('"cam"') # search for the phrase "192.168.1.108" in any field
SELECT SourceAddress Src, DestinationAddress Dst FROM Netflow WHERE lucene('"192.168.1.108"')
# search for the phrase "cam" in the CounterOps.hostname field SELECT hostname FROM CounterOps WHERE lucene('hostname:"cam"')
# search for the phrase "192.168.1.108" in the Netflow.SourceAddress field SELECT SourceAddress Src, DestinationAddress Dst FROM Netflow WHERE
Lucene Term Searches
• Lucene term search specifies an (unquoted) Lucene single term as the search pattern which must match a tokenized term in a document field for the search to evaluate to true
• Term searches can exploit TERMS_BY_FIELD or TERMS indexes
• Lucene term searches can also take advantage of search pattern modifiers such as ranges, wildcards, and regular
Lucene Term Searches
# search for the exact term 'cam' in any field
SELECT hostname FROM CounterOps WHERE lucene('cam')
# search for the exact term '192.168.1.108' in any field
SELECT SourceAddress Src, DestinationAddress Dst FROM Netflow WHERE
lucene('192.168.1.108')
# search for the exact term 'cam' in the CounterOps.hostname field
SELECT hostname FROM CounterOps WHERE lucene('hostname:cam')
# search for the exact term '192.168.1.108' in the Netflow.SourceAddress field
SELECT SourceAddress Src, DestinationAddress Dst FROM Netflow WHERE
Lucene Term Searches
# matches any hostname containing the hyphen-delimited term 'sqrrl' # does not match single term 'sqrrl_wk301'
SELECT hostname FROM CounterOps WHERE lucene('hostname:sqrrl')
# matches any hostname containing the hyphen-delimited term 'wk301' # does not match single term 'sqrrl_wk301'
SELECT hostname FROM CounterOps WHERE lucene('hostname:wk301')
# these queries return no rows because the search pattern matches no term in any hostname field
SELECT hostname FROM CounterOps WHERE lucene('hostname:qrrl-wk') SELECT hostname FROM CounterOps WHERE lucene('hostname:sqrrl-wk301')
# matches single term 'sqrrl_wk301'
Fields in the Lucene Query String
• Lucene Field Selectors & Visibility Labels
• Lucene queries do not require visibility labels with field names
• A specified field name matches all visible fields with the same base name in accordance with the user's visibility roles
• Hierarchical fields are specified in a flattened format of elements separated by forward slashes
ARRAYS in Lucene Field Selectors
# exact term search on secondaryIds[0]
SELECT json() FROM CounterOps WHERE lucene('secondaryIds[0]:cam')
# exact term search on secondaryIds[1]
SELECT json() FROM CounterOps WHERE lucene('secondaryIds[1]:"cam.sqrrl.com"')
# use path() to select a particular array element
SELECT path('secondaryIds[1]') FROM CounterOps WHERE lucene('secondaryIds[0]:cam')
# use json() to return the entire array
SELECT json('secondaryIds') FROM CounterOps WHERE lucene('secondaryIds[0]:cam')
ARRAYS in Lucene Field Selectors
• All array members are searched if either • the last element of a path is an array • the array index is '[]' or '[*]'
# the following equivalent queries all search all members of the array
SELECT json() FROM CounterOps WHERE lucene('secondaryIds:cam') SELECT json() FROM CounterOps WHERE lucene('secondaryIds[]:cam') SELECT json() FROM CounterOps WHERE lucene('secondaryIds[*]:cam')
Wildcards In Lucene Field Selectors
• The following wildcards can be used in Lucene field selectors
• One asterisk * matches any path element or portion of a path element but * does not span multiple levels of a hierarchical path
• Two asterisks ** match any number of intervening levels in a hierarchical path
SELECT hostname FROM CounterOps WHERE lucene('host*:cam') SELECT hostname FROM CounterOps WHERE lucene('host*ame:cam') SELECT hostname FROM CounterOps WHERE lucene('*ostname:cam') SELECT hostname FROM CounterOps WHERE lucene('host**:cam') SELECT hostname FROM CounterOps WHERE lucene('host**ame:cam') SELECT hostname FROM CounterOps WHERE lucene('**ostname:cam')
Searching Multiple Sources
• It can sometimes be useful to search for a term or phrase across multiple or even all Sqrrl sources
# queries all fields in multiple named sources:
SELECT dataset(), json() FROM CounterOps, Netflow WHERE lucene('192.168.1.108')
# you can optionally supply a field spec if the sources have similarly named
fields
SELECT dataset(), json() FROM CounterOps, Netflow WHERE lucene('*Address:192.168.1.108')
# no FROM clause - queries all fields in all visible datasets
# (including both sources & models)
Lucene Wildcards
• Single term search patterns (but not phrase patterns) can contain wildcards
• Trailing wildcard searches match against tokenized Terms in addition
to the entire field value
• They can exploit TERMS_BY_FIELD or TERMS indexes
# trailing wildcards search individual terms in addition to the entire field
SELECT hostname FROM CounterOps WHERE lucene('hostname:cam?')
# also returns 'sqrrl-cam002' & 'sqrrl-cam002-2', but not 'b1lobbycam'
SELECT hostname FROM CounterOps WHERE lucene('hostname:cam*')
# WHERE NOT with a lucene wildcard query
Lucene Wildcards
• Non-trailing wildcard searches match against the entire field value not against tokenized terms
• They require full table scans and therefore can be much slower than other search patterns
Lucene Wildcards
# non-trailing wildcards cannot exploit any indexes so require a full table scan) # non-trailing wildcards match against entire fields, not terms
SELECT hostname FROM CounterOps WHERE lucene('hostname:cam*1') SELECT hostname FROM CounterOps WHERE lucene('hostname:cam?1')
# these queries match both field values 'sqrrl_wk301' and 'sqrrl-wk301' SELECT hostname FROM CounterOps WHERE lucene('hostname:sqrrl?wk301') SELECT hostname FROM CounterOps WHERE lucene('hostname:sqrrl*wk301')
# leading wildcards are permitted
# like other non-trailing wildcards they match against entire fields, not terms SELECT hostname FROM CounterOps WHERE lucene('hostname:*qrrl-wk301')
Lucene String Range Searches
• String range searches match against tokenized terms in addition to the entire field value
• Can exploit TERMS_BY_FIELD or TERMS indexes.
• Range searches allow one to match documents whose field values are between the (inclusive) lower bound and the (exclusive) upper bound specified by the range search
• Both of these queries return 11 more rows than their 'LIKE' counterparts do which are of the form /[a-zA-Z]*\-[b-d].*/:
SELECT hostname FROM CounterOps WHERE lucene('hostname:[b TO d]*') SELECT path('secondaryIds[1]') id1 FROM CounterOps WHERE
Lucene Regular Expressions
• Match against the entire field value not against tokenized terms
• Require full table scans and therefore can be much slower than other search patterns
• Must be delimited with double slashes
SELECT hostname FROM CounterOps WHERE lucene('hostname://cam//') SELECT hostname FROM CounterOps WHERE lucene('hostname://^cam//') SELECT hostname FROM CounterOps WHERE lucene('hostname://cam$//') SELECT hostname FROM CounterOps WHERE lucene('hostname://^cam$//') SELECT hostname FROM CounterOps WHERE lucene('hostname://^cam.$//') SELECT hostname FROM CounterOps WHERE lucene('hostname://^cam\d+$//') SELECT hostname FROM CounterOps WHERE lucene('hostname://^cam.*1$//') SELECT hostname FROM CounterOps WHERE lucene('hostname://^[b-c]//') SELECT hostname FROM CounterOps WHERE lucene('hostname://^cam[23].$//') SELECT hostname FROM CounterOps WHERE lucene('hostname://^cam[^23].$//')
© 2019, Amazon Web Services, Inc. or its Affiliates.
Lucene Compound Queries
• Search pattern can combine single terms and phrases with the
Boolean operators AND / OR, or their respective counterparts && and ||
• OR is the default conjunction operator
• Links two terms and finds a matching document if either of the terms exist in a document - equivalent to a set union
• AND operator matches documents where both terms exist anywhere in the text of a single document - equivalent to a set intersection.
• Supports using parentheses to construct compound queries based on sub-queries
SELECT SourceAddress Src, DestinationAddress Dst, Protocol, DestinationPort DstPort FROM Netflow WHERE lucene('(Protocol:TCP OR Protocol:UDP) AND
Escaping Lucine Special Characters
• Lucene supports escaping the following special characters that are part of its query syntax
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
• To escape these characters precede them with “\” (to search for (1+1):2 use the query)
ORDER BY – Sorting Query Results
• Used to control the sort order for SELECT query results
• Can include columns or expressions that are not part of the SELECT clause
ORDER BY a Single Expression
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY Src ASC
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY Src
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY Src DESC
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY InBytes DESC
ORDER BY Multiple Expressions
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY Src, Dst
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY Dst, Src
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY Src ASC, Dst DESC
ORDER BY Column Numbers
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY 1, 2
ORDER BY … LIMIT … OFFSET
• You can only use OFFSET following both an ORDER BY clause and a LIMIT clause
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY Src, Dst LIMIT 5
SELECT SourceAddress Src, DestinationAddress Dst, Protocol FROM Netflow WHERE Protocol='ICMP' ORDER BY Src, Dst LIMIT 5 OFFSET 1
Aggregate Functions
• COUNT, MAX, MIN, SUM, AVG
• You can use aggregate functions only in the following clauses • SELECT
• GROUP BY
Counting Source Records
• COUNT(fieldname) returns the total number of non-null entries in a given field or path across all matching documents
SELECT COUNT(hostname) FROM CounterOps
• You can apply a WHERE clause as usual to restrict the result set
SELECT COUNT(hostname) FROM CounterOps WHERE hostname LIKE 'cam%'
• Use IS NULL to count matching records with null values in a particular field
# NOTE THAT THE TRAINING SOURCES DO NOT CONTAIN ANY NULL FIELDS SELECT COUNT(hostname) FROM CounterOps WHERE hostname IS NULL
# These two queries are equivalent
SELECT COUNT(hostname) FROM CounterOps WHERE hostname IS NOT NULL SELECT COUNT(hostname) FROM CounterOps
Counting Source Records
• COUNT(*) returns the total number of matching records and is more efficient than counting a particular field
SELECT COUNT(*) FROM CounterOps
SELECT COUNT(*) Ct FROM Netflow WHERE Protocol='TCP'
SELECT COUNT(*) Ct FROM Netflow WHERE Protocol='TCP' AND
Counting Unique Values
• COUNT(DISTINCT fieldname) is the preferred way to estimate the number of unique values in a field
• COUNT(DISTINCT) uses an advanced algorithm (HyperLogLog++) to estimate the number of unique values in any particular field, typically within 5% of the true count
• COUNT(DISTINCT) performs much better and consumes far fewer resources than an exact unique field value count
• The DISTINCT keyword is unique to the COUNT() function and cannot be used in any other context
Counting Unique Values
# estimate the number of unique values in a field - the fast but approximate method SELECT COUNT(DISTINCT hostname) FROM CounterOps
SELECT COUNT(DISTINCT hostname) FROM CounterOps WHERE hostname LIKE 'cam%' # not supported
X SELECT DISTINCT hostname FROM CounterOps
• If you require an exact count of unique values in a field despite the
performance and resource penalties you can execute a query with the
following form
# count unique values in a field - the slow but exact method SELECT hostname, COUNT(*) FROM CounterOps GROUP BY hostname
Other Aggregate Functions
SELECT MIN(InBytes), MAX(InBytes), SUM(InBytes), AVG(InBytes) FROM Netflow
SELECT MIN(InBytes) AS MinBytes, MAX(InBytes) AS MaxBytes,
SUM(InBytes) AS SumBytes, round(AVG(InBytes)) AS AvgBytes from
Netflow
SELECT MIN(TimeDuration), MAX(TimeDuration), AVG(TimeDuration),
SUM(TimeDuration) FROM Netflow
SELECT round(MIN(TimeDuration)) MinDuration,
round(MAX(TimeDuration)) MaxDuration, round(AVG(TimeDuration)) AvgDuration, round(SUM(TimeDuration)) SumDuration FROM Netflow
GROUP BY - Aggregating SELECT Query Results
• Allows you to group the returned results of an aggregated expression in the SELECT clause
• If the SELECT clause includes an aggregated expression then you can use a GROUP BY clause to group the returned results
• GROUP BY value must include all of the non-aggregated expressions from the SELECT clause
• Aliases are commonly used in both grouped and ordered queries for
GROUP BY a Single Expression
• GROUP BY with no aggregates
# returns the unique values in the Netflow.Protocol field w/out any aggregation
SELECT Protocol FROM Netflow GROUP BY Protocol
• If you group by a single expression then the aggregated values are grouped by every value of the GROUP BY expression that is in the matching data
# outputs the COUNT(*) aggregate for each unique value in the Netflow.Protocol field
SELECT Protocol, COUNT(*) Ct FROM Netflow GROUP BY Protocol • You cannot group by an aggregated expression
# not allowed
GROUP BY Multiple Expressions
• If you group by multiple expressions the results are grouped by all combinations of values in the grouping expression
• The grouping order reflects the order of the GROUP BY expressions
SELECT Protocol, DestinationPort Port, COUNT(*) Ct FROM Netflow GROUP BY
Protocol, Port LIMIT 15
SELECT Protocol, SourceAddress Src, DestinationPort Port, COUNT(*) Ct FROM Netflow GROUP BY Protocol, Src, Port
SELECT Protocol, SourceAddress Src, DestinationAddress Dst, DestinationPort Port, COUNT(*) Ct FROM Netflow GROUP BY Protocol, Src, Dst, Port
GROUP BY … WHERE
• You can apply a WHERE clause as usual to restrict the result set
SELECT Protocol, COUNT(*) Ct FROM Netflow WHERE Protocol IN
('TCP','UDP') GROUP BY Protocol
SELECT Protocol, DestinationPort Dst, COUNT(*) Ct FROM Netflow WHERE
Protocol NOT IN ('TCP','UDP') GROUP BY Protocol, Dst
SELECT Protocol, SourceAddress Src, DestinationPort Port, COUNT(*) Ct FROM Netflow WHERE Protocol NOT IN ('TCP','UDP') GROUP BY
Order by
• Used to control the sort order for SELECT query results
• Can include columns that are not part of the SELECT clause
GROUP BY … ORDER BY
• Order by Aggregated expressions
SELECT Protocol, COUNT(*) Ct FROM Netflow GROUP BY Protocol ORDER BY Ct DESC
SELECT DestinationAddress Dst, COUNT(*) Ct FROM Netflow GROUP BY Dst ORDER BY Ct DESC LIMIT 15
• Order by any expressions from the GROUP BY clause
SELECT Protocol, COUNT(*) Ct FROM Netflow GROUP BY Protocol ORDER BY Protocol
SELECT Protocol, DestinationPort DstPort, COUNT(*) Ct FROM Netflow GROUP BY Protocol, DstPort ORDER BY Protocol, DstPort
SELECT DestinationPort DstPort, Protocol, COUNT(*) Ct FROM Netflow GROUP BY DstPort, Protocol ORDER BY DstPort, Protocol
• Order by a combination of both
SELECT hostname, COUNT(*) AS Ct FROM CounterOps GROUP BY hostname ORDER BY Ct DESC, hostname LIMIT 15
SELECT Protocol, DestinationPort DstPort, COUNT(*) Ct FROM Netflow GROUP BY DstPort, Protocol ORDER
Arithmetic Operators in Queries
• Sqrrl supports using the following basic comparison operators to compare expressions in a SELECT query WHERE clause
• =, <, >, <=, >=, <>
• FLOAT operators: + - * / • INTEGER operators: div mod • STRING concatenation: +\
SELECT TimeStart, InBytes+OutBytes TotalBytes, SourceAddress Src, DestinationAddress Dst, Protocol + ':' + DestinationPort AS
"Protocol:Port" FROM Netflow WHERE InBytes+OutBytes >= 10000000 ORDER BY TotalBytes DESC
Query Functions – Already Covered
• json([fieldPath1, ...])
• Returns a JSON object representing the fields of this document
• Optionally, the results can only show the tree under the specified paths • dataset()
• Returns the dataset containing the matching document • Useful for grouping multi-dataset queries
• Not valid in WHERE clauses
Query Functions – Already Covered
• path(fieldPath)
• Returns the value of a field at the specified path
• Must be used instead of a bare column reference for slash-delimited paths or for fields with reserved keyword names
• round(x, [precision])
• Rounds the given number to precision decimal places • If precision is omitted defaults to 0
Query Functions – Record Identifiers
• uuid()
• Returns the uuid of a document/record • event_time()
• Returns the event time of a source record
• Can only be used in queries against sources
SELECT uuid(), event_time(), TimeStart, TimeEnd FROM Netflow ORDER BY TimeStart LIMIT 10
Query Functions – Date & Time Functions
• DATETIME Type Conversions
• A DATETIME value converts to a STRING as an ISO 8601 formatted timestamp rendered in the appropriate time zone
• A STRING value converts to a DATETIME as a prefix of an ISO 8601 formatted timestamp with all remaining fields set to zero. If the time zone is not provided explicitly then UTC is assumed
• The to_date() and format_date() functions offer more control of DATETIME format patterns and timezones
Query Functions – Date & Time Functions
• DATETIME Type Conversions
• DATETIMEs are converted to and from numbers as milliseconds since the epoch 1970-01-01 00:00Z
• Aggregate functions such as MIN(), MAX(), and AVG() therefore interpret DATETIME values as milliseconds since the epoch and the result can be explicitly cast back to a DATETIME
• DATETIME Arithmetic
• DATETIME+/- numeric - Adds or subtracts a number of days from the DATETIME
Query Functions – Date & Time Functions
• to_date(expr, [format])
• Attempts to parse expr as a DATETIME according to the specified format pattern or to the ISO 8601 format if omitted. If an integer is provided as the single argument interprets the value as milliseconds since the epoch
SELECT TimeStart, TimeEnd, SourceAddress Src, DestinationAddress Dst FROM Netflow WHERE TimeStart BETWEEN to_date('2014-11-01') AND to_date('2014-11-03') ORDER BY TimeStart SELECT TimeStart, TimeEnd, SourceAddress Src, DestinationAddress Dst FROM Netflow WHERE
TimeStart BETWEEN to_date('2014-11-03')-1 AND to_date('2014-11-03') ORDER BY TimeStart SELECT to_date(MIN(TimeStart)) MinTime, to_date(AVG(TimeStart)) AvgTime,
to_date(MAX(TimeStart)) MaxTime FROM Netflow
Query Functions – Date & Time Functions
• format_date(datetime, format, [timezone]
• Formats the specified DATETIME with a format pattern
SELECT format_date(MIN(TimeStart), 'MM/dd/yyyy hh:mm:ss') MinTime FROM Netflow
• now()
Query Functions – Date & Time Functions
• timestampdiff(field, datetime1, datetime2)
• Returns the difference between two DATETIMEs in the specified units (one of YEAR, MONTH, DAY, HOUR, MINUTE, SECOND). In SELECT queries, field is a keyword and should not be enclosed in quotes
SELECT timestampdiff(DAY, to_date(MAX(TimeStart)), now()) FROM Netflow SELECT timestampdiff(HOUR, to_date(MAX(TimeStart)), now()) FROM Netflow SELECT DestinationAddress Dst, COUNT(*) Ct, format_date(MIN(event_time()),
'yyyy/MM/dd hh:mm:ss') MinTime, format_date(MAX(event_time()), 'yyyy/MM/dd hh:mm:ss') MaxTime, timestampdiff(HOUR, to_date(MIN(event_time())),
to_date(MAX(event_time())))+'H' TimeSpan FROM Netflow GROUP BY Dst ORDER BY Ct DESC LIMIT 20
Query Functions – Processing Inet ADDR
• inet(ipaddress)
• Converts a STRING into an INET_ADDR
SELECT to_date(MIN(event_time())) MinTime, to_date(MAX(event_time())) MaxTime FROM Netflow WHERE DestinationAddress=inet('192.168.0.120')
SELECT ipAddress FROM CounterOps ORDER BY ipAddress
SELECT ipAddress FROM CounterOps WHERE ipAddress < inet('172.16.23.10') ORDER BY ipAddress
SELECT ipAddress FROM CounterOps WHERE ipAddress BETWEEN
Query Functions – Processing Inet ADDR
• contains(subnet, inet_address)
• Returns true if the specified subnet contains the specified INET_ADDR otherwise false
SELECT ipAddress FROM CounterOps WHERE contains('65.18.0.0/8', ipAddress) ORDER BY ipAddress
SELECT COUNT(*) FROM CounterOps WHERE contains('192.168.0.0/24', ipAddress) SELECT COUNT(*) FROM CounterOps WHERE contains('192.168.0.0/255.255.255.0',
ipAddress)
SELECT COUNT(*) Ct, timestampdiff(HOUR, to_date(MIN(event_time())),
to_date(MAX(event_time())))+'H' TimeSpan FROM Netflow WHERE
Query Functions – String functions
concat datalength len locate lower
ltrim regex replace rtrim str
Query Functions – Math Functions
abs acos asin atan atan2
ceiling cos exp floor int
ln log log2 log10 round
Visibility Labels & SELECT Queries
• When a source field has visibility labels they become part of the field name so to query those fields you must also specify the visibility label
SELECT path('field@[visibility]')
• If you do not specify the visibility label for a field that requires one then no results are returned for that field
• If the --strip-visibilities query option is enabled Sqrrl returns all matching data allowed by any of the user's visibility roles
Visibility Labels & SELECT Queries
• If you do include visibility labels when --strip-visibilities is enabled then no results are returned for those fields
• If two fields have the same name but different visibility labels AND the user has access to both visibility labels then the resulting value is not
Escaping Reserved Keywords & Special Characters
• A source, field, model, entity, feature, or relationship name that is an invalid SQL identifier or a reserved SQL keyword must be enclosed in double quotes
SELECT field FROM "3"
SELECT username FROM SecurityAnalysis."User"
• Generally use the backslash (\) to escape special characters
SELECT json() FROM CounterOps WHERE hostname = 'name\+123'
• Except to escape a single quote, use another single quote
Sqrrl Indexes & SELECT … WHERE Expressions
• VALUE_BY_FIELD• (field, type, value) → Record • Equality, Range, Prefix
• All non-aggregate types
• AGGREGATE
• field → (value, record)
• Numeric equality and range • Aggregates only
Sqrrl Indexes & SELECT … WHERE Expressions
• TERMS_BY_FIELD• (field, term) → record
• Exact-field Lucene term queries
• Can be used for string equality at a cost
• All non-aggregate types, but will be compared as strings
• TERMS
• term → (field, record)
• Unfielded or field pattern Lucene term queries
• Can be used for fielded Lucene or string equality at a cost • All non-aggregate types, but will be compared as strings
Explain plan
• Explain() method on the query operations object returned by the createQuery() or createGqlQuery() API call returns a JSON tree representing the iterator hierarchy that will be used to execute the query
• Result is *implementation-dependent* and is not guaranteed to be consistent across different versions of Sqrrl but can be used to
optimize WHERE clause formulation to help avoid inefficient full table scans
Explain plan
• Queries with a relativeSpeed of 3 take advantage of a specifically optimized index and go directly to matching records
• Queries with a relativeSpeed of 2 use indexes but may degrade given particular characteristics of the data
• Range searches whose result sizes exceed
sqrrl.server.rangesearch.maxResortingEntries fail over to scanning every document
• Queries with a relativeSpeed of 1 indicate queries whose running time scales linearly with the total size of the data
SELECT queries in the Sqrrl Web UI
• You can also execute queries in the Sqrrl Web UI
SELECT hostname FROM CounterOps
• Web UI supports SELECT clauses of the form SELECT * which returns all of the available columns for the matching source records
• To avoid a mix of columns from different items include an explicit FROM clause with SELECT *
• First column displayed contains the event time for each source record • Remaining columns are then displayed in alphabetical order by name
SELECT queries in the Sqrrl Web UI
• Provide a WHERE clause to limit the matching results from SELECT *
SELECT * FROM CounterOps WHERE hostname LIKE 'cam%'
• From the results list, you can:
• Sort by a specific column by clicking the column heading
• change the column order by dragging and dropping a column to a new position
• Hide and show individual columns by right-clicking the column name in the table heading then checking or unchecking the checkboxes in the column list
• You cannot use SELECT * outside of the Sqrrl Web UI • Executing SELECT * in the Sqrrl shell returns an error
© 2019, Amazon Web Services, Inc. or its Affiliates.