5 Data Types5 Data Types
5.3 PostgreSQL Data Types PostgreSQL Data Types
5.3.12 Network Address Types Network Address Types
5.3.12 Network Address Types
PostgreSQL includes support for bothcidr cidr ,inet inet , andmacaddr macaddr data types. Again, those types are bundled with indexing support and advanced functions and operator support.
The PostgreSQL documentation chapters entitledNetwork Address Types andNetwork Address Functions and Operators cover network address types.
Web servers logs are a classic source of data to process where we find network address types andThe Honeynet Project has some free samples for us to play with. This time we're using theScan 34Scan 34 entry. Here's how to load the sample data set, once cleaned into a proper CSV file:
begin begin;
drop
drop tabletableifif existsexists access_log;
create
createtabletable access_log (
ip inet, ts timestamptz,
request text, status integer );
\copycopy access_log fromfrom 'access.csv' withwith csv delimiter ';'
commit commit;
The script used to cleanse the srcinal data into a CSV that PostgreSQL is happy about implements a pretty simple transformation from
211.141.115.145 - - [13/Mar/2005:04:10:18 -0500] "GET / HTTP/1.1" 403 2898 "-" "Mozilla/4.0 (compatible; MSIE 5.5;
into
"211.141.115.145";" 2005-05-13 04:10:18 -0500";"GET / HTTP/1.1";"403"
Being mostly interested into network address types, the transformation from the Apache access log format to CSV is lossy here, we keep only some of the fields we might be interested into.
One of the things that's possible to implement thanks to the PostgreSQLinet inet data type is an analysis of /24 /24 networks that are to be found in the logs.
To enable that analysis, we can use theset_masklen()set_masklen() function which allows us to transforms an IP address into an arbitrary CIDR network address:
select
select distinctdistinctonon (ip) ip,
set_masklen(ip, 24) asas inet_24, set_masklen(ip::cidr, 24) asas cidr_24
from
from access_log
limit limit 10;
And we can see that if we keep the data type asinet inet , we still get the full IP address with the 24
24 network notation added. To have the.0/24.0/24 notation we need to be usingcidr cidr :
ip │ inet_24 │ cidr_24
═══════════════╪═══ ═══════════════╪══ ═══════════════
4.35.221.243 │ 4.35.221.243/24 │ 4.35.221.0/24 4.152.207.126 │ 4.152.207.126/24 │ 4.152.207.0/24 4.152.207.238 │ 4.152.207.238/24 │ 4.152.207.0/24 4.249.111.162 │ 4.249.111.162/24 │ 4.249.111.0/24 12.1.223.132 │ 12.1.223.132/24 │ 12.1.223.0/24 12.8.192.60 │ 12.8.192.60/24 │ 12.8.192.0/24 12.33.114.7 │ 12.33.114.7/24 │ 12.33.114.0/24 12.47.120.130 │ 12.47.120.130/24 │ 12.47.120.0/24 12.172.137.4 │ 12.172.137.4/24 │ 12.172.137.0/24 18.194.1.122 │ 18.194.1.122/24 │ 18.194.1.0/24 (10 rows)
Of course, note that you could be analyzing other networks than /24 /24:
select
select distinctdistinctonon (ip) ip,
set_masklen(ip::cidr, 27) asas cidr_27,
set_masklen(ip::cidr, 28) asas cidr_28
from
from access_log
limit limit 10;
This computes for us the proper starting ip addresses for our CIDR notation for us, of course.
After all, what's the point of using proper data types if not for advanced processing?
ip │ cidr_27 │ cidr_28
═══════════════╪═══ ═══════════════╪══ ════════════════
4.35.221.243 │ 4.35.221.224/27 │ 4.35.221.240/28 4.152.207.126 │ 4.152.207.96/27 │ 4.152.207.112/28 4.152.207.238 │ 4.152.207.224/27 │ 4.152.207.224/28 4.249.111.162 │ 4.249.111.160/27 │ 4.249.111.160/28 12.1.223.132 │ 12.1.223.128/27 │ 12.1.223.128/28 12.8.192.60 │ 12.8.192.32/27 │ 12.8.192.48/28 12.33.114.7 │ 12.33.114.0/27 │ 12.33.114.0/28 12.47.120.130 │ 12.47.120.128/27 │ 12.47.120.128/28 12.172.137.4 │ 12.172.137.0/27 │ 12.172.137.0/28 18.194.1.122 │ 18.194.1.96/27 │ 18.194.1.112/28 (10 rows)
Equipped with this set_masklen()set_masklen() function, it's now easy to analyze our access logs using arbitrary CIDR network definitions.
select
select set_masklen(ip::cidr, 24) asas networknetwork, count(*) asas requests,
array_length(array_agg( distinctdistinct ip), 1) asas ipcount
from
from access_log
group
group bybynetworknetwork having
having array_length(arr ay_agg(distinctdistinct ip), 1) > 1
order
order byby requests descdesc, ipcount descdesc;
In our case, we get the following result:
network │ requests │ ipcount
══════════════════╪══════════╪═════════
(12 rows)
5.3.13 Ranges 5.3.13 Ranges
Range types are a unique feature of PostgreSQL, managing two dimensions of data in a single column, and allowing advanced processing. The main example is thedaterangedaterange data type, which stores as a single value a lower and an upper bound of the range as a single value. This allows PostgreSQL to implement a concurrent safe check againstoverlappingoverlapping ranges, as we're going to see in the next example.
As usual, read the PostgreSQL documentation chapters with the titlesRange Types and Range Functions and Operators for complete information.
TheInternational Monetary Fund publishesexchange rate archives by month for lots of currencies. An exchange rate is relevant from its publication until the next rate is published, which makes a very good use case for our PostgreSQL range types.
The following SQL script is the main part of theELELT T script that has been used for this book.
Only missing from this book's pages is the transformation script that pivots the availabletstsv v file into the more interesting format we use here:
begin begin;
create
createschemaschema ifif notnot existsexists raw;
-- Must be run as a Super User in your database instance -- create extension if not exists btree_gist;
drop
drop tabletableifif existsexists raw.rates, rates;
create
createtabletable raw.rates (
currency text, date date, rate numeric );
\copycopy raw.rates fromfrom 'rates.csv' withwith csv delimiter ';'
create
createtabletable rates (
currency text, validity daterange,
rate numeric,
exclude usingusing gist (currency withwith =, validity withwith &&) );
insert
insertintointo rates(currency, validity, rate)
select
select currency, daterange(date,
lead(date) overover(partitionpartitionbyby currency
order
orderbyby date), '[)'
)
as
as validity, rate
from
from raw.rates
order orderbyby date;
commit commit;
In this SQL script, we first create a target table for loading the CSV file. The file contains lines with a currency name, a date of publication, and a rate as anumericnumeric value. Once the data is loaded into this table, we can transform it into something more interesting to work with from an application, theratesrates table.
Theratesrates table registers the rate value for a currency and avalidity validity period, and uses an exclusion constraint that guarantees non-overlapping validity validity periods for any given currency
currency :
exclude usingusing gist (currency withwith =, validity withwith &&)
This expression reads: exclude any tuple where the currency is== to an existing currency in our table AND AND where thevalidity validity is overlapping with (&&&&) any existing validity in our table.
This exclusion constraint is implemented in PostgreSQL using aGiST GiST index.
By default,GiST GiST in PostgreSQL doesn't support one-dimensional data types that are meant to be covered by B-treeB-tree indexes. With exclusion constraints though, it's very interesting to extend GiST GiST support for one-dimensional data types, and so we install thebtree_gisbtree_gis extension, provided in PostgreSQL contrib package.
The script then fills in theratesrates table from theraw.ratesraw.rates we'd been importing in the previous
step. The query uses thelead()lead() window function to implement the specification spelled out in English earlier:an exchange rate is relevant from its publication until the next rate isan exchange rate is relevant from its publication until the next rate is published
published .
Here's how the data looks, with the following query targeting Euro rates:
select
select currency, validity, rate
from from rates
where
where currency = 'Euro'
order
order byby validity
limit limit 10;
We can see that the validity is a range of dates, and the standard output for this type is a closed range which includes the first entry and excludes the second one:
currency │ validity │ rate
══════════╪═════════════════════════╪══════════
Euro │ [2017-05-02,2017 -05-03) │ 1.254600 Euro │ [2017-05-03,2017 -05-04) │ 1.254030 Euro │ [2017-05-04,2017 -05-05) │ 1.252780 Euro │ [2017-05-05,2017 -05-08) │ 1.250510 Euro │ [2017-05-08,2017 -05-09) │ 1.252880 Euro │ [2017-05-09,2017 -05-10) │ 1.255280 Euro │ [2017-05-10,2017 -05-11) │ 1.255300 Euro │ [2017-05-11,2017 -05-12) │ 1.257320 Euro │ [2017-05-12,2017 -05-15) │ 1.255530 Euro │ [2017-05-15,2017 -05-16) │ 1.248960 (10 rows)
Having this data set with the exclusion constraint means that we know we have at most a single rate available at any point in time, which allows an application needing the rate for a specific time to write the following query:
select select rate
from from rates
where
where currency = 'Euro'
and
and validity @> date '2017-05-18';
The operator@>@> readscontainscontains, and PostgreSQL uses the exclusion constraint's index to solve that query efficiently:
rate
══════════
1.240740 (1 row)