• No results found

Network Address Types Network Address Types

5 Data Types5 Data Types

5.3 PostgreSQL Data Types PostgreSQL Data Types

5.3.12 Network Address Types Network Address Types

5.3.12 Network Address Types

PostgreSQL includes support for bothcidr cidr ,inet inet , andmacaddr macaddr data types. Again, those types are bundled with indexing support and advanced functions and operator support.

The PostgreSQL documentation chapters entitledNetwork Address Types andNetwork Address Functions and Operators cover network address types.

Web servers logs are a classic source of data to process where we find network address types andThe Honeynet Project has some free samples for us to play with. This time we're using theScan 34Scan 34 entry. Here's how to load the sample data set, once cleaned into a proper CSV file:

begin begin;

drop

drop tabletableifif existsexists access_log;

create

createtabletable access_log (

ip inet, ts timestamptz,

request text, status integer );

\copycopy access_log fromfrom 'access.csv' withwith csv delimiter ';'

commit commit;

The script used to cleanse the srcinal data into a CSV that PostgreSQL is happy about implements a pretty simple transformation from

211.141.115.145 - - [13/Mar/2005:04:10:18 -0500] "GET / HTTP/1.1" 403 2898 "-" "Mozilla/4.0 (compatible; MSIE 5.5;

into

"211.141.115.145";" 2005-05-13 04:10:18 -0500";"GET / HTTP/1.1";"403"

Being mostly interested into network address types, the transformation from the Apache access log format to CSV is lossy here, we keep only some of the fields we might be interested into.

One of the things that's possible to implement thanks to the PostgreSQLinet inet data type is an analysis of /24 /24 networks that are to be found in the logs.

To enable that analysis, we can use theset_masklen()set_masklen() function which allows us to transforms an IP address into an arbitrary CIDR network address:

select

select distinctdistinctonon (ip) ip,

set_masklen(ip, 24) asas inet_24, set_masklen(ip::cidr, 24) asas cidr_24

from

from access_log

limit limit 10;

And we can see that if we keep the data type asinet inet , we still get the full IP address with the 24

24 network notation added. To have the.0/24.0/24 notation we need to be usingcidr cidr :

ip inet_24 cidr_24

═══════════════╪═══ ═══════════════╪══ ═══════════════

4.35.221.243 │ 4.35.221.243/24 │ 4.35.221.0/24 4.152.207.126 │ 4.152.207.126/24 │ 4.152.207.0/24 4.152.207.238 │ 4.152.207.238/24 │ 4.152.207.0/24 4.249.111.162 │ 4.249.111.162/24 │ 4.249.111.0/24 12.1.223.132 │ 12.1.223.132/24 │ 12.1.223.0/24 12.8.192.60 │ 12.8.192.60/24 │ 12.8.192.0/24 12.33.114.7 │ 12.33.114.7/24 │ 12.33.114.0/24 12.47.120.130 │ 12.47.120.130/24 │ 12.47.120.0/24 12.172.137.4 │ 12.172.137.4/24 │ 12.172.137.0/24 18.194.1.122 │ 18.194.1.122/24 │ 18.194.1.0/24 (10 rows)

Of course, note that you could be analyzing other networks than /24 /24:

select

select distinctdistinctonon (ip) ip,

set_masklen(ip::cidr, 27) asas cidr_27,

set_masklen(ip::cidr, 28) asas cidr_28

from

from access_log

limit limit 10;

This computes for us the proper starting ip addresses for our CIDR notation for us, of course.

After all, what's the point of using proper data types if not for advanced processing?

ip cidr_27 cidr_28

═══════════════╪═══ ═══════════════╪══ ════════════════

4.35.221.243 │ 4.35.221.224/27 │ 4.35.221.240/28 4.152.207.126 │ 4.152.207.96/27 │ 4.152.207.112/28 4.152.207.238 │ 4.152.207.224/27 │ 4.152.207.224/28 4.249.111.162 │ 4.249.111.160/27 │ 4.249.111.160/28 12.1.223.132 │ 12.1.223.128/27 │ 12.1.223.128/28 12.8.192.60 │ 12.8.192.32/27 │ 12.8.192.48/28 12.33.114.7 │ 12.33.114.0/27 │ 12.33.114.0/28 12.47.120.130 │ 12.47.120.128/27 │ 12.47.120.128/28 12.172.137.4 │ 12.172.137.0/27 │ 12.172.137.0/28 18.194.1.122 │ 18.194.1.96/27 │ 18.194.1.112/28 (10 rows)

Equipped with this set_masklen()set_masklen() function, it's now easy to analyze our access logs using arbitrary CIDR network definitions.

select

select set_masklen(ip::cidr, 24) asas networknetwork, count(*) asas requests,

array_length(array_agg( distinctdistinct ip), 1) asas ipcount

from

from access_log

group

group bybynetworknetwork having

having array_length(arr ay_agg(distinctdistinct ip), 1) > 1

order

order byby requests descdesc, ipcount descdesc;

In our case, we get the following result:

network │ requests │ ipcount

══════════════════╪══════════╪═════════

(12 rows)

5.3.13 Ranges 5.3.13 Ranges

Range types are a unique feature of PostgreSQL, managing two dimensions of data in a single column, and allowing advanced processing. The main example is thedaterangedaterange data type, which stores as a single value a lower and an upper bound of the range as a single value. This allows PostgreSQL to implement a concurrent safe check againstoverlappingoverlapping ranges, as we're going to see in the next example.

As usual, read the PostgreSQL documentation chapters with the titlesRange Types and Range Functions and Operators for complete information.

TheInternational Monetary Fund publishesexchange rate archives by month for lots of currencies. An exchange rate is relevant from its publication until the next rate is published, which makes a very good use case for our PostgreSQL range types.

The following SQL script is the main part of theELELT T script that has been used for this book.

Only missing from this book's pages is the transformation script that pivots the availabletstsv v file into the more interesting format we use here:

begin begin;

create

createschemaschema ifif notnot existsexists raw;

-- Must be run as a Super User in your database instance -- create extension if not exists btree_gist;

drop

drop tabletableifif existsexists raw.rates, rates;

create

createtabletable raw.rates (

currency text, date date, rate numeric );

\copycopy raw.rates fromfrom 'rates.csv' withwith csv delimiter ';'

create

createtabletable rates (

currency text, validity daterange,

rate numeric,

exclude usingusing gist (currency withwith =, validity withwith &&) );

insert

insertintointo rates(currency, validity, rate)

select

select currency, daterange(date,

lead(date) overover(partitionpartitionbyby currency

order

orderbyby date), '[)'

)

as

as validity, rate

from

from raw.rates

order orderbyby date;

commit commit;

In this SQL script, we first create a target table for loading the CSV file. The file contains lines with a currency name, a date of publication, and a rate as anumericnumeric value. Once the data is loaded into this table, we can transform it into something more interesting to work with from an application, theratesrates table.

Theratesrates table registers the rate value for a currency and avalidity validity period, and uses an exclusion constraint that guarantees non-overlapping validity validity periods for any given currency

currency :

exclude usingusing gist (currency withwith =, validity withwith &&)

This expression reads: exclude any tuple where the currency is== to an existing currency in our table AND AND where thevalidity validity is overlapping with (&&&&) any existing validity in our table.

This exclusion constraint is implemented in PostgreSQL using aGiST GiST index.

By default,GiST GiST in PostgreSQL doesn't support one-dimensional data types that are meant to be covered by B-treeB-tree indexes. With exclusion constraints though, it's very interesting to extend GiST GiST support for one-dimensional data types, and so we install thebtree_gisbtree_gis extension, provided in PostgreSQL contrib package.

The script then fills in theratesrates table from theraw.ratesraw.rates we'd been importing in the previous

step. The query uses thelead()lead() window function to implement the specification spelled out in English earlier:an exchange rate is relevant from its publication until the next rate isan exchange rate is relevant from its publication until the next rate is published

published .

Here's how the data looks, with the following query targeting Euro rates:

select

select currency, validity, rate

from from rates

where

where currency = 'Euro'

order

order byby validity

limit limit 10;

We can see that the validity is a range of dates, and the standard output for this type is a closed range which includes the first entry and excludes the second one:

currency │ validity rate

══════════╪═════════════════════════╪══════════

Euro │ [2017-05-02,2017 -05-03) │ 1.254600 Euro │ [2017-05-03,2017 -05-04) │ 1.254030 Euro │ [2017-05-04,2017 -05-05) │ 1.252780 Euro │ [2017-05-05,2017 -05-08) │ 1.250510 Euro │ [2017-05-08,2017 -05-09) │ 1.252880 Euro │ [2017-05-09,2017 -05-10) │ 1.255280 Euro │ [2017-05-10,2017 -05-11) │ 1.255300 Euro │ [2017-05-11,2017 -05-12) │ 1.257320 Euro │ [2017-05-12,2017 -05-15) │ 1.255530 Euro │ [2017-05-15,2017 -05-16) │ 1.248960 (10 rows)

Having this data set with the exclusion constraint means that we know we have at most a single rate available at any point in time, which allows an application needing the rate for a specific time to write the following query:

select select rate

from from rates

where

where currency = 'Euro'

and

and validity @> date '2017-05-18';

The operator@>@> readscontainscontains, and PostgreSQL uses the exclusion constraint's index to solve that query efficiently:

rate

══════════

1.240740 (1 row)