In-database funnel implementation - Combining usage and profile data for retrospectively analyz

5.5 Funnels

5.5.5 In-database funnel implementation

The second implementation focuses on reducing the traffic between the database and application by keeping intermediary results in the database. This way queries don’t need to return huge lists of identifiers back to the application, but only the final numbers required to display the funnel. This implementation has been made to combat the weak points of the first implementation, which is the scalability.

Data structure

Instead of storing lists of identifiers for each step in PHP as done in the first implementation, this implementation keeps the identifiers in PostgreSQL. The identifier mappings are stored in a temporary table. Listing 20 shows the query to create this temporary table. The temporary table will have a column for the

baseidentifier, and each secondary identifier. For the base identifier a constraint

is added to make sure it is neverNULL, so that mistakes while creating the funnel are noticed. Secondary identifiers can be null, because not every user in the funnel might reach the step where a second identifier is introduced.

The create table query has a special clause at the end ON COMMIT DROP which deletes the temporary table when the transaction end. This requires that at the start of the funnel theBEGIN TRANSACTIONcommand is used, and at the end the COMMMIT command. Using this structure it is ensured that the temporary table

is deleted when the funnel results are calculated.

Listing 20: Query for creating the mapping table

1 CREATE TEMP TABLE map ( 2 base TEXT NOT NULL, 3 user TEXT DEFAULT NULL 4 ) ON COMMIT DROP;

Two more temporary tables are used, one to store the identifiers of the current step, and one for the identifiers of the previous step. Listing 21 shows the query used for creating both of these tables (where the table of the previous step used the name previous, and the current one uses current). Usage of these tables will be made clear while going through the funnel steps.

Listing 21: Query for creating the table for the current step

1 CREATE TEMP TABLE current ( 2 base TEXT NOT NULL 3 ) ON COMMIT DROP;

Step 1: Which emails have been sent?

Listing 22 shows the query used to get the first results. The select query is exactly the same as in the first implementation, but instead of returning the results to PHP it inserts them in the temporary table.

Listing 22: Query to get the sent email events and put them into the temporary table

1 insert into current ( 2 select

3 data#>>’{email,id}’ as base 4 from events

5 where

6 data->>’type’ = ’email’

7 and data#>>’{email,type}’ = ’sent’

8 and data->>’at’ > ’$funnel->start’ and data->>’at’ <

’$funnel->end’

9 );

Because this is the first step, the map table also needs to be updated, this is shown in Listing 23. Now the map contains all the base identifiers that will be followed during this funnel.

Listing 23: Copy the result of the first step into the map table

1 insert into map 2 select * 3 from current;

The resulting value for the current step can be retrieved using a simple query that counts the number of rows in thecurrent table, like shown in Listing 24.

Listing 24: Get the result count for a step

1 select count(*) 2 from current;

Since this is the first step, it does not need to be compared to a previous step to get the continued/stopped/rejoined numbers. Before moving on to the next step the result of this step needs to move to theprevious table, for which two queries are listed in Listing 25

Listing 25: Move results of the current step to the previous table

1 insert into previous 2 select * from current; 3

4 truncate current;

Step 2: Which emails have been read?

The query to get the results of the second step has some changes compared to the one of the first implementation. Listing 26 shows the resulting query. It selects the email read events from the table, and joins them with the map table to limit the results to only the identifiers that are present in that table. The resulting identifiers are again inserted into thecurrenttable.

Listing 26: Query to get the results for the emails read step

1 insert into current ( 2 select

3 events.data#>>’{email,id}’ as base 4 from events

5 inner join map on

6 events.data#>>’{email,id}’ = map.base 7 where

8 events.data->>’type’ = ’email’

9 and events.data#>>’{email,type}’ = ’read’ 10 );

Getting the count for the second step is the same as the first step, simply counting the number of rows in the current table. For getting the number of people that continued from the first to the second step the query in Listing 27 is used. This selects the rows from the current step, that directly came from the previous step (to ensure rejoined people are excluded). With this continued number the stopped number of the previous step is simply subtracting it from the result of the previous step.

Listing 27: Get the number of people that continued from the previous step to this one

1 select count(*) 2 from current

3 left join previous on previous.base = current.base;

One result is still missing, the number of people that did not do the previous step, but did do the current one, which is the rejoined statistic. Listing 28 shows the query used to calculate it. It counts the number of identifiers in current

that do not appear inprevious.

Listing 28: Get the number of people that rejoind

1 select count(*) 2 from current

3 left join previous on previous.base = current.base 4 where previous.base = null;

Again the data for this step is moved to the previous one, and thecurrenttable

is cleared, as shown in Listing 25.

Step 3: Which emails lead to website visits?

This step looks like the previous one, the only difference is the query that gets the results, which is shown below in Listing 29.

Listing 29: Get the results for website visits

1 insert into current ( 2 select

3 events.data#>>’{website,params,email_id}’ as base 4 from events

5 inner join map on

6 events.data#>>’{website,params,email_id}’ = map.base 7 where

8 events.data->>’type’ = ’website’

9 and events.data#>>’{website,type}’ = ’visit’ 10 );

Pros and Cons

This system solves some of the problems in the first implementation, but also has some negative sides of its own. Below the positive and negative points of this implementation are listed.

1. Con: The query used to select the identifiers for each step needs to be wrapped with an insert query, adding complexity (this can however auto- matically be added by the application).

2. Con: Temporary tables need to be created and managed, adding complexity.

3. Pro: Small query size, it is not necessary to insert all identifiers into the query, reducing query parsing time.

4. Pro: Improved performance, since indexes could be added to the temporary table and speed up usage.

In document Combining usage and profile data for retrospectively analyzing usability of applications with funnels (Page 52-55)