System architecture - Combining usage and profile data for retrospectively analyzing usability

Keeping in mind the requirements of the previous section an abstract system architecture has been created, visualized in Figure 17. This architecture shows what kind of system most companies that use analytics have, either as in-house

solution or provided by one of the companies investigated in Chapter A.2. Hav- ing a clear idea how an analytics tool is organized technically is critical for developing a solution that can combine the usage and profile data, because now I can ensure the requirements of an analytics tool are still met in the final system. The architecture starts at the top where the events are generated, after which they are stored, processed by a real-time filter or historic scanner, and in the end results (reports, charts, etc.) are stored. Below these components are explained in detail.

Figure 17: Event collection and processing system

B.2.1 Input

At the top Figure 17 starts with the input. These are the applications like websites, apps and servers that generate events. Apps can send data about its usage (startup, screen views, encountered problems), a web server can send data about its operation (response time, uptime, encountered problems) and a notification server can track notification usage (sent, read, discarded, unable to send). Data from these different sources should have some common properties, like the source and time of the event. But all other data depends on the context and should be flexible. For example a web server should report its name, ip

address and data center, while apps should report the logged in user, device type (brand, model, size, etc.) and currently visible screen. This part fulfills R1.

B.2.2 Storage

After events are generated by a server or client, they need to be stored. The pictured load balancer is there to spread the intake on these servers and data centers, fulfilling R2. After the load balancer a database needs to store the events, for this there are many options. Below are the most used database implementations listed per type, based on current popularity of them calculated by DB-engines [6] and a comparison of NoSQL databases [24].

Relational [14]: Oracle, MySQL, Microsoft SQL Server, PostgreSQL, DB2, Microsoft Access, SQLite, RedShift (Amazon), MariaDB, Teradata.

Key-value stores [12]: Redis, Memcached, Riak, Voldemort (LinkedIn), BerkeleyDB (Oracle).

Document databases [7]: MongoDB (BSON), Dynamo (Amazon), Couch- base, CouchDB (JSON).

Wide column stores [16]: Bigtable (Google), Hypertable, Cassandra (Face- book; used by Digg, Twitter); SimpleDB (Amazon), DynamoDB (Amazon), HBase (by Apache).

Graph databases [11]: Neo4j, InfoGrid, Sones GraphDB, AllegroGraph, InfiniteGraph.

Not all of these databases are suitable for analytics evens storage. For example Memcached and Redis only store data in-memory, which will not work because of the required persistence and capacity. Graph databases also seem unsuitable, since event storage has no strong need for relationships and therefore does not fit the purpose of these databases. Some databases do not scale to mul- tiple servers and/or data centers (as required by R3). Examples are MySQL (needs MySQL Cluster [13]), SQLite (limited to a single file), PostgreSQL (needs something like CitusDB [4], as used by CloudFlare [5]).

B.2.3 Analysis

Now that events are stored they can be analyzed. There are two main aspects in event processing, doing it real-time as the events come in, and processing historic data. For real-time processing it makes sense to put a filter at the event storage intake, which forwards relevant events to an application that processes them. This could for example be used for monitoring the number of errors that come from the web server, to send email notifications if there are more than X per hour. This system implements R4.

Historic processing could be monthly app usage reports, long-term usage statis- tics or results of A/B testing. This can be organized by searching through all events within a certain time period (assuming events can easily be filtered by time, which makes sense for analytics systems). This processing does not need to be very quick, but does need to be scalable to for example be able to generate yearly reports. R5 is implemented with this system.

In this project the focus will be on historic data processing to limit the scope of the project. Real-time is kept in mind as future feature in the design of the system, but will not be tested and benchmarked.

C

Staying analytics solution

In this appendix the analytics solution Staying currenlty uses is described, to get an idea what the starting situation of Staying is.

C.1 Architecture

Figure 18: Staying analytics overview

First an event is generated in one of the applications, as seen at the top of Figure 18, an event is a piece of JSON data containing information about what happened. The website backend events are generated by the web server when the web page is requested, the website frontend events are generated in JavaScript at the device of the visitor. The portal events are generated in JavaScript and bundled together before sending them to the backend, which will unpack them and send them to the event collection server. There are two apps, one used by guests to see their stays and use the chat, and one for the venue owners to answer the chat messages. These apps are deployed on Android and iOS, and both send bundles of events to the backend, like the portal does. The events are send to events.staying.nlwhere they are collected and stored in a database.

In document Combining usage and profile data for retrospectively analyzing usability of applications with funnels (Page 87-91)