Business Intelligence and Process Modelling
F.W. Takes Universiteit Leiden
Where are we?
Business Intelligence: anything that aims at providing actionable information that can be used to support business decision making
Business Analysis Business Analytics
Visual Analytics Descriptive Analytics Predictive Analytics
Network Intelligence: Network Science in a BI context
Process Modelling
Data
→
Network Science (recap)
Data Data Analysis Data Mining Data Science Big DataNetwork science: analyzing “big” structured data consisting of
objects connected via certain relationships, in short: networks
Interest from: mathematics, computer science, physics, biology, public administration, social sciences, . . .
Notation (recap)
Concept Symbol Network (graph) G = (V,E) Objects (nodes/vertices) V Relations (links/edges) E Directed —E ⊆V ×V Undirected Number of nodes — |V| n Number of edges — |E| mSmall World Networks (recap)
1 Sparse networks density
2 Fat-tailed power-law degree distribution degree
3 Giant component components
4 Low pairwise node-to-node distances distance
Many real-world networks: communication networks, citation
networks, collaboration networks (Erd¨os, Kevin Bacon), protein
interaction networks, information networks (Wikipedia), webgraphs, financial networks (Bitcoin) . . .
Topics
Graph Representation and Structure Paths and Distances
Graph Evolution, Link Prediction ←
Spidering and Sampling
Centrality
Visualization Algorithms and Tools Graph Compression
Community Detection
Contagion, Gossipping and Virality Privacy, Anonymity and Ethics
Network evolution
Graphsevolve over time
Social networks: users join the network and create new friendships Webgraphs: new pages and links to pages appear on the internet Scientific networks: new papers are being co-authored and new citations are made in these papers
Interesting: small world properties emerge and are preserved during evolution!
Evolving graphs
Graph Gt = (Vt,Et)
Time window 0≤t ≤T −1
Usually at t = 0, either
V0=∅ and a new edge may bring new nodes, or
V0=VT−1 and only edges are added at each timestamp
Timestamp on node v ∈V:
t(v)∈[0;T −1]
Timestamp on edgee ∈E:
t(e)∈[0;T −1], or as common input format:
e = (u,v,t(u,v)) with u,v ∈V andt(u,v) ∈[0,T −1]
LIACS collaboration network (v2012)
Two schools
Synthetic graphs model-driven
Model or algorithm to generate graphs from scratch
Tune parameters to obtain a graph similar to an observed network Statistical analysis
Real-world graphs data-driven
Obtain data from an actual network
Compute and derive properties and determine similarity with other networks
Apple collaboration network
http://www.kenedict.com/apples-internal-innovation-network-unraveled/
Link prediction
Link prediction problem: given a networkGt = (Vt,Et), denoting
the network at time t, predict the newly formed links in the evolved
network Gt0 = (Vt0,Et0) at timet0 >t, i.e., predict the contents of Et0\Et.
Applicable to weighted and unweighted, directed and undirected networks
Supervised learning problem
Features based on the structure of the network Train on first 95%, test on last 5% (randomized) Validate result using AUROC
J.E. van Engelen, H.D. Boekhout and F.W. Takes, Explainable and Efficient Link Prediction in Real-World Networks (working paper), 2016.
Link prediction
Link prediction problem: given a networkGt = (Vt,Et), denoting
the network at time t, predict the newly formed links in the evolved
network Gt0 = (Vt0,Et0) at timet0 >t, i.e., predict the contents of Et0\Et.
Applicable to weighted and unweighted, directed and undirected networks
Supervised learning problem
Features based on the structure of the network Train on first 95%, test on last 5% (randomized) Validate result using AUROC
J.E. van Engelen, H.D. Boekhout and F.W. Takes, Explainable and Efficient Link Prediction in Real-World Networks (working paper), 2016.
Feature set goals
efficient in terms of time complexity;
accurate in its future link predictions;
explainable in its performance based on simple features;
consistent in its accuracy relative to larger feature sets across networks;
Link prediction features
Compute features for each possible future edge (i,j)∈/Et
Node features: degree, volume (total weight)
Neighborhood features: neighbor count, common neighbor count, transitive common neighborhood, Jaccard coefficient, preferential attachment, and others
Path features: shortest path length, number of shortest paths, restricted Katz measure, and others
Efficient Feature Set (EFS)
Large number of features Black box type of approach
Cover individual, local and global properties Explainable result
Node features
Feature Variant Complexity EFS Degree (source) - O(1) X Degree (source) din O(1)
Degree (source) dout O(1) Degree (target) - O(1) X Degree (target) din O(1)
Degree (target) dout O(1)
Volume (source) - O(m/n) Volume (source) din O(m/n) X
Volume (source) dout O(m/n) X Volume (target) - O(m/n) Volume (target) din O(m/n) X
Volume (target) dout O(m/n) X
Neighbourhood features
Total neighbours - O(m/n) Total neighbours Γin O(m/n)
Total neighbours Γout O(m/n) Common neighbours - O(m/n) X Common neighbours Γin O(m/n)
Common neighbours Γout O(m/n)
Transitive comm. neigh. - O(m/n) Jaccard Coeff. - O(m/n) X Jaccard Coeff. Γin O(m/n)
Jaccard Coeff. Γout O(m/n)
Transitive Jacc. Coeff. - O(m/n) X Adamic/Adar - O(m/n) Preferential attachment - O(1) Preferential attachment Γin O(1)
Preferential attachment Γout O(1)
Opposite direction link - O(1) X
Path features
Shortest path length - O(m+n) X Num. shortest paths `max= 3 O(m+n)
Restricted Katz measure `max= 3, O(m+n) β= 0.05
Datasets
Table : Characteristics of network data sets used for testing
Data set Nodes Links CC Type Dist 3N
digg 30,398 86,404 0.01 + D 4.68 45% fb-links 63,731 817,035 0.22 - U 4.31 88% fb-wall 46,952 274,086 0.11 + D 5.71 61% infectious 410 2,765 0.46 + U 3.57 83% liacs 1,036 4,650 0.84 + U 3.86 100% lkml-reply 27,927 242,976 0.30 + D 5.19 99% slashdot 51,083 131,175 0.02 + D 4.59 75% topology 34,761 107,720 0.29 + U 3.78 97% ucsocial 1,899 20,296 0.11 + D 3.07 99% wikipedia 100,312 746, 114 0.21 - D 3.83 89%
Experiments
Large candidate set of size (|V| × |V −1|)− |E|
Restrict based on maximum distance a new edge bridges Class imbalance
Randomly leave out edges in training to get to 9 : 1 ratio Measure result using AUROC
Determine difference between All features, Node features, Neighborhood Features and EFS
Results
Features digg fb-links fb-wall infectious liacs lkml slashdot topology ucsocial wikipedia All 0.830 0.933 0.887 0.967 0.997 0.975 0.928 0.967 0.913 0.970 Node 0.827 0.700 0.710 0.955 0.969 0.971 0.922 0.949 0.911 0.941 Neighbourhood 0.761 0.911 0.866 0.794 0.986 0.974 0.920 0.961 0.920 0.926 Path 0.632 0.897 0.819 0.579 0.979 0.925 0.777 0.940 0.673 0.827 EFS 0.825 0.930 0.876 0.958 0.995 0.973 0.921 0.965 0.910 0.967 EFS Performance 99.4% 99.6% 98.8% 99.1% 99.8% 99.8% 99.2% 99.8% 99.7% 99.7%
Table : AUROC for each network and each set of features. EFS Performance lists performance of EFS relative to All features.
Conclusions
Network science treats data as an annotated set of objects and relationships
The structureof the network provides new insights in the data
Centrality measuresare able to identify prominent actors in the network solely based on its structure
Community detection algorithms reveal groups and clusters based on the network structure
Process Modelling
Recap
Business Intelligence
Process Modelling
Business process modelling Modelling languages Process discovery
Business Process Management (recap)
Process: a set of related actions and transactions to achieve a certain objective
Business process: a sequence of activities aimed at producing something of value for the business (Morgan02)
Management processes Operational processes Supporting processes
Business Process Management: the discipline that combines knowledge from information technology and knowledge from management sciences and applies this to operational business processes (v.d. Aalst)
Extension of WorkFlow Management (WFM)
Business Process Modelling (recap)
Business Process Model: abstract representation of business processes, functionality is:
Descriptive: what is actually happening? Prescriptive: what should be happening?
Explanatory: why is the process designed this way?
In practice: formalizeand visualizebusiness processes
Process Discovery: derive the process from a description of activities
Process Mining: the task of converting eventdata into process models (discovery, conformance, enhancement)
Why Model Processes? (recap)
Process Mining (recap)
Business Process. . . Intelligence?
M. Castellanos et al.,Business process intelligence,Handbook
Process Modelling
Informal models: used for discussion and documentation (process descriptions)
Formal models: used for analysis or enactment
Petri Nets— today PN Business Process Model Notation — later BPMN
Petri Nets
Event logs (1)
Case ID Event ID dd-mm-yyyy:hh.mm Activity Resource Costs
1 35654423 30-12-2010:11.02 register request Pete 50 1 35654424 31-12-2010:10.06 examine thoroughly Sue 400 1 35654425 05-01-2011:15.12 check ticket Mike 100 1 35654426 06-01-2011:11.18 decide Sara 200 1 35654427 07-01-2011:14.24 reject request Pete 200 2 35654483 30-12-2010:11.32 register request Mike 50 2 35654485 30-12-2010:12.12 check ticket Mike 100 2 35654487 30-12-2010:14.16 examine casually Sean 400 2 35654488 05-01-2011:11.22 decide Sara 200 2 35654489 08-01-2011:12.05 pay compensation Ellen 200 3 35654521 30-12-2010:14.32 register request Pete 50 3 35654522 30-12-2010:15.06 examine casually Mike 400 3 35654524 30-12-2010:16.34 check ticket Ellen 100 3 35654525 06-01-2011:09.18 decide Sara 200 3 35654526 06-01-2011:12.18 reinitiate request Sara 200 3 35654527 06-01-2011:13.06 examine thoroughly Sean 400 3 35654530 08-01-2011:11.43 check ticket Pete 100 3 35654531 09-01-2011:09.55 decide Sara 200 3 35654533 15-01-2011:10.45 pay compensation Ellen 200 4 35654641 06-01-2011:15.02 register request Pete 50 4 35654643 07-01-2011:12.06 check ticket Mike 100 4 35654644 08-01-2011:14.43 examine thoroughly Sean 400 4 35654645 09-01-2011:12.02 decide Sara 200 4 35654647 12-01-2011:15.44 reject request Ellen 200 . . .
Event logs (2)
Case ID Event ID dd-mm-yyyy:hh.mm Activity Resource Costs
. . .
5 35654711 06-01-2011:09.02 register request Ellen 50 5 35654712 07-01-2011:10.16 examine casually Mike 400 5 35654714 08-01-2011:11.22 check ticket Pete 100 5 35654715 10-01-2011:13.28 decide Sara 200 5 35654716 11-01-2011:16.18 reinitiate request Sara 200 5 35654718 14-01-2011:14.33 check ticket Ellen 100 5 35654719 16-01-2011:15.50 examine casually Mike 400 5 35654720 19-01-2011:11.18 decide Sara 200 5 35654721 20-01-2011:12.48 reinitiate request Sara 200 5 35654722 21-01-2011:09.06 examine casually Sue 400 5 35654724 21-01-2011:11.34 check ticket Pete 100 5 35654725 23-01-2011:13.12 decide Sara 200 5 35654726 24-01-2011:14.56 reject request Mike 200 6 35654871 06-01-2011:15.02 register request Mike 50 6 35654873 06-01-2011:16.06 examine casually Ellen 400 6 35654874 07-01-2011:16.22 check ticket Mike 100 6 35654875 07-01-2011:16.52 decide Sara 200 6 35654877 16-01-2011:11.47 pay compensation Mike 200
Table : Event logs of a support desk handling customer compensations
Simplified event log
Case ID Trace 1 ha,b,d,e,hi 2 ha,d,c,e,gi 3 ha,c,d,e,f,b,d,e,gi 4 ha,d,b,e,hi 5 ha,c,d,e,f,d,c,e,f,c,d,e,hi 6 ha,c,d,e,giTable : Simplified event log of a support desk handling customer compensations (a = register request, b = examine thoroughly, c = examine
casually, d = check ticket, e = decide, f = reinitiate request, g = pay compensation, h = reject request)
In short: {ha,b,d,e,hi,ha,d,c,e,gi,ha,c,d,e,f,b,d,e,gi,
Simplified event log
Case ID Trace 1 ha,b,d,e,hi 2 ha,d,c,e,gi 3 ha,c,d,e,f,b,d,e,gi 4 ha,d,b,e,hi 5 ha,c,d,e,f,d,c,e,f,c,d,e,hi 6 ha,c,d,e,giTable : Simplified event log of a support desk handling customer compensations (a = register request, b = examine thoroughly, c = examine
casually, d = check ticket, e = decide, f = reinitiate request, g = pay compensation, h = reject request)
In short: {ha,b,d,e,hi,ha,d,c,e,gi,ha,c,d,e,f,b,d,e,gi,
ha,d,b,e,hi,ha,c,d,e,f,d,c,e,f,c,d,e,hi,ha,c,d,e,gi}
Example (1)
Case ID Trace 1 ha,b,d,e,hi 2 ha,d,c,e,gi 3 ha,c,d,e,f,b,d,e,gi 4 ha,d,b,e,hi 5 ha,c,d,e,f,d,c,e,f,c,d,e,hi 6 ha,c,d,e,giExample (2)
Figure : Petri net based on event log{ha,b,d,e,hi,ha,d,b,e,hi}
Play out
Replay
Connecting models to real events is crucial Possible uses
Conformance checking Repairing models
Extending the model with frequencies and temporal information Constructing predictive models
Operational support (prediction, recommendation, etc.)
Automata (remember?)
Finite automaton FA= (Q,Σ,qo,A, δ)
Q is a finite set of states
Σ is a finite alphabet of input symbols qo∈Q is the initial state
A⊆Q is the set of accepting states
δ:Q×Σ→Q is the transition function
Figure : Deterministic Finite Automaton for the functionx mod 3
Automata (remember?)
Finite automaton FA= (Q,Σ,qo,A, δ)
Q is a finite set of states
Σ is a finite alphabet of input symbols qo∈Q is the initial state
A⊆Q is the set of accepting states
δ:Q×Σ→Q is the transition function
Petri Nets
Petri netN= (P,T,F)
P is a finite set of places
T is a finite set of transitions
F ⊆(P ×T)∪(T ×P) is a finite set
of directedarcscalled the flow
relation
Labeled Petri Nets
Petri netN= (P,T,F,A, `)
P is a finite set of places
T is a finite set of transitions
F ⊆(P ×T)∪(T ×P) is a finite set
of directedarcscalled the flow
relation
Ais a set of activity labels
Enabling
A transition is enabled if each of its input places contains at least
one token
Firing
An enabled transition can fire (i.e., it occurs),consuming a token
fromeach input place andproducing a token for each output
Petri Nets
Connections are directed
No connections between two places or two transitions Places may hold zero or more tokens
At most one arc between nodes (for now)
Firing is atomic
Multiple transitions may be enabled, but only one fires at a time During execution, the number of tokens may vary if there are transitions for which the number of input places is not equal to the number of output places
The network is static
Example (1)
Petri net for atraffic light
States: red, orange and green
Transitions from red to green, green to orange, and orange to red
Example (1)
Petri net for atraffic light
States: red, orange and green
Transitions from red to green, green to orange, and orange to red
Example (1)
Petri net for atraffic light
States: red, orange and green
Transitions from red to green, green to orange, and orange to red
Example (1)
Petri net for atraffic light
States: red, orange and green
Transitions from red to green, green to orange, and orange to red
Example (2)
Petri net for
Example (2)
Petri net for
2 traffic lights
Example (3)
Petri net for
Lab session
Continue with Assignment 2
Do the pandas, scikit-learn and Algorithmia tutorials Create features
Machine learning
Implement (a small part of) your data mining algorithm on
Algorithmia, and add it to your dashboard Write the (scientific!) report for the assignment Start reading relevant book chapters . . .
Credits
Lecture based on slides belonging to the course book
W. van der Aalst,Process Mining: Discovery, Conformance and