Summary - Predictive trend mining for social network analysis

This chapter has presented an overview of work related to the general concept of KDD and DM, Association Rules and FPM, temporal spatial DM, clustering and trend clus- ter analysis, social network mining and visualisation in DM. The related work in DM techniques, such as FPM, provided several insights to the proposed module for iden- tifying temporal frequent patterns and trends. TFP was selected as the foundation algorithm and was extended to suit the nature of sequences of social network data. As noted in many FPM experiments, large numbers of patterns are typically discov- ered which tends to hinder the user’s interpretation of DM results. Thus the use of clustering techniques and visualisation tools are proposed. Trend analysis is aimed at investigating temporal changes that occur in collections of frequent pattern trends. In the work described in this thesis, prediction modeling is proposed. The next chapter introduces the “social network” datasets used for evaluating the proposed framework in this thesis.

Chapter 3

Social Network Datasets

This chapter describes the “social network” datasets used for evaluating the algorithms in this thesis. The datasets were extracted from: (i) the GB cattle movement database (ii) an insurance company (Deeside Insurance Ltd) customer database describing re- quests for insurance quotes and (iii) the Malaysian Armed Forces logistic cargo distribution database. These datasets are exemplars of business community social networks representing the entities that form part of the organisations communities and the traffic/communication between these entities. As noted in Chapter 2, in this thesis, the definition of the term “social network” is extended beyond the “tight” definition used by some authors, namely that social networks represent user of Internet sites such as Facebook and LinkedIn and the communication between those users. This thesis takes a much wider view that social networks may include business communities, file sharing systems, co-authoring frameworks and so on. The selected datasets consist of attributes which are viewed as network nodes for example farms, customers and camps; and movement, communication or traffic between these nodes are treated as the edges of the networks.

This chapter introduces the three datasets used for the evaluation described in later chapters. So that the social network datasets can be used with respect to the systems described in this thesis it was first necessary for them to be preprocessed and appropriately formatted. This chapter thus also explains the discretisation and normalisation processes that were applied to the datasets to produce the required binary valued format.

With respect to the social networks used to evaluate the proposed mechanisms, two specific “types” of social network can be identified: star networks and complex star networks. The generic nature of these two types of network is indicated (in a stylized form) by the two “network snap shots” given in Figures 3.1 and 3.2. With reference to Figure 3.1, a star network is characterised by a single “star shape” with all nodes communicating with one super-node. Note that, as shown in the figure, not all network nodes will be necessarily communicating (linking) with the super-node at

any given time stamp. The generic network snap shot given in Figure 3.2 is a more complex version of that given in Figure 3.1, and is thus referred to as a complex star network. The network is characterised by a number of disconnected “star” sub-networks of varying size. Again, not all network nodes (with respect to the snap-shot time stamp) are necessarily communicating (linking) with any of the other nodes. Note also that, some of the “star” sub-networks comprise only two nodes. In general the proposed PTMF may be applied to any type of social network, as the adopted “discretisation and normalisation” processes will serve to convert the datasets into a standard tabular format of the form required for input to the framework. As long as the network conforms to the formalism presented in Section 4.1 in the thesis it does not matter what “type of network” is being considered. Additional material on discretisation and normalisation has been included in the thesis, with respect to the above, in Section 3.4.

Figure 3.1: (Styalised) Simple Star Network

Figure 3.2: (Stylaised) Complex Star Net- work

The rest of this chapter is organised as follows. Section 3.1 describes the GB cattle movement dataset, Section 3.2 the insurance quotation dataset and Section 3.3 logistic cargo distribution dataset. The discretisation and normalisation process is then presented in Section 3.4, where the data schema for the pre-processed datasets is also explained. Lastly, Section 3.5 briefly summarises this chapter.

3.1 GB Cattle Movement Database

The GB cattle movement Cattle Tracing System (CTS) database records all the movements of cattle registered within or imported into Great Britain. The database is main- tained by the Department for Environment, Food and Rural Affairs (DEFRA). Cattle movements can be “one-of” movements to final destinations, or movements between intermediate locations. Movement types include: (i) cattle imports, (ii) movements between locations, (iii) “movements” in terms of births and (iv) “movements” in terms of deaths. The CTS was introduced in September 1998, and updated in 2001 to support disease control activities. Currently (2012), the CTS database holds some 155 Gb of

data.

The CTS database comprises a number of tables, the most significant of which are the animal, location and movement tables. For the analysis reported in the thesis, the data from 2003 to 2006 was extracted to form 4 episodes, each comprising 12 (one month time stamps), presented as a sequence of 48 “complex” networks. The data was stored in a single data warehouse such that each record represented a single cattle movement instance associated with a particular year (episode) and month (time stamp). The number of CTS records represented in each episode was about 400,000. Each record in the warehouse comprised: (i) a time stamp (month and year), (ii) the number of cattle moved, (iii) the breed, (iv) the senders location in terms of easting and northing grid values, (v) the “type” of the sender’s location, (vi) the receivers location in terms of easting and northing grid values, (vii) the “type” of the receiver’s location, and (viii) the senders’ and receivers’ Parish Testing Interval (PTI)1. If two different breeds of cattle were moved at the same time from the same sender location to the same receiver location, this would generate two records in the warehouse. The maximum number of cattle moved of the same breed between any pair of locations for a single time stamp was approximately 40 animals. The spatial magnitude of movement between farms or animal holding areas can be derived from the location grid values. The easting and northing values of sender and receiver locations were divided into k

kilometer sub-ranges to produce k sized grid squares Experiments using k = 50 and

k= 100 were conducted; these are described in Chapter 5. The effect of this ranging was to sub-divide the geographic area covered by the CTS database into a k×k grid. These grid squares were given unique ID numbers which were also recorded in the dataset.

In document Predictive trend mining for social network analysis (Page 59-62)