Cyber Security Through Visualization
Kwan-Liu Ma
Department of Computer Science University of California at Davis
Email: [email protected]
Networked computers are subject to attack, mis-use, and abuse. Organizations and individuals are making every effort to build and maintain trustworthy computing systems. The main strategy is to closely monitor and inspect network activities by collecting and analyzing data about the network traffic and the trails of system usage. The analysis usually requires large amounts of finely detailed, high-dimensional data to enable analysts to uncover hidden threats and make calculated predictions in a timely fashion. The traditional, signature-based and statistical methods are limited in their capability to cope with the large, evolving data and the dynamic nature of the Internet. Visualization proves effective to aid in understand-ing large, high-dimensional data commonly found in many demanding applications such as large-scale sci-entific simulations and biomedicine. There is thus a growing interest in the development of visualization methods as alternative or complementary solutions to the pressing cyber security problems (Brodley, Chan, Lippmann & Yurcik 2004, Ma, North & Yurcik 2005). The challenge is to develop new visual representa-tions, layout methods, user interfaces, and interaction techniques that can effectively facilitate visual inter-rogation and communication of the vast amounts of cyber security information.
Visual data analysis is inherently an iterative pro-cess, where each iteration provides more insight into the data being shown. A typical example of this pro-cess occurs with any type of overview plus detail vi-sualization. Patterns in the overview tend to direct what the user chooses to view in more detail, and the detailed view can provide insight on regions of the overview. This drill-down process, starting at a high semantic level and progressing to more detailed views, creates a feedback loop as shown in Figure 1, which can lend itself well to visualizing the relation-ships between large number of objects, such as port and network scans. In most cases, different visual rep-resentations are needed for constructing these differ-ent views. In particular, each specific region of inter-est may be defined in a space of arbitrary dimensions. The challenge is thus to seek the best space and visual representation in that space for each type of analysis task. I show with three different tasks how visual-ization can assist in the analysis of computer network activities for detecting anomalies using the drill-down process.
Copyright c 2006, Australian Computer Society, Inc. This pa-per appeared at Asia Pacific Symposium on Information Vi-sualization (APVIS 2006), Tokyo, Japan. Conferences in Re-search and Practice in Information Technology, Vol. 60. K. Misue, K. Sugiyama and J. Tanaka, Ed. Reproduction for aca-demic, not-for profit purposes permitted provided this text is included.
Figure 1: The process of visual data analysis is in-herently iterative to see both summary and details in context.
Analysis of Internet Routing Data
The Internet can be considered as a set of sub-networks, each of which represents an organization’s network. The problem of packet routing can sim-plify to routing data between these larger entities, referred to Autonomous Systems, according to the Border Gateway Protocol (BGP). For data packets to arrive at the correct destination, these Autonomous Systems exchange network reachability information in the form of BGP path announcements. Study-ing and understandStudy-ing the dynamics of BGP routStudy-ing changes is thus crucial to ensuring robust network performance. A drill-down process of analysis (Teoh, Jankun-Kelly, Ma & Wu 2004) can start by looking at the aggregate information about routing changes over a complete period of time, followed by examin-ing routexamin-ing update messages over selected period of time and their corresponding statistical values; next, particular instances of instability can be visualized in detail. Figure 2 shows a two-level aggregate data browser which displays the distribution of the BGP announcements over a 1-year period (bottom), and allows the analyst to select a focused period of sev-eral minutes. Figure 3 shows the color coded text visualization of individual BGP path announcements as well as statistical measures of the routing updates during the focused period. This joint visualization
al-Figure 2: A two-level aggregate data browser can cover a wide range of granularity. Bottom: At the overview level, the analyst first looks at the aggre-gate information of the entire time period (typically one year) and specifies a period to focus on. Top: At the next level, the focus can be further narrowed down to a period of several minutes.
Figure 3: Left: Text visualization of a sequence of Au-tonomous System (AS) path announcements. Each unique AS path is shown in a different color, which effectively forms visual patterns of the updates such as oscillations, repeats, or slow convergences. Right: Statistical measures corresponding to each announce-ment can be used to help verify any detected anomaly. lows for verification of the visual and statistical infor-mation for anomaly detection (Teoh, Zhang, Tseng, Ma & Wu 2004). After instability events are identi-fied, it is possible to see the distribution of different events, their severity, duration, and frequency all in one single visualization (Teoh, Ma & Wu 2003), as shown in Figure 4.
Port Data Visualization
Scanning a network is a very common first step in a network intrusion attempt. Crackers frequently scan entire ranges of ports, looking for open ports that can be exploited to gain access to a system. Worms and viruses often target specific ports in an attempt to locate systems that are vulnerable to the mechanisms they use to spread. These attacks are all recorded in security logs, but these logs are time-consuming for administrators to analyze by hand. One way to understand the collected security logs is to produce images of network traffic by choosing axes that cor-respond to important features of the data, creating a grid based on these axes and then assigning each cell of the grid with a visual property such as color to represent the network activity there.
The drill-down process of analyzing port data be-gins with an overview that presents a time-ordered view of the entire data set. The goal here is to choose a particular range of time to zoom into. In the fol-lowing steps, detailed views are created to
eventu-Figure 4: Visualization of instability events for a par-ticular IP prefix. Each event is represented by a circle and a base. The area of the circle is proportion to the number of announcements and each color segment in-dicates a specific type of instability. The triangular base shows the duration of the event. The position of each circle is placed to avoid occlusion. Nevertheless, tall events suggest that there are many events at that time.
ally reach an atomic unit of network security inter-est, which may represent a port scan, an intrusive attack, the activities of spyware, the flocking of em-ployees to Web news sites after a major news event, or any other discrete feature that can be identified in the data. Figure 5 shows, from left to right, a 3-tier process for studying TCP port data (Muelder, Ma & Bartoletti 2005a), in which the left most image dis-plays a highly condensed port data over a period of time such as a week or a month. Each row of the visualization represents one unit (generally an hour) of time. Each pixel corresponds to a range of ports and is colored according to the level of activity on the ports during the time unit. Figure 6 shows differ-ent levels of enhancemdiffer-ent can bring out port scans; furthermore, using different data metrics can reveal different patterns in the data leading to new discov-eries (Muelder et al. 2005a). The grid visualization, shown in the middle of Figure 5, depicts the the ac-tivity during a given time unit. It consists of a dot on the grid for each of the 65,536 ports. The user can se-lect a port to see detailed information about that port and its surrounding ports, as shown in the upper right image in Figure 5. The bottom right picture shows a single port over the entire time range, for each of the metrics of interest including, from top to bottom, ses-sion count, destination count, source count, the ratio of source and destination, and country count. Such a visualization is useful for finding relationships be-tween metrics, as well as showing periodic structures in the data such as the change in web traffic through-out the day. Figure 7 presents some distinct patterns of activity.
Scan Characterization
In order to obfuscate an attack, an attacker frequently alters identifying features like source IP addresses. Thus, in order to identify an attacker, some more immutable aspect of the attack must be considered, such as packet arrival timing, which is dependent on several factors that are difficult to alter, such as hard-ware or operating system limitations or router delays. However, due to the chaotic nature of packet arrival times, one must analyze a large quantity of packets. Network scans provide a good source of such timing information. It is thus beneficial to take network scan
Figure 5: A drill-down process for finding network scans by applying it to a 24 hour long dataset at 10 minute resolution. Starting at the timeline on the left, a spike is found on a high port that crosses sev-eral hours. One of these hours is then selected for viewing in the grid based visualization shown in the middle. In it, there is exactly one port with unusu-ally large values in the range of ports that correspond to the spike. The range around this port is zoomed into which reveals in the bottom-right image that an abnormally large number of destinations being con-nected to by a small number of sources, which means that this is likely a network scan.
timing information and use visualization techniques to characterize the scans.
An underlying premise of this approach to the net-work scan characterization problem is that humans are creatures of habit (and autonomous software ap-plications even more so.) Having created an environ-ment of attack tools with which they have become familiar, they tend to reuse those tools and support systems in future activities. The particular settings employed would likely remain consistent as well, both out of familiarity and a desire to properly compare re-cent results with earlier findings. Additionally, other software processes that may be running concurrently in support of analysis or ancillary activities compete with and impact the performance of their network ac-tivities in reliably consistent ways, imposing uniquely identifiable characteristics in the sequence and packet arrival times of high packet count interactions.
Individual scans can be shown with a grid-based visualization, where the axes are the third and forth octets of the destination IP addresses, and the color is based on a metric derived from timing information. Metrics range from simple metrics such as arrival time of the first probe or number of probes for each address to complex metrics such as deviation from a linear expected arrival time for each destination. Patterns that are nearly identical in one metric can be distinct in others. Figure 8 shows such scan visualizations.
The relationships between network scans may be understood by statistically comparing pairs of scans and it is also possible to get a quantitative measure on how well they match. However, because the scans are too chaotic to compare directly, frequency analysis, such as Fourier transforms or wavelet transforms, can be used to convert the scan patterns into scalograms, which can then be systematically compared (Muelder, Ma & Bartoletti 2005b). Although network scan pat-terns can exhibit a periodic or quasi-periodic struc-ture, they often contain gaps, aperiodic aberrations and regions where the relative phase of the periodic structures has shifted, which are things that Fourier analysis has been found to not handle well. Wavelets, on the other hand, are relatively resistant to phase
Figure 6: Visualization of the entire time range with enhancement to bring out port scans corresponding to a particular spike using the activity-level histogram and gradient editor shown on the bottom.
Figure 7: Plots metrics versus time for individual ports. In each example, the session count (the first metric) is highlighted. The other four metrics shown are destination address count, source address count, unique source and destination pair count, and country count. The usage of Port 80 (upper left) is very peri-odic while Port 46011 (upper right) has a fairly con-stant level of activity, with a few spikes. Port 27374 (lower left) is more erratic, though, interestingly, its usage drops noticeably as time goes on. Port 34816 (lower right) has one of the most suspicious usage graph; it is only used a few times, but it is used fairly heavily during those times.
Figure 8: Visualization of individual scans showing patterns that can easily be compared by eye. These images show the arrival time of the first connection attempt to each address, with blue being early in the scan, red being late in the scan, and black being an address that had no connection attempt.
shifts and noise. Figure 9 shows that dissimilar pat-terns result ine different scalograms.
At this point, the scans can be directly visualized individually, but when dealing with large numbers of scans, this is unfeasible. So, once the scans are iso-lated, in order to automatically compare them, finger-prints are generated to be fed into an overview visu-alization. This overview of the relationships between the scans and the detailed view of individual pairs of scans for comparison purposes compose an overview plus detail feedback loop. As described before, the overview allows the analyst to drill down into certain areas, by showing them in the detailed view.
The overview can be provided by a graph visual-ization of the relationship between scans. Each node is a scan, and each edge is their relationship. The edge weights are derived from the wavelet analysis of the scans, and they range from 0 , which means the scans are completely different, to 1 which means they are identical. The graph can be displayed with a force di-rected layout algorithm and edges below a particular threshold may be dropped for clarity. The resulting image then shows clusters of scans that are similar, as shown in Figure 10. The user can start with such an overview graph and then drill down to detailed characteristics of the scans in the same cluster. In this way, a graph of a large number of scans can be rapidly compared and subsequently identified. Fur-thermore, it is very helpful to allow the analyst to bias the overview in a manner reflecting the cognitive in-sight gained from looking at the detail view (Muelder et al. 2005b). This enhances the feedback loop by al-lowing information gained by viewing the details of network scans to be reflected back in the overview. Concluding Remarks
Visualization leverages human’s extraordinary ability to detect patterns in images; nevertheless, by itself visualization does not answer all the questions the analyst has. Visualization is best for guiding a com-plex data analysis process since visualization is par-ticularly good for showing an overview of the data, which can direct the analyst’s attention to the as-pects of the data that require further investigations, as demonstrated by the three examples I have given. The ability to show details in context with visualiza-tion is also very powerful. To fight against the
in-Figure 9: Wavelet scalograms reduce large complex pattern to smaller simpler vectors that can be com-pared. This example was made using a metric based on the number of visits per unique address, with black to red gradient, where black is no probes and red is the maximum for that scan. Top: Similar scans have similar wavelet scalograms. Bottom: Dissimilar scans have very different scalograms.
Figure 10: Graph visualization of network scans. Clusters contain scans with a general pattern. A representative example from each selected cluster is shown.
creasingly creative and malicious attacks to network security I believe a promising approach is to add in-telligent reasoning with learning capabilities into the visualization directed analysis.
Cyber security is a much broader topic than en-suring the normal operation of a computer network. The task of analyzing network security data repre-sents only a small subset of the greater security prob-lem that we are faced with today. For example, data collected for intelligence information analysis are more heterogeneous, including text, measurements from sensors, imagery, video and audio from diverse sources. What visual representations should we use to study such heterogeneous data in a unified man-ner? What would be the meaningful linkages among the disparate data to facilitate cross-exploration? For the field of information visualization, this new class of problems presents many challenges and open research questions. We have only addressed a very small frac-tion of these challenges. Please join me in this re-search endeavor.
References
Brodley, C., Chan, P., Lippmann, R. & Yurcik, B., eds (2004), Proceedings of the 2004 ACM Work-shop on Visualization and Data Mining for Com-puter Security (VizSEC/DMSEC 2004), ACM. Ma, K.-L., North, S. & Yurcik, B., eds (2005),
Pro-ceedings of the IEEE Workshop on Visualiza-tion for Computer Security 2005 (VizSEC 2005), IEEE Computer Society.
Muelder, C., Ma, K.-L. & Bartoletti, T. (2005a), In-teractive visualization for network and port scan detection, in ‘Proceedings of the Eighth Interna-tional Symposium on Recent Advances in Intru-sion Detection (RAID 2005)’.
Muelder, C., Ma, K.-L. & Bartoletti, T. (2005b), A visualization methodlogy for characterization of network scans, in ‘Proceedings of the Workshop on Visualization for Computer Security (VizSEC 2005)’, pp. 29–38.
Teoh, S. T., Jankun-Kelly, T. J., Ma, K.-L. & Wu, S. F. (2004), ‘Visual data analysis for de-tecting flaws and intruders in computer net-work systems’, IEEE Computer Graphics and Applications (special issue on Visual Analytics) 24(5), 27–35.
Teoh, S. T., Ma, K.-L. & Wu, S. F. (2003), Visual exploration process for the analysis of internet routing data, in ‘Proceedings of the IEEE Visu-alization 2003 Conference’, pp. 523–530.
Teoh, S. T., Zhang, K., Tseng, S.-M., Ma, K.-L. & Wu, S. F. (2004), Combining visual and auto-mated data mining for near-real-time anomaly detection and analysis in gbp, in ‘Proceed-ings of the 2004 ACM Workshop on Visualiza-tion and Data Mining for Computer Security (VizSEC/DMSEC 2004)’, pp. 35–44.