L INK G RAPHS 72.21.54

Representing Data

66.35.250.210 web.232.38.131 209.135.140.51 66.150.96.111 63.web.233.198 192.168.200.255 10.0.0.5 10.0.0.14 10.0.0.255 10.0.0.22 10.0.0.89 66.35.250.150 63.87.252.162 67.111.1dns.14 10.0.0.252 216.231.41.2 66.35.250.168 web windows imap dns

Figure 3-20 A three-node configuration visualizing the traffic between source and destination machines with an additional predicate node that identifies the type of communication

Encoding additional data dimensions in a link graph can be done by using color, shape,andedge thickness.One interesting way of color-coding IP address nodes is to color internal addresses—ones that are assigned to the machines to be protected—with one color and all the other nodes with another color. This helps to immediately identify internal communications versus ones involving external machines. Figure 3-21 shows this idea, where we own the 111.0.0.0/8 address space. All these address nodes are colored in dark gray, and the external addresses are colored in bright gray. We can immediately see that only one internal machine is talking to other internal machines. All the other machines are being contacted by external machines. In addition, two connections show increased activity, indicated by the thicker edges.

Figure 3-21 Link graph showing machines communicating on a network. Internal nodes are colored in light gray, and external machines are encoded in dark gray.

One of the biggest challenges in generating link graphs is thelayoutor arrangement of nodes in a graph. In graph theory, the problem is formulated as this: Given a set of nodes with a set of edges (relations), calculate the position of the nodes and the connection to be drawn for each edge. One of the criteria for a well laid-out graph is that edges are nonoverlapping and similar nodes are grouped together. Overlapping edges make it hard to read a graph. Multiple algorithms exist to lay out the nodes, each using a different approach to arrive at an optimal node placement. I refer the interested reader to a paper written by Ivan Herman et al., called “Graph Visualization and Navigation in

Information Visualization: a Survey.” Figure 3-22, Figure 3-23, and Figure 3-24 show how three different layout algorithms arrange the same set of nodes. All three graphs display the same underlying data; only the layout algorithm has been changed!

202.9.145.41 213.3.104.65 210.71.73.9 217.86.21.61 194.204.207.1 210.12.137.7 81.29.64.72 10.0.09 61.177.39.110 64.171.12.160 212.253.113.179 211.162.6.137 129.132.97.6 200.43.189.233 213.144.137.88 212.36.0.242 111.222.195.58 111.222.195.60 111.222.195.60 80.238.198.61 128.134.106.4 195.130.225.150 217.80.46.58 111.222.192.109 12.236.216.149 217.98.170.194 212.143.60.35 212.210.21.235 216.210.237.233 217.162.11.45

Figure 3-22 This is the first of three graphs that display the same data. Each graph uses a different layout algorithm to lay out the nodes.This graph uses a circular layout algorithm.

Another layout option is shown in Figure 3-23, where you see the same data; this time using a “spring”-based layout algorithm. The approach is to use a spring model (see “An Algorithm for Drawing General Undirected Graphs,” by Kamada and Kawai,Information Processing Letters31:1, April 1989). The edges between nodes are treated as springs. When you place a new edge in the graph, the graph seeks equilibrium. That means that the nodes it is being attached to move closer to each other. However, the old edges that were already attached to the nodes pull the new nodes back.

Figure 3-24 shows the same data again, this time using another heuristic to lay out the nodes: a hierarchical layout. Hierarchical layouts place the nodes in a tree. Figure 3-24 shows how the same nodes that are shown in Figure 3-22 and Figure 3-23 are drawn by a hierarchical layout. Hierarchical layouts are best suited for data that does not create very wide trees, meaning that no nodes have a lot of neighbors. Otherwise, the graphs will end up very flat and illegible.

LINKGRAPHS 72.21.54.242 66.35.250.210 web.232.38.131 209.135.140.51 66.150.96.111 63.web.233.198 192.168.200.255 10.0.0.5 10.0.0.14 10.0.0.255 10.0.0.22 10.0.0.89 66.35.250.150 63.87.252.162 67.111.1dns.14 10.0.0.252 216.231.41.2 66.35.250.168 web windows imap dns

Figure 3-22 uses a circular approach for placing the nodes. In simplified terms, the algorithm works such that it analyzes the connectivity structure of the nodes and arranges clusters it finds as separate circles. The circles themselves are created by placing the nodes of the cluster on concentric circles.

Figure 3-23 This graph uses a spring model to do the graph layout. 66.35.250.150 7.25.2.162 67.111.1dns.14 10.0.0.22 10.0.0.89 10.0.0.255 10.0.0.5 192.168.200.255 209.135.140.51 72.21.54.242 66.35.250.168 web.232.38.131 66.35.250.210 windows web 216.231.41.2 100.0.0.252 66.150.96.111 63.web.233.198 dns imap 10.0.0.14 66.35.250.150 66.35.250.168 72.21.54.242 66.35.250.210 web.232.38.131 209.135.140.51 63.87.252.162 67.111.1dns.14 192.168.200.255 10.0.0.255 windows 66.150.96.111 216.231.41.2 100.0.0.252 63.web.233.198 dns imap 10.0.0.22 10.0.0.89 10.0.0.5 10.0.0.14 web

Figure 3-24 This graph uses a hierarchical layout algorithm to place the nodes.

Probably the biggest limitation of link graphs is the number of nodes that can be visual- ized simultaneously. Too many nodes will make the graph illegible. However, a lot of times it is necessary to visualize hundreds or thousands of nodes. A solution to this problem is to combine nodes into groups and represent them as one individual node. This is also known as aggregation.The disadvantage of aggregation is that information is lost. Intelligent aggregation is nevertheless a good trade-off for visualizing thousands

of nodes. I use the term intelligent aggregation.That means nothing more than trying to identify nodes for which we are not interested in the exact value, but the mere presence of such a type. Assume you are visualizing IP addresses as one of the nodes. A lot of times you are not really interested in the very specific value of a node, but it suffices to know what subnet the node belongs to. In that case, you can aggregate your nodes based on either Class C, B, or A masks. Figure 3-25 shows a sample graph. The leftmost side shows all the individual nodes. The rightmost side shows what happens if all the nodes are aggregated per Class A network. Comparing these two graphs, you can see that quite a bit of detail is missing in the new graph. For certain analysis tasks, however, this is sufficient.

MAPS

Figure 3-25 This graph shows a simple graph in which the source nodes are aggregated by their Class A membership.

M

APS

Some of the dimensions in our data have a close relationship to a physical location. IP addresses, for example, are associated with machines, which have a physical location. Various degrees of granularity can be applied to map addresses to countries, cities,

buildings, desks, and so forth. Visually communicating location is often an effective way of analyzing data. Maps are just a general bucket for graphs that display data relative to their physical location. You can use world maps, city maps, building plans, and even rack layouts to plot data atop them.

When using maps to plot data, you have to make two decisions: • What data dimensions determine the location? For example

• The source address only, as shown in Figure 3-26. • The source and destination address, as in Figure 3-27. • The reporting device’s location, as in Figure 3-28. • How is the data going to be represented? For example

• Encode data with color inside of the map—a choropleth map (see Figure 3-26). • Use a link graph that connects the entities in the graph (see Figure 3-27). • Draw a chart at the corresponding location on the map (see Figure 3-28). Often, this type of mapping requires some data preprocessing. Adding the geo-location for IP addresses is one example. If you are going to map your data onto, for example, a building plan, you must include the coordinates relative to the building plan in your data.

Figure 3-26 Data is mapped onto a map of the United States. Color is used to encode the severity of an alert.The darker the color, the more severe the alert represented.

MAPS

Figure 3-27 A globe where event information is encoded as links between the source and destination location. Color on edges is used to express the severity of the event reported. (This figure appears in the full-color insert located in the middle of the book.)

Maps are useful for examining spatially distributed data. For example, if you are interested in where people are located when they access your services, the originator’s IP address can be mapped to a location on the globe and then plotted on a world map. Figure 3-28 shows another example where you have a distributed network and sensors deployed in geographically disparate locations. You can now map the number and prior- ity of events on a map based on the location of the sensors. This type of mapping helps to immediately identify which branch office, site, or network segment has the most problems and which ones are running smoothly. Maps are a great example of graphics that are used for situational-awareness displays, which we discuss a little later in this book. Maps are also a great tool to communicate information to nontechnical people. They are easy to understand.

Figure 3-28 A globe where the events are mapped to the location from which they were reported. Stacked cubes are used to represent each event. Color coding indicates the severity of the event reported.

Maps are often overused. It is not always the best way to display information. For example, if your goal is to show relationships between machines and not their geograph- ical location, you should use a link graph. One of the main problems with maps is data density. Suppose you are mapping IP addresses to a map. In some areas, there will be a lot of data points, such as in metropolitan areas (e.g., in Manhattan). In other areas, the density will be sparse, such as in Montana. This relative density results in large areas of the graph that get wasted and others that have a very high data density, making it hard to decode values.

T

REEMAPS

Treemapsare another alternative for visualizing multidimensional, hierarchical data. Even though they were invented during the 1990s,4_{the computer security community}

has not really caught on to using them for analyzing data. Almost certainly, treemaps will be seen more often in the future, as soon as people realize how easy it is to analyze their security data with treemaps.

Treemaps provide a visual representation of hierarchical structures in data. Hierar- chical or tree structures are created by piecing together multiple data dimensions (for example, by taking the action, the source IP address, and the destination IP address). If we were to take the following log file and arranged it in a tree, we would end up with a tree structure as shown in Figure 3-29:

Feb 18 13:39:26.454036 rule 47/0(match): pass in on xl0: 211.71.102.170.2063 > 212.254.109.27 [Financial System] Feb 18 13:39:26.889746 rule 71/0(match): block in on xl0: 192.27.249.139.63270 > 212.254.109.27 [Financial System] Feb 18 13:39:27.046530 rule 47/0(match): pass in on xl0: 192.27.249.139.63271 > 212.254.110.10 [Financial System] Feb 18 13:39:27.277447 rule 71/0(match): block in on xl0: 192.27.249.139.63277 > 212.254.110.99 [Mail Server] Feb 18 13:39:27.278849 rule 71/0(match): block in on xl0: 192.27.249.139.63278 > 212.254.110.97 [Mail Server] TREEMAPS Financial System 212.254.109.27 block pass pass 212.254.110.10 Mail Server 212.254.110.97 block 212.254.110.99 block

Figure 3-29 A tree structure, displaying the log file as a tree.

Treemaps use size and color to encode specific properties of the leaf nodes (the gray nodes in Figure 3-29)—the destination IP addresses in our case. Leaf nodes in a tree are ones that have no further children. The root node is the one at the very top of the tree. Treemaps are useful when comparing nodes and subtrees at varying depths in the tree to find patterns and exceptions.

Figure 3-30 shows how the log file is mapped to a treemap. The size of the individual boxes is determined by the number of times a certain block or pass occurred targeting one of our machines. For color coding, I chose to use dark gray for blocked traffic and light gray for passed traffic. Any other dimension could have been chosen to assign the color. By choosing the action, I can readily show which traffic was blocked and which traffic actually passed the firewall.

Figure 3-30 A treemap that provides a visual representation of the simple log example discussed earlier.

In Figure 3-30, you can see that two financial systems are monitored. For one of them, half of the connections that were attempted were blocked. The other financial system had no traffic blocked by the firewall. The other systems protected by this firewall are two mail servers. No traffic was being passed through the firewall that was targeted at either of the mail servers. This visualization gives the user a quick overview of which types of systems are being protected by the firewall and which systems the firewall had to block traffic for. Various use-cases can be addressed by this view. It could be used for troubleshooting purposes by, for example, checking whether certain systems are being blocked by the firewall or for gaining an overview of which systems get hit with the most traffic that the firewall has to block. It is possible to reconfigure the hierarchy of the treemap to highlight other properties of the data.

Figure 3-31 shows an example where a lot of data, approximately 10,000 records, was mapped into a treemap. This figure really shows the strength of treemaps—that a large amount of data can be mapped into a single graph. You can quickly identify the sources

and the destination ports they accessed. Color in this case indicates whether the traffic was blocked or passed by the firewall. Immediately, you can see that traffic for some ports got blocked for almost all the traffic, while other traffic was let through.

TREEMAPS

Figure 3-31 This treemap encodes approximately 10,000 records in a small space.

Treemaps have unique properties that turn out to be advantages over other types of graphs. The first advantage is that treemaps can show relationships based on hierarchies. It is easy to compare different data dimensions with each other. Second, treemaps are great at visualizing more than just three dimensions. With color and size, it is especially possible to visualize multiple data dimensions simultaneously. The third advantage is that clusters are easily detectable. You will see how all of these advantages play out during various use-cases later in this book.

In document Applied Security Visualization pdf (Page 114-125)