Dynamic action selection - Common architecture for distributed probabilistic Internet fault dia

Agents in CAPRI diagnose failures dynamically based on the information in a request, service descriptions, and dependency knowledge. This general, dynamic message processing procedure allows agents to automatically take into account new dependency knowledge and new services when they become available. Figure 9-2 illustrates diagnostic requests that regional agents receive and the requests that they make to specialist agents for 27,641

diagnostic requests from March 6 to March 21, 2007. Table 9.1 presents this data in tabular form. Given a Firefox error code and a combination of specialist agents, the table indicates the number of diagnoses for which a regional agent contacted all of those specialist agents to diagnose a failure of that type. The figure shows that regional agents are able to dynamically decide what actions are appropriate in a non-domain-specific way, based only on the information in the request, service descriptions, and dependency knowledge.

Firefox error code

DNS Lookup Test Requests 8508 Web Server History Test Requests 12086 AS Path Test Requests 2952 5510 1373 2316 2690 1037 5080 2158 650 2123 Server not found

10002 Request canceled by user 7500 Connection refused 3751 Connection timed out 3780 Connection reset 1993 Other 650

Total specialist requests: 23546

Total regional agent diagnoses: 27641

Specialist requests

Figure 9-2: Regional agents dynamically select specialist agents to contact.

Diagnostic requests at the left of the figure are categorized by the error code reported by the Firefox web browser. The arrows indicate the total number of diagnostic requests regional agents made to each type of specialist agent. Note that for clarity, arrows with fewer than 100 requests are not shown. Certain types of failures are easier to diagnose than others. For example, regional agents respond to diagnostic requests for which the Firefox error code is “server not found” using only probabilistic dependency knowledge without any additional diagnostic tests. Such failures are with very high probability DNS Lookup failures. This capability to diagnose failures with high accuracy with incomplete information in a general way is one important advantage of probabilistic diagnosis. For other types of failures, such as “request canceled by user”, a regional agent may need to request additional tests from multiple other specialist agents.

Diagnoses using indicated specialists W = Web Server Test

D = DNS Lookup Test A = AS Path Test

Firefox error code none W D WD A WA DA WDA Total

Canceled by user 1020 1265 866 3717 56 79 48 449 7500

Connection refused 1378 2250 21 26 27 2 9 38 3751

Connection timed out 665 506 64 422 175 276 186 1486 3780

Connection reset 474 467 134 846 6 9 6 51 1993

Server not found 9997 1 2 0 0 2 0 0 10002

Other 383 80 25 80 5 10 8 24 615

Total 13917 4569 1112 5091 269 378 257 2048 27641

Table 9.1: Distribution of specialist agent requests from regional agents by Firefox error code

different tests when selecting actions. In my prototype implementation, some services have greater costs than others. Web server history tests have the lowest cost, followed by verify DNS lookup tests, while AS path tests have the greatest cost. For this reason, using the dynamic procedure for action selection provided by CAPRI, agents will typically request additional diagnostic tests useful for diagnosis in order of increasing cost.

This figure also illustrates the ability of agents to automatically decide what actions are possible based on the inputs specified in service descriptions. In certain cases, agents do not have the necessary inputs to request certain diagnostic tests. For example, when a “server not found” error occurs, the user agent does not have an IP address for the destination web server and so cannot conduct a web server history test or an AS path test.

CAPRI also allows agents to preferentially handle requests from nearby agents by ad- vertising lower costs. Three out of the 14 regional agents in my experiments can request DNS lookup tests from specialist agents located in the same AS. Because DNS lookup specialist agents advertise a lower cost of diagnosis to regional agents within the same AS (9000 instead of 10000), regional agents in the same AS as a DNS specialist agent request DNS lookup tests from specialist agents within the same AS. In this experiment I found that 2801 out of 8508 (33%) of DNS lookup requests are handled by DNS specialist agents in the same AS as the regional agent.

The procedure that agents use action selection also enables an agent to automatically work around failures in other agents simply by considering the value of the information actions provide. For example, at one point in my experiments a DNS lookup agent at the University of Oregon became unreachable due to a Planetlab software upgrade. Because multiple DNS lookup specialists offer the same services, a regional agent has multiple DNS lookup test request actions in its list of possible next actions, all with the same value. A regional agent that fails to connect to the University of Oregon DNS lookup agent selects another DNS lookup test action with the same value if available. Once a regional agent successfully obtains DNS lookup test information from a specialist, however, the value of all other DNS lookup test actions becomes zero because they do not provide any new information.

These experimental results illustrate some of the benefits of dynamic message processing and action selection based on action value and cost. The action selection procedure that CAPRI agents use automatically distinguishes between possible and impossible actions based on the information in an agent’s component graph, and can take into account the different costs of actions to preferentially request lower cost tests. This procedure allows an agent to automatically work around failures in other agents and estimate the expected gain in accuracy of available diagnostic actions without domain-specific knowledge.

In document Common architecture for distributed probabilistic Internet fault diagnosis (Page 122-125)