Layer 1a: Perceptron Module - Design Decisions

Chapter 6: Design Decisions

6.4 Architecture

6.4.1 Layer 1a: Perceptron Module

In layer 1a, of Figure 6-2, a single perceptron is used to learn and predict spam zombie server identification values. A flow-chart for our perceptron module illustrating how the perceptron works from a high level perspective is presented in Figure 6-3 and described in detail below. We note that the perceptron module is broken up into three main areas. The first area is called ‘loading’ and is responsible for creating a new perceptron object, loading its weight values and generating its inputs. The second area is called ‘decision making’ where based on the given inputs, the perceptron must decide whether to accept or reject the current SMTP transaction. The third area is called ‘training’ , which is

responsible for training the perceptron to learn good and bad server identification values. We note that the training module is purposely positioned after the decision module so that we can continuously re-train our perceptron in real-time based upon the outcome of the Spamhaus list [6] query and other filtering layers on the filtering server. More detail on each of the three sections is described in the following sub-sections.

6.4.1.1 Loading

The purpose of the loading portion of the algorithm is to create a new perceptron object, initialize its weight values and convert the provided input into a corresponding numeric array reference that can be used by the learn and run functions. Specifically, in Step 1 of Figure 6-3, the algorithm is provided with the sending server’ s identification value as its input and parses the input into sub-domains, using the ‘.’ as a delimiter. For example, a

server identification value of static.mail.example.com would be parsed into four

separate terms static, mail, example and com. In Step 2, each sub domain term is then

stored in an index file called crunches on the filtering server and assigned a

corresponding numeric value. For example, the server identification of

static.mail.example.com could be indexed as 1, 2, 3 and 4.

Next, in Step 3, we create a new perceptron object and initialize it with pre-defined weights, learning rate and maximum iteration count. The pre-defined weights for each input of the perceptron is stored on a file called weights on the filtering server and are

the most recent weights used by the last perceptron object. The learning rate was defined as the parameter

α

in Equation 6-2. The maximum iteration count variable is used to prevent our perceptron from spending a long period of time training a given set of input, which could affect performance of our filtering technique. We note that in order to perform on-line learning, we must create a new perceptron object for each SMTP transaction and preserve the weight values of the most recently learned training example. Once we have completed Step 3 of Figure 6-3, our perceptron is ready to either learn or make a decision. We describe both options in the subsequent sub-sections.

6.4.1.2 Decision Making

The decision making portion of our algorithm is called after a new perceptron is created. In Step 4 of Figure 6-3 we apply the numeric values of each sub domain from the server identification value to the inputs of the perceptron.

In Step 6, the algorithm calculates the net-sum for the perceptron and subsequently applies the net-sum value to a sigmoid activation function to calculate the output. This output value is then used to determine whether the server identification value is to be accepted. Specifically, if the output value is below the decision threshold of the sigmoid activation function (i.e. 0.5) then the SMTP transaction is able to get through our perceptron filtering and is subsequently verified against layer 1b, the Reverse DNS check layer. If however, the output value is above the decision threshold value, the SMTP transaction is blocked.

6.4.1.3 Training

The training portion is executed after the creation of a new perceptron and after the perceptron’ s decision module and reverse DNS module have decided that the current sending server provided a legitimate server identification value.

The training algorithm begins by determining whether the sending server’ s IP address is listed in the Spamhaus list. As previously mentioned, the Spamhaus list is designed to list the IP addresses that are sending spam through illegal third party exploits, in other words, spam zombies. In Step 6, if the sending server’ s IP address is on the Spamhaus list then we set the expected output of our perceptron to ‘1’ to indicate that the sending server is a known malicious host. If however, the sending server is not listed in the

Spamhaus list then we proceed to the final 2 layers of filtering on the filtering server. If the current SMTP transaction passes the final two layers of filtering and the sending server’ s IP address was not listed in the Spamhaus list, then in Step 6, we set the expected output of our perceptron to ‘0’ to indicate that the sending server is likely to be trusted.

To train the perceptron, we run through Steps 7 and 8 of Figure 6-3. Step 9 indicates that we repeat steps 7 to 8 until a stopping criterion is met. Once a stopping criterion is met, training is complete. Our stopping criterion occurs when the perceptron output equals the expected output or a maximum number of iterations has elapsed.

In document Transactional Behaviour Based Spam Detection (Page 90-95)