• No results found

Source Code Viruses

4.4 How Do Viruses Operate?

4.4.5 Source Code Viruses

This type of virus falls into a very different category. Be that as it may, source code viruses are actually an infection mode.

The principle is again quite simple. As a first stage, a virus or a worm which is under the form of an executable duplicates its own code. However, unlike what is going on in the four above mentioned modes, the target is the source code of a program while the duplicated code is the source code of the virus. The program which is infected in such a way must therefore be recompiled in order to produce a valid executable. The duplication of the code actually corresponds to theQuineprograms, which have been presented in Section 2.2.4. Figure 4.9 illustrates the way such viruses work. However,

Virus (Binary code) Target (Source code) Compiling Infected host Target Target (Source code) (Source code) (Binary code)

Fig. 4.9.Source Code Infection

duplicating the code is not sufficient to write such a virus. A rapid analysis of the infected source code could easily betray the presence of the virus (even though in pratice, users are unlikely to perform such a long and tedious analysis in the case of large programs). An appropriate solution would be to use more sophisticated mechanisms when duplicating the considered source code.

The main advantage of code viruses stems from the fact that the pro- duced executable is perfectly homogeneous and this feature sharply differen- tiates them from the other infection modes (in these cases, binary codes are modified from outside). Another advantage of such viruses, is that they are able to totally bypass all the known antiviral techniques, including integrity checks. This has been demonstrated by a number of experiments.

An additional advantage is that they can infect computers even when for instance the attacker has no information about what kind of environment is being used (especially, the type of operating system). Such viruses may also be effective, in new and unknown environments. An attacker will have no option but to assume that the victim uses a compiler which is in line with current standards (such as the ANSIstandard for the C language).

One may argue that integrity codes enable source codes to be protected especially those downloaded from the Internet. Any file modification will be then detected whenever the integrity code is recalculated and verified. True, it is a fact that theMD5hash function [128] is actually the most widely used function (even though, in most cases, no integrity code is used). However, one can express some doubts about the efficiency of such functions and especially as regards theMD5hash function:

• on the one hand, if the attacker manages to infect a source code (either by breaking into the operating system or via an infection process launched by an unaware victim), producing a new hash value and replacing the old one21 with the new one will be a child’s play for him;

• on the other hand, the security of some integrity functions can be ques- tionned. TheMD5hash function which is widely used, was put into ques- tion by H. Dobbertin [53], in 1996. The latter told the author in 1998, that the complete cryptoanalysis of the MD5 function seemed about to succeed. He explained that his technique designed for operational crypt- analysis of the MD4hash function could be applied sucessfully to theMD5 hash function, whose design is very close to that of the MD4 hash func- tion. This assertion has beem since confirmed in August 2004. Collisions on MD5 have been published as well as collisions for other famous hash functions like HAVAL-128 or RIPEMD [158]. This essential result demon- strates that source code infection is quite possible, given the wide use of 21This is far more difficult when the hash values are encrypted, or more generally pro-

tected. However, viral techniques enable to bypass any such protection, especially when using combined viruses (see Chapter 13 for more details).

MD5as integrity tool22. What about the security of other hash functions? Unfortunately, since the technique used in [158] has not been published yet, nothing can be said.

As a general rule, a source code virus operates in the following way:

1. first, the virus creates avirus.hfile (its goes without saying that a real- life virus will require a more subtle name; see further) which includes the source code of the virus. This file is made of two parts: the virus code which will have to be compiled, and the same code which will be contained in an array of characters (e.g of unsigned char type). In this respect, a fine example worth mentioning is the Quine code which was written by Daniel Martin (see Section 2.2.4 as well). Let us precise that the code which must fit on a single line has been split into several lines, to suit the book text layout.

#include<stdio.h>

char a[] = "\";\nmain() {char *b=a;printf(\"#include<st dio.h>\\nchar a[] = \\\"\");\nfor(;*b;b++) {switch(*b){ case ’\\n’: printf(\"\\\\n\"); break;\ncase ’\\\\’:case ’\\\"’: putchar(’\\\\’); default: putchar(*b);}} printf (a);}\n";main() {char *b=a;printf("#include<stdio.h>\n char a[] = \"");for(;*b;b++) {switch(*b){case ’\n’:

printf("\\n"); break;case ’\\’: case ’\"’: putchar(’\ \’); default: putchar(*b);}} printf(a);}

Creating an auto-replicating program of Quinetype is all the more com- plex as the size of the program is large (more than some dozens of bytes). Precisely, source code viruses take much space. M. Ludwig [105, chap. 13] has developed a rather efficient method to solve this problem. Once the file has been created in this way, it goes without saying that it must be carefully hidden in order to avoid being detected during a common directory listing;

2. then, the virus infects the target source files according to two next steps: a) the virus inserts a inclusion directive such as#include "virus.h". An interesting technique consists in creating for instance under Linux a viral source file whose name is .stdio.h (a hidden file), and then to replace the #include <stdio.h> directive with the 22It is rather surprising that users go on using this function in spite of the first results ob-

tained by H. Dobbertin. It is not unusual that more questionable cryptanalyses discredit cyphersystems which, all things considered, provided a fairly good security level.

#include ".stdio.h"one. This solution which can be largely opti- mized is much more difficult to detect (since users do not pay much attention to inclusion directives or program headers when reading source code). Let us mention that far more sophisticated techniques exist;

b) then the virus inserts inserts one or several instructions into the source code of the target program so that the virus may be called (the best way consits in hide these instructions in comments, but far many other tricks may be used);

3. incidently, the virus itself may compile the infected source file in order to produce an infected binary code directly. On its own or with the help of other viruses (this is the case of combined viruses), it will be also able to handle integrity codes, and/or change the times of last modification or access.

The interested reader will find in [57] an example of a virus written in source code. It must be stressed on that having source codes at one’s disposal (and this argument is often put forward as far as free software are concerned) is not a guarantee of security. Who is ready to read a code which contains some thousands or dozens of thousands of lines and whose readability is mediocre (please refer to www.ioccc.org to see the result of all that when the C language is used)? Besides, the compiler may directly be responsible for the infection and trigger it (the excellent K. Thompson’s paper [152] is very enlightening about this issue). The only way of obtaining almost all the desired guarantees is to control the compiler binaries as well. In other words, we must have a way to produce them from a reliable source code using a compiler binary that can be trusted. Let us say almost all the desired guarantees because all the functionalities and tricks described in K. Thompson’s paper may be implemented directly at the processor level.