Java Obfuscation Salah Malik BSc Computer Science 2001/2002

(1)

Java Obfuscation

Salah Malik

BSc Computer Science

2001/2002

(2)

Summary

Java has become a popular language in both academia and industry. Its strength lies in the "Write Once Run Anywhere" paradigm. This is achieved through compiling the source code into "byte code" for the Java Virtual Machine (JVM). Unfortunately this byte code can be very easily reverse

engineered. That is; changed from byte code back into the original source code.

The problem of decompilation has been addressed through the usage of "obfuscation". The byte code is altered in such a way to render the source code difficult for humans to read after decompilation. The less human-readable the code is, the more successful the obfuscation can be considered.

This report includes an outline of the way the JVM operates and the class file format into which Java source programs are transformed. There is also a description of the current known obfuscation techniques and how they affect Java programs.

The main aim of this report is to investigate the effects of obfuscation on Java byte codes. This includes an evaluation of currently available obfuscation tools and an investigation into the possibility and technical problems of obfuscation via unconditional jump statements. Initially the development of an obfuscator for private/public methods, variables and class names was considered, but this was later rejected for reasons that are made clear in this report.

An extensive strategy was implemented for the background research, focusing on the implications of code protection and reverse engineering, the structure of the JVM, the class file format and the byte code instruction set, the principles and techniques of obfuscation and the actual services by currently available obfuscators. Most of the reference is web based as Java obfuscation is “on the cutting-edge” of software security.

The initial and revised project schedules are available in Appendix A. The initial plan aimed to finish the initial research by February 2002, the evaluation of obfuscation software by March 2002, the development of a name obfuscator by April 2002 and the investigation of jump statement obfuscation by May 2002. This however did not take into the account of the writing of this report and the

(3)

Acknowledgements

I would like to thank my supervisor Chris Gillespie and my project assessor Dr. Nick Efford for the invaluable advice they have given me throughout this project. I would also like to thank Dr. Sara Fores, the project administrator, and Mr. Martyn Clark for their advice.

(4)

Summary ... p.i Acknowledgements ... p.ii Table of Contents... p.iii 1. Introduction to Code Protection ... p.1 1.1. Introduction ... p.1 1.2. Server Side Execution ... p.1 1.3. Encryption ... p.2 1.4. Signed Native Code Execution... p.3 1.5. Code Obfuscation ... p.4 1.6. Decompilation ... p.5 1.7. Deobfuscation... p.5 1.8. Why Java? ... p.5

2. Java Virtual Machine... p.6 2.1. Introduction to the Java Virtual Machine... p.6 2.2. Introduction to Java Architecture ... p.6 2.3. Introduction to Java Byte Codes... p.6 2.4. Class Loader ... p.7 2.5. Byte Code Verifier... p.7 2.6. Supported Data Types ... p.7 2.7. JVM Registers ... p.8 2.8. Method Area... p.8 2.9. Java Stack ... p.8 2.10. Garbage-Collected Heap... p.9 2.11. Frames ... p.9 2.12. Java Instruction Set... p.9

3. Class File Format... p.10 3.1. Introduction to Class File Format... p.10 3.2. Magic Number... p.10 3.3. Major Number, Minor Number ... p.10 3.4. Constant Pool... p.10 3.5. Access flags ... p.11 3.6. This Class ... p.11 3.7. Super Class ... p.11

(5)

3.8. Interfaces ... p.11 3.9. Fields ... p.12 3.10. Methods ... p.12 3.11. Attributes ... p.12 3.12. Class File Descriptors... p.12 3.12.1. Field Descriptors ... p.12 3.12.2. Method Descriptors ... p.14

4. Code Obfuscation Techniques... p.15 4.1. Obfuscation Transformation... p.15 4.1.1. Definition... p.15 4.1.2. Quality ... p.15 4.1.3. Potency ... p.15 4.1.3.1. Definition... p.15 4.1.3.2. Measure Scale... p.16 4.1.4. Resilience ... p.16 4.1.4.1. Definition... p.17 4.1.4.2. Measure Scale... p.17 4.1.5. Stealth ... p.18 4.1.5.1. Definition... p.18 4.1.5.2. Measure Scale... p.19 4.1.6. Cost... p.19 4.1.6.1 Definition... p.19 4.1.6.2. Measure Scale... p.19 4.2. Types of Obfuscation... p.20 4.2.1. Layout Transformations ... p.20 4.2.1.1. Change Formatting ... p.20 4.2.1.2. Scrambling Identifier Names... p.20 4.2.1.3. Remove Comments ... p.20 4.2.2. Data Transformations ... p.20 4.2.2.1. Data Storage ... p.20 4.2.2.2. Data Encoding ... p.21 4.2.2.3. Data Aggregation... p.22 4.2.2.4. Data Ordering ... p.22 4.2.3. Control Transformations... p.23 4.2.3.1. Opaque Constructs... p.23 4.2.3.1.1. Definition ... p.23

(6)

4.2.3.1.2. Trivial Constructs ... p.23 4.2.3.1.3. Weak Constructs ... p.24 4.2.3.2. Control Aggregation... p.24 4.2.3.3. Control Ordering ... p.25 4.2.3.4. Control Computations ... p.25 4.2.4. Preventive Transformations... p.26

4.2.4.1. Inherent Preventive Transformations ... p.26 4.2.4.2. Targeted Preventive Transformations ... p.26

5. Evaluation of Currently Available Tools... p.27 5.1. Purpose of Evaluation... p.27 5.2. Web Resources ... p.27 5.3. Comparison of Obfuscation Reviews Available... p.27 5.4. Evaluation Criteria... p.28 5.4.1. Cost... p.28 5.4.2. Availability ... p.28 5.4.3. Range of Transformations Offered... p.28 5.4.4. Potency ... p.28 5.4.5. Resilience ... p.28 5.4.6. Stealth ... p.29 5.4.7. Execution Cost ... p.29 5.4.8. Effectiveness against Decompilation... p.29 5.4.9. Usability ... p.29 5.4.10. Documentation ... p.29 5.4.11. Test Data... p.29 5.4. Obfuscation Software Tools ... p.30 5.5.1. Zelix KlassMaster... p.30 5.5.2. Jshrink... p.32 5.5.3. DashO-Pro ... p.33 5.5.4. File Obfuscator ... p.33 5.5.5. RetroGuard ... p.33 5.5.6. 1stBarrier ... p.33 5.5.7. 2LKitObfuscator... p.34 5.5.8. Aubjex ... p.34 5.5.9. CafeBabe ... p.35 5.5.10. CodeShield ... p.36 5.5.11. Condensity... p.36

(7)

5.5.12. Crema ... p.36 5.5.13. Elixir ... p.36 5.5.14. Excelsior Jet... p.36 5.5.15. Helseth JObfuscator... p.37 5.5.16. Jammer ... p.37 5.5.17. JCloak ... p.37 5.5.18. Jopt ... p.37 5.5.19. Mocha Source Obfuscator ... p.38 5.5.20. Marvin Obfuscator... p.38 5.5.21. Obfuscate... p.38 5.5.22. ShroudIt!... p.38 5.5.23. WingGuard ... p.39 5.5.24. SmokeScreen ... p.39 5.5.25. JMangle ... p.39 5.5.26. JODE ... p.39 5.5.27. JOBE... p.39 5.5.28. HashJava... p.39 5.5.29. SourceGuard ... p.39 5.5. Evaluation Results ... p.40 5.5. Conclusion... p.40

6. Investigation Goto Statement Obfuscation ... p.41 6.1. Purpose of Investigation ... p.41 6.2. Justification for Choice of Transformation... p.41 6.3. Investigation Method... p.41 6.4. Test Data... p.42 6.5. Test Results……….……..p.43 6.5.1. Statement Rearranging ... p.43 6.5.2. For Loop ... p.44 6.5.3. For Loop with Opaque Constructs... p.45 6.5.4. Method Parameters... p.46 6.6. Evaluation of Test Results... p.47 6.7. Implementation... p.47 6.7.1. Byte Code Engineering Library... p.47 6.7.2. jclasslib ... p.48 6.7.3. Kawa... p.48

(8)

6.8. Evaluation of Obfuscator... p.48 6.9. Difficulties in Development ... p.48 6.10. Investigation Conclusion ... p.48

7. Conclusion ... p.49 7.1. Obfuscation Evaluation ... p.49 7.2. Goto Statement Obfuscation... p.49

Bibliography... p.50

Appendix A - Reflection ... p.59

Appendix B - Software Metrics Table ... p.61

Appendix C - Evaluation Table and Results... p.63

Appendix D - Obfuscated Program Code... p.69 D.1. Test Data... p.69 D.2. Disassembled Class Files... p.71 D.3. Decompiled Class Files ... p.79 D.4. Zelix KlassMaster... p.81 D.5. Jshrink... p.86 D.6. DashO-Pro ... p.89 D.7. File Obfuscator ... p.90 D.8. RetroGuard ... p.90 D.9. 1stBarrier ... p.98 D.10. 2LKitObfuscator... p.103 D.11. Aubjex ... p.106 D.12. CafeBabe ... p.116 D.13. JCloak ... p.118 D.14. Jopt... p.121 D.15. Mocha Source Obfuscator ... p.123 D.16. ShroudIt!... p.126 D.17. WingGuard ... p.129 D.18. SmokeScreen ... p.133 D.19. JMangle ... p.137 D.20. SourceGuard ... p.140

(9)

1. Introduction to Code Protection

This section introduces the main techniques for protecting programs written in network-friendly formats, giving details as to why obfuscation is the most preferable solution. It also mentions decompilation and deobfuscation and how they affect obfuscation.

1.1. Introduction

The advent of network-independent programs has seen the need for code protection increase. Prior to this, programs were compiled for specific hardware and operating systems [34]. During compilation, information such as variable names and references to library routines were removed, producing hardware dependent machine code that was large in size and had low portability, i.e. executable only on computers of the same hardware specification as that of the computer on which the original source code was compiled, ([34], [8] p.1, [41], [42], [12]). As the machine code was stored in binary files, they proved difficult to read.

This has changed with the introduction of hardware-independent specifications, which partially compile the code into a format that can be run on a separate software implementation ([8] p.1). This not only allows programs to be run on different platforms (i.e. they have high portability), but because the code does not have any hardware specific library routine calls, they are considerably small, thus making them easier to transfer over networks ([8] p.1).

There is also one major drawback: because there are no hardware/operating system specific code constructs, the programs have proven to be easy to decompile into the original code. This gives ample opportunity for pirate and rival software developers to obtain vital algorithms and data structures contained within the code ([8] p.1).

A software developer can take legal action to protect their code. Software artefacts are covered under copyright law; however, it can prove to be expensive for a small software house taking on a larger and more powerful corporation ([8] p.3).

In response to this problem, a number of technical code protection techniques have been drawn up.

1.2. Server Side Execution

The most secure approach is for users to connect to a web site set up by the software developer to run the program remotely, paying a small amount of electronic money each time ([8] p.3). The program is executed on the developer’s server and input/output is via the web. The reverse engineer never gains physical access to the application and so is unable to decompile the code ([8] p.3).

(10)

Figure 1.1: Protection by Server-Side Execution ([33] p.6)

However, due to limits on network bandwidth and latency, the application will not perform as well as it could if it was run locally ([8] p.3). A way to get round this is to implement partial server side execution, where the application is broken into two parts: one part runs on the user’s site and the other part (containing the code to be protected) is run remotely ([8] p.3, [34]).

Figure 1.2: Protection by Partial Server-Side Execution ([33] p.6)

1.3. Encryption

The software developer could encrypt the code and then send this encrypted code to users. This would guarantee protection against any software attacks except for two problems: one is that it only works if the entire encryption/decryption process takes place in hardware. This is because most encryption methods involve running a tamper-proofed environment (on a separate machine) to encrypt the code ([68] p.2, [11], [36], [69]). Compiled Java code, for example, is run on a software implementation of a

(11)

machine, and as the tamper-proofed environment runs processor-specific code, use of encryption methods is much more difficult if not impossible. Another drawback is that specialised hardware tends to limit the portability of programs.

Figure 1.3: Protection by Encryption ([33] p.7)

1.4. Signed Native Code Execution

The software developer can use just-in-time compilers to create an executable for all popular

architectures ([8] p.3). A just-in-time (JIT) compiler is a program that turns code into instructions that can be sent directly to the processor [67]. “When downloading the application, the user’s site would have to identify the architecture/operating system combination it is running, and the corresponding version would be transmitted” ([8] p.3). As the native code is processor-specific, it will prove harder for the reverse engineer to decompile ([8] p.3).

(12)

There is still one drawback to this approach: “native codes cannot be run with complete security on the user’s machine” ([8] p.3). To ensure that the code is safe to run on the user’s system, digital signatures would be required. A digital signature can be thought of as a digital equivalent to a handwritten signature ([45] p.613), and is appended to a message (as extra data) to identify and authenticate the sender and message data using public-key encryption” [23]. In public-key encryption, “each person gets a pair of keys, called the public key and the private key. Each person's public key is published while the private key is kept secret. Messages are encrypted using the intended recipient's public key and can only be decrypted using his private key” [24]. “The sender uses a one-way hash function [(see [25])] to generate a hash-code of about 32 bits from the message data. He then encrypts the hash-code with his private key. The receiver recomputes the hash-code from the data and decrypts the received hash with the sender's public key. If the two hash-codes are equal, the receiver can be sure that data has not been corrupted and that it came from the given sender” [23]. This method also increases software maintenance effort, as different versions of the application would have to be made for each of the different hardware specifications ([8] p.3).

1.5. Code Obfuscation

The software developer could use an obfuscator to obfuscate the program. The process of obfuscation “transforms a program [by adding or removing code] so that it is more difficult to understand, yet is functionally identical to the original” ([34], [26]). The program still provides the same functionality of the original application, except that it may be larger, run slower and have side effects such as creating files ([34], [8] p.3). This does not try to prevent someone from gaining access to the source code; instead it makes the task of using the data structures and algorithms within that much more difficult as the code will be more difficult for a human reverse engineer to read ([8] p.3). Although deobfuscators have been developed to counter this, a good obfuscation technique would prove to be effective even against this sort of application.

(13)

1.6. Decompilation

“A decompiler is a program that reads a program written in a machine language – the source

language – and translates it into an equivalent program in a high-level language – the target language. A decompiler, or reverse compiler, attempts to reverse the process of a compiler which translates a high-level program into a binary or executable program” ([5] p.1). Decompilation usually is used for software maintenance and security ([5] p.15). However, it is also utilised by reverse engineers to gain access to key data structures and control constructs of other developers’ software ([8] p.2, [34]). Obfuscation aims to counter this by altering the object code so that the decompiled code resembles as little to the original source code as possible, making decompilation a futile exercise. See [4], [5], [6], [15], [16], [17], [19], [20], [27], [38], [44], [46], [47], [48] and [49] for more details.

1.7. Deobfuscation

Deobfuscation attempts to undo the transformations of an obfuscator on the program code ([8] p.3), using techniques such as static analysis, data dependency analysis [29] and program slicing ([8] p.24). This can be utilised by reverse engineers to deobfuscate obfuscated code to obtain the source code. Because of this, obfuscation techniques must be able to withstand deobfuscation attacks.

1.8. Why Java?

This report examines Java code obfuscation in particular because the Java byte code format into which Java programs are compiled is designed in such a way as to retain as much symbolic

information about the original program as possible, the byte codes contain commands that can be run on a virtual machine, enabling them to be run on any hardware/operating system and the instruction set of the byte code is designed to be as small as possible. This not only makes Java programs easier to transmit over networks, but also easier to decompile.

Other programs such as C++ programs are compiled into machine code, which is specific to the processor of the machine on which the program was compiled, so the program can only run on that machine. While C++ executables can be decompiled (see Section 1.6, [6] p.2), the decompiler must be specific to the hardware/operating system on which the program was compiled.

(14)

2. Java Virtual Machine

This section describes the basic structure of the Java Virtual Machine architecture.

2.1. Introduction to the Java Virtual Machine

“The Java Virtual Machine, or JVM, is an abstract computer that runs compiled Java programs. The JVM is "virtual" because it is generally implemented in software on top of a "real" hardware platform and operating system. All Java programs are compiled for the JVM. Therefore, the JVM must be implemented on a particular platform before compiled Java programs will run on that platform” [51].

“The JVM plays a central role in making Java portable. It provides a layer of abstraction between the compiled Java program and the underlying hardware platform and operating system. The JVM is central to Java's portability because compiled Java programs run on the JVM, independent of whatever may be underneath a particular JVM implementation.

The JVM is small when implemented in software. It was designed to be small so that it can fit in as many places as possible -- places like TV sets, cell phones, and personal computers. The JVM wants to be everywhere, and its success is indicated by the extent to which programs written in Java will run everywhere” [51].

2.2. Introduction to Java Architecture

“At the heart of Java technology lies the Java virtual machine. Although the name "Java" is generally used to refer to the Java programming language, there is more to Java than the language. The Java virtual machine, Java API, and Java class file work together with the language to make Java programs run. The components of the Java architecture are the JVM, the class file, API, and language. It gives an overview of Java's architecture, discusses why Java is important, and looks at Java's pros and cons” [64].

2.3. Introduction to Java Byte Codes

“Java byte codes can be thought of as the machine language of the JVM. The Java compiler reads Java language source (.java) files, translates the source into Java byte codes, and places the byte codes into class (.class) files. The compiler generates one class file per class in the source” [51].

“To the JVM, a stream of byte codes is a sequence of instructions. Each instruction consists of a one-byte opcode and zero or more operands. The opcode tells the JVM what action to take. If the JVM requires more information to perform the action than just the opcode, the required information immediately follows the opcode as operands. A mnemonic is defined for each byte code instruction. The mnemonics can be thought of as an assembly language for the JVM. For example, there is an

(15)

instruction that will cause the JVM to push a zero onto the stack. The mnemonic for this instruction is iconst_0, and its byte code value is 60 hex. This instruction takes no operands” [51].

2.4. Class Loader

“Class files are then loaded by the class loader, either locally or through a network. The Java class libraries required are also loaded at this stage. Before the class files are executed, they must be checked by the Java verifier. If no verification errors occur, the classes are executed by the JVM” [33]. “To execute a Java program, the interpreter is given the name of the main class in the program. This byte code class file is then searched for in the file system. For each class A that is loaded in to memory, the class loader determines the classes that are used by A. If these classes are not already present in memory, they must also be loaded into memory. This action is performed recursively until all the classes used by a program are present in memory. The classes are then checked by the byte code verifier” [33].

2.5. Byte Code Verifier

“The problem with distributing programs across a network, such as the Internet, is that the recipient may not be able to trust the program. The program may corrupt the user’s system, either accidentally through poor programming, or deliberately, in the case of viruses. To stop this, Java byte codes are checked by the verifier before they are executed.

The "virtual hardware" of the Java Virtual Machine can be divided into four basic parts: the registers, the stack, the garbage-collected heap, and the method area. These parts are abstract, just like the machine they compose, but they must exist in some form in every JVM implementation.

The size of an address in the JVM is 32 bits.The JVM can, therefore, address up to 4 gigabytes (2 to the power of 32) of memory, with each memory location containing one byte. Each register in the JVM stores one 32-bit address. The stack, the garbage-collected heap, and the method area reside somewhere within the 4 gigabytes of addressable memory. The exact location of these memory areas is a decision of the implementor of each particular JVM. The method area, because it contains byte codes, is aligned on byte boundaries. The stack and garbage-collected heap are aligned on word (32-bit) boundaries” [33].

2.6. Supported Data Types

The JVM on two kinds of types: primitive types and reference types [32]. “There are,

correspondingly, two kinds of values that can be stored in variables, passed as arguments, returned by methods, and operated upon: primitive values and reference values” [32]. No type checking needs to be done by the JVM as this has been done by a compiler (e.g. javac) [32]. “Instead, the instruction set of the Java virtual machine distinguishes its operand types using instructions intended to operate on

(16)

values of specific types, e,g iadd, ladd, fadd and dadd are all JVM that add two numeric values and produce numeric results, but each is specialized for its operand type: int, long, float, and double, respectively.

Objects are either dynamically allocated class instances or arrays. A reference to an object is

considered to have a JVM type of reference and can be thought of as pointers to objects [32]. “More than one reference to an object may exist. Objects are always operated on, passed, and tested via values of type reference” [32].

2.7. JVM Registers

“The JVM has a program counter and three registers that manage the stack. It has few registers because the byte code instructions of the JVM operate primarily on the stack. This stack-oriented design helps keep the JVM's instruction set and implementation small. The JVM uses the program counter, or pc register, to keep track of where in memory it should be executing instructions. The other three registers -- optop register, frame register, and vars register -- point to various parts of the stack frame of the currently executing method. The stack frame of an executing method holds the state (local variables, intermediate results of calculations, etc.) for a particular invocation of the method” [51].

2.8. Method Area

“The method area is where the byte codes reside. The program counter always points to (contains the address of) some byte in the method area. The program counter is used to keep track of the thread of execution. After a byte code instruction has been executed, the program counter will contain the address of the next instruction to execute. After execution of an instruction, the JVM sets the program counter to the address of the instruction that immediately follows the previous one, unless the

previous one specifically demanded a jump” [51].

2.9 Java Stack

“The Java stack is used to store parameters for and results of byte code instructions, to pass

parameters to and return values from methods, and to keep the state of each method invocation. The state of a method invocation is called its stack frame. The vars, frame, and optop registers point to different parts of the current stack frame. There are three sections in a Java stack frame: the local variables, the execution environment, and the operand stack. The local variables section contains all the local variables being used by the current method invocation. It is pointed to by the vars register. The execution environment section is used to maintain the operations of the stack itself. It is pointed to by the frame register. The operand stack is used as a work space by byte code instructions. It is here that the parameters for byte code instructions are placed, and results of byte code instructions are

(17)

found. The top of the operand stack is pointed to by the optop register. The execution environment is usually sandwiched between the local variables and the operand stack. The operand stack of the currently executing method is always the topmost stack section, and the optop register therefore always points to the top of the entire Java stack” [51].

2.10. Garbage-Collected Heap

The heap is where the objects of a Java program reside [51]. Programmers can allocate memory to an object using the new operator [51]. “The Java language doesn't allow you to free allocated memory directly. Instead, the runtime environment keeps track of the references to each object on the heap, and automatically frees the memory occupied by objects that are no longer referenced -- a process called garbage collection” [51]. See [53] and [66] for more details.

2.11. Frames

This is where data and partial results are stored [32]. A new frame is created each time a method is invoked and is destroyed when its method invocation completes [32].

2.12. Java Instruction Set

For details about the JVM instruction set, see [32] Chapter 6 p.171, [50] and [53].

For more details about the JVM, see [2], [3], [37], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65] and [66].

(18)

3. Class File Format

This section describes the class file format.

3.1. Introduction to Class File Format

The class file format defines the structure of files that a Java Virtual Machine (JVM) can run. It “contains everything a JVM needs to know about one Java class or interface” [52]. It “consists of a stream of 8-bit bytes. All 16-bit, 32-bit, and 64-bit quantities are constructed by reading in two, four, and eight consecutive 8-bit bytes, respectively. Multibyte data items are always stored in big-endian order, where the high bytes come first” ([32] p.93). The length of a class file cannot be predicted before loading, as each program contains a variable number of classes and interfaces, which in turn contain a variable number of fields and methods [52]. The class file format handles this by “prefacing the actual information by its size or length. This way, when the class is being loaded by the JVM, the size of variable-length information is read first. Once the JVM knows the size, it can correctly read in the actual information” [52]. Information about the many parts of the class file is generally written to the class file with no space or padding between consecutive pieces of information, keeping the size of the file down to a minimum so as to enable them to travel across networks more easily [52].

“The order of class file components is strictly defined so JVM’s can know what to expect, and where to expect it, when loading a class file” [52]. The major components of a class file are (in order of appearance): magic number, minor and major version numbers, constant pool count, constant pool, access flags, this class, super class, interfaces, fields, methods, and attributes [52], [2]. The constant pool, interfaces, fields, methods and attributes components also have a count of the structures that they detail preceding them [32].

3.2. Magic Number

The first four bytes make up the magic number, whose value is always 0xCAFEBABE [52]. The magic number identifies the file as conforming to the class file format ([32] p.94).

3.3. Major Number, Minor Number

“The second four bytes of the class file contain the major and minor version numbers. These numbers identify the version of the class file format to which a particular class file adheres and allow the JVM to verify that the class file is loadable” [52].

3.4. Constant Pool

After the constant pool count is the constant pool itself. This is a table of structures representing various string constants, class and interface names, final variable values, variable names and types,

(19)

and method names and signatures and other constants that are referred to within the class file structure and its substructures ([32] p.95, [52]), not unlike a symbol table in compiler design terminology ([1] p.60-62). “A method signature is its return type and set of argument types” [52].

“The constant pool is organized as an array of variable-length elements. Each constant occupies one element in the array. Throughout the class file, constants are referred to by the integer index that indicates their position in the array. The initial constant has an index of one, the second constant has an index of two, etc. The constant pool array is preceded by its array size, so [the JVM] will know how many constants to expect when loading the class file. Each element of the constant pool starts with a one-byte tag specifying the type of constant at that position in the array. Once a JVM grabs and interprets this tag, it knows what follows the tag. For example, if a tag indicates the constant is a string, the JVM expects the next two bytes to be the string length. Following this two-byte length, the JVM expects to find length number of bytes, which make up the characters of the string” [52].

3.5. Access Flags

The next two bytes represent the access flags, which “indicate whether or not this file defines a class or an interface, whether the class or interface is public or abstract, and (if it's a class and not an interface) whether the class is final” [52].

3.6. This Class

The next two bytes represent the this class component, an index into the constant pool [52]. The constant pool entry at this index has two parts: a one-byte tag (which indicates that this element contains information about a class or interface) and a two-byte name index (which is a string constant containing the name of the class or interface) [52].

3.7. Super Class

The next two bytes represent the super class component [52]. For a class, the super class can be either zero or an index into the constant pool ([32] p.97). If it is zero, than the constant pool entry at this index, “then this class file must represent the class Object, the only class or interface without a direct superclass” ([32] p.97). Otherwise, the constant pool entry is the name of the super class from which this class descends [52]. For an instance, the super class is an index into the constant pool. The entry at this index must represent the class Object [32].

3.8. Interfaces

The next two bytes represent an array structure whose entries are indexes into the constant pool. The entries at these indexes represent the interfaces implemented by the class [52].

(20)

3.9. Fields

After the fields count is the fields component. This is “an array of variable-length structures, one for each field. Each structure reveals information about one field such as the field's name, type, and, if it is a final variable, its constant value. Some information is contained in the structure itself, and some is contained in constant pool locations pointed to by the structure. The only fields that appear in the list are those that were declared by the class or interface defined in the file; no fields inherited from super classes or superinterfaces appear in the list” [52].

3.10. Methods

Following the method count is the methods component, which is an array of variable-length structures, one for each method. “The structure for each method contains several pieces of

information about the method, including the method descriptor (its return type and argument list), the number of stack words required for the method's local variables, the maximum number of stack words required for the method's operand stack, a table of exceptions caught by the method, the byte code sequence, and a line number table” [52].

3.11. Attributes

Following the attributes count is the attributes component, which is an array of variable-length structures, one for each attribute. These attributes “give general information about the particular class or interface defined by the file. The JVM will silently ignore any attributes that it does not recognise” [52].

3.12. Class File Descriptors

“A descriptor is a string representing the type of a field or method. Descriptors are represented in the class file format using UTF-8 strings ([32] p.110, 111) and thus may be drawn, where not further constrained, from the entire Unicode character set” ([32] p.99).

3.12.1. Field Descriptors

A field descriptor represents the type of a class, instance, or local variable. It is a series of characters generated by the grammar in Figure 4.1.

(21)

MethodDescriptor ::= (ParameterDescriptor*)ReturnDescriptor ParameterDescriptor ::= FieldType ReturnDescriptor ::= FieldType | V FieldDescriptor ::= FieldType ComponentType ::= FieldType FieldType ::= BaseTypeObjectTypeArrayType BaseType ::= B | C | D | F | I | J | S | Z ObjectType ::= L <classname> ; ArrayType ::= [ComponentType] Terminals = {B, C, D, F, I, J, S, Z, L, V, ;, [, ]}

Non-Terminals = {MethodDescriptor, ParameterDescriptor, ReturnDescriptor, FieldDescriptor, ComponentType, FieldType, BaseType, ObjectType, ArrayType}

Figure 4.1: BNF grammar for class file descriptors ([32] p.102)

“The characters of BaseType, the L and ; of ObjectType, and the [ of ArrayType are all ASCII characters. The <classname> represents a fully qualified class or interface name. For historical reasons it is encoded in internal form” ([32] p.101). “In this internal form, the ASCII periods ('.') that normally separate the identifiers that make up the fully qualified name are replaced by ASCII forward slashes ('/'). For example, the normal fully qualified name of class Thread is java.lang.Thread. In the form used in descriptors in the class file format, a reference to the name of class Thread is

implemented using a CONSTANT_Utf8_info structure representing the string "java/lang/Thread"” ([32] p.99).

(22)

BaseType Character Type Interpretation B byte signed byte C char Unicode character

D double double-precision floating-point value F float single-precision floating-point value I int integer

J long long integer

L<classname>; reference An instance of class <classname> S Short signed short

Z boolean true or false V Void void (methods only) [ reference one array dimension [[ reference two array dimensions

Table 4.1. The interpretation of the class file descriptor types ([32] p.101)

“For example, the descriptor of an instance variable of type int is simply I. The descriptor of an instance variable of type Object is Ljava/lang/Object;” [32]. The descriptor of an instance variable that is a multidimensional int array (int array[][][]) is [[[I ([32] p.101).

3.12.2. Method Descriptors

“A method descriptor represents the parameters that the method takes and the value that it returns” ([32] p.102). It consists of a parameter descriptor and return descriptor. “A parameter descriptor represents a parameter passed to a method. A return descriptor represents the type of the value returned from a method” ([32] p.102). Method descriptors are generated by the grammar in Figure 4.1.

“For example, the method descriptor for the method “Object method(int i, double d, Thread t)” is “(IDLjava/lang/Thread;)Ljava/lang/Object”. Note that internal forms of the fully qualified names of Thread and Object are used in the method descriptor” ([32] p.102).

(23)

4. Code Obfuscation Techniques

This section describes techniques implemented by obfuscators. It defines what exactly is an obfuscating transformation, based on the quality of the transformation and the type of data that it targets.

4.1. Obfuscation Transformation

An obfuscation transformation is a transformation that alters the program code (usually the byte codes of a class file in the case of Java obfuscation) by removing or adding code with the aim of making the process of decompiling (and deobfuscation) more difficult ([8] p.3, [34]).

4.1.1. Definition

In [8], an obfuscation transformation is defined as one that changes a program P into P’, such that both P and P’ have the same observable behaviour regardless of any side effects such as more memory usage or degraded performance ([8] p.6). Furthermore, “the following conditions must hold:

• If P fails to terminate or terminates with an error condition, then P’ may or may not terminates.

• Otherwise, P’ must terminate and produce the same output as P.

” ([8] p.7)

4.1.2. Quality

We define the quality of an obfuscation transformation as a combination of four measures: viz. potency, resilience, stealth and cost:

“Tqual(P), the quality of a transformation T, is defined as the combination of the potency, resilience,

[stealth] and cost of T”:

Tqual(P) = (Tpot(P), Tres(P), Tste(P), Tcost(P)).

([8] p.9)

A transformation of high quality will aim to have high potency, resilience and stealth and low cost.

4.1.3. Potency

The potency of an obfuscation transformation is a measure of how much different or more complex is the obfuscated program from the original program.

(24)

“Let T be a [behaviour-conserving] transformation, such that P → P’ transforms P into a target program P’. Let E(P) be the complexity of P. Tpot(P), the potency of T with respect to a program P, is

a measure of the extent to which T changes the complexity of P. It is defined as

Tpot(P) = E(P’)/E(P) – 1.

T is a potent obfuscating transformation if Tpot(P) > 0. ” ([8] p.7)

4.1.3.2. Measure Scale

In the Software Complexity Metrics branch of Software Engineering, measures of complexity have been derived from theoretical and empirical studies of programs ([8] p.7). Using these measures, “statements such as if programs P and P’ are identical except that P’ contains more of property q than P, then P’ is more complex than P. Given such a statement, we can attempt to construct a

transformation which adds more of the q-property to a program, knowing that this is likely to increase its obscurity.” ([8] p.7). The table in Appendix B provides an overview of some of the more popular software complexity measures ([8] p.8, [12]).

In [5], potency is measured on the following scale:

low medium high

low potency high potency Figure 5.1: Potency scale of an obfuscation transformation

“In order for T to be a potent obfuscating transformation, it should

• Increase overall program size (µ1) and introduce new classes and methods (µ7a).

• Introduce new predicates (µ2) and increase the nesting level of conditional and looping

constructs (µ3).

• Increase the number of method arguments (µ5) and inter-class instance variable dependencies

(µ7d).

• Increase the height of the inheritance tree (µ7b, µ7c).

• Increase long-range variable dependencies (µ4).” ([8] p.7).

(25)

The resilience of an obfuscation transformation is a measure of how difficult it is for someone to undo the obfuscation, in terms of the amount of time required for a programmer to construct a deobfuscator (programmer effort) and the execution time and space required by the deobfuscator to effectively reduce the potency of the transformation (deobfuscator effort) ([8] p.8). The main difference between the potency and the resilience of a transformation is that the potency attempts to confuse a human reader, whereas the resilience attempts to confuse a deobfuscator ([8] p.9).

4.1.4.1. Definition

“Let T be a [behaviour-conserving] transformation, such that P → P’ transforms P into a target program P’. Let E(P) be the complexity of P. Tres(P) is the resilience of T with respect to a program P.

Tres(P)=one-way if information is removed from P such that P cannot be reconstructed from P’.

Otherwise,

Tpot(P) = Resilience(TDeobfuscator effort, TProgrammer effort)

Where Resilience is the function defined in the matrix in [the diagram below]. ” ([8] p.9)

In [8], resilience is measured on the following scale:

trivial weak strong full one-way

low resilience high resilience

Figure 5.2: Resilience scale of an obfuscation transformation ([8] p.9)

Transformations that are the most resilient are described as one-way, i.e. they can never be undone ([8] p.9). “This is typically because they remove information from the program that was useful to the human programmer, but which is not necessary in order to execute the program correctly. Other transformations typically add useless information to the program that does not change its observable behaviour, but which increases the “information load” on a human reader. These transformations can be undone with varying degrees of difficulty” ([8] p.9).

Figure 5.3 shows that deobfuscator effort is classified as either polynomial time or exponential time, whereas programmer effort is measured as a function of the scope of the transformation ([8] p.8). “This is based on the intuition that it is easier to construct counter-measures against an obfuscating

(26)

transformation that only affects a small part of a procedure, than against one that may affect an entire program” ([8] p.8).

“The scope of a transformation is defined using terminology borrowed from code optimisation theory: T is a local transformation if it affects a single basic block ([1] p.528) of a control flow graph (CFG) ([1] p.532), it is global if it affects an entire CFG, it is inter-procedural if it affects the flow of information between procedures, and it is an inter-process transformation if it affects the interaction between independently executing threads of control” ([8] p.9).

Programmer effort

Inter-process full full

Inter-procedural strong full

Global weak strong

Local trivial weak

Deobfuscator effort Polynomial time Exponential time

Figure 5.3: Resilience scale of an obfuscation transformation in terms of programmer effort and deobfuscator effort ([8] p.9)

4.1.5. Stealth

The stealth of an obfuscation transformation is a measure of how well hidden are the changes to the code after obfuscation. Obfuscated code that blends in well with the original code would be difficult for a reverse engineer to find and would therefore prove difficult to deobfuscate ([9] p.4). However, if the transformation “introduces new code that differs wildly from what is in the original program it will be easy to spot for a reverse engineer” ([10] p.3).

“Let T be a [behaviour-conserving] transformation and Q be a program. Ps(Q) is the set of language

features used by Q, while Ps(T) is the set of language features introduced by T. Tste(Q) is the stealth of

(27)

1.0, if | Ps(T) = 0|.

Tste(Q) = 1.0 - | Ps(T) \ Ps(Q) | , otherwise.

| Ps(T) |

” ([33] Low Thesis p.23)

In [33], stealth is measured on the following scale:

unstealthy moderate stealthy

low stealth high stealth Figure 5.4: Stealth scale of an obfuscation transformation

“If Tste(Q) is close to 1, then T is considered to be stealthy. Conversely if Tste(Q) is close to 0, then T

is unstealthy.” ([33] p.23)

4.1.6. Cost

The cost of an obfuscation transformation is a measure of the execution time/space overhead that it incurs on an obfuscated application ([8] p.9). This includes any changes in the file size or any degradation in performance (e.g. the obfuscated program runs slower than the original program) ([8] p.9).

“Let T be a [behaviour-conserving] transformation, such that P → P’ transforms P into a target program P’. Tcost(P) is the extra execution time/space of P’ compared to P.

dear if executing P’ requires exponentially more resources than P. Tpot(P) = costly if executing P’ requires O(n

p

), p > 1, more resources than P. cheap if executing P’ requires O(n) more resources than P.

free if executing P’ requires O(1) more resources than P.

(28)

In [8], the cost is measured on the following scale:

free cheap costly dear

low cost high cost

Figure 5.5: Cost scale of an obfuscation transformation

4.2. Types of Obfuscation

In [8], obfuscation transformations are classified by the types of source code objects that they target. There are four basic types: layout, data, control and preventive transformations.

4.2.1. Layout Transformations

Layout transformations affect the layout of the program code. “Information that is unnecessary to the execution of the program, such as identifier names and comments, is altered” [34].

4.2.1.1. Changing Formatting

“The first transformation removes the source code formatting information sometimes available in Java class files. This is a one-way transformation because once the original formatting is gone it cannot be recovered; it is a transformation with low potency, because there is very little semantic content in formatting, and no great confusion is introduced when that information is removed; finally, this is a free transformation since the space and time complexity of the application is not affected.” ([8] p.10)

4.2.1.2. Scrambling Identifier Names

“Scrambling identifier names is also a one-way and free transformation” ([8] p.10). However, it has medium potency as identifiers contain a great deal of pragmatic information ([8] p.10).

4.2.1.3. Removing Comments

Removing comments is also one-way and free, but it has high potency, as the comments contain information that greatly eases the understanding of the code; without comments the code will be much harder to understand ([8] p.30).

4.2.2. Data Transformations

Data transformations change the data structures of the program code ([34], [8] p.17).

(29)

There is usually a “natural” way to store a particular data item in a program, e.g. a local integer variable would be preferable as an iteration variable for iteration through the elements of an array ([8] p.17). While other variable types are possible, they would be less natural and probably less efficient ([8] p.17). “Data storage obfuscation affects how data is stored in memory. For example a local variable can be converted into a global one” [34].

“There are a number of simple storage transformations that promote variables from a specialised storage class to a more general class. Their potency and resilience are generally low, but used in conjunction with other transformations they can be quite effective” ([8] p.18).

Another data storage obfuscation technique is to convert static data into procedural data as static data contain much useful pragmatic information to a reverse engineer ([8] p.18). A simple way of

obfuscating a static string is to convert it into a program that produces the string” ([8] p.18). The potency, resilience and cost of this type of transformation depend on the complexity of the string generation function ([8] p.30).

4.2.2.2. Data Encoding

“Data encoding obfuscation affect how the stored data is interpreted” [34] by selecting unnatural encoding for common data types ([8] p.17). Figure 5.6 gives an example in which an integer variable is replaced by a simple encoding function.

Before After

int i = 1; int i=11;

while (i < 1000) { while (i < 8003) { ... A[I] ...; ... A[(i-3)/8] ...; i ++; i += 8;

} }

Figure 5.6: Data encoding obfuscation in which i is replaced by 8 * i + 3 [34]

“There will be a trade-off between resilience and potency on one hand and cost on the other. A simple encoding function such as the one above will add little extra execution time but can be deobfuscated using common compiler analysis techniques” ([8] p.18).

“Boolean variables and other variables of restricted range can be split into two or more variables. The potency, resilience and cost of this transformation all grow with the number of variables into which the original variable is split” ([8] p.18). The resilience can be further enhanced via the implementation of algorithms in the obfuscated application that construct the run-time look-up tables ([8] p.18).

(30)

4.2.2.3. Data Aggregation

“Data aggregation obfuscation alters how data is grouped together” [34].

Some aggregation transformations merge two or more scalar variables into one variable ([8] p.19). This transformation has weak resilience as a deobfuscator only needs to examine the set of arithmetic operations being applied to a particular variable in order to guess that it actually consists of two merged variables ([8] p.19). The transformation also has low potency and free cost ([8] p.30).

Other transformations restructure arrays by either splitting an array into several sub-arrays, merging two or more arrays into one array, folding an array (increasing the number of dimensions) and flattening an array (decreasing the number of dimensions) ([8] p.20). The potency of these transformations depends on the extent to which the arrays in question are transformed ([8] p.21). However, they have weak resilience and free cost (except for folding which has cheap cost as this transformation is more complicated than the others) ([8] p.30).

According to metric µ7b and µ7b (Appendix B, [8] p.8), the complexity of a class grows with its depth

in the inheritance hierarchy and the number of its direct descendants ([8] p.21). The complexity of a class can be increased, either by splitting up the class or inserting a new bogus class. Splitting up a class has low resilience as a deobfuscator can simply merge the classes together to get the original one. The resilience of bogus class insertion depends on the number of new classes and the increase in the depth of the inheritance hierarchy tree ([8] p.30). Both transformations have medium potency and free cost.

4.2.2.4. Data Ordering

“Data ordering obfuscation changes how data is ordered” [34]. “Programmers tend to organise their source code to maximise its locality. The idea is that a program is easier to read and understand if two items that are logically related are also physically close in the source text. This kind of locality works on every level of the source … [all] kinds of spatial locality can provide useful clues to a reverse engineer” ([8] p.16). It is therefore useful to randomise the order of declarations in the source application, particularly the order of methods and instance variables within classes and formal parameters within methods ([8] p.21).

In many cases it will also be possible to reorder the elements within an array. Simply put, we provide an opaque encoding function f(i) which maps the ith element in the original array into its new position of the reordered array” ([8] p.21).

(31)

These types of transformations have low potency and free cost ([8] p.30). The resilience depends on the type of data structures they target (reordering methods and instance variables has one-way resilience, whereas reordering arrays has weak resilience) ([8] p.30).

4.2.3. Control Transformations

Control transformations alter the control constructs of the program code. “The idea here is to disguise the real control flow in a program” [34]. With these transformations a certain amount of

computational overhead will be unavoidable, meaning that there is a trade-off between efficiency and obscurity ([8] p.10).

4.2.3.1. Opaque Constructs

An opaque variable is a variable that has some property q which is known a priori to the obfuscator, but which is difficult for the deobfuscator to deduce ([8] p.10). Similarly, an opaque predicate is a boolean expression for which a deobfuscator can deduce its outcome only with great difficulty, while this outcome is well known to the obfuscator ([8] p.10). Opaque constructs are the key to highly resilient control transformations ([8] p.10).

4.2.3.1.1. Definition

“A variable V is opaque at a point p in a program, if V has a property q at p which is known at obfuscation time. We write this as Vp

q

or Vq if p is clear from context.

A predicate P is opaque at p if its outcome is known at obfuscation time. We write VpF (VpT) if P

always evaluates to False (True) at p, and Vp q

if P sometimes evaluates to True and sometimes to False “ ([8] p.10).

Figure 5.7: Different types of opaque predicates. Solid lines indicate paths that may sometimes be taken, dashed lines paths that will never be taken ([8] p.10).

4.2.3.1.2. Trivial Constructs

“An opaque construct is trivial if a deobfuscator can crack it (deduce its value) by a static local analysis. An analysis is local if it is restricted to a single basic block of a control flow graph” ([8] p.11).

(32)

{ int v, a=5; b=6; v = a + b; if (b > 5)T …

if (random(1,5) < 0)F … }

Figure 5.8 An example of a trivial opaque construct. ([8] p.12)

4.2.3.1.3. Weak Constructs

“An opaque construct is weak if a deobfuscator can crack it by a static global analysis. An analysis is global if it is restricted to a single control flow graph” ([8] p.11).

{ int v, a=5; b=6; if (…) … . …. (b is changed) . …. . …. if (b < 7)T a++; v = (a > 5) ? v=b*b : v=b }

Figure 5.9 An example of a weak opaque construct. ([8] p.12)

4.2.3.2. Control Aggregation

“Control aggregation obfuscation changes the way in which program statements are grouped

together” [34]. They do this by breaking up computations that logically belong together and merging computations that do not belong together ([8] p.10). Code which the programmer aggregated into a method (presumably because it logically belonged together) should be broken up and scattered over the program and (2) code which seems not to belong together should be aggregated into one method” ([8] p.14).

Method inlining removes procedural abstractions from the program ([8] p.14). It is “a highly resilient transformation, since once a procedure call has been replaced with the body of the called procedure and the procedure itself has been removed, there is no trace of the abstraction left in the code” ([8] p.14). Method outlining involves turning a sequence of statements into a subroutine and is a very

(33)

useful companion to inlining ([8] p.14). Both transformations have medium potency and free cost, but outlining is strongly resilient.

Method interleaving transformations merge the bodies and parameters of two or more methods declared in the same class and add an extra opaque variable to discriminate between calls to the individual methods ([8] p.15). Method cloning transformations involve generating methods (within the same class) that appear different to each other but have identical behaviour, with an opaque predicate to select the correct method ([8] p.15). The quality of both types of transformations depends on the quality of the opaque predicate used.

Looping transformations affect the control flow in loop constructs. This includes loop blocking, in which the loop is “decomposed into blocks of a constant blocking factor (the step is multiplied by this factor)” [39], loop unrolling, in which the body of the loop is replicated a number of times and loop fission, in which a loop with a compound body is turned into several loops with the same iteration space ([8] p.16). All of these transformations have low potency, weak resilience and free cost, except for loop unrolling, which has cheap cost ([8] p.30).

4.2.3.3. Control Ordering

“Control ordering obfuscation alters the order in which statements are executed. For example, loops can be made to iterate backwards instead of forwards” [34]. This follows on from the same principles of locality outlined in Section 5.2.2.4. “There is locality among terms within expressions, statements within basic blocks, basic blocks within methods, methods within classes, classes within files, etc. For some types of items (methods within classes, for example) this is trivial. In other cases (such as statements within basic blocks) a data dependency analysis will have to be performed to determine which reorderings are legal” ([8] p.16,17). Data dependency analysis “involves the determination of what variables depend on what other variables” [29].

“These transformations have low potency (they do not add much obscurity to the program) but their resilience is high, in many cases one-way. For example, when the placement of statements within a basic block has been randomised, there will be no traces of the original order left in the resulting code” ([8] p.17). Control ordering transformations also have low cost ([8] p.30).

4.2.3.4. Control Computations

Computation transformations make algorithmic changes to the source applications. This involves hiding the real control flow behind irrelevant statements that do not contribute to the actual computations, introducing code sequences at the object code level for which there exist no corresponding high-level language constructs, removing real control-flow abstractions and adding

(34)

bogus control flow abstractions. For these transformations, the quality depends on the quality of the opaque predicate and the nesting depth at which the construct is inserted ([8] p.30).

4.2.4. Preventive Transformations

The main goal of preventive transformations is not to alter any particular type of program code, but rather to cause a deobfuscator or decompiler to crash or stop it from successfully undo the

transformations ([8] p.24). There are two types: inherent and targeted.

4.2.4. Inherent Preventive Transformations

Inherent preventive transformations attempt to make known deobfuscation techniques harder to employ ([33] p.16), e.g. by reversing an iterating control construct and inserting bogus data dependencies to prevent the deobfuscator from undoing the transformation ([8] p.24). They have medium potency, weak resilience and free cost ([8] p.31).

4.2.4. Targeted Preventive Transformations

Targeted preventive transformations are designed to counter specific analysis tools, e.g. by inserting code so as to cause the deobfuscator to crash ([8] p.24). They have free cost, but as they may be susceptible to attack from other deobfuscators, they also have low potency and trivial resilience ([8] p.31).

(35)

5. Evaluation of Currently Available Tools

This section discusses the evaluation of the obfuscation tools currently available.

5.1. Purpose of Evaluation

The purpose of this evaluation is to review what the currently available obfuscation tools have to offer to the software developer of today. This will be terms of the quality of the obfuscation transformations on offer and the quality of the obfuscator as a piece of software.

5.2. Web Resources

All of the tools are available on the World Wide Web [28]. Most of the initial interest seems to be around 1997, when the main research papers on obfuscation were released [8] [9] [10], but there does not seem to be as much interest in obfuscation now then before. This is probably due to the fact that most of the present research appears to be conducted in industry research and development

departments rather than academia [34].

5.3. Comparison of Obfuscation Reviews Available

In [31], a comparative survey of Java obfuscators available on the Internet is given. It examines 13 obfuscators by running two “benchmark” programs and a “Logic program”. The first benchmark program tests for layout obfuscation (see Section 4.2.1.) and data obfuscation (see Section 4.2.2.), whereas the second benchmark program tests for control obfuscation (see Section 4.2.3.). The logic program is an application from another department who requested for it to be obfuscated. An obfuscation metrics table is used to rate the power of the obfuscations provided based on whether layout, data and control obfuscation is offered and the effect on the programs after obfuscation (including recompilation of the decompiled source code).

The usage of two programs to test the transformations offered was deemed too complicated and that a single program that tests for all types of transformation would be better.

There is no mention of effects on classes and inheritance. There seems to focus exclusively on the power of the transformations (save for the fact that software that fail to download/install/run score zero in all categories). This report aims to also examine the obfuscators as software applications and how usable they are. This report also examines 29 obfuscators as opposed to just 13.

[56] also examines if the decompiled source code can be recompiled and run as before. This report is more concerned with effect on class file and how the source code looks after decompilation. The point is to obscure the original code – it doesn’t matter if the decompiled obfuscated code can be compiled

(36)

and still works if the obfuscated class file itself still works. The obfuscators are given a mark of either 1 or 0 for each of the criterion. While this probably made the software easier to assess, it does not really give an insight into the quality of the transformations offered.

[35] is a general article that demonstrates the use of the Mocha decompiler [43] and how the Crema obfuscator [99] can protect Java class files from Mocha.

5.4. Evaluation Criteria

The evaluation mark scheme for the obfuscator is as follows. For each obfuscator, a mark between 0 and 5 inclusive is given for each of the criteria below. The marks are then added up to give a total out of 50, which is then multiplied by 2 to give a mark out of 100. This will give a recognisable indication of the quality of the software. The criteria are as follows: cost, availability, range of transformations offered, potency, resilience, stealth, execution cost, ease of use and documentation. For obfuscators that provide more than one type of transformation, the potency, resilience, stealth and cost ratings will be based on the transformation that gives the highest rating possible (unless the performance of the application is heavily degraded, in which case the cost rating will be poor).

5.4.1. Cost

The monetary cost of the obfuscator and whether it is appropriate for its intended audience. Cheaper or free software will score higher than more expensive software.

5.4.2. Availability

How easily available is the software; is it freely available off the web site or is permission from the owners required. Software that is easy and quick to download will score higher than those that are more difficult to get hold of (or even unavailable).

5.4.3. Range of Transformations Offered

Obfuscators that provide a wide range of obfuscation transformations will score higher in this category than those that provide only one type of transformation (including preventive) or do not really obfuscate at all.

5.4.4. Potency

This is measure of how different the obfuscated application is from the original application ([8] p.7) (see Section 4.1.3.). Transformations that have low potency will score lower than those with high potency.

(37)

This is measure of how resilient are the transformations to decompilation and deobfuscation

techniques ([8] p.8) (see Section 4.1.4.). Transformations that have trivial resilience will score lower than those with one-way potency.

5.4.6. Stealth

This is measure of how well the obfuscated blends in with the rest of the code ([9] p.4) (see Section 4.1.5.). Transformations that are unstealthy will score lower than those that are very stealthy.

5.4.7. Execution Cost

This is measure of the extra time/space complexity added to the program after obfuscation ([8] p.9) (see Section 4.1.6.). Transformations that have dear cost will score lower than those with free cost.

5.4.8. Effectiveness against Decompilation

The obfuscated code will be passed to a decompiler to see if the transformations are resistant to decompilation. Transformations which prove to be difficult for the decompiler to undo will gain higher marks than those which prove to be easily undone. A comparison will be made between the decompiled code and decompiled code of the original program.

The decompiler that is used is DJ Java Decompiler [40], which is freely available for both Windows and Unix operating systems.

5.4.9. Usability

This will rate the usability of the software. This includes how easy it is to install or run, whether complicated command line arguments and setting classpaths are required or is a Graphical User Interface (GUI) provided which is of high quality.

5.4.10. Documentation

How good is the documentation provided. Documentation which is straightforward to follow will score higher marks than documentation which is confusing and hard to follow or even non-existent. Also, simple “readme" files will score less than full documentation in HTML format.

5.4.11. Test Data

The test data will consist of the class files of three java program files: two files (File.java and File2.java) to represent two separate classes, one of which extends the other, and one file (Test.java) which contains the main program that uses the classes. Where Java archive (JAR) files are required than the class files will be compressed into a JAR file, which will be used as input.

(38)

The test data will contain the following types of data structures and control constructs: • int, double, float, array, String and char variables

• constants • class objects

• private, public and protected variables/methods • if then else statements

• while do statements • case of statements • for do statements

The program listings of the source code of the test data can be found in Appendix D.1. Appendix D.2 shows the disassembled class files as streams of byte code (see Section 2.5). Appendix D.3 shows the source code decompiled from the class files in Appendix D.2.

5.5. Obfuscation Software Tools

These are the obfuscation software tools evaluated in this report. For each of these, examples of obfuscation transformations are given using the addition program shown below.

public class Addition {

public static void main(String argv[]) { int a = 2;

int b = 5; int result;

System.out.println("\nA Simple Addition Java Program\n"); System.out.println("First number: " + a);

System.out.println("Second number: " + b + "\n"); result = a + b; System.out.println("Result: " + result + "\n"); return; } } 5.5.1. Zelix KlassMaster

Zelix KlassMaster [101] provides layout obfuscation (changing class, field and method names and line number scrambling), data obfuscation (string encryption) and control obfuscation (selection and looping statements altered such that there is no direct Java source code equivalent). It also provides code optimisation.

(39)

import java.io.PrintStream; public class a { public a() { }

public static void main(String args[]) { boolean flag = a; byte byte0 = 2; byte byte1 = 5; System.out.println(zkmToString("Fu56Y!Dy\000\020\rPq\fD%[{Ez- BtE`>[r\027Q!>"));

System.out.println(zkmToString("\n]g\026DlZ`\bR)F/E") + byte0); System.out.println(zkmToString("\037Qv\n^(\024{\020].Qg_\020") + byte1 + "\n");

int i = byte0 + byte1;

System.out.println(zkmToString("\036Qf\020\\8\0165") + i + "\n"); if(flag) { int j = b.a; b.a = ++j; } }

private static String zkmToString(String s) {

char ac[] = s.toCharArray(); int i = ac.length; int k; for(int j = 0; j < i; j++) { switch(j % 5) { case 0: // '\0' k = 0x4c; break; case 1: // '\001' k = 52; break; case 2: // '\002' k = 21; break; case 3: // '\003' k = 101; break;

(40)

default: k = 48; break; } ac[j] ^= k; }

return new String(ac); }

public static boolean a; }

The web site provides an FAQ and a support contact email address to answer queries. A record of releases and the new features they introduce is available. Documentation is available on-line and also available for downloading to the user’s site (in HTML format). Official versions are available for US $399.00 (£273.62), and US $199.00 (£136.47). Upgrade licences and discounts for each separate licence requested are also provided. A free 30-day evaluation version is available for download. All versions run perfectly on Unix and Windows platforms. The transformation proves difficult for the decompiler to undo. See Appendix D.4 for the results on the test data.

5.5.2. Jshrink

Jshrink [79] provides layout obfuscation (name changing) and code optimisation. It is available as an evaluation jar file, the licence for which is available via email and lasts for 10 days. The official release is available for US $95.00 (£65.15).The report evaluates an evaluation copy of a previous release, which was obtained via email, as the latest version was not released until after the evaluation of this obfuscator.

public class File { public File() { $c = 0.012F; $d = 8D; }

public static int Sum(int i, int j) { int k = i + j; return k; } int $; int $b; private float $c; protected double $d; }

The obfuscation did not work for the addition program, but work for the test data. Appendix D.5 for details.