1 | P a g e
DATA OBFUSCATION
What is data obfuscation?
Data obfuscations break the data structures used in the program and encrypt literals. This method includes modifying inheritance relations, restructuring arrays, etc. Data obfuscations thoroughly change the data structure of a program. They make the obfuscated codes so complicated that it is impossible to recreate the original source code.
Data obfuscations operate on the data structures used in the program. Data storage obfuscations change the type of storage for variables. One example is converting a local variable into a global variable. The obfuscator would ensure that different methods use the variable at different times but none of them use it at the same time.
A data encoding obfuscation changes the way a program interprets stored data. For example, you can replace all references that initialize an index variable i by the expression 8*i+3. When the code needs to use the index value, the obfuscator inserts the expression (i-3)/8. Finally, instead of incrementing the variable by one, you add eight to the value. Basically, the obfuscation scales and offsets the index from the desired value and only computes the real index when it's going to be used.
A data aggregation obfuscation alters how data is grouped together in memory. An example is turning a 2D array into a 1D array or vice versa. The basic idea is to change the familiar conceptual mapping to a less common, in-memory representation so that it's more difficult for a person to understand your algorithms. For example, a chessboard is often modeled in a program as a matrix, but changing it to a one-dimensional array works just as well for the CPU.
A data ordering obfuscation changes how data is ordered. In C-based languages, it is common to see the ith element of a collection of data accessed by indexing to position i in an array. A data ordering obfuscation would determine the index in the array of the data by calling some function f(i). Again, this simply rearranges the storage of information in a way that less closely models the normal conceptual model.
2 | P a g e
Understanding a simple algorithm such as sorting elements of an array is easy. Applying a simple data transformation on such algorithm can make it hard for someone to understand the code. We will apply a data transformations on the following piece of code:
for(i=0;i<10;i++) for(j=i;j<10;j++)
if(a[j]>a[i]) swap(a[i],a[j]);
Aggregation
The first data transformation we would like to discuss is restructuring arrays. Arrays can be split,merged, folded or flattened . We will merge two or more arrays into one:
Applying this transformation to our example will force the attacker to evaluate details of the algorithm if he wants to understand it. The test and swap lines will be transformed into the next piece of code, assuming that a is the array on the odd indices of the interleaved array.
if(a[2j+1]>a[2i+1]) swap(a[2j+1],a[2i+1]);
Finding similar transformations for arrays is not hard and implementing them into the right tool neither. As it is already difficult in TXL to get type information, it makes this data transformation impossible to apply in a safe way. E.g., modifying a datastructure, requires the location of every instance of that data structure. On a parse tree this is non-trivial as the same name might be used in different scopes for different datastructures. While the parse tree does contain sufficient information to deduce the type of datastructure when, it is a more straightforward to perform this on an intermediate representation which contains a symbol table.
3 | P a g e Ordering
An obfuscation transformation which reorders arrays is neither difficult in SUIF. A symbol table is at our disposal so each pointer to the array is known, which makes finding all accesses to the array straight forward. The indices used to access the array can be changed by a function mapping the original position i into its new position of the reordered array. The test and swap lines of our example will be changed into the next piece of code which will no longer order the array as in the original program. Although, all indices will be changed in the program, so the resulting code stays functionally equivalent with the original one.
if(a[f(i)]>a[f(j)]) swap(a[f(i)],a[f(j)]);
Storage and encoding
Data flow optimizations such as common subexpression elimination and constant propagation are able to undo very trivial data obfuscations. For example when splitting constant 10 into subexpression 2+8, constant propagation will undo this transformation. Non-trivial data obfuscations such as these shown above always survive the compilation process because these transformations change the context of the program. While a compiler only has optimizing transformations at his disposal, he is unable to undo such context changing data transformations. On the other hand is variable splitting a deoptimization transformation and applying such transformation should take into account the optimizations performed by the compiler.We had a look at binary obfuscators and found out that no non-trivial data transformations were implemented. Only trivial data transformations such as constant splitting are implemented at binary level and without further obfuscation, an optimization run afterwards could remove these transformations. It is not astonishing that binary obfuscators only contain trivial data transformations as the types of datastructures are lost during compilation.
Passing extra information to do such transformations at a binary level is feasible, but intensive and rather artificial if these transformations can be a source code level and afterwards survive the compiler optimizations.
4 | P a g e
Why would you want to merely obfuscate data, rather than use a strong encryption algorithm?
A good example would be an audit report on a medical system. This report may be generated for an external auditor, and contain sensitive information. The auditor will be examining the report for information that indicates possible cases of fraud or abuse.
Assume that management has required that Names, Social Security Numbers and other personal information should not be available to the auditor except on an as needed basis.
The data needs to be presented to the auditor, but in a way that allows the examination of all data, so that patterns in the data may be detected.
Encryption would be a poor choice in this case, as the data would be rendered into ASCII values outside of the range of normal ASCII characters. This would be impossible to read.
A better choice might be to obfuscate the data with a simple substitution cipher. While this is not considered encryption, it may be suitable for this situation.
When the auditor finds a possible case of abuse, he will need the real name and SSN of the party involved. He could obtain this by calling a customer service representative at the insurance company that supplied the report, and ask for the real information.
The obfuscated data is read to the customer service rep, who then inputs it into an application that supplies the real data.
The importance of using pronounceable characters becomes very clear. Strong encryption would render this impossible.
Here’s some simple example code to do the obfuscation: create or replace package obfs
is
function obfs( varchar2 in ) return varchar2; pragma restrict_references( obfs, WNPS, WNDS );
5 | P a g e
function unobfs( varchar2 in ) return varchar2;
pragma restrict_references( unobfs, WNPS, WNDS ) end;
create or replace package body obfs is
xlate_from varchar2(62) :=
‘0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz’; xlate_to varchar2(62) :=
‘nopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklm’; function obfs ( clear_text_in varchar2 ) return varchar2
is begin
return translate( clear_text_in, xlate_from, xlate_to ); end;
function unobfs ( obfs_text_in varchar2 ) return varchar2 is
begin
return translate( obfs_text_in, xlate_to, xlate_from ); end;
end; /
Here is some sample output: SSN OBFS SSN --- --- 540407786 srnrnuuvt 542800170 srpvnnoun
6 | P a g e
542802063 srpvnpntq 541466830 srorttvqn
As you can see, it wouldn’t be very difficult to decipher this scheme given enough data. A somewhat more effective method involves chopping the text into segments and rearranging it as well as obfuscating it. Below is some sample output from this algorithm.
OBFS OBFS --- --- 540407786 &24B23B&Z 542800170 -4B*23&&& 542802063 -4Z&23-&_ 541466830 *2_423ZZ&
While this is still not encryption, this data would be more difficult to decipher without the key. Source code for this in PL/SQL is available at the URL provided at the end of this article.
Another way to hide sensitive data is through masking. This is different from the previous example in that the clear text cannot be reconstructed from the displayed data.
This is useful in situations where it is only necessary to display a portion of the data. A good case for this method is the receipts printed at gas stations and convenience stores. When a purchase is made with a credit card, the last 4 digits of the credit are often displayed as clear text, while the rest of the credit card number has been masked with a series of X’s.
Slop n Slurp 1 Stop Shop 5/25/2000 8:53 P.M. Football Burrito 1 2.49 2.49 Premium Gasoline 12.5 1.699 21.24 ===== 23.73 AMEX 2/02 XXXX-XXXXXX-65498
7 | P a g e
This method can also be used for reports where the person reading the report requires only a portion of the sensitive data. This method is also commonly used for the account numbers on printed transactions from ATM’s.