Dependent types and their application in memory-safe low-level programming

(1)

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL INSTITUTO DE INFORMÁTICA

PROGRAMA DE PÓS-GRADUAÇÃO EM COMPUTAÇÃO

VÍTOR BUJÉS UBATUBA DE ARAÚJO

Dependent types and their application in

memory-safe low-level programming

Trabalho Individual I TI-123

Advisor: Prof. Dr. Álvaro Freitas Moreira Coadvisor: Prof. Dr. Rodrigo Machado

Porto Alegre October 2014

(2)

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL Reitor: Prof. Carlos Alexandre Netto

Vice-Reitor: Prof. Rui Vicente Oppermann

Pró-Reitor de Pós-Graduação: Prof. Vladimir Pinheiro do Nascimento Diretor do Instituto de Informática: Prof. Luis da Cunha Lamb

Coordenador do PPGC: Prof. Luigi Carro

(3)

ABSTRACT

Dependent types allow the construction of types that depend on values. For instance, one might declare a type of arrays of a specific size. This allows specification of richer and more precise types to programs, thus making it possible to check a wider variety of program properties mechanically. A particularly interesting application of dependent types is in low-level and systems programming, where dependent types can be used to regain safety guarantees usually not present in low-level programming languages, such as ensuring that memory past the bounds of an array or buffer will not be accessed.

This work aims to present the fundamentals of dependent type theory, as well as some examples of dependently-typed programming languages and what can be achieved by them. Special focus will be given to dependently-typed languages aimed at low-level and systems programming, such as Deputy, and how they can help in avoiding the usual memory-safety problems encountered in low-level programming. Other solutions to the problem of memory-safe low-level programming not employing dependent types will also be examined, for the sake of comparison.

(4)

RESUMO

Tipos dependentes e sua aplicação em programação de baixo nível segura em relação a memória

Tipos dependentes permitem a construção de tipos que dependem de valores. Por exemplo, pode-se declarar um tipo de arrays de um tamanho específico. Isso permite a atribuição de tipos mais ricos e precisos a programas, permitindo assim a verificação de uma maior variedade de propriedades do programa mecanicamente. Uma aplicação particularmente interessante de tipos dependentes é em programação de baixo nível e de sistemas, onde tipos dependentes podem ser usados para recuperar garantias de segurança normalmente ausentes em linguagens de programação de baixo nível, tal como garantir que a memória além dos limites de umarrayoubuffernão será acessada.

Este trabalho objetiva apresentar os fundamentos da teoria de tipos dependentes, bem como alguns exemplos de linguagens de programação com tipos dependentes voltadas a programação de baixo nível e de sistemas, tais como Deputy, e como elas podem ajudar em evitar os problemas de segurança de acesso à memória encontrados em programação de baixo nível. Também serão examinadas outras soluções para o problema de programa-ção de baixo nível segura em relaprograma-ção a memória não envolvendo tipos dependentes, para fins de comparação.

(5)

1 INTRODUCTION

Higher-level programming languages, such as Java, C#, Haskell and Scheme, enforce

memory safety: they guarantee that a program never makes access to a region of memory that is not allocated to it. This is usually achieved through a combination of mechanisms, such as a strong type system, automatic memory management through garbage collection, and array bounds checking. These mechanisms have static (compile-time) components, such as a type system that rejects programs that would lead to memory safety violations if executed, and dynamic (run-time) components, such as runtime checks to ensure that an object is accessed within its bounds. Dynamic checks require that the program keep enough metadata at runtime to allow checking the validity of memory access. For in-stance, arrays may be stored in memory as an integer representing its length followed by the actual elements of the array; this way, when an access to a position of the array is performed, it is possible to compare the position index againts the stored length of the array to ensure that the index is valid.

By contrast, lower-level programming languages, such as C and C++, do not provide such strong memory safety guarantees, rather relying on the programmer to keep track of memory allocations and ensuring that no invalid memory access is performed by the programmer at runtime. On the one hand, this gives the programmer greater control: the lack of implicit metadata in arrays and other data allows finer control of the memory layout of data structures, and the lack of implicit checks allows precise control of when checks are performed and enables elimination of redundant checks by the programmer. This is especially important insystems programming, i.e., programming of low-level sys-tem components such as operating syssys-tem kernels and programming language runtimes, where precise control over data structure layout may be required. On the other hand, this is a frequent source of bugs in C and C++ programs, since it is very easy to leave out or in-correctly perform such checks, leading to a memory-unsafe program, with consequences varying from crashes to silent data corruption to security vulnerabilities. A large number of security vulnerabilities found in real-world software is caused bybuffer overflows and overreads, i.e., by exploiting the absence or incorrectness of bounds checking of some program buffer to gain access to a region of memory that should not be accessible, thus either obtaining information that should not be revealed, or altering the subsequent be-havior of the program, potentially enabling execution of arbitrary code of the attacker’s choice. A recent example of this is the Heartbleed bug discovered in the widely-deployed OpenSSL library for secure communication between clients and servers on the Internet, in which the absence of a bounds check allowed an attacker to obtain the contents of arbitrary regions of the server’s memory, potentially revealing sensitive data such as user-names, passwords, and the private keys of SSL security certificates.

(8)

8

systems programming. Many of them involve keeping implicit metadata on bounds and other allocation information at runtime as is done in higher-level programming languages. This has a number of drawbacks. If this metadata is kept together with the data it refers to, it changes data representation, which is undesirable in systems programming where control over memory layout is necessary, and also introduces interoperability problems with external code, such as libraries, which do not expect such metadata to be present in data structures. Also, unlike higher-level programming languages, C and C++ have pointers, which can point anywhere in the middle of an object; therefore, it is not enough to store the length of the array together with its elements as one might do in a higher-level programming language, because it is not in general possible to reach the beginning of the array from an arbitrary pointer to it in C and C++. Rather, every pointer must have associated bounds information metadata, which makes for a great memory overhead.

A different approach, based on dependent types, relies on the observation that a correct low-level program isalreadymemory-safe: programmers employ a variety of idioms to keep track of bounds information, such as storing an array and its length in a single data structure (as in POSIX C’siovecstructure), or passing the array together with its length as arguments to array-manipulating functions (such as C’smemcpyfunction). However, since this information is kept in an ad-hoc way by the programmer without language support, the compiler cannot verify the correctness of its usage. Dependent types allow the specification of such relationships between data and metadata explicitly, thus allowing them to be mechanically verified either statically or dynamically. For instance, in Deputy, a dependent type system for C, the signature of a function taking an array and its length can be annotated as:

int f(int * count(length) array, int length)

indicating thatarrayis a pointer to a region of memory containinglengthelements. Given this annotation, the compiler can verify that accesses to this array are within its bounds without keeping separate metadata; rather, the metadata already present in the C program can be reused, thus avoiding a memory overhead.

This work presents an overview of memory safety, dependent types, and how de-pendent types can be used to being stronger memory safety guarantees to systems pro-gramming, together with a comparison with other approaches to memory-safe systems programming. Special focus will be given to the Deputy type system, in preparation to the Master’s dissertation work to which this Individual Study is associated.

1.1 Outline

Chapter 2 presents the concept of memory safety as it is defined by different authors, and the different degrees of memory safety provided by programming languages. Chapter 3 presents dependent types and dependently-typed programming languages. Chapter 4 presents the Deputy type system and how it applies dependent types to provide memory safety in systems programming. Chapter 5 presents other approaches to memory-safe systems programming, both dependently-typed and otherwise. Chapter 6 concludes.

(9)

9

2 MEMORY SAFETY

This chapter introduces the concept of memory safety, how and to what extent it is enforced in higher- and lower-level programming languages, and gives an introduction of different approaches to improving memory safety of lower-level programming.

2.1 Introduction

There is no single definition of memory safety among authors. Broadly speaking: Broad definition. A program is said to be memory-safe if it only makes access to regions of memory allocated to it. A language is said to be memory-safe if its semantics guarantee that valid programs written in it are memory-safe.

This is a very lax definition, however. For instance, by this definitions, if two arrays, of ten integers each, are allocated contiguously in memory, a program attempting to access the eleventh position of the first array would still be considered memory safe, because, although the access is out of the bounds of the first array, it still falls within a region of memory allocated to the program (specifically, to the second array). Such a definition is useful in the context of guaranteeing that multiple programs sharing the same memory space don’t step over each other’s memory (?), but it does not help in preventing buffer overflows and other memory corruption within a single program. Usually, we are inter-ested in a stricter definition of memory safety which accounts for these situations. It is harder to make such a definition without talking about concepts specific to each program-ming language, but one might generalize definitions such as in?as:

Stricter definition. A program is said to be memory safe if every access to memory happens through a reference to a previously allocated object, the object has not been deal-located, and the region of memory accessed through such a reference has been allocated to that specific object.

Such a definition still has discutible interpretations. For instance, two contiguous arrays belonging to a single data structure might be considered part of a single object, and therefore out-of-bounds accesses to the first array which still fall within the region allocated to the object as a whole might not be considered a violation of memory safety. Some works take this definition (?), but others rule out such an out-of-bounds accesses.

2.2 Spatial and temporal memory safety

Memory safety has a spatial and a temporal aspect. Spatial memory safety refers to ensuring that no out-of-bounds access to memory is performed (with what is consid-ered out of bounds varying with the definition of memory safety used), whereas temporal

(10)

10

memory safety refers to ensuring that no access is performed to memory (or to an ob-ject) which has already been deallocated or has not been allocated yet. The mechanisms for ensuring each form of memory safety are distinct. For instance, guaranteeing spatial memory safery safety might involve storing bounds information for arrays and adding runtime checks to ensure that indices are within bounds, while guaranteeing temporal memory safety might involve employing automatic memory managent mechanisms, such as reference counting or garbage collection, which ensure that an object in memory is only deallocated after no references to it remain.

Higher-level programming languages usually enforce both kinds of memory safety, whereas the solutions for memory-safe low-level programming surveyed in this work vary in what they provide. Some, such as CCured (NECULA et al., 2005) and Cyclone (JIM et al., 2002), attempt to prove both. Others, such as Deputy (CONDIT et al., 2007), provide only spatial memory safety, and can be used together with a complementary so-lution for temporal memory safety, such as a conservative garbage collector (BOEHM; WEISER, 1988).

2.3 Mechanisms for ensuring memory safety

2.3.1 Temporal memory safety

Temporal memory safety violations arise from the use of references to objects that are not present in memory, either because they have never been allocated, or because they have already been deallocated. The first case surfaces in the form of uninitialized pointers; this can be avoided by ensuring, when a pointer is created, that it is initialized either with the address of a valid object in memory, or with a special null value, and that attempts to dereference a null pointer will be detected and handled in some way by the environment; languages like Java do precisely this. The second case poses a greater difficulty. In C and C++, the region of memory allocated to an object may be released (by explicit request from the programmer) even though references to it remain in the program; such references are known as dangling pointers. Solutions to this problem ensure no memory access through dangling references are performed, either by taking over control of memory deallocation to ensure that memory is not released while references remain to it, or by constraning the creation of new references to allocated objects.

2.3.1.1 Reference counting 2.3.1.2 Garbage collection 2.3.1.3 Region inference 2.3.1.4 Ownership 2.3.1.5 Linear types

2.3.2 Spatial memory safety

Spatial memory safety violations arise from accessing a region of memory outside the bounds of an allocated object (or, more strictly, accessing through a reference to an object a region of memory not pertaining to it). The most common case of this is in the indexing of arrays, but it can happen in other situations where a reference may point to objects of variable size, such as unions of differently-sized objects, or in object-oriented languages which allow downcasting an object to a subtype with more fields.

(11)

11

2.3.2.1 Dynamic checking

One way to solve this problem for arrays is to store the array in memory as a header, containing its length and possibly other metadata, and a payload, consisting of the ele-ments themselves. When an access is performed to the array, runtime checks are per-formed to ensure that the index of the requested position is a valid index in the array by comparing it with the stored length; if this turns out to be false, a runtime exception is sig-naled. Languages like Java, Python and Scheme work like this. If implemented naively, this incurs an overhead on every array access. However, it is often possible to prove stat-ically that an access is within bounds, and avoid the runtime check in those cases. For example, given a loop bounded by the length of the vector, like

for (i=0; i<vector.length; i++) { vector[i] = i*i;

}

a compiler can easily verify that the variableionly assumes valid indices to vector, and therefore a runtime check is not necessary.

Similarly, unions may be stored with a tag indicating which element of the union is active, and checking it at runtime before accessing its contents. Downcasts can also be similarly checked for validity.

2.3.2.2 Static enforcement

Another way to solve the problem is to ensure that code that might violate memory safety does not get compiled or run, or designing language constructs that make it im-possible to express memory-unsafe operations. For instance, tagged unions in ML and Haskell allow the definition of unions of differently-sized types, but it is only possible to access the contents of such values by pattern-matching against the tag of the value; there are no linguistic facilities in these languages to perform an access using the wrong fields of the tag. More generally, strongly-typed languages guarantee that the memory used by a value of one type cannot be reinterpreted as a value of another type, as is possible for instance in C.

More interestingly, one might usedependent types to ensure that only valid indices are used to index an array. Purely static dependent type systems, such as in Idris and ATS, effectively ensure that the programmer performs a bounds check to verify that an index is valid before trying to access an array, and reject programs when they cannot prove that this is the case. Systems which mix static and dynamic checking, such as Deputy, accept programs when they cannot either prove or disprove that an access is valid, but insert runtime checks in such cases to ensure that an out-of-bounds access is not performed. A more detailed account of dependent types will be presented in the rest of this work.

2.4 Low-level and systems programming

Whereas higher-level programming languages enforce both spatial and memory safery, lower-level languages like C and C++ do not provide such strong memory safety guaran-tees, which makes programs in those languages prone to bugs such as buffer overflows and dangling pointers. The flip side of this lack of safety is flexibility: C gives the programmer great control over the layout of data structures in memory, since no metadata is kept by the runtime. Also, memory management is manually performed by the programmer, which

(12)

12

gives precise control over when memory will be deallocated and obviates the need for a garbage collector. This kind of flexibility is especially important insystems programming, i.e., the programming of foundational system components such as operating system ker-nels and implementations of higher-level language runtimes and virtual machines, where precise control of memory layout may be necessary and using a garbage collector may be unfeasible. C and C++ are also often used in real-time applications, where the variability in execution time introduced by a garbage collector may not be tolerable. For these rea-sons, C and C++ remain widely used, even though bugs due to memory safety violations are very common and often found in deployed software. Such bugs often lead to crashes, silent data corruption, and security vulnerabilities.

Many solutions have been proposed to improve the situation of memory safety in systems programming. The next chapters survey some of them, with a special emphasis on solutions based on dependent type systems.

(13)

13

3 DEPUTY: A DEPENDENT TYPE SYSTEM FOR C

3.1 Introduction

3.2 Design

3.3 Implementation

3.4 Limitations

3.5 Related work

(14)

14

4 OTHER APPROACHES TO MEMORY-SAFE SYSTEMS

PROGRAMMING

4.1 ATS

4.2 Idris

4.3 Rust

4.4 Huh?

(15)

15

(16)

16

REFERENCES

BOEHM, H.-J.; WEISER, M. Garbage Collection in an Uncooperative Environment. Software: Practice & Experience, New York, NY, USA, v.18, n.9, p.807–820, Sept. 1988.

CONDIT, J. et al. Dependent types for low-level programming. In: Programming Lan-guages and Systems. [S.l.]: Springer, 2007. p.520–535.

JIM, T. et al. Cyclone: a safe dialect of c. In: USENIX ANNUAL TECHNICAL CON-FERENCE, GENERAL TRACK, 2002.Anais. . . [S.l.: s.n.], 2002. p.275–288.

NECULA, G. C. et al. CCured: type-safe retrofitting of legacy software.ACM Transac-tions on Programming Languages and Systems (TOPLAS), [S.l.], v.27, n.3, p.477– 526, 2005.

Dependent types and their application in memory-safe low-level programming

Dependent types and their application in

memory-safe low-level programming

ABSTRACT

RESUMO

CONTENTS

1

INTRODUCTION

1.1

Outline

2

MEMORY SAFETY

2.1

Introduction

2.2

Spatial and temporal memory safety

2.3

Mechanisms for ensuring memory safety

2.4

Low-level and systems programming

3

DEPUTY: A DEPENDENT TYPE SYSTEM FOR C

3.1

Introduction

3.2

Design

3.3

Implementation

3.4

Limitations

3.5

Related work

4

OTHER APPROACHES TO MEMORY-SAFE SYSTEMS

PROGRAMMING

4.1

ATS

4.2

Idris

4.3

Rust

4.4

Huh?

REFERENCES