Dependencies - Compiling Ensemble Applications

4.2 Compiling Ensemble Applications

4.2.3 Dependencies

Although actors do not share state, they do share types. This is true in both the language and the runtime, where it is necessary to have actors create user data types or invoke procedures. While this sharing does not lead to inconsistent state in the language, it does generate dependencies between the class files which represent these types at runtime. This is necessary for actor adaptation, see Section 4.4.3.

To address this, the compiler determines the dependencies between the different Ensemble entities before generating the Java source code. When creating the Java classes that represent the Ensemble types, the compiler generates the custom @dependency annotation for the class being generated. This annotation contains a list of classes upon which the class currently being generated depends. The linker uses this annotation, and embeds both the number and names of the dependencies into the class file for the particular type. During any action which would potentially require a new data type to be present in a stage, such as a spawn or migrate, the runtime can use this information to determine any unmet dependencies. In the future, it would be better to use runtime analysis of the class file’s symbolic references to determine dependencies, rather than store these directly in the class file, thus reducing the size of class files.

As well as explicit inter-class dependencies, the compiler also optimises for subroutine usage. Procedures or queries which are defined either outwith or within an actor only have Java code generated if they are invoked within an actor. This means that space is not consumed for subroutines which are never invoked. However, this does mean that each actor will have its own copy of any invoked subroutine. Although this may lead to multiple copies of the same function, it is useful for reducing inter-class dependencies, as well as simplifying the spawn and migration process discussed in Section 4.4.3.

Class File Format

Java class files contain member definitions (fields and methods), metadata, and constant pools. Past analyses of Java programs show that, on average, class files can be as little as 33% method definitions [114], and only 20% bytecode [115]. However, this may not

Data Standard library Programs (33 class files) (20 class files) Constant pool 66.5% 67.5% Class metadata used 0.2% 0.2% Class metadata unused 5.2% 7.6% Field metadata used 0.2% 0.2% Field metadata unused 7.4% 1.5% Method metadata used 2.3% 2.7%

Bytecode 3.5% 7.3%

Method metadata unused 14.6% 13.0% Method total 20.4% 23.0%

Total used 72.8% 78.0%

Total unused 27.2% 22.0%

Table 4.1: Average percentage composition of class files. ‘Used’ and ‘unused’ indicate whether information is present in the modified class file.

be representative of Ensemble applications, and in particular the standard library, which contains mainly class and native method definitions.

The class files from the standard library and various Ensemble programs have been analysed to find how much of their data can be discarded. The results are shown in Table 4.1. Most of the unused data is related to linking. The rest is mainly metadata related to Java features unsupported by the VM, or which is encoded in the VM’s new instructions (e.g. the size of fields, and whether methods are native). Appendix B fully describes the new class file format.

Inter-Class References

Currently, the class files representing specific Ensemble entities are symbolically referenced by name. If all Ensemble applications were compiled from a single source file, the compiler would be able to ensure that no two types could have the same name, hence this referencing approach would be safe. However, as Ensemble applications are designed to be able to work together when compiled independently, this approach does not guarantee uniqueness - two distinct types may posses the same name.

To solve this problem using a decentralised approach, a unique naming scheme is adopted. By taking a MD5 hash [116] of an Ensemble type’s class file at compiletime, a 128-bit identifier is produced to identify a class in place of a literal name. The Java Universally Unique ID (UUID) library1 _{is used to generate this number. Using a hash of the post-linked}

class file has the advantage that if two identical actors are compiled independently, they will

posses the same UUID. The generation and use of UUIDs is not visible to either the language or the user.

As a UUID is represented as a finite number, it does not guarantee a truly unique number - it is possible that two different classes will hash to the same UUID. There are a number of points to consider. Firstly, by inspecting the number of bits used in the UUID and the approach used in the library, there are 2126potential values which the UUID may take. This is sufficiently large for the purposes of this work, making a collision extremely improbable. Should this not be sufficient, it may be possible to increase the number of bits used to represent the key. Also, metadata may be used to add context specific information. Examples of such meta data include the literal name of the type, or the string encoding used for the type. Secondly, as the UUID is generated from the class file, it is possible that two actors which have been named distinctly may be found as equivalent. This is essentially structural match- ing. Unlike the discussion in Section 3.4.3, this will not lead to unexpected logical errors. For raw types, there is no issue in choosing one type over another if they are structurally the same. For actor classes, the UUID includes the actual implementation of the behaviour clause in addition to the data types used. This means that both the actor’s state and logic are used to generate the UUID, hence the UUID is generated from a unique representation of the actor.

Encoding

Although the compiler ensures that only valid Ensemble applications will compile success- fully, the presence of runtime discovery, reconfiguration, distribution, and the any type re- quires that there exist some encoding of an entity’s type at runtime.

Encodings fall into two categories: those which represent primitive types, and those which represent aggregate types, such as structs and actors. Table B.1 in Appendix B describes the mapping between types and encodings. This string-based encoding was chosen as it was simple, both to implement and to perform type comparisons at runtime. In the future, using the hash of an encoding, rather than the encoding itself, may be more space and time efficient.

Similarly to the inter-class dependencies discussed in Section 4.2.3, the compiler will determine an encoding for an entity’s type at compiletime, and annotate the class file generated for this type with the custom @encoding annotation containing the string encoding. The linker will then encode this information into the class file. Also, an encoding of the data which a channel conveys is supplied to a channel when created. This is necessary for com- munication across remote channels, see Section 4.4.2. As runtime type information is only required in the distributed case, its use has been kept to a minimum.

Optimisations

Although Ensemble applications are translated to Java source code, and then compiled using the Java compiler, they do not use all the features of Java. As well as not implementing a number of Java features, the linker performs a number of optimisations to reduce the size of the class file.

‘Static Only’ Classes In Java, all methods and fields must be part of a class; there is no such thing as a ‘top-level method’ or a global variable. However, some classes within the standard library contain only static methods and static fields which are never instantiated, never extended, and never used as the type of a variable. As this only applies to classes which are integral to the runtime and are always present on each stage, there is no need to include a class definition for them; only the methods and fields themselves are included. A class suitable for this treatment is marked with the custom @static only annotation by the compiler, which is detected by the linker. All wrapper classes for system actors are marked as static only, giving significant space savings.

Empty Methods Java object construction occurs in two stages. The new instruction allo- cates the memory required for an object, and then the object’s constructor is called. All Java classes are required to have a constructor. javac generates default constructors where necessary, which do nothing but call the superclass’s constructor, to ensure this rule is satisfied. This leads to long chains of calls to methods that do no useful work at all.

The linker detects these methods and removes them from the generated class files. All calls to these methods are also removed. This continues iteratively until all constructors and static methods which do nothing, transitively, have been removed. Some modification of the bytecode around the removed call sites is necessary to ensure correct execution. In particular, method arguments which are pushed to the operand stack before a call must now be disposed of as the call no longer happens. If an argument was pushed immediately before the call, the pushing instruction is removed; otherwise, an appropriate number of pop instructions are inserted.

Virtual methods are not removed even if they do nothing, so that virtual calls continue to work as expected.

Native Methods The invokenative instruction refers to native methods by IDs as- signed by the linker, made available to the VM through a C header file. Hence no information about native methods is stored.

Direct Bytecode Generation Ensemble applications are currently compiled to Java, and then to bytecode using javac, before being modified by the linker. There would be several advantages in compiling Ensemble directly to augmented bytecode:

• Reduction in code size – a number of the classes and methods in the standard library exist only for compatibility with the code generated by javac. Certain methods in the ‘primitive classes’ such as Integer, and classes to support exceptions, are not strictly necessary, but are required to link with classes generated by javac.

• Optimisation of bytecode – the code generated by javac might not represent En- semble idioms in the most efficient way. Optimising at the compilation stage, with knowledge of the changes to the instruction set made by the VM, might be easier than attempting to optimise javac-generated bytecode in the linker.

• Variable and stack usage – the Ensemble VM uses bitmaps to record which local variables and stack slots contain objects, so that garbage collection works correctly. These must be maintained at runtime because javac-generated code reuses slots for different types throughout the lifetime of a method call. By contrast, in Darjeeling (Section 2.2.2) a frame has separate variables and stacks for objects and primitives, so that the information needed for garbage collection is available statically [49]. This is possible because of the extensive bytecode rewriting Darjeeling performs during the linking stage. A similar process could be adopted in Ensemble.

Although there are clear advantages, time constraints prevented the use of direct bytecode generation in this work.

In document A linguistic approach to concurrent, distributed, and adaptive programming across heterogeneous platforms (Page 99-103)