The AppFabric platform itself is a complex and distributed system. Therefore, before delving into the details of the implementation, let us first discuss some of the design challenges and how we address them in our implementation. Note that most of these challenges are generic to the design of most distributed systems; it is the solution that is system specific. Also
there could be many alternate solutions to each of these problems, however, we will try to motivate that our design choices are valid and acceptable for the case in point.
• Enforcing modularity: Modularity is a common goal with almost all system designs. It is basically a way to factor the system into a set of constituent components. In general, modularity helps achieve the following system properties:
– Readability and maintainability: Code readability and maintainability is of- ten an important goal, especially for large software projects. Modularity allows parts of a code to be developed and maintained separately and in parallel. It allows the code to be more easily maintained by restricting the e↵ect of code changes to modules and more deterministically manage these changes.
– De-coupling fate: A stronger notion of modularity is to address the problems of fate-sharing across modules. This requires that if a module fails, it should not be able to take down the whole system. Also, a malfunctioning module should not be able to directly corrupt the state of another module.
– Concurrency: Modularity also introduces the idea of concurrency where each module may independently execute concurrently. This allows the system designer to make more efficient systems by partitioning his code into concurrently exe- cutable modules such that the system as a whole can perform many di↵erent tasks at the same time leading to better utilization of the hardware resources and ensuring that the system as a whole makes progress all the time.
The most basic form of modularity is obtained through procedure (or function) calls. Procedures are modules within the same process. However, this is quite a weak notion of modularity that neither supports fate de-coupling nor concurrency. The procedure call graph is sequential and all procedures share the whole process address space. Therefore, a single malformed function can very easily corrupt the address space and the whole program can be brought down. Often a more stronger notion of modularity is required for most systems. Threads provide a stronger notion of modularity where each module runs in a separate thread within the same process. Threads share the process heap but each thread has its own stack. Therefore, threads support concurrent execution while not completely achieving fate-decoupling. The next level of modularity may be achieved by running each module in a separate process. This allows the system
to achieve all the three modularity goals. However, processes in the same host share fate in the sense that a rogue process with the right privileges may corrupt the hosts operating environment. Also, concurrency in this case is largely notional since each independent module still share the same hardware resources. Distributing the modules across di↵erent processes on di↵erent hosts provides a stronger notion of both fate- decoupling and concurrency.
Any sufficiently complex system employs all these di↵erent techniques for modularity and AppFabric is no di↵erent. In modularizing AppFabric, we made the following design choices:
– Run platform modules and application services in separate address spaces: The first level of modularity is enforced by always running platform modules and application services in separate processes. Therefore application services are designed as external modules that need to connect to the platform through an external communication interface. This is in contrast to an alter- nate implementation where the platform is provided as a special communication library that runs in the address space of the service and provides an intelligent communication substrate. Separating the platform and the application service allows the platform to independently handle failed application services without a↵ecting other services.
– May run platform modules in the same host within a single address space: In each host, platform modules may run within the same address space (same process). Note the use of the word ”may” in the preceding sentence. In our current implementation, all the sPort and pPort instances run in the same process whereas all the tPort instances run in a separate process. This is not a design requirement and is the case simply because we want to use the kernels network stack to interpose between message-level and packet-level communica- tions. We could very well write our own version of the network stack and avoided this. Anyways, for platform modules running in the same process, concurrency is provided by running them in separate threads. Therefore, each port instance runs as a separate thread. This is to avoid the overhead of enforcing process level modularity; especially since it is not required within trusted modules. This also allows the platform to have some explicit control over scheduling the modules
(the ports in this case) instead of delegating it to the operating system. How- ever, in our current implementation we do not do any explicit scheduling of the threads. These threads share the process heap and it provides threads with a cheap way to communicate with each other. However, in-order to avoid fate- sharing completely, the only way for threads to communicate inside the platform process is through messaging. Messaging is a less efficient communication mech- anism between modules than shared memory but has excellent concurrency and fate-decoupling properties.
– No assumption can be made whether application services will be de-
ployed on the same host or on separate hosts: The platform requires
that each application service be run as a separate process. But beyond that, it may not make any assumption whether they will be run on the same host or on separate hosts. When run on the same host, inter-process communication is the cheapest way to communicate whereas when run on di↵erent hosts the only way to communicate is through the network transport. However, this decision is taken dynamically during deployment time and hence the platform needs to provide a common communication interface to the service and automatically map it to the appropriate transport during deployment. This is exactly what the AppFabric Service Conduit (ASC) abstraction provides.
• Synchronous vs. asynchronous bootstrap: There are two options of bootstrap- ping distributed systems - synchronous and asynchronous. In a synchronous system, a strict bootstrapping order is maintained where one module is started only after another has started and so on. In an asynchronous bootstrap, each module may be started at any point in time. For a large distributed system like AppFabric involving multiple hosts in multiple datacenters, it would be highly inefficient to have a synchronous boot- strap. Also, an asynchronous bootstrap mechanism helps in making the system more dynamic. AppFabric implements asynchronous bootstrapping through a simple re-try mechanism for each connection request. If a node tries to connect to another node but fails, it re-tries after a certain interval. The interval may be fixed or adaptive. In-fact, it makes more sense to decrease the interval across successive re-trials since during the bootstrap stage the probability of the node being ready to accept the connection request becomes higher as more time passes.
• Synchronous vs. asynchronous request/response messaging: Another design choice that is often encountered by distributed system designers is whether to imple- ment synchronous or asynchronous request/response messaging. In synchronous mes- saging each request is lock stepped with a corresponding response. In asynchronous messaging, the request response pairs are not lock stepped, thus allowing the system to handle many requests concurrently. AppFabric implements asynchronous request response messaging. Each request message is given an unique ID that is replayed in the response. This allows the sender to match a request to a response in case the request-response pairing discipline needs to be maintained. Also, each recipient main- tains a message queue to queue the messages and process them in the order in which it received them.
• Two-step vs. three-step transactions: This is another important design choice that depends on the type of the transaction between two modules. As is the general rule, AppFabric applies three-step transaction process whenever two modules are nego- tiating certain parameters that requires either synchronizing or updating critical state information. As a very simple example; suppose that the central controller sends a request to a datacenter controller to start a service on any host on any port in the datacenter. The datacenter controller starts the service and sends back the host ID, host IP and port number for the service to the central controller. The central con- troller then acknowledges receiving this information so that the datacenter controller can delete this service-host mapping from its database since this mapping is of use only to the central controller (that keeps record of service deployments). For most other transactions that does not involve parameter negotiations or update of critical state, two-step transactions are enough.
• Lazy updates: The AppFabric control plane has two conflicting goals: scalability and dynamism. It needs to dynamically adapt the system to the needs of the application while at the same time be scalable to handle large application deployments. These goals are conflicting because to be dynamic the control plane has to keep an updated view of the whole system which involves either polling (from top down) or reporting (from bottom up) the state of each individual data plane node at very frequent inter- vals. Having a hierarchical control plane design helps to a certain extent. The other mechanism we apply is lazy updates. The way it works is that instead of reporting
system state at regular intervals, either the control plane queries the system before it needs to make a critical decision such as allocating more nodes, or the data plane nodes report an event based on a configuration; such as when the load crosses a particular pre-set threshold.