How to Tolerate Half Less One
Byzantine Nodes in Practical
Distributed Systems
University of Lisboa Faculty of Sciences LASIGE – Navigators Group
http://www.navigators.di.fc.ul.pt/
Miguel Correia, Nuno F. Neves, Paulo Veríssimo
Navigators FCUL
Summary
• We present:
FAn intrusion-tolerant (Byzantine-tolerant) service FBased on the state machine approach
FThat tolerates malicious replicas
• Features:
FReduces for the first time the maximum resilience in practical distributed systems:
FCircumvents FLP without synchrony assumptions on the asynchronous part of the system (all synchrony is in a kind of oracle)
3
Outline
1. Motivation
2. System Model and TTCB
3. 2f+1 Atomic Multicast
4. State Machine Replication
5. Conclusions and Future Work
Navigators FCUL
5
Motivation (I)
• Intrusions are hard to prevent entirely =>
intrusion tolerance
• State machine approach is a generic solution
for fault/intrusion-tolerant systems but:
FDistributed systems typically modeled with the
asynchronous model in this context
FMaximum resilience for asynchronous systems is
Navigators FCUL
Motivation (II)
• Each replica has two costs:
FHardware and software
FDesign, since replicas have to fail independently – Cannot be attacked simultaneously (N-version programming)
• Using a kind of oracle we manage to reduce the
resilience to
FA reduction of 25% to 33% of the number of replicas! FImpossibility of consensus in asynchronous systems
System Model and TTCB
Navigators FCUL
Architectural Hybridization
• Most of the system is asynchronous Byzantine
• There is a subsystem built in such a way that it
is secure: the TTCB
TTCB Control Channel Payload Network Host n Local TTCB Processes OS Host 2 OS Local TTCB Processes Host 1 OS Local TTCB Processes9
TTCB and Intrusion Tolerance
• The TTCB serves to support the execution of
intrusion-tolerant protocols/applications
F They run mostly in the payload systemthat can be attacked F They use the TTCBto execute some critical steps securely
TTCB Control Channel Payload Network Host n Local TTCB Processes OS Host 2 OS Local TTCB Processes Host 1 OS Local TTCB Processes Navigators FCUL
TTCB Design Principles
• Interposition: it must be interposed between
vital resources and any attempt to interact
with them
• Shielding: its must be protected from faults
affecting security
• Validation: its implementation must be
verifiable
11
TTCB Services
• The design principles require that the TTCB is
simple => limited set of services
• Trusted Multicast Ordering Service (TMO)
FThe objective is to support the execution of an intrusion-tolerant atomic multicast protocol FIt is not affected by malicious faults since it is a
TTCB service
Navigators FCUL
System Model
• The system is implemented by a set of
processes:
FThey run in the payload system (outside the TTCB) FThey communicate using the payload network
(outside the TTCB) using reliable channels: – Eventually deliver messages
– Message integrity is guaranteed
FThey call the TTCB TMO service to make some steps securely and efficiently
2f+1 Atomic Multicast
Navigators FCUL
Atomic Multicast
• Simplified definition:
FAll correct processes deliver the same messages FIf the sender is correct then the correct processes
deliver the message
FAll messages are delivered in the same order by all correct processes
15
The Protocol
• Uses a cryptographic hash function H to give
the TTCB a unique identifier of the message
(e.g., SHA-1)
FThe output of H has fixed length
FIt is computationally infeasible to find two different inputs that hash to the same output
Navigators FCUL
Protocol execution
P0 P1 P2 TTCB tmo sent H(M1) n=3 f=1 message delivery M1 received H(M1) f+1 processes have M1 order = 1 deliver H(M1), 1 received H(M1) deliver H(M1), 1X
X
X
18
TMO Service
• Is the core of the solution
FDecides when a message can be delivered
– If f+1 processes show that have the message, then at least one correct process has the message FDefines a sequential order for the messages
FProvides reliable information since TTCB is secure
• Implementation of the service
FIt is a kind of atomic multicast protocol...
F...but executed in a benign environment: the TTCB
Navigators FCUL
20
System Architecture
OS cm Host 2 Host 1 OS s1 OS Host n OS s2 sn OS c1 CLIENTS SERVERS TTCB Control Channel TTCB Local TTCB Payload Network Local TTCB Local TTCB Navigators FCULState Machine Approach
• Servers are state machines:
Fstate variables, commands
• All correct servers follow the same history of
states iff:
FInitial state: all servers start in the same state FAgreement: all servers execute the same
commands
FTotal order: all servers execute the commands in the same order
FDeterminism: the same command executed in the same initial state generates the same final state
22
Clients
• Clients:
FAny number of clients can fail FThey have local (unreliable) clocks
• Protocol:
FSend a REQUEST message to one server
FWait for f+1 identical REPLY messages from different servers
FIf Tresendafter the REQUEST message was sent, the
replies were not received, send REQUEST messages to f additional servers
Navigators FCUL
Servers
• At most
can fail
• Simplified protocol:
FWhen server receives a REQUEST message, atomically multicast it to all the servers
FWhen the atomic multicast protocol delivers a message, if the same request has not been previously delivered, then execute the command
24
Handling Requests’ Corruption
• A malicious server might corrupt a request
before atomically multicasting it
• Solution: the REQUEST message has a
vector of MACs, one per server
FObtained with keys shared by the client with each of the servers
FEach server can use one of the MACs to verify if the message was corrupted
FIf its MAC is wrong, a correct server does not give H(M) to the TTCB, i.e., it does not contribute to the
f+1 threshold
Navigators FCUL
Performance
• Without batching
• Time complexity:
FTwo asynchronous rounds + FOne round of TMO executions
• Message complexity:
Conclusions and Future Work
Navigators FCUL
Conclusions
• First solution for intrusion-tolerant state-machine
replication in practical distributed systems with
only 2f+1 replicas
• Interesting impact since each additional replica
has a considerable cost
• Circumvents FLP without synchrony assumptions
on the asynchronous part of the system
Fall synchrony is encompassed in the TTCB
• The performance is promising:
28
Future Work
• Is it possible to simplify the protocol?
• What is the minimum TTCB service that can be
used to solve atomic multicast?
• Ongoing implementation of a new TTCB-like
asynchronous component (Perseus/Fiasco)
FPerformance evaluationFBetter understanding of the tradeoffs involved
• New TTCB-like components:
FLIDS, hardware (PC/104 appliances), WANs
• New applications for TTCB-like components
Navigators FCUL