Optimizations for High Performance ORBs Douglas C. Schmidt (www.cs.wustl.edu/
schmidt)
Aniruddha S. Gokhale (www.cs.wustl.edu/
gokhale) Washington University, St. Louis,
USA.
1
Motivation
Typical state of aairs today is the \Distri- bution Crisis"
{
Computers and networks get faster and cheaper
{
Communication software gets slower, buggier, more expensive
Accidental complexity is one source of prob- lems, e.g.,
{
Incompatible software infrastructures
{
Continuous rediscovery and reinvention of core con- cepts and components
Inherent complexity is another source of prob- lems
{
e.g., latency, partial failures, partitioning, causal ordering, etc.
2
Candidate Solution: CORBA
DII
DII ORB ORB
INTERFACE INTERFACE
OBJECT OBJECT REQUEST BROKER REQUEST BROKER
op() op()
IDL
STUBS IDL
STUBS OBJECT OBJECT
ADAPTER ADAPTER
IDL
SKELETON IDL
SKELETON DSI DSI
in args in args out args + return value out args + return value
OBJECT OBJECT IMPLEMENTATION IMPLEMENTATION CLIENT
CLIENT
Goals
1. Simplify development of distributed applications 2. Provide exible foundation for higher-level services
Observations
CORBA is well-suited for certain commu- nication requirements and certain network environments
{
e.g., request/response or oneway messaging over low-speed Ethernet or Token Ring
However, current CORBA implementations exhibit high overhead for other types of re- quirements and environments
{
e.g., bandwidth-intensive and delay-sensitive ap- plications over high-speed networks
Performance limitations will ultimately im-
pede adoption of CORBA
Performance Problems with Existing ORBs
Existing ORBs lack certain features (e.g., end-to-end QoS)
Reasons for poor performance
{
Inecient server demultiplexing techniques
{
Long chains of intra-ORB function calls
{
Excessive presentation layer conversions and data copying
{
Non-optimized buering algorithms used for net- work reads and writes
{
Improper choice of underlying operating system in- terfaces
{
Lack of integration with advanced "OS" and net- work features
5
Key Research Questions
\Can CORBA be used for performance-sensitive applications on high-speed networks?"
{
Goal is to determine this empirically
\What are the strategic optimizations for Gigabit CORBA"?
{
Goal is to maintain strict CORBA compliance
\What changes are required to provide Real- time CORBA?"
{
Goal is to provide end-to-end QoS guarantees
6
Pinpointing CORBA Overhead
Presentation layer overhead
{
e.g., typed and untyped data
Data manipulation and data copying over- head
{
e.g., message management
Demultiplexing and operation dispatching over- head
{
e.g., layered and de-layered demultiplexing
OS/network/protocol integration
{
e.g., ATM/host adapters, resource reservation and scheduling
General Path of CORBA Requests
STUBS
(
SII)
OR DII REQUESTCLIENT SERVER
FRAMING
,
ERROR CHECKING AND INTEROPERABILITYMARSHALLING
OS LEVEL
INTERFACE METHOD IMPLEMENTATION
FRAMING
,
ERROR CHECKING AND INTEROPERABILITY DEMARSHALLING ANDDEMULTIPLEXING
OS LEVEL
NETWORK
Experiments Performed
Throughput measurements
{
i.e., using static and dynamic invocation policies
Receiver-side request demultiplexing
Latency measurements
Scalability
{
i.e., under varying number of objects per server
9
Network/Host Environment
BAY NETWORKS BAY NETWORKS
LATTISCELL LATTISCELL ATM SWITCH ATM SWITCH
(16
(16 PORT PORT ,, OC3 OC3 155
155 MBPS MBPS // PORT PORT ,, 9,180
9,180 MTU MTU )) ULTRA
ULTRA SPARC SPARC 22
(( ENI ATM ENI ATM ADAPTORS ADAPTORS AND ETHERNET AND ETHERNET ))
10
TTCP Conguration for CORBA Implementation
ATM ATM SWITCH SWITCH
2: forward 2: forward
TTCP TTCP Impl Impl
3: send(buf) 3: send(buf)
4: ack 4: ack
Sender
Sender 1: send(buf) 1: send(buf)
TTCP TTCP Skel Skel TTCP
TTCP Stub Stub
Experimental Setup for Throughput Measurements
Enhanced version of TTCP
{
TTCP measures end-to-end data/request transfer
{
Enhanced version compares Orbix 2.1 and Visi- Broker 2.0
Parameters varied
{
64MB of typed data
.
Types included sequences of scalars and structs
{
Sender buer sizes ranged from 1K to 1024K
{
Socket queues were 8k (default) and 64k (maxi- mum)
{
Network was 155 Mbps ATM and \loopback"
Throughput Measurements using Static Invocation Interface
www.cs.wustl.edu/
schmidt/SIGCOMM
,96.ps.gz
13
Orbix SII 'Blackbox' Performance
0 20 40 60 80 100 120
0 20 40 60 80 100 120 140
Throughput in Mbps
Sender Buffer size in KBytes
short long double char octet struct
14
VisiBroker SII 'Blackbox' Performance
0 20 40 60 80 100 120
0 20 40 60 80 100 120 140
Throughput in Mbps
Sender Buffer size in KBytes
short long double char octet struct
Orbix High-Cost Functions (sender side)
Orbix sequences for 128K user buer
Test Time %Exec Name
(msec)
--- scalars 6,091 86.00 write
895 13.00 memcpy struct 32,713 72.00 write
1,177 2.58 NullCoder::codeLongArray
978 2.15 Request::op<<(double&)
952 2.09 BinStruct::encodeOp
952 2.09 Request::encodeLongArray
922 2.02 Request::insertOctet
922 2.02 Request::op<<(short&)
922 2.02 Request::op<<(long&)
922 2.02 Request::op<<(char&)
838 1.84 NullCoder::codeDouble
755 1.66 NullCoder::codeLong
671 1.47 memcpy
Orbix Whitebox Analysis of Throughput using SII
read
OS Kernel ttcp_sequence_i::
sendStructSequence
CORBA::Request::
send
write
OS Kernel
NETWORK
CLIENT SERVER
ttcp_sequence::
sendStructSequence
CORBA::Request::
invoke
FRFInterface::send
OrbixChannel::
sendMsg
OrbixTCPChannel::
transmitBuffer
send OrbixTCPChannel::
receiveBuffer OrbixChannel::
readMsg Channel::
processIncomingMsg FRRInterface::dispatch
ContextClassS::
dispatch ttcp_sequence_dispatch::
dispatch
OS LEVEL INTERFACEMETHODIMPLEMENTATION
OS LEVEL
0.01%
15%
0.01%
75-80%
CLIENT SEND
0.01%
MARSHALLING
20-25%
0.01%
72%
FRAMING, ERROR CHECKING (NO IIOP INTEROPERABILITY) FRAMING,ERROR CHECKING(NO IIOPINTEROPERABILITY) DEMARSHALLINGDEMULTIPLEXING
17
VisiBroker High-Cost Functions (sender side)
VisiBroker sequences for 128K user buer
Test Time %Exec Name
(msec)
--- scalars 6,214 99.00 write
struct 37,960 72.00 write
3,831 7.28 op<<(NCostream&,BinStruct&) 3,594 6.83 memcpy
979 1.86 PMCIIOPStream::op<<(double) 951 1.81 PMCIIOPStream::put
951 1.81 PMCIIOPStream::op<<(long) 666 1.27 PMCIIOPStream::op<<(short)
18
VisiBroker Whitebox Analysis of Throughput using SII
PMCIIOPstream::
_read PMCIIOPStream::
receiveMessage
read
OS Kernel ttcp_sequence_i::
sendStructSequence
PMCSkelInfo::
execute
PMCBOAClient::
request
PMCBOAClient::
ProcessMessasge
PMCIIOPstream::
_write CORBA::Object::
_send
PMCStubInfo::
send
PMCIIOPStream::
sendMessage
writev
OS Kernel
NETWORK
CLIENT SERVER
_sk_ttcp_sequence::
_sendStructSequence
ttcp_sequence::
sendStructSequence
INTERFACE
METHOD
IMPLEMENTATION FRAMING,ERROR CHECKINGINTEROPERABILITY OS LEVEL DEMARSHALLINGDEMULTIPLEXING
0.01%
12%
11%
70-75%
CLIENT SENDMARSHALLING
FRAMING, ERROR CHECKING INTEROPERABILITYOS LEVEL
0.01%
15-20%
5%
72%
Throughput Measurements using Dynamic Invocation and Dynamic
Skeleton Interface
www.cs.wustl.edu/
schmidt/GLOBECOM
,96.ps.gz
Orbix DII 'Blackbox' Performance
0 20 40 60 80
0 20 40 60 80 100 120 140
Throughput in Mbps
Sender Buffer size in KBytes
short long double char octet struct
21
VisiBroker DII 'Blackbox' Performance
0 20 40 60 80
0 20 40 60 80 100 120 140
Throughput in Mbps
Sender Buffer size in KBytes
short long double char octet struct
22
VisiBroker DII Client Datapath
CORBA_Request::
send_oneway
CORBA_Request::
_invoke
CORBA_Request::
_marshal_out
CORBA_Object::
_send
CORBA_Any::write_value
_write_value(NCostream&, TypeCode&) _write_value(NCostream&, TypeCode&,MarshalInBuf)
Marshal individual fields of the struct
memcpy
PMCStubInfo::send
PMCIIOPStream::
sendMessage
PMCIIOPstream::_writev
CORBA::MarshalInBuffer::
get
memcpy writev
OS Kernel
Orbix High-Cost Functions
Orbix sequences for 128K user buer
Test Time %Exec #Calls Name
(msec)
--- scalars 27,864 98.14 513 _libc_write
352 1.24 2,560 memcpy
struct 25,740 12.38 44,085,834 cleanfree 18,893 9.09 46,190,107 _free_unlocked 14,328 6.89 44,092,964 realfree
9,710 4.67 4,195,328 count 7,522 3.62 46,190,159 malloc 6,461 3.11 46,190,158 new
4,855 2.34 2,097,664 CORBA::typeCode
::putValue
VisiBroker High-Cost Functions
VisiBroker sequences for 128K user buer
Time %Exec #Calls Name
(msec)
--- octets/chars
3,630 90.50 512 writev
353 8.81 8,195 memcpy
doubles
1,413 27.81 8,414,220 memcpy
1,229 24.21 512 writev
1,173 23.09 8,388,608 CORBA::MarshallInBuffer ::operator>>(double&) 880 17.32 8,389,632 CORBA::MarshallInBuffer
::get(char*)
352 6.93 512 CORBA::MarshallInBuffer ::get(double*)
25
VisiBroker High-Cost Functions (cont,d.)
Time %Exec #Calls Name
(msec)
--- longs
2,346 29.41 16,777,216 CORBA::MarshallInBuffer ::operator>>(long&) 1,883 23.60 16,817,163 memcpy
1,760 22.06 16,777,728 CORBA::MarshallInBuffer ::get(char*)
1,249 15.65 512 writev
704 8.82 512 CORBA::MarshallInBuffer ::get(long*)
shorts
4,693 33.87 33,554,422 CORBA::MarshallInBuffer operator>>(short&) 3,520 25.40 33,554,944 CORBA::MarshallInBuffer
::get(char*) 2,824 20.38 33,627,659 memcpy
1,408 10.16 512 CORBA::MarshallInBuffer ::get(short*)
1,367 9.86 512 writev
26
VisiBroker High-Cost Functions (cont,d.)
Time %Exec #Calls Name
(msec)
--- structs
7,279 25.96 9,216 writev
5,200 20.00 2,796,032 various marshalling functions
4,080 14.55 39,321,438 memcpy
3,188 11.37 16,776,192 write_value(NCOstream&, CORBA:::TypeCode*, CORBA::MarshallInBuffer) 2,053 7.32 19,572,736 CORBA::MarshalInBuffer
::get(char*) 1,702 6.07 2,796,032 _write_struct_value
(NCOstream, TypeCode) 1,466 5.23 13,980,160 CORBA::TypeCode::
member_type 1,408 5.02 16,776,192 CORBA::TypeCode::
member_count
VisiBroker DSI 'Blackbox' Performance
0 5 10 15 20 25 30
0 20 40 60 80 100 120 140
Throughput in Mbps
Sender Buffer size in KBytes
short
long
double
char
octet
struct
Measuring Server-side Request Demultiplexing Overhead
An interface with N (N is large, say 100) methods dened
Method names chosen arbitrarily
In each iteration, client invokes the nal method M times (M is large, say 100)
Repeat the experiment 100, 500, 1,000 times
29
Client-side Latency in seconds
Version Iterations
1 100 500 1,000
---
Orbix 0.032 5.23 31.86 64.21
VisiBroker 0.010 4.53 24.44 50.06 ---
30
Demultiplexing in ORBs
2: DEMUX TO
I/O HANDLE
OBJECTOBJECT N::N:: METHODMETHOD KK
OBJECTOBJECT N::N:: METHODMETHOD 11
OBJECTOBJECT 1::1:: METHODMETHOD KK
OBJECTOBJECT 1::1:: METHODMETHOD 22
...
...
... ... ... ... ... ... ...
OBJECTOBJECT 1::1:: METHODMETHOD 11
DE
DE -- LAYERED REQUEST LAYERED REQUEST DEMULTIPLEXER DEMULTIPLEXER
METHODMETHOD KK
METHODMETHOD 22
...
...
METHODMETHOD 11 ...
...
...
...
...
...
...
IDL IDL SKEL
SKEL 11 SKEL SKEL IDL IDL 22 SKEL SKEL IDL IDL M M
OBJECT ADAPTER
OBJECT
OBJECT 11 OBJECT OBJECT 22 OBJECT OBJECT N N 4: DEMUX TO
SKELETON
5: DEMUX TO METHOD
3: DEMUX TO METHOD
1: DEMUX THRU PROTOCOL STACK
3: DEMUX TO OBJECT
OBJECT ADAPTER LAYERED
DEMULTIPLEXING
DE
DE -- LAYERED LAYERED DEMULTIPLEXING DEMULTIPLEXING
2: DEMUX TO
I/O HANDLE
1: DEMUX THRU PROTOCOL STACK
Demultiplexing and Dispatching Overhead in ORBs
Both Orbix and VisiBroker pass the method name in the request
Passing a string adds to the amount of in- formation carried in the request header
Orbix uses strcmp and linear search on the server side to demultiplex incoming request
For very large interfaces, this is undesirable
Hand-crafted Optimizations
For both the ORBs, hash the method name to a numeric value and pass this information in the request header
For Orbix, use numeric comparisons using the hashed value for method name lookup
33
Client-side Latency in seconds
Version Iterations
1 100 500 1,000
---
Orbix 0.032 5.23 31.86 64.21
Optimized Orbix 0.028 4.21 29.97 60.67
VisiBroker 0.010 4.53 24.44 50.06
Optimized VisiBroker 0.010 3.86 24.04 49.36 ---
34
Latency Measurements over ATM
Latency measurements for twoway methods
Methods are of two types:
1. Parameterless methods
2. Methods that send a sequence of octets and structs of primitive data types
Sequence parameter takes range of buer sizes from 1 to 1,024 units of the specic data type
Latency for Parameterless Method Invocation
0.0 0.5 1.0 1.5
Latency in msec
CORBA (Orbix) CORBA (Visigenic) TCP/IP (ACE)
Latency for Methods Sending Sequences
0.0 200.0 400.0 600.0 800.0 1000.0
Units sent (1 struct unit = 32 bytes, 1 octet unit = 1 byte) 0.0
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0
Latency in msec
CORBA (Orbix) structs CORBA (Visigenic) structs TCP/IP (ACE) structs CORBA (Orbix) octets CORBA (Visigenic) octets TCP/IP (ACE) octets
37
Optimizations for High Performance ORBs
Lack of integration with advanced OS and network features:
{
Provide hooks to utilize OS features such as Real- Time scheduling, Multi-threading, and zero-copy buer management
Inecient server demultiplexing techniques:
{
Use delayered demultiplexing and ecient packet lters
Long chains of intra-ORB function calls:
{
Use Integrated Layer Processing and Inlining
{
Use compiler techniques to automate this
38
Optimizations for High Performance ORBs
Excessive presentation layer conversions and data copying:
{
Use ecient stub compilers
{
Achieve an optimal tradeo between compiled ver- sus interpreted stubs
Non-optimized buering algorithms used for network reads and writes:
{
Use optimal buer sizes and ow control
Gigabit CORBA Optimizations
REQUEST DEMULTIPLEXING
OPTIMIZATIONS COMMUNICATION
PROTOCOL OPTIMIZATIONS PRESENTATION
LAYER OPTIMIZATIONS
NETWORK ADAPTER OPTIMIZATIONS
DII
DII ORB ORB
INTERFACE INTERFACE
OBJECT OBJECT REQUEST BROKER REQUEST BROKER
op() op()
IDL IDL
STUBS STUBSOBJECT OBJECT ADAPTER ADAPTER
IDL
IDL
SKELETON SKELETONDSI DSI
in argsin args out args + return value out args + return value
OBJECT OBJECT IMPLEMENTATION IMPLEMENTATION CLIENT
CLIENT
DATA COPYING OPTIMIZATIONS
OS I OS I
//
OO SUBSYSTEM SUBSYSTEMOS I OS I
//
O SUBSYSTEMNETWORK
NETWORK
NETWORKNETWORK ADAPTER ADAPTER
NETWORK NETWORK ADAPTER ADAPTER
I/O
SUBSYSTEM OPTIMIZATIONSReal-time CORBA
...
...
... ... ... ... ... ... ...
GIGABIT I/O SUBSYSTEM GIOP/IIOP END-TO-END TRANSPORT
OBJECT ADAPTER
OBJECTOBJECT N::N:: METHODMETHOD KK
OBJECTOBJECT N::N:: METHODMETHOD 11
OBJECTOBJECT 1::1:: METHODMETHOD KK
OBJECTOBJECT 1::1:: METHODMETHOD 22
OBJECTOBJECT 1::1:: METHODMETHOD 11
DE
DE -- LAYERED REQUEST LAYERED REQUEST DEMULTIPLEXER DEMULTIPLEXER
ZERO ZERO COPY COPY BUFFERS BUFFERS
APPLICATION SPECIFIC
CODE
41
Integrated View of CORBA Optimizations
REAL-TIME REQUEST SCHEDULING
QUEUES REAL
REAL--TIME THREADSTIME THREADS AND UPCALLS AND UPCALLS
ZERO ZERO COPY COPY BUFFERS BUFFERS
REAL-TIME OBJECT ADAPTER
ATM PORT INTERFACE CONTROLLER
(APIC)
GIGABIT I/O SUBSYSTEM
VC
VC11 VCVC22 VCVC33 VCVC44 VCVC55
...
OBJECTOBJECTN::N::METHODMETHOD11
OBJECTOBJECT1::1::METHODMETHOD22
OBJECTOBJECT1::1::METHODMETHOD11
DE
DE--LAYERED REQUESTLAYERED REQUEST DEMULTIPLEXER DEMULTIPLEXER
APPLICATION SPECIFIC
& CORBA CODE SERVICES
REAL-TIME GIOP/IIOP PROTOCOLS
TCP/IP ATM RTP
ATM NETWORK
...
...
... ... ... ...
OBJECTOBJECTN::N::METHODMETHODKK
OBJECTOBJECT1::1::METHODMETHODKK
PRESENTATION LAYER PRESENTATION LAYER Q
QOOSS SPECIFICATIONS SPECIFICATIONS
42
Concluding Remarks
CORBA is a promising architecture for dis- tributed computing
Conventional CORBA implementations are not tuned for high-performance or real-time systems
{
Note, low-speed networks often hide performance overhead
Ultimately, an integrated approach is the best solution
Optimizations must be applied at multiple layers
{