OBJECTOBJECTREQUEST BROKERREQUEST BROKER DIIDIIORBORB IDLIDL DSIDSI IDLIDL op()op() OBJECTOBJECTIMPLEMENTATIONIMPLEMENTATIONCLIENTCLIENT

(1)

Optimizations for High Performance ORBs Douglas C. Schmidt (www.cs.wustl.edu/

schmidt)

Aniruddha S. Gokhale (www.cs.wustl.edu/

gokhale) Washington University, St. Louis,

USA.

1

Motivation

Typical state of aairs today is the \Distri- bution Crisis"

{

Computers and networks get faster and cheaper

{

Communication software gets slower, buggier, more expensive

Accidental complexity is one source of prob- lems, e.g.,

{

Incompatible software infrastructures

{

Continuous rediscovery and reinvention of core con- cepts and components

Inherent complexity is another source of prob- lems

{

e.g., latency, partial failures, partitioning, causal ordering, etc.

2

Candidate Solution: CORBA

DII

DII ORB ORB

INTERFACE INTERFACE

OBJECT OBJECT REQUEST BROKER REQUEST BROKER

op() op()

IDL

STUBS IDL

STUBS OBJECT OBJECT

ADAPTER ADAPTER

IDL

SKELETON IDL

SKELETON DSI DSI

in args in args out args + return value out args + return value

OBJECT OBJECT IMPLEMENTATION IMPLEMENTATION CLIENT

CLIENT

Goals

1. Simplify development of distributed applications 2. Provide exible foundation for higher-level services

Observations

CORBA is well-suited for certain commu- nication requirements and certain network environments

{

e.g., request/response or oneway messaging over low-speed Ethernet or Token Ring

However, current CORBA implementations exhibit high overhead for other types of re- quirements and environments

{

e.g., bandwidth-intensive and delay-sensitive ap- plications over high-speed networks

Performance limitations will ultimately im-

pede adoption of CORBA

(2)

Performance Problems with Existing ORBs

Existing ORBs lack certain features (e.g., end-to-end QoS)

Reasons for poor performance

{

Inecient server demultiplexing techniques

{

Long chains of intra-ORB function calls

{

Excessive presentation layer conversions and data copying

{

Non-optimized buering algorithms used for net- work reads and writes

{

Improper choice of underlying operating system in- terfaces

{

Lack of integration with advanced "OS" and net- work features

5

Key Research Questions

\Can CORBA be used for performance-sensitive applications on high-speed networks?"

{

Goal is to determine this empirically

\What are the strategic optimizations for Gigabit CORBA"?

{

Goal is to maintain strict CORBA compliance

\What changes are required to provide Real- time CORBA?"

{

Goal is to provide end-to-end QoS guarantees

6

Pinpointing CORBA Overhead

Presentation layer overhead

{

e.g., typed and untyped data

Data manipulation and data copying over- head

{

e.g., message management

Demultiplexing and operation dispatching over- head

{

e.g., layered and de-layered demultiplexing

OS/network/protocol integration

{

e.g., ATM/host adapters, resource reservation and scheduling

General Path of CORBA Requests

STUBS

(

SII

)

OR DII REQUEST

CLIENT SERVER

FRAMING

,

ERROR CHECKING AND INTEROPERABILITY

MARSHALLING

OS LEVEL

INTERFACE METHOD IMPLEMENTATION

FRAMING

,

ERROR CHECKING AND INTEROPERABILITY DEMARSHALLING AND

DEMULTIPLEXING

OS LEVEL

NETWORK

(3)

Experiments Performed

Throughput measurements

{

i.e., using static and dynamic invocation policies

Receiver-side request demultiplexing

Latency measurements

Scalability

{

i.e., under varying number of objects per server

9

Network/Host Environment

BAY NETWORKS BAY NETWORKS

LATTISCELL LATTISCELL ATM SWITCH ATM SWITCH

(16

(16 PORT PORT ,, OC3 OC3 155

155 ^MBPS ^MBPS // ^PORT ^PORT ,, 9,180

9,180 MTU MTU )) ULTRA

ULTRA SPARC SPARC 22

(( ENI ATM ENI ATM ADAPTORS ADAPTORS AND ETHERNET AND ETHERNET ))

10

TTCP Conguration for CORBA Implementation

ATM ATM SWITCH SWITCH

2: forward 2: forward

TTCP TTCP Impl Impl

3: send(buf) 3: send(buf)

4: ack 4: ack

Sender

Sender 1: send(buf) 1: send(buf)

TTCP TTCP Skel Skel TTCP

TTCP Stub Stub

Experimental Setup for Throughput Measurements

Enhanced version of TTCP

{

TTCP measures end-to-end data/request transfer

{

Enhanced version compares Orbix 2.1 and Visi- Broker 2.0

Parameters varied

{

64MB of typed data

.

Types included sequences of scalars and structs

{

Sender buer sizes ranged from 1K to 1024K

{

Socket queues were 8k (default) and 64k (maxi- mum)

{

Network was 155 Mbps ATM and \loopback"

(4)

Throughput Measurements using Static Invocation Interface

www.cs.wustl.edu/

schmidt/SIGCOMM

^,

96.ps.gz

13

Orbix SII 'Blackbox' Performance

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140

Throughput in Mbps

Sender Buffer size in KBytes

short long double char octet struct

14

VisiBroker SII 'Blackbox' Performance

0 20 40 60 80 100 120

0 20 40 60 80 100 120 140

Throughput in Mbps

Sender Buffer size in KBytes

short long double char octet struct

Orbix High-Cost Functions (sender side)

Orbix sequences for 128K user buer

Test Time %Exec Name

(msec)

--- scalars 6,091 86.00 write

895 13.00 memcpy struct 32,713 72.00 write

1,177 2.58 NullCoder::codeLongArray

978 2.15 Request::op<<(double&)

952 2.09 BinStruct::encodeOp

952 2.09 Request::encodeLongArray

922 2.02 Request::insertOctet

922 2.02 Request::op<<(short&)

922 2.02 Request::op<<(long&)

922 2.02 Request::op<<(char&)

838 1.84 NullCoder::codeDouble

755 1.66 NullCoder::codeLong

671 1.47 memcpy

(5)

Orbix Whitebox Analysis of Throughput using SII

read

OS Kernel ttcp_sequence_i::

sendStructSequence

CORBA::Request::

send

write

OS Kernel

NETWORK

CLIENT SERVER

ttcp_sequence::

sendStructSequence

CORBA::Request::

invoke

FRFInterface::send

OrbixChannel::

sendMsg

OrbixTCPChannel::

transmitBuffer

send OrbixTCPChannel::

receiveBuffer OrbixChannel::

readMsg Channel::

processIncomingMsg FRRInterface::dispatch

ContextClassS::

dispatch ttcp_sequence_dispatch::

dispatch

OS LEVEL INTERFACEMETHODIMPLEMENTATION

OS LEVEL

0.01%

15%

0.01%

75-80%

CLIENT SEND

0.01%

MARSHALLING

20-25%

0.01%

72%

FRAMING, ERROR CHECKING (NO IIOP INTEROPERABILITY) FRAMING,ERROR CHECKING(NO IIOPINTEROPERABILITY) DEMARSHALLINGDEMULTIPLEXING

17

VisiBroker High-Cost Functions (sender side)

VisiBroker sequences for 128K user buer

Test Time %Exec Name

(msec)

--- scalars 6,214 99.00 write

struct 37,960 72.00 write

3,831 7.28 op<<(NCostream&,BinStruct&) 3,594 6.83 memcpy

979 1.86 PMCIIOPStream::op<<(double) 951 1.81 PMCIIOPStream::put

951 1.81 PMCIIOPStream::op<<(long) 666 1.27 PMCIIOPStream::op<<(short)

18

VisiBroker Whitebox Analysis of Throughput using SII

PMCIIOPstream::

_read PMCIIOPStream::

receiveMessage

read

OS Kernel ttcp_sequence_i::

sendStructSequence

PMCSkelInfo::

execute

PMCBOAClient::

request

PMCBOAClient::

ProcessMessasge

PMCIIOPstream::

_write CORBA::Object::

_send

PMCStubInfo::

send

PMCIIOPStream::

sendMessage

writev

OS Kernel

NETWORK

CLIENT SERVER

_sk_ttcp_sequence::

_sendStructSequence

ttcp_sequence::

sendStructSequence

INTERFACE

METHOD

IMPLEMENTATION FRAMING,ERROR CHECKINGINTEROPERABILITY OS LEVEL DEMARSHALLINGDEMULTIPLEXING

0.01%

12%

11%

70-75%

CLIENT SENDMARSHALLING

FRAMING, ERROR CHECKING INTEROPERABILITYOS LEVEL

0.01%

15-20%

5%

72%

Throughput Measurements using Dynamic Invocation and Dynamic

Skeleton Interface

www.cs.wustl.edu/

schmidt/GLOBECOM

^,

96.ps.gz

(6)

Orbix DII 'Blackbox' Performance

0 20 40 60 80

0 20 40 60 80 100 120 140

Throughput in Mbps

Sender Buffer size in KBytes

short long double char octet struct

21

VisiBroker DII 'Blackbox' Performance

0 20 40 60 80

0 20 40 60 80 100 120 140

Throughput in Mbps

Sender Buffer size in KBytes

short long double char octet struct

22

VisiBroker DII Client Datapath

CORBA_Request::

send_oneway

CORBA_Request::

_invoke

CORBA_Request::

_marshal_out

CORBA_Object::

_send

CORBA_Any::write_value

_write_value(NCostream&, TypeCode&) _write_value(NCostream&, TypeCode&,MarshalInBuf)

Marshal individual fields of the struct

memcpy

PMCStubInfo::send

PMCIIOPStream::

sendMessage

PMCIIOPstream::_writev

CORBA::MarshalInBuffer::

get

memcpy writev

OS Kernel

Orbix High-Cost Functions

Orbix sequences for 128K user buer

Test Time %Exec #Calls Name

(msec)

--- scalars 27,864 98.14 513 _libc_write

352 1.24 2,560 memcpy

struct 25,740 12.38 44,085,834 cleanfree 18,893 9.09 46,190,107 _free_unlocked 14,328 6.89 44,092,964 realfree

9,710 4.67 4,195,328 count 7,522 3.62 46,190,159 malloc 6,461 3.11 46,190,158 new

4,855 2.34 2,097,664 CORBA::typeCode

::putValue

(7)

VisiBroker High-Cost Functions

VisiBroker sequences for 128K user buer

Time %Exec #Calls Name

(msec)

--- octets/chars

3,630 90.50 512 writev

353 8.81 8,195 memcpy

doubles

1,413 27.81 8,414,220 memcpy

1,229 24.21 512 writev

1,173 23.09 8,388,608 CORBA::MarshallInBuffer ::operator>>(double&) 880 17.32 8,389,632 CORBA::MarshallInBuffer

::get(char*)

352 6.93 512 CORBA::MarshallInBuffer ::get(double*)

25

VisiBroker High-Cost Functions (cont,d.)

Time %Exec #Calls Name

(msec)

--- longs

2,346 29.41 16,777,216 CORBA::MarshallInBuffer ::operator>>(long&) 1,883 23.60 16,817,163 memcpy

1,760 22.06 16,777,728 CORBA::MarshallInBuffer ::get(char*)

1,249 15.65 512 writev

704 8.82 512 CORBA::MarshallInBuffer ::get(long*)

shorts

4,693 33.87 33,554,422 CORBA::MarshallInBuffer operator>>(short&) 3,520 25.40 33,554,944 CORBA::MarshallInBuffer

::get(char*) 2,824 20.38 33,627,659 memcpy

1,408 10.16 512 CORBA::MarshallInBuffer ::get(short*)

1,367 9.86 512 writev

26

VisiBroker High-Cost Functions (cont,d.)

Time %Exec #Calls Name

(msec)

--- structs

7,279 25.96 9,216 writev

5,200 20.00 2,796,032 various marshalling functions

4,080 14.55 39,321,438 memcpy

3,188 11.37 16,776,192 write_value(NCOstream&, CORBA:::TypeCode*, CORBA::MarshallInBuffer) 2,053 7.32 19,572,736 CORBA::MarshalInBuffer

::get(char*) 1,702 6.07 2,796,032 _write_struct_value

(NCOstream, TypeCode) 1,466 5.23 13,980,160 CORBA::TypeCode::

member_type 1,408 5.02 16,776,192 CORBA::TypeCode::

member_count

VisiBroker DSI 'Blackbox' Performance

0 5 10 15 20 25 30

0 20 40 60 80 100 120 140

Throughput in Mbps

Sender Buffer size in KBytes

short

long

double

char

octet

struct

(8)

Measuring Server-side Request Demultiplexing Overhead

An interface with N (N is large, say 100) methods dened

Method names chosen arbitrarily

In each iteration, client invokes the nal method M times (M is large, say 100)

Repeat the experiment 100, 500, 1,000 times

29

Client-side Latency in seconds

Version Iterations

1 100 500 1,000

---

Orbix 0.032 5.23 31.86 64.21

VisiBroker 0.010 4.53 24.44 50.06 ---

30

Demultiplexing in ORBs

2: ^{DEMUX TO}

I/O HANDLE

OBJECTOBJECT N::N:: METHODMETHOD KK

OBJECTOBJECT N::N:: METHODMETHOD 11

OBJECTOBJECT 1::1:: METHODMETHOD KK

OBJECTOBJECT 1::1:: METHODMETHOD 22

...

... ... ... ... ... ... ...

OBJECTOBJECT 1::1:: METHODMETHOD 11

DE

DE -- LAYERED REQUEST LAYERED REQUEST DEMULTIPLEXER DEMULTIPLEXER

METHODMETHOD KK

METHODMETHOD 22

...

METHODMETHOD 11 ...

...

IDL IDL SKEL

SKEL 11 SKEL SKEL ÎDL ÎDL 22 SKEL SKEL ÎDL ÎDL M M

OBJECT ADAPTER

OBJECT

OBJECT 11 OBJECT OBJECT 22 OBJECT OBJECT N N 4: ^{DEMUX TO}

SKELETON

5: DEMUX TO METHOD

3: DEMUX TO METHOD

1: DEMUX THRU PROTOCOL STACK

3: DEMUX TO OBJECT

OBJECT ADAPTER LAYERED

DEMULTIPLEXING

DE

DE -- LAYERED LAYERED DEMULTIPLEXING DEMULTIPLEXING

2: ^{DEMUX TO}

I/O HANDLE

1: DEMUX THRU PROTOCOL STACK

Demultiplexing and Dispatching Overhead in ORBs

Both Orbix and VisiBroker pass the method name in the request

Passing a string adds to the amount of in- formation carried in the request header

Orbix uses strcmp and linear search on the server side to demultiplex incoming request

For very large interfaces, this is undesirable

(9)

Hand-crafted Optimizations

For both the ORBs, hash the method name to a numeric value and pass this information in the request header

For Orbix, use numeric comparisons using the hashed value for method name lookup

33

Client-side Latency in seconds

Version Iterations

1 100 500 1,000

---

Orbix 0.032 5.23 31.86 64.21

Optimized Orbix 0.028 4.21 29.97 60.67

VisiBroker 0.010 4.53 24.44 50.06

Optimized VisiBroker 0.010 3.86 24.04 49.36 ---

34

Latency Measurements over ATM

Latency measurements for twoway methods

Methods are of two types:

1. Parameterless methods

2. Methods that send a sequence of octets and structs of primitive data types

Sequence parameter takes range of buer sizes from 1 to 1,024 units of the specic data type

Latency for Parameterless Method Invocation

0.0 0.5 1.0 1.5

Latency in msec

CORBA (Orbix) CORBA (Visigenic) TCP/IP (ACE)

(10)

Latency for Methods Sending Sequences

0.0 200.0 400.0 600.0 800.0 1000.0

Units sent (1 struct unit = 32 bytes, 1 octet unit = 1 byte) 0.0

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0

Latency in msec

CORBA (Orbix) structs CORBA (Visigenic) structs TCP/IP (ACE) structs CORBA (Orbix) octets CORBA (Visigenic) octets TCP/IP (ACE) octets

37

Optimizations for High Performance ORBs

Lack of integration with advanced OS and network features:

{

Provide hooks to utilize OS features such as Real- Time scheduling, Multi-threading, and zero-copy buer management

Inecient server demultiplexing techniques:

{

Use delayered demultiplexing and ecient packet lters

Long chains of intra-ORB function calls:

{

Use Integrated Layer Processing and Inlining

{

Use compiler techniques to automate this

38

Optimizations for High Performance ORBs

Excessive presentation layer conversions and data copying:

{

Use ecient stub compilers

{

Achieve an optimal tradeo between compiled ver- sus interpreted stubs

Non-optimized buering algorithms used for network reads and writes:

{

Use optimal buer sizes and ow control

Gigabit CORBA Optimizations

REQUEST DEMULTIPLEXING

OPTIMIZATIONS COMMUNICATION

PROTOCOL OPTIMIZATIONS PRESENTATION

LAYER OPTIMIZATIONS

NETWORK ADAPTER OPTIMIZATIONS

DII

DII ORB ORB

INTERFACE INTERFACE

OBJECT OBJECT REQUEST BROKER REQUEST BROKER

op() op()

IDL IDL

STUBS STUBS

OBJECT OBJECT ADAPTER ADAPTER

IDL

SKELETON SKELETON

DSI DSI

in args

in args out args + return value out args + return value

OBJECT OBJECT IMPLEMENTATION IMPLEMENTATION CLIENT

CLIENT

DATA COPYING OPTIMIZATIONS

OS I OS I

//

OO SUBSYSTEM SUBSYSTEM

OS I OS I

//

O SUBSYSTEM

NETWORK

NETWORK ADAPTER ADAPTER

NETWORK NETWORK ADAPTER ADAPTER

I/O

SUBSYSTEM OPTIMIZATIONS

(11)

Real-time CORBA

...

... ... ... ... ... ... ...

GIGABIT I/O SUBSYSTEM GIOP/IIOP END-TO-END TRANSPORT

OBJECT ADAPTER

OBJECTOBJECT N::N:: METHODMETHOD KK

OBJECTOBJECT N::N:: METHODMETHOD 11

OBJECTOBJECT 1::1:: METHODMETHOD KK

OBJECTOBJECT 1::1:: METHODMETHOD 22

OBJECTOBJECT 1::1:: METHODMETHOD 11

DE

DE -- LAYERED REQUEST LAYERED REQUEST DEMULTIPLEXER DEMULTIPLEXER

ZERO ZERO COPY COPY BUFFERS BUFFERS

APPLICATION SPECIFIC

CODE

41

Integrated View of CORBA Optimizations

REAL-TIME REQUEST SCHEDULING

QUEUES REAL

REAL--TIME THREADSTIME THREADS AND UPCALLS AND UPCALLS

ZERO ZERO COPY COPY BUFFERS BUFFERS

REAL-TIME OBJECT ADAPTER

ATM PORT INTERFACE CONTROLLER

(APIC)

GIGABIT I/O SUBSYSTEM

VC

VC11 ^VC^VC22 ^VC^VC33 ^VC^VC44 ^VC^VC55

...

OBJECTOBJECTN::N::METHODMETHOD11

OBJECTOBJECT1::1::METHODMETHOD22

OBJECTOBJECT1::1::METHODMETHOD11

DE

DE--LAYERED REQUESTLAYERED REQUEST DEMULTIPLEXER DEMULTIPLEXER

APPLICATION SPECIFIC

& CORBA CODE SERVICES

REAL-TIME GIOP/IIOP PROTOCOLS

TCP/IP ATM RTP

ATM NETWORK

...

... ... ... ...

OBJECTOBJECTN::N::METHODMETHODKK

OBJECTOBJECT1::1::METHODMETHODKK

PRESENTATION LAYER PRESENTATION LAYER Q

QOOSS SPECIFICATIONS SPECIFICATIONS

42

OBJECTOBJECTREQUEST BROKERREQUEST BROKER DIIDIIORBORB IDLIDL DSIDSI IDLIDL op()op() OBJECTOBJECTIMPLEMENTATIONIMPLEMENTATIONCLIENTCLIENT

Optimizations for High Performance ORBs Douglas C. Schmidt (www.cs.wustl.edu/

schmidt)

Aniruddha S. Gokhale (www.cs.wustl.edu/

gokhale) Washington University, St. Louis,

USA.

Motivation

Typical state of a airs today is the \Distri- bution Crisis"

Computers and networks get faster and cheaper

Communication software gets slower, buggier, more expensive

Accidental complexity is one source of prob- lems, e.g.,

Incompatible software infrastructures

Continuous rediscovery and reinvention of core con- cepts and components

Inherent complexity is another source of prob- lems

e.g., latency, partial failures, partitioning, causal ordering, etc.

Candidate Solution: CORBA

DII

DII ORB ORB

INTERFACE INTERFACE

OBJECT OBJECT REQUEST BROKER REQUEST BROKER

op() op()

IDL

STUBS IDL

STUBS OBJECT OBJECT

ADAPTER ADAPTER

IDL

SKELETON IDL

SKELETON DSI DSI

in args in args out args + return value out args + return value

OBJECT OBJECT IMPLEMENTATION IMPLEMENTATION CLIENT

CLIENT

Goals

1. Simplify development of distributed applications 2. Provide exible foundation for higher-level services

Observations

CORBA is well-suited for certain commu- nication requirements and certain network environments

e.g., request/response or oneway messaging over low-speed Ethernet or Token Ring

However, current CORBA implementations exhibit high overhead for other types of re- quirements and environments

e.g., bandwidth-intensive and delay-sensitive ap- plications over high-speed networks

Performance limitations will ultimately im-

pede adoption of CORBA

Performance Problems with Existing ORBs

Existing ORBs lack certain features (e.g., end-to-end QoS)

Reasons for poor performance

Inecient server demultiplexing techniques

Long chains of intra-ORB function calls

Excessive presentation layer conversions and data copying

Non-optimized bu ering algorithms used for net- work reads and writes

Improper choice of underlying operating system in- terfaces

Lack of integration with advanced "OS" and net- work features

Key Research Questions

\Can CORBA be used for performance-sensitive applications on high-speed networks?"

Goal is to determine this empirically

\What are the strategic optimizations for Gigabit CORBA"?

Goal is to maintain strict CORBA compliance

\What changes are required to provide Real- time CORBA?"

Goal is to provide end-to-end QoS guarantees

Pinpointing CORBA Overhead

Presentation layer overhead

e.g., typed and untyped data

Data manipulation and data copying over- head

e.g., message management

Demultiplexing and operation dispatching over- head

e.g., layered and de-layered demultiplexing

OS/network/protocol integration

e.g., ATM/host adapters, resource reservation and scheduling

General Path of CORBA Requests

(

)

CLIENT SERVER

,

,

NETWORK

Experiments Performed

Throughput measurements

i.e., using static and dynamic invocation policies

Receiver-side request demultiplexing

Latency measurements

Scalability

i.e., under varying number of objects per server

Network/Host Environment

Typical state of aairs today is the \Distri- bution Crisis"

Inecient server demultiplexing techniques

Non-optimized buering algorithms used for net- work reads and writes

155 ^MBPS ^MBPS // ^PORT ^PORT ,, 9,180

TTCP Conguration for CORBA Implementation

Sender buer sizes ranged from 1K to 1024K

Orbix sequences for 128K user buer