John Leis
http://www.usq.edu.au/users/leis/
October10,2002
FacultyofEngineering&SurveyingTechnicalReports
ISSN 1446-1846
ReportTR-2002-01
ISBN 1877078018
FacultyofEngineering&Surveying
UniversityofSouthernQueensland
ToowoombaQld4350Australia
http://www.usq.edu.au/
Purpose
The Faculty of Engineering and Surveying Technical Reports serve as amechanism for disseminating results from
certainofits research anddevelopment activities. The scope of reports inthis seriesincludes (butisnot restricted
to): literature reviews, designs, analyses,scientic andtechnicalndings,commissionedresearch,and descriptions
ofsoftwareandhardwareproducedbystaoftheFaculty ofEngineering andSurveying.
LimitationsofUse
TheCounciloftheUniversityofSouthernQueensland,itsFacultyofEngineeringandsurveying,andthestaofthe
UniversityofSouthernQueensland: (1)donotmakeanywarrantyorrepresentation, expressorimplied,withrespect
totheaccuracy,completeness,orusefulnessoftheinformationcontainedinthesereports; or(2)donotassumeany
Abstract
Client-serverprotocolssuchasHTTPoveraTCPtransportlayermaybeimplementedinarelatively
straightforwardmanner,asisoftengivenin\textbook"examples. Manyoftheseexamplesdonotscale
wellto a full implementationacross anything buta localhost connection withminimal datapayload.
Furthermore,suchimplementationsareinvariablylessthanoptimalintermsofperformance,robustness
todieringremoteclientcongurations,processortimeutilization,andbandwidthusage.Itissuggested
that the examplesoften given intextsare overlysimplied,to thepoint of beingmisleading. Inthis
note,weexaminesomeoftheseissues,resultingfromtheimplementationofHTTPclientsandservers
using Unix (Berkeley)sockets, Windows NT (Winsock), and Java. Issues addressed include eÆcient
send/receive buering to minimizing kernel and user data copying, and robustness to variability in
recorddelimiters.
1 Introduction
Client-serverdatatransmissionoverIPnormallyusestheservicesprovidedbytheTCPlayer. Thestandard
APIforaccessingTCPservicesistheso-called\socket"library,whichisuniversallyavailableonUnix-based
Operating Systems(whereit istermed\BerkeleySockets")and Windows(termed \Winsock"). The type
ofclient-serversoftwarewhichoriginallymotivatedthedevelopmentofthisreportistypiedbytheHTTP
protocol,which isasimplesynchronousrequest-responseprotocol.
Reliableoperationacrossvaryingunderlyingnetworksisagoalwhichhardlyneedstobestated. Thisideal
mustextendto varyingnetwork conditions,such ascongestion anddelay(whichwill notbeknownin
ad-vance),andideallywouldextendtovarioussystemplatforms(whichwillpresumablybeknowninadvance).
Becauseoftheinteractionbetweenthevariousprotocolstacklayers,networkpeersandintermediaterouters,
fundamental performance problemsare often diÆcultto determine,and may evengo unnoticed. The
re-mainder of this report discusses somecommon problems in network coding, and suggestssome workable
solutions. Codefragmentsarepresentedtoillustrate thesolutionmethods. Thelistingsof codefragments
inthisreportomitscheckingcodeforsystemcallreturnvaluesintheinterestsofclarityandfocusingonthe
solutionoutline. Systemcall returnvaluesshould alwaysbecheckedfor returnvalues;this isparticularly
soin networkprogramming,wheremyriaderrorsonthelocalclient,theremotesystem,orthenetworkin
betweencouldpotentiallyoccur.
2 Generic Design Goals
The\RobustnessPrinciple"asenunciatedbyBraden[1] isasfollows:
Ateverylayeroftheprotocols,there isageneralrulewhose applicationcanleadto enormous
benetsinrobustnessandinteroperability:
\Beliberalin whatyouaccept,andconservativein whatyousend"
Softwareshouldbewrittentodealwitheveryconceivableerror,nomatterhowunlikely;sooner
or later a packet will come in with that particular combination of errors and attributes, and
unlessthesoftwareisprepared,chaoscanensue.
Goodsoftwaredesignprinciplesdictatethatweshouldstriveforasolutionwhichhasthefollowingattributes
(see[2]):
Portable MustbeabletorunonWindows,Unixorotherplatforms,andideallybeeasilyportedtoother
Robust Mustberobusttodieringnetworkpeers. Theclientandserverlibrariesmustcorrectlyparseall
headers inallcases,andmustworkinallpossibleconnectionscenarios.
Reliable Thesystemmustusetimeoutsasappropriate,andgracefullyhandlefailures.
EÆcient The system must minimize the number of system calls, theamount of copying to/from kernel
memory,andthenumberofTCPprotocol writesgenerated.
Correct Thesystemshouldnothaveanyside-eects,suchasstraycharactersleftintheTCPinputbuer.
3 Fundamental Design Issues
Figure1illustrates, insimpleterms,theoveralldataowsin apeer-to-peertransactionintermsofbuers
present in the senderor receiverprotocol stack, and in transit within thenetwork itself. SinceTCP is a
stream-basedprotocol,anyrecordblocking|suchastheheaderanddataillustratedinthegure|must
beperformedby the application layer. The data itself may then be split into fragmentsfor transmission
acrossthenetwork. This isgovernedbyvarious factorssuch astheMaximumTransmission Unit (MTU)
oftheunderlying network(s),theMaximumReceiveUnit(MRU) ofthereceiver,and theadvertised TCP
receivewindow at thetime. The receiverprocess must then readthe data viareadorrecv. This is not
synchronous with the sending process, and must not be assumed to occur atomically. In addition, the
receivermust becognisantofdelayscaused bythenetworkormultitaskingat either end. Itmust assume
thatthetransmissionmaybreak atanystage,afteranyamountofdatahasbeenreceived. TheTCPlayer
willdoitsbest toguaranteedelivery,but\best-eortdelivery"doesnotimply\perfectdelivery"[3].
hdr data
receivedfromsocketlayer fragmentsintransit
senttosocketlayer
Figure1: Flowofdatafromthesendingapplication,throughthenetwork,andbacktothepeerapplication.
Thedatamaybeblockedintoarbitrarily-sizeddatasegmentsatanystage;thediagramdoesnotintendto
implythatanyparticularblockingwilloccurwithanypredictability.
4 The Send Mechanism
Anaveapproachwouldbetogeneratethedataandbuildthecorrespondingheader,andqueueeachofthese
componentsseparatelytotheTCPlayerviasendorwrite. ThismasksapotentialunderlyingineÆciency:
if the header is queued for transmission separately to the data portion via separate calls to send, the
underlying transportmechanismmayin fact send theheaderin aseparatepacket before thedata issent.
This resultsin anadditionalpacket beingintroduced into thenetwork,withcorresponding overhead. For
example,aHTTPresponse headercouldbeaslittleas50bytes orless. ConsideringthattheTCPandIP
headerstypicallyaccountfor20byteseach,theeÆciency(datapayloadoutofentiretransmission)israther
poor. Theproblemis exacerbatedbytheNaglealgorithm,which delayssendingsmall packets in orderto
theeÆciency problem | by givingapparentlyfaster write operations| byagreater amountofnetwork
traÆc. Inalightly-loadednetworkthismaynotbeaproblem,butifscalabilityisdesired(asitshouldbe),
thenthisobvioussolutionturnsoutto beapoorone.
Thus,itisdesirablethat thedatatobesent(headerpluscontent)iscoalescedintoone(or veryfew)send
operations. Astraightforwardsolutionwouldbeto haveoneheader buerwhere theheaderis composed,
asecond buerwhere the data portion is composed,and an application-layersend buer, to which both
header anddataarecopied. Thisbueristhen passedto thesocketslayer,whereit islikelycopiedagain
from user space into kernel space. At best, this results in copyingthe entire contents of the datapacket
at least once in theapplication. This copyoperation may be dispensed with, by writingthe header and
datasequentiallyintothesendbuer. Thisrequiresmaintainingapointertothenextbyteavailablein the
send buer. One potentialargumentagainstthis is that sincethelength ofthe send datais written into
theHTTPheaderbymeansof theContent-Lengthelement,it isimpossibletoknowwheretheendof the
header will be until thepacket data hasbeen assembled. However,if thecontentsis (forexample) ale,
thenthelesizemaybeeasilydeterminedusingthestatlibraryfunction.
If it is impractical to read an entire le into amemory buer at once, a sensible compromisehas to be
reached. Forexample, a 1M le, even ifread into amalloc'dbuer, would notbesentas asingle TCP
segment. Onthe other hand, disk reads are often arrangedto be sector-sized chunks (for example,1024
bytes) for eÆciency reasons. A data chunk of this size may be passedto the socket-layersend function
withoutlossofeÆciency. Listing1illustrates this.
Listing1: CallingtheWriteMessageBlockfunction.
stat(imgFileName,&statbuf );
lesize =statbuf.stsize;
nleft = lesize ;
strcpy(sendbuf, "HTTP200OKnrnn");
strcat(sendbuf, "Content type:text/html nrnn");
sprintf(tmpbuf ,"Content length:% dnrnn",lesize);
strcat(sendbuf, tmpbuf );
strcat(sendbuf, "nrnn"); //blank line terminator
nhdr=strlen(sendbuf);
imgFile=fopen( imgFileName,"rb");
nread=fread(&sendbuf[nhdr],1,MAXSENDBYTES nhdr,imgFile);
if( nread>0)
f
nleft =nread;
nwrite=nread+nhdr;
WriteMessageBlock( ConnectionSocket,sendbuf ,nwrite,WRITETIMEOUT);
g
while(nleft >0)
f
nread=fread(sendbuf,1, MAXSENDBYTES ,imgFile);
if( nread>0)
f
nleft =nread;
nwrite=nread;
WriteMessageBlock(Socket,sendbuf ,nwrite,WRITETIMEOUT);
g
else
return ;
g
fclose(imgFile);
ThefunctionWriteMessageBlockencapsulatesthefunctionsof:
1. Makingthesocketnon-blocking(fcntlorioctlsocket).
2. Waiting forthesocketto bereadyforwriting(select).
3. Actuallywritingthedatato thesendqueue(send).
An outline of the code for WriteMessageBlockfor the Berkeley sockets API is shown in Listing 2. For
Winsock, thefcntlcall ischangedtoioctlsocket, andthe processingpertaining to interruptedsystem
calls(EINTR) isremoved. Furtherdetailsonthelattermaybefoundin [3].
Note that the send operationmay block due to various reasons, but typically blocking occurs due to full
send buers. In that case,the processshould block onsend. Since theapplication should fail gracefully,
thesendisprecededbyacalltoselecttosetatimeout|selecteitherreturnswhenthesocketisready
forwriting,orthetimeoutperiodexpires. Thereadandwriteselectorspassedtoselectindicatereadiness
forreadingorwriting,withselectreturningthenumberofsocketswhich satisfythecondition.
Itshouldalsobenotedthatasuccessfulsendcalldoesnotmeanthatthedatawassuccessfullydeliveredto
theremotepeer,oreventhatthedatawassentat all;itsimplymeansthat itwasqueuedfortransmission
bytheTCP/IPstack.
Listing2: TheWriteMessageBlockfunction.
fcntl(fd, F SETFL,ONONBLOCK);
writeTimeout.tvsec=timeout;
writeTimeout.tvusec=0;
FD ZERO(&replyFDSet );
FD SET( fd,&replyFDSet );
//write buer, onesend() at atime
nleft =buen;
pbuf =buf ;
while(nleft >0)
f
rc= select(FDSETSIZE,(fdset)NULL ,&replyFDSet,(fdset)NULL,&writeTimeout);
if( rc<= 0)
return 1;
//write as manybytes aspossible, upto theremainingbuer size
nwritten=send( fd, pbuf, nleft ,0);
if( nwritten <0) //send() error
f
if( errno==EINTR)//interruptedsystemcall
continue;
return 1;
g
if( nwritten== 0)
continue ;
nleft =nwritten;
pbuf+= nwritten;
g
5 The Receive Mechanism
Aswiththesendprocess,thereceiveprocessshouldnotassumethatdatawillbeavailableinanyparticular
blocksize,orthatitwillarriveatall. Theonlyguaranteeisthatifdataisavailable,itwillbeerror-checked
(to theextent possible bytransport and link layers), and will bein thecorrect byte-order. The function
ReadMessageBlockshowninListing3encapsulatesthefunctionsof:
1. Makingthesocketnon-blocking(fcntlorioctlsocket).
2. Waiting forthesocketto bereadyforreading(select).
3. Actuallyreadingthedatafromthereceivequeue,intotheapplicationbuer(recv).
Listing3: TheReadMessageBlockfunction.
rc= fcntl(fd, FSETFL, ONONBLOCK);
requestTimeout.tvsec=timeout;
requestTimeout.tvusec=0;
FD ZERO(&requestFDSet);
FD SET( fd,&requestFDSet );
nleft =nbytes;
pbuf =buf ;
while(nleft >0)
f
rc= select(FDSETSIZE,&requestFDSet,(fdset )NULL,(fdset )NULL,&requestTimeout);
if( rc<= 0)
return 1;
//read as manybytesasare available, uptothe remainingbuer size
nread=recv(fd,pbuf, nleft ,0);
if( nread<0) //recv() error
f
//checkfor interrupted system call
if( errno==EINTR)
continue;
return 1;
g
if( nread==0) //read() at eof
f
returnnbytes nleft; // returnactualnumberofbytes inbuer
g
nleft =nread;
pbuf+= nread;
g
returnnbytes;
peer onlysends ashort message, orindeed thelast remaining bytes (less thanonebuer-full) of alarger
message. Inthiscase,theTCPlayersendsaFINpacket,andaftertheappropriatehandshaking,thesocket
layerrecvfunction returnszero. Notethat areturnof zerodoesnot imply that bytes delayedin transit
couldstillberead|theselectcall blocksuntileitherdataisavailable,atimeoutoccurs,orthepeerhas
closedtheconnection. Checkingthereturncodes fromselectandrecvensuresthatthese conditionsare
properlydealtwith. FromtheWinsockmanpage:
Forconnection-orientedsockets,readabilitycan alsoindicate that aclose requesthas been
received from the peer. If the virtual circuit was closed gracefully, then a recv will return
immediately with zero bytes read. If the virtual circuit was reset, then a recv will complete
immediatelywith anerrorcode,suchasWSAECONNRESET.
Themultiple-readapproachisdocumentedinvarioussources(forexample,[3,5,6]).
Thenextproblemisthecorrectdetermination oftheextentoftheheader,inprotocols(suchasHTTP[7])
where the header may be of variable length. The HTTP header is delineated by a \line termination"
code. Some operating systems (such as Macintosh) use a carriage return (CR, hex 0D) to signify this;
somesystems(Unix)usealinefeed(LF,hex0A),andsome(Windows) useCR-LF.Althoughthestandard
speciesCR-LFforend ofline, andapairofCR-LF characters(thatis, CR-LF-CR-LF)toterminate the
header,good practiceindicates that weshould accepteither CR-CR,LF-LF, orCR-LF-CR-LFasavalid
headertermination.
This problem, if notdealt with correctly, may manifest itself in various wayswhich are diÆcult to trace.
Supposedly ready-to-use libraries may not handle this correctly. For example, in the Javaenvironment,
thereadLine methodsdo notcorrectlyterminate ifthe client givesanend-of-lineterminator dierentto
linefeed,orwhatitwasexpecting(seeforexample[8],pages107-109). Themethodsoutlinedherearerobust
toanyterminatorsequence: CR,LForCR-LF.
Failuretocorrectlydealwithheaderterminationmayleadto:
1. A client or serverwhich waits for a header-terminating sequence which it never receives(sent CR,
expectingLForvice-versa);or
2. A clientorserverprematurelyterminatingtheheader.
SimplycheckingfortwoCR'sisnotenough: therstcasewillbemissed,andthesecondcasewillresultin
oneLFbyte remaininginthebuer.
Onepossible,thoughsuboptimal,approachistousetheagsargumenttotherecvcall. Theagargument
maybesetto\peek"atthemessage:
nread = recv(socket, pbuffer, 1, MSG_PEEK);
It is temptingto use the\peek method" likethis, to check forheader termination. However,this is very
ineÆcient,since each systemcallinvolvesconsiderable overhead in termsofkernelcontextswitching, and
thecopyingtouserspaceofoneortwobytes. Notonlyiseachbytecopied,itiscopiedacrosssystemlevel
boundaries(usertokernelspace). Eectively,itmeansthateachbyteoftheheaderisreadseparately,and
thustheperformancepenaltiesinusingsuchanapproachcouldbesevere.
Thefunction ReadMessageHeadershown in Listing4accomplishes aneÆcientread,copyand checkfor a
terminatingsequence. Multiple TCPread operationsareusedto accumulate thereceiveblock. Eachread
operation iscalled using thebytes remainingin the buer,so asnot to overowthe buer. The scanfor
Listing4: TheReadMessageHeaderfunction.
ndata=0;//returnedvalue
fcntl(fd, F SETFL,ONONBLOCK);
requestTimeout.tvsec=timeout;
requestTimeout.tvusec=0;
FD ZERO(&requestFDSet);
FD SET( fd,&requestFDSet );
//accumulatebuer, onerecv() at a time
nleft =buen;
pbuf =buf ;
nbuf=0;
nCR=0;
nLF =0;
while(nleft >0)
f
rc= select(FDSETSIZE,&requestFDSet,(fdset )NULL,(fdset )NULL,&requestTimeout);
if( rc<= 0)
return 1;
//read as manybytesasare available, uptothe remainingbuer size
nread=recv(fd,pbuf, nleft ,0);
if( nread<0) //read() error
f
//checkfor interrupted system call
if( errno==EINTR)
continue;
return 1;
g
if( nread==0) //streamclosed bypeer
return 1;
n=nbuf; // start scanningat last read position
nbuf+=nread; //endof buer so far
while(n<nbuf)
f
if( buf[n]==LF)
f
if(++nLF ==2)
f
// replace secondLFwithn0( LF LF orCR LFCR LF)
buf[n ]= 'n0';
nhdr=n+1;
ndata=nbuf nhdr; // additional, non headerbytesreadin
returnnhdr; //returnbytesin header,including last null
g
if( buf[n]==CR)
f
if(++nCR== 2)
f
if( nLF==0)
f
//replace secondCRwithn0(CR CR)
buf[n]= 'n0';
nhdr=n+1;
ndata=nbuf nhdr;//additional,non headerbytesreadin
returnnhdr; //returnbytesin header,including last null
g
g
g
if((buf[n]!= CR)&& ( buf[n]!=LF ))
f
nLF=0;
nCR=0;
g
n++;
g
nleft =nread;
pbuf+= nread;
g
// buer full butnoterminatorsequencefound
return 1;
Note that in readingaxed-sizeblockfor eÆciency, databeyond theend ofthe headermaybestored in
thesuppliedbuer. Thisstateinformationisreturnedas:
1. Thenumberof bytes intheheader(returnedvaluenhdr).
2. Thelengthofanyadditionaldatabeyondthis (variable *ndata).
Thus, aprocessing sequencesuchasshown in Listing5is requiredafterthe callto read theheader. This
holdingofstateinformationmeansthatpointers mustbepassedinCsothat areturnvalueissaved. This
is ideallysuitedto anobject-orientedimplementation, where theobjectcanmaintain thecurrentstateof
thereadoperationwithinitself, hiddenfromthecallerapplication.
Listing5: CallingtheReadMessageHeaderfunction.
if( nhdr>0)
f
//checkheader
...
if( ndata>0)
f
//point to ndatabytesof data
pdata=&buf [nhdr];
...
g
g
6 Further Issues
Besidesthefundamental operationsofsendingand receivingformatted dataonaTCPvirtualcircuit,the
implementationofclientandserversoftwarerequiresattentiontoanumberofotherissues. Innoparticular
order,theseinclude:
Handlingsimultaneousconnections usingmultiplesocketsandmultiple threadsorprocesses.
Securityaspects includingreverseaddressresolutionforservers,transactionanderrorlogginginafailsafe
manner,serverlesystemprotection,client/serverauthentication,andtransactionencryption.
Server load balancing usingpre-forkedprocessesorpre-spawnedthreadsontheserver,andDNSrotary
IPselectionbytheclient.
7 Conclusion
Thisreporthashighlightedsomefundamental problemsinusingapplication-layerprotocolsoverTCP,and
presentedsomesolutionswhichhavebeenfoundtoworkwellinpractice. TheseincludeeÆcientscheduling
of thesend operation in the presenceof Nagle's algorithm,and multiple non-overlapping read operations
which,ifrequired,caneÆcientlydelineatetheprotocolheaderfromsubsequentbinaryornon-binarydata.
All of the above methods havebeen implemented in C on Unix and Windows NT platforms, and using
JavaontheJavaVirtualMachine.ItisnotedthattheJavaimplementationprovidesaparticularlyelegant
waytomaintainthecurrentstateofthetransactionwithinaninstanceofanobject(aMessageReaderand
aMessageWriterclass),and that aC++implementation oftheC code currentlyused should beableto
providesuchadvantagesin tandemwiththeeÆciencyderivedfromcompiled C++code.
References
[1] R. Braden, Requirements for Internet Hosts { Application and Support, Internet Engineering Task
Force,Oct 1989, ftp://ftp.isi.edu/in-notes/rfc1123.txtcurrentSept2002.
[2] Allan Vermeulen, Scott W. Ambler, Greg Bumgardner, Eldon Metz, TrevorMisfeldt Jim Schur, and
PatrickThompson, The Elementsof JavaStyle, CambridgeUniversityPress,2000.
[3] JonC.Snader,EectiveTCP/IPProgramming-44TipstoImproveYourNetworkPrograms,
Addison-Wesley,2000.
[4] JohnNagle, CongestionControl inIP/TCPInternetworks, InternetEngineeringTaskForce,Jan1984,
ftp://ftp.isi.edu/in-notes/rfc896.txtcurrentSept2002.
[5] Douglas E. Comer and Thomas Narten, \TCP/IP", in Unix Networking, Stephen G. Kochan and
PatrickW. Woods,Eds.,chapter3.HaydenBooks,1989.
[6] W. Richard Stevens, TCP/IP Illustrated- Volume3: TCP for Transactions, HTTP, NNTP and the
Unix DomainProtocols, Addison-Wesley,1996.
[7] T.Berners-Lee,R. Fielding,andH. Frystyk, HypertextTransfer Protocol {HTTP/1.0, Internet
Engi-neeringTaskForce,May1996, ftp://ftp.isi.edu/in-notes/rfc1945.txtcurrentApril2002.
[8] Elliotte RustyHarold, Java NetworkProgramming, O'Reilly,2000.
[9] R. Fielding, J. Gettys, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, Hypertext Transfer
Protocol{HTTP/1.1, InternetEngineeringTask Force,Jun1999,