• No results found

Buffering issues in TCP-based services

N/A
N/A
Protected

Academic year: 2019

Share "Buffering issues in TCP-based services"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

John Leis

[email protected]

http://www.usq.edu.au/users/leis/

October10,2002

FacultyofEngineering&SurveyingTechnicalReports

ISSN 1446-1846

ReportTR-2002-01

ISBN 1877078018

FacultyofEngineering&Surveying

UniversityofSouthernQueensland

ToowoombaQld4350Australia

http://www.usq.edu.au/

Purpose

The Faculty of Engineering and Surveying Technical Reports serve as amechanism for disseminating results from

certainofits research anddevelopment activities. The scope of reports inthis seriesincludes (butisnot restricted

to): literature reviews, designs, analyses,scientic andtechnicalndings,commissionedresearch,and descriptions

ofsoftwareandhardwareproducedbystaoftheFaculty ofEngineering andSurveying.

LimitationsofUse

TheCounciloftheUniversityofSouthernQueensland,itsFacultyofEngineeringandsurveying,andthestaofthe

UniversityofSouthernQueensland: (1)donotmakeanywarrantyorrepresentation, expressorimplied,withrespect

totheaccuracy,completeness,orusefulnessoftheinformationcontainedinthesereports; or(2)donotassumeany

(2)

Abstract

Client-serverprotocolssuchasHTTPoveraTCPtransportlayermaybeimplementedinarelatively

straightforwardmanner,asisoftengivenin\textbook"examples. Manyoftheseexamplesdonotscale

wellto a full implementationacross anything buta localhost connection withminimal datapayload.

Furthermore,suchimplementationsareinvariablylessthanoptimalintermsofperformance,robustness

todieringremoteclientcongurations,processortimeutilization,andbandwidthusage.Itissuggested

that the examplesoften given intextsare overlysimplied,to thepoint of beingmisleading. Inthis

note,weexaminesomeoftheseissues,resultingfromtheimplementationofHTTPclientsandservers

using Unix (Berkeley)sockets, Windows NT (Winsock), and Java. Issues addressed include eÆcient

send/receive buering to minimizing kernel and user data copying, and robustness to variability in

recorddelimiters.

1 Introduction

Client-serverdatatransmissionoverIPnormallyusestheservicesprovidedbytheTCPlayer. Thestandard

APIforaccessingTCPservicesistheso-called\socket"library,whichisuniversallyavailableonUnix-based

Operating Systems(whereit istermed\BerkeleySockets")and Windows(termed \Winsock"). The type

ofclient-serversoftwarewhichoriginallymotivatedthedevelopmentofthisreportistypiedbytheHTTP

protocol,which isasimplesynchronousrequest-responseprotocol.

Reliableoperationacrossvaryingunderlyingnetworksisagoalwhichhardlyneedstobestated. Thisideal

mustextendto varyingnetwork conditions,such ascongestion anddelay(whichwill notbeknownin

ad-vance),andideallywouldextendtovarioussystemplatforms(whichwillpresumablybeknowninadvance).

Becauseoftheinteractionbetweenthevariousprotocolstacklayers,networkpeersandintermediaterouters,

fundamental performance problemsare often diÆcultto determine,and may evengo unnoticed. The

re-mainder of this report discusses somecommon problems in network coding, and suggestssome workable

solutions. Codefragmentsarepresentedtoillustrate thesolutionmethods. Thelistingsof codefragments

inthisreportomitscheckingcodeforsystemcallreturnvaluesintheinterestsofclarityandfocusingonthe

solutionoutline. Systemcall returnvaluesshould alwaysbecheckedfor returnvalues;this isparticularly

soin networkprogramming,wheremyriaderrorsonthelocalclient,theremotesystem,orthenetworkin

betweencouldpotentiallyoccur.

2 Generic Design Goals

The\RobustnessPrinciple"asenunciatedbyBraden[1] isasfollows:

Ateverylayeroftheprotocols,there isageneralrulewhose applicationcanleadto enormous

benetsinrobustnessandinteroperability:

\Beliberalin whatyouaccept,andconservativein whatyousend"

Softwareshouldbewrittentodealwitheveryconceivableerror,nomatterhowunlikely;sooner

or later a packet will come in with that particular combination of errors and attributes, and

unlessthesoftwareisprepared,chaoscanensue.

Goodsoftwaredesignprinciplesdictatethatweshouldstriveforasolutionwhichhasthefollowingattributes

(see[2]):

Portable MustbeabletorunonWindows,Unixorotherplatforms,andideallybeeasilyportedtoother

(3)

Robust Mustberobusttodieringnetworkpeers. Theclientandserverlibrariesmustcorrectlyparseall

headers inallcases,andmustworkinallpossibleconnectionscenarios.

Reliable Thesystemmustusetimeoutsasappropriate,andgracefullyhandlefailures.

EÆcient The system must minimize the number of system calls, theamount of copying to/from kernel

memory,andthenumberofTCPprotocol writesgenerated.

Correct Thesystemshouldnothaveanyside-eects,suchasstraycharactersleftintheTCPinputbuer.

3 Fundamental Design Issues

Figure1illustrates, insimpleterms,theoveralldataowsin apeer-to-peertransactionintermsofbuers

present in the senderor receiverprotocol stack, and in transit within thenetwork itself. SinceTCP is a

stream-basedprotocol,anyrecordblocking|suchastheheaderanddataillustratedinthegure|must

beperformedby the application layer. The data itself may then be split into fragmentsfor transmission

acrossthenetwork. This isgovernedbyvarious factorssuch astheMaximumTransmission Unit (MTU)

oftheunderlying network(s),theMaximumReceiveUnit(MRU) ofthereceiver,and theadvertised TCP

receivewindow at thetime. The receiverprocess must then readthe data viareadorrecv. This is not

synchronous with the sending process, and must not be assumed to occur atomically. In addition, the

receivermust becognisantofdelayscaused bythenetworkormultitaskingat either end. Itmust assume

thatthetransmissionmaybreak atanystage,afteranyamountofdatahasbeenreceived. TheTCPlayer

willdoitsbest toguaranteedelivery,but\best-eortdelivery"doesnotimply\perfectdelivery"[3].

hdr data

receivedfromsocketlayer fragmentsintransit

senttosocketlayer

Figure1: Flowofdatafromthesendingapplication,throughthenetwork,andbacktothepeerapplication.

Thedatamaybeblockedintoarbitrarily-sizeddatasegmentsatanystage;thediagramdoesnotintendto

implythatanyparticularblockingwilloccurwithanypredictability.

4 The Send Mechanism

Anaveapproachwouldbetogeneratethedataandbuildthecorrespondingheader,andqueueeachofthese

componentsseparatelytotheTCPlayerviasendorwrite. ThismasksapotentialunderlyingineÆciency:

if the header is queued for transmission separately to the data portion via separate calls to send, the

underlying transportmechanismmayin fact send theheaderin aseparatepacket before thedata issent.

This resultsin anadditionalpacket beingintroduced into thenetwork,withcorresponding overhead. For

example,aHTTPresponse headercouldbeaslittleas50bytes orless. ConsideringthattheTCPandIP

headerstypicallyaccountfor20byteseach,theeÆciency(datapayloadoutofentiretransmission)israther

poor. Theproblemis exacerbatedbytheNaglealgorithm,which delayssendingsmall packets in orderto

(4)

theeÆciency problem | by givingapparentlyfaster write operations| byagreater amountofnetwork

traÆc. Inalightly-loadednetworkthismaynotbeaproblem,butifscalabilityisdesired(asitshouldbe),

thenthisobvioussolutionturnsoutto beapoorone.

Thus,itisdesirablethat thedatatobesent(headerpluscontent)iscoalescedintoone(or veryfew)send

operations. Astraightforwardsolutionwouldbeto haveoneheader buerwhere theheaderis composed,

asecond buerwhere the data portion is composed,and an application-layersend buer, to which both

header anddataarecopied. Thisbueristhen passedto thesocketslayer,whereit islikelycopiedagain

from user space into kernel space. At best, this results in copyingthe entire contents of the datapacket

at least once in theapplication. This copyoperation may be dispensed with, by writingthe header and

datasequentiallyintothesendbuer. Thisrequiresmaintainingapointertothenextbyteavailablein the

send buer. One potentialargumentagainstthis is that sincethelength ofthe send datais written into

theHTTPheaderbymeansof theContent-Lengthelement,it isimpossibletoknowwheretheendof the

header will be until thepacket data hasbeen assembled. However,if thecontentsis (forexample) ale,

thenthelesizemaybeeasilydeterminedusingthestatlibraryfunction.

If it is impractical to read an entire le into amemory buer at once, a sensible compromisehas to be

reached. Forexample, a 1M le, even ifread into amalloc'dbuer, would notbesentas asingle TCP

segment. Onthe other hand, disk reads are often arrangedto be sector-sized chunks (for example,1024

bytes) for eÆciency reasons. A data chunk of this size may be passedto the socket-layersend function

withoutlossofeÆciency. Listing1illustrates this.

Listing1: CallingtheWriteMessageBlockfunction.

stat(imgFileName,&statbuf );

lesize =statbuf.stsize;

nleft = lesize ;

strcpy(sendbuf, "HTTP200OKnrnn");

strcat(sendbuf, "Content type:text/html nrnn");

sprintf(tmpbuf ,"Content length:% dnrnn",lesize);

strcat(sendbuf, tmpbuf );

strcat(sendbuf, "nrnn"); //blank line terminator

nhdr=strlen(sendbuf);

imgFile=fopen( imgFileName,"rb");

nread=fread(&sendbuf[nhdr],1,MAXSENDBYTES nhdr,imgFile);

if( nread>0)

f

nleft =nread;

nwrite=nread+nhdr;

WriteMessageBlock( ConnectionSocket,sendbuf ,nwrite,WRITETIMEOUT);

g

while(nleft >0)

f

nread=fread(sendbuf,1, MAXSENDBYTES ,imgFile);

if( nread>0)

f

nleft =nread;

nwrite=nread;

WriteMessageBlock(Socket,sendbuf ,nwrite,WRITETIMEOUT);

g

else

return ;

g

fclose(imgFile);

(5)

ThefunctionWriteMessageBlockencapsulatesthefunctionsof:

1. Makingthesocketnon-blocking(fcntlorioctlsocket).

2. Waiting forthesocketto bereadyforwriting(select).

3. Actuallywritingthedatato thesendqueue(send).

An outline of the code for WriteMessageBlockfor the Berkeley sockets API is shown in Listing 2. For

Winsock, thefcntlcall ischangedtoioctlsocket, andthe processingpertaining to interruptedsystem

calls(EINTR) isremoved. Furtherdetailsonthelattermaybefoundin [3].

Note that the send operationmay block due to various reasons, but typically blocking occurs due to full

send buers. In that case,the processshould block onsend. Since theapplication should fail gracefully,

thesendisprecededbyacalltoselecttosetatimeout|selecteitherreturnswhenthesocketisready

forwriting,orthetimeoutperiodexpires. Thereadandwriteselectorspassedtoselectindicatereadiness

forreadingorwriting,withselectreturningthenumberofsocketswhich satisfythecondition.

Itshouldalsobenotedthatasuccessfulsendcalldoesnotmeanthatthedatawassuccessfullydeliveredto

theremotepeer,oreventhatthedatawassentat all;itsimplymeansthat itwasqueuedfortransmission

bytheTCP/IPstack.

Listing2: TheWriteMessageBlockfunction.

fcntl(fd, F SETFL,ONONBLOCK);

writeTimeout.tvsec=timeout;

writeTimeout.tvusec=0;

FD ZERO(&replyFDSet );

FD SET( fd,&replyFDSet );

//write buer, onesend() at atime

nleft =buen;

pbuf =buf ;

while(nleft >0)

f

rc= select(FDSETSIZE,(fdset)NULL ,&replyFDSet,(fdset)NULL,&writeTimeout);

if( rc<= 0)

return 1;

//write as manybytes aspossible, upto theremainingbuer size

nwritten=send( fd, pbuf, nleft ,0);

if( nwritten <0) //send() error

f

if( errno==EINTR)//interruptedsystemcall

continue;

return 1;

g

if( nwritten== 0)

continue ;

nleft =nwritten;

pbuf+= nwritten;

g

(6)

5 The Receive Mechanism

Aswiththesendprocess,thereceiveprocessshouldnotassumethatdatawillbeavailableinanyparticular

blocksize,orthatitwillarriveatall. Theonlyguaranteeisthatifdataisavailable,itwillbeerror-checked

(to theextent possible bytransport and link layers), and will bein thecorrect byte-order. The function

ReadMessageBlockshowninListing3encapsulatesthefunctionsof:

1. Makingthesocketnon-blocking(fcntlorioctlsocket).

2. Waiting forthesocketto bereadyforreading(select).

3. Actuallyreadingthedatafromthereceivequeue,intotheapplicationbuer(recv).

Listing3: TheReadMessageBlockfunction.

rc= fcntl(fd, FSETFL, ONONBLOCK);

requestTimeout.tvsec=timeout;

requestTimeout.tvusec=0;

FD ZERO(&requestFDSet);

FD SET( fd,&requestFDSet );

nleft =nbytes;

pbuf =buf ;

while(nleft >0)

f

rc= select(FDSETSIZE,&requestFDSet,(fdset )NULL,(fdset )NULL,&requestTimeout);

if( rc<= 0)

return 1;

//read as manybytesasare available, uptothe remainingbuer size

nread=recv(fd,pbuf, nleft ,0);

if( nread<0) //recv() error

f

//checkfor interrupted system call

if( errno==EINTR)

continue;

return 1;

g

if( nread==0) //read() at eof

f

returnnbytes nleft; // returnactualnumberofbytes inbuer

g

nleft =nread;

pbuf+= nread;

g

returnnbytes;

(7)

peer onlysends ashort message, orindeed thelast remaining bytes (less thanonebuer-full) of alarger

message. Inthiscase,theTCPlayersendsaFINpacket,andaftertheappropriatehandshaking,thesocket

layerrecvfunction returnszero. Notethat areturnof zerodoesnot imply that bytes delayedin transit

couldstillberead|theselectcall blocksuntileitherdataisavailable,atimeoutoccurs,orthepeerhas

closedtheconnection. Checkingthereturncodes fromselectandrecvensuresthatthese conditionsare

properlydealtwith. FromtheWinsockmanpage:

Forconnection-orientedsockets,readabilitycan alsoindicate that aclose requesthas been

received from the peer. If the virtual circuit was closed gracefully, then a recv will return

immediately with zero bytes read. If the virtual circuit was reset, then a recv will complete

immediatelywith anerrorcode,suchasWSAECONNRESET.

Themultiple-readapproachisdocumentedinvarioussources(forexample,[3,5,6]).

Thenextproblemisthecorrectdetermination oftheextentoftheheader,inprotocols(suchasHTTP[7])

where the header may be of variable length. The HTTP header is delineated by a \line termination"

code. Some operating systems (such as Macintosh) use a carriage return (CR, hex 0D) to signify this;

somesystems(Unix)usealinefeed(LF,hex0A),andsome(Windows) useCR-LF.Althoughthestandard

speciesCR-LFforend ofline, andapairofCR-LF characters(thatis, CR-LF-CR-LF)toterminate the

header,good practiceindicates that weshould accepteither CR-CR,LF-LF, orCR-LF-CR-LFasavalid

headertermination.

This problem, if notdealt with correctly, may manifest itself in various wayswhich are diÆcult to trace.

Supposedly ready-to-use libraries may not handle this correctly. For example, in the Javaenvironment,

thereadLine methodsdo notcorrectlyterminate ifthe client givesanend-of-lineterminator dierentto

linefeed,orwhatitwasexpecting(seeforexample[8],pages107-109). Themethodsoutlinedherearerobust

toanyterminatorsequence: CR,LForCR-LF.

Failuretocorrectlydealwithheaderterminationmayleadto:

1. A client or serverwhich waits for a header-terminating sequence which it never receives(sent CR,

expectingLForvice-versa);or

2. A clientorserverprematurelyterminatingtheheader.

SimplycheckingfortwoCR'sisnotenough: therstcasewillbemissed,andthesecondcasewillresultin

oneLFbyte remaininginthebuer.

Onepossible,thoughsuboptimal,approachistousetheagsargumenttotherecvcall. Theagargument

maybesetto\peek"atthemessage:

nread = recv(socket, pbuffer, 1, MSG_PEEK);

It is temptingto use the\peek method" likethis, to check forheader termination. However,this is very

ineÆcient,since each systemcallinvolvesconsiderable overhead in termsofkernelcontextswitching, and

thecopyingtouserspaceofoneortwobytes. Notonlyiseachbytecopied,itiscopiedacrosssystemlevel

boundaries(usertokernelspace). Eectively,itmeansthateachbyteoftheheaderisreadseparately,and

thustheperformancepenaltiesinusingsuchanapproachcouldbesevere.

Thefunction ReadMessageHeadershown in Listing4accomplishes aneÆcientread,copyand checkfor a

terminatingsequence. Multiple TCPread operationsareusedto accumulate thereceiveblock. Eachread

operation iscalled using thebytes remainingin the buer,so asnot to overowthe buer. The scanfor

(8)

Listing4: TheReadMessageHeaderfunction.

ndata=0;//returnedvalue

fcntl(fd, F SETFL,ONONBLOCK);

requestTimeout.tvsec=timeout;

requestTimeout.tvusec=0;

FD ZERO(&requestFDSet);

FD SET( fd,&requestFDSet );

//accumulatebuer, onerecv() at a time

nleft =buen;

pbuf =buf ;

nbuf=0;

nCR=0;

nLF =0;

while(nleft >0)

f

rc= select(FDSETSIZE,&requestFDSet,(fdset )NULL,(fdset )NULL,&requestTimeout);

if( rc<= 0)

return 1;

//read as manybytesasare available, uptothe remainingbuer size

nread=recv(fd,pbuf, nleft ,0);

if( nread<0) //read() error

f

//checkfor interrupted system call

if( errno==EINTR)

continue;

return 1;

g

if( nread==0) //streamclosed bypeer

return 1;

n=nbuf; // start scanningat last read position

nbuf+=nread; //endof buer so far

while(n<nbuf)

f

if( buf[n]==LF)

f

if(++nLF ==2)

f

// replace secondLFwithn0( LF LF orCR LFCR LF)

buf[n ]= 'n0';

nhdr=n+1;

ndata=nbuf nhdr; // additional, non headerbytesreadin

returnnhdr; //returnbytesin header,including last null

g

(9)

if( buf[n]==CR)

f

if(++nCR== 2)

f

if( nLF==0)

f

//replace secondCRwithn0(CR CR)

buf[n]= 'n0';

nhdr=n+1;

ndata=nbuf nhdr;//additional,non headerbytesreadin

returnnhdr; //returnbytesin header,including last null

g

g

g

if((buf[n]!= CR)&& ( buf[n]!=LF ))

f

nLF=0;

nCR=0;

g

n++;

g

nleft =nread;

pbuf+= nread;

g

// buer full butnoterminatorsequencefound

return 1;

Note that in readingaxed-sizeblockfor eÆciency, databeyond theend ofthe headermaybestored in

thesuppliedbuer. Thisstateinformationisreturnedas:

1. Thenumberof bytes intheheader(returnedvaluenhdr).

2. Thelengthofanyadditionaldatabeyondthis (variable *ndata).

Thus, aprocessing sequencesuchasshown in Listing5is requiredafterthe callto read theheader. This

holdingofstateinformationmeansthatpointers mustbepassedinCsothat areturnvalueissaved. This

is ideallysuitedto anobject-orientedimplementation, where theobjectcanmaintain thecurrentstateof

thereadoperationwithinitself, hiddenfromthecallerapplication.

Listing5: CallingtheReadMessageHeaderfunction.

if( nhdr>0)

f

//checkheader

...

if( ndata>0)

f

//point to ndatabytesof data

pdata=&buf [nhdr];

...

g

g

(10)

6 Further Issues

Besidesthefundamental operationsofsendingand receivingformatted dataonaTCPvirtualcircuit,the

implementationofclientandserversoftwarerequiresattentiontoanumberofotherissues. Innoparticular

order,theseinclude:

Handlingsimultaneousconnections usingmultiplesocketsandmultiple threadsorprocesses.

Securityaspects includingreverseaddressresolutionforservers,transactionanderrorlogginginafailsafe

manner,serverlesystemprotection,client/serverauthentication,andtransactionencryption.

Server load balancing usingpre-forkedprocessesorpre-spawnedthreadsontheserver,andDNSrotary

IPselectionbytheclient.

7 Conclusion

Thisreporthashighlightedsomefundamental problemsinusingapplication-layerprotocolsoverTCP,and

presentedsomesolutionswhichhavebeenfoundtoworkwellinpractice. TheseincludeeÆcientscheduling

of thesend operation in the presenceof Nagle's algorithm,and multiple non-overlapping read operations

which,ifrequired,caneÆcientlydelineatetheprotocolheaderfromsubsequentbinaryornon-binarydata.

All of the above methods havebeen implemented in C on Unix and Windows NT platforms, and using

JavaontheJavaVirtualMachine.ItisnotedthattheJavaimplementationprovidesaparticularlyelegant

waytomaintainthecurrentstateofthetransactionwithinaninstanceofanobject(aMessageReaderand

aMessageWriterclass),and that aC++implementation oftheC code currentlyused should beableto

providesuchadvantagesin tandemwiththeeÆciencyderivedfromcompiled C++code.

References

[1] R. Braden, Requirements for Internet Hosts { Application and Support, Internet Engineering Task

Force,Oct 1989, ftp://ftp.isi.edu/in-notes/rfc1123.txtcurrentSept2002.

[2] Allan Vermeulen, Scott W. Ambler, Greg Bumgardner, Eldon Metz, TrevorMisfeldt Jim Schur, and

PatrickThompson, The Elementsof JavaStyle, CambridgeUniversityPress,2000.

[3] JonC.Snader,EectiveTCP/IPProgramming-44TipstoImproveYourNetworkPrograms,

Addison-Wesley,2000.

[4] JohnNagle, CongestionControl inIP/TCPInternetworks, InternetEngineeringTaskForce,Jan1984,

ftp://ftp.isi.edu/in-notes/rfc896.txtcurrentSept2002.

[5] Douglas E. Comer and Thomas Narten, \TCP/IP", in Unix Networking, Stephen G. Kochan and

PatrickW. Woods,Eds.,chapter3.HaydenBooks,1989.

[6] W. Richard Stevens, TCP/IP Illustrated- Volume3: TCP for Transactions, HTTP, NNTP and the

Unix DomainProtocols, Addison-Wesley,1996.

[7] T.Berners-Lee,R. Fielding,andH. Frystyk, HypertextTransfer Protocol {HTTP/1.0, Internet

Engi-neeringTaskForce,May1996, ftp://ftp.isi.edu/in-notes/rfc1945.txtcurrentApril2002.

[8] Elliotte RustyHarold, Java NetworkProgramming, O'Reilly,2000.

[9] R. Fielding, J. Gettys, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, Hypertext Transfer

Protocol{HTTP/1.1, InternetEngineeringTask Force,Jun1999,

References

Related documents

Newby indicated that he had no problem with the Department’s proposed language change.. O’Malley indicated that the language reflects the Department’s policy for a number

The core framework has three levels of enablers as depicted in Figure 2.2 : (i) Service level(SL): Mechanisms to translate service requirements, build and exploit real world

The large number of associated attributes (22 − 29) has helped in decreasing the percentage of instances with missing values in the remaining attributes post the attribute

9 Within this multifaceted context, the aims of this review paper are: (1) to discuss the inconsistencies in the definitions and terminology of coaching used in the literature

Open, universal strong authentication is intended to provide all key constituencies (device manufacturers, identity management vendors, security service providers, and

Secondly (and perhaps more worrying from the ATO’s point of view), the uncertainty could give rise to a situation where knowledgeable investors (including ‘sophisticated investors’

The TOM TAILOR Group comprises TOM TAILOR Holding AG as the ultimate parent company and the subsidiaries listed in the notes to the consolidated financial statements to 31

In addition, optimization of affinity during subsequent stages of drug discovery typically leads to a further increase in molecular weight (MW) [ 19 ].. Moreover, affinity is