Before I built a wall I'd ask to know What I was walling in or walling out,
INTERFACES CHAPTER
n t c s v n f i e l
returns number of fields on last line read by csvgetl i ne. behavior undefined if called before csvgetl i ne is called.
This specification still leaves open questions. For example, what values should be returned by c s v f i e l d and csvnf i e l d if they are called after csvgetl i ne has encoun- tered EOF? How should ill-formed fields be handled? Nailing down all such puzzles is difficult even for a tiny system, and very challenging for a large one, though it is important to try. One often doesn't discover oversights and omissions until imple- mentation is underway.
The rest of this section contains a new implementation of c s v g e t l i n e that matches the specification. The is broken into two files, a header csv. h that contains the function declarations that represent the public part of the interface, and an implementation file csv
.
c that contains the code. Users include csv. h in their source code and link their compiled code with the compiled version of csv. c; the source need never be visible.Here is the header file:
i n t e r f a c e f o r csv l i b r a r y a/
char n f ) read next i n p u t l i n e
char n t /a r e t u r n f i e l d n a/
i n t ; r e t u r n number o f f i e l d s a / The internal variables that store text and the internal functions like s p l i t are declared s t a t i c so they are visible only within the file that contains them. This is the simplest way to hide information in a C program.
enum NOMEM = -2 o u t o f memory s i g n a l a/ s t a t i c char = NULL; i n p u t chars
s t a t i c char = NULL; l i n e copy used by s p l i t s t a t i c i n t maxline = 0; s i z e o f and s t a t i c char = NULL; / a f i e l d p o i n t e r s
*/
s t a t i c i n t maxfield = 0; s i z e o f f i e l d [ ]s t a t i c i n t n f i e l d = 0; number o f f i e l d s i n f i e l d [ ] a/ s t a t i c char = f i e l d separator chars
The variables are initialized statically as well. These initial values are used to test whether to create or grow arrays.
These declarations describe a simple data structure. The i ne array holds the input line; the s l i ne array is created by copying characters from i ne and terminat- ing each field. The f i e l d array points to entries in s l i ne. This diagram shows the state of these three arrays after the input line ab
,
"cd",
, ,
"g,
h" has been pro- cessed. Shaded elements in s l i ne are not part of any field.SECTION 4.3 A LIBRARY FOR OTHERS
line
Here is the function csvgetl i ne itself:
sl i ne
c s v g e t l i ne: g e t one l i n e , grow as needed / a sample i n p u t : char acsvgetl a f i i n t c; char
,
anews; i f ( l i n e == NULL) a l l o c a t e on f i r s t c a l l maxline = maxfield = 1; l i n e (char a) s l in e = (char a) f i e l d (chari f ( l i n e == NULL s l in e == NULL f i e l d == NULL) reset
r e t u r n NULL; out o f memory */
field 0 1 2 3 4
f o r (f ! ! endofl n ,
i f (i grow l i n e
maxline 2; / a double c u r r e n t s i z e */ newl = (char a) r e a l maxli ne)
news = (char i ne, maxl ine) i f (newl == NULL news == NULL)
r e t u r n NULL; out o f memory i ne = newl s l i n e = news;
3
= c; = i f ( s p l i t ( ) == NOMEM) resetr e t u r n NULL; out o f memory
*/
r e t u r n (c == EOF i == 0) ? NULL : l i n e ;a
An incoming line is accumulated in ine, which is grown as necessary by a call to
r e a l the size is doubled on each growth, as in Section 2.6. The s l i ne array is
96 INTERFACES CHAPTER 4
kept the same size as i ne; c s v g e t l i ne calls s p l i t to create the field pointers in a separate array f i e l d, which is also grown as needed.
As is our custom, we start the arrays very small and grow them on demand, to guarantee that the array-growing code is exercised. If allocation fails, we call r e s e t to restore the globals to their starting state, so a subsequent call to c s v g e t l i ne has a chance of succeeding: / a r e s e t : s e t v a r i a b l e s back t o s t a r t i n g v a l u e s a/ s t a t i c v o i d p e r m i t t e d by ANSI C ; f r e e e l l i n e = NULL; = NULL; f i e l d = NULL; maxline = m a x f i e l d = n f i e l d = 0;
The i ne function handles the problem that an input line may be terminated by a carriage return, a both, or even EOF:
e n d o f l i n e : check f o r and consume \r, \n, o r EOF a/ s t a t i c i n t a f i n , i n t c) n t e o l e o l = i f == c = i f (c != c != EOF) f i n ) read t o o f a r ; p u t c back r e t u r n e o l ;
A separate function is necessary. since the standard input functions do not handle the rich variety of perverse formats encountered in real inputs.
Our prototype used s t r t o k to find the next token by searching for a separator character, normally a comma, but this made it impossible to handle quoted commas. A major change in the implementation of s p l i t is necessary, though its interface need not change. Consider these input lines:
Each line has three empty fields. Making sure that s p l i t parses them and other odd inputs correctly complicates it significantly, an example of how special cases and boundary conditions can come to dominate a program.