Awk lets you convert variables from one type to another on the fly. For example, to convert an integer to a string, you simply use it as a string. String construction can be done with concatenation, which is often very convenient.
These principles are used in awkcast:
UNIX> echo "4 Jim" | awkcast
Word 1: as a number: 4, as a string: 4.
0 appended: number: 40, string 40 Word 2: as a number: 0, as a string: Jim.
0 appended: number: 0, string Jim0 UNIX>
Casting a string to an integer gives it its atoi() value.
BEGIN and END
There are two special patterns, BEGIN and END, which cause the corresponding actions to be executed before and after any lines are processed respectively. Therefore, the following program (awkwc) counts the number of lines and words in the input file.
UNIX> cat awkwc
#!/bin/awk -f
BEGIN { nl = 0; nw = 0 } { nl++ ; nw += NF }
END { print "Lines:", nl, "words:", nw } UNIX> awkwc awkwc
Awk tries to process each statement on each line. Unlike sed, there is no
``hold space.'' Instead, each statement is processed on the original version of each line. Two special commands in awk are next and exit. Next specifies to stop processing the current input line, and to go directly to the next one, skipping all the rest of the statements. Exit specifies for awk to exit immediately.
Here are some simple examples. awkpo prints out only the odd numbered lines (note that this is an awkward way to do this, but it works):
UNIX> cat awkpo
3 Bill Clinton 4 George Bush 5 Ronald Reagan 6 Jimmy Carter 7 Sylvester Stallone UNIX> cat -n input | awkpo
1 Which of these lines doesn't belong:
3 Bill Clinton 5 Ronald Reagan 7 Sylvester Stallone UNIX>
awkptR prints out all lines until it reaches a lines with a capital R UNIX> cat awkptR
#!/bin/awk -f /R/ { exit }
{ print $0 } UNIX> awkptR input
Which of these lines doesn't belong:
Bill Clinton George Bush UNIX>
Arrays
Arrays in awk are a little odd. First, you don't have to malloc() any storage --just use it and there it is. Second, arrays can have any indices -- integers, floating point numbers or strings. This is called ``associative'' indexing, and can be very convenient. You cannot have multi-dimensional arrays or arrays of arrays though. To simulate multidimensional arrays, you can just concatenate the indices.
Take a look at awkgolf. This is typical of quick-and-dirty awk programs that you sometimes write to look at data. This one processes golf scores. Suppose you have some score files, as in the files usopen, masters, kemper and memorial. These files first have the name of the tournament in all caps, and then scores for a bunch of golfers. Suppose you'd like to see all the golfers with scores for each tournament in a readable form. This is what awkgolf does. Let's break it into its four parts.
The first part is the BEGIN line:
BEGIN { nt = 0 ; np = 0 }
This simply initializes two variables: nt is the number of tournaments, and np is the number of players.
The next line looks a little cryptic:
/^[A-Z]*$/ { this = $0; tourn[nt] = $0 ; nt++; next }
This only works on lines that are all capital letters. These are the lines that identify tournaments. On these lines, it does the following:
Sets the this variable to be the tournament name.
Puts the tournament's name into the tourn array.
Increments nt variable.
Skips the rest of the program and goes onto the next line.
The next part works on all lines that contain the pattern '--'. These are the lines with golfers' scores:
/--/ { golfer = $1
for (i = 2; $i != "--" ; i++) golfer = golfer" "$i if (isgolfer[golfer] != "yes") {
isgolfer[golfer] = "yes"
The first two lines of this action set the golfer variable to be the golfer's name.
Note that you can do string comparison in awk using standard boolean operators, unlike in C where you would have to use strcmp().
The next 5 lines use awk's associative arrays: The array isgolfer is checked to see if it contains the string ``yes'' under the golfer's name. If so, we have processed this golfer before. If not, we sed the golfer's entry in isgolfer to
``yes,'' set the np-th entry of the array g to be the golfer, and increment np.
Finally, we set the golfer's score for the tournament in the score array. Note that we don't use double-indirection. Instead, we simply concatenate the golfer's name and the tournament's name, and use that as the index for the array.
The last part of the program does the final formatting:
END { printf("%-25s", " ");
for (j = 0; j < nt; j++) printf("%9s", tourn[j]) printf("\n")
printf("\n") }
}
The first three lines print out 25 spaces, and then the names of the tournaments as held in the tourn array. Then we loop through each golfer, and print the golfer's name, padded to 25 characters, and then his score in each tournament. Note that if the golfer didn't play in the tournament, that entry of the tournament array will be the null string. This is quite convenient, because we don't have to test for whether the golfer played the tournament --we can just use awk's default values.
Ok, lets try awkgolf:
UNIX> awkgolf kemper # Note that the ouput is only sorted because its
# sorted in the input file KEMPER
UNIX> cat masters usopen kemper memorial | awkgolf
MASTERS USOPEN KEMPER MEMORIAL
Tiger Woods 281 6 5
Tommy Tolles 283 2 -11
Tom Watson 284 16 0
Paul Stankowski 285 6 -5 -3
Fred Couples 286 13 missed
Davis Love III 286 5 -3 -7
Justin Leonard 286 9 -10 0
Steve Elkington 287 7
Tom Lehman 287 -2 0 -3
Ernie Els 288 -4 missed -1
Vijay Singh 288 21 0 -14
Jesper Parnevik 289 11 missed -4
Lee Westwood 291 6
Nick Price 291 6 -7
Lee Janzen 292 13 -4 -11
Jim Furyk 293 2 -12
Mark O'Meara 294 9 5 -2
Scott McCarron 294 3 missed missed
Scott Hoch 298 3 -11
Jumbo Ozaki 300 missed
Frank Nobilo 303 9 -10
Bob Tway missed 2 -7
Brad Faxon missed 17 2
David Duval missed 11 -5
Greg Norman missed missed -7 -12
Loren Roberts missed 4 -6
Nick Faldo missed 11 -7
Phil Mickelson missed 10 -4
Steve Jones missed 15 2 3
Steve Stricker missed 9 missed -1
Jay Haas 2 -5 -4
Billy Andrade 4 -7
Hal Sutton 6 missed -1
Kirk Triplett 1 -2
Don Pooley missed -4
UNIX>
File indirection
You can specify that the output of print and printf go to a file with indirection.
For example, to copy standard input to the file f1 you could do:
UNIX> awk '{print $0 > "f1"}' < input UNIX> cat f1
Which of these lines doesn't belong:
Bill Clinton
Multiline awk programs in the Bourne shell
The Bourne shell lets you define multiline strings simply by putting newlines in the string (within single or double quotes, of course). This means that you can embed simple multiline awk scripts in a sh program without having to use cumbersome backslashes, or intermediate files. For example, shwc works just like awkwc, but works as a shell script rather than an awk program.
UNIX> shwc awkwc Lines: 5 words: 26 UNIX> shwc < awkwc Lines: 5 words: 26
UNIX> shwc awkwc awkwc usage: shwc [ file ]
UNIX>
Awk's limitations
Awk is useful for simple data processing. It is not useful when things get more complex for a few reasons. First, if your data file is huge, you'll do better to write a C program (using for example the fields library from CS302/360) because it will be more efficient sometimes by a factor of 60 or more. Second, once you start writing procedure calls in awk, it seems to me you may as well be writing C code. Third, you often find awk's lack of double indirection and string processing cumbersome and inefficient.
Awk is not a good language for string processing. Irritatingly, it doesn't let you get at string elements with array operations. I.e. the following will fail:
UNIX> cat sp.awk
{ s = $1 ; s[0] = 'a' ; print s } UNIX> awk -f sp.awk input awk: syntax error near line 1 awk: illegal statement near line 1 UNIX>
Of course, sed is ideal for string processing, so often you can get what you want with a combination of sed and awk.