AWK

AWK is a general purpose programming language that is designed for processing text-based data, either in files or data streams,
and was created at Bell Labs in the 1970s I noticed that Erik Wendelin wrote an article awk is a beautiful tool. In this article he said that it was best to introduce Awk with practical examples. I totally agree with Erik. Eric Pements Awk one-liner collection consists of five sections: 1. File spacing, 2. Numbering and calculations, 3. Text conversion and substitution, 4. Selective printing of certain lines, 5. Selective deleting of certain lines. The first part of the article will explain the first two sections: File spacing and Numbering and calculations. The second part will explain Text conversion and substitution, and the last part Selective printing/deleting of certain lines. These one-liners work with all versions of awk, such as nawk (AT&Ts new awk), gawk (GNUs awk), mawk (Michael Brennans awk) and oawk (old awk). Lets start! 1. Line Spacing 1. Double-space a file. awk '1; { print "" }' filname.ext So how does it work? A one-liner is an Awk program and every Awk program consists of a sequence of pattern-action statements pattern { action statements }. In this case there are two statements 1 and { print }. In a pattern-action statement either the pattern or the action may be missing. If the pattern is missing, the action is applied to every single line of input. A missing action is equivalent to { print }. Thus, this one-liner translates to: awk '1 { print } { print "" }' filname.ext An action is applied only if the pattern matches, i.e., pattern is true. Since 1 is always true, this one-liner translates further into two print statements: awk '{ print } { print "" }' filname.ext Every print statement in Awk is silently followed by an ORS - Output Record Separator variable, which is a newline by default. The first print statement with no arguments is equivalent to print $0, where $0 is a variable holding the entire line. The second print statement prints nothing, but knowing that each print statement is followed by ORS, it actually prints a newline. So there we have it, each line gets double-spaced. 2. Another way to double-space a file. awk 'BEGIN { ORS="\n\n" }; 1' filname.ext BEGIN is a special kind of pattern which is not tested against the input. It is executed before any input is read. This one-liner double-spaces the file by setting the ORS variable to two newlines. As I mentioned previously, statement 1 gets translated to { print }, and every print statement gets terminated with the value of ORS variable. 3. Double-space a file so that no more than one blank line appears between lines of text. awk 'NF { print $0 "\n" }' filname.ext The one-liner uses another special variable called NF - Number of Fields. It contains the number of fields the current line was split into. For example, a line this is a test splits in four pieces and NF gets set to 4. The empty line does not split into any pieces and NF gets set to 0. Using NF as a pattern can effectively filter out empty lines. This one liner says: If there are any number of fields, print the whole line followed by newline. 4. Triple-space a file. awk '1; { print "\n" }' filname.ext This one-liner is very similar to previous ones. 1 gets translated into { print } and the resulting Awk program is:
awk '{ print; print "\n" }' filname.ext It prints the line, then prints a newline followed by terminating ORS, which is newline by default. 2. Numbering and Calculations 5. Number lines in each file separately. awk '{ print FNR "\t" $0 }' filname.ext This Awk program appends the FNR - File Line Number predefined variable and a tab (\t) before each line. FNR variable contains the current line for each file separately. For example, if this one-liner was called on two files, one containing 10 lines, and the other 12, it would number lines in the first file from 1 to 10, and then resume numbering from one for the second file and number lines in this file from 1 to 12. FNR gets reset from file to file. 6. Number lines for all files together. awk '{ print NR "\t" $0 }' filname.ext This one works the same as #5 except that it uses NR - Line Number variable, which does not get reset from file to file. It counts the input lines seen so far. For example, if it was called on the same two files with 10 and 12 lines, it would number the lines from 1 to 22 (10 + 12). 7. Number lines in a fancy manner. awk '{ printf("%5d : %s\n", NR, $0) }' filname.ext This one-liner uses printf() function to number lines in a custom format. It takes format parameter just like a regular printf() function. Note that ORS does not get appended at the end of printf(), so we have to print the newline (\n) character explicitly. This one right-aligns line numbers, followed by a space and a colon, and the line. 8. Number only non-blank lines in files. awk 'NF { $0=++a " :" $0 }; { print }' filname.ext Awk variables are dynamic; they come into existence when they are first used. This oneliner pre-increments variable a each time the line is non-empty, then it appends the value of this variable to the beginning of line and prints it out. 9. Count lines in files (emulates wc -l). awk 'END { print NR }' filname.ext END is another special kind of pattern which is not tested against the input. It is executed when all the input has been exhausted. This one-liner outputs the value of NR special variable after all the input has been consumed. NR contains total number of lines seen (= number of lines in the file). 10. Print the sum of fields in every line. awk '{ s = 0; for (i = 1; i <= NF; i++) s = s+$i; print s }' filname.ext Awk has some features of C language, like the for (;;) { } loop. This one-liner loops over all fields in a line (there are NF fields in a line), and adds the result in variable s. Then it prints the result out and proceeds to the next line. 11. Print the sum of fields in all lines. awk '{ for (i = 1; i <= NF; i++) s = s+$i }; END { print s }' filname.ext This one-liner is basically the same as #10, except that it prints the sum of all fields. Notice how it did not initialize variable s to 0. It was not necessary as variables come into existence dynamically. 12. Replace every field by its absolute value. awk '{ for (i = 1; i <= NF; i++) if ($i < 0) $i = -$i; print }' filname.ext This one-liner uses two other features of C language, namely the if () { } statement and omission of curly braces. It loops over all fields in a line and checks if any of the fields is less than 0. If any of the fields is less than 0, then it just negates the field to make it positive. Fields can be addresses indirectly by a variable. For example, i = 5; $i = hello, sets field number 5 to string hello.
Here is the same one-liner rewritten with curly braces for clarity. The print statement gets executed after all the fields in the line have been replaced by their absolute values. awk '{ for (i = 1; i <= NF; i++) { if ($i < 0) { $i = -$i; } } print }' filname.ext 13. Count the total number of fields (words) in a file. awk '{ total = total + NF }; END { print total }' filname.ext This one-liner matches all the lines and keeps adding the number of fields in each line. The number of fields seen so far is kept in a variable named total. Once the input has been processed, special pattern END { } is executed, which prints the total number of fields. 14. Print the total number of lines containing word Beth. awk '/Beth/ { n++ }; END { print n+0 }' filname.ext This one-liner has two pattern-action statements. The first one is /Beth/ { n++ }. A pattern between two slashes is a regular expression. It matches all lines containing pattern Beth (not necessarily the word Beth, it could as well be Bethe or theBeth333). When a line matches, variable n gets incremented by one. The second pattern-action statement is END { print n+0 }. It is executed when the file has been processed. Note the +0 in print n+0 statement. It forces 0 to be printed in case there were no matches (n was undefined). Had we not put +0 there, an empty line would have been printed. 15. Find the line containing the largest (numeric) first field. awk '$1 > max { max=$1; maxline=$0 }; END { print max, maxline }' filname.ext This one-liner keeps track of the largest number in the first field (in variable max) and the corresponding line (in variable maxline). Once it has looped over all lines, it prints them out. 16. Print the number of fields in each line, followed by the line. awk '{ print NF ":" $0 } ' filname.ext This one-liner just prints out the predefined variable NF - Number of Fields, which contains the number of fields in the line, followed by a colon and the line itself. 17. Print the last field of each line. awk '{ print $NF }' filname.ext Fields in Awk need not be referenced by constants. For example, code like f = 3; print $f would print out the 3rd field. This one-liner prints the field with the value of NF. $NF is last field in the line. 18. Print the last field of the last line. awk '{ field = $NF }; END { print field }' filname.ext This one-liner keeps track of the last field in variable field. Once it has looped all the lines, variable field contains the last field of the last line, and it just prints it out. 19. Print every line with more than 4 fields. awk 'NF > 4' filname.ext This one-liner omits the action statement. As I noted in one-liner #1, a missing action statement is equivalent to { print }. 20. Print every line where the value of the last field is greater than 4. awk '$NF > 4' filname.ext This one-liner is similar to #17. It references the last field by NF variable. If its greater than 4, it prints it out.
AWK is a programming language that is designed for processing text-based data, either in files or data streams, and was created atBell Labs in the 1970s[1]. The name AWK is derived from the family names of its authors Alfred Aho, Peter Weinberger, and BrianKernighan; however, it is not commonly pronounced as a string of separate letters but rather to sound the same as the name of the bird, auk (which acts as an emblem of the language such as on The AWK Programming Language book cover). awk, when written in all lowercase letters, refers to the Unix or Plan 9 program that runs other programs written in the AWK programming language. "AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed." - Alfred V. Aho[2] AWK is an example of a programming language that extensively uses the string datatype, associative arrays (that is, arrays indexed by key strings), and regular expressions. The power, terseness, and limitations of AWK programs and sed scripts inspired Larry Wallto write Perl. Because of their dense notation, all these languages are often used for writing one-liner programs. AWK is one of the early tools to appear in Version 7 Unix and gained popularity as a way to add computational features to a Unixpipeline. A version of the AWK language is a standard feature of nearly every modern Unix-like operating system available today. AWK is mentioned in the Single UNIX Specification as one of the mandatory utilities of a Unix operating system. Besides the Bourne shell, AWK is the only other scripting language available in a standard Unix environment[3]. Implementations of AWK exist as installed software for almost all other operating systems. Contents [hide] 1 Structure of AWK programs 2 AWK commands 2.1 The print command 2.2 Variables and syntax 2.3 User-defined functions 3 Sample applications 3.1 Hello World 3.2 Print lines longer than 80 characters 3.3 Print a count of words 3.4 Sum last word 3.5 Match a range of input lines 3.6 Calculate word frequencies 3.7 Match pattern from command line 4 Self-contained AWK scripts 5 AWK versions and implementations 6 Books 7 References 8 See also 9 External links [edit]Structure of AWK programs An AWK program is a series of pattern action pairs, written as:
pattern { action } where pattern is typically an expression and action is a series of commands. Each line of input is tested against all the patterns in turn and the action is executed for each expression that is true. Either the pattern or the action may be omitted. The pattern defaults to matching every line of input. The default action is to print the line of input. In addition to a simple AWK expression, the pattern can be BEGIN or END causing the action to be executed before or after all lines of input have been read, or pattern1, pattern2 which matches the range of lines of input starting with a line that matches pattern1 up to and including the line that matches pattern2 before again trying to match against pattern1 on future lines. In addition to normal arithmetic and logical operators, AWK expressions include the tilde operator, ~, which matches a regular expression against a string. As handy syntactic sugar,/regexp/ without using the tilde operator matches against the current line of input. [edit]AWK commands AWK commands are the statement that is substituted for action in the examples above. AWK commands can include function calls, variable assignments, calculations, or any combination thereof. AWK contains built-in support for many functions; many more are provided by the various flavors of AWK. Also, some flavors support the inclusion of dynamically linked libraries, which can also provide more functions. For brevity, the enclosing curly braces ( { } ) will be omitted from these examples. [edit]The print command The print command is used to output text. The output text is always terminated with a predefined string called the output record separator (ORS) whose default value is a newline. The simplest form of this command is: print This displays the contents of the current line. In AWK, lines are broken down into fields, and these can be displayed separately: print $1 Displays the first field of the current line print $1, $3 Displays the first and third fields of the current line, separated by a predefined string called the output field separator (OFS) whose default value is a single space character Although these fields ($X) may bear resemblance to variables (the $ symbol indicates variables in perl), they actually refer to the fields of the current line. A special case, $0, refers to the entire line. In fact, the commands "print" and "print $0" are identical in functionality. The print command can also display the results of calculations and/or function calls: print 3+2 print foobar(3) print foobar(variable) print sin(3-2) Output may be sent to a file: print "expression" > "file name" or through a pipe: print "expression" | "command" [edit]Variables and syntax Variable names can use any of the characters [A-Za-z0-9_], with the exception of language keywords. The operators + - * / represent addition, subtraction, multiplication, and division, respectively. For string concatenation, simply place two variables (or string constants) next to each other. It is optional to use a space in between if string constants are involved. But you can't place two variable names adjacent to each other without having a space in between.
String constants are delimited by double quotes. Statements need not end with semicolons. Finally, comments can be added to programs by using # as the first character on a line. [edit]User-defined functions In a format similar to C, function definitions consist of the keyword function, the function name, argument names and the function body. Here is an example of a function. function add_three (number, temp) { temp = number + 3 return temp } This statement can be invoked as follows: print add_three(36) # Outputs 39 Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is convention to add some whitespace in the argument list before the local variables, in order to indicate where the parameters end and the local variables begin. One or more space may exists between the function name and the open parenthese in the function definition, but no space at all is allowed in the function call. [edit]Sample applications [edit]Hello World Here is the ubiquitous "Hello world program" program written in AWK: BEGIN { print "Hello, world!" } Note that you do not need an explicit exit statement, since the only pattern is BEGIN, no command-line arguments are processed. [edit]Print lines longer than 80 characters Print all lines longer than 80 characters. Note that the default action is to print the current line. length($0) > 80 [edit]Print a count of words Count words in the input, and print lines, words, and characters (like wc) { w += NF c += length + 1 } END { print NR, w, c } As there is no pattern for the first line of the program, every line of input matches by default so the increment actions are executed for every line. Note that w += NF is shorthand for w = w + NF. [edit]Sum last word { s += $NF } END { print s + 0 } s is incremented by the numeric value of $NF which is the last word on the line as defined by AWK's field separator, by default white-space. NF is the number of fields in the current line, e.g. 4. Since $4 is the value of the fourth field, $NF is the value of the last field in the line regardless of how many fields this line has, or whether it has more or fewer fields than surrounding lines. $ is actually a unary operator with the highest operator precedence. (If the line has no fields then NF is 0, $0 is the whole line, which in this case is empty apart from possible white-space, and so has the numeric value 0.) At the end of the input the END pattern matches so s is printed. However, since there may have been no lines of input at all, in which case no value has ever been assigned to s, it will by default be an empty string. Adding zero to a variable is an AWK idiom for coercing it from a string to a numeric value. (Concatenating an empty string is to coerce from a number
to a string, e.g. s "". Note, there's no operator to concatenate strings, they're just placed adjacently.) With the coercion the program prints 0 on an empty input, without it an empty line is printed. [edit]Match a range of input lines $ yes Wikipedia | awk 'NR % 4 == 1, NR % 4 == 3 { printf "%6d %s\n", NR, $0 }' | sed 7q 1 Wikipedia 2 Wikipedia 3 Wikipedia 5 Wikipedia 6 Wikipedia 7 Wikipedia 9 Wikipedia $ The yes command repeatedly prints its argument (by default the letter "y") on a line. In this case, we tell the command to print the word "Wikipedia". The action statement prints each line numbered. The printf function emulates the standard C printf, and works similarly to the print command described above. The pattern to match, however, works as follows: NR is the number of records, typically lines of input, AWK has so far read, i.e. the current line number, starting at 1 for the first line of input. % is the modulo operator. NR % 4 == 1 is true for the first, fifth, ninth, etc., lines of input. Likewise, NR % 4 == 3 is true for the third, seventh, eleventh, etc., lines of input. The range pattern is false until the first part matches, on line 1, and then remains true up to and including when the second part matches, on line 3. It then stays false until the first part matches again on line 5. The sed command is used to print the first 7 lines, to prevent yes running forever. It is equivalent to head -7 if the head command is available. The first part of a range pattern being constantly true, e.g. 1, can be used to start the range at the beginning of input. Similarly, if the second part is constantly false, e.g. 0, the range continues until the end of input: /^--cut here--$/, 0 prints lines of input from the first line matching the regular expression ^--cut here--$, that is, a line containing only the phrase "--cut here--", to the end. [edit]Calculate word frequencies Word frequency, uses associative arrays: BEGIN { FS="[^a-zA-Z]+" } { for (i=1; i<=NF; i++) words[tolower($i)]++ } END { for (i in words) print i, words[i] } The BEGIN block sets the field separator to any sequence of non-alphabetic characters. Note that separators can be regular expressions. After that, we get to a bare action, which performs the action on every input line. In this case, for every field on the line, we add one to the number of times that word, first converted to lowercase, appears. Finally, in the END block, we print the words with their frequencies. The line for (i in words) creates a loop that goes through the array words, setting i to each subscript of the array. This is different from most languages, where such a loop goes through each value in the array. This
means that you print the word with each count in a simple way. tolower was an addition to the One True awk (see below) made after the book was published. [edit]Match pattern from command line This program can be represented in several ways. The first one uses the Bourne shell to make a shell script that does everything. It is the shortest of these methods: $ cat grepinawk pattern=$1 shift awk '/'$pattern'/ { print FILENAME ":" $0 }' $* $ The $pattern in the awk command is not protected by quotes. A pattern by itself in the usual way checks to see if the whole line ($0) matches. FILENAME contains the current filename. awk has no explicit concatenation operator; two adjacent strings concatenate them. $0 expands to the original unchanged input line. There are alternate ways of writing this. This shell script accesses the environment directly from within awk: $ cat grepinawk pattern=$1 shift awk '$0 ~ ENVIRON["pattern"] { print FILENAME ":" $0 }' $* $ This is a shell script that uses ENVIRON, an array introduced in a newer version of the One True awk after the book was published. The subscript of ENVIRON is the name of an environment variable; its result is the variable's value. This is like the getenv function in various standard libraries and POSIX. The shell script makes an environment variable pattern containing the first argument, then drops that argument and has awk look for the pattern in each file. ~ checks to see if its left operand matches its right operand; !~ is its inverse. Note that a regular expression is just a string and can be stored in variables. The next way uses command-line variable assignment, in which an argument to awk can be seen as an assignment to a variable: $ cat grepinawk pattern=$1 shift awk '$0 ~ pattern { print FILENAME ":" $0 }' "pattern=$pattern" $* $ Finally, this is written in pure awk, without help from a shell or without the need to know too much about the implementation of the awk script (as the variable assignment on command line one does), but is a bit lengthy: BEGIN { pattern = ARGV[1] for (i = 1; i < ARGC; i++) # remove first argument ARGV[i] = ARGV[i + 1] ARGC-if (ARGC == 1) { # the pattern was the only thing, so force read from standard input (used by book) ARGC = 2 ARGV[1] = "-" } }
$0 ~ pattern { print FILENAME ":" $0 } The BEGIN is necessary not only to extract the first argument, but also to prevent it from being interpreted as a filename after the BEGIN block ends. ARGC, the number of arguments, is always guaranteed to be 1, as ARGV[0] is the name of the command that executed the script, most often the string "awk". Also note that ARGV[ARGC] is the empty string, "". # initiates a comment that expands to the end of the line. Note the if block. awk only checks to see if it should read from standard input before it runs the command. This means that awk 'prog' only works because the fact that there are no filenames is only checked before prog is run! If you explicitly set ARGC to 1 so that there are no arguments, awk will simply quit because it feels there are no more input files. Therefore, you need to explicitly say to read from standard input with the special filename -. [edit]Self-contained AWK scripts As with many other programming languages, self-contained AWK script can be constructed using the so-called "shebang" syntax. For example, a UNIX command called hello.awk that prints the string "Hello, world!" may be built by creating a file named hello.awk containing the following lines: #!/usr/bin/awk -f BEGIN { print "Hello, world!" } The -f tells awk that the argument that follows is the file to read the awk program from, which is placed there by the shell when running. [edit]AWK versions and implementations AWK was originally written in 1977, and distributed with Version 7 Unix. In 1985 its authors started expanding the language, most significantly by adding user-defined functions. The language is described in the book The AWK Programming Language, published 1988, and its implementation was made available in releases of UNIX System V. To avoid confusion with the incompatible older version, this version was sometimes known as "new awk" ornawk. This implementation was released under a free software license in 1996, and is still maintained by Brian Kernighan. (see external links below) BWK awk refers to the version by Brian W. Kernighan. It has been dubbed the "One True AWK" because of the use of the term in association with the book[4] that originally described the language, and the fact that Kernighan was one of the original authors of awk. FreeBSD refers to this version as one-true-awk[5]. This version also has features not in the book, such astolower and ENVIRON that are explained above; see the FIXES file in the source archive for details. gawk (GNU awk) is another free software implementation and the only implementation that made serious attempts at implementing i18n. It also allows the user to extend the functionality of the program via user-written shared libraries. It was written before the original implementation became freely available, and is still widely used. Many Linux distributions come with a recent version of gawk and gawk is widely recognized as the defacto standard implementation in the Linux world; gawk version 3.0 was included as awk in FreeBSD prior to version 5.0. Subsequent versions of FreeBSD use BWK awk in order to avoid[6] the GPL, a more restrictive (in the sense that GPL licensed code cannot be modified to become proprietary software) license than the BSD license. [7] xgawk is a SourceForge project[8] based on gawk. It extends gawk with dynamically loadable libraries. mawk is a very fast AWK implementation by Mike Brennan based on a byte code interpreter. Old versions of Unix, such as UNIX/32V, included awkcc, which converted AWK to C. Kernighan wrote a program to turn awk into C++; its state is not known. [9]
awka (whose front end is written on top of the mawk program) is another translator of awk scripts into C code. When compiled, statically including the author's libawka.a, the resulting executables are considerably sped up and according to the author's tests compare very well with other versions of awk, perl or tcl. Small scripts will turn into programs of 160-170 kB. Downloads and further information about these versions are available from the sites listed below. Thompson AWK or TAWK is an AWK compiler for Solaris (operating system), DOS, OS/2, and Windows, previously sold by Thompson Automation Software (which has ceased its activities). Jawk is a SourceForge project[10] to implement AWK in Java. Extensions to the language are added to provide access to Java features within AWK scripts (i.e., Java threads, sockets, Collections, etc). BusyBox includes a sparsely documented AWK implementation that appears to be complete, written by Dmitry Zakharov. This is a very small implementation suitable for embedded systems.

AWK

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

AWK

Hochgeladen von

Copyright:

Verfügbare Formate

AWK is a general purpose programming language that is designed for processing text-based data, either in files or data streams,

Das könnte Ihnen auch gefallen