PDA

View Full Version : gAWK: record separator bugged?



GerritHiemstra
March 28th, 2012, 09:19 PM
Hello,

i was wondering if anyone has an explanation for a problem i ran into with AWK...

I will use the following simple awk script to explain myself:


BEGIN{RS=""}
END{print "num_rows: " NR}I assumed that by using an empty record separator, AWK would consider all incoming fields to be on 1 record.

Using the following example text, all seems good:



<abc>
<def>
</def>
</abc>
Output: num_rows: 1

However, using the following text (containing a situation where there's 2 or more newline characters in a row):




<abc>
<def>



</def>
</abc>
Output: num_rows: 2

How could it be?

Do i have my definitions wrong? (what precisely IS a record separator?)

Many thanks in advance!

r-senior
March 28th, 2012, 09:25 PM
The manual page for gawk explains "Records":


Records
Normally, records are separated by newline characters. You can control
how records are separated by assigning values to the built-in variable
RS. If RS is any single character, that character separates records.
Otherwise, RS is a regular expression. Text in the input that matches
this regular expression separates the record. However, in compatibil‐
ity mode, only the first character of its string value is used for sep‐
arating records. If RS is set to the null string, then records are
separated by blank lines. When RS is set to the null string, the new‐
line character always acts as a field separator, in addition to what‐
ever value FS may have.


Use 'man gawk' for further information.

GerritHiemstra
March 28th, 2012, 09:27 PM
Thank you sir!

GerritHiemstra
March 28th, 2012, 09:37 PM
Hollld up, i was a bit too quick in that one...

If i use 10 consecutive blank lines in my example text, the output is still "num_rows: 2" :confused:

r-senior
March 28th, 2012, 09:49 PM
It means the separator is one or more blank lines, not that each record is separated by a single blank line. The man page is slightly ambiguous, I'd agree, but the complete manual is quite clear:

http://www.gnu.org/software/gawk/manual/html_node/Multiple-Line.html

GerritHiemstra
March 29th, 2012, 06:51 PM
Thanks!

My xml pretty print script is working again; i used "/a" as record separator, as I need the text to be read on one record :p