PDA

View Full Version : [SOLVED] match() in GAWK and MAWK



mouffon
February 28th, 2011, 09:30 PM
Dear All,

Running Ubuntu 10.10 I observed an apparent incompatibility of the "match()" functions in mawk and gawk. As "match" is one of the basic features of awk this looks strange to me. Are there any ideas what I am doing wrong? Many thanks for your help.

-----------

This is the AWK script "match.awk", which is meant to match acronyms (sequences of capital letters) and print them:

for (i=1;i<=NF;i++){
if (match($i,/[A-Z][A-Z]+/)) {print substr($i,RSTART)}
}
-------------

This is the test data "matchtest.txt":

ABC chains of meta
chains of meta
-------------

"mawk -f match.awk matchtest.txt" produces (as desired):
ABC
-------------

"gawk -f match.awk matchtest.txt" produces:
ABC
chains
of
meta
chains
of
meta
-------------

gmargo
February 28th, 2011, 11:11 PM
Confirmed. That is really weird.

Another data point: If you change the pattern to "[[:upper:]][[:upper:]]+", then gawk works but mawk does not.

Another data point: If you print out RLENGTH, which is the length of the match, it doesn't make sense either. For instance, the word "chains" is matched, but with a length of 2, not 6 (or zero). "meta" gets a length of 3, not 4.

gmargo
February 28th, 2011, 11:57 PM
It's locale related. This gives me the correct result:


LC_ALL=C gawk -f match.awk matchtest.txt
-or-
LC_COLLATE=C gawk -f match.awk matchtest.txt
From the gawk user manual:
http://www.gnu.org/software/gawk/manual/gawk.html#Case_002dsensitivity


As of gawk 3.1.4, the case equivalences are fully locale-aware. They are based on the C <ctype.h> facilities, such as isalpha() and toupper().


Update: the gawk documentation actually mentions this problem: http://www.gnu.org/software/gawk/manual/gawk.html#Locales

mouffon
March 1st, 2011, 06:47 PM
Problem solved. Many thanks.
(And sorry I did not find the solution myself.)