Results 1 to 7 of 7

Thread: Puzzling egrep behavior. Possibly a bug?

  1. #1
    Join Date
    Dec 2007
    Beans
    102

    Puzzling egrep behavior. Possibly a bug?

    I am trying to extract a string from the output of a DICOM directory
    parser. The string I'm trying to match looks similar to this:

    (0004,1500) CS #36 [DICOM\ST000000\SE000000\CR000000.DCM] Referenced File ID

    "s" is a much larger string in which the target string is embedded.

    From the console, this command produces no matches:
    echo $s|egrep -o " \(0004,1500\) CS #[0-9]{2} \[.{,50}\] Referenced File ID "

    However, any of these produce the expected result:
    echo $s|egrep -o " \(0004,1500\) CS #[0-9]{2} \[.{,50}\] Reference. File ID "
    echo $s|egrep -o " \(0004,1500\) CS #[0-9]{2} \[.{,50}\] Reference[d] File ID "
    echo $s|egrep -o " \(0004,1500\) CS #[0-9]{2} \[.{1,50}\] Referenced File ID "

    A typical output:
    (0004,1500) CS #36 [DICOM\ST000000\SE000000\CR000000.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000000\SE000000\CR000001.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000000\SE000001\CR000000.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000000\SE000002\CR000000.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000000\SE000003\CR000000.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000000\SE000004\CR000000.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000000\SE000005\CR000000.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000001\SE000000\CR000000.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000001\SE000001\CR000000.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000001\SE000002\CR000001.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000001\SE000002\CR000000.DCM] Referenced File ID
    (0004,1500) CS #36 [DICOM\ST000001\SE000003\CR000000.DCM] Referenced File ID


    In summary, if I use the form, ".{,40}" to specify the variable part of
    the string, egrep won't recognize a match between "Referenced" and
    "Referenced", but will recognize "Reference." and "Reference[d]". If I
    use ".{1,40}", then "Referenced" matches "Referenced", as expected.

    This would seem to be a bug, unless there is something about the regular
    expression that I'm missing.

    I'm using GNU bash, version 4.1.5(1), and GNU grep, version 2.5.4
    OS: Ubuntu 10.04.4 LTS

    Any advice would be appreciated.

    manthony121
    Last edited by manthony121; March 18th, 2013 at 04:30 AM.

  2. #2
    Join Date
    Feb 2013
    Beans
    Hidden!

    Re: Puzzling egrep behavior. Possibly a bug?

    Hmm, {,n} interval is obviously GNU egrep-specific. The POSIX standard defines only {m}, {m,n} and {m,}. Interestingly, even GNU grep treats \{,n\} as an error.

    Looks like a bug in egrep. Compare the output of
    Code:
    echo x|egrep -o 'x?'
    echo x|egrep -o 'x{0,1}'
    echo x|egrep -o 'x{,1}'
    echo 'x{,1}'|egrep -o 'x{,1}'
    The third form finds no match, but the fourth matches x.

    Update.
    The above was tested with cygwin grep on Windows. Now, on Debian with grep 2.12-2 / libc6 2.13-38 all examples above match x. I guess it was fixed in glibc at some point.
    Last edited by schragge; March 15th, 2013 at 11:27 AM.

  3. #3
    Join Date
    Feb 2009
    Location
    Dallas, TX
    Beans
    7,790
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: Puzzling egrep behavior. Possibly a bug?

    Hi manthony121.

    The expression:
    Code:
    \[ \]
    is not having the effect you want, then this:
    Code:
    .{,50}
    is matching anything up to 50 chars, which happens to be up to "...Reference".

    If you replace it with the following expression, it should work:
    Code:
    [[] []]
    Try this:
    Code:
    egrep  "\(0004,1500\) CS #[0-9]{2} [[].{,50}[]] Referenced File ID"  yourfile.txt
    Hope it helps. Let us know how it goes.
    Regards.

  4. #4
    Join Date
    Dec 2007
    Beans
    102

    Re: Puzzling egrep behavior. Possibly a bug?

    schragge: thank you for the reply. I didn't know that the {,m} form was non-standard. The man page lists it as acceptable.

    papibe: The expression as you wrote it works perfectly. On re-reading the man page, I realized that nowhere does it specify that you can include a literal '[' in a string by preceding it with a backslash, as I had assumed. The only place that specifies how to include a literal '[' is in the section on bracket expressions, where, as you point out, you can specify a literal '[' by placing it first within a bracket expression.

    Thank you for showing how to write the regex properly.

    However, it still seems to be buggy behavior. The expression fails on the "d" in "Referenced" regardless of the number included in the ".{,50}" expression, and, changing ".{,50}" to ".{1,50}" results in the behavior that would be expected if "\[" were equivalent to "[[]".

    As I play with it some more, it does seem to be a problem with the ".{,50}" idiom. Eliminating the square brackets completely does not affect the result:

    egrep -o " \(0004,1500\) CS #[0-9]{2} .{,80} Referenced File ID " doesn't match.
    egrep -o " \(0004,1500\) CS #[0-9]{2} .{1,80} Referenced File ID " does match.
    egrep -o " \(0004,1500\) CS #[0-9]{2} .{,80} Reference. File ID " also matches.

    I think the moral of the story is to avoid the "{,m}" idiom!

    Thanks again.

  5. #5
    Join Date
    Feb 2009
    Location
    Dallas, TX
    Beans
    7,790
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: Puzzling egrep behavior. Possibly a bug?

    Quote Originally Posted by manthony121 View Post
    I think the moral of the story is to avoid the "{,m}" idiom!
    I think you are right.

    That option is no longer documented (and I guess not supported) on grep 2.10 (Ubuntu 12.04).

    Regards.

    EDIT: don't get me wrong, these are still supported:
    Code:
          {n}     The preceding item is matched exactly n times.
          {n,}    The preceding item is matched n or more times.
          {n,m}   The  preceding  item  is  matched at least n times, but not more
                  than m times.
    Last edited by papibe; March 15th, 2013 at 01:50 AM. Reason: added supported expressions

  6. #6
    Join Date
    Feb 2013
    Beans
    Hidden!

    Re: Puzzling egrep behavior. Possibly a bug?

    Quote Originally Posted by manthony121 View Post
    On re-reading the man page, I realized that nowhere does it specify that you can include a literal '[' in a string by preceding it with a backslash, as I had assumed.
    Actually, your assumption is a reasonable one. At least, this is how I read the standard (specifically, 9.4.1-9.4.3). This is also how the implementations I tested work, i.e. libc6 2.13, libpcre 8.30, and grep/egrep in Solaris 8.

  7. #7
    Join Date
    Dec 2007
    Beans
    102

    Re: Puzzling egrep behavior. Possibly a bug?

    It seems that "\[" does have the desired effect of matching a literal "[". However, when used with the cursed ".{,50}" construct, it does not behave quite the same way as "[[]". So, I will use both "[[]" and avoid ".{,m} in my future regexes.

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •