Page 1 of 2 12 LastLast
Results 1 to 10 of 17

Thread: Need help with regular expression

  1. #1

    Need help with regular expression

    What regular expression would I use to extract <body> tags from a file?

    These tags aren't always plain <body>; sometimes it's < body>, < body > (note the spaces), or < body someattributes >. What regexp(s) would I use in order to snag all opening body tags, regardeless of how they're formatted?

    Thanks.

    PS: I'm not sure if this matters, but I'll be using grep and sed.
    Blue Ubuntu forums: http://userstyles.org/styles/2455 | Grey Ubuntu forums: http://userstyles.org/styles/3440 | Black Ubuntu forums: http://userstyles.org/styles/3992
    Windows free since: 6/15/07

  2. #2
    Join Date
    Jun 2006
    Location
    CT, USA
    Beans
    5,267
    Distro
    Ubuntu 6.10 Edgy

    Re: Need help with regular expression

    Parsing HTML by hand is IMHO sucker's game in general. If you have some simple static well-formed HTML, like in the case of <body> you may have luck, but in general - is better to use HTML parser, they solved most tricky corner cases for you.

    I would use Python and ElementTree parser, but then I use Python everywhere I can get away with it YMMV
    Last edited by pmasiar; February 4th, 2008 at 02:50 PM.

  3. #3
    Join Date
    Sep 2006
    Beans
    2,914

    Re: Need help with regular expression

    Quote Originally Posted by Interestedinthepenguin View Post
    What regular expression would I use to extract <body> tags from a file?

    These tags aren't always plain <body>; sometimes it's < body>, < body > (note the spaces), or < body someattributes >. What regexp(s) would I use in order to snag all opening body tags, regardeless of how they're formatted?

    Thanks.

    PS: I'm not sure if this matters, but I'll be using grep and sed.
    how about providing a sample for people interested to play with and describing what you expect the final result to be

  4. #4
    Join Date
    Dec 2006
    Location
    Malaysia
    Beans
    1,570
    Distro
    Ubuntu 12.10 Quantal Quetzal

    Re: Need help with regular expression

    Try...
    Code:
    <\s*body[^>]*>

  5. #5
    Join Date
    Jun 2006
    Location
    CT, USA
    Beans
    5,267
    Distro
    Ubuntu 6.10 Edgy

    Re: Need help with regular expression

    To get feeling what you might volunteered for if you want to parse HTML by hand, read about why dot-star is sometimes pronounced death-star: read Death to Dot Star! at PerlMonk

  6. #6
    Join Date
    Dec 2006
    Location
    Malaysia
    Beans
    1,570
    Distro
    Ubuntu 12.10 Quantal Quetzal

    Re: Need help with regular expression

    I still stand by my regex! =O

  7. #7
    Join Date
    May 2007
    Location
    Paris, France
    Beans
    927
    Distro
    Kubuntu 7.04 Feisty Fawn

    Re: Need help with regular expression

    Quote Originally Posted by hyperair View Post
    <\s*body[^>]*>
    With that regex, <bodywhatever> is matched too (which is incorrect).

    This fixes that specific flaw:
    Code:
    /<\s*body(\s*>|\s+[^>]+>)/i
    (note the ending i so that the regex is case insensitive)


    Handling of incorrectly formed HTML with a plain > inside an attribute value (it should be the &gt; entity, but all browsers will happily parse it anyway*) is left as an exercise for the reader.

    Example: <body attr="hello > world">


    Handling of correctly formed XHTML that (for some obscure reason) uses a namespace prefix is left, too, as an exercise for the reader.

    Example of such a case:
    Code:
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html 
         PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <foo:html xmlns:foo="http://www.w3.org/1999/xhtml" xml:lang="en" foo:lang="en">
      <foo:head>
        <foo:title>Tricky case</foo:title>
      </foo:head>
      <foo:body>
        Document contents...
      </foo:body>
    </foo:html>

    These are definitely the kind of "tricky corner cases" that pmasiar was evoking, and the very reason why you should use a dedicated HTML parser (not an XML parser, mind you, as my first example would then not be parsed correctly despite the fact that all browsers will accept it*).

    (*): if the browser thinks it's dealing with plain old HTML ; of course in XHTML mode it will fail to validate.



    Quote Originally Posted by hyperair View Post
    I still stand by my regex! =O
    Still standing?
    Last edited by aks44; February 4th, 2008 at 06:02 PM. Reason: typo + clarification
    Not even tinfoil can save us now...

  8. #8
    Join Date
    Oct 2006
    Location
    Austin, Texas
    Beans
    2,712
    Distro
    Ubuntu 7.10 Gutsy Gibbon

    Re: Need help with regular expression

    Quote Originally Posted by hyperair View Post
    I still stand by my regex! =O
    RE is great for some situations, it just doesn't do well with hierarchical data like XML. You can parse simple things out of XML using RE, but XML is meant to be represented in a hierarchical / tree-like structure. I would use an XML parser as pmasiar suggested, however for true HTML I would probably use BeautifulSoup since it can handle badly formed HTML (like most of the internet).

  9. #9
    Join Date
    Dec 2006
    Location
    Malaysia
    Beans
    1,570
    Distro
    Ubuntu 12.10 Quantal Quetzal

    Re: Need help with regular expression

    T_T I give up.

  10. #10
    Join Date
    May 2007
    Location
    Paris, France
    Beans
    927
    Distro
    Kubuntu 7.04 Feisty Fawn

    Re: Need help with regular expression

    While we're at it, I'll push the point a little farther...


    It's quite easy to write a regex that checks if a string is a well-formed e-mail address, right?

    Here's the beast, have fun!
    http://code.iamcal.com/php/rfc822/full_regexp.txt




    EDIT: this is a PHP / PCRE regex (Perl Compatible Regular Expression), other regex engines may require a different one.
    Last edited by aks44; February 4th, 2008 at 03:50 PM.
    Not even tinfoil can save us now...

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •