Need help with regular expression

**Interestedinthepenguin** · February 4th, 2008

What regular expression would I use to extract <body> tags from a file?

These tags aren't always plain <body>; sometimes it's < body>, < body > (note the spaces), or < body someattributes >. What regexp(s) would I use in order to snag all opening body tags, regardeless of how they're formatted?

Thanks.

PS: I'm not sure if this matters, but I'll be using grep and sed.

**pmasiar** · February 4th, 2008

Parsing HTML by hand is IMHO sucker's game in general. If you have some simple static well-formed HTML, like in the case of <body> you may have luck, but in general - is better to use HTML parser, they solved most tricky corner cases for you.

I would use Python and ElementTree parser, but then I use Python everywhere I can get away with it

YMMV

**ghostdog74** · February 4th, 2008

Originally Posted by Interestedinthepenguin

What regular expression would I use to extract <body> tags from a file?

These tags aren't always plain <body>; sometimes it's < body>, < body > (note the spaces), or < body someattributes >. What regexp(s) would I use in order to snag all opening body tags, regardeless of how they're formatted?

Thanks.

PS: I'm not sure if this matters, but I'll be using grep and sed.

how about providing a sample for people interested to play with and describing what you expect the final result to be

**hyperair** · February 4th, 2008

Try...

Code:

<\s*body[^>]*>

**pmasiar** · February 4th, 2008

To get feeling what you might volunteered for if you want to parse HTML by hand, read about why dot-star is sometimes pronounced death-star: read Death to Dot Star! at PerlMonk

**hyperair** · February 4th, 2008

I still stand by my regex! =O

**aks44** · February 4th, 2008

Originally Posted by hyperair

<\s*body[^>]*>

With that regex, <bodywhatever> is matched too (which is incorrect).

This fixes that specific flaw:

Code:

/<\s*body(\s*>|\s+[^>]+>)/i

(note the ending i so that the regex is case insensitive)

Handling of incorrectly formed HTML with a plain > inside an attribute value (it should be the > entity, but all browsers will happily parse it anyway*) is left as an exercise for the reader.

Example: <body attr="hello > world">

Handling of correctly formed XHTML that (for some obscure reason) uses a namespace prefix is left, too, as an exercise for the reader.

Example of such a case:

Code:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<foo:html xmlns:foo="http://www.w3.org/1999/xhtml" xml:lang="en" foo:lang="en">
  <foo:head>
    <foo:title>Tricky case</foo:title>
  </foo:head>
  <foo:body>
    Document contents...
  </foo:body>
</foo:html>

These are definitely the kind of "tricky corner cases" that pmasiar was evoking, and the very reason why you should use a dedicated HTML parser (not an XML parser, mind you, as my first example would then not be parsed correctly despite the fact that all browsers will accept it*).

(*): if the browser thinks it's dealing with plain old HTML ; of course in XHTML mode it will fail to validate.

Originally Posted by hyperair

I still stand by my regex! =O

Still standing?

**Wybiral** · February 4th, 2008

Originally Posted by hyperair

I still stand by my regex! =O

RE is great for some situations, it just doesn't do well with hierarchical data like XML. You can parse simple things out of XML using RE, but XML is meant to be represented in a hierarchical / tree-like structure. I would use an XML parser as pmasiar suggested, however for true HTML I would probably use BeautifulSoup since it can handle badly formed HTML (like most of the internet).

**hyperair** · February 4th, 2008

T_T I give up.

**aks44** · February 4th, 2008

While we're at it, I'll push the point a little farther...

It's quite easy to write a regex that checks if a string is a well-formed e-mail address, right?

Here's the beast, have fun!
http://code.iamcal.com/php/rfc822/full_regexp.txt

EDIT: this is a PHP / PCRE regex (Perl Compatible Regular Expression), other regex engines may require a different one.

Thread: Need help with regular expression

Thread Tools

Display

Need help with regular expression

Re: Need help with regular expression

Re: Need help with regular expression

Re: Need help with regular expression

Re: Need help with regular expression

Re: Need help with regular expression

Re: Need help with regular expression

Re: Need help with regular expression

Re: Need help with regular expression

Re: Need help with regular expression

Bookmarks

Bookmarks

Posting Permissions