Originally Posted by
hyperair
<\s*body[^>]*>
With that regex, <bodywhatever> is matched too (which is incorrect).
This fixes that specific flaw:
Code:
/<\s*body(\s*>|\s+[^>]+>)/i
(note the ending i so that the regex is case insensitive)
Handling of incorrectly formed HTML with a plain > inside an attribute value (it should be the > entity, but all browsers will happily parse it anyway*) is left as an exercise for the reader.
Example: <body attr="hello > world">
Handling of correctly formed XHTML that (for some obscure reason) uses a namespace prefix is left, too, as an exercise for the reader.
Example of such a case:
Code:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<foo:html xmlns:foo="http://www.w3.org/1999/xhtml" xml:lang="en" foo:lang="en">
<foo:head>
<foo:title>Tricky case</foo:title>
</foo:head>
<foo:body>
Document contents...
</foo:body>
</foo:html>
These are definitely the kind of "tricky corner cases" that pmasiar was evoking, and the very reason why you should use a dedicated HTML parser (not an XML parser, mind you, as my first example would then not be parsed correctly despite the fact that all browsers will accept it*).
(*): if the browser thinks it's dealing with plain old HTML ; of course in XHTML mode it will fail to validate.
Originally Posted by
hyperair
I still stand by my regex! =O
Still standing?
Bookmarks