PDA

View Full Version : [python] Get comments/data between comments from HTML


Nekiruhs
September 9th, 2007, 12:34 PM
I'm in a bit of a pickle here. I've gotten the HTML code for a webiste in Python. Now my only remaining step that I need to accomplish is extracting the content that I want. The trouble is, the content is only delimited from the ads by comments, for example:
ADS
<!-- Begin Content -->
Multiple lines of content that I want to extract
<!-- End Content -->
ADS
I don't think that the HTMLParser.HTMLParser, even when specially subclassed and the handle_starttag() function is overridden can handle comments. Unless I'm mistaken. Is the only way to do this with RegExes? I don't have to worry about improper formatting, as I am dealing with only one site with standard procedures for content. Any help for me?

pmasiar
September 9th, 2007, 01:27 PM
# python code: keep it simple!
trash, keep = fultext.split('<!-- Begin Content -->')
keep, trash = keep.split('<!-- End Content -->')

Nekiruhs
September 9th, 2007, 01:29 PM
# python code
trash, keep = fultext.split('<!-- Begin Content -->')
keep, trash = keep.split('<!-- End Content -->')

/jawdrop
... I knew about the split function but wow, that didn't even occur to me.
:guitar:

EDIT: It returns the error Value error: Need more than one value to unpack

pmasiar
September 9th, 2007, 01:30 PM
yup, tricky part is that you can assign to multiple variables (a tuple) in one statement. Neat!

You also need to be careful you have **exactly one instance** of the split substring.

ansi
September 9th, 2007, 02:52 PM
Hello,


I don't think that the HTMLParser.HTMLParser, even when specially subclassed and the handle_starttag() function is overridden can handle comments.
...
Any help for me?


According to the output of "pydoc HTMLParser", class HTMLParser has a special method for comments handling:
| handle_comment(self, data)
| # Overridable -- handle comment


Regards,