Nekiruhs
September 9th, 2007, 12:34 PM
I'm in a bit of a pickle here. I've gotten the HTML code for a webiste in Python. Now my only remaining step that I need to accomplish is extracting the content that I want. The trouble is, the content is only delimited from the ads by comments, for example:
ADS
<!-- Begin Content -->
Multiple lines of content that I want to extract
<!-- End Content -->
ADS
I don't think that the HTMLParser.HTMLParser, even when specially subclassed and the handle_starttag() function is overridden can handle comments. Unless I'm mistaken. Is the only way to do this with RegExes? I don't have to worry about improper formatting, as I am dealing with only one site with standard procedures for content. Any help for me?
ADS
<!-- Begin Content -->
Multiple lines of content that I want to extract
<!-- End Content -->
ADS
I don't think that the HTMLParser.HTMLParser, even when specially subclassed and the handle_starttag() function is overridden can handle comments. Unless I'm mistaken. Is the only way to do this with RegExes? I don't have to worry about improper formatting, as I am dealing with only one site with standard procedures for content. Any help for me?