PDA

View Full Version : How do I read just the text from html using python?



agim
January 8th, 2009, 06:09 AM
I would like to write a program that searches my bookmarks file, and downloads the html so I have a copy of them.
Ideally, I would like to be able to just scrape the Title, which isn't too difficult, and the text from the html.

I cannot figure out a decent solution to scrape the text. I have looked into beautiful soup. If it does this then I am looking in the wrong place.

Whether the answer is here or somewhere else, hopefully someone here can help.

Thanks
Andy

Ahadiel
January 8th, 2009, 06:11 AM
This should probably be moved to the Programming forum.

mssever
January 8th, 2009, 06:38 AM
I would like to write a program that searches my bookmarks file, and downloads the html so I have a copy of them.
Ideally, I would like to be able to just scrape the Title, which isn't too difficult, and the text from the html.

I cannot figure out a decent solution to scrape the text. I have looked into beautiful soup. If it does this then I am looking in the wrong place.

Whether the answer is here or somewhere else, hopefully someone here can help.

Thanks
Andy
Beautiful Soup is the ideal tool for this job.

agim
January 8th, 2009, 06:47 AM
Great, it seemed like it. Exactly what do I have to do? Each site uses their tags differently, is there something like a 'scrapeText' function?

mssever
January 8th, 2009, 08:06 AM
Great, it seemed like it. Exactly what do I have to do? Each site uses their tags differently, is there something like a 'scrapeText' function?
The best thing is to read the documentation. I've never used Beautiful Soup myself, so I don't know the details of how to use it. Just go to its website, and you'll find documentation.

ghostdog74
January 8th, 2009, 08:28 AM
you can use simple string manipulations just fine.
Another way is to use regular expression. Just a snippet


import re
for l in open("file"):
if "<A HREF" in l:
print re.findall("<A HREF=\"(.*?)ADD_DATE.*\">(.*?)</A>",l)