PDA

View Full Version : [SOLVED] Web scraping



pokerbirch
December 29th, 2008, 07:03 PM
Previous to discovering Ubuntu about 6 months ago, Windows XP was my sole O.S. Now it's the other way round. :)

I'm a hobby programmer and many of my code concoctions involve scraping data off various websites. Windows has the WinHTTP library which is the backbone of Internet Explorer...so it was an ideal starting block for some web scraping. In Python, i'm kinda struggling to find an equivalent. My requirements are:

* automatic cookie handling (essential for login sessions)
* ok with http AND https
* supports both GET and POST requests
* browser-like request headers (particularly 'User-Agent')
* automatic gzip handling

I've played around with urllib and urllib2 but i'm not sure that they can give me what i need. At the end of the day, all i require is the html but of course it's not always that simple.

While on the subject, what is the best way to parse the data? The html often contains lots of JavaScript, from which much of my data is collected. The way i handled it in the past was by using string functions and delimiters. I'm not convinced that Beautiful Soup will be of use to me (due to the JavaScript?), but perhaps some kind of tokenizer lib would make parsing less cumbersome?

Suggestions please?

slavik
December 29th, 2008, 07:15 PM
libcurl is what you are looking for I believe.

pokerbirch
December 29th, 2008, 07:44 PM
Indeed, PyCurl looks like it might do the job nicely.

Any suggestions for the token style parsing?

Also, i notice you're a Perl lover, and my long debated decision was a 50/50 choice between Python and Perl. Python appears to be more popular, yet for lots and lots of reasons that would throw this thread off topic, Perl seems to be much more 'complete' than Python is.

So the big question is: how would you achieve the above in Perl?

pmasiar
December 29th, 2008, 08:39 PM
Perl seems to be much more 'complete' than Python is.

Not sure what you mean. Perl has regex parsing as part of language, Python uses external library - that's about all the difference. With Python, you have cleaner, easier to read code (I did enough Perl to know and hate it's quirks) and I am not aware about any huge gap in library coverage. And Python's popularity is growing, while Perl is going down (according to TIOBE), so IMNSHO your time is much better and more effectively spent in Python.

pokerbirch
December 29th, 2008, 09:08 PM
[RE: Perl/Python]
Perhaps it's just the specific areas that i've been searching for, but whenever i Google for help, i always seem to find significantly more solutions in Perl than i do for Python. Although i've not actually programmed anything in Perl, i've seen enough code snippets to realise that the Python syntax is nicer. With regards to Perl being 'more complete', what i meant was that i keep finding oddities and/or incomplete code within the Python libraries i'm playing with. Documentation of 3rd party libraries can be quite poor in Python...which i was comparing to Perl's CPAN.

My longer term goal is to be able to build cross-platform, stand-alone apps from my code and Python does seem to have projects working towards that. I know i have Java as an option, but i'm a bit old fashioned and really hate the bloat. I know that cpu time, hard drives and ram are no longer premium space, but does that REALLY give us a good reason to be more sloppy?


[Back on topic]
I'm about to Google for PyCurl examples. If anyone knows of some good ones, i'd appreciate seeing them here....

slavik
December 29th, 2008, 09:38 PM
Perl has Curl, too, afaik ...

pp.
December 29th, 2008, 09:50 PM
I've done some web scraping, but I was more interested in the text than in the contents of the scripts. However, I thought Prolog was a very suitable tool for that purpose. The implementation I used already carries an HTML parser, and parsing scripts and things in Prolog is a straightforward job.

The only caveat is that there is an unexpected number of sites which not only send poor HTML but not even well formed HTML at all.

Java's API documentation @ sun comes to mind. HTML obviously generated by programs, still malformed.

pmasiar
December 29th, 2008, 11:07 PM
whenever i Google for help, i always seem to find significantly more solutions in Perl than i do for Python.

Could be, Perl was (and might be ever now, in LOC) more widely used than Python. Also, because Perl programmers notoriously cannot read each other's code (and sometimes even your own :-) ) there are more Perl solutions - because often is easier to start from scratch than add to existing code. It might not be a fueature but a bug :-)

pokerbirch> Documentation of 3rd party libraries can be quite poor in Python...which i was comparing to Perl's CPAN.

CPAN is huge but quality of module's docs is also variable. Core Python libraries (included by default) are quite decent. Rest depends - but with growing community, quality will improve, because sharing Python code is much simpler than Perl, due to standardized whitespace.

pokerbirch> I know that cpu time, hard drives and ram are no longer premium space, but does that REALLY give us a good reason to be more sloppy?

It's not about being sloppy, but about burning CPU cycles of your 97% idling CPU on something, while improving your productivity. Read "The Hundred-Year Language" - http://www.paulgraham.com/hundred.html

pp.
December 29th, 2008, 11:14 PM
often is easier to start from scratch than add to existing code. It might not be a fueature but a bug

But then, to insiders the language might appear so clear and concise that it's cheaper and faster to formulate your own solution anew than to search for existing ones.

It's that way for me with natural languages. Unless I want to say something very commonplace like 'good afternoon' I build my sentences all by myself instead of perusing reams and reams of phrase books. With mixed results, as can be seen.

ghostdog74
December 30th, 2008, 01:01 AM
Suggestions please?
you can try PHP with its curl library.

pokerbirch
December 30th, 2008, 01:53 AM
Thank you all, libcurl is indeed the canine's testicles. :D

Wybiral
December 30th, 2008, 02:03 AM
Why do you think handling the Javascript will be an issue with BeautifulSoup? It will put it in the text attribute of the appropriate script elements if you need to parse it (it's designed to handle improper, real-world html, JS isn't going to break it).