pokerbirch
December 31st, 2008, 02:45 AM
I'm learning Python. Most of my programs download data from various websites, so i'm building a simple, re-usable class that i can use to retrieve the html of a given url. There are times when i need high performance, so for that reason i've chosen PyCurl.
Documentation for PyCurl is poor and i've been reading the curl/libcurl sites to try and work out how do use the damn library. It seems fairly straight forward once you work out how to set the options:
import pycurl
USER_AGENT = 'Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.0.5) Gecko/2008121622 Ubuntu/8.10 (intrepid) Firefox/3.0.5'
c = pycurl.Curl()
c.setopt(pycurl.VERBOSE, 1) # show request info
c.setopt(pycurl.COOKIEFILE, '') # enable automatic cookie handling
c.setopt(pycurl.ENCODING, 'gzip, deflate')
c.setopt(pycurl.USERAGENT, USER_AGENT)
c.setopt(pycurl.CONNECTTIMEOUT, 5)
c.setopt(pycurl.TIMEOUT, 5)
c.setopt(pycurl.URL , 'http://whatsmyuseragent.com')
c.perform()
c.close
Now there may be mistakes in the above...which is why i'm asking for a little help. First of all, is the COOKIEFILE parameter correct? From the libcurl docs, it says that the cookie parser can be enabled by specifying a none-existent file...does an empty string qualify as a none existent file? Secondly, the page source is printed out to the console (which shows me that it's working ok), but i actually don't want it to do that. I want the data to be returned as a string so that i can then forward into a parsing routine. As much as i've Googled, i just can't seem to find out how to get the response as a string.
It's probably a one-liner solution, just to make me feel more stupid. :P
Documentation for PyCurl is poor and i've been reading the curl/libcurl sites to try and work out how do use the damn library. It seems fairly straight forward once you work out how to set the options:
import pycurl
USER_AGENT = 'Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.0.5) Gecko/2008121622 Ubuntu/8.10 (intrepid) Firefox/3.0.5'
c = pycurl.Curl()
c.setopt(pycurl.VERBOSE, 1) # show request info
c.setopt(pycurl.COOKIEFILE, '') # enable automatic cookie handling
c.setopt(pycurl.ENCODING, 'gzip, deflate')
c.setopt(pycurl.USERAGENT, USER_AGENT)
c.setopt(pycurl.CONNECTTIMEOUT, 5)
c.setopt(pycurl.TIMEOUT, 5)
c.setopt(pycurl.URL , 'http://whatsmyuseragent.com')
c.perform()
c.close
Now there may be mistakes in the above...which is why i'm asking for a little help. First of all, is the COOKIEFILE parameter correct? From the libcurl docs, it says that the cookie parser can be enabled by specifying a none-existent file...does an empty string qualify as a none existent file? Secondly, the page source is printed out to the console (which shows me that it's working ok), but i actually don't want it to do that. I want the data to be returned as a string so that i can then forward into a parsing routine. As much as i've Googled, i just can't seem to find out how to get the response as a string.
It's probably a one-liner solution, just to make me feel more stupid. :P