Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: urllib2 problem

  1. #1
    Join Date
    Feb 2010
    Location
    Slovakia
    Beans
    43
    Distro
    Ubuntu 10.04 Lucid Lynx

    urllib2 problem

    Hello,

    I have trouble. I want to download source code of one website. If I use urllib2, python raise error that page blocked connection. The page is http://pokec.sk . Is possible to download source ? Thanks.

  2. #2
    Join Date
    Oct 2007
    Beans
    1,914
    Distro
    Lubuntu 12.10 Quantal Quetzal

    Re: urllib2 problem

    Test if downloading the site using wget works. For this, in the terminal, try something like:
    Code:
    cd /tmp
    wget <yoursite>
    Do you get an error? If yes, the owner of that site doesn't want to get it processed using anything else than a browser, which you should respect. If not, please copy&paste the precise error message of your Python program here.

  3. #3
    Join Date
    Feb 2010
    Location
    Slovakia
    Beans
    43
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: urllib2 problem

    I try wget and it works fine. I will post the error tommorow coz i have the program on second PC and i dont have much time now. Thanks you.

  4. #4
    Join Date
    Aug 2007
    Beans
    949

    Re: urllib2 problem

    If wget works, then its probably something wrong with your call to urllib2 rather than the website discriminating based on your user agent (which by the way, feel free to ignore, the Internet is a public place).

  5. #5
    Join Date
    Feb 2010
    Location
    Slovakia
    Beans
    43
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: urllib2 problem

    Ok, here is error :
    Code:
    /usr/bin/python -u  "/home/remixus/Plocha/a.py"
    Traceback (most recent call last):
      File "/home/remixus/Plocha/a.py", line 4, in <module>
        s=urllib2.urlopen('http://www.azet.sk')
      File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
        return _opener.open(url, data, timeout)
      File "/usr/lib/python2.6/urllib2.py", line 391, in open
        response = self._open(req, data)
      File "/usr/lib/python2.6/urllib2.py", line 409, in _open
        '_open', req)
      File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
        result = func(*args)
      File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open
        return self.do_open(httplib.HTTPConnection, req)
      File "/usr/lib/python2.6/urllib2.py", line 1136, in do_open
        raise URLError(err)
    urllib2.URLError: <urlopen error [Errno 104] Connection reset by peer>
    and the program
    Code:
    import urllib2
    s=urllib2.urlopen('http://www.pokec.sk')
    print s

  6. #6
    Join Date
    Aug 2007
    Location
    UK
    Beans
    427
    Distro
    Ubuntu UNR

    Re: urllib2 problem

    I can indeed confirm it is the user agent that is not liked by the server but this works.
    Code:
    r = urllib2.Request("http://www.pokec.sk", headers={"User-Agent": "Python-urlli~"})
    urllib2.urlopen(r).read()
    Replace the ~ with a b and it fails again.

  7. #7
    Join Date
    Feb 2010
    Location
    Slovakia
    Beans
    43
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: urllib2 problem

    Thank you so much !!

  8. #8
    Join Date
    Feb 2010
    Location
    Slovakia
    Beans
    43
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: urllib2 problem

    I want you ask something again If you type http://pokec.sk to browser there is login. How can I log in using python ? Thanks for reply.

  9. #9
    Join Date
    Aug 2007
    Location
    UK
    Beans
    427
    Distro
    Ubuntu UNR

    Re: urllib2 problem

    I just looked through the simpler login page and found the form tag has the action:
    Code:
    https://prihlasenie.azet.sk/overenie?uri=http%3A%2F%2Fpokec.azet.sk&isWap=0
    This is where to send your http(s) << s for secure request also type="post" so it's a post request and the names on the text and password inputs are "form[username]" and "form[password]"

    This page is interesting as it will tell you how to encode the url. I assume you just tack the encoded form data onto the rest of the url and send it although there is probably an urllib way of doing this. Then prepare to receive a session cookie, I guess.
    http://en.wikipedia.org/wiki/POST_%28HTTP%29

    Edit:
    Here is my best guess as I don't have an account on this site and I can't read this anyway in order to set one up.
    Code:
    import urllib
    import urllib2
    import os
    
    username = "me"
    password = "unknown"
    filename = "sitedump.html"
    
    r = urllib2.Request("https://prihlasenie.azet.sk/overenie?uri=http%3A%2F%2Fpokec.azet.sk&isWap=0",
    urllib.urlencode({"form[username]":username, "form[password]":password}), {"User-Agent": "Python-urlli~"})
    with open(filename, "w") as f:
       f.write(urllib2.urlopen(r).read())
    os.system("firefox file://" + os.path.join(os.getcwd(), filename))
    This gives me the login page with the supplied username correctly filled in which is very encouraging and according to what I have gleaned from the docs any session cookies are handled by the fourth parameter of urllib2.Request, "origin_req_host" and has a default value that isn't None so hopefully that means there is automatic behaviour.
    Last edited by StephenF; July 19th, 2010 at 09:06 PM. Reason: Implementation added

  10. #10
    Join Date
    Feb 2010
    Location
    Slovakia
    Beans
    43
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: urllib2 problem

    Thanks again ! I just want to ask...how i can search google using python ? for example i type to google something and google show me the best webpages ... and the program can show me this address. Sorry for my bad english, i hope you understand

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •