Results 1 to 3 of 3

Thread: Question for python regex: matching double backslashes '\\'

  1. #1
    Join Date
    Jun 2008
    Beans
    79

    Question for python regex: matching double backslashes '\\'

    Hi,
    I am writing a script to scrape urls from a website. The following is the string I'm trying to match (it is found in the page source read by urllib2.urlopen(webpage).read()):
    Code:
    "stream_h264_url":"http:\\/\\/www.dailymotion.com\\/cdn\\/H264-512x384\\/video\\/xu41s3.mp4?auth=1349806412-337e4c35a8590a1dabc2761376070386"
    The regex search I do in python is:
    Code:
    re.search('"stream_h264_url":"http:[-\\/a-zA-Z0-9?=.]+"',html)
    where html is the page source of the webpage I'm interested in.

    But I get an error saying: unexpected end of regular expression.
    If I change the regex from,

    '"stream_h264_url":"http:[-\\/a-zA-Z0-9?=.]+"'

    to

    '"stream_h264_url":"http:[-\\\\/a-zA-Z0-9?=.]+"'

    everything matches perfectly. But I don't understand why I have to match two backslashes as opposed to a single literal backslash. Shouldn't a literal backslash ('\\') match every single backslash in the page source?

    Any help is appreciated.

  2. #2
    Join Date
    Jun 2009
    Location
    Land of Paranoia and Guns
    Beans
    194
    Distro
    Ubuntu 12.10 Quantal Quetzal

    Re: Question for python regex: matching double backslashes '\\'

    Python also uses the backslash as a special escaping character. Two backslashes will expand to one. It will expand "\\" to "\" and "\\\\" to "\\". To avoid needing rediculous amounts of backslashes, you can use a raw string:
    Code:
    r'"stream_h264_url":"http:[-\\/a-zA-Z0-9?=.]+"'
    The 'r' before the string causes python to interpret the string literally, ignoring escape sequences.
    Don't use W3Schools as a resource! (Inconsequential foul language at the jump)
    Open Linux Forums (More foul language, but well worth it for the quality of support and good humor.)
    If you want to discuss W3Schools, please PM me instead of posting.

  3. #3
    Join Date
    Feb 2008
    Beans
    251
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Question for python regex: matching double backslashes '\\'

    I would (almost always) recommend beautifulsoup for parsing HTML: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

    Something like the following would be appropriate for extracting all href attributes from the <a> tags on the page:

    Code:
    #!/usr/bin/python
    import urllib2
    from bs4 import BeautifulSoup
    
    url  = 'http://ubuntu.com'
    html = urllib2.urlopen(url).read()
    soup = BeautifulSoup(html)
    
    # find all <a> tags
    for a in soup.find_all('a'):
    
      # extract the href attribute
      link = a.get('href')
    
      # only print full URLs
      if link[:7] == "http://":
        print link

    Hope it helps!

    g

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •