View Full Version : Question for python regex: matching double backslashes '\\'

October 7th, 2012, 07:28 PM
I am writing a script to scrape urls from a website. The following is the string I'm trying to match (it is found in the page source read by urllib2.urlopen(webpage).read()):


The regex search I do in python is:


where html is the page source of the webpage I'm interested in.

But I get an error saying: unexpected end of regular expression.
If I change the regex from,




everything matches perfectly. But I don't understand why I have to match two backslashes as opposed to a single literal backslash. Shouldn't a literal backslash ('\\') match every single backslash in the page source?

Any help is appreciated.

October 7th, 2012, 09:55 PM
Python also uses the backslash as a special escaping character. Two backslashes will expand to one. It will expand "\\" to "\" and "\\\\" to "\\". To avoid needing rediculous amounts of backslashes, you can use a raw string:

The 'r' before the string causes python to interpret the string literally, ignoring escape sequences.

October 8th, 2012, 12:07 PM
I would (almost always) recommend beautifulsoup for parsing HTML: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Something like the following would be appropriate for extracting all href attributes from the <a> tags on the page:

import urllib2
from bs4 import BeautifulSoup

url = 'http://ubuntu.com'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

# find all <a> tags
for a in soup.find_all('a'):

# extract the href attribute
link = a.get('href')

# only print full URLs
if link[:7] == "http://":
print link

Hope it helps!