View Full Version : Regex help (python)
matmatmat
June 11th, 2009, 11:51 AM
Can anyone give me an regex to get the URL to a webpage in html:
<a href="test.html">click</a>
<a href='test.html'>click</a>
that can be used with the python re module, I'm not sure if i should check for a match then split the line or to do something else?
leblancmeneses
June 11th, 2009, 02:20 PM
Regex.Match(input, @"href\s*=\s*\"(?<urlOfInterest>[^"]+)\"", RegexOptions.IgnoreCase | RegexOptions.Multiline)
what i've done here is forget about tags and look for attributes of specific interest.
\s* means zero or more white spaces, because multiline is set it will also match newlines
[^"]+ negative character group that means match anything that is not a double quote.
for perfect matching i would actually write grammar rule using my library
npeg.codeplex.com
other ways people accomplish this is to build a sax parser and just apply regex in attributes section.
ghostdog74
June 11th, 2009, 09:02 PM
its better to use a HTML parser.
myrtle1908
June 11th, 2009, 09:52 PM
BeautifulSoup is excellent and very easy to use ... http://www.crummy.com/software/BeautifulSoup/
vBulletin® v3.8.7, Copyright ©2000-2012, vBulletin Solutions, Inc.