Yes you can, but if the page is a bit big it gets wasteful. Why parse the whole page when you can stop on the first hit? My code above is a simplification of the one I provided fr the very general case of unsorted URLs, but it can be even simpler:
Code:
urlPattern='http://sourceforge.net/projects/ndiswrapper/files/stable/ndiswrapper-\d+.\d+.tar.gz/download'
with open(downloadsPage) as f:
wholePage=f.read()
latestUrl=re.search(urlPattern,wholePage)
or, with network retrieval:
Code:
import urllib2,re
downloadPage='http://sourceforge.net/projects/ndiswrapper/files/stable/'
downloadPattern='http://sourceforge.net/projects/ndiswrapper/files/stable/ndiswrapper-\d+.\d+.tar.gz/download'
try:
connection = urllib2.urlopen(downloadPage)
page=connection.read()
except urllib2.HTTPError, e:
print 'Cannot retrieve URL: HTTP Error %d' % e.code
exit(1)
except urllib2.URLError, e:
print 'Cannot retrieve URL: %s' % e.reason
exit(1)
firstMatch=re.search(downloadPattern,page)
if firstMatch:
print 'Best download URL: %s' % firstMatch.group()
else:
print 'No download URL found, bad page URL?'
(note that on Sourceforge, you don't get a 404 error, but some other page up in the hierarchy)
Bookmarks