[h2o]
April 17th, 2008, 10:07 AM
I need to parse wikipedia links to extract the data (url/article name) using Python.
The links are of these types.
1) Link without title: [[My Article]]
2) Link with title [[My Article|Description of My Article]]
3) Hypertext link [http://example.com]
4) Hypertext link with title [http://example.com Link to example.com]
I am primarily interested in the URL part, not the description (although getting that as well doesn't hurt, but it's really not needed).
To make it more clear what I want, here is some sample code on how I want it.
>>s = "[[Main Page]], [[An Article|This is a title]] [http://example.com Our website] but you can also visit [http://google.com]"
>>m = re.findall(REGEX, s)
>>print m
[('Main Page', ''), ('An Article', 'This is a title'), ('http://example.com','Our website), ('http://google.com', '')]
I have found regexs that solves both (1) and (2) but none that handles the mixed content.
Anyone with more regex-skills than me who is up to the challenge? :)
The links are of these types.
1) Link without title: [[My Article]]
2) Link with title [[My Article|Description of My Article]]
3) Hypertext link [http://example.com]
4) Hypertext link with title [http://example.com Link to example.com]
I am primarily interested in the URL part, not the description (although getting that as well doesn't hurt, but it's really not needed).
To make it more clear what I want, here is some sample code on how I want it.
>>s = "[[Main Page]], [[An Article|This is a title]] [http://example.com Our website] but you can also visit [http://google.com]"
>>m = re.findall(REGEX, s)
>>print m
[('Main Page', ''), ('An Article', 'This is a title'), ('http://example.com','Our website), ('http://google.com', '')]
I have found regexs that solves both (1) and (2) but none that handles the mixed content.
Anyone with more regex-skills than me who is up to the challenge? :)