View Full Version : looking for an app or script to parse text from html
soonindallas
January 25th, 2006, 07:36 PM
I'm a job seeker. I want to make a mailmerge file from text that I could get from a website's pages by parsing stuff out of their html.
On the site I'm looking at, each company has a profile page containing information I want: the company's mailing address and the name of the guy I want to write to behind a predictable title.
If I can also parse some information out of links in an "index" page I can automate the generation of urls to pass to the parser.
This information appears in the html source when I ask firefox to display it.
Would anyone with experience of this kind of task please give me some recommendations ?
stream303
January 26th, 2006, 09:07 AM
I often use Lynx to parse out html.
You could use Lynx to actually navigate to the site/page, and then issue the "print" command, which will prompt you to save to a local file. This "printed" file is stripped of html.
I'm sure there are other ways, but Lynx has always been one of my favorite browser/tools on the command line..
soonindallas
January 26th, 2006, 07:21 PM
thanks stream. I used Lynx as you suggested. It does help strip the tags and I went ahead with it.
I was looking for a way to filter the text and automate the task for 100 or so webpages at predictable urls.
I'm about half way done. If anyone can chip in with another idea I would appreciate it.
sapo
January 26th, 2006, 07:39 PM
This sounds like spamming stuff :rolleyes:
Kerberos
January 26th, 2006, 08:01 PM
PHP can do it pretty well. I made a script for fetching wow items and parsing the info (similar thing).
$remotefilename = http://wow.allakhazam.com/db/item.html?witem=' . $itemno;
$externalinfo = file2string ( $remotefilename );
Then just use string functions on $externalinfo to filter out the bits you want. Write it to MySQL if your feeling exotic. :)
Kerberos
January 26th, 2006, 08:03 PM
double post. are these forums running on windows server or something?
soonindallas
January 27th, 2006, 09:41 AM
kerebos: thanks for the suggestion. I'll look into that for fetching the pages.
yesterday I manually fetched 40 or so web pages and saved the text with Lynx. Then with grep -f keywords * >> foobar compiled a file foobar with eveything I need.
should have thought of grep before...
Powered by vBulletin® Version 4.2.2 Copyright © 2024 vBulletin Solutions, Inc. All rights reserved.