View Full Version : Cat-like command to dump text from START to END for Bush Video Archive?
kentbye
May 10th, 2007, 05:11 AM
I'm working on screen scraping the WhiteHouse.gov for a website that I'm helping out with, and I have a question.
Is there a commandline program or grep options that will display the text between two strings -- like the text between "<!-- BEGIN -->" & "<!-- END -->" ?
I'm trying to isolate the transcript text that appears between these two flags.
cat will dump the text
grep will also search for strings on a line.
But is there anything out there that will output everything in between into a file?
We're downloading the public domain streaming videos from WhiteHouse.gov, and transcoding them so that they're more accessible, and we wanted to also match up these videos with the transcripts.
Here's a sample page, and look at the page source to see the "<!-- BEGIN -->" & "<!-- END -->" tags:
http://www.whitehouse.gov/news/releases/2007/01/20070131-1.html
It will be integrated into a full-blown George W. Bush video archival collection, and this will make it a lot easier to search for video quotes for people to remix.
Here's a sampling of some of the videos so far:
http://political-videos.blip.tv/
Thanks,
-Kent.
jamescox84
May 10th, 2007, 11:18 AM
Here's a script I just wrote, could a python script like this do what you want.
#!/usr/bin/env python
# file: gettrans
import sys
import urllib
if __name__ == '__main__':
if len(sys.argv) != 3:
print 'usage: %s <url> <outputfile>' % sys.argv[0]
sys.exit()
f = urllib.urlopen(sys.argv[1])
page_source = f.read()
f.close()
start_tag = '<!-- BEGIN -->'
end_tag = '<!-- END -->'
start = page_source.find(start_tag) + len(start_tag)
end = page_source.find(end_tag)
transcript = page_source[start: end]
try:
f = open(sys.argv[2], 'w')
f.write(transcript)
f.close()
except IOError:
print 'Could not open %s' % sys.argv[2]
Use like this:
./gettrans http://www.whitehouse.gov/news/releases/2007/01/20070131-1.html transcript.html
or
python gettrans http://www.whitehouse.gov/news/releases/2007/01/20070131-1.html transcript.html
gettrans will need execute permistion.
cwaldbieser
May 10th, 2007, 12:44 PM
I'm working on screen scraping the WhiteHouse.gov for a website that I'm helping out with, and I have a question.
Is there a commandline program or grep options that will display the text between two strings -- like the text between "<!-- BEGIN -->" & "<!-- END -->" ?
I'm trying to isolate the transcript text that appears between these two flags.
cat will dump the text
grep will also search for strings on a line.
But is there anything out there that will output everything in between into a file?
We're downloading the public domain streaming videos from WhiteHouse.gov, and transcoding them so that they're more accessible, and we wanted to also match up these videos with the transcripts.
Here's a sample page, and look at the page source to see the "<!-- BEGIN -->" & "<!-- END -->" tags:
http://www.whitehouse.gov/news/releases/2007/01/20070131-1.html
It will be integrated into a full-blown George W. Bush video archival collection, and this will make it a lot easier to search for video quotes for people to remix.
Here's a sampling of some of the videos so far:
http://political-videos.blip.tv/
Thanks,
-Kent.
Assuming the file is "test.txt":
$ sed -n '/<!-- START -->/,/<!-- END -->/p' test.txt | grep -v -e '<!-- START -->' -e '<!-- END -->'
The sed command only prints lines between the markers (inclusive). The grep -v filters out the marker lines.
kentbye
May 11th, 2007, 12:24 AM
Great! Thanks.
I got it to work with the python script.
And I extended the code in order to make make the start_tag and end_tag input arguments (shown down below)
And for some reason the sed command didn't do anything.
Is it supposed to output to a file?
I did the following:
axel http://www.whitehouse.gov/news/releases/2007/01/20070131-1.html;
sed -n '/<!-- START -->/,/<!-- END -->/p' 20070131-1.html | grep -v -e '<!-- START -->' -e '<!-- END -->';
And there is nothing that is output.
I renamed the "gettrans" file to "parse", and did a
chmod +x parse
And here is the code for inputting the start and end tags:
#!/usr/bin/env python
# file: parse
import sys
import urllib
if __name__ == '__main__':
if len(sys.argv) != 5:
print 'usage: %s <url> <start_tag> <end_tag> <outputfile>' % sys.argv[0]
sys.exit()
f = urllib.urlopen(sys.argv[1])
page_source = f.read()
f.close()
start_tag = sys.argv[2]
end_tag = sys.argv[3]
start = page_source.find(start_tag) + len(start_tag)
end = page_source.find(end_tag)
transcript = page_source[start: end]
try:
f = open(sys.argv[4], 'w')
f.write(transcript)
f.close()
except IOError:
print 'Could not open %s' % sys.argv[4]
Here's what I entered to make it run:
parse 20070430-2.html '<!-- BEGIN -->' '<!-- END -->' 20070430-2.txt
Thanks again,
-Kent.
jamescox84
May 11th, 2007, 09:35 AM
Oh, this make my heart swell with joy when I see such things. The Free Software thing really working. Thanks for making use of my script, and for giving back you modifications. I know the program is probably only useful to you, but who knows.
cwaldbieser
May 11th, 2007, 12:20 PM
Great! Thanks.
And for some reason the sed command didn't do anything.
Is it supposed to output to a file?
I did the following:
axel http://www.whitehouse.gov/news/releases/2007/01/20070131-1.html;
sed -n '/<!-- START -->/,/<!-- END -->/p' 20070131-1.html | grep -v -e '<!-- START -->' -e '<!-- END -->';
And there is nothing that is output.
-Kent.
Oops. It should have been:
$ sed -n '/<!-- BEGIN -->/,/<!-- END -->/p' 20070131-1.html | grep -v -e '<!-- BEGIN -->' -e '<!-- END -->';
I typed "START" instead of "BEGIN" without thinking.
Powered by vBulletin® Version 4.2.2 Copyright © 2024 vBulletin Solutions, Inc. All rights reserved.