View Full Version : need help with grep or similar
pedrotuga
September 1st, 2006, 03:44 PM
Hi,
I want to catch all the urls from an html source.
I went and red about grep and its regulat expression but i am facing to basic problems:
-grep is designed to work on a line by line basis, that screws it up becauase oftem there is huge pieces of html without an endl
-there might be, but so far i havent figured out how to include spaces on the wildcards.
any alternative or hints?
a little code snippet would be wellcome as well of course ;)
ciscosurfer
September 1st, 2006, 04:45 PM
How 'bout this:
cat your_html_source_file | grep http
or
cat your_html_source_file | grep href
ifokkema
September 1st, 2006, 04:50 PM
That would do exactly what he described what he doesn't want :)
I'm not currently working on Linux (shame on me) but I know grep has the option to show only the text that has matched the pattern. Then you would need to use a full URL pattern, and you'll have your results.
HTH
Ivo
ciscosurfer
September 1st, 2006, 05:01 PM
Check 'man grep' for info on regular expressions. By default, grep matches by line.
You can also try (for visual clarity): cat your_html_source_file | grep --color=always href
Stone123
September 1st, 2006, 05:04 PM
Maby this can help , just search for some regular expression.
Perl:
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
Since i don't know Perl i would use Php and regular expression.
Create one input box button , place the page on apache with php. Use eregi to get String from http + rest till " "
pedrotuga
September 1st, 2006, 05:58 PM
I thought about doing a php script instead... but thats not what i am looking for... i want to be able to catch the links without leaving the console... this can also be practical and reused in other shell scripts.
i am confused.
Are the regular expressions the exact same in perl, php? what about the ones used in grep??? are they the same as the ones mentioned before?
u gave me a link to perl regular expressions... can i use those on grep?
what about php's?
ifokkema
September 1st, 2006, 09:24 PM
PHP has a set of 'Perl compatible' regular expressions, for which the syntax is almost exactly the same as Perl's. The native PHP regular expressions are a bit easier to use, but to be honest I can't even remember how they are used since I stick to the much faster Perl compatible ones.
Grep has different 'levels' of pattern matching. I believe the -E options allows grep to process regular expressions like you can use in Perl/PHP as well. I think regular expression patterns are just so common that you can easily adopt them to different languages / executables.
This is what I would use in PHP:
/^(ht|f)tps?:\/\/([0-9a-z][-0-9a-z]*[0-9a-z]\.)+[a-z]{2,4}\/?[%&=#0-9a-z\/._+-]*\??.*$/i
to match a absolute URL. It can easily be adopted for usage with grep.
pedrotuga
September 1st, 2006, 10:59 PM
-E enables regular expressions, the problem is that grep always returns a line and i want the match only...
any other program i can use besides grep that returns the regular expression matches?
if not i will use php... the problem is that i have PHP installed
as an apache module :( wich means i have to donload it an recompile it in order to be able to run php in scripts in the shell.
ifokkema, thanks a million for the regular expression.
Ragazzo
September 2nd, 2006, 02:41 AM
I wrote this script in Python. It finds all links defined with an A element.
#!/usr/bin/env python
import re
import sys
p1 = re.compile(r'<\s*a.*?>', re.DOTALL | re.I) #A-tag
p2 = re.compile(r'href="(.*?)"', re.DOTALL | re.I)
sys.argv.pop(0)
for filename in sys.argv:
file = open(filename)
text = file.read()
file.close()
for tag in p1.findall(text):
print p2.search(tag).group(1)
ifokkema
September 2nd, 2006, 08:28 AM
-E enables regular expressions, the problem is that grep always returns a line and i want the match only...
According to the manual:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
if not i will use php... the problem is that i have PHP installed as an apache module :( wich means i have to donload it an recompile it in order to be able to run php in scripts in the shell.
No, you don't :D
sudo apt-get install php4-cli
(or php5-cli, if you'd like)
Then, you can put
#!/usr/bin/php
on top of your script and you're off writing your PHP app :)
ifokkema, thanks a million for the regular expression.
You're welcome. Ragazzo has a good point too, matching the <A href=""> tags. It saves you a enourmous regular expression, but it depends on your source file if it's usable. Also, using that technique will catch relative URLs, not sure if you need those, too.
Powered by vBulletin® Version 4.2.2 Copyright © 2024 vBulletin Solutions, Inc. All rights reserved.