[SOLVED] sed reserved characters?

April 7th, 2008, 09:18 PM
I'm trying to grab URL's from a text file and have found a few suggestions.

One from ietf itself that looks promising is


but I'm having no luck converting this in a way sed will use - I've tried alsorts.

Can anyone suggest the characters that must be escaped and whether I need to opt for some regex extension, or ssed etc to be able to do regex as complex as that?

Having not used sed before, I had hoped for something simple like just escaping the ()s might work..

sed -n -e 's@^\(\([^:/?#]+\):\)?\(//\([^/?#]*\)\)?\([^?#]*\)\(\?\([^#]*\)\)?\(#\(.*\)\)?@\5@p' text

April 8th, 2008, 02:04 AM
how does your text file with URL's look like?

April 8th, 2008, 12:48 PM
For now, first step, it's just a list of URL's each on a newline. So ^ should match the start well enough.

Later I might use delimiters to grab from HTML source but having the regex avaliable to grab the different elements is what I'm looking for initially.

I think that \5 should be the domain host for instance..

April 9th, 2008, 04:43 AM
It seems that you need to backquote the + and ? quantifiers.
I don't know why that should be necessary. But it sure helps.

$ sed -n -e 's@^\(\([^:/?#]\+\):\)\?\(//\([^/?#]*\)\)\?\([^?#]*\)\(\?\([^#]*\)\)\?\(#\(.*\)\)\?@\5 \6 \7@p' text

April 9th, 2008, 01:15 PM
Great - Thank you, that works!

For the record then the above gives for a URL, for example
the results in the following subexpression matches:

$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = ?
$7 = search
$8 = #Related
$9 = Related