perl oneliner to extract first letters of regex search [Archive]

View Full Version : perl oneliner to extract first letters of regex search

thaitang

September 24th, 2011, 02:22 AM

can someone help me with a perl oneliner to parse regex search patterns by allowing me to extract/isolate the first letter of each word in the search pattern matched and then joining all of the first letters of each word together into a single string.

for example, in sed I might use something like:

sed -ri 's/<\/p><center>([A-Za-z])([A-Za-z@:,]{0,20})[ ]{0,1}([A-Za-z]{0,1})[ ]{0,1}([A-Za-z@:,]{0,20})[ ]{0,1}([A-Za-z]{0,1})[ ]{0,1}([A-Za-z@:,]{0,20})[ ]{0,1}([A-Za-z]{0,1})(.*)<\/b><\/center>/<a href="#\1\3\5\7Toc" name="\1\3\5\7Txt" style="text-decoration:none">\1 \2 \3 \4 \5 \6 \7 \8<\/a><\/font> /' 1.htmlwhich I realize needs some nurturing to work as I want, but really I want to start mopving toward doing this kind of stuff using perl. So the before might look like:

<center>Part I: Introduction to Beyond Good and Evil</center>and the desired after like:

<a href="#PIItBGaEToc" name="PIItBGaETxt" style="text-decoration:none">Part I: Introduction to Beyond Good and Evil</a> I know the html could use some tweaking as well, but one thing at a time. :) If anyone can help with a perl oneliner example of how to do the above I would really appreciate it.

cheers, tt

dethorpe

September 26th, 2011, 11:06 AM

If your going to be parsing and manipulation HTML like this i'd recomend using an HTML parsing module from CPAN rather than trying to do it using regexs.

e.g. http://search.cpan.org/~gaas/HTML-Parser/[/URL] or [URL]http://search.cpan.org/dist/HTML-Parser/lib/HTML/PullParser.pm (http://search.cpan.org/~gaas/HTML-Parser/) depending on how you want to do it.

myrtle1908

September 26th, 2011, 11:26 AM

Does it have to be a one liner? I would first strip the HTML then use a regex with word boundary

use strict;
use warnings;

my $s = '<center>Part I: Introduction to Beyond Good and Evil</center>';
$s =~ s/<.*?>//g;
my @m = ($s =~ m/\b(\w)/g);
print "@m";

Yields

P I I t B G a E

stylishpants

September 27th, 2011, 12:32 AM

Here is one way to do it.

Since you're interested in learning how to build one-liners, I'm showing all the steps I did to construct this.
You don't need to run the earlier steps, the last one does everything.

# File of test data:
bob@cob:/tmp$ cat file.html
<center>Part I: Introduction to Beyond Good and Evil</center>
<center>Part II: The Wrath of Khan</center>
<center>Part III: Return Of The Jedi </center>

# Extract title from surrounding text.
# Take everything between the tags, but not the tags themselves.
bob@cob:/tmp$ cat file.html | perl -nle '/(.*)<\/b>/; print $1'
Part I: Introduction to Beyond Good and Evil
Part II: The Wrath of Khan
Part III: Return Of The Jedi

# strip out all tags within the title
bob@cob:/tmp$ cat file.html | perl -nle '/(.*)<\/b>/; print $1' | perl -pe 's/<.*?>//g'
Part I: Introduction to Beyond Good and Evil
Part II: The Wrath of Khan
Part III: Return Of The Jedi

# use match operator to match the first char in every sequence of consecutive non-whitespace chars
# returns list context, which gets printed as a group of letters
# "-l" flag adds a newline after each print, but has other side effects so beware if you keep that
bob@cob:/tmp$ cat file.html | pcregrep -o '.*' | perl -nle 's/<.*?>//g; print m/(\S)\S*/g '
PIItBGaE
PITWoK
PIROTJ

# Save the abbreviation and the original string (including its inner tags) into variables
bob@cob:/tmp$ cat file.html | perl -nle '/(.*)<\/b>/; print $1' | perl -ne 'chop; $s=$_; s/<.*?>//g; $a=join("",m/(\S)\S*/g); printf(qq[%s: %s\n],$a,$s)'
PIItBGaE: Part I: Introduction to Beyond Good and Evil
PITWoK: Part II: The Wrath of Khan
PIROTJ: Part III: Return Of The Jedi

# Use the (broken) HTML you supplied in the original post as a template and substitute in the variables
bob@cob:/tmp$ cat file.html | perl -nle '/(.*)<\/b>/; print $1' | perl -ne 'chop; $s=$_; s/<.*?>//g; $t=join("",m/(\S)\S*/g); printf(qq[<a href="#%s" name="%s" style="text-decoration:none">%s</a> \n],$t,$t,$s) '
<a href="#PIItBGaE" name="PIItBGaE" style="text-decoration:none">Part I: Introduction to Beyond Good and Evil</a> 
<a href="#PITWoK" name="PITWoK" style="text-decoration:none">Part II: The Wrath of Khan</a> 
<a href="#PIROTJ" name="PIROTJ" style="text-decoration:none">Part III: Return Of The Jedi </a> 

# Merge all of that into one single perl instance for brevity.
# (readability / maintainability is another story. I would not want to maintain code like this.)
bob@cob:/tmp$ perl -ne '/(.*)<\/b>/; $s=$1; s/<.*?>//g; $t=join("",m/(\S)\S*/g); printf(qq[<a href="#%s" name="%s" style="text-decoration:none">%s</a> \n],$t,$t,$s) ' file.html
<a href="#PIItBGaE" name="PIItBGaE" style="text-decoration:none">Part I: Introduction to Beyond Good and Evil</a> 
<a href="#PITWoK" name="PITWoK" style="text-decoration:none">Part II: The Wrath of Khan</a> 
<a href="#PIROTJ" name="PIROTJ" style="text-decoration:none">Part III: Return Of The Jedi </a>