PDA

View Full Version : Parse websearch output with Perl (RegEx)



hansoffate
July 8th, 2008, 12:02 AM
I am trying to query the yeast genome database (http://www.yeastgenome.org/) to search for ORFs and get the FASTA results. Basically, it is just a simple query, get the output from the query, then strip out what I need.


An example of an output that I would be working with is here: http://db.yeastgenome.org/cgi-bin/getSeq?seq=YMR056C&flankl=0&flankr=0&map=p3map
I want to take the ORF ID (YMR056C) and get the FASTA output which is:
>YMR056C Chr 13 reverse complement
MSHTETQTQQSHFGVDFLMGGVSAAIAKTGAAPIERVKLLMQNQEEMLKQ GSLDTRYKGI
LDCFKRTATHEGIVSFWRGNTANVLRYFPTQALNFAFKDKIKSLLSYDRE RDGYAKWFAG
NLFSGGAAGGLSLLFVYSLDYARTRLAADARGSKSTSQRQFNGLLDVYKK TLKTDGLLGL
YRGFVPSVLGIIVYRGLYFGLYDSFKPVLLTGALEGSFVASFLLGWVITM GASTASYPLD
TVRRRMMMTSGQTIKYDGALDCLRKIVQKEGAYSLFKGCGANIFRGVAAA GVISLYDQLQ
LIMFGKKFK*


This is the code I tried to write based off another perlscript I had. Basically, I can get the html and the results that I want to pull out are contained within the <pre>> TEXT HERE </pre>. I am close but I think my Regular Expression isn't written correctly.

Any Ideas?

Thanks,
Hans


#!/usr/bin/perl

use warnings;
use LWP::Simple;

while (<>) {
chomp;
$html = get("http://db.yeastgenome.org/cgi-bin/getSeq?seq=$_&flankl=0&flankr=0&map=p3map");
unless (length($html)) {
warn "Unable to load page for '$_'\n";
next;
}

@found = ();
foreach $line (split("\n", $html)) {
next unless (($fasta) = $line =~ m#<pre>>([.<]+)</pre>#i);
push(@found, $fasta);
}
print "$_: ", join(' ', @found), "\n";
}

skeeterbug
July 8th, 2008, 12:44 AM
I wouldn't use regex to parse html.

http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/TreeBuilder.pm

I am sure there are others.

myrtle1908
July 8th, 2008, 01:26 AM
So you want to get everything in between '<pre>' and '</pre>'?

First join your html into one big string then (assuming there will only ever be one set of pre tags):-

my ($pre) = $html =~ /.*?<pre>(.*?)<\/pre>/i;

slavik
July 8th, 2008, 02:15 AM
So you want to get everything in between '<pre>' and '</pre>'?

First join your html into one big string then (assuming there will only ever be one set of pre tags):-

my ($pre) = $html =~ /.*?<pre>(.*?)<\/pre>/i;
or you can do:



my @pre_arr = $html =~ /.*?<pre>(.*?)<\/pre>/gi;

hansoffate
July 8th, 2008, 11:37 PM
I tried both regex by replacing the code below and it didn't work. I tried to also write the join but it didn't work as well.

Could either of you provide a bit more help?

Thanks,
Hans



@found = ();
foreach $line (split("\n", $html)) {
next unless (($fasta) = $line =~ m#<pre>>([.<]+)</pre>#i);
push(@found, $fasta);

myrtle1908
July 9th, 2008, 01:25 AM
This works but you will need to do your own error handling etc.



use strict;
use warnings;
use LWP::Simple;

my $html = get("http://db.yeastgenome.org/cgi-bin/getSeq?seq=YMR056C&flankl=0&flankr=0&map=p3map");
my ($pre) = $html =~ /.*?<pre>(.*?)<\/pre>/gsi;
print $pre;

ghostdog74
July 9th, 2008, 03:12 AM
imagine you already have the page downloaded


#!/usr/bin/perl
$s="";
$f=0;
while (<>) {
s/.*<pre>// and $f=1 if (/<pre>/) ;
$s=$s . $_ if $f;
$f=0 if (/<\/pre>/);
}
print $s;

output:


# ./test.pl htmlfile
>YMR056C Chr 13 reverse complement
MSHTETQTQQSHFGVDFLMGGVSAAIAKTGAAPIERVKLLMQNQEEMLKQ GSLDTRYKGI
LDCFKRTATHEGIVSFWRGNTANVLRYFPTQALNFAFKDKIKSLLSYDRE RDGYAKWFAG
NLFSGGAAGGLSLLFVYSLDYARTRLAADARGSKSTSQRQFNGLLDVYKK TLKTDGLLGL
YRGFVPSVLGIIVYRGLYFGLYDSFKPVLLTGALEGSFVASFLLGWVITM GASTASYPLD
TVRRRMMMTSGQTIKYDGALDCLRKIVQKEGAYSLFKGCGANIFRGVAAA GVISLYDQLQ
LIMFGKKFK*
</pre><hr size="2" width="75%">