PDA

View Full Version : [SOLVED] How can i read lots of files and find the last post?



hopelessone
March 27th, 2008, 03:00 PM
Hi,

I come from vb6 so don't know much..

i have a folder with 1300 files... it was posts in a forum..

i want to search the files for the last of these:

posted whatever date whatever time
e.g.
posted 04-30-2000 03:17 AM

<TITLE>Whatever - The Archives</title>

remove the " - The Archives" part

<FONT SIZE="2" face="Verdana, Arial"><B>Username</B>

copy the name of the file 003247.html
add ../Forum6/HTML/

and add them all in 1 file to look like:

<TR>
<TD bgcolor="#F7F7F7">
<IMG SRC="../images/closed.gif" BORDER=0></td>
<TD bgcolor="#F7F7F7"><FONT SIZE="2" FACE="Verdana, Arial">
<A HREF="../Forum6/HTML/003247.html">Whatever</A>
</FONT>
</td>
<td bgcolor="#DEDFDF">
<FONT SIZE="2" FACE="Verdana, Arial">Username</FONT>
</td>
<td align=center bgcolor="#F7F7F7">
<FONT SIZE="2" FACE="Verdana, Arial">1</FONT>
</td>
<td NOWRAP bgcolor="#DEDFDF">
<FONT SIZE="2" FACE="Verdana, Arial">04-30-2000 <FONT SIZE="2" FACE="Verdana, Arial" COLOR="#800080">03:10 AM</FONT></FONT>
</td></tr>

<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>

<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>

How can I achieve this?

thanks for reading.. :)

ghostdog74
March 27th, 2008, 03:38 PM
not tested, since you didn't provide samples.


awk 'BEGIN{ IGNORECASE=1; onefile="newfile.html"}
/posted/{
date=$2
time=$3
}
/<TITLE>/{
gsub(/<TITLE>|<\/TITLE>|- The Archives/,"")
title=$0
}

END {
print "<TR>" > onefile
print "<TD bgcolor=\"#F7F7F7\">" > onefile
print "<IMG SRC=\"../images/closed.gif\" BORDER=0></td>" > onefile
print "<TD bgcolor=\"#F7F7F7\"><FONT SIZE=\"2\" FACE=\"Verdana, Arial\">" > onefile
print "<A HREF=\"../Forum6/HTML/" FILENAME "\">" title "</A>" > onefile
print "</FONT>" > onefile
print "</td>" > onefile
print "<td bgcolor=\"#DEDFDF\">" > onefile
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">Username</FONT>" > onefile
print "</td>" > onefile
print "<td align=center bgcolor=\"#F7F7F7\">" > onefile
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">1</FONT>" > onefile
print "</td>" > onefile
print "<td NOWRAP bgcolor=\"#DEDFDF\">" > onefile
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">" date "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\" COLOR=\"#800080\">" time "</FONT></FONT>" > onefile
print "</td></tr>" > onefile

}' file

tgalati4
March 27th, 2008, 03:51 PM
>man grep
>man perl
>man awk

>grep "Whatever I'm Looking For" *.html > mylist.txt

hopelessone
March 28th, 2008, 02:02 PM
ghostdog74 - I really appreciate what you did for me...(sorry about the no sample file)

the line

/<TITLE>/{
gsub(/<TITLE>|<\/TITLE>|- The Archives/,"")
title=$0

Original file:
<TITLE>Whatever - The Archives</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">

generated file:
<A HREF="../Forum6/HTML/003247.html">Whatever - The Archives<META HTTP-EQUIV="Pragma" CONTENT="no-cache"></A>

Would like:
<A HREF="../Forum6/HTML/003247.html">Whatever</A>

+ just noticed the files sometimes change from "The Archives" to "The BB"

thanks for trying to help me on this...

ghostdog74
March 29th, 2008, 04:08 AM
provide your sample file!

hopelessone
March 29th, 2008, 04:24 AM
It's the third line

sometimes says:

- The Archives or - The BB

in this case says - The BB

Thanks for helpin..!!

ghostdog74
March 29th, 2008, 04:40 AM
this is not the original file, right?
it has


<TITLE>Whatever - the BB</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">


if your original file has


<TITLE>Whatever - The Archives</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">

then this part

gsub(/<TITLE>|<\/TITLE>|- The Archives/,"")
will remove the TITLEs and the word "- The Archives" from the line, and gives you "Whatever" as the final title.

because you do not have " - The Archives" in


<TITLE>Whatever - the BB</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">

what you will get is the whole title itself, unchanged

If "- the BB" is also what you don't need, just add it into gsub


gsub(/<TITLE>|<\/TITLE>|- The Archives|- the BB/,"")

hopelessone
March 29th, 2008, 04:51 AM
Wow thats great was trying to get that working...

still gives me <META HTTP-EQUIV="Pragma" CONTENT="no-cache"> after it though...

any way to trim this off?

thanks..:KS

ghostdog74
March 29th, 2008, 05:16 AM
gsub(/<TITLE>|<\/TITLE>|- The Archives|<META.*\"no-cache\">/,"")

hopelessone
March 29th, 2008, 05:25 AM
Ahhh i see how it works now...

Thanks...

i'm trying to figure out how to count how many times the words posted appears and minus 1 from it...

so far i got:

/posted/{
date=$6
time=$7
am=$8
twords += wc +1

gives me the total posts not -1

thank Q for your patience..

ghostdog74
March 29th, 2008, 05:43 AM
Ahhh i see how it works now...

Thanks...

i'm trying to figure out how to count how many times the words posted appears and minus 1 from it...

so far i got:

/posted/{
date=$6
time=$7
am=$8
twords += wc +1

gives me the total posts not -1

thank Q for your patience..
just increment a counter.



/posted/ [
...
..
p++
}

then print out at the END block



END {
...
...
...
print "How many times posted appears : " p - 1
}



learn some shell, awk (http://www.grymoire.com/Unix/) here.

hopelessone
March 29th, 2008, 05:57 AM
Thanks !!!!

:KS :KS :KS

hopelessone
March 29th, 2008, 06:07 AM
Doh !!!

i just tried to do it for all 1300 files in the folder..

just gives one posting...and not:

<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>

<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>

I changed the last line from }' 001345.html to:

}' *.html

sniff sniff...

Ideas?

ghostdog74
March 29th, 2008, 06:11 AM
change the ">" to ">>"

hopelessone
March 29th, 2008, 06:31 AM
Hi,

I get the same thing twice, but with a total of 121210 posts with p++.. << how to reset this after each file

you meant this right?

from this print "<TR>" > onefile to print "<TR>" >> onefile?

thanks

ghostdog74
March 29th, 2008, 08:27 AM
Hi,

I get the same thing twice, but with a total of 121210 posts with p++.. << how to reset this after each file

store it in an associative array


/posted/ {
...
store[FILENAME]+=1
..
}

then at the END block


END {
...
....
for ( i in store ) {
print "total number of posted for file " i " is " store[i]
}
}



you meant this

from this print "<TR>" > onefile to print "<TR>" >> onefile?

thanks

yes, also, since you want to output to only 1 file, you can remove all the "> onefile" , and redirect to the newfile on the command line instead


# ./script.sh > onefile


you should spend some time going over the materials i gave you.

hopelessone
March 30th, 2008, 07:46 AM
thanks..

i have to do it to all of them so i don't keep getting the same thing twice...

title[FILENAME]=$0 <-----is this correct?

END {
for ( i in title,store ) {

I'm getting a syntax error on the comma line title,store? whats the separator?

thanks..:popcorn:

ghostdog74
March 30th, 2008, 08:59 AM
thanks..

i have to do it to all of them so i don't keep getting the same thing twice...

title[FILENAME]=$0 <-----is this correct?


yes



END {
for ( i in title,store ) {

I'm getting a syntax error on the comma line title,store? whats the separator?

thanks..:popcorn:
you can't use it like that. Check the awk link i gave you for the for loop syntax. Alternative, you can go here. (http://www.gnu.org/software/gawk/manual/gawk.html)

hopelessone
March 30th, 2008, 09:44 AM
Hi,

in the links it goes:

for (i in u_count) {
if (i != "") {
print u_size[i], u_count[i], i;
}
}
for (i in g_count) {
if (i != "") {
print g_size[i], g_count[i], i;
}
}
for (i in ug_count) {
if (i != "") {
print ug_size[i], ug_count[i], i;
}
}
for (i in all_count) {
if (i != "") {
print all_size[i], all_count[i], i;
}

if i follow these exaples i get each line x 1300 not all together...

I know it's to do with where i put the for command...if i put it at the top i can get one field only...if i put the for command within the other for command i get a huge file...e.g.

for ( i in title ) {
for ( i in user ) {
print "<TR>"
print "<TD bgcolor=\"#F7F7F7\">"
print "<IMG SRC=\"../images/closed.gif\" BORDER=0></td>"
print "<TD bgcolor=\"#F7F7F7\"><FONT SIZE=\"2\" FACE=\"Verdana, Arial\">"
print "<A HREF=\"../Forum6/HTML/" FILENAME "\">" title[i] "</A>"
print "</FONT>"
print "</td>"
print "<td bgcolor=\"#DEDFDF\">"
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">" user[i] "</FONT>"
print "</td>"
print "<td align=center bgcolor=\"#F7F7F7\">"
etc...
}
}

so where would be the correct place to put all of them? :confused:

thanks..

hopelessone
March 30th, 2008, 09:52 AM
What about a IF statement within the FOR statement?

for ( i in title ) {
if(i=i) {
print "<A HREF=\"../Forum6/HTML/" FILENAME "\">" title[i] "</A>"
etc...
title[i]=""
}
}

Worked !!!

hopelessone
March 30th, 2008, 10:01 AM
next i will figure out how to sort from most recent date and time...(any hints?)

thanks ghostdog74 :KS:KS:KS

ghostdog74
March 30th, 2008, 10:18 AM
next i will figure out how to sort from most recent date and time...(any hints?)

thanks ghostdog74 :KS:KS:KS

pls read the manual (http://www.gnu.org/manual/gawk/html_node/Array-Sorting.html)!

hopelessone
March 30th, 2008, 12:30 PM
so far i got:

END {
n = asorti(date, time)
for (i = 1; i <= n; i++) {
for ( i in title ) {

if (i=i) {
print "<TR>"
etc...

doesn't seem to work...workin on it... :popcorn:

hopelessone
March 30th, 2008, 02:47 PM
hi,

does awk get all the values for all the files first then write to file? or each file then write to file?

thanks

hopelessone
March 31st, 2008, 01:03 PM
i still don't get why this wont work?

dates are either:
12-25-1999
12-25-99

END {
n = asort(date)
for (i = 1; i <= n; i++)

for ( i in title ) {

if (i=i) {

etc...
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">" date[i] " <FONT SIZE=\"2\" FACE=\"Verdana, Arial\" COLOR=\"#800080\">" time[i] " " am[i] "</FONT></FONT>"
etc...

hopelessone
April 1st, 2008, 03:37 AM
Given up...will do the rest by hand...

anyway thanks ...learned alot...