View Full Version : [SOLVED] How can i read lots of files and find the last post?
hopelessone
March 27th, 2008, 03:00 PM
Hi,
I come from vb6 so don't know much..
i have a folder with 1300 files... it was posts in a forum..
i want to search the files for the last of these:
posted whatever date whatever time
e.g.
posted 04-30-2000 03:17 AM
<TITLE>Whatever - The Archives</title>
remove the " - The Archives" part
<FONT SIZE="2" face="Verdana, Arial"><B>Username</B>
copy the name of the file 003247.html
add ../Forum6/HTML/
and add them all in 1 file to look like:
<TR>
<TD bgcolor="#F7F7F7">
<IMG SRC="../images/closed.gif" BORDER=0></td>
<TD bgcolor="#F7F7F7"><FONT SIZE="2" FACE="Verdana, Arial">
<A HREF="../Forum6/HTML/003247.html">Whatever</A>
</FONT>
</td>
<td bgcolor="#DEDFDF">
<FONT SIZE="2" FACE="Verdana, Arial">Username</FONT>
</td>
<td align=center bgcolor="#F7F7F7">
<FONT SIZE="2" FACE="Verdana, Arial">1</FONT>
</td>
<td NOWRAP bgcolor="#DEDFDF">
<FONT SIZE="2" FACE="Verdana, Arial">04-30-2000 <FONT SIZE="2" FACE="Verdana, Arial" COLOR="#800080">03:10 AM</FONT></FONT>
</td></tr>
<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>
<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>
How can I achieve this?
thanks for reading.. :)
ghostdog74
March 27th, 2008, 03:38 PM
not tested, since you didn't provide samples.
awk 'BEGIN{ IGNORECASE=1; onefile="newfile.html"}
/posted/{
date=$2
time=$3
}
/<TITLE>/{
gsub(/<TITLE>|<\/TITLE>|- The Archives/,"")
title=$0
}
END {
print "<TR>" > onefile
print "<TD bgcolor=\"#F7F7F7\">" > onefile
print "<IMG SRC=\"../images/closed.gif\" BORDER=0></td>" > onefile
print "<TD bgcolor=\"#F7F7F7\"><FONT SIZE=\"2\" FACE=\"Verdana, Arial\">" > onefile
print "<A HREF=\"../Forum6/HTML/" FILENAME "\">" title "</A>" > onefile
print "</FONT>" > onefile
print "</td>" > onefile
print "<td bgcolor=\"#DEDFDF\">" > onefile
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">Username</FONT>" > onefile
print "</td>" > onefile
print "<td align=center bgcolor=\"#F7F7F7\">" > onefile
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">1</FONT>" > onefile
print "</td>" > onefile
print "<td NOWRAP bgcolor=\"#DEDFDF\">" > onefile
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">" date "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\" COLOR=\"#800080\">" time "</FONT></FONT>" > onefile
print "</td></tr>" > onefile
}' file
tgalati4
March 27th, 2008, 03:51 PM
>man grep
>man perl
>man awk
>grep "Whatever I'm Looking For" *.html > mylist.txt
hopelessone
March 28th, 2008, 02:02 PM
ghostdog74 - I really appreciate what you did for me...(sorry about the no sample file)
the line
/<TITLE>/{
gsub(/<TITLE>|<\/TITLE>|- The Archives/,"")
title=$0
Original file:
<TITLE>Whatever - The Archives</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">
generated file:
<A HREF="../Forum6/HTML/003247.html">Whatever - The Archives<META HTTP-EQUIV="Pragma" CONTENT="no-cache"></A>
Would like:
<A HREF="../Forum6/HTML/003247.html">Whatever</A>
+ just noticed the files sometimes change from "The Archives" to "The BB"
thanks for trying to help me on this...
ghostdog74
March 29th, 2008, 04:08 AM
provide your sample file!
hopelessone
March 29th, 2008, 04:24 AM
It's the third line
sometimes says:
- The Archives or - The BB
in this case says - The BB
Thanks for helpin..!!
ghostdog74
March 29th, 2008, 04:40 AM
this is not the original file, right?
it has
<TITLE>Whatever - the BB</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">
if your original file has
<TITLE>Whatever - The Archives</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">
then this part
gsub(/<TITLE>|<\/TITLE>|- The Archives/,"")
will remove the TITLEs and the word "- The Archives" from the line, and gives you "Whatever" as the final title.
because you do not have " - The Archives" in
<TITLE>Whatever - the BB</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">
what you will get is the whole title itself, unchanged
If "- the BB" is also what you don't need, just add it into gsub
gsub(/<TITLE>|<\/TITLE>|- The Archives|- the BB/,"")
hopelessone
March 29th, 2008, 04:51 AM
Wow thats great was trying to get that working...
still gives me <META HTTP-EQUIV="Pragma" CONTENT="no-cache"> after it though...
any way to trim this off?
thanks..:KS
ghostdog74
March 29th, 2008, 05:16 AM
gsub(/<TITLE>|<\/TITLE>|- The Archives|<META.*\"no-cache\">/,"")
hopelessone
March 29th, 2008, 05:25 AM
Ahhh i see how it works now...
Thanks...
i'm trying to figure out how to count how many times the words posted appears and minus 1 from it...
so far i got:
/posted/{
date=$6
time=$7
am=$8
twords += wc +1
gives me the total posts not -1
thank Q for your patience..
ghostdog74
March 29th, 2008, 05:43 AM
Ahhh i see how it works now...
Thanks...
i'm trying to figure out how to count how many times the words posted appears and minus 1 from it...
so far i got:
/posted/{
date=$6
time=$7
am=$8
twords += wc +1
gives me the total posts not -1
thank Q for your patience..
just increment a counter.
/posted/ [
...
..
p++
}
then print out at the END block
END {
...
...
...
print "How many times posted appears : " p - 1
}
learn some shell, awk (http://www.grymoire.com/Unix/) here.
hopelessone
March 29th, 2008, 05:57 AM
Thanks !!!!
:KS :KS :KS
hopelessone
March 29th, 2008, 06:07 AM
Doh !!!
i just tried to do it for all 1300 files in the folder..
just gives one posting...and not:
<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>
<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>
I changed the last line from }' 001345.html to:
}' *.html
sniff sniff...
Ideas?
ghostdog74
March 29th, 2008, 06:11 AM
change the ">" to ">>"
hopelessone
March 29th, 2008, 06:31 AM
Hi,
I get the same thing twice, but with a total of 121210 posts with p++.. << how to reset this after each file
you meant this right?
from this print "<TR>" > onefile to print "<TR>" >> onefile?
thanks
ghostdog74
March 29th, 2008, 08:27 AM
Hi,
I get the same thing twice, but with a total of 121210 posts with p++.. << how to reset this after each file
store it in an associative array
/posted/ {
...
store[FILENAME]+=1
..
}
then at the END block
END {
...
....
for ( i in store ) {
print "total number of posted for file " i " is " store[i]
}
}
you meant this
from this print "<TR>" > onefile to print "<TR>" >> onefile?
thanks
yes, also, since you want to output to only 1 file, you can remove all the "> onefile" , and redirect to the newfile on the command line instead
# ./script.sh > onefile
you should spend some time going over the materials i gave you.
hopelessone
March 30th, 2008, 07:46 AM
thanks..
i have to do it to all of them so i don't keep getting the same thing twice...
title[FILENAME]=$0 <-----is this correct?
END {
for ( i in title,store ) {
I'm getting a syntax error on the comma line title,store? whats the separator?
thanks..:popcorn:
ghostdog74
March 30th, 2008, 08:59 AM
thanks..
i have to do it to all of them so i don't keep getting the same thing twice...
title[FILENAME]=$0 <-----is this correct?
yes
END {
for ( i in title,store ) {
I'm getting a syntax error on the comma line title,store? whats the separator?
thanks..:popcorn:
you can't use it like that. Check the awk link i gave you for the for loop syntax. Alternative, you can go here. (http://www.gnu.org/software/gawk/manual/gawk.html)
hopelessone
March 30th, 2008, 09:44 AM
Hi,
in the links it goes:
for (i in u_count) {
if (i != "") {
print u_size[i], u_count[i], i;
}
}
for (i in g_count) {
if (i != "") {
print g_size[i], g_count[i], i;
}
}
for (i in ug_count) {
if (i != "") {
print ug_size[i], ug_count[i], i;
}
}
for (i in all_count) {
if (i != "") {
print all_size[i], all_count[i], i;
}
if i follow these exaples i get each line x 1300 not all together...
I know it's to do with where i put the for command...if i put it at the top i can get one field only...if i put the for command within the other for command i get a huge file...e.g.
for ( i in title ) {
for ( i in user ) {
print "<TR>"
print "<TD bgcolor=\"#F7F7F7\">"
print "<IMG SRC=\"../images/closed.gif\" BORDER=0></td>"
print "<TD bgcolor=\"#F7F7F7\"><FONT SIZE=\"2\" FACE=\"Verdana, Arial\">"
print "<A HREF=\"../Forum6/HTML/" FILENAME "\">" title[i] "</A>"
print "</FONT>"
print "</td>"
print "<td bgcolor=\"#DEDFDF\">"
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">" user[i] "</FONT>"
print "</td>"
print "<td align=center bgcolor=\"#F7F7F7\">"
etc...
}
}
so where would be the correct place to put all of them? :confused:
thanks..
hopelessone
March 30th, 2008, 09:52 AM
What about a IF statement within the FOR statement?
for ( i in title ) {
if(i=i) {
print "<A HREF=\"../Forum6/HTML/" FILENAME "\">" title[i] "</A>"
etc...
title[i]=""
}
}
Worked !!!
hopelessone
March 30th, 2008, 10:01 AM
next i will figure out how to sort from most recent date and time...(any hints?)
thanks ghostdog74 :KS:KS:KS
ghostdog74
March 30th, 2008, 10:18 AM
next i will figure out how to sort from most recent date and time...(any hints?)
thanks ghostdog74 :KS:KS:KS
pls read the manual (http://www.gnu.org/manual/gawk/html_node/Array-Sorting.html)!
hopelessone
March 30th, 2008, 12:30 PM
so far i got:
END {
n = asorti(date, time)
for (i = 1; i <= n; i++) {
for ( i in title ) {
if (i=i) {
print "<TR>"
etc...
doesn't seem to work...workin on it... :popcorn:
hopelessone
March 30th, 2008, 02:47 PM
hi,
does awk get all the values for all the files first then write to file? or each file then write to file?
thanks
hopelessone
March 31st, 2008, 01:03 PM
i still don't get why this wont work?
dates are either:
12-25-1999
12-25-99
END {
n = asort(date)
for (i = 1; i <= n; i++)
for ( i in title ) {
if (i=i) {
etc...
print "<FONT SIZE=\"2\" FACE=\"Verdana, Arial\">" date[i] " <FONT SIZE=\"2\" FACE=\"Verdana, Arial\" COLOR=\"#800080\">" time[i] " " am[i] "</FONT></FONT>"
etc...
hopelessone
April 1st, 2008, 03:37 AM
Given up...will do the rest by hand...
anyway thanks ...learned alot...
Powered by vBulletin® Version 4.2.2 Copyright © 2024 vBulletin Solutions, Inc. All rights reserved.