[SOLVED] How can i read lots of files and find the last post? [Archive]

View Full Version : [SOLVED] How can i read lots of files and find the last post?

hopelessone

March 27th, 2008, 03:00 PM

Hi,

I come from vb6 so don't know much..

i have a folder with 1300 files... it was posts in a forum..

i want to search the files for the last of these:

posted whatever date whatever time
e.g.
posted 04-30-2000 03:17 AM

<TITLE>Whatever - The Archives</title>

remove the " - The Archives" part

Username

copy the name of the file 003247.html
add ../Forum6/HTML/

and add them all in 1 file to look like:

<TR>
<TD bgcolor="#F7F7F7">
<IMG SRC="../images/closed.gif" BORDER=0></td>
<TD bgcolor="#F7F7F7">
<A HREF="../Forum6/HTML/003247.html">Whatever</A>

</td>
<td bgcolor="#DEDFDF">
Username
</td>
<td align=center bgcolor="#F7F7F7">
1
</td>
<td NOWRAP bgcolor="#DEDFDF">
04-30-2000 03:10 AM
</td></tr>

<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>

<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>

How can I achieve this?

thanks for reading.. :)

ghostdog74

March 27th, 2008, 03:38 PM

not tested, since you didn't provide samples.

awk 'BEGIN{ IGNORECASE=1; onefile="newfile.html"}
/posted/{
date=$2
time=$3
}
/<TITLE>/{
gsub(/<TITLE>|<\/TITLE>|- The Archives/,"")
title=$0
}

END {
print "<TR>" > onefile
print "<TD bgcolor=\"#F7F7F7\">" > onefile
print "<IMG SRC=\"../images/closed.gif\" BORDER=0></td>" > onefile
print "<TD bgcolor=\"#F7F7F7\">" > onefile
print "<A HREF=\"../Forum6/HTML/" FILENAME "\">" title "</A>" > onefile
print "" > onefile
print "</td>" > onefile
print "<td bgcolor=\"#DEDFDF\">" > onefile
print "Username" > onefile
print "</td>" > onefile
print "<td align=center bgcolor=\"#F7F7F7\">" > onefile
print "1" > onefile
print "</td>" > onefile
print "<td NOWRAP bgcolor=\"#DEDFDF\">" > onefile
print "" date "" time "" > onefile
print "</td></tr>" > onefile

}' file

tgalati4

March 27th, 2008, 03:51 PM

>man grep
>man perl
>man awk

>grep "Whatever I'm Looking For" *.html > mylist.txt

hopelessone

March 28th, 2008, 02:02 PM

ghostdog74 - I really appreciate what you did for me...(sorry about the no sample file)

the line

/<TITLE>/{
gsub(/<TITLE>|<\/TITLE>|- The Archives/,"")
title=$0

Original file:
<TITLE>Whatever - The Archives</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">

generated file:
<A HREF="../Forum6/HTML/003247.html">Whatever - The Archives<META HTTP-EQUIV="Pragma" CONTENT="no-cache"></A>

Would like:
<A HREF="../Forum6/HTML/003247.html">Whatever</A>

+ just noticed the files sometimes change from "The Archives" to "The BB"

thanks for trying to help me on this...

ghostdog74

March 29th, 2008, 04:08 AM

provide your sample file!

hopelessone

March 29th, 2008, 04:24 AM

It's the third line

sometimes says:

- The Archives or - The BB

in this case says - The BB

Thanks for helpin..!!

ghostdog74

March 29th, 2008, 04:40 AM

this is not the original file, right?
it has

<TITLE>Whatever - the BB</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">

if your original file has

<TITLE>Whatever - The Archives</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">

then this part

gsub(/<TITLE>|<\/TITLE>|- The Archives/,"")
will remove the TITLEs and the word "- The Archives" from the line, and gives you "Whatever" as the final title.

because you do not have " - The Archives" in

<TITLE>Whatever - the BB</title><META HTTP-EQUIV="Pragma" CONTENT="no-cache">

what you will get is the whole title itself, unchanged

If "- the BB" is also what you don't need, just add it into gsub

gsub(/<TITLE>|<\/TITLE>|- The Archives|- the BB/,"")

hopelessone

March 29th, 2008, 04:51 AM

Wow thats great was trying to get that working...

still gives me <META HTTP-EQUIV="Pragma" CONTENT="no-cache"> after it though...

any way to trim this off?

thanks..:KS

ghostdog74

March 29th, 2008, 05:16 AM

gsub(/<TITLE>|<\/TITLE>|- The Archives|<META.*\"no-cache\">/,"")

hopelessone

March 29th, 2008, 05:25 AM

ghostdog74

March 29th, 2008, 05:43 AM

Ahhh i see how it works now...

Thanks...

i'm trying to figure out how to count how many times the words posted appears and minus 1 from it...

so far i got:

/posted/{
date=$6
time=$7
am=$8
twords += wc +1

gives me the total posts not -1

thank Q for your patience..
just increment a counter.

/posted/ [
...
..
p++
}

then print out at the END block

END {
...
...
...
print "How many times posted appears : " p - 1
}

learn some shell, awk (http://www.grymoire.com/Unix/) here.

hopelessone

March 29th, 2008, 05:57 AM

Thanks !!!!

:KS :KS :KS

hopelessone

March 29th, 2008, 06:07 AM

Doh !!!

i just tried to do it for all 1300 files in the folder..

just gives one posting...and not:

<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>

<TR>
<TD bgcolor="#F7F7F7">
Etc....
</td></tr>

I changed the last line from }' 001345.html to:

}' *.html

sniff sniff...

Ideas?

ghostdog74

March 29th, 2008, 06:11 AM

change the ">" to ">>"

hopelessone

March 29th, 2008, 06:31 AM

Hi,

I get the same thing twice, but with a total of 121210 posts with p++.. << how to reset this after each file

you meant this right?

from this print "<TR>" > onefile to print "<TR>" >> onefile?

thanks

ghostdog74

March 29th, 2008, 08:27 AM

Hi,

I get the same thing twice, but with a total of 121210 posts with p++.. << how to reset this after each file

store it in an associative array

/posted/ {
...
store[FILENAME]+=1
..
}

then at the END block

END {
...
....
for ( i in store ) {
print "total number of posted for file " i " is " store[i]
}
}

you meant this

from this print "<TR>" > onefile to print "<TR>" >> onefile?

thanks

yes, also, since you want to output to only 1 file, you can remove all the "> onefile" , and redirect to the newfile on the command line instead

# ./script.sh > onefile

you should spend some time going over the materials i gave you.

hopelessone

March 30th, 2008, 07:46 AM

thanks..

i have to do it to all of them so i don't keep getting the same thing twice...

title[FILENAME]=$0 <-----is this correct?

END {
for ( i in title,store ) {

I'm getting a syntax error on the comma line title,store? whats the separator?

thanks..:popcorn:

ghostdog74

March 30th, 2008, 08:59 AM

thanks..

i have to do it to all of them so i don't keep getting the same thing twice...

title[FILENAME]=$0 <-----is this correct?

yes

END {
for ( i in title,store ) {

I'm getting a syntax error on the comma line title,store? whats the separator?

thanks..:popcorn:
you can't use it like that. Check the awk link i gave you for the for loop syntax. Alternative, you can go here. (http://www.gnu.org/software/gawk/manual/gawk.html)

hopelessone

March 30th, 2008, 09:44 AM

Hi,

in the links it goes:

for (i in u_count) {
if (i != "") {
print u_size[i], u_count[i], i;
}
}
for (i in g_count) {
if (i != "") {
print g_size[i], g_count[i], i;
}
}
for (i in ug_count) {
if (i != "") {
print ug_size[i], ug_count[i], i;
}
}
for (i in all_count) {
if (i != "") {
print all_size[i], all_count[i], i;
}

if i follow these exaples i get each line x 1300 not all together...

I know it's to do with where i put the for command...if i put it at the top i can get one field only...if i put the for command within the other for command i get a huge file...e.g.

for ( i in title ) {
for ( i in user ) {
print "<TR>"
print "<TD bgcolor=\"#F7F7F7\">"
print "<IMG SRC=\"../images/closed.gif\" BORDER=0></td>"
print "<TD bgcolor=\"#F7F7F7\">"
print "<A HREF=\"../Forum6/HTML/" FILENAME "\">" title[i] "</A>"
print ""
print "</td>"
print "<td bgcolor=\"#DEDFDF\">"
print "" user[i] ""
print "</td>"
print "<td align=center bgcolor=\"#F7F7F7\">"
etc...
}
}

so where would be the correct place to put all of them? :confused:

thanks..

hopelessone

March 30th, 2008, 09:52 AM

What about a IF statement within the FOR statement?

for ( i in title ) {
if(i=i) {
print "<A HREF=\"../Forum6/HTML/" FILENAME "\">" title[i] "</A>"
etc...
title[i]=""
}
}

Worked !!!

hopelessone

March 30th, 2008, 10:01 AM

next i will figure out how to sort from most recent date and time...(any hints?)

thanks ghostdog74 :KS:KS:KS

ghostdog74

March 30th, 2008, 10:18 AM

next i will figure out how to sort from most recent date and time...(any hints?)

thanks ghostdog74 :KS:KS:KS

pls read the manual (http://www.gnu.org/manual/gawk/html_node/Array-Sorting.html)!

hopelessone

March 30th, 2008, 12:30 PM

so far i got:

END {
n = asorti(date, time)
for (i = 1; i <= n; i++) {
for ( i in title ) {

if (i=i) {
print "<TR>"
etc...

doesn't seem to work...workin on it... :popcorn:

hopelessone

March 30th, 2008, 02:47 PM

hi,

does awk get all the values for all the files first then write to file? or each file then write to file?

thanks

hopelessone

March 31st, 2008, 01:03 PM

i still don't get why this wont work?

dates are either:
12-25-1999
12-25-99

END {
n = asort(date)
for (i = 1; i <= n; i++)

for ( i in title ) {

if (i=i) {

etc...
print "" date[i] " " time[i] " " am[i] ""
etc...

hopelessone

April 1st, 2008, 03:37 AM

Given up...will do the rest by hand...

anyway thanks ...learned alot...