Bashscript - fetch www.site.com/something.asp?page=? [Archive]

PDA

View Full Version : Bashscript - fetch www.site.com/something.asp?page=?

daller

February 26th, 2008, 11:52 PM

yabbadabbadont

February 27th, 2008, 12:15 AM

http://www.httrack.com/

Martin Witte

February 27th, 2008, 12:17 AM

I don't know if this is possible in a shell script, in Python you can open url's with urllib (see http://docs.python.org/lib/node577.html for some examples).

ghostdog74

February 27th, 2008, 01:42 AM

Hi there,

I have to fetch a lot of products for a website.

Basically I have to fetch these pages (and pictures on these pages!):

www.site.com/something.asp?page=?

...where the last "?" is all numbers between 1 and 100000.

How do I accomplish this?
first, use wget to get one page. if it works, you can just use a for loop

for number in `seq 100000`
do
wget http:........$number
done

nhandler

February 27th, 2008, 01:51 AM

daller

February 27th, 2008, 12:25 PM

first, use wget to get one page. if it works, you can just use a for loop

for number in `seq 100000`
do
wget http:........$number
done

I only need the content, and no external links...

...I also need the images, though! - How do i setup wget to download the images on the pages, without following other links (wget -r, works, but it crawls the whole site :))

...and can I slow it down a little? - It seems to kill the server after some 1000 wget's!

There's also a lot of numbers that are not valid (displays almost empty template) - How do I insert an IF-statement, that deletes the downloaded file if it doesn't have the string "DKK" in it. (I know this string is in the valid ones, and not in the invalid!)

Thank you for your help!

Mr. C.

February 28th, 2008, 04:45 AM

Use either --wait seconds to wait between retrievals, or --random-wait

--wait seconds
--random-wait

if you are doing a single wget. If you are looping over multiple wgets, place a sleep 1 call after your wget.

The --page-requisites option will pick up images required for pages to render (along with other stuff). Use it without recursion set. Since you know the name of all the pages, this is likely what you want.

You can also specify the types of files you want to accept:

--accept acclist

where acclist is a comma-separated list of file suffixes (eg: .jpg,.gif)

There are plenty of options available to wget - the man page takes some time to review, but worth it:

man wget

MrC

Mr. C.

February 28th, 2008, 04:50 AM

There's also a lot of numbers that are not valid (displays almost empty template) - How do I insert an IF-statement, that deletes the downloaded file if it doesn't have the string "DKK" in it. (I know this string is in the valid ones, and not in the invalid!)

Do this part at the end, outside the loop:

$ find . -type f | xargs grep -v -l DKK | xargs rm

which finds all files from the current directory (be sure you're in the top of the download directory first), send that list to grep which prints out only the name of the files that do not contain DKK, and passes that list to rm.

MrC

tg3793

December 8th, 2010, 09:55 AM

One thing to keep in mind is that if you are trying to set up an offline version of a site, the wget method will still be linking to the online version. httrack has options to modify the links and images to reference the offline copy instead.

This is an old thread I know. But since I'm looking at it perhaps another newbie that would not know better might be looking at it as well. So for the edification of all I submit the following reply.

You can use the --convert-links option for wget which enables wget to, after downloading all of the pages, to rename the links within the downloaded pages to point at your new local copy.

Here is a great tutorial (http://dalelane.co.uk/blog/?p=233) on it. Even though the tutorial applies to a wiki, the same principles apply for any offline copy of a website.

ssam

December 8th, 2010, 01:59 PM

also curl (similar to wget) can do this.