PDA

View Full Version : [ubuntu] Want to copy the website content by using wget.



nos09
February 12th, 2010, 07:01 PM
Hey, i want to copy the whole website contents or say want to copy the whole site if possible.

I've used the wget in terminal and it copied the website but I cant find the things I want.

let's say I want to copy this site for my personal use "http://lyrics.astraweb.com/" the site is for lyrics as u can see.
I did tried this command "
"wget -r -l0 --no-parent http://lyrics.astraweb.com"
and
" "wget - -l0 --no-parent http://lyrics.astraweb.com/browse/band/33/Greenday.html"
and I m not able to copy the all lyrics but the lyrics I've accessed on the web online. so help me to copy all the contents.

thanks for help in advance. and sorry for the english :(
I m crazy though.:guitar:

Enigmapond
February 12th, 2010, 07:12 PM
That would be much easier to do with just installing httrack. This will copy a website or just rip the things you want. It works very well.

nos09
February 12th, 2010, 07:17 PM
what is httrack?
and are you sure that i can able to copy the lyrics as well. i dont mid if they are in html form. and i think you should visite the site first just to conform what I'm saying. and thanks for quick replay..

Enigmapond
February 12th, 2010, 07:25 PM
It copies the entire site and makes it browsable offline...copies everything on it. I just tried it and it works...it makes a mirror of the site. You can copy everything or just specific things depending on how you set it up...

nos09
February 12th, 2010, 07:31 PM
one last question::

is there any option for making it continue on next session. i mean guess if the site is little too big to copy at once then can i continue to copy it on next day.???

Enigmapond
February 12th, 2010, 07:38 PM
I'm not sure...never had to do it that way but you can filter out what you don't want and just download, for example just the html pages ignoring all the ads and images...best thing to tell you to do is install it with the documentation and play with it...

kellemes
February 12th, 2010, 07:46 PM
one last question::

is there any option for making it continue on next session. i mean guess if the site is little too big to copy at once then can i continue to copy it on next day.???

From "step 3" in the httrack-webscreen you can choose "Continue interrupted download" as action.
I guess this is what you want?

nos09
February 12th, 2010, 10:03 PM
now I tried to copy the url : http://lyrics.astraweb.com/browse/bands/33/Greenday.html"

without any problem but when i go offline and then openthe html page i just work with only one perticular page. i want the links too work for me. I mean when i click on one of the links then it shows connection error that mean i m offline,

but thats what i want i want to see offline pages. now any one who wants to help me please try it by yourself first. try to copy all the contens and try to make links working..

please.....

Satoru-san
February 12th, 2010, 10:10 PM
There is a feature that will copy recursivly, but you have to be careful, if you try to download a forum, you are looking at infinite downloading not just because of the thread, but because of the calender, I tried downloading a forum one time, and I let it go for a few hours, and when I came back I noticed that it had downloaded like 2 gigs of calenders. You also have to watch out for links off site, especially if they link to google, or it will never stop, though I haven't used this program in a long time and you may be able to restrict downloading to only that domain. Don't quote me on any of that though, I am only speculating.

nos09
February 13th, 2010, 02:14 AM
found a way ....
I cant copy the whole site though but can copy the artist i want
for example if i want to copy all the lyrics from greenday band then i will go to the internet page "http://lyrics.astraweb.com/browse/bands/33/Greenday.html" and copy the url then open the terminal and type


wget -mkE "http://lyrics.astraweb.com/browse/bands/33/Greenday.html"

it will download all the lyrics in indiviual html pages. but the problem is it won't stop untill it will download the whole site. so keep an eye on terminal process and close as it goes on other links,

any other solution are most welcome this wasn't good enough. but can do the trick temporery.