Results 1 to 6 of 6

Thread: dump text from pages of google search

  1. #1
    Join Date
    Apr 2007
    Beans
    12

    dump text from pages of google search

    hi all,

    i am studying for a medical exam and I frequently need to consult multiple pages to find obscure facts. I am looking for a way to do a google search, go to each of the top 5-10 sites, dump the text from these sites to a file, and hi-light the search term within the dumped text.

    I figure you can dump text directly from a lynx search, but I can't work out how to get lynx to go into each of the google links without doing it manually.

    Ideally, I would like to write a script that would prompt for the search terms and then the dump file name and then create the file.

    Any ideas? Thanks

  2. #2
    Join Date
    Mar 2011
    Location
    U.K.
    Beans
    828
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: dump text from pages of google search

    If you're researching a corpus of medical references I would recommend installing

    TextSTAT
    http://neon.niederlandistik.fu-berlin.de/en/textstat/

    You need to install python-tk .. through Synaptic Package Manager

    Then, run the command "python TextSTAT.pyw" in the TEXTStat folder.

    I just run the command ..

    cd /home/myusername/Applications/TextSTAT/ && python TextSTAT.pyw

    You can then add either web sites or local files to your corpus to search for key phrases.


    also try ..

    AntConc3.2 and AntWord Profiler
    http://www.antlab.sci.waseda.ac.jp/software.html

    ...

    another tip is to install zotero add-on for firefox

    this is a citation manager

  3. #3
    Join Date
    Apr 2007
    Beans
    12

    Re: dump text from pages of google search

    thanks, i'll give those a try.

    still looking for a scripted command line option, if somebody is willing to help out

  4. #4
    Join Date
    May 2009
    Location
    Courtenay, BC, Canada
    Beans
    1,583

    Re: dump text from pages of google search

    You can use wget to do this, but you need to specify a user agent
    Code:
    wget -U “Firefox/3.0.15″ http://www.google.com/search?q=wget+google+query+to+file -O file.html
    otherwise you get a 403/forbidden

    http://isaksen.biz/blog/?p=470

  5. #5
    Join Date
    May 2009
    Location
    Courtenay, BC, Canada
    Beans
    1,583

    Re: dump text from pages of google search

    use the --mirror option to follow links such as:
    Code:
     wget -U “Firefox/3.0.15″ --mirror http://www.google.com/search?q=wget+google+query+to+file -O file.html
    it actually seems that using the -O option prevents it from getting multiple pages, so do it without it, like:
    Code:
     wget -U “Firefox/3.0.15″ --mirror http://www.google.com/search?q=wget+google+query+to+file
    after 300 or so results you will start getting service unavailable
    Last edited by HiImTye; March 19th, 2012 at 12:45 AM.

  6. #6
    Join Date
    Apr 2007
    Beans
    12

    Re: dump text from pages of google search

    thanks, looks very promising.
    1. At the moment it just seems to be getting the google search page rather than opening the links and saving these pages
    2. is there a way to limit or define the number of responses? at the moment the command just keeps adding pages until i kill the process.
    3. is there a way to dump to a text file rather than html? I guess I could just pipe the output to an html to text converter
    Last edited by edfromballarat; March 19th, 2012 at 02:04 AM.

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •