Results 1 to 6 of 6

Thread: Bash Scripting and the Web

  1. #1
    Join Date
    Feb 2012
    Beans
    38

    Bash Scripting and the Web

    Hello Veterans and Gurus,

    I have been learning bash for the past couple of weeks, and need help with an idea that I had. I basically want to write a script to download the first ten pages of a Google search for a keyword of my choice.
    The idea basically is to have a program that does a very primitive form of web crawling for me. This way, I could download the content of the web, and do my data mining for it later.

    This is the idea in very general headlines. It is still not quite crystal clear in my own head to be able to present it as a detailed question, but in essence, as a start I want to be at least be able to do what I mentioned above.

    Any idea how bash scripting could help with that? Is there any reading material that would educate me on scripting commands specifically for interfacing with the web and its content? Is bash script the way to go in the first place?

    I realize these are broad questions, so feel free to respond with broad answers. I would be happy to go in any direction this thread heads.

    Also feel free to throw in any terminal commands that you think might be relevant or helpful. My list of terminal commands has no limit.

  2. #2
    Join Date
    Sep 2006
    Beans
    8,627
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Bash Scripting and the Web

    You could read up on wget, but it is more likely that you'll need to parse the HTML to get at the link for the next page. To do that you'll need an actual XHTML parser like perl's HTML:: Parser or HTML::LinkExtor

  3. #3
    Join Date
    Feb 2012
    Beans
    38

    Re: Bash Scripting and the Web

    wget is actually a great place to start! I started reading up on it and it is definitely going to be useful. Thanks.

    I hope more people chime in with their expertise.

  4. #4
    Join Date
    Feb 2008
    Beans
    251
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Bash Scripting and the Web

    Hi,

    I would actually be tempted to do this in the scripting language Python[1], rather than Bash.

    I find that Python is easy to understand, maintain, and has some really useful libraries, including:

    • urllib2[2] for opening connections and downloading webpages
    • BeautifulSoup[3] for HTML parsing


    In addition there are excellent tutorials for getting you started.

    good luck!


    [1] http://www.python.org/doc/
    [2] http://docs.python.org/library/urllib2.html
    [3] http://www.crummy.com/software/BeautifulSoup/bs4/doc/

  5. #5
    Join Date
    Jan 2012
    Beans
    342

    Re: Bash Scripting and the Web

    LinuXofArabiA ...

    I'm not sure google provides scrapable URL's (except perhaps adds and internal google services). See for yourself:

    url_scrape.py

    Code:
    #!/usr/bin/env python
    
    '''Returns a list of URLs that are found in standard input.
    
    These URLs must be between quotes ("" or '') and must start with http://
    
    Modified from Python Recipe 30270 by Yuriy Tkachenko:
    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302700
    
    '''
    
    __RCS__ = '$Id: url_scrape.py,v 1.4 2005/06/19 18:18:34 darren Exp darren $'
    __version__ = '$Revision: 1.4 $'
    __initialdate__ = 'June 2005'
    
    __author__ = 'Darren Paul Griffith, http://www.madphilosopher.ca/'
    
    import re
    import sys
    
    
    if __name__ == '__main__':
    
        # Pattern for fully-qualified URLs:
        url_pattern = re.compile('''["']http://[^+]*?['"]''')
    
        # build list of all URLs found in standard input
        s = sys.stdin.read()
        all = url_pattern.findall(s)
    
        # output all the URLs
        for i in all:
            print i.strip('"').strip("'")
    Wrap it together with bash/wget (wgeturl.sh) ...

    Code:
    !/bin/bash
    
    if [[ -z ${@} ]]; then
        echo "No search string provided!"
        echo "Usage: $(basename ${0}) <search string>"
        exit
    fi
    
    wget -U mozilla -q -O - 'http://www.google.com/search?q='"${@}" | url_scrape.py >> urls.txt
    The run 'wgeturl.sh <search string>'

    I may be wrong but I think they obfusticate URL's in order to "not be evil".

    best ... khay
    Last edited by Khayyam; February 26th, 2012 at 09:46 PM.

  6. #6
    Join Date
    Sep 2006
    Beans
    8,627
    Distro
    Ubuntu 14.04 Trusty Tahr

    Google's search API

    Google's search API might or might not useful here too:

    https://code.google.com/apis/customsearch/

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •