Results 1 to 6 of 6

Thread: extract links from webpage

  1. #1
    Join Date
    Mar 2013
    Beans
    20

    Question extract links from webpage

    I have a "source.txt" file which contains list of some URLs. For example:


    Code:
    source.txt:    
    http://www.amazon.com/gp/product/B007OZNZG0/ref=s9_pop_gw_g349_ir05/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846
    http://www.amazon.com/gp/product/B0083PWAPW/ref=s9_pop_gw_g424_ir04/176-5131847-6150405?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=02R1PYSDAPM8P0XF7HXW&pf_rd_t=101&pf_rd_p=1263340922&pf_rd_i=507846
    I want to extract all links from the above URLs which contain "/gp/product" and then store them in "extracted.txt" file, which would be:


    Code:
    extracted.txt:
    http://www.amazon.com/gp/product/B008GFRB9E/ref=fs_j
    http://www.amazon.com/gp/product/B008GFUA4C/ref=fs_2
    I am using Cygwin on Windows 7 (64 bit).

    Thank you for your help.
    Last edited by Si1414; April 2nd, 2013 at 06:36 PM. Reason: Solved!

  2. #2
    Join Date
    Feb 2007
    Location
    Romania
    Beans
    Hidden!
    Distro
    Ubuntu Development Release

    Re: extract links from webpage

    Assuming that each URL is on a separate line in the source file, you could use grep to do the job. Do you have grep installed in your Cygwin environment?

  3. #3
    Join Date
    Mar 2013
    Beans
    20

    Re: extract links from webpage

    Thank you. Yes, I have it installed on Cygwin.

  4. #4
    Join Date
    Mar 2013
    Beans
    20

    Re: extract links from webpage

    I have grep installed on Cygwin. However, I want to retrieve each links inside "source.txt" and search through the html for"/gp/product" and store the links in "extracted.txt".
    Any suggestions for this?

  5. #5
    Join Date
    Feb 2013
    Beans
    Hidden!

    Re: extract links from webpage

    Do you mean something like
    Code:
    wget -q -i source.txt -O-|grep '/gp/product'
    wget takes links from source.txt, downloads a HTML page for each link and feeds them to grep.

  6. #6
    Join Date
    Mar 2013
    Beans
    20

    Re: extract links from webpage

    Exactly! Great Help again schragge. I really appreciate that.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •