Results 1 to 5 of 5

Thread: Manipulate a txt file

  1. #1
    Join Date
    Mar 2013
    Beans
    20

    Question Manipulate a txt file

    I have the following txt file:

    Code:
    source.txt:
    <a href="/gp/product/bold.jsp?tp=&add=B007OZNZG0"><a href="https://www.amazon.com/gp/product/utility/edit-one-click-pref.html?ie=UTF8&query=*entries*%3D0%2C*Version*%3D1&returnPath=%2Fgp%2Fproduct%2FB007OZNZG0" id="oneClickSignInLinkID">Sign in</a> to turn on 1-Click ordering.
    <tr><a href="/gp/product/bold.jsp?tp=&add=B007OZAJSDH"><td align="right" style="font-weight:....
    ...
    which is essentially bunch of junks with some links that have "bold.jsp" (bolded above).

    What I want to do is to:
    1- extract the bold parts (which have bold.jsp) and write them to a new txt file (each in a separate line)
    2- add "http:www.amazon.com" to the beginning of each line

    So the output would be:

    Code:
    output.txt:
    http:www.amazon.com/gp/product/bold.jsp?tp=&add=B007OZNZG0
    http:www.amazon.com/gp/product/bold.jsp?tp=&add=B007OZAJSDH
    
    
    I am using Cygwin on Windows.

    Thank you for your help.
    Last edited by Si1414; April 2nd, 2013 at 06:36 PM. Reason: Solved!

  2. #2
    Join Date
    Feb 2013
    Beans
    Hidden!

    Re: Manipulate a txt file

    Code:
    awk -F'"' -vRS='<a[ \t\n]+href="' '$1~/bold\.jsp/{sub(/^\//,"http://www.amazon.com/");print $1}' source.txt
    Last edited by schragge; April 2nd, 2013 at 04:48 PM.

  3. #3
    Join Date
    Mar 2013
    Beans
    20

    Re: Manipulate a txt file

    Thank You. Great help..
    How to put each link in a separate line?
    When I use the following code, it writes all links after each other and not in separate lines:

    Code:
    (awk -F'"' -vRS='<a[ \t\n]+href="' 'NR>1{sub(/^\//,"http://www.amazon.com/"); print $1}' source.txt) >output.txt
    Last edited by Si1414; April 2nd, 2013 at 04:49 PM.

  4. #4
    Join Date
    Feb 2013
    Beans
    Hidden!

    Re: Manipulate a txt file

    Strange. On my system, it puts each link on its own line. What awk implementation Cygwin uses? gawk? You could try to redirect the output from inside awk:
    Code:
    awk -F'"' -vRS='<a[ \t\n]+href="' '$1~/bold\.jsp/{sub(/^\//,"http://www.amazon.com/"); print$1>"output.txt"}' source.txt
    or explicitly set the ORS variable:
    Code:
    awk -F'"' -vRS='<a[ \t\n]+href="' -vORS='\r\n' '$1~/bold\.jsp/{sub(/^\//,"http://www.amazon.com/");print$1}' source.txt
    or both.

    TBH, I suspect it does print links on separate lines, but ends each line with LF alone (Unix convention), not with CR+LF (Windows convention), thus ORS='\r\n' above. Or you can open output.txt in an editor that understands the Unix convention like Notepad++. Alternatively, convert the file with unix2dos.

    BTW, I've edited my post and changed the code to only select lines that contain bold.jsp.
    Last edited by schragge; April 2nd, 2013 at 05:11 PM.

  5. #5
    Join Date
    Mar 2013
    Beans
    20

    Re: Manipulate a txt file

    Thanks again. The second code works perfect!

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •