Page 1 of 4 123 ... LastLast
Results 1 to 10 of 33

Thread: Python vs Bash Challange

  1. #1
    Join Date
    Dec 2007
    Beans
    18

    Python vs Bash Challange

    Hi Guys and Girls,

    I have written some bash to solve a little issue I have, but would be interested in the performance difference of doing it in Python. Hopefully a few people here will think this is quite fun and a little challenge!

    I have a huge text list of over 1 million URL's (generated from my own site XML). They are in a file called "allurls.txt" and in the format...

    Code:
    http://www.mydomain.com/5560000043-Some_Words
    http://www.mydomain.com/6263403333-Different_things_here
    http://www.mydomain.com/5562344287-Not_even_in_English
    http://www.mydomain.com/9962458778-Why
    And so on...
    (Format is:
    Code:
    up to the first slash is always the same, then a 10 digit number, then a hyphon and some text
    )

    What I need to do is to create a CSV called "data.csv" that looks like this...

    Code:
    "5560000043","556000-0043","http://www.mydomain.com/5560000043-Some_Words","http://www.myNEWdomain.com/5560000043"
    "6263403333","626340-3333","http://www.mydomain.com/6263403333-Different_things_here","http://www.myNEWdomain.com/6263403333"
    "5560000118","556000-0118","http://www.mydomain.com/5562344287-Not_even_in_English","http://www.myNEWdomain.com/5562344287"
    and so on and so on...
    (Format of data.csv should be:
    Code:
    "numbers after the first / but before the first hyphon","those numbers again, but split with a hyphon after the first six digits","The original URL","The new URL with just the number after the first slash"
    )

    I have written the following bash script called createCSV.sh which works great and does exactly what it should, except that it takes over 3 hours to create the CSV out of the 1 million plus URLS.

    Code:
    #!/bin/sh
    # createCSV.sh
    # Author: Me
    
    rm data.csv
    fname="allurls.txt"
    PS1=$(date '+%Y%m%d_%H%M%S')
    while read a;
    do
    comma=","
    hyphon="-"
    quote='"'
    orgnumber=`expr substr $a 22 10`
    orgnumberfirsthalf=`expr substr $orgnumber 1 6`
    orgnumbersecondhalf=`expr substr $orgnumber 7 4`
    orgwithhypon="$orgnumberfirsthalf$hyphon$orgnumbersecondhalf"
    oldurl="$a"
    newurl1stpart="http://www.myNEWdomain.com/"
    newurl2ndpart="$orgnumber"
    newurl="$newurl1stpart$newurl2ndpart"
    total="$quote$orgnumber$quote$comma$quote$orgwithhypon$quote$comma$quote$oldurl$quote$comma$quote$newurl$quote"
    echo "$total" >> data.csv
    done < "$fname"
    PS2=$(date '+%Y%m%d_%H%M%S')
    echo "$PS1" >> data.csv
    echo "$PS2" >> data.csv
    Beautiful eh?

    So the challenge is can anyone write a python script that will beat my great bash script time of over three hours to make the CSV?

    Add in a time stamp when it starts and a time stamp when it ends into the end of the CSV as I have and I'll run all suggestions and we can see who is the fastest!

    Good luck

  2. #2
    Join Date
    Sep 2006
    Beans
    2,914

    Re: Python vs Bash Challange

    are you sure your bash script gives the correct output.

    here's an awk solution

    Code:
    awk 'BEGIN{
        FS="[/-]"
        OFS=","
        q="\042"
    }
    {
        org=q $0 q
        o = q $4 q
        newdom = q "http://mynewdomain.com/"$4 q
        $4 = q substr($4,1,6)"-"substr($4,7) q
        print o,$4,org,newdom    
    }' file

  3. #3
    Join Date
    Dec 2007
    Beans
    18

    Re: Python vs Bash Challange

    That is immense...

    The time using my bash script that I wrote with my general lack of knowledge (but it did work ) was over 3 hours...

    Ghostdog74, powered by awk... 9 seconds.

    Just shows what you can do when you know what to do...

    Here's the completed code:

    Code:
    #!/bin/sh
    # createCSVawk.sh
    # Author: GhostDog74
    
    awk 'BEGIN{
        FS="[/-]"
        OFS=","
        q="\042"
    }
    {
        org=q $0 q
        o = q $4 q
        newdom = q "http://www.myNEWdomain.com/"$4 q
        $4 = q substr($4,1,6)"-"substr($4,7) q
        print o,$4,org,newdom >> "data.csv"  
    }' allurls.txt
    9 seconds. Unbelievable. I'm going for lunch. Thanks man!

  4. #4
    Join Date
    Sep 2006
    Beans
    2,914

    Re: Python vs Bash Challange

    for 9 seconds, i am guessing you have a VERY big file.

  5. #5
    Join Date
    Dec 2007
    Beans
    18

    Re: Python vs Bash Challange

    Quote Originally Posted by ghostdog74 View Post
    for 9 seconds, i am guessing you have a VERY big file.

    The output CSV is 256Mb.

  6. #6
    Join Date
    Sep 2006
    Beans
    2,914

    Re: Python vs Bash Challange

    word of advice: Don't use bash's while read loop to parse big files.

  7. #7
    Join Date
    May 2007
    Beans
    125

    Re: Python vs Bash Challange

    So, is this challenge closed (awk FTW?), or are you still looking for a python version for comparison?

  8. #8
    Join Date
    Apr 2006
    Beans
    1,273

    Re: Python vs Bash Challange

    This is a good example of using the right tool for the job.

  9. #9
    Join Date
    Feb 2008
    Location
    Cape Town, South Africa
    Beans
    Hidden!
    Distro
    Ubuntu 8.04 Hardy Heron

    Re: Python vs Bash Challange

    Quote Originally Posted by Paul Miller View Post
    So, is this challenge closed (awk FTW?), or are you still looking for a python version for comparison?
    There are more then 1 way to skin a cat. If you want to give a python example I don't think any one viewing this thread later would mind. Also I don't think the OP would mind seeing other alternatives to the solution.

  10. #10
    Join Date
    Mar 2008
    Beans
    4,714
    Distro
    Ubuntu 9.10 Karmic Koala

    Re: Python vs Bash Challange

    Quote Originally Posted by ghostdog74 View Post
    word of advice: Don't use bash's while read loop to parse big files.
    What is the correct bash idiom for reading large files line by line?

    What is "while read" doing that makes it so slow?

Page 1 of 4 123 ... LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •