Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: need a more complicated cut command

  1. #1
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    1,033
    Distro
    Ubuntu 16.04 Xenial Xerus

    need a more complicated cut command

    i have a big file (millions of lines) that i need to cut 3 columns from. these columns are separated by a variety of different delimiters plus i need them to be in a different order than the order they are in the big file. this is beyond the capability of the cut command. any suggestions?

    Ubuntu 16.04.6 here.
    What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.

  2. #2
    Join Date
    Jan 2007
    Beans
    709
    Distro
    Ubuntu 18.04 Bionic Beaver

    Re: need a more complicated cut command

    awk (mawk) is what I use for more complex file processing. Or something like Perl or Python.
    Current 'buntu systems: Server 18.04.2 LTS, Mythbuntu 16.04 LTS, Ubuntu 16.04.1 LTS / Retired: 14.04 LTS, 10.04 LTS, 8.04 LTS
    Been using ubuntu since 6.04 (13 years!)

  3. #3
    Join Date
    Aug 2016
    Location
    Wandering
    Beans
    Hidden!
    Distro
    Xubuntu Development Release

    Re: need a more complicated cut command

    With realization of one's own potential and self-confidence in one's ability, one can build a better world.
    Dalai Lama>>
    Code Tags

  4. #4
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    1,033
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: need a more complicated cut command

    the geeksforgeeks site is badly programmed. i'm getting ads overlying parts of article text and code. i came up with an idea for something like cut or awk but with a language (the idea) more sophisticated than cut but simpler than awk code. i'll make a prototype in Python and if people like it (or i need more speed) a version in C.
    What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.

  5. #5
    Join Date
    Mar 2010
    Location
    Squidbilly-Land
    Beans
    16,531
    Distro
    Ubuntu Mate 16.04 Xenial Xerus

    Re: need a more complicated cut command

    I'd use perl - set it up as a filter so stdin and stdout are used.

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    # read from stdin, 1 line at a time
    while (<>){
        chomp;
    
       # this splits using a : delimiter, but you can match on multiple delimiters with a regex
        my @larray = split(/:/, $_);  
    
        # print the columns you want to keep to stdout
        print "$larray[0], $larray[5], $larray[6]\n";
    }
    This can parse a /etc/passwd file ... Run with ./parse.pl /etc/passwd use or redirect or pipe the output where you need it.

    Take the output from this and sort using any tool you like. I'd use sort unless there is a good reason not to. You could throw that output into a DBMS and have it do the sorting. Sky is the limit. If it is less than 2M records and each record is under 1K, I'd probably just add the columns into an array, then have perl output them in the desired sort order. This is the kind of stuff perl was created to handle. Sorting a column in perl is a 1-liner. Lots of examples on google. If you need to run it more than once on huge files, I'd use the quicksort variant. The built-in sort can do that.

    Bet python, ruby and easily do the same thing. I just had the perl code laying around.

  6. #6
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    1,033
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: need a more complicated cut command

    the way i'm thinking about this is a command where the expression of what to do fits on the command line in what i think are typical cases. the "language" of this expression is a new thing i thought up. users of it would need to read up on how it is done much like reading up on cut or find or other commands unless they use them a lot.
    What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.

  7. #7
    Join Date
    Dec 2006
    Location
    Hamburg, Germany
    Beans
    115
    Distro
    Lubuntu 19.04 Disco Dingo

    Re: need a more complicated cut command

    Perl is historically perfect in such things. The beast can process file of any big size. But recent years processors have been so fast that Python would also be appropriate..

    Back to reality - If you want to code in Perl: As an ex-Perl monk I recommend to disable input record separator (which is newline "\n") in UNIX-like sytems. In that case the speed-up is incredible:
    Code:
    $HoldTerm = $/;
    $/ = undef;
    open SYSIN, "file.txt" or die;
    $fc =<SYSIN>;
    close SYSIN;
    $/ = $HoldTerm;
    .. then process your text in $fc.

    Disabling \n-interleaved file read speeds-up I/O dramatically. You slurp the file in one shot. Then regex runs in RAM (because $fc is stored in RAM) without waiting for I/O because file read happened before.
    With this technique you can process very big files like 500 MB within 1 minute.

    I am fed up with Perl. Python is the future.

    HTH.
    Last edited by pencuse; 3 Days Ago at 10:11 PM. Reason: typo fixo
    Computer: Dell Latitude E7440 (since December 2017); CPU: Intel Core i7-4600U, 2 cores, 4 threads; HDD: 256 GB SSD; RAM: 8 GB DDR3L; Graphics: Intel HD 4400 (integrated in the CPU)
    Registered Linux User #454206

  8. #8
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    1,033
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: need a more complicated cut command

    i want to get the expression to reformat in one line, preferably quite short. expressing such things in Perl or Python would be a struggle getting it into one line, and beyond 2 or 3 data items i don't see how that would even be possible. so that's why i came up with my own little way to express how to cut up columns for output. an example is (in single quotes like on a shell command) ':>:>6]^,>5[,>20@'
    which would do 3 columns, the 1st between two : and output it 6 characters wide right justified, the 2nd between two , found from the start of each line 5 characters wide left justified, and the data up to the next , placed at column 20 in the output. this would be the only command argument if reading stdin and writing stdout.
    What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.

  9. #9
    Join Date
    Mar 2010
    Location
    Squidbilly-Land
    Beans
    16,531
    Distro
    Ubuntu Mate 16.04 Xenial Xerus

    Re: need a more complicated cut command

    Quote Originally Posted by Skaperen View Post
    i want to get the expression to reformat in one line, preferably quite short. expressing such things in Perl or Python would be a struggle getting it into one line, and beyond 2 or 3 data items i don't see how that would even be possible. so that's why i came up with my own little way to express how to cut up columns for output. an example is (in single quotes like on a shell command) ':>:>6]^,>5[,>20@'
    which would do 3 columns, the 1st between two : and output it 6 characters wide right justified, the 2nd between two , found from the start of each line 5 characters wide left justified, and the data up to the next , placed at column 20 in the output. this would be the only command argument if reading stdin and writing stdout.
    Those are new requirements, not mentioned in the OP.

    Seems like the definition of write-only code. In 5 yrs, when you come back, will you remember this syntax? Will the next guy? OTOH, bash has functions you could add to your shell. That would be 1 line to invoke. Writing maintainable code is an important goal.

  10. #10
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    1,033
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: need a more complicated cut command

    i am not trying to implement a command for each specific case, but rather, just one (or up to four) that can handle a variety of cases, more than cut can. the flexibility of (g)awk, perl, python, or any programming language is not the goal.

    the new "language" (that this command gets as an argument) is still being developed, it can change again. here are the notes i made today:
    Code:
    input actions:
    
    \ next character is only data
    ' everything to next ' is data but \ still works
    " everything to next " is data but \ still works
    + move specified number of columns to right
    - move specified number of columns to left
    ^ set position to front of input line
    $ set position to end of input line
    > find data to right
    < find data to left
    { get input from specified number of columns on right
    } get input from specified number of columns on left
    
    data action:
    
    _ match one or more white spaces (spaces, tabs) unless given as \_ or quoted
    
    output actions:
    
    ! output from data
    | output from input
    ] output is padded to specified width, justified right
    [ output is padded to specified width, justified left
    @ output is placed at specified position in output
    it's designed with the idea of interpreting it character by character. regular characters get collected into a string with _ being recorded in a special code (unless as \_ or quoted). some actions use the data as a number and some others use it as a string to either be searched or to be output. each time an action sets a position that position is pushed on a stack. when an output action uses input, it pops 2 positions from the stack, pushes a copy of the newest one back onto the stack, maybe flips them to have them in order, and indexes the input line to get the input that it will output. this is done on each line.

    i want to also implement one that "compiles" the string ahead of time to see if it performs better. the first implementation is being done in Python. then maybe i'll do one in C.
    What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •