Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: what input for UTF-8/Unicode in bash?

  1. #1
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    2,023
    Distro
    Xubuntu 20.04 Focal Fossa

    what input for UTF-8/Unicode in bash?

    i want to type in, or assign to a variable to be expanded in a command, some kind of sequence that gets expanded to the bytes of a UTF-8 sequence or a Unicode value. i have a file that has UTF-8 in its name. a list of file names will be output. i want to filter these names with one of the grep commands. can this be done?
    Mask wearer, Social distancer, System Administrator, Programmer, Linux advocate, Command Line user, Ham radio operator (KA9WGN/8, tech), Photographer (hobby), occasional tweetXer

  2. #2
    Join Date
    Mar 2011
    Location
    U.K.
    Beans
    Hidden!
    Distro
    Ubuntu 22.04 Jammy Jellyfish

    Re: what input for UTF-8/Unicode in bash?

    To search for strings inside files I use ripgrep.

    P.S. Here is a ripgrep wrapper for various file types.
    Last edited by dragonfly41; December 28th, 2019 at 07:52 PM. Reason: added link to wrapper

  3. #3
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    2,023
    Distro
    Xubuntu 20.04 Focal Fossa

    Re: what input for UTF-8/Unicode in bash?

    filtering file names is just one example of the use of inputting unicode and/or utf-8 in bash. sorry i didn't make that clear. how to do the input is my goal, from knowing the numeric codes for these characters, such as searching for "Владимир Путин" but not being able to type or paste these characters in.
    Mask wearer, Social distancer, System Administrator, Programmer, Linux advocate, Command Line user, Ham radio operator (KA9WGN/8, tech), Photographer (hobby), occasional tweetXer

  4. #4
    Join Date
    Aug 2011
    Location
    52.5° N 6.4° E
    Beans
    6,821
    Distro
    Xubuntu 22.04 Jammy Jellyfish

    Re: what input for UTF-8/Unicode in bash?

    If you have the numeric codes of the characters and want to type them in a terminal (or many other applications), use ctrl+shift+u, [numeric code in hexadecimal], <space>. What shell or other application is running in the terminal is irrelevant (most of the time), as the input is handled by the terminal, which converts it to UTF-8 encoded strings, which are passed to the application running inside.

    Most naïve applications/tools/library funtions designed to handle ASCII can handle UTF-8 without modification. That's how UTF-8 was designed. The only real change is that counting characters has become different from counting bytes.

    Instead of learning all numeric codes by heart, you can often use the compose key (for symbols, accented characters or characters derived from ordinary letters) or switch keyboard layout (for a completely different alphabet).

  5. #5
    Join Date
    Nov 2007
    Location
    London, England
    Beans
    7,701

    Re: what input for UTF-8/Unicode in bash?

    Also:
    You don't need to type the leading zeros in Ctrl+Shift+U [hex-number] <space>.
    There is a fascinating app called Character Map (binary name gucharmap) that shows all unicode characters, and you can double-click characters to put them in a copyable text field at the bottom.

  6. #6
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    2,023
    Distro
    Xubuntu 20.04 Focal Fossa

    Re: what input for UTF-8/Unicode in bash?

    i think i'm going to set a keyboard shortcut (hot key) to start gucharmap.

    i just tried ctrl+shift+u with a code and just got the unconverted ascii character, starting with the u in lower case. this was with xfce4-terminal.

    my big interest is what backslash style or similar sequences can do. and i don't mean a means so emulate having the proper keyboard by entering the character code and the terminal passing the raw code in the width (8 bit, 16 bit, 32 bit) that the terminal is emulating. i mean sequences passed to applications as the characters actually typed where this can encode unicode when the sequences are, eventually, processed for the encoding. i want to pass an encoded form like this when passing command arguments or piped input to a shell.
    Mask wearer, Social distancer, System Administrator, Programmer, Linux advocate, Command Line user, Ham radio operator (KA9WGN/8, tech), Photographer (hobby), occasional tweetXer

  7. #7
    Join Date
    Aug 2011
    Location
    52.5° N 6.4° E
    Beans
    6,821
    Distro
    Xubuntu 22.04 Jammy Jellyfish

    Re: what input for UTF-8/Unicode in bash?

    I'm not sure what you mean by "the unconverted ascii character, starting with the u in lower case". Also not sure what you mean by "the raw code in the width (8 bit, 16 bit, 32 bit) that the terminal is emulating". The ctrl+shift+u trick works in xfce4-terminal, in firefox, nearly everywhere.

    If you want to pass the actual escape sequence to the application and have the application process it, just type an escape sequence, making sure it's not already processed by the shell, and making sure your application actually supports escape sequences. Many do, as it's a simple feature, but for example /bin/echo doesn't. The bash shell builtin echo does.

    As a few examples, when I want my terminal to wish me a good day in greek, I can type:
    Code:
    # By switching keyboard layout with the <menu> key:
    # Actual keypresses:
    # e c h o <space> <menu> shift+k a l h m ; e r a <menu> <enter>
    # Gives this command line:
    $ echo Καλημέρα
    
    # By having the terminal interpret escape sequences:
    # Actual keypresses:
    # e c h o <space> ctrl+shift+u 3 9 a <space> ctrl+shift+u 3 b 1 <space> ctrl+shift+u 3 b b <space> ctrl+shift+u 3 b 7 <space> ctrl+shift+u 3 b c <space> ctrl+shift+u 3 a d <space> ctrl+shift+u 3 c 1 <space> ctrl+shift+u 3 b 1 <space> <enter>
    # Gives the command line:
    $ echo Καλημέρα
    
    # By having bash interpret escape sequences:
    # Just type this at the prompt:
    $ echo $'\u39a\u3b1\u3bb\u3b7\u3bc\u3ad\u3c1\u3b1'
    # Maybe that is what you mean?
    
    # By having the application interpret the escape sequences. Use quotes to prevent the shell from processing the backslashes.
    # Note that that doesn't really happen here, as echo is a shell builtin and the actual /bin/echo program
    # doesn't handle escape sequences for multibyte characters.
    # Type at the prompt:
    $ echo -e '\u39a\u3b1\u3bb\u3b7\u3bc\u3ad\u3c1\u3b1'
    In all cases the characters I actually types are sent to the process interpreting the escape sequences, then they are converted to UTF-8. In Linux nearly all text is in UTF-8.

  8. #8
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    2,023
    Distro
    Xubuntu 20.04 Focal Fossa

    Re: what input for UTF-8/Unicode in bash?

    basically, it's not happening for me. no conversion took place. i just got to see the literal characters i typed. while holding down shift and ctrl when i typed the "u", i saw a "u". then i typed 0412. then Unicode number for Russian letter "В". but i didn't get "В". all i got was "u0412" ("u" was in lower case).

    the bash interpreted backslash example worked. i got "Καλημέρα", though when googling that it seems google misunderstands the codes. 8 bytes in for 8 Greek letters, must be an ISO8859 coding.
    Mask wearer, Social distancer, System Administrator, Programmer, Linux advocate, Command Line user, Ham radio operator (KA9WGN/8, tech), Photographer (hobby), occasional tweetXer

  9. #9
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    2,023
    Distro
    Xubuntu 20.04 Focal Fossa

    Re: what input for UTF-8/Unicode in bash?

    here is my list of Unicode to UTF-8 mapping. one of the many benefits of UTF-8 coding is that it sorts correctly. it's about 1.2 MB that uncompresses to about 36 MB.
    Mask wearer, Social distancer, System Administrator, Programmer, Linux advocate, Command Line user, Ham radio operator (KA9WGN/8, tech), Photographer (hobby), occasional tweetXer

  10. #10
    Join Date
    Aug 2011
    Location
    52.5° N 6.4° E
    Beans
    6,821
    Distro
    Xubuntu 22.04 Jammy Jellyfish

    Re: what input for UTF-8/Unicode in bash?

    I thought you didn't need anything special to enable the ctrl+shift+u input. I certainly didn't. When I hit ctrl+shift+u, I get an underlined u, then it shows the hexadecimal code as I type it, and when I terminate with space, the whole u0412 is converted into В. (BTW, no need to type the leading 0, but it does no harm).

    When I use output redirect to echo Καλημέρα to a file, I get 17 bytes (including a terminating newline). I think your web browser does an encoding conversion before sending things to google, or google does something weird.

    BTW, terminology is a bit fuzzy here. What Unicode does it that it defines a mapping between character names and numbers, what UTF-8 does is that it defines a mapping between numbers and byte sequences. In principle, these don't have to be linked, but in practice, they are.

Page 1 of 2 12 LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •