what input for UTF-8/Unicode in bash?

**Skaperen** · December 28th, 2019

i want to type in, or assign to a variable to be expanded in a command, some kind of sequence that gets expanded to the bytes of a UTF-8 sequence or a Unicode value. i have a file that has UTF-8 in its name. a list of file names will be output. i want to filter these names with one of the grep commands. can this be done?

**dragonfly41** · December 28th, 2019

To search for strings inside files I use ripgrep.

P.S. Here is a ripgrep wrapper for various file types.

**Skaperen** · December 30th, 2019

filtering file names is just one example of the use of inputting unicode and/or utf-8 in bash. sorry i didn't make that clear. how to do the input is my goal, from knowing the numeric codes for these characters, such as searching for "Владимир Путин" but not being able to type or paste these characters in.

**Impavidus** · December 30th, 2019

If you have the numeric codes of the characters and want to type them in a terminal (or many other applications), use ctrl+shift+u, [numeric code in hexadecimal], <space>. What shell or other application is running in the terminal is irrelevant (most of the time), as the input is handled by the terminal, which converts it to UTF-8 encoded strings, which are passed to the application running inside.

Most naïve applications/tools/library funtions designed to handle ASCII can handle UTF-8 without modification. That's how UTF-8 was designed. The only real change is that counting characters has become different from counting bytes.

Instead of learning all numeric codes by heart, you can often use the compose key (for symbols, accented characters or characters derived from ordinary letters) or switch keyboard layout (for a completely different alphabet).

**The Cog** · December 30th, 2019

Also:
You don't need to type the leading zeros in Ctrl+Shift+U [hex-number] <space>.
There is a fascinating app called Character Map (binary name gucharmap) that shows all unicode characters, and you can double-click characters to put them in a copyable text field at the bottom.

**Skaperen** · December 31st, 2019

i think i'm going to set a keyboard shortcut (hot key) to start gucharmap.

i just tried ctrl+shift+u with a code and just got the unconverted ascii character, starting with the u in lower case. this was with xfce4-terminal.

my big interest is what backslash style or similar sequences can do. and i don't mean a means so emulate having the proper keyboard by entering the character code and the terminal passing the raw code in the width (8 bit, 16 bit, 32 bit) that the terminal is emulating. i mean sequences passed to applications as the characters actually typed where this can encode unicode when the sequences are, eventually, processed for the encoding. i want to pass an encoded form like this when passing command arguments or piped input to a shell.

**Impavidus** · December 31st, 2019

I'm not sure what you mean by "the unconverted ascii character, starting with the u in lower case". Also not sure what you mean by "the raw code in the width (8 bit, 16 bit, 32 bit) that the terminal is emulating". The ctrl+shift+u trick works in xfce4-terminal, in firefox, nearly everywhere.

If you want to pass the actual escape sequence to the application and have the application process it, just type an escape sequence, making sure it's not already processed by the shell, and making sure your application actually supports escape sequences. Many do, as it's a simple feature, but for example /bin/echo doesn't. The bash shell builtin echo does.

As a few examples, when I want my terminal to wish me a good day in greek, I can type:

Code:

# By switching keyboard layout with the <menu> key:
# Actual keypresses:
# e c h o <space> <menu> shift+k a l h m ; e r a <menu> <enter>
# Gives this command line:
$ echo Καλημέρα

# By having the terminal interpret escape sequences:
# Actual keypresses:
# e c h o <space> ctrl+shift+u 3 9 a <space> ctrl+shift+u 3 b 1 <space> ctrl+shift+u 3 b b <space> ctrl+shift+u 3 b 7 <space> ctrl+shift+u 3 b c <space> ctrl+shift+u 3 a d <space> ctrl+shift+u 3 c 1 <space> ctrl+shift+u 3 b 1 <space> <enter>
# Gives the command line:
$ echo Καλημέρα

# By having bash interpret escape sequences:
# Just type this at the prompt:
$ echo $'\u39a\u3b1\u3bb\u3b7\u3bc\u3ad\u3c1\u3b1'
# Maybe that is what you mean?

# By having the application interpret the escape sequences. Use quotes to prevent the shell from processing the backslashes.
# Note that that doesn't really happen here, as echo is a shell builtin and the actual /bin/echo program
# doesn't handle escape sequences for multibyte characters.
# Type at the prompt:
$ echo -e '\u39a\u3b1\u3bb\u3b7\u3bc\u3ad\u3c1\u3b1'

In all cases the characters I actually types are sent to the process interpreting the escape sequences, then they are converted to UTF-8. In Linux nearly all text is in UTF-8.

**Skaperen** · January 1st, 2020

basically, it's not happening for me. no conversion took place. i just got to see the literal characters i typed. while holding down shift and ctrl when i typed the "u", i saw a "u". then i typed 0412. then Unicode number for Russian letter "В". but i didn't get "В". all i got was "u0412" ("u" was in lower case).

the bash interpreted backslash example worked. i got "Καλημέρα", though when googling that it seems google misunderstands the codes. 8 bytes in for 8 Greek letters, must be an ISO8859 coding.

**Skaperen** · January 1st, 2020

here is my list of Unicode to UTF-8 mapping. one of the many benefits of UTF-8 coding is that it sorts correctly. it's about 1.2 MB that uncompresses to about 36 MB.

**Impavidus** · January 1st, 2020

I thought you didn't need anything special to enable the ctrl+shift+u input. I certainly didn't. When I hit ctrl+shift+u, I get an underlined u, then it shows the hexadecimal code as I type it, and when I terminate with space, the whole u0412 is converted into В. (BTW, no need to type the leading 0, but it does no harm).

When I use output redirect to echo Καλημέρα to a file, I get 17 bytes (including a terminating newline). I think your web browser does an encoding conversion before sending things to google, or google does something weird.

BTW, terminology is a bit fuzzy here. What Unicode does it that it defines a mapping between character names and numbers, what UTF-8 does is that it defines a mapping between numbers and byte sequences. In principle, these don't have to be linked, but in practice, they are.

Thread: what input for UTF-8/Unicode in bash?

Thread Tools

Display

what input for UTF-8/Unicode in bash?

Re: what input for UTF-8/Unicode in bash?

Re: what input for UTF-8/Unicode in bash?

Re: what input for UTF-8/Unicode in bash?

Re: what input for UTF-8/Unicode in bash?

Re: what input for UTF-8/Unicode in bash?

Re: what input for UTF-8/Unicode in bash?

Re: what input for UTF-8/Unicode in bash?

Re: what input for UTF-8/Unicode in bash?

Re: what input for UTF-8/Unicode in bash?

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions