Page 2 of 2 FirstFirst 12
Results 11 to 14 of 14

Thread: what input for UTF-8/Unicode in bash?

  1. #11
    Join Date
    Nov 2007
    Location
    London, England
    Beans
    6,922
    Distro
    Xubuntu 19.10 Eoan Ermine

    Re: what input for UTF-8/Unicode in bash?

    I know it's not a lot of help, but I use Xubuntu, and the Ctrl-Shift-U input has worked for me for years (currently on 19.10). So I think there may be something unusual in your setup. I've not been able to find a setting anywhere that looks like a good suspect.
    Just in case it provides a clue, the output of locale in a terminal for me is:
    Code:
    steve@StevesPC:~$ locale
    LANG=en_GB.UTF-8
    LANGUAGE=en_GB:en
    LC_CTYPE="en_GB.UTF-8"
    LC_NUMERIC=en_GB.UTF-8
    LC_TIME=en_GB.UTF-8
    LC_COLLATE="en_GB.UTF-8"
    LC_MONETARY=en_GB.UTF-8
    LC_MESSAGES="en_GB.UTF-8"
    LC_PAPER=en_GB.UTF-8
    LC_NAME=en_GB.UTF-8
    LC_ADDRESS=en_GB.UTF-8
    LC_TELEPHONE=en_GB.UTF-8
    LC_MEASUREMENT=en_GB.UTF-8
    LC_IDENTIFICATION=en_GB.UTF-8
    LC_ALL=
    and in Settings -> Language Support, my "Keyboard input method system" is "none".

  2. #12
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    1,674
    Distro
    Xubuntu 18.04 Bionic Beaver

    Re: what input for UTF-8/Unicode in bash?

    i previously had used a keyboard input for 8-bit coding that did not show the number as it was typed. it would feed the byte as input as soon as the 3 decimal digits were entered. i was making the assumption that Ctrl+Shift+U worked the same way and jumped to the quick conclusion that feature was not active. the one i used long ago was a loadable kernel module. the kernel can do this before the echo happens and that is apparently what that module did. as soon as i saw "u0412" i just backspaced. had i gone ahead with another character, i would have seen the converted character. this conversion does work in firefox. so is it being done in the kernel (i'm guessing no, since it would not try to show the echo of the digits on the X display) or in X (probably not for similar reasons) or in the app (xfce4-terminal and firefox)?

    the 8 bytes i was referring to was the 8 backslash octal codes. i'm guessing something translated 8859 coding to Unicode which was then encoded as UTF-8. BTW, i have written UTF-8 encoding and decoding software in both C and Python3; i do know what it is. i was looking for types of coded data entry to check how some scripts i'm creating might need to deal with user input that it gets before any encoding happens. i can enter such codes in Python3 string literals in source code.
    Social distancer, System Administrator, Programmer, Linux advocate, Command Line user, Ham radio operator (KA9WGN/8, tech), Photographer (hobby), occasional tweeter

  3. #13
    Join Date
    Aug 2011
    Location
    51.8° N 5.8° E
    Beans
    5,016
    Distro
    Xubuntu 19.10 Eoan Ermine

    Re: what input for UTF-8/Unicode in bash?

    As the codes have variable length, you need some kind of terminator. Apparently, this conversion of ctrl+shift+u codes is done in GTK. I guess Qt has its own methods of unicode input handling, but I don't have any Qt applications installed to test.

    I've written some UTF-8 encoding, decoding and validating functions in C, but never in Python. As long as your input comes from a terminal, or a not-too-ancient text file created on a Linux system, or you use something like GTK, you can safely assume it's UTF-8 encoded Unicode, and as long as your output is to a terminal, to a text file, or something like GTK, you can safely output UTF-8 encoded Unicode. When you want to handle your own input method or text rendering or if you care about compatibility with old files in legacy 8-bit encodings (iso-8859-n etc.) or with Windows (where UTF-16 encoded Unicode seems to occur), you have to pay attention to the encoding used.

  4. #14
    Join Date
    Jan 2010
    Location
    Wheeling WV USA
    Beans
    1,674
    Distro
    Xubuntu 18.04 Bionic Beaver

    Re: what input for UTF-8/Unicode in bash?

    in Python, the way Unicode is handled in escape sequences is "\u0123" and "\U01234567" where the digits are always hexadecimal. one way i thought of would be "\u" or "\U" to start hexadecimal digits where any non-digit ends the sequence and "\ " ends it w/o being an added character (so "\u12345\ foo" has a length of 4 Unicode characters ... avoiding the "f" being interpreted as a digit). if you really want a space before "foo" then do "\u12345 foo".

    input as UTF-8 is usually a safe bet. with ctrl+shift+u that should work for the entire Unicode code space. but if this is done in GTK and Qt, it might not be universal enough to totally dismiss getting input with escape sequences. then the issue is whether you want to support that in your app. if we can get ctrl+shift+u in more places (X, Wayland, etc) this would be universal enough to deprecate all terminal input backslashes for anything but backslashes.
    Social distancer, System Administrator, Programmer, Linux advocate, Command Line user, Ham radio operator (KA9WGN/8, tech), Photographer (hobby), occasional tweeter

Page 2 of 2 FirstFirst 12

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •