Results 1 to 3 of 3

Thread: Unlocking the mysteries of UTF-8

  1. #1
    Join Date
    Aug 2009
    Beans
    110

    Unlocking the mysteries of UTF-8

    This question arose because of a recent need to put some Hangeul (Korean) characters in a document.
    I have a file constisting of the following text string: ampersand, pound sign, "51060" , semicolon.
    Sending this file to my browser (Firefox v. 12.0 for Ubuntu, with character encoding set to UTF-8), produces
    the Hangeul representation of the syllable "i" (the number 2 in Korean).

    When I use cut-and-paste to copy the browser display to a text file, the Hangeul script is still visible (for example, using "cat"), but the contents of the file look like

    Code:
    $ od -x i.txt
    0000000 9dec 0ab4
    0000004
    Try as I may, I can't figure out how the sixteen-bit number 51060 (0xC774) in the HTML file translates to the thirty-two bits in the text file. Guessing that the "0a" byte is an end-of-file marker, but beyond that, am lost.

    OS is Ubuntu 10.04 LTS. The /etc/defaults/ file looks like
    Code:
    LANG="en_US.UTF-8"
    Thanks for any clues.

  2. #2
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,499
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Unloking the mysteries of UTF-8

    https://en.wikipedia.org/wiki/UTF-8#Description
    to avoid collision between chars of different byte length there are additional bit sequences added which indicate the length of the char (otherwise you wouldn't be able to differentiate between [51060]/[c774] and [199][116]/[c7][74]).

    51060 is in 3-byte range, padded with 8 bits total.

    Code:
    $ echo "obase=2; 51060" | bc
    1100011101110100
    according to utf8 spec this gets padded with
    111011001001110110110100

    Code:
    $ echo "ibase=2; 111011001001110110110100" | bc
    15506868
    $ echo "obase=16; 15506868" | bc
    EC9DB4
    0000000 [9d][ec] [0a][b4]
    printed in different order, 0a = newline (if you run od with -c it will be clearly visible)

    Code:
    $ od -c hangul.txt
    0000000 354 235 264  \n
    0000004
    354 is octal for 236/EC, 235=157/9D, 264=180/B4
    Last edited by Vaphell; May 7th, 2012 at 11:01 PM.

  3. #3
    Join Date
    Aug 2009
    Beans
    110

    Re: Unloking the mysteries of UTF-8

    Thank you, Vaphell, for a clear explanation.
    Thanks also for the Wikipedia link.
    (Don't know why my own web search didn't find it.)

    Have marked this thread solved.

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •