https://en.wikipedia.org/wiki/UTF-8#Description
to avoid collision between chars of different byte length there are additional bit sequences added which indicate the length of the char (otherwise you wouldn't be able to differentiate between [51060]/[c774] and [199][116]/[c7][74]).
51060 is in 3-byte range, padded with 8 bits total.
Code:
$ echo "obase=2; 51060" | bc
1100011101110100
according to utf8 spec this gets padded with
111011001001110110110100
Code:
$ echo "ibase=2; 111011001001110110110100" | bc
15506868
$ echo "obase=16; 15506868" | bc
EC9DB4
0000000 [9d][ec] [0a][b4]
printed in different order, 0a = newline (if you run od with -c it will be clearly visible)
Code:
$ od -c hangul.txt
0000000 354 235 264 \n
0000004
354 is octal for 236/EC, 235=157/9D, 264=180/B4
Bookmarks