cat Japanese text file

**ntzrmtthihu777** · November 18th, 2012

When I ls the contents of a directory with Japanese filenames it works perfectly fine,

Code:

ls
$read            oto.ini      あ_wav.frq    え.wav
Mattithyahu.bmp  readme.txt   い.wav        え_wav.frq
New Project.ust  あ.wav        い.wav.uspec  ゆ.wav
character.txt    あ.wav.uspec  い_wav.frq    ゆ_wav.frq

but when I cat a text file with Japanese characters (the very same ones) it gives me these replacement characters.

Code:

cat oto.ini
��.wav=a,0,0,0,0,0
��.wav=i,0,0,0,0,0
��.wav=e,0,0,0,0,0
��.wav=yu,0,0,0,0,0

I tried setting the language to Japanese (have to do this with a few of my programs) but it still gives me the same results.

Code:

LANG=ja_JP.utf8 cat oto.ini
��.wav=a,0,0,0,0,0
��.wav=i,0,0,0,0,0
��.wav=e,0,0,0,0,0
��.wav=yu,0,0,0,0,0

It *should* read as

Code:

あ.wav=a,0,0,0,0,0
い.wav=i,0,0,0,0,0
え.wav=e,0,0,0,0,0
ゆ.wav=yu,0,0,0,0,0

Does anyone know how to fix this issue?

**The Cog** · November 18th, 2012

Do you know what character encoding the text files are using?

It might help to see the actual file contents, just a snippet with this command would help:

Code:

cat oto.ini | hd | head

P.S.
That character is apparently HIRAGANA LETTER A, unicode value 0x3042 which in UTF8 encoding would be three bytes 0xE3 0x81 0x82.

**Vaphell** · November 18th, 2012

the question is does the .ini file have utf8 encoding?

Code:

$ echo "い"  | od -x
0000000 81e3 0a84
0000004
$ echo "い" > jp.txt
$ cat jp.txt
い
$ cat jp.txt | od -x
0000000 81e3 0a84
0000004

**ntzrmtthihu777** · November 18th, 2012

Originally Posted by The Cog

Do you know what character encoding the text files are using?

It might help to see the actual file contents, just a snippet with this command would help:
cat oto.ini | hd | head

First off, thank you for this interesting new trick you taught me! Hehe, it would be very useful to a project I had on the back burner.

Second, the encoding is Japanese (SHIFT_JIS), can I set it to this with LANG=? or something similar?

Code:

cat oto.ini | hd | head
00000000  82 a0 2e 77 61 76 3d 61  2c 30 2c 30 2c 30 2c 30  |...wav=a,0,0,0,0|
00000010  2c 30 0d 0a 82 a2 2e 77  61 76 3d 69 2c 30 2c 30  |,0.....wav=i,0,0|
00000020  2c 30 2c 30 2c 30 0d 0a  82 a6 2e 77 61 76 3d 65  |,0,0,0.....wav=e|
00000030  2c 30 2c 30 2c 30 2c 30  2c 30 0d 0a 82 e4 2e 77  |,0,0,0,0,0.....w|
00000040  61 76 3d 79 75 2c 30 2c  30 2c 30 2c 30 2c 30 0d  |av=yu,0,0,0,0,0.|
00000050  0a                                                |.|
00000051

**ntzrmtthihu777** · November 18th, 2012

Originally Posted by The Cog

P.S.
That character is apparently HIRAGANA LETTER A, unicode value 0x3042 which in UTF8 encoding would be three bytes 0xE3 0x81 0x82.

Yes that is a hiragana 'a', how did you tell? By searching for あ or did you figure it out from the apparently useless �� character? And can I still use this info in a shell script? What I am actually trying to achieve is create a "Voicebank Aliaser" for Ubuntu/linux UTAU users.
Basically oto.ini contains info on all the wav files for a certain voicebank, and also any "aliases" they may have, in my example あ.wav is aliased to a.wav.
What I eventually want to achieve is a simple script that will look at
a.wav=,#,#,#,#,# and recognize that a = あ, and alias it to that, like
a.wav=あ,#,#,#,#,# and also the reverse.
あ.wav=,#,#,#,#,# あ = a
あ.wav=a,#,#,#,#,#

**Vaphell** · November 18th, 2012

you can use iconv -f SHIFT_JIS -t UTF-8 to convert the file.

**ntzrmtthihu777** · November 18th, 2012

Originally Posted by Vaphell

you can use iconv -f SHIFT_JIS -t UTF-8 to convert the file.

Very interesting! Does this actually change the file itself or just allows its contents to be displayed properly in terminal? Because the program that uses the oto.ini file may require the SHIFT_JIS encoding.

**The Cog** · November 18th, 2012

Originally Posted by ntzrmtthihu777

Yes that is a hiragana 'a', how did you tell?

I did this command:

Code:

echo あ | hd

(copy/paste the character from the browser to the command prompt). That tells me the character is encoded as e3 81 82.
Equally, I could start python and paste the quoted character there (my typing is in bold):

Code:

$ python
Python 2.7.3 (default, Sep 26 2012, 21:53:58) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'あ'
'\xe3\x81\x82'

To get the unicode value from that, I tell python to decode the utf8 bytes (2 ways shown):

Code:

>>> 'あ'.decode("utf8")
u'\u3042'
>>> "\xe3\x81\x82".decode("utf8")
u'\u3042'

then I used the Character Map program supplied with Ubuntu. It's under Accesories in the Xubuntu start menu. Just looked it up in there.

I don't know how to convert easily, so I will bow to Vaphell on that subject.

**ntzrmtthihu777** · November 18th, 2012

Very nice trick, that. For the most part I don't need it for what I do, have a very nice Japanese input method set up (anthy and iBus), but still a very cool trick.

**Vaphell** · November 18th, 2012

Very interesting! Does this actually change the file itself or just allows its contents to be displayed properly in terminal? Because the program that uses the oto.ini file may require the SHIFT_JIS encoding.

not that i converted anything in my life using that tool, but short test shows that it merely prints out the converted content and the original file stays intact.

so what do you want to achieve exactly with these synonyms? you want to create symlinks pointing to real .wav files or what?

Thread: cat Japanese text file

Thread Tools

Display

cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions