Page 1 of 3 123 LastLast
Results 1 to 10 of 29

Thread: cat Japanese text file

  1. #1
    Join Date
    May 2012
    Location
    ザ・ワ&
    Beans
    152
    Distro
    Xubuntu 12.04 Precise Pangolin

    cat Japanese text file

    When I ls the contents of a directory with Japanese filenames it works perfectly fine,
    Code:
    ls
    $read            oto.ini      あ_wav.frq    え.wav
    Mattithyahu.bmp  readme.txt   い.wav        え_wav.frq
    New Project.ust  あ.wav        い.wav.uspec  ゆ.wav
    character.txt    あ.wav.uspec  い_wav.frq    ゆ_wav.frq
    but when I cat a text file with Japanese characters (the very same ones) it gives me these replacement characters.
    Code:
    cat oto.ini
    ��.wav=a,0,0,0,0,0
    ��.wav=i,0,0,0,0,0
    ��.wav=e,0,0,0,0,0
    ��.wav=yu,0,0,0,0,0
    I tried setting the language to Japanese (have to do this with a few of my programs) but it still gives me the same results.
    Code:
    LANG=ja_JP.utf8 cat oto.ini
    ��.wav=a,0,0,0,0,0
    ��.wav=i,0,0,0,0,0
    ��.wav=e,0,0,0,0,0
    ��.wav=yu,0,0,0,0,0
    It *should* read as
    Code:
    あ.wav=a,0,0,0,0,0
    い.wav=i,0,0,0,0,0
    え.wav=e,0,0,0,0,0
    ゆ.wav=yu,0,0,0,0,0
    Does anyone know how to fix this issue?

    Last edited by ntzrmtthihu777; November 18th, 2012 at 09:39 PM. Reason: misplaced [CODE][/CODE] tags

  2. #2
    Join Date
    Nov 2007
    Location
    London, England
    Beans
    7,701

    Re: cat Japanese text file

    Do you know what character encoding the text files are using?

    It might help to see the actual file contents, just a snippet with this command would help:
    Code:
    cat oto.ini | hd | head
    P.S.
    That character is apparently HIRAGANA LETTER A, unicode value 0x3042 which in UTF8 encoding would be three bytes 0xE3 0x81 0x82.
    Last edited by The Cog; November 18th, 2012 at 09:52 PM.

  3. #3
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,499
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: cat Japanese text file

    the question is does the .ini file have utf8 encoding?

    Code:
    $ echo "い"  | od -x
    0000000 81e3 0a84
    0000004
    $ echo "い" > jp.txt
    $ cat jp.txt
    い
    $ cat jp.txt | od -x
    0000000 81e3 0a84
    0000004
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

  4. #4
    Join Date
    May 2012
    Location
    ザ・ワ&
    Beans
    152
    Distro
    Xubuntu 12.04 Precise Pangolin

    Re: cat Japanese text file

    Quote Originally Posted by The Cog View Post
    Do you know what character encoding the text files are using?

    It might help to see the actual file contents, just a snippet with this command would help:
    cat oto.ini | hd | head
    First off, thank you for this interesting new trick you taught me! Hehe, it would be very useful to a project I had on the back burner.

    Second, the encoding is Japanese (SHIFT_JIS), can I set it to this with LANG=? or something similar?


    Code:
    cat oto.ini | hd | head
    00000000  82 a0 2e 77 61 76 3d 61  2c 30 2c 30 2c 30 2c 30  |...wav=a,0,0,0,0|
    00000010  2c 30 0d 0a 82 a2 2e 77  61 76 3d 69 2c 30 2c 30  |,0.....wav=i,0,0|
    00000020  2c 30 2c 30 2c 30 0d 0a  82 a6 2e 77 61 76 3d 65  |,0,0,0.....wav=e|
    00000030  2c 30 2c 30 2c 30 2c 30  2c 30 0d 0a 82 e4 2e 77  |,0,0,0,0,0.....w|
    00000040  61 76 3d 79 75 2c 30 2c  30 2c 30 2c 30 2c 30 0d  |av=yu,0,0,0,0,0.|
    00000050  0a                                                |.|
    00000051
    Last edited by ntzrmtthihu777; November 18th, 2012 at 10:12 PM.

  5. #5
    Join Date
    May 2012
    Location
    ザ・ワ&
    Beans
    152
    Distro
    Xubuntu 12.04 Precise Pangolin

    Re: cat Japanese text file

    Quote Originally Posted by The Cog View Post
    P.S.
    That character is apparently HIRAGANA LETTER A, unicode value 0x3042 which in UTF8 encoding would be three bytes 0xE3 0x81 0x82.
    Yes that is a hiragana 'a', how did you tell? By searching for あ or did you figure it out from the apparently useless �� character? And can I still use this info in a shell script? What I am actually trying to achieve is create a "Voicebank Aliaser" for Ubuntu/linux UTAU users.
    Basically oto.ini contains info on all the wav files for a certain voicebank, and also any "aliases" they may have, in my example あ.wav is aliased to a.wav.
    What I eventually want to achieve is a simple script that will look at
    a.wav=,#,#,#,#,# and recognize that a = あ, and alias it to that, like
    a.wav=あ,#,#,#,#,# and also the reverse.
    あ.wav=,#,#,#,#,# あ = a
    あ.wav=a,#,#,#,#,#

  6. #6
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,499
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: cat Japanese text file

    you can use iconv -f SHIFT_JIS -t UTF-8 to convert the file.
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

  7. #7
    Join Date
    May 2012
    Location
    ザ・ワ&
    Beans
    152
    Distro
    Xubuntu 12.04 Precise Pangolin

    Re: cat Japanese text file

    Quote Originally Posted by Vaphell View Post
    you can use iconv -f SHIFT_JIS -t UTF-8 to convert the file.
    Very interesting! Does this actually change the file itself or just allows its contents to be displayed properly in terminal? Because the program that uses the oto.ini file may require the SHIFT_JIS encoding.

  8. #8
    Join Date
    Nov 2007
    Location
    London, England
    Beans
    7,701

    Re: cat Japanese text file

    Quote Originally Posted by ntzrmtthihu777 View Post
    Yes that is a hiragana 'a', how did you tell?
    I did this command:
    Code:
    echo あ | hd
    (copy/paste the character from the browser to the command prompt). That tells me the character is encoded as e3 81 82.
    Equally, I could start python and paste the quoted character there (my typing is in bold):
    Code:
    $ python
    Python 2.7.3 (default, Sep 26 2012, 21:53:58) 
    [GCC 4.7.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> 'あ'
    '\xe3\x81\x82'
    To get the unicode value from that, I tell python to decode the utf8 bytes (2 ways shown):
    Code:
    >>> 'あ'.decode("utf8")
    u'\u3042'
    >>> "\xe3\x81\x82".decode("utf8")
    u'\u3042'
    then I used the Character Map program supplied with Ubuntu. It's under Accesories in the Xubuntu start menu. Just looked it up in there.

    I don't know how to convert easily, so I will bow to Vaphell on that subject.

  9. #9
    Join Date
    May 2012
    Location
    ザ・ワ&
    Beans
    152
    Distro
    Xubuntu 12.04 Precise Pangolin

    Re: cat Japanese text file

    Very nice trick, that. For the most part I don't need it for what I do, have a very nice Japanese input method set up (anthy and iBus), but still a very cool trick.

  10. #10
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,499
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: cat Japanese text file

    Very interesting! Does this actually change the file itself or just allows its contents to be displayed properly in terminal? Because the program that uses the oto.ini file may require the SHIFT_JIS encoding.
    not that i converted anything in my life using that tool, but short test shows that it merely prints out the converted content and the original file stays intact.

    so what do you want to achieve exactly with these synonyms? you want to create symlinks pointing to real .wav files or what?
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

Page 1 of 3 123 LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •