Page 1 of 2 12 LastLast
Results 1 to 10 of 20

Thread: Bash script to convert romaji to hiragana

  1. #1
    Join Date
    Jan 2012
    Location
    /a/
    Beans
    753
    Distro
    Kubuntu 13.04 Raring Ringtail

    Bash script to convert romaji to hiragana

    I started learning Japanese not too long ago, and I'm tired of Anki, so I'm making my own, simple flash card program. I want to be able to have the option to read the Japanese in romaji (e.g. "mizu" instead of "water"), or have it automatically converted to hiragana (e.g. "みず" instead of "mizu"). Is there a way for me to use bash to convert the romaji to hiragana (or possibly katakana)?

    Basically, how hiragana works is that each letter represents a syllable, such as "da" (だ), with a single vowel, usually preceeded by a single consonant. So any word in romaji must be able to be broken down by the consonants, so "anata" must be separated into "a na ta", with the break at the start of the consonant, and then turned into あなた. However there are a few problems that make this more difficult. Hiragana like "shi" (し) have two consonant letters, but only one consonant sound, and the bash script must be able to recognize that. Even worse, in some situations the hiragana are pronounced differently depending on context, such as "wa" (わ) being spelled as "ha" (は) if it's being used as a particle, such as the "wa" in the sentence "anata wa shin setsu desu" (あなたはしんせつです) being spelled with "ha" (は instead of わ), as if it were "anata ha shin setsu desu".

    Is there any (relatively easy) way to use bash to do this, or is it just too complicated for a simple script and I should just suck it up and create a third array for hiragana?

    Code:
    #!/bin/bash
    
    # ローマ字array
    nihongo[1]='Itadakimasu'
    nihongo[2]='Oishii'
    nihongo[3]='Mama'
    nihongo[4]='Natto'
    nihongo[5]='Kanpai'
    nihongo[6]='Hajimemasite'
    nihongo[7]='Hajimeru'
    (etc...)
    
    # English array
    english[1]='Always used before eating, roughly means "I am about to receive".'
    english[2]='Delicious'
    english[3]='I don'\''t like this food (it'\''s so-so).'
    english[4]='Fermented soy beans with a very strong smell.'
    english[5]='Equivalent to the English "Cheers!" said before a drink.'
    english[6]='Roughly "Nice to meet you." (lit. "for the first time").'
    english[7]='To begin'
    (etc...)
    
    # There must be the same number of Japanese and English translations
    if [ ${#nihongo[*]} -ne ${#english[*]} ]; then
        echo 'Error, length of array variables "nihongo" and "english" do not match.'
        exit 1
    fi
    
    correct=0
    incorrect=0
    
    # Display stats and exit
    function finished() {
        if ! [ $correct -eq 0 -a $incorrect -eq 0 ]; then
            clear
            let total=$correct+$incorrect
            percent=$(echo "scale=2; ($correct/$total)*100" | bc)
            echo -e "Correct: $correct\nIncorrect: $incorrect\nTotal: $total\nPercent: $percent%"
        else
            echo
        fi
        tput cnorm normal
        exit 0
    }
    trap finished SIGHUP SIGINT SIGTERM
    
    while true; do
        clear
        last_rand=$rand
        # Make sure the same card isn't done twice in a row
        until [ "$rand" != "$last_rand" ]; do
            rand=$(shuf -i 1-${#nihongo[*]} -n 1)
        done
        echo "Nihongo: ${nihongo[$rand]}"
        read
        tput cup 1 # Dirty workaround because I'm too stupid to not echo a new line when enter is pressed
        echo "English: ${english[$rand]}"
        unset yn
        until [ "$yn" = y -o "$yn" = n ]; do
            echo
            read -p "Did you get it right? [y/n] " yn
            if [ "$yn" = y ]; then
                ((correct++))
                echo 'Good job!'
                sleep 1
            elif [ "$yn" = n ]; then
                ((incorrect++))
                echo 'Oh well, maybe next time.'
                sleep 1
            fi
        done
    done
    Last edited by Stonecold1995; March 21st, 2013 at 07:49 AM.
    The whole thing is so patently infantile, so foreign to reality, that to anyone with a friendly attitude to humanity it is painful to think that the great majority of mortals will never be able to rise above this view of life.
    ~Sigmund Freud

  2. #2
    Join Date
    Feb 2007
    Location
    Romania
    Beans
    Hidden!
    Distro
    Ubuntu Development Release

    Re: Bash script to convert romaji to hiragana

    Thread moved to Programming Talk.

    There is a perl module which can do this: http://search.cpan.org/~dankogai/Lin...gua/JA/Kana.pm

  3. #3
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,160
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Bash script to convert romaji to hiragana

    if python is more to your liking
    romkan

    https://pypi.python.org/pypi/romkan

    Code:
    sudo pip install romkan
    there is also ruby version of romkan, directly in the repos

    you can easily inline it in your bash script
    Code:
    $ txt="anata wa shin setsu desu"
    $ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]))" "$txt"
    あなた わ しん せつ です
    or create a small py script that translates the string

    Code:
    #!/usr/bin/env python
    
    import sys
    import romkan
    print( romkan.to_hiragana( sys.argv[1] ).encode('utf8') ) # had to use encode because it complained
    btw, you should put your dictionary in a separate file in some easily parseable format where data is separated by some obscure char, eg
    Code:
    Itadakimasu|Always used before eating, roughly means "I am about to receive"|commentary
    Oishii|Delicious|or something else
    Code:
    #!/bin/bash
    
    while IFS="|" read -r jp en comment
    do
      nihongo+=( "$jp" )
      english+=( "$en" )
      ...
    done < dict.txt
    
    for(( i=0; i<${#nihongo[@]}; i++ ))
    do
      printf "%s (%s): %s\n" "${nihongo[i]}" "$( python romaji2hiragana "${nihongo[i]}" )" "${english[i]}"
    done
    Code:
    $ ./jp_dict.sh 
    Itadakimasu (いただきます): Always used before eating, roughly means "I am about to receive"
    Oishii (おいしい): Delicious
    as a bonus you can have multiple easily customizable dicts that can be read by the parametrized script with no effort.
    Last edited by Vaphell; March 22nd, 2013 at 07:52 AM.
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

  4. #4
    Join Date
    Jan 2012
    Location
    /a/
    Beans
    753
    Distro
    Kubuntu 13.04 Raring Ringtail

    Re: Bash script to convert romaji to hiragana

    Quote Originally Posted by Vaphell View Post
    if python is more to your liking
    romkan

    https://pypi.python.org/pypi/romkan
    Yes thank you.

    Quote Originally Posted by Vaphell View Post
    btw, you should put your dictionary in a separate file in some easily parseable format where data is separated by some obscure char, eg
    Code:
    Itadakimasu|Always used before eating, roughly means "I am about to receive"|commentary
    Oishii|Delicious|or something else
    Code:
    #!/bin/bash
    
    while IFS="|" read -r jp en comment
    do
      nihongo+=( "$jp" )
      english+=( "$en" )
      ...
    done < dict.txt
    
    for(( i=0; i<${#nihongo[@]}; i++ ))
    do
      printf "%s (%s): %s\n" "${nihongo[i]}" "$( python romaji2hiragana "${nihongo[i]}" )" "${english[i]}"
    done
    Yeah I know hard-coding is bad practise, and I'm planning to use a separate file as you say. The problem is that I tend to spend more time on the script than I actually do actually using it to learn, so if it's in the same file it'll help me remember more easily because I'll see it more often. Because putting it in a separate file is so easy, I'm planning on leaving that until the very end, when the rest of the script has all the features I want it to.
    The whole thing is so patently infantile, so foreign to reality, that to anyone with a friendly attitude to humanity it is painful to think that the great majority of mortals will never be able to rise above this view of life.
    ~Sigmund Freud

  5. #5
    Join Date
    Jan 2012
    Location
    /a/
    Beans
    753
    Distro
    Kubuntu 13.04 Raring Ringtail

    Re: Bash script to convert romaji to hiragana

    Hm, I just tried romkan and it's not working correctly. As I said in the OP, it can't just blindly translate, it has to use at least a little contextual information to tell what to do.

    Code:
    alex@kubuntu:~$ txt='Sore wa dosa shimasen'
    alex@kubuntu:~$ python -c "import sys; import romkan; print(romkan.to_hiragana(sys.argv[1]))" "$txt"
    それ わ どさ しません
    The output should be this:
    Code:
    それどさしません
    Hiragana has no spaces, and the "wa" (わ) should have been written as "ha" (は) instead, because it's being used alone as a particle rather than part of a different word.

    It seems romkan only does a very literal (and very erroneous) translation. That's something I could have easily done in bash alone, my problem is that there are a few rules hiragana follows that aren't as easy to make a bash script with.
    The whole thing is so patently infantile, so foreign to reality, that to anyone with a friendly attitude to humanity it is painful to think that the great majority of mortals will never be able to rise above this view of life.
    ~Sigmund Freud

  6. #6
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,160
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Bash script to convert romaji to hiragana

    are there many such exceptions?
    you could pass the string through some regex that would replace standalone 'wa' to 'ha' and then do the dumb translation eg

    Code:
    $ txt='wa wawa' #some junk
    $ temp=$( echo "$txt" | sed -r 's/\<wa\>/ha/g' )
    $ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$temp"
    は わわ
    
    $ txt='Sore wa dosa shimasen'
    $ temp=$( echo "$txt" | sed -r 's/\<wa\>/ha/g' )
    $ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$temp"
    それ は どさ しません
    you could compile a file with sed expressions and feed it to sed via -f switch

    if preprocessing with a bunch of easy rules is not viable you could always have an additional array only for exceptions and use it when it's not empty
    Code:
    $ nihongo='wa wawa'
    $ exception=''
    $ [ -z "$exception" ] && python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$nihongo" || echo "$exception"
    わ わわ
    $ nihongo='wa wawa'
    $ exception='は わわ'
    $ [ -z "$exception" ] && python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$nihongo" || echo "$exception"
    は わわ
    the advantage of data separated from code shows up in this case. Now to extend script you have to not only add the code by also add additional arrays with much more unwieldy syntax than in case of simple well formatted external file where relevant pieces of data are next to each other. No off-by-one errors and other crap plaguing programmers.
    Last edited by Vaphell; March 22nd, 2013 at 10:00 PM.
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

  7. #7
    Join Date
    Jan 2012
    Location
    /a/
    Beans
    753
    Distro
    Kubuntu 13.04 Raring Ringtail

    Re: Bash script to convert romaji to hiragana

    Quote Originally Posted by Vaphell View Post
    are there many such exceptions?
    you could pass the string through some regex that would replace standalone 'wa' to 'ha' and then do the dumb translation eg

    Code:
    $ txt='wa wawa' #some junk
    $ temp=$( echo "$txt" | sed -r 's/\<wa\>/ha/g' )
    $ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$temp"
    は わわ
    
    $ txt='Sore wa dosa shimasen'
    $ temp=$( echo "$txt" | sed -r 's/\<wa\>/ha/g' )
    $ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$temp"
    それ は どさ しません
    I don't think there are too many exceptions, but more context is needed than "is it alone or part of a word". While that's a good heuristic, I think there are other times when "wa" is alone but is NOT spelled as "ha" because it is not functioning as a particle in the same way. So "wa" is only written as は if it's being used as a topic marker, but if it's used at the end of a sentence to show an emotional connection, it's written as わ.

    However, lots of romaji will not add spaces between it and the word it relates to, for example, "konbanwa" is spelled こんばんは (ko nn ba nn ha), because the "wa" and the end is actually a topic marker, but it's not often spelled "konban wa" in romaji so the script wouldn't change the "wa" to "ha" as it should.

    So I don't think simple rules would work, something like romkan would be good if only it actually used some kind of japanese dictionary.
    The whole thing is so patently infantile, so foreign to reality, that to anyone with a friendly attitude to humanity it is painful to think that the great majority of mortals will never be able to rise above this view of life.
    ~Sigmund Freud

  8. #8
    Join Date
    Jul 2007
    Location
    Poland
    Beans
    4,160
    Distro
    Ubuntu 10.04 Lucid Lynx

    Re: Bash script to convert romaji to hiragana

    such ambiguities are nasty to program around, it's hard to do anything smart with when context matters that much.
    Having an accurate list of exceptions would be useful to estimate viability of possible solutions.

    are there cases where wa becomes ha even at the end of the sentence?
    are there many such cases where wa is glued to the end of the word and becomes ha either way?
    if your question is answered, mark the thread as [SOLVED]. Thx.
    To post code or command output, use [code] tags.
    Check your bash script here // BashFAQ // BashPitfalls

  9. #9

    Re: Bash script to convert romaji to hiragana

    You keep saying "I think ... I don't think ... I think". That's too vague. You must define the problem before you can expect to find a solution. I don't know any Japanese, so I couldn't say how hard this problem is going to be. Maybe you could find out whether it's possible to exhaustively list all the exceptions, or determine from syntax alone whether they apply -- those are questions for a Japanese language expert, not for a programmer.

    Can you list the rules and the exceptions completely and concisely in English? That would be a good place to start. If not, well, you're out of luck anyway.

    Have you tried the Perl module? Does it do the same thing as the Python one?

  10. #10
    Join Date
    Jan 2012
    Location
    /a/
    Beans
    753
    Distro
    Kubuntu 13.04 Raring Ringtail

    Re: Bash script to convert romaji to hiragana

    Quote Originally Posted by Vaphell View Post
    are there cases where wa becomes ha even at the end of the sentence?
    Yes, with words like "konbanwa".

    Quote Originally Posted by Vaphell View Post
    are there many such cases where wa is glued to the end of the word and becomes ha either way?
    There are multiple types of romaji, some glue many things together, some don't (e.g. "konbanwa" is almost always one word in romaji, but "watashi wa" is not as commonly spelled "watashiwa").


    Quote Originally Posted by trent.josephsen View Post
    Can you list the rules and the exceptions completely and concisely in English? That would be a good place to start. If not, well, you're out of luck anyway.
    I couldn't, but there may be a dictionary somewhere that could. There must be because Google Translate is able to do it with good accuracy (but for most other things Google Translate is terrible at languages, especialy Japanese).

    Quote Originally Posted by trent.josephsen View Post
    Have you tried the Perl module? Does it do the same thing as the Python one?
    Ah, I forgot. I'll try that. If that doesn't work, are there any other such ways to do this?

    I'll rephrase my original question a little... I don't care if it's in pure bash, for all I care it could use Google Translate (although I heard they shut down their API).

    I'm not able to list too many exceptions, I am still fairly new to Japanese after all, so if I can't provide much more information I guess I'll just look around more forums maybe I'll find something I could run under Wine. Perhaps there's a dictionary somewhere I can find.
    The whole thing is so patently infantile, so foreign to reality, that to anyone with a friendly attitude to humanity it is painful to think that the great majority of mortals will never be able to rise above this view of life.
    ~Sigmund Freud

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •