Bash script to convert romaji to hiragana

**Stonecold1995** · March 21st, 2013

I started learning Japanese not too long ago, and I'm tired of Anki, so I'm making my own, simple flash card program. I want to be able to have the option to read the Japanese in romaji (e.g. "mizu" instead of "water"), or have it automatically converted to hiragana (e.g. "みず" instead of "mizu"). Is there a way for me to use bash to convert the romaji to hiragana (or possibly katakana)?

Basically, how hiragana works is that each letter represents a syllable, such as "da" (だ), with a single vowel, usually preceeded by a single consonant. So any word in romaji must be able to be broken down by the consonants, so "anata" must be separated into "a na ta", with the break at the start of the consonant, and then turned into あなた. However there are a few problems that make this more difficult. Hiragana like "shi" (し) have two consonant letters, but only one consonant sound, and the bash script must be able to recognize that. Even worse, in some situations the hiragana are pronounced differently depending on context, such as "wa" (わ) being spelled as "ha" (は) if it's being used as a particle, such as the "wa" in the sentence "anata wa shin setsu desu" (あなたはしんせつです) being spelled with "ha" (は instead of わ), as if it were "anata ha shin setsu desu".

Is there any (relatively easy) way to use bash to do this, or is it just too complicated for a simple script and I should just suck it up and create a third array for hiragana?

Code:

#!/bin/bash

# ローマ字array
nihongo[1]='Itadakimasu'
nihongo[2]='Oishii'
nihongo[3]='Mama'
nihongo[4]='Natto'
nihongo[5]='Kanpai'
nihongo[6]='Hajimemasite'
nihongo[7]='Hajimeru'
(etc...)

# English array
english[1]='Always used before eating, roughly means "I am about to receive".'
english[2]='Delicious'
english[3]='I don'\''t like this food (it'\''s so-so).'
english[4]='Fermented soy beans with a very strong smell.'
english[5]='Equivalent to the English "Cheers!" said before a drink.'
english[6]='Roughly "Nice to meet you." (lit. "for the first time").'
english[7]='To begin'
(etc...)

# There must be the same number of Japanese and English translations
if [ ${#nihongo[*]} -ne ${#english[*]} ]; then
    echo 'Error, length of array variables "nihongo" and "english" do not match.'
    exit 1
fi

correct=0
incorrect=0

# Display stats and exit
function finished() {
    if ! [ $correct -eq 0 -a $incorrect -eq 0 ]; then
        clear
        let total=$correct+$incorrect
        percent=$(echo "scale=2; ($correct/$total)*100" | bc)
        echo -e "Correct: $correct\nIncorrect: $incorrect\nTotal: $total\nPercent: $percent%"
    else
        echo
    fi
    tput cnorm normal
    exit 0
}
trap finished SIGHUP SIGINT SIGTERM

while true; do
    clear
    last_rand=$rand
    # Make sure the same card isn't done twice in a row
    until [ "$rand" != "$last_rand" ]; do
        rand=$(shuf -i 1-${#nihongo[*]} -n 1)
    done
    echo "Nihongo: ${nihongo[$rand]}"
    read
    tput cup 1 # Dirty workaround because I'm too stupid to not echo a new line when enter is pressed
    echo "English: ${english[$rand]}"
    unset yn
    until [ "$yn" = y -o "$yn" = n ]; do
        echo
        read -p "Did you get it right? [y/n] " yn
        if [ "$yn" = y ]; then
            ((correct++))
            echo 'Good job!'
            sleep 1
        elif [ "$yn" = n ]; then
            ((incorrect++))
            echo 'Oh well, maybe next time.'
            sleep 1
        fi
    done
done

**sisco311** · March 21st, 2013

Thread moved to Programming Talk.

There is a perl module which can do this: http://search.cpan.org/~dankogai/Lin...gua/JA/Kana.pm

**Vaphell** · March 22nd, 2013

if python is more to your liking
romkan

https://pypi.python.org/pypi/romkan

Code:

sudo pip install romkan

there is also ruby version of romkan, directly in the repos

you can easily inline it in your bash script

Code:

$ txt="anata wa shin setsu desu"
$ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]))" "$txt"
あなた わ しん せつ です

or create a small py script that translates the string

Code:

#!/usr/bin/env python

import sys
import romkan
print( romkan.to_hiragana( sys.argv[1] ).encode('utf8') ) # had to use encode because it complained

btw, you should put your dictionary in a separate file in some easily parseable format where data is separated by some obscure char, eg

Code:

Itadakimasu|Always used before eating, roughly means "I am about to receive"|commentary
Oishii|Delicious|or something else

Code:

#!/bin/bash

while IFS="|" read -r jp en comment
do
  nihongo+=( "$jp" )
  english+=( "$en" )
  ...
done < dict.txt

for(( i=0; i<${#nihongo[@]}; i++ ))
do
  printf "%s (%s): %s\n" "${nihongo[i]}" "$( python romaji2hiragana "${nihongo[i]}" )" "${english[i]}"
done

Code:

$ ./jp_dict.sh 
Itadakimasu (いただきます): Always used before eating, roughly means "I am about to receive"
Oishii (おいしい): Delicious

as a bonus you can have multiple easily customizable dicts that can be read by the parametrized script with no effort.

**Stonecold1995** · March 22nd, 2013

Originally Posted by Vaphell

if python is more to your liking
romkan

https://pypi.python.org/pypi/romkan

Yes thank you.

Originally Posted by Vaphell

btw, you should put your dictionary in a separate file in some easily parseable format where data is separated by some obscure char, eg

Code:

Itadakimasu|Always used before eating, roughly means "I am about to receive"|commentary
Oishii|Delicious|or something else

Code:

#!/bin/bash

while IFS="|" read -r jp en comment
do
  nihongo+=( "$jp" )
  english+=( "$en" )
  ...
done < dict.txt

for(( i=0; i<${#nihongo[@]}; i++ ))
do
  printf "%s (%s): %s\n" "${nihongo[i]}" "$( python romaji2hiragana "${nihongo[i]}" )" "${english[i]}"
done

Yeah I know hard-coding is bad practise, and I'm planning to use a separate file as you say. The problem is that I tend to spend more time on the script than I actually do actually using it to learn, so if it's in the same file it'll help me remember more easily because I'll see it more often. Because putting it in a separate file is so easy, I'm planning on leaving that until the very end, when the rest of the script has all the features I want it to.

**Stonecold1995** · March 22nd, 2013

Hm, I just tried romkan and it's not working correctly. As I said in the OP, it can't just blindly translate, it has to use at least a little contextual information to tell what to do.

Code:

alex@kubuntu:~$ txt='Sore wa dosa shimasen'
alex@kubuntu:~$ python -c "import sys; import romkan; print(romkan.to_hiragana(sys.argv[1]))" "$txt"
それ わ どさ しません

The output should be this:

Code:

それはどさしません

Hiragana has no spaces, and the "wa" (わ) should have been written as "ha" (は) instead, because it's being used alone as a particle rather than part of a different word.

It seems romkan only does a very literal (and very erroneous) translation. That's something I could have easily done in bash alone, my problem is that there are a few rules hiragana follows that aren't as easy to make a bash script with.

**Vaphell** · March 22nd, 2013

are there many such exceptions?
you could pass the string through some regex that would replace standalone 'wa' to 'ha' and then do the dumb translation eg

Code:

$ txt='wa wawa' #some junk
$ temp=$( echo "$txt" | sed -r 's/\<wa\>/ha/g' )
$ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$temp"
は わわ

$ txt='Sore wa dosa shimasen'
$ temp=$( echo "$txt" | sed -r 's/\<wa\>/ha/g' )
$ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$temp"
それ は どさ しません

you could compile a file with sed expressions and feed it to sed via -f switch

if preprocessing with a bunch of easy rules is not viable you could always have an additional array only for exceptions and use it when it's not empty

Code:

$ nihongo='wa wawa'
$ exception=''
$ [ -z "$exception" ] && python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$nihongo" || echo "$exception"
わ わわ
$ nihongo='wa wawa'
$ exception='は わわ'
$ [ -z "$exception" ] && python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$nihongo" || echo "$exception"
は わわ

the advantage of data separated from code shows up in this case. Now to extend script you have to not only add the code by also add additional arrays with much more unwieldy syntax than in case of simple well formatted external file where relevant pieces of data are next to each other. No off-by-one errors and other crap plaguing programmers.

**Stonecold1995** · March 23rd, 2013

Originally Posted by Vaphell

are there many such exceptions?
you could pass the string through some regex that would replace standalone 'wa' to 'ha' and then do the dumb translation eg

Code:

$ txt='wa wawa' #some junk
$ temp=$( echo "$txt" | sed -r 's/\<wa\>/ha/g' )
$ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$temp"
は わわ

$ txt='Sore wa dosa shimasen'
$ temp=$( echo "$txt" | sed -r 's/\<wa\>/ha/g' )
$ python -c "import sys; import romkan; print( romkan.to_hiragana(sys.argv[1]).encode('utf8'))" "$temp"
それ は どさ しません

I don't think there are too many exceptions, but more context is needed than "is it alone or part of a word". While that's a good heuristic, I think there are other times when "wa" is alone but is NOT spelled as "ha" because it is not functioning as a particle in the same way. So "wa" is only written as は if it's being used as a topic marker, but if it's used at the end of a sentence to show an emotional connection, it's written as わ.

However, lots of romaji will not add spaces between it and the word it relates to, for example, "konbanwa" is spelled こんばんは (ko nn ba nn ha), because the "wa" and the end is actually a topic marker, but it's not often spelled "konban wa" in romaji so the script wouldn't change the "wa" to "ha" as it should.

So I don't think simple rules would work, something like romkan would be good if only it actually used some kind of japanese dictionary.

**Vaphell** · March 23rd, 2013

such ambiguities are nasty to program around, it's hard to do anything smart with when context matters that much.
Having an accurate list of exceptions would be useful to estimate viability of possible solutions.

are there cases where wa becomes ha even at the end of the sentence?
are there many such cases where wa is glued to the end of the word and becomes ha either way?

**trent.josephsen** · March 23rd, 2013

You keep saying "I think ... I don't think ... I think". That's too vague. You must define the problem before you can expect to find a solution. I don't know any Japanese, so I couldn't say how hard this problem is going to be. Maybe you could find out whether it's possible to exhaustively list all the exceptions, or determine from syntax alone whether they apply -- those are questions for a Japanese language expert, not for a programmer.

Can you list the rules and the exceptions completely and concisely in English? That would be a good place to start. If not, well, you're out of luck anyway.

Have you tried the Perl module? Does it do the same thing as the Python one?

**Stonecold1995** · March 23rd, 2013

Originally Posted by Vaphell

are there cases where wa becomes ha even at the end of the sentence?

Yes, with words like "konbanwa".

Originally Posted by Vaphell

are there many such cases where wa is glued to the end of the word and becomes ha either way?

There are multiple types of romaji, some glue many things together, some don't (e.g. "konbanwa" is almost always one word in romaji, but "watashi wa" is not as commonly spelled "watashiwa").

Originally Posted by trent.josephsen

Can you list the rules and the exceptions completely and concisely in English? That would be a good place to start. If not, well, you're out of luck anyway.

I couldn't, but there may be a dictionary somewhere that could. There must be because Google Translate is able to do it with good accuracy (but for most other things Google Translate is terrible at languages, especialy Japanese).

Originally Posted by trent.josephsen

Have you tried the Perl module? Does it do the same thing as the Python one?

Ah, I forgot. I'll try that. If that doesn't work, are there any other such ways to do this?

I'll rephrase my original question a little... I don't care if it's in pure bash, for all I care it could use Google Translate (although I heard they shut down their API).

I'm not able to list too many exceptions, I am still fairly new to Japanese after all, so if I can't provide much more information I guess I'll just look around more forums maybe I'll find something I could run under Wine. Perhaps there's a dictionary somewhere I can find.

Thread: Bash script to convert romaji to hiragana

Thread Tools

Display

Bash script to convert romaji to hiragana

Re: Bash script to convert romaji to hiragana

Re: Bash script to convert romaji to hiragana

Re: Bash script to convert romaji to hiragana

Re: Bash script to convert romaji to hiragana

Re: Bash script to convert romaji to hiragana

Re: Bash script to convert romaji to hiragana

Re: Bash script to convert romaji to hiragana

Re: Bash script to convert romaji to hiragana

Re: Bash script to convert romaji to hiragana

Bookmarks

Bookmarks

Posting Permissions