cat Japanese text file

**ntzrmtthihu777** · November 18th, 2012

Originally Posted by Vaphell

not that i converted anything in my life using that tool, but short test shows that it merely prints out the converted content and the original file stays intact.

so what do you want to achieve exactly with these synonyms? you want to create symlinks pointing to real .wav files or what?

Basically the way it works is you place notes, either hiragana or romaji (Roman letters) on the screen, and tell it what voicebank (collection of *.wav files and oto.ini) to use. If you place hiragana notes but use a voicebank with romaji *.wav files, a properly aliased oto.ini should point あ to a.wav (and same for romaji notes and a hiragana voicebank, points a to あ.wav), no need to create Linux symlinks.

The reason I needed to figure this out proper is I want to create a automatic aliaser script, probably using sed. See, a good voicebank can have an upwards of 300 lines, and manually aliasing them all is quite tedious. There are already windows batch files that do this, but I am developing tools for UTAU users on ubuntu. Wrote a tutorial on how to install UTAU (a Japanese windows .exe) on a non-Japanese ubuntu install, and how to properly unzip voicebanks with Japanese folder and/or file names (using the default file-roller or archive mounter gives gibberish names).

**Vaphell** · November 18th, 2012

so what's the input for that automatic aliaser script and what's the expected output?

**ntzrmtthihu777** · November 18th, 2012

Input 1 (going to have 2, I think. One to do this)

Code:

あ.wav=,0,0,0,0,0
い.wav=,0,0,0,0,0
う.wav=,0,0,0,0,0
え.wav=,0,0,0,0,0
お.wav=,0,0,0,0,0

Output 1

Code:

あ.wav=a,0,0,0,0,0
い.wav=i,0,0,0,0,0
う.wav=u,0,0,0,0,0
え.wav=e,0,0,0,0,0
お.wav=o,0,0,0,0,0

Input 2 (and one to do this)

Code:

a.wav=,0,0,0,0,0
i.wav=,0,0,0,0,0
u.wav=,0,0,0,0,0
e.wav=,0,0,0,0,0
o.wav=,0,0,0,0,0

Output 2

Code:

a.wav=あ,0,0,0,0,0
i.wav=い,0,0,0,0,0
u.wav=う,0,0,0,0,0
e.wav=え,0,0,0,0,0
o.wav=お,0,0,0,0,0

I think

Code:

sed 's/a.wav=/a.wav=あ/'

and so on would do the trick, make a sedscript with all the equivalents and run

Code:

sed -f sedscript oto.ini

would be what is called for, yes?

This may not seem to be too much work to do manually, but I am only showing a small portion of what a full oto.ini file would contain. As I said, a good bank would have at least 120-ish just to cover Japanese syllables, and a few are multilingual, so lists of over 300 are not uncommon.

**Vaphell** · November 18th, 2012

but i don't see from where the script should get the info that a=あ, i=い, etc
if you had a tidy list of synonyms, it would be rather easy to generate these files.

**ntzrmtthihu777** · November 18th, 2012

Originally Posted by Vaphell

but i don't see from where the script should get the info that a=あ, i=い, etc
if you had a tidy list of synonyms, it would be rather easy to generate these files.

I was thinking along these lines, have 2 files:

hira_roma containing:

Code:

s/あ.wav=/あ.wav=a/g
s/い.wav=/い.wav=i/g
s/う.wav=/う.wav=u/g
s/え.wav=/え.wav=e/g
s/お.wav=/お.wav=o/g
...

roma_hira containing:

Code:

s/a.wav=/a.wav=あ/g
s/i.wav=/i.wav=い/g
s/u.wav=/u.wav=う/g
s/e.wav=/e.wav=え/g
s/o.wav=/o.wav=お/g
...

And running

Code:

sed -f roma_hira oto.ini

or

Code:

sed -f hira_roma oto.ini

as needed, assuming oto.ini is in the same dir.

**Vaphell** · November 19th, 2012

awk would be much better

consider this example:

Code:

$ cat syn.txt 
a あ
i い
u う
e え
o お
$ awk '{ printf("%s.wav=%s,0,0,0,0,0\n", $1, $2); }' syn.txt
a.wav=あ,0,0,0,0,0
i.wav=い,0,0,0,0,0
u.wav=う,0,0,0,0,0
e.wav=え,0,0,0,0,0
o.wav=お,0,0,0,0,0
$ awk '{ printf("%s.wav=%s,0,0,0,0,0\n", $2, $1); }' syn.txt
あ.wav=a,0,0,0,0,0
い.wav=i,0,0,0,0,0
う.wav=u,0,0,0,0,0
え.wav=e,0,0,0,0,0
お.wav=o,0,0,0,0,0

pipe to iconv to convert to SHIFT_JIS, dump the result to a file and it's done

even pure bash can do it:

Code:

$ while read -r a b; do echo "$a.wav=$b,0,0,0,0,0"; done < syn.txt
a.wav=あ,0,0,0,0,0
i.wav=い,0,0,0,0,0
u.wav=う,0,0,0,0,0
e.wav=え,0,0,0,0,0
o.wav=お,0,0,0,0,0
$ while read -r a b; do echo "$b.wav=$a,0,0,0,0,0"; done < syn.txt
あ.wav=a,0,0,0,0,0
い.wav=i,0,0,0,0,0
う.wav=u,0,0,0,0,0
え.wav=e,0,0,0,0,0
お.wav=o,0,0,0,0,0

**ntzrmtthihu777** · November 19th, 2012

Very interesting... I have used awk for a few personal projects, very nice use here. But, I have just considered what may be a hitch using my old scheme and was about to post it, but then I saw yours and it may have a similar problem...

Suppose this oto.ini is already partially aliased, say:

Code:

 
a.wav=あ,0,0,0,0,0
i.wav=,0,0,0,0,0
u.wav=う,0,0,0,0,0
e.wav=,0,0,0,0,0
o.wav=お,0,0,0,0,0

Wouldn't using either of our scripts give us

Code:

 
a.wav=ああ,0,0,0,0,0
i.wav= い,0,0,0,0,0
u.wav=うう ,0,0,0,0,0
e.wav= え,0,0,0,0,0
o.wav= おお,0,0,0,0,0

?
I was thinking extending the sed to:

Code:

s/a.wav=*,/a.wav=あ,/g

unless your know of a better solution.

Also, an actual oto.ini would have numbers other than 0 depending on the frequency of the sound, length of the consonant or vowel, and other info, so merely creating them out of thin air with the awk example would only be useful when creating a brand new bank which default to ,0,0,0,0,0 or ,,,,, anyway. And each bank has its own more or less unique oto.ini. Creating the initial one is done by the program itself, its just aliasing that takes a bit. I am looking to modify an existing oto.ini, sorry if I was not 100% clear on that from the start.

**Vaphell** · November 19th, 2012

my code created the output from scratch using template line
PUT_STUFF_HERE.wav=PUT_STUFF_HERE,0,0,0,0,0
but those changing numbers you speak of make that approach go out the window

duplicated symbols you mentioned are easy to fix - simply strip anything between = and , before doing substitutions.

can you give few example lines of real data (or even full oto.ini) so i can get full picture

**ntzrmtthihu777** · November 19th, 2012

Code:

a.wav=‚ ,54,105,348,36,17 
ad.wav=,8,133,155,79,39 
ah.wav=,80,97,318,39,14 
ai.wav=,64,132,284,57,31 
al.wav=,303,109,152,33,10 
all.wav=,270,118,168,32,18 
am.wav=,27,74,19,23,11 
an.wav=,63,75,307,26,10 
and.wav=,209,80,297,33,13 
ang.wav=,110,74,352,24,11

A few of these will not have a hiragana equivalent, but again some of these banks are designed to be multilingual.

**Vaphell** · November 19th, 2012

i think sed -f is a good approach but i'd generate these sed files too, based on the clean list to make it easier to introduce changes, should the need arise.

Code:

#!/bin/bash

while read -r a b
do
  echo "s/^$a[.]wav=[^,]*,/$a.wav=$b,/"
done < syn.txt > sed1.txt

while read -r a b
do
  echo "s/^$a[.]wav=[^,]*,/$b.wav=$a,/"
done < syn.txt > sed2.txt

echo
echo "Sed #1"
sed -f sed1.txt oto.txt # | iconv -f UTF-8 -t SHIFT_JIS > output1.txt
echo "Sed #2"
sed -f sed2.txt oto.txt # | iconv -f UTF-8 -t SHIFT_JIS > output2.txt

example, using trash data

Code:

$ cat syn.txt
a XoXo
ad !!!
ah ###
ai ===
al ---
all @@@
am FFUU-
an -_-
and o.O
ang >_<
$ ./jp.sh 
Sed #1
a.wav=XoXo,54,105,348,36,17 
ad.wav=!!!,8,133,155,79,39 
ah.wav=###,80,97,318,39,14 
ai.wav====,64,132,284,57,31 
al.wav=---,303,109,152,33,10 
all.wav=@@@,270,118,168,32,18 
am.wav=FFUU-,27,74,19,23,11 
an.wav=-_-,63,75,307,26,10 
and.wav=o.O,209,80,297,33,13 
ang.wav=>_<,110,74,352,24,11
Sed #2
XoXo.wav=a,54,105,348,36,17 
!!!.wav=ad,8,133,155,79,39 
###.wav=ah,80,97,318,39,14 
===.wav=ai,64,132,284,57,31 
---.wav=al,303,109,152,33,10 
@@@.wav=all,270,118,168,32,18 
FFUU-.wav=am,27,74,19,23,11 
-_-.wav=an,63,75,307,26,10 
o.O.wav=and,209,80,297,33,13 
>_<.wav=ang,110,74,352,24,11

Thread: cat Japanese text file

Thread Tools

Display

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Re: cat Japanese text file

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions