Thread: Babylon
View Single Post
Old January 14th, 2008   #10
bit4
First Cup of Ubuntu
 
Join Date: Oct 2007
Beans: 6
Re: Babylon

I managed to convert Babylon_English_Hebrew.BGL into valid dict format and use it with dictd. Here is how:

1 - Run this command to convert the BGL file into a pair of index+dict files::
Code:
     $ dictconv Babylon_English_Hebrew.BGL -o Babylon_English_Hebrew.index >Babylon_English_Hebrew.dict
This creates a defective index + dict pair of files, but all the info is in them, so they can be fixed, like this:

2 - Save the Python program at the end of this post into a file named fix_babylon_heb_dict.py

3 - Run the fix program to create better behaved index+dict pair of files:
Code:
     $ python fix_babylon_heb_dict.py Babylon_English_Hebrew
The result is named Babylon_English_Hebrew.{index|dict}.new

4 - Compress the dict file:
Code:
    $ dictzip Babylon_English_Hebrew.dict.new
5 - Copy the pair to where dictd looks for dictionaries by default:
Code:
    $ sudo cp Babylon_English_Hebrew.dict.new.dz /usr/share/dictd/Babylon_English_Hebrew.dict.dz
    $ sudo cp Babylon_English_Hebrew.index.new /usr/share/dictd/Babylon_English_Hebrew.index
6 - Use an automated tool to recreate the dictd config file from the list of available dictionary files:
Code:
    $ sudo dictdconfig -w
7 - Restart dictd so it will use the new config file and load the new dictionary:
Code:
    $ sudo /etc/init.d/dictd restart
That's it!
I am still hoping to find a similar procedure for converting old *.DIC files as well.

Here is the fix_babylon_heb_dict.py program:
Code:
#
# Read the *.dict and *.index files produced from a Babylon Hebrew dictionary
# by dictconv and convert them into a valid pair of files using utf-8 encoding.
# Written by bit4, Jan 2008.
#
# The *.dict file from Babylon represents Hebrew using a range of accented
# characters starting at 0x00e0 instead of the real Hebrew range that starts
# at 0x05d0. In addition, the letter Nun is represented by 0x011f. Also,
# the index file contains funny $nnnn$ suffixes in most of the words.
# This program fixes all of the above problems.
# The input file name is given on the command line (without the suffix)
# and the output is saved into a pair of files with the same names and the
# word '.new' appended.
#
# The full process of converting and installing a BGL dictionary:
#   $ dictconv Babylon_English_Hebrew.BGL -o Babylon_English_Hebrew.index >Babylon_English_Hebrew.dict
#   $ python fix_babylon_heb_dict.py Babylon_English_Hebrew
#   $ dictzip Babylon_English_Hebrew.dict.new
#   $ sudo cp Babylon_English_Hebrew.dict.new.dz /usr/share/dictd/Babylon_English_Hebrew.dict.dz
#   $ sudo cp Babylon_English_Hebrew.index.new /usr/share/dictd/Babylon_English_Hebrew.index
#   $ sudo dictdconfig -w
#   $ sudo /etc/init.d/dictd restart
#

import re
import sys

abc='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
def s2i(s):
    """Return the number represented by the base64 string s."""
    n = 0
    for c in s:
        n = n * 64 + abc.index(c)
    return n

def i2s(n):
    """Return the base64 string representing the number n."""
    s = ''
    while n:
        s = abc[n % 64] + s
        n /= 64
    return s or 'A'

def acc2heb(ch):
   if ch == u'\u011f': return u'\u05e0'
   if u'\xe0' <= ch <= u'\xff': return unichr(ord(ch) - 0xe0 + 0x05d0)
   return ch

def fix_def(definition):
    """Return the fixed definition: convert from Windows1255 to utf-8"""
    return ''.join(map(acc2heb, definition.decode('utf8','replace'))).encode('utf8','replace')

def writedef(out_idxfile, out_dictfile, word, definition, oldpos):
    """Append the given word+def to the output dictionary and return the next position"""
    deflen = len(definition)
    out_idxfile.write('%s\t%s\t%s\n' % (word, i2s(oldpos), i2s(deflen)))
    out_dictfile.write(definition)
    return oldpos + deflen

def main():
    if len(sys.argv) != 2:
        print >>sys.stderr, "Usage: %s name" % sys.argv[0]
        print >>sys.stderr, "Input is name.index + name.dict, output is name.index.new + name.dict.new"
        sys.exit(1)
    fname = sys.argv[1]

    idxfile = file(fname + '.index', 'r')
    dictfile = file(fname + '.dict', 'r')
    out_idxfile = file(fname + '.index.new', 'w')
    out_dictfile = file(fname + '.dict.new', 'w')
    outpos = 0

    outpos = writedef(out_idxfile, out_dictfile, '00-encoding', 'utf-8', outpos)
    for line in idxfile.readlines():
        word,pos,leng = line.strip().split('\t')
        pos,leng = s2i(pos),s2i(leng) #todo: add try/except and ignore lines with errors
        # Remove the weird $number$ suffix that many entries have after using dictconv on
        # Babylon glossaries.
        mo = re.match(r'(.*)\$\d+\$$', word)
        if mo:
            word = mo.group(1)
        #
        dictfile.seek(pos)
        definition = fix_def(dictfile.read(leng))
        #
        outpos = writedef(out_idxfile, out_dictfile, word, definition, outpos)

if __name__ == '__main__':
    main()
bit4 is offline   Reply With Quote