I managed to convert Babylon_English_Hebrew.BGL into valid dict format and use it with dictd. Here is how:
1 - Run this command to convert the BGL file into a pair of index+dict files::
Code:
$ dictconv Babylon_English_Hebrew.BGL -o Babylon_English_Hebrew.index >Babylon_English_Hebrew.dict
This creates a defective index + dict pair of files, but all the info is in them, so they can be fixed, like this:
2 - Save the Python program at the end of this post into a file named fix_babylon_heb_dict.py
3 - Run the fix program to create better behaved index+dict pair of files:
Code:
$ python fix_babylon_heb_dict.py Babylon_English_Hebrew
The result is named Babylon_English_Hebrew.{index|dict}.new
4 - Compress the dict file:
Code:
$ dictzip Babylon_English_Hebrew.dict.new
5 - Copy the pair to where dictd looks for dictionaries by default:
Code:
$ sudo cp Babylon_English_Hebrew.dict.new.dz /usr/share/dictd/Babylon_English_Hebrew.dict.dz
$ sudo cp Babylon_English_Hebrew.index.new /usr/share/dictd/Babylon_English_Hebrew.index
6 - Use an automated tool to recreate the dictd config file from the list of available dictionary files:
Code:
$ sudo dictdconfig -w
7 - Restart dictd so it will use the new config file and load the new dictionary:
Code:
$ sudo /etc/init.d/dictd restart
That's it!
I am still hoping to find a similar procedure for converting old *.DIC files as well.
Here is the fix_babylon_heb_dict.py program:
Code:
#
# Read the *.dict and *.index files produced from a Babylon Hebrew dictionary
# by dictconv and convert them into a valid pair of files using utf-8 encoding.
# Written by bit4, Jan 2008.
#
# The *.dict file from Babylon represents Hebrew using a range of accented
# characters starting at 0x00e0 instead of the real Hebrew range that starts
# at 0x05d0. In addition, the letter Nun is represented by 0x011f. Also,
# the index file contains funny $nnnn$ suffixes in most of the words.
# This program fixes all of the above problems.
# The input file name is given on the command line (without the suffix)
# and the output is saved into a pair of files with the same names and the
# word '.new' appended.
#
# The full process of converting and installing a BGL dictionary:
# $ dictconv Babylon_English_Hebrew.BGL -o Babylon_English_Hebrew.index >Babylon_English_Hebrew.dict
# $ python fix_babylon_heb_dict.py Babylon_English_Hebrew
# $ dictzip Babylon_English_Hebrew.dict.new
# $ sudo cp Babylon_English_Hebrew.dict.new.dz /usr/share/dictd/Babylon_English_Hebrew.dict.dz
# $ sudo cp Babylon_English_Hebrew.index.new /usr/share/dictd/Babylon_English_Hebrew.index
# $ sudo dictdconfig -w
# $ sudo /etc/init.d/dictd restart
#
import re
import sys
abc='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
def s2i(s):
"""Return the number represented by the base64 string s."""
n = 0
for c in s:
n = n * 64 + abc.index(c)
return n
def i2s(n):
"""Return the base64 string representing the number n."""
s = ''
while n:
s = abc[n % 64] + s
n /= 64
return s or 'A'
def acc2heb(ch):
if ch == u'\u011f': return u'\u05e0'
if u'\xe0' <= ch <= u'\xff': return unichr(ord(ch) - 0xe0 + 0x05d0)
return ch
def fix_def(definition):
"""Return the fixed definition: convert from Windows1255 to utf-8"""
return ''.join(map(acc2heb, definition.decode('utf8','replace'))).encode('utf8','replace')
def writedef(out_idxfile, out_dictfile, word, definition, oldpos):
"""Append the given word+def to the output dictionary and return the next position"""
deflen = len(definition)
out_idxfile.write('%s\t%s\t%s\n' % (word, i2s(oldpos), i2s(deflen)))
out_dictfile.write(definition)
return oldpos + deflen
def main():
if len(sys.argv) != 2:
print >>sys.stderr, "Usage: %s name" % sys.argv[0]
print >>sys.stderr, "Input is name.index + name.dict, output is name.index.new + name.dict.new"
sys.exit(1)
fname = sys.argv[1]
idxfile = file(fname + '.index', 'r')
dictfile = file(fname + '.dict', 'r')
out_idxfile = file(fname + '.index.new', 'w')
out_dictfile = file(fname + '.dict.new', 'w')
outpos = 0
outpos = writedef(out_idxfile, out_dictfile, '00-encoding', 'utf-8', outpos)
for line in idxfile.readlines():
word,pos,leng = line.strip().split('\t')
pos,leng = s2i(pos),s2i(leng) #todo: add try/except and ignore lines with errors
# Remove the weird $number$ suffix that many entries have after using dictconv on
# Babylon glossaries.
mo = re.match(r'(.*)\$\d+\$$', word)
if mo:
word = mo.group(1)
#
dictfile.seek(pos)
definition = fix_def(dictfile.read(leng))
#
outpos = writedef(out_idxfile, out_dictfile, word, definition, outpos)
if __name__ == '__main__':
main()