Results 1 to 4 of 4

Thread: Similarity Comparison within words (Python)

  1. #1
    Join Date
    Apr 2008
    Location
    Netherlands
    Beans
    223
    Distro
    Ubuntu 13.04 Raring Ringtail

    Similarity Comparison within words (Python)

    Hello,

    I am attempting an ambitious project built around etymology within Germanic languages.
    Coming from Holland I speak a germanic language and due to my obsession with the ancient past, I decided to attempt to find the words that bind all germanic languages.
    The trouble is that many words have similarities but in different places.
    For an example, the Dutch word for travelling is 'reizen' and the German equivalent is 'Reise' and also 'reise' in Norway. But other words have different spellings but similarities in other places.

    What I want to ultimately do is type in a word or list of words, it will then look at other languages which use the same type of word and correctly display them, then later I can work at trying to come to a common root.
    The trouble is letting the program correctly detect commonalities.

    Does anyone have any ideas how I could accomplish this detection?
    Take the travelling example.
    Others can be a little more difficult. For instance, the Dutch for sailing is 'varen', which isnt similar to the German equivalent but the German for driving is 'fahren', an obvious common descendant.
    My ultimate goal is to detect those.

    Does anyone have any pointers?
    Thanks
    ----------------------------------------
    Don't fear the terminal, it may look like a dragon, you just need to learn to ride it.

  2. #2
    Join Date
    Sep 2013
    Beans
    14

    Re: Similarity Comparison within words (Python)

    Don't want to disappoint you, but I'm afraid this is going to be more complex than you'd like. It will need writing complex word analysers for each language and words will likely need to be accompanied with some metadata. Natural language processing never was an easy task.

    Well, there is easy (but not precise) way to do similarity analysis. You can use SequenceMatcher from Python's difflib or fuzzy wuzzy library.

    http://docs.python.org/2/library/difflib.html

    It will tell you how similar two strings are. Note, however, that this is not etymological analysis, it's based on calculating how much modifications to string A must be done to get string B.
    Last edited by Nil_Pointer; September 8th, 2013 at 09:28 PM.

  3. #3
    Join Date
    Apr 2009
    Location
    Germany
    Beans
    2,134
    Distro
    Ubuntu Development Release

    Re: Similarity Comparison within words (Python)

    sklearn might be the best tool to tackle the problem.
    its already has a lot of components centered around text similarity analysis (which, is as already said, a complicated topic, I recommend you go read some papers about the basics it before starting).

    see this for an interesting tutorial on how to use it:
    http://pyevolve.sourceforge.net/wordpress/?p=1589
    Last edited by MadCow108; September 8th, 2013 at 11:26 PM.

  4. #4
    Join Date
    Aug 2013
    Beans
    1

    Re: Similarity Comparison within words (Python)

    How about creating your database using phonetic dictionaries rather than using the actual spellings? I suspect this would be more likely to pick up similar or identical sounding words in different languages even if they're spelled differently?

    A quick Google leads me to believe that there are phonetic dictionaries out there available to download.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •