PDA

View Full Version : PYTHON: Unicode and RE



jesuisbenjamin
June 2nd, 2010, 12:36 AM
Hi there,

As i thought i had through how to handle unicode strings with python, now using RE i am facing a similar problem again. The info i found online did not help, perhaps because i use the re.split() function.

Here is the code:


def cut(string):
cut_string = re.split(knife, string)
print cut_string

nagari = "prātipadau saṃṣṭhitau vai drūvyam"
knife = '([^[au]|[ai]|[o]|[e]|[ā]|[ū]|[ī]|[u]|[i]|[a]]*[[o]|[e]|[ā]|[ū]|[ī]|[u]|[i]|[a]])'
knife = re.compile(knife, re.UNICODE)
cut(nagari)

The result is:

>>>[u'pr\u0101tipadau sa\u1e43\u1e63\u1e6dhitau vai dr\u016bvyam']
while i expect:

>>>['prā', 'ti', 'pa', 'dau', ' ', 'sa', 'ṃṣṭhi', 'tau', 'vai', 'drū', 'vya', 'm']

I tried several things (adding u or ur in front of my strings or decode('utf-8') etc) but i only get coded results and split fails)

I think i need help on that one once more :(
Thanks