[ubuntu] copy non-english characters from pdf

February 26th, 2011, 07:31 PM
I need to copy danish text from pdf files and work with it in other programs such as open office word processor. The problem is that the OCR doesn't seem to recognize or copy non-english characters, such as . It tends to mistake them for a, o, ee. Has anyone come across a solution for this?

February 28th, 2011, 04:56 AM
I'm not an expert, but since this is a day old with 0 replies, I'll take a shot.

Several posibilities come to mind:

1- Many, I suspect most, OCR aps are trainable. If you are doing enough of this work , especially if a lot of it is in the same font, this would be worth a try if, for some reason, you MUST use OCR. Consult the manual on your ap. You may need to copy and paste characters from one of the exotic character picking aps or switch in an out of a Danish keyboard setting or learn to use the alternate way of inputting characters with the control key and the num pad.

2-Use one of the many automated translation services and go explore some Danish tech fora. You can translate some key words into Danish and input them in Google. Ask for advice on one of those boards. Presumably they do OCR in Denmark and must have some way to scan their funny squiggles.

3-Why use OCR at all? Depending on the document you may be able to simply highlight the text, copy it, and paste it into the program of your choice. Even if the pdf is locked down tight there are still aps out there that can convert it to text. Try scroogling "pdf to text" for instance.