Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 23

Thread: Unable to read Cyrillic/Russian file names mounting with ext2

  1. #11
    Join Date
    May 2007
    Beans
    22

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    You're amazing!

    I'll try first thing when I'm at home.

  2. #12
    Join Date
    May 2007
    Beans
    22

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    Back again:
    ls Ni*astori | od -c -td1
    0000000 N i 361 a P a s t o r i : \n ( 2
    78 105 -15 97 32 80 97 115 116 111 114 105 58 10 40 50
    0000020 0 0 2 ) M a r 355 a \n \n N i n a
    48 48 50 41 32 77 97 114 -19 97 10 10 78 105 110 97
    0000040 P a s t o r i : \n N o . H a y
    32 80 97 115 116 111 114 105 58 10 78 111 46 72 97 121
    0000060 Q u i n t o M a l o \n
    32 81 117 105 110 116 111 32 77 97 108 111 10
    0000075

    What am I looking at?
    Nautilus shows (invalid encoding) under dir/filename.

  3. #13
    Join Date
    May 2007
    Beans
    22

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    convmv -f cp1251 -t utf-8 -r --exec "echo #1 should be renamed to #2" .
    convmv -f ascii -t utf-8 -r --exec "echo #1 should be renamed to #2" .
    convmv -f iso-8859-1 -t utf-8 -r --exec "echo #1 should be renamed to #2" .

    these don't seem to work:
    echo Ni\�a\ Pastori should be renamed to Ni\�\�a\ Pastori

    I do get this message:
    Your Perl version has fleas #37757 #49830, but Perl is up-to-date.

  4. #14
    Join Date
    May 2007
    Beans
    22

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    I guess the Cyrillic filenames can't be recovered, I found this thread with people who had the same problem. This got confirmed after I accidentally stumbled upon a recently created Russian document that displayed Cyrillic characters just fine:
    http://forum.dsmg600.info/t651-probl...rom-linux.html

    But the "invalid encoding" problem with filenames like "Niña Pastori" remains. Any clue what the original codepage could be?

  5. #15
    Join Date
    May 2007
    Beans
    22

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    What I found:
    N i 361 a
    78 105 -15 97


    Number 361 coincides with the octal code of ñ considering the Windows Code Page 1252:
    Microsoft Windows Code Page 1252
    char dec col/row oct hex description
    [ñ] 241 15/01 361 F1 SMALL LETTER n WITH TILDE


    But even though the following command had the wrong codepage, it should have given the Russian "c" instead of another �:
    convmv -f cp1251 -t utf-8 -r --exec "echo #1 should be renamed to #2" .
    echo Ni\�a\ Pastori should be renamed to Ni\�\�a\ Pastori


    And if I manually enter a ñ the terminal, it shows the character properly.
    If I rename the file manually it's also shown properly in the terminal.

    What am I missing? And what's up with the [-15] ascii-code for [ñ] (oct 361)?
    Last edited by Bl4deRunner; September 10th, 2009 at 09:36 AM.

  6. #16
    Join Date
    Apr 2006
    Location
    London
    Beans
    212
    Distro
    Ubuntu

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    This is good news actually. The key next step is to work out the character encoding used for your Cyrillic filenames.

    As you have worked out, Niña Pastori filename is encoded in ISO-8859-1 (or something very close - the Windows variant of this uses a few characters you would never use in a filename):

    0000000 N i 361 a P a s t o r i : \n ( 2

    See position 361 (decimal due to the -td1 in od argument) in http://en.wikipedia.org/wiki/ISO/IEC...odepage_layout - this is the ñ character.

    Convmv looks like it is correctly translating this filename into UTF-8 i.e. from 1 to 2 bytes as seen here. So, please redirect output of convmv into a file for (1) this file, and a file with some cyrillic characters in the name(and please tell me what they are meant to be if possible).

    Then you can look at this with gedit, or Firefox on your local system (change the Firefox character encoding to utf8 and others to see what they say).

    Please also paste the convmv output file as follows:

    - use pastehtml.com via this script: http://code.google.com/p/pastehtml/ - do something like "convmv .... | sh pastehtml.sh"

    - it gives you a URL which you can check with Firefox (be sure to set your character encoding as appropriate - e.g. ISO 8859-1 for the Nina name (an example I just did is http://pastehtml.com/view/090910saJ08OSD.txt which contains Niña in ISO-8859-1)

    - paste the URL here along with the exact command that generated the output

    This avoids the whole issue of the forum software translating character encodings.

    Are you absolutely sure that your locale and fonts in the Terminal program support UTF-8? Type 'locale' in the terminal, and also try saving an HTML file from Firefox that you know has Cyrillic utf8 characters (yahoo.ru or something) and then viewing that with lynx (apt-get install lynx if needed).

    The pastehtml test above will help a lot, as it's much easier to tell Firefox to use a specific character encoding than with some other programs.

    Also I hear that ROXfiler is good at showing filenames in non-UTF8 character sets - apt-get install rox-filer and give it a go. See http://ubuntuforums.org/showthread.php?t=144297 for tips on this, convmv, etc. Even though the statement that NTFS uses CP850 is wrong (NTFS uses Unicode according to Microsoft's internationalisation guru: http://blogs.msdn.com/michkap/archiv...24/769540.aspx), there are many good tips on this thread.

    This is quite painful to debug, but once you are on UTF-8 it's MUCH easier... everything remains in a single character set including Cyrillic characters, Spanish accented characters, etc.
    Last edited by Cato2; September 10th, 2009 at 01:17 PM.

  7. #17
    Join Date
    Apr 2006
    Location
    London
    Beans
    212
    Distro
    Ubuntu

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    Responding to your question about the '-15' encoding for Nina file: the reason for this is that you have multiple character encodings for the filenames on your disk. Anyone using the filename for a file must know the character encoding (e.g. ISO-8859-1) to make sense of it.

    Not quite sure how this happened since Samba was apparently using utf8, but I suspect that Samba wasn't doing utf8 at all, and was simply passing through the unmodified 8-bit character sets used by the client PCs (e.g. ISO-8859-1 or -15 for the Nina filenames, and Windows-1251 [or something else] for the Cyrillic filenames).

    Anyway - the hard part is not doing the conversion, it's identifying which are the character sets. We know that it's ISO-8859-1 for the Nina type files (presumably MP3 files?) - note that ISO-8859-15 is almost the same, basically just adding the Euro character. What we don't know is the character set (i.e. encoding basically) for the Cyrillic filenames.

    Once you are using UTF-8, at least all this will be in the past, but the trick is to identify which character encodings were used for which files and do a selective convmv to rename them.

    If you have any accented/Cyrillic filenames in archive files (backups, .tar.gz files, etc) you would also need to unpack those, convmv to rename, and repack them.

    This is somewhat error prone hence my encouragements to do a backup first, using an image backup (CloneZilla) so that you can reconstruct the exact state of the filesystem. If the filenames are messed up badly, it's almost as bad as losing the files...

    Also, a JBOD system is quite vulnerable to disk errors anyway - RAID-1 or simply 2 separate filesystems would be better. If you really want 1 big filesystem, look into LVM - however, disabling write caching (hdparm -W0 /dev/sdX) is strongly recommended for integrity with LVM and ext3, and possibly ext2.

  8. #18
    Join Date
    May 2007
    Beans
    22

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    Well...
    I've found this, and the Nautilus-script is a big help.
    http://ubuntuforums.org/showthread.php?t=144297
    BUT!
    You really have to double-check the encodings before applying the scripts.

    Moreover, I found out that the --exec "echo #1 #2" of convmv doesn't show what the end-result will be. For example:

    convmv -f cp1252 -t utf-8 -r --exec "echo #1 should be renamed to #2" .
    gave:
    echo Ni\�a\ Pastori should be renamed to Ni\�\�a\ Pastori

    But

    convmv -f cp1252 -t utf-8 -r --notest .
    translated the files properly to:
    Niña Pastori

    But now I've found other files with different encodings.
    What's the easiest way to find the codepage if you know the character & character code?

    For the record:
    ubuntu@ubuntu:~$ locale
    LANG=en_US.UTF-8
    LC_CTYPE="en_US.UTF-8"
    LC_NUMERIC="en_US.UTF-8"
    LC_TIME="en_US.UTF-8"
    LC_COLLATE="en_US.UTF-8"
    LC_MONETARY="en_US.UTF-8"
    LC_MESSAGES="en_US.UTF-8"
    LC_PAPER="en_US.UTF-8"
    LC_NAME="en_US.UTF-8"
    LC_ADDRESS="en_US.UTF-8"
    LC_TELEPHONE="en_US.UTF-8"
    LC_MEASUREMENT="en_US.UTF-8"
    LC_IDENTIFICATION="en_US.UTF-8"
    LC_ALL=

  9. #19
    Join Date
    Apr 2009
    Beans
    1,173

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    Quote Originally Posted by Bl4deRunner View Post
    _______ _______.doc
    ~$????? ???????.doc
    ~$_____ _______.doc
    ??????? ???????.doc
    One possibility you should consider is whether these files were ever migrated from FAT or NTFS, or some other FS. If so then it is possible that the partition has been correctly mounted already.

  10. #20
    Join Date
    Apr 2006
    Location
    London
    Beans
    212
    Distro
    Ubuntu

    Re: Unable to read Cyrillic/Russian file names mounting with ext2

    Quote Originally Posted by Bl4deRunner View Post
    Well...
    ...
    You really have to double-check the encodings before applying the scripts.

    Moreover, I found out that the --exec "echo #1 #2" of convmv doesn't show what the end-result will be. For example:

    convmv -f cp1252 -t utf-8 -r --exec "echo #1 should be renamed to #2" .
    gave:
    echo Ni\�a\ Pastori should be renamed to Ni\�\�a\ Pastori

    But

    convmv -f cp1252 -t utf-8 -r --notest .
    translated the files properly to:
    Niña Pastori

    But now I've found other files with different encodings.
    What's the easiest way to find the codepage if you know the character & character code?
    The reason convmv doesn't appear to work is that your console UTF8 setup is incorrect - either it's not using UTF-8, or more likely the Unicode font used in console doesn't include Cyrillic characters. Try Arial Unicode MS, which is available in Windows/Office and can be installed in Ubuntu if you don't have it.

    If you had uploaded the convmv output to pastehtml.com I could tell you this definitely, and this is also the ONLY way to figure out this character set.

    Once we have the actual character values in a file, it's a lot easier to use Firefox to try different character encodings until one of them makes sense, as mentioned above, or to look in web pages documenting character sets (much more work).

    This will go much quicker if you respond to my previous post specifically, i.e. run
    Code:
    convmv -f cp1252 -t utf-8 -r --exec "echo #1 should be renamed to #2" >convmv.txt
    Then upload this file to http://pastehtml.com by installing the pastehtml.sh file as mentioned, and then running
    Code:
    cat convmv.log | sh pastehtml.sh
    This is so I can see the exact character set values. I don't have time to type in the reasons for these requests but they are designed to help identify the source character set, after which the conversion is quite trivial.

    Until we figure out the current character set of the Cyrillic characters, it's like trying to find a lost key in a dark cellar, as we have no idea what value to set for the -f parameter.

    There are apparently 18 commonly used Cyrillic encodings and some specific applications use their own variants... see http://www.unicodecharacter.com/charsets/cyrillic.html. Also, see http://www.cs.tut.fi/~jkorpela/chars.html#problems for common issues when the character sets don't match.

    Try running rox-filer as well, it's much easier to try different character sets with this by simply doing "CHARSET=cp850 rox-filer" etc, and it has special support for viewing filenames in multiple character sets as well as UTF-8. See http://ubuntuforums.org/showpost.php...98&postcount=5
    Last edited by Cato2; September 11th, 2009 at 07:54 AM.

Page 2 of 3 FirstFirst 123 LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •