Idea for handling encodings in a programming language
I've been designing a language named Wolf in my head for a while, and what with the big deal everyone's been making about character encodings due to the Python 2/3 switch, I think I came up with a good way to handle encodings.
There are two kinds of strings in Wolf: String and Bytestring. The components are Character and Byte. String is a sequence of Unicode characters (and Character is a code point), while Bytestring is a simple sequence of bytes (so Byte is just a byte). String is always represented internally as Unicode.
Changing a String to a Bytestring is encoding, and changing a Bytestring to a String is decoding. You do this with String's 'encode' method (which returns a Bytestring containing the raw sequence of bytes for the coding) and Bytestring's 'decode' method (which returns a String of the Unicode equivalents for each character in the coding).
To do all this transparently, Strings and Bytestrings have encoding "tags". Whenever a tagged String is encoded and no encoding is specified, the tagged encoding is used to encode it. Whenever a tagged Bytestring is decoded and no encoding is specified, the tagged encoding is used to decode it. And, of course, the 'encode' and 'decode' operations always set the encoding tag on their return value.
Bytestring literals used in the code automatically receive the "ascii" tag, and String literals in the code are tagged with whatever encoding the source file is. Any data received from an external source must be tagged and decoded manually, though file handles let you specify a particular coding to tag read data with.
Do you think this is a good way to handle character encodings? I know character encoding in general is a rather complicated issue, but this is the best way I could think of.
Regards, PacSci
Windows is to Linux as a straw house is to a brick house. The bricks are harder to get started with, but they're higher quality and won't crash as easily.
Any quotes in the above post may have been edited for spelling and grammar.
Bookmarks