Categories
Encoding Python

On character encoding, and the dark arts surrounding it, and how it relates to Python, and javascript

1) Computers don’t manipulate characters, they manipulate bytes.

2) In order to assign meaning to bytes, we come up with conventions of what they mean. The simplest convention is called ASCII (95 combinations of bytes that represent 95 chars)

3) At 8 bytes, you could construct 256 different characters. At first, we only filled out the first 128 ones with english characters, and punctuation representations, and such other oddities. The other 128 possibilities were filled with some of the non-english languages odd characters. Problem is: we didn’t do it in such a conventional way, until the ISO conventions come. But that wasn’t enough, so there come Unicode, with 1.1M possibilities of characters, and so far we have only assigned 110K of them.

4) UTF-8: the problem with Unicode is that it is easier / more efficient to transfer data across the wire in 8 (rather than the 21 pieces Unicode characters require). So in order to do that, UTF-8 use some of those characters as we use the “shift” key in a computer: and now we can pack more things with less bandwidth. ASCII characters keep their same native byte representation, meaning they have exactly the same byte representation in UTF-8 and ASCII. 5) For python 2: you can have strings represented in two different data types. Strings (str) and unicode (unicode)

Exaple of string: my_string = “Hello”

Example of unicode: my_string = “Hello u2119u01b4”

5) In order to communicate back and forth between the two sets of codes, you do: my_string(‘ascii’) and my_string.encode(‘utf-8’), and .decode() as well.

6) Sometimes .encode() will fail, if you try to convert a UTF-8 sequence that doesn’t exist in ASCII, to ASCII, what you do think will happen? The second argument of .encode can tell you what to do in those cases, by default it is:

my_string.encode(‘ascii’,’strict’) which will throw an exception

But is can be: “replace” (the character that can’t be converted becomes ?), “ignore” (throw away the character), or “xmlcharrefreplace” (char becomes its xml entity equivalent)

7) Python 2 automatically converts strings back and forth, specially when you are trying to, for example concatenate strings of the two different groups. The best way to deal with this problem is to keep them separate and know you are dealing with two separate groups.

8) Python 3 does not convert automatically. It also based on Unicode instead. It also has a data type “byte” that store strings as bytes: b”hello” != “hello” (the second hello is unicode, therefore different)

9) If you try to concatenate strings of the different groups in python 3, it will just throw an error. So the stategy is: deal with different encoding at the very beginning of your input, and at the very end of your output. In other words:

– As data come in: encode it to be Unicode. as it go out, encode it to bytes.

– Inside your program: make sure you are always dealing with Unicode.

– If you use plugings, know what they are sending you (Unicode or bytes), and use the rules above. Keep in mind that just looking at a string of bytes won’t tell you what kind of encoding they are inn.