1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
|
.. _encodings:
Strings encoding problems
=========================
This document is meant to avoid ``UnicodeError`` (``UnicodeEncodeError`` , ``UnicodeDecodeError``) and to set a base that allows the users to keep away headaches.
First approach
--------------
One of the problems with python 2 that makes hard to find out problems is the implicit conversion between ``str`` and ``unicode``.
Look at this code::
>>> u'ä'.encode('utf-8')
'\xc3\xa4'
>>>
>>> u'ä'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
A situation like this could happen if the user confuse one type for another. 'encode' is a method of ``unicode`` and 'decode' is a method of ``str``, since you call 'decode', python "knows" how to convert from ``unicode`` to ``str`` and then call the 'decode' method, *that* conversion is made with the safe default "ascii" which raises an exception.
We need to know which one we are using **every time**. A possible way to avoid mistakes is to use ``leap_assert_type`` at the beginning of each method that has a ``str``/``unicode`` parameter.
The best approach we need to use ``unicode`` internally and when we read/write/transmit data, encode it to bytes (``str``).
Examples of problems found
--------------------------
* **logging data**: ``logger.debug("some string {0}".format(some_data))`` may fail if we have an ``unicode`` parameter because of the conversion needed to output it.
We need to use ``repr(some_data)`` to avoid encoding problems when sending data to the stdout. An easy way to do it is: ``logger.debug("some string {0!r}".format(some_data))``
- **paths encoding**: we should return always ``unicode`` values from helpers and encode them when we need to use it.
The stdlib handles correctly ``unicode`` strings path parameters.
If we want to do something else with the paths, we need to convert them manually using the system encoding.
Regarding the encoding, use a hardcoded encoding may be wrong.
Instead of encode/decode using for instance 'utf-8', we should use this ``sys.getfilesystemencoding()``
For the data stored in a db (or something that is some way isolated from the system) we may want to choose 'utf-8' explicitly.
Steps to improve code
---------------------
#. From now on, keep in mind the difference between ``str`` and ``unicode`` and write code consequently.
#. For each method we can add a ``leap_assert_type(parameter_name, unicode)`` (or ``str``) to avoid type problems.
#. Each time that is possible move towards the unicode 'frontier' (``unicode`` inside, ``str`` (bytes) outside).
#. When is possible update the methods parameters in order to be certain of the data types that we are handling.
Recommended info
----------------
* PyCon 2012 talk: https://www.youtube.com/watch?v=sgHbC6udIqc
* article and transcription: http://nedbatchelder.com/text/unipain.html
* PyConAr 2012 (Spanish): http://www.youtube.com/watch?v=pQJ0emlYv50
* Overcoming frustrations: http://pythonhosted.org/kitchen/unicode-frustrations.html
* Python's Unicode howto: http://docs.python.org/2/howto/unicode.html
* An encoding primer: http://www.danielmiessler.com/study/encoding/
|