diff options
author | Micah Anderson <micah@riseup.net> | 2014-11-11 11:53:55 -0500 |
---|---|---|
committer | Micah Anderson <micah@riseup.net> | 2014-11-11 11:53:55 -0500 |
commit | 7d5c3dcd969161322deed6c43f8a6a3cb92c3369 (patch) | |
tree | 109b05c88c7252d7609ef324d62ef9dd7f06123f /docs/source/unicode.rst | |
parent | 44be832c5708baadd146cb954befbc3dcad8d463 (diff) |
upgrade to 14.4.1upstream/14.4.1
Diffstat (limited to 'docs/source/unicode.rst')
-rw-r--r-- | docs/source/unicode.rst | 188 |
1 files changed, 188 insertions, 0 deletions
diff --git a/docs/source/unicode.rst b/docs/source/unicode.rst new file mode 100644 index 0000000..a0c7878 --- /dev/null +++ b/docs/source/unicode.rst @@ -0,0 +1,188 @@ +.. PyZMQ Unicode doc, by Min Ragan-Kelley, 2010 + +.. _unicode: + +PyZMQ and Unicode +================= + +PyZMQ is built with an eye towards an easy transition to Python 3, and part of +that is dealing with unicode strings. This is an overview of some of what we +found, and what it means for PyZMQ. + +First, Unicode in Python 2 and 3 +******************************** + +In Python < 3, a ``str`` object is really a C string with some sugar - a +specific series of bytes with some fun methods like ``endswith()`` and +``split()``. In 2.0, the ``unicode`` object was added, which handles different +methods of encoding. In Python 3, however, the meaning of ``str`` changes. A +``str`` in Python 3 is a full unicode object, with encoding and everything. If +you want a C string with some sugar, there is a new object called ``bytes``, +that behaves much like the 2.x ``str``. The idea is that for a user, a string is +a series of *characters*, not a series of bytes. For simple ascii, the two are +interchangeable, but if you consider accents and non-Latin characters, then the +character meaning of byte sequences can be ambiguous, since it depends on the +encoding scheme. They decided to avoid the ambiguity by forcing users who want +the actual bytes to specify the encoding every time they want to convert a +string to bytes. That way, users are aware of the difference between a series of +bytes and a collection of characters, and don't confuse the two, as happens in +Python 2.x. + +The problems (on both sides) come from the fact that regardless of the language +design, users are mostly going to use ``str`` objects to represent collections +of characters, and the behavior of that object is dramatically different in +certain aspects between the 2.x ``bytes`` approach and the 3.x ``unicode`` +approach. The ``unicode`` approach has the advantage of removing byte ambiguity +- it's a list of characters, not bytes. However, if you really do want the +bytes, it's very inefficient to get them. The ``bytes`` approach has the +advantage of efficiency. A ``bytes`` object really is just a char* pointer with +some methods to be used on it, so when interacting with, so interacting with C +code, etc is highly efficient and straightforward. However, understanding a +bytes object as a string with extended characters introduces ambiguity and +possibly confusion. + +To avoid ambiguity, hereafter we will refer to encoded C arrays as 'bytes' and +abstract unicode objects as 'strings'. + +Unicode Buffers +--------------- + +Since unicode objects have a wide range of representations, they are not stored +as the bytes according to their encoding, but rather in a format called UCS (an +older fixed-width Unicode format). On some platforms (OS X, Windows), the storage +is UCS-2, which is 2 bytes per character. On most \*ix systems, it is UCS-4, or +4 bytes per character. The contents of the *buffer* of a ``unicode`` object are +not encoding dependent (always UCS-2 or UCS-4), but they are *platform* +dependent. As a result of this, and the further insistence on not interpreting +``unicode`` objects as bytes without specifying encoding, ``str`` objects in +Python 3 don't even provide the buffer interface. You simply cannot get the raw +bytes of a ``unicode`` object without specifying the encoding for the bytes. In +Python 2.x, you can get to the raw buffer, but the platform dependence and the +fact that the encoding of the buffer is not the encoding of the object makes it +very confusing, so this is probably a good move. + +The efficiency problem here comes from the fact that simple ascii strings are 4x +as big in memory as they need to be (on most Linux, 2x on other platforms). +Also, to translate to/from C code that works with char*, you always have to copy +data and encode/decode the bytes. This really is horribly inefficient from a +memory standpoint. Essentially, Where memory efficiency matters to you, you +should never ever use strings; use bytes. The problem is that users will almost +always use ``str``, and in 2.x they are efficient, but in 3.x they are not. We +want to make sure that we don't help the user make this mistake, so we ensure +that zmq methods don't try to hide what strings really are. + +What This Means for PyZMQ +************************* + +PyZMQ is a wrapper for a C library, so it really should use bytes, since a +string is not a simple wrapper for ``char *`` like it used to be, but an +abstract sequence of characters. The representations of bytes in Python are +either the ``bytes`` object itself, or any object that provides the buffer +interface (aka memoryview). In Python 2.x, unicode objects do provide the buffer +interface, but as they do not in Python 3, where pyzmq requires bytes, we +specifically reject unicode objects. + +The relevant methods here are ``socket.send/recv``, ``socket.get/setsockopt``, +``socket.bind/connect``. The important consideration for send/recv and +set/getsockopt is that when you put in something, you really should get the same +object back with its partner method. We can easily coerce unicode objects to +bytes with send/setsockopt, but the problem is that the pair method of +recv/getsockopt will always be bytes, and there should be symmetry. We certainly +shouldn't try to always decode on the retrieval side, because if users just want +bytes, then we are potentially using up enormous amounts of excess memory +unnecessarily, due to copying and larger memory footprint of unicode strings. + +Still, we recognize the fact that users will quite frequently have unicode +strings that they want to send, so we have added ``socket.<method>_string()`` +wrappers. These methods simply wrap their bytes counterpart by encoding +to/decoding from bytes around them, and they all take an `encoding` keyword +argument that defaults to utf-8. Since encoding and decoding are necessary to +translate between unicode and bytes, it is impossible to perform non-copying +actions with these wrappers. + +``socket.bind/connect`` methods are different from these, in that they are +strictly setters and there is not corresponding getter method. As a result, we +feel that we can safely coerce unicode objects to bytes (always to utf-8) in +these methods. + +.. note:: + + For cross-language symmetry (including Python 3), the ``_unicode`` methods + are now ``_string``. Many languages have a notion of native strings, and + the use of ``_unicode`` was wedded too closely to the name of such objects + in Python 2. For the time being, anywhere you see ``_string``, ``_unicode`` + also works, and is the only option in pyzmq ≤ 2.1.11. + + +The Methods +----------- + +Overview of the relevant methods: + +.. py:function:: socket.bind(self, addr) + + `addr` is ``bytes`` or ``unicode``. If ``unicode``, + encoded to utf-8 ``bytes`` + +.. py:function:: socket.connect(self, addr) + + `addr` is ``bytes`` or ``unicode``. If ``unicode``, + encoded to utf-8 ``bytes`` + +.. py:function:: socket.send(self, object obj, flags=0, copy=True) + + `obj` is ``bytes`` or provides buffer interface. + + if `obj` is ``unicode``, raise ``TypeError`` + +.. py:function:: socket.recv(self, flags=0, copy=True) + + returns ``bytes`` if `copy=True` + + returns ``zmq.Message`` if `copy=False`: + + `message.buffer` is a buffer view of the ``bytes`` + + `str(message)` provides the ``bytes`` + + `unicode(message)` decodes `message.buffer` with utf-8 + +.. py:function:: socket.send_string(self, unicode s, flags=0, encoding='utf-8') + + takes a ``unicode`` string `s`, and sends the ``bytes`` + after encoding without an extra copy, via: + + `socket.send(s.encode(encoding), flags, copy=False)` + +.. py:function:: socket.recv_string(self, flags=0, encoding='utf-8') + + always returns ``unicode`` string + + there will be a ``UnicodeError`` if it cannot decode the buffer + + performs non-copying `recv`, and decodes the buffer with `encoding` + +.. py:function:: socket.setsockopt(self, opt, optval) + + only accepts ``bytes`` for `optval` (or ``int``, depending on `opt`) + + ``TypeError`` if ``unicode`` or anything else + +.. py:function:: socket.getsockopt(self, opt) + + returns ``bytes`` (or ``int``), never ``unicode`` + +.. py:function:: socket.setsockopt_string(self, opt, unicode optval, encoding='utf-8') + + accepts ``unicode`` string for `optval` + + encodes `optval` with `encoding` before passing the ``bytes`` to + `setsockopt` + +.. py:function:: socket.getsockopt_string(self, opt, encoding='utf-8') + + always returns ``unicode`` string, after decoding with `encoding` + + note that `zmq.IDENTITY` is the only `sockopt` with a string value + that can be queried with `getsockopt` + |