From cce638a8adf4e045ca5505afea4bda57753c31dd Mon Sep 17 00:00:00 2001
From: Micah Anderson <micah@riseup.net>
Date: Mon, 11 Aug 2014 16:33:29 -0400
Subject: initial import of debian package

---
 docs/source/unicode.rst | 188 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 188 insertions(+)
 create mode 100644 docs/source/unicode.rst

(limited to 'docs/source/unicode.rst')

diff --git a/docs/source/unicode.rst b/docs/source/unicode.rst
new file mode 100644
index 0000000..a0c7878
--- /dev/null
+++ b/docs/source/unicode.rst
@@ -0,0 +1,188 @@
+.. PyZMQ Unicode doc, by Min Ragan-Kelley, 2010
+
+.. _unicode:
+
+PyZMQ and Unicode
+=================
+
+PyZMQ is built with an eye towards an easy transition to Python 3, and part of
+that is dealing with unicode strings. This is an overview of some of what we
+found, and what it means for PyZMQ.
+
+First, Unicode in Python 2 and 3
+********************************
+
+In Python < 3, a ``str`` object is really a C string with some sugar - a
+specific series of bytes with some fun methods like ``endswith()`` and
+``split()``. In 2.0, the ``unicode`` object was added, which handles different
+methods of encoding. In Python 3, however, the meaning of ``str`` changes. A
+``str`` in Python 3 is a full unicode object, with encoding and everything. If
+you want a C string with some sugar, there is a new object called ``bytes``,
+that behaves much like the 2.x ``str``. The idea is that for a user, a string is
+a series of *characters*, not a series of bytes. For simple ascii, the two are
+interchangeable, but if you consider accents and non-Latin characters, then the
+character meaning of byte sequences can be ambiguous, since it depends on the
+encoding scheme. They decided to avoid the ambiguity by forcing users who want
+the actual bytes to specify the encoding every time they want to convert a
+string to bytes. That way, users are aware of the difference between a series of
+bytes and a collection of characters, and don't confuse the two, as happens in
+Python 2.x.
+
+The problems (on both sides) come from the fact that regardless of the language
+design, users are mostly going to use ``str`` objects to represent collections
+of characters, and the behavior of that object is dramatically different in
+certain aspects between the 2.x ``bytes`` approach and the 3.x ``unicode``
+approach. The ``unicode`` approach has the advantage of removing byte ambiguity
+- it's a list of characters, not bytes. However, if you really do want the
+bytes, it's very inefficient to get them. The ``bytes`` approach has the
+advantage of efficiency. A ``bytes`` object really is just a char* pointer with
+some methods to be used on it, so when interacting with, so interacting with C
+code, etc is highly efficient and straightforward. However, understanding a
+bytes object as a string with extended characters introduces ambiguity and
+possibly confusion.
+
+To avoid ambiguity, hereafter we will refer to encoded C arrays as 'bytes' and
+abstract unicode objects as 'strings'.
+
+Unicode Buffers
+---------------
+
+Since unicode objects have a wide range of representations, they are not stored
+as the bytes according to their encoding, but rather in a format called UCS (an
+older fixed-width Unicode format). On some platforms (OS X, Windows), the storage
+is UCS-2, which is 2 bytes per character. On most \*ix systems, it is UCS-4, or
+4 bytes per character. The contents of the *buffer* of a ``unicode`` object are
+not encoding dependent (always UCS-2 or UCS-4), but they are *platform*
+dependent. As a result of this, and the further insistence on not interpreting
+``unicode`` objects as bytes without specifying encoding, ``str`` objects in
+Python 3 don't even provide the buffer interface. You simply cannot get the raw
+bytes of a ``unicode`` object without specifying the encoding for the bytes. In
+Python 2.x, you can get to the raw buffer, but the platform dependence and the
+fact that the encoding of the buffer is not the encoding of the object makes it
+very confusing, so this is probably a good move.
+
+The efficiency problem here comes from the fact that simple ascii strings are 4x
+as big in memory as they need to be (on most Linux, 2x on other platforms).
+Also, to translate to/from C code that works with char*, you always have to copy
+data and encode/decode the bytes. This really is horribly inefficient from a
+memory standpoint. Essentially, Where memory efficiency matters to you, you
+should never ever use strings; use bytes. The problem is that users will almost
+always use ``str``, and in 2.x they are efficient, but in 3.x they are not. We
+want to make sure that we don't help the user make this mistake, so we ensure
+that zmq methods don't try to hide what strings really are.
+
+What This Means for PyZMQ
+*************************
+
+PyZMQ is a wrapper for a C library, so it really should use bytes, since a
+string is not a simple wrapper for ``char *`` like it used to be, but an
+abstract sequence of characters. The representations of bytes in Python are
+either the ``bytes`` object itself, or any object that provides the buffer
+interface (aka memoryview). In Python 2.x, unicode objects do provide the buffer
+interface, but as they do not in Python 3, where pyzmq requires bytes, we
+specifically reject unicode objects.
+
+The relevant methods here are ``socket.send/recv``, ``socket.get/setsockopt``,
+``socket.bind/connect``. The important consideration for send/recv and
+set/getsockopt is that when you put in something, you really should get the same
+object back with its partner method. We can easily coerce unicode objects to
+bytes with send/setsockopt, but the problem is that the pair method of
+recv/getsockopt will always be bytes, and there should be symmetry. We certainly
+shouldn't try to always decode on the retrieval side, because if users just want
+bytes, then we are potentially using up enormous amounts of excess memory
+unnecessarily, due to copying and larger memory footprint of unicode strings.
+
+Still, we recognize the fact that users will quite frequently have unicode
+strings that they want to send, so we have added ``socket.<method>_string()``
+wrappers. These methods simply wrap their bytes counterpart by encoding
+to/decoding from bytes around them, and they all take an `encoding` keyword
+argument that defaults to utf-8. Since encoding and decoding are necessary to
+translate between unicode and bytes, it is impossible to perform non-copying
+actions with these wrappers.
+
+``socket.bind/connect`` methods are different from these, in that they are
+strictly setters and there is not corresponding getter method. As a result, we
+feel that we can safely coerce unicode objects to bytes (always to utf-8) in
+these methods.
+
+.. note::
+
+    For cross-language symmetry (including Python 3), the ``_unicode`` methods 
+    are now ``_string``. Many languages have a notion of native strings, and 
+    the use of ``_unicode`` was wedded too closely to the name of such objects 
+    in Python 2.  For the time being, anywhere you see ``_string``, ``_unicode``
+    also works, and is the only option in pyzmq ≤ 2.1.11.
+
+
+The Methods
+-----------
+
+Overview of the relevant methods:
+
+.. py:function::    socket.bind(self, addr)
+    
+        `addr` is ``bytes`` or ``unicode``. If ``unicode``, 
+        encoded to utf-8 ``bytes``
+
+.. py:function::    socket.connect(self, addr)
+
+        `addr` is ``bytes`` or ``unicode``. If ``unicode``, 
+        encoded to utf-8 ``bytes``
+
+.. py:function::    socket.send(self, object obj, flags=0, copy=True)
+
+        `obj` is ``bytes`` or provides buffer interface. 
+        
+        if `obj` is ``unicode``, raise ``TypeError``
+
+.. py:function::    socket.recv(self, flags=0, copy=True)
+
+        returns ``bytes`` if `copy=True`
+        
+        returns ``zmq.Message`` if `copy=False`:
+        
+            `message.buffer` is a buffer view of the ``bytes``
+            
+            `str(message)` provides the ``bytes``
+            
+            `unicode(message)` decodes `message.buffer` with utf-8
+    
+.. py:function::    socket.send_string(self, unicode s, flags=0, encoding='utf-8')
+
+        takes a ``unicode`` string `s`, and sends the ``bytes`` 
+        after encoding without an extra copy, via:
+        
+        `socket.send(s.encode(encoding), flags, copy=False)`
+    
+.. py:function::    socket.recv_string(self, flags=0, encoding='utf-8')
+
+        always returns ``unicode`` string
+        
+        there will be a ``UnicodeError`` if it cannot decode the buffer
+        
+        performs non-copying `recv`, and decodes the buffer with `encoding`
+    
+.. py:function::    socket.setsockopt(self, opt, optval)
+
+        only accepts ``bytes``  for `optval` (or ``int``, depending on `opt`)
+        
+        ``TypeError`` if ``unicode`` or anything else
+    
+.. py:function::    socket.getsockopt(self, opt)
+
+        returns ``bytes`` (or ``int``), never ``unicode``
+    
+.. py:function::    socket.setsockopt_string(self, opt, unicode optval, encoding='utf-8')
+
+        accepts ``unicode`` string for `optval`
+        
+        encodes `optval` with `encoding` before passing the ``bytes`` to 
+        `setsockopt`
+    
+.. py:function::    socket.getsockopt_string(self, opt, encoding='utf-8')
+
+        always returns ``unicode`` string, after decoding with `encoding`
+        
+        note that `zmq.IDENTITY` is the only `sockopt` with a string value 
+        that can be queried with `getsockopt`
+
-- 
cgit v1.2.3