1 files changed, 152 insertions, 0 deletions
diff --git a/src/ext/libcharsetdetect/README.md b/src/ext/libcharsetdetect/README.md
new file mode 100644
index 0000000..12e368e
--- /dev/null
+++ b/src/ext/libcharsetdetect/README.md
@@ -0,0 +1,152 @@
+# Universal Character Set Detector (UCSD)
+
+A library exposing a C interface and dependency-free interface to the Mozilla C++ UCSD library.
+
+This library provides a highly accurate set of heuristics that attempt to determine the character set used to encode some input text.
+This is extremely useful when your program has to handle an input file which is supplied without any encoding metadata.
+
+Pulls together:
+
+  * A NSPR emulation library (see `nspr-emu/README.md`)
+  * Code written by Colin Snover to provide a command line interface to the library
+  * The UCSD library itself from the Mozilla seamonkey source tree
+
+The UCSD version provided is that present in the Mozilla public repo as of 31/10/2010.
+
+## Building
+
+We have a build system based on CMake, so you will need that installed. That done, simply do this incantation:
+
+    ./configure
+    make
+    sudo make install
+
+This will install the header file `charsetdetect.h` and the UCSD shared library, which you should link against in your compiler.
+
+## API documentation
+
+The library provides an opaque type of character set detectors:
+
+    typedef void* csd_t;
+
+The first thing a client should do is create one of these:
+
+    csd_t csd_open(void);
+
+A `csd_t` created in this fashion must be freed by `csd_close`. If creation fails, `csd_open` returns `(csd_t)-1`.
+
+Now you need to feed some data to the detector:
+
+    int csd_consider(csd_t csd, const char *data, int length);
+
+The meaning of the return code is as follows:
+
+  * Returns 0 if more data is needed to come to a conclusion
+  * Returns a positive number if enough data has been received to detect the character set
+  * Returns a negative number if there is an error
+
+Finally, close the detector to find out what the character set is:
+
+    const char *csd_close(csd_t csd);
+
+The detected character set name is returned as an ASCII string. This function returns `NULL` if detection failed because there was not
+enough data. It is safe to call `csd_close` at any point from creation by `csd_open` to the first call of `csd_close` on that character
+set detector.
+
+## Full example
+
+This is a complete C program that shows how the library can be used to build a simple command-line character set detector:
+
+    #include "charsetdetect.h"
+    #include "stdio.h"
+
+    #define BUFFER_SIZE 4096
+
+    int main(int argc, const char * argv[]) {
+        csd_t csd = csd_open();
+        if (csd == (csd_t)-1) {
+            printf("csd_open failed\n");
+            return 1;
+        }
+    
+        int size;
+        char buf[BUFFER_SIZE] = {0};
+
+        while ((size = fread(buf, 1, sizeof(buf), stdin)) != 0) {
+            int result = csd_consider(csd, buf, size);
+            if (result < 0) {
+                printf("csd_consider failed\n");
+                return 3;
+            } else if (result > 0) {
+                // Already have enough data
+                break;
+            }
+        }
+    
+        const char *result = csd_close(csd);
+        if (result == NULL) {
+            printf("Unknown character set\n");
+            return 2;
+        } else {
+            printf("%s\n", result);
+            return 0;
+        }
+    }
+
+You can compile it and try it (on platforms with GCC) as follows:
+
+    gcc example.c -lcharsetdetect
+    ./a.out < my_test_file.txt
+
+## Known character sets
+
+The list of possible character sets that can be returned from the library as of the most recent update are:
+
+    Big5
+    EUC-JP
+    EUC-KR
+    GB18030
+    gb18030
+    HZ-GB-2312
+    IBM855
+    IBM866
+    ISO-2022-CN
+    ISO-2022-JP
+    ISO-2022-KR
+    ISO-8859-2
+    ISO-8859-5
+    ISO-8859-7
+    ISO-8859-8
+    KOI8-R
+    Shift_JIS
+    TIS-620
+    UTF-8
+    UTF-16BE
+    UTF-16LE
+    UTF-32BE
+    UTF-32LE
+    windows-1250
+    windows-1251
+    windows-1252
+    windows-1253
+    windows-1255
+    x-euc-tw
+    X-ISO-10646-UCS-4-2143
+    X-ISO-10646-UCS-4-3412
+    x-mac-cyrillic
+
+We believe this list to be exhaustive. Future updates to the UCSD library may add more alternatives, but we will endeavour to keep
+this list current.
+
+Notice that you may get both capitalisations of `GB18030`. For this reason (and to be future-proof against any future behaviour
+like this for newly-added character sets) we recommend that you compare character set names case insensitively.
+
+## Licensing
+
+The files `libcharsetdetect.{cpp,h}` are (c) 2010 Colin Snover and released under an MIT license.
+
+The UCSD is (c) mozilla.org and tri-licensed under MPL 1.1/GPL 2.0/LGPL 2.1.
+
+We incorporate header files from the NSPR emulation library, which is LGPL licensed.
+
+Thus the resulting artifact is LGPL licensed (I think).
+\ No newline at end of file