summaryrefslogtreecommitdiff
path: root/src/ext/libcharsetdetect/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'src/ext/libcharsetdetect/README.md')
-rw-r--r--src/ext/libcharsetdetect/README.md152
1 files changed, 152 insertions, 0 deletions
diff --git a/src/ext/libcharsetdetect/README.md b/src/ext/libcharsetdetect/README.md
new file mode 100644
index 0000000..12e368e
--- /dev/null
+++ b/src/ext/libcharsetdetect/README.md
@@ -0,0 +1,152 @@
+# Universal Character Set Detector (UCSD)
+
+A library exposing a C interface and dependency-free interface to the Mozilla C++ UCSD library.
+
+This library provides a highly accurate set of heuristics that attempt to determine the character set used to encode some input text.
+This is extremely useful when your program has to handle an input file which is supplied without any encoding metadata.
+
+Pulls together:
+
+ * A NSPR emulation library (see `nspr-emu/README.md`)
+ * Code written by Colin Snover to provide a command line interface to the library
+ * The UCSD library itself from the Mozilla seamonkey source tree
+
+The UCSD version provided is that present in the Mozilla public repo as of 31/10/2010.
+
+## Building
+
+We have a build system based on CMake, so you will need that installed. That done, simply do this incantation:
+
+ ./configure
+ make
+ sudo make install
+
+This will install the header file `charsetdetect.h` and the UCSD shared library, which you should link against in your compiler.
+
+## API documentation
+
+The library provides an opaque type of character set detectors:
+
+ typedef void* csd_t;
+
+The first thing a client should do is create one of these:
+
+ csd_t csd_open(void);
+
+A `csd_t` created in this fashion must be freed by `csd_close`. If creation fails, `csd_open` returns `(csd_t)-1`.
+
+Now you need to feed some data to the detector:
+
+ int csd_consider(csd_t csd, const char *data, int length);
+
+The meaning of the return code is as follows:
+
+ * Returns 0 if more data is needed to come to a conclusion
+ * Returns a positive number if enough data has been received to detect the character set
+ * Returns a negative number if there is an error
+
+Finally, close the detector to find out what the character set is:
+
+ const char *csd_close(csd_t csd);
+
+The detected character set name is returned as an ASCII string. This function returns `NULL` if detection failed because there was not
+enough data. It is safe to call `csd_close` at any point from creation by `csd_open` to the first call of `csd_close` on that character
+set detector.
+
+## Full example
+
+This is a complete C program that shows how the library can be used to build a simple command-line character set detector:
+
+ #include "charsetdetect.h"
+ #include "stdio.h"
+
+ #define BUFFER_SIZE 4096
+
+ int main(int argc, const char * argv[]) {
+ csd_t csd = csd_open();
+ if (csd == (csd_t)-1) {
+ printf("csd_open failed\n");
+ return 1;
+ }
+
+ int size;
+ char buf[BUFFER_SIZE] = {0};
+
+ while ((size = fread(buf, 1, sizeof(buf), stdin)) != 0) {
+ int result = csd_consider(csd, buf, size);
+ if (result < 0) {
+ printf("csd_consider failed\n");
+ return 3;
+ } else if (result > 0) {
+ // Already have enough data
+ break;
+ }
+ }
+
+ const char *result = csd_close(csd);
+ if (result == NULL) {
+ printf("Unknown character set\n");
+ return 2;
+ } else {
+ printf("%s\n", result);
+ return 0;
+ }
+ }
+
+You can compile it and try it (on platforms with GCC) as follows:
+
+ gcc example.c -lcharsetdetect
+ ./a.out < my_test_file.txt
+
+## Known character sets
+
+The list of possible character sets that can be returned from the library as of the most recent update are:
+
+ Big5
+ EUC-JP
+ EUC-KR
+ GB18030
+ gb18030
+ HZ-GB-2312
+ IBM855
+ IBM866
+ ISO-2022-CN
+ ISO-2022-JP
+ ISO-2022-KR
+ ISO-8859-2
+ ISO-8859-5
+ ISO-8859-7
+ ISO-8859-8
+ KOI8-R
+ Shift_JIS
+ TIS-620
+ UTF-8
+ UTF-16BE
+ UTF-16LE
+ UTF-32BE
+ UTF-32LE
+ windows-1250
+ windows-1251
+ windows-1252
+ windows-1253
+ windows-1255
+ x-euc-tw
+ X-ISO-10646-UCS-4-2143
+ X-ISO-10646-UCS-4-3412
+ x-mac-cyrillic
+
+We believe this list to be exhaustive. Future updates to the UCSD library may add more alternatives, but we will endeavour to keep
+this list current.
+
+Notice that you may get both capitalisations of `GB18030`. For this reason (and to be future-proof against any future behaviour
+like this for newly-added character sets) we recommend that you compare character set names case insensitively.
+
+## Licensing
+
+The files `libcharsetdetect.{cpp,h}` are (c) 2010 Colin Snover and released under an MIT license.
+
+The UCSD is (c) mozilla.org and tri-licensed under MPL 1.1/GPL 2.0/LGPL 2.1.
+
+We incorporate header files from the NSPR emulation library, which is LGPL licensed.
+
+Thus the resulting artifact is LGPL licensed (I think). \ No newline at end of file