python-cjson package and non-ASCII data

In my current AJAX project I use mod-python in the server and data is transferred using JSON encoding. There are several Python modules you can choose from for JSON serialization. Some of them are reviewed here. After some reading and testing, I picked python-cjson since I found it fast and reliable. I didn’t have any problem with it until I started sending and receiving data that where not 7-bits ascii, but belonged to the extended ascii set. Strange characters started to appear in the database and at the browser interface.
I tracked the problem down to the fact that python-cjson expects its input to be either 7-bits ascii or Python Unicode internal representation.
Which means that if you receive from the net a string encoded in UTF-8 which contains characters outside of the 7-bits ascii range, you get either an error or a wrong character translation. Same thing if you read a string, say, from a database and try to encode it.

After some tests and some mail exchanges with python-cjson’s author, here is what I have learned:

  • decoding: before calling cjson.decode(), you must convert your data to Python internal representation. For instance if you expect your data to be UTF-8, this is what you should do:
    cjson.decode("your_data".decode('utf-8'))
  • encoding: the approach is similar to the first one, the only problem is that you generally don’t have a simple string to encode but a complex structure. Clearly converting every string of the structure before feeding it to python-cjson is not a good solution. A better one would be converting your data to Python unicode as you read them into your program, for instance from a database or a file. In my case, where data are read from a PostgreSQL database using psycopg2, it was not difficult at all. Psycopg has an option to covert character representations when reading from/writing to a database. It consists of 2 lines of code as I found out here:
    psycopg2.extensions.register_type
             (psycopg2.extensions.UNICODE)
    connection.set_client_encoding('UTF8')

    assuming that your db is UTF-8 encoded.

    This way cjson.encode() will be happy and serialize correctly.

  • Leave a Reply