How to raise UnicodeDecodeError in Python 3

For debugging work, I needed to manually raise UnicodeDecodeError in CPython 3(.4). Its constructor requires 5 arguments:

>>> raise UnicodeDecodeError()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: function takes exactly 5 arguments (0 given)

The docs specify which properties an already instantiated UnicodeError (the base class) has: https://docs.python.org/3/library/exceptions.html#UnicodeError

These are five properties. It is obvious that the constructor needs these five pieces of information. However, setting them requires knowing the expected order, since the constructor does not take keyword arguments. The signature also is not documented when invoking help(UnicodeDecodeError), which suggests that this interface is implemented in C (which makes sense, as it affects the bowels of CPython’s text processing).

So, the only true reference for finding out the expected order of arguments is the C code implementing the constructor. It is defined by the function UnicodeDecodeError_init in the file Objects/exceptions.c in the CPython repository. The essential part are these lines:

if (!PyArg_ParseTuple(args, "O!OnnO!",
     &PyUnicode_Type, &ude->encoding,
     &ude->object,
     &ude->start,
     &ude->end,
     &PyUnicode_Type, &ude->reason)) {
         ude->encoding = ude->object = ude->reason = NULL;
         return -1;
}

That is, the order is the following:

  1. encoding (unicode object, i.e. type str)
  2. object that was attempted to be decoded (here a bytes object makes sense, whereas in fact the requirement just is that this object must provide the Buffer interface)
  3. start (integer)
  4. end (integer)
  5. reason (type str)

Hence, now we know how to artificially raise a UnicodeDecodeError:

>>> o = b'\x00\x00'
>>> raise UnicodeDecodeError('funnycodec', o, 1, 2, 'This is just a fake reason!')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'funnycodec' codec can't decode byte 0x00 in position 1: This is just a fake reason!

Leave a Reply

Your email address will not be published. Required fields are marked *

Human? Please fill this out: * Time limit is exhausted. Please reload CAPTCHA.