Python 2 on Windows: how to read command line arguments containing Unicode code points

While in the Unix world UTF-8 is the de-facto standard for terminal input and output encoding, the situation on Windows is a bit more complex. In general, Windows is even a step ahead compared to Unix systems: Unicode code points in command line arguments are supported natively when using cmd.exe or the Powershell. The Win 32 API has corresponding functions for retrieving such strings as native Unicode data types.

Python 2(.7), however, does not make use of these functions. Instead, it tries to read arguments as byte sequences. Characters not included in the 7-bit ASCII range end up as ? in the byte strings in sys.argv.

Another issue might be that by default Python does not use UTF-8 for encoding characters in the stdout stream (for me, the default stdout encoding is the more limited code page cp437).

I don’t want to lose too many words now, there are quite reliable workarounds for both issues. Stdout encoding can be enforced with the PYTHONIOENCODING environment variable. chcp 65001 sets the console code page to an UTF-8-alike encoding, so that special characters can be used as command line arguments in an UTF-8-encoded batch file, such as this test.bat:

@chcp 65001 > nul
@set PYTHONIOENCODING=utf-8
python test.py ☺

This is the Python script test.py for printing information about the retrieved command line arguments:

import sys
sys.argv = win32_unicode_argv()
print repr(sys.argv)
for a in sys.argv:
    print(a.encode(sys.stdout.encoding))

Open a terminal (cmd.exe) and execute

c:\> test.bat > out

Then have a look into the file out in which we just redirected the stdout stream of the Python script (tell your editor/file viewer to decode the file using UTF-8 and use a proper font having special glyphs!):

c:\> python test.py ☺ 
[u'test.py', u'\u263a']
test.py
☺

As you can see, the items in argv are unicode strings. This is the magic performed by the function win32_unicode_argv() which I will show below. When encoding these unicode strings to sys.stdout.encoding (which, in fact, is UTF-8 as of the environment variable PYTHONIOENCODING), the special Unicode code point ☺ becomes properly encoded.

All in all, using chcp 65001 + PYTHONIOENCODING="utf-8" + win32_unicode_argv(), we got a well-behaved information stream from the UTF-8-encoded input file test.bat to the UTF-8-encoded output file out.

This is win32_unicode_argv() which is making use of the ctypes module for using the Win 32 API functions that are provided by Windows for retrieving command line arguments as native Win 32 Unicode strings:

import sys
def win32_unicode_argv():
    # Solution copied from http://stackoverflow.com/a/846931/145400
 
    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR
 
    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR
 
    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)
 
    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

Kudos to http://stackoverflow.com/a/846931/145400.

Leave a Reply

Your email address will not be published. Required fields are marked *

Human? Please fill this out: * Time limit is exhausted. Please reload CAPTCHA.

  1. Augusto Esteves Avatar
    Augusto Esteves

    I am struggling to correctly print file names passed on through sys.argv

    My bat file (added to shell:sendto):

    @echo off
    @chcp 65001 > nul
    @set PYTHONIOENCODING=utf-8
    python “D:DropboxPythonprint_file_name.py” %1
    pause

    If the file name is ‘canção.pdf’ it manages to print it correctly with your example, but not, e.g., ‘조선.pdf’ nor ‘मान.pdf’