While in the Unix world UTF-8 is the de-facto standard for terminal input and output encoding, the situation on Windows is a bit more complex. In general, Windows is even a step ahead compared to Unix systems: Unicode code points in command line arguments are supported natively when using cmd.exe or the Powershell. The Win 32 API has corresponding functions for retrieving such strings as native Unicode data types.
Python 2(.7), however, does not make use of these functions. Instead, it tries to read arguments as byte sequences. Characters not included in the 7-bit ASCII range end up as ?
in the byte strings in sys.argv
.
Another issue might be that by default Python does not use UTF-8 for encoding characters in the stdout stream (for me, the default stdout encoding is the more limited code page cp437).
I don’t want to lose too many words now, there are quite reliable workarounds for both issues. Stdout encoding can be enforced with the PYTHONIOENCODING
environment variable. chcp 65001
sets the console code page to an UTF-8-alike encoding, so that special characters can be used as command line arguments in an UTF-8-encoded batch file, such as this test.bat
:
@chcp 65001 > nul @set PYTHONIOENCODING=utf-8 python test.py ☺
This is the Python script test.py
for printing information about the retrieved command line arguments:
import sys sys.argv = win32_unicode_argv() print repr(sys.argv) for a in sys.argv: print(a.encode(sys.stdout.encoding))
Open a terminal (cmd.exe
) and execute
c:\> test.bat > out
Then have a look into the file out
in which we just redirected the stdout stream of the Python script (tell your editor/file viewer to decode the file using UTF-8 and use a proper font having special glyphs!):
c:\> python test.py ☺ [u'test.py', u'\u263a'] test.py ☺
As you can see, the items in argv are unicode strings. This is the magic performed by the function win32_unicode_argv()
which I will show below. When encoding these unicode strings to sys.stdout.encoding
(which, in fact, is UTF-8 as of the environment variable PYTHONIOENCODING
), the special Unicode code point ☺ becomes properly encoded.
All in all, using chcp 65001 + PYTHONIOENCODING="utf-8" + win32_unicode_argv()
, we got a well-behaved information stream from the UTF-8-encoded input file test.bat
to the UTF-8-encoded output file out
.
This is win32_unicode_argv()
which is making use of the ctypes module for using the Win 32 API functions that are provided by Windows for retrieving command line arguments as native Win 32 Unicode strings:
import sys def win32_unicode_argv(): # Solution copied from http://stackoverflow.com/a/846931/145400 from ctypes import POINTER, byref, cdll, c_int, windll from ctypes.wintypes import LPCWSTR, LPWSTR GetCommandLineW = cdll.kernel32.GetCommandLineW GetCommandLineW.argtypes = [] GetCommandLineW.restype = LPCWSTR CommandLineToArgvW = windll.shell32.CommandLineToArgvW CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)] CommandLineToArgvW.restype = POINTER(LPWSTR) cmd = GetCommandLineW() argc = c_int(0) argv = CommandLineToArgvW(cmd, byref(argc)) if argc.value > 0: # Remove Python executable and commands if present start = argc.value - len(sys.argv) return [argv[i] for i in xrange(start, argc.value)]
Kudos to http://stackoverflow.com/a/846931/145400.
Leave a Reply