Category Archives: System programming download with wget

Downloading from with premium credentials through the command line is possible using standard tools such as wget or curl. However, there is no official API and the exact method required depends on the mechanism implemented by the website. Finding these implementation details requires a little amount reverse engineering.

Here I share a small shell script that should work on all POSIX-compliant platforms (e.g. Mac or Linux). The method is based on current behavior of the website. There are no special tools involved, just wget, grep, sed, mktemp.

(The solutions I found on the web did not work (anymore) and/or were suspiciously wrong.)


Copy the script content below, define username and password, and save the script as, for instance, Then, invoke the script like so:

$ /bin/sh urls.txt

The file urls.txt should contain one URL per line, such as in this example:


This paragraph is just for the curious ones. The script first POSTs your credentials to and stores the resulting authentication cookie in a file. This authentication cookie is then used for retrieving the website corresponding to an file. That website contains a temporarily valid download URL corresponding to the file. Using grep and sed, the HTML code is filtered for this URL. The payload data transfer is triggered by firing a POST request with empty body against this URL (cookie not needed). Files are downloaded to the current working directory. All intermediate data is stored in a temporary directory. That directory is automatically deleted upon script exit (no data is leaked, unless the script is terminated with SIGKILL).

The script

# Copyright 2015 Jan-Philip Gehrcke,
# See
if [ "$#" -ne 1 ]; then
    echo "Missing argument: URLs file (containing one URL per line)." >&2
    exit 1
if [ ! -r "${URLSFILE}" ]; then
    echo "Cannot read URLs file ${URLSFILE}. Exit." >&2
    exit 1
if [ ! -s "${URLSFILE}" ]; then
    echo "URLs file is empty. Exit." >&2
    exit 1
TMPDIR="$(mktemp -d)"
# Install trap that removes the temporary directory recursively
# upon exit (except for when this program retrieves SIGKILL).
trap 'rm -rf "$TMPDIR"' EXIT
echo "Temporary directory: ${TMPDIR}"
echo "Log in via POST request to ${LOGINURL}, save cookies."
wget --save-cookies=${COOKIESFILE} --server-response \
    --output-document ${LOGINRESPFILE} \
    --post-data="id=${USERNAME}&pw=${PASSWORD}" \
# Status code is 200 even if login failed.
# Uploaded sends a '{"err":"User and password do not match!"}'-like response
# body in case of error.
echo "Verify that login response is empty."
# Response is more than 0 bytes in case of login error.
if [ -s "${LOGINRESPFILE}" ]; then
    echo "Login response larger than 0 bytes. Print response and exit." >&2
    cat "${LOGINRESPFILE}"
    exit 1
# Zero response size does not necessarily imply successful login.
# Wget adds three commented lines to the cookies file by default, so
# set cookies should result in more than three lines in this file.
echo "${COOKIESFILELINES} lines in cookies file found."
if [ "${COOKIESFILELINES}" -lt "4" ]; then
    echo "Expected >3 lines in cookies file. Exit.". >&2
    exit 1
echo "Process URLs."
# Assume that login worked. Iterate through URLs.
while read CURRENTURL; do
    if [ "x$CURRENTURL" = "x" ]; then
        # Skip empty lines.
    echo -e "\n\n"
    TMPFILE="$(mktemp --tmpdir=${TMPDIR} response.html.XXXX)"
    echo "GET ${CURRENTURL} (use auth cookie), store response."
    wget --no-verbose --load-cookies=${COOKIESFILE} \
        --output-document ${TMPFILE} ${CURRENTURL}
    if [ ! -s "${TMPFILE}" ]; then
        echo "No HTML response: ${TMPFILE} is zero size. Skip processing."
    # Extract (temporarily valid) download URL from HTML.
    LINEOFINTEREST="$(grep post ${TMPFILE} | grep action | grep uploaded)"
    # Match entire line, include space after action="bla" , replace
    # entire line with first group, which is bla.
    DLURL=$(echo $LINEOFINTEREST | sed 's/.*action="\(.\+\)" .*/\1/')
    echo "Extracted download URL: ${DLURL}"
    # This file contains account details, so delete as soon as not required
    # anymore.
    rm -f "${TMPFILE}"
    echo "POST to URL w/o data. Response is file. Get filename from header."
    # --content-disposition should extract the proper filename.
    wget --content-disposition --post-data='' "${DLURL}"
done < "${URLSFILE}"

Python 2 on Windows: how to read command line arguments containing Unicode code points

While in the Unix world UTF-8 is the de-facto standard for terminal input and output encoding, the situation on Windows is a bit more complex. In general, Windows is even a step ahead compared to Unix systems: Unicode code points in command line arguments are supported natively when using cmd.exe or the Powershell. The Win 32 API has corresponding functions for retrieving such strings as native Unicode data types.

Python 2(.7), however, does not make use of these functions. Instead, it tries to read arguments as byte sequences. Characters not included in the 7-bit ASCII range end up as ? in the byte strings in sys.argv.

Another issue might be that by default Python does not use UTF-8 for encoding characters in the stdout stream (for me, the default stdout encoding is the more limited code page cp437).

I don’t want to lose too many words now, there are quite reliable workarounds for both issues. Stdout encoding can be enforced with the PYTHONIOENCODING environment variable. chcp 65001 sets the console code page to an UTF-8-alike encoding, so that special characters can be used as command line arguments in an UTF-8-encoded batch file, such as this test.bat:

@chcp 65001 > nul
python ☺

This is the Python script for printing information about the retrieved command line arguments:

import sys
sys.argv = win32_unicode_argv()
print repr(sys.argv)
for a in sys.argv:

Open a terminal (cmd.exe) and execute

c:\> test.bat > out

Then have a look into the file out in which we just redirected the stdout stream of the Python script (tell your editor/file viewer to decode the file using UTF-8 and use a proper font having special glyphs!):

c:\> python ☺ 
[u'', u'\u263a']

As you can see, the items in argv are unicode strings. This is the magic performed by the function win32_unicode_argv() which I will show below. When encoding these unicode strings to sys.stdout.encoding (which, in fact, is UTF-8 as of the environment variable PYTHONIOENCODING), the special Unicode code point ☺ becomes properly encoded.

All in all, using chcp 65001 + PYTHONIOENCODING="utf-8" + win32_unicode_argv(), we got a well-behaved information stream from the UTF-8-encoded input file test.bat to the UTF-8-encoded output file out.

This is win32_unicode_argv() which is making use of the ctypes module for using the Win 32 API functions that are provided by Windows for retrieving command line arguments as native Win 32 Unicode strings:

import sys
def win32_unicode_argv():
    # Solution copied from
    from ctypes import POINTER, byref, cdll, c_int, windll
    from ctypes.wintypes import LPCWSTR, LPWSTR
    GetCommandLineW = cdll.kernel32.GetCommandLineW
    GetCommandLineW.argtypes = []
    GetCommandLineW.restype = LPCWSTR
    CommandLineToArgvW = windll.shell32.CommandLineToArgvW
    CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
    CommandLineToArgvW.restype = POINTER(LPWSTR)
    cmd = GetCommandLineW()
    argc = c_int(0)
    argv = CommandLineToArgvW(cmd, byref(argc))
    if argc.value > 0:
        # Remove Python executable and commands if present
        start = argc.value - len(sys.argv)
        return [argv[i] for i in
                xrange(start, argc.value)]

Kudos to

A command line argument is raw binary data. It comes with limitations and needs interpretation.

How are command line arguments interpreted? Can arbitrary data be exchanged between the calling program on the one hand and the called program on the other hand? Some might ask: are command line interfaces “unicode-aware”?

Command line interfaces comprise an essential class of interfaces used in system architecture and the above questions deserve precise answers. In this article I try to clarify why command line arguments are nothing but raw byte sequences that deserve proper interpretation in the receiving program. To some degree, the article dives into the topic of character encoding. Towards the end, I provide simple and expressive code examples based on bash and Python. Please note that the article only applies to Unix-like systems. Certain concepts are also true for Windows, but the main difference is that Windows actually has a fully unicode-aware command line argument API (but not all programs make use of it, such as Python 2) while Unix-like systems don’t.

Program invocation: behind the scenes it is always execve

On (largely) POSIX-compliant operating systems (e.g. Mac OS, Linuxes including your Android phone, all BSDs), all program invocation scenarios have one system call in common. Eventually, the system call execve() is the entry point for all program invocations on these platforms. It instructs the operating system to run the loader, which prepares the new program for execution and eventually brings it into running state (and leaves it to itself). One argument provided to execve() is a character string — a file system path pointing to the executable file on disk. One task of the loader is to read the program’s machine code from that file and place it into memory.

argv: nothing but raw binary data

Another essential step the loader performs before triggering the actual invocation of the program is to copy “the command line arguments” on the stack of the new program. These arguments were provided to the loader via the argv argument to execve() — argv means argument vector. Simply spoken, this is a set of strings. More precisely, each of these strings is a null-terminated C char array.

One could say that each element in a C char array is a character. A character, however, has quite an abstract meaning. The greek Σ is a character, right? In times of real character abstractions, the Unicode code points, we should call each element in a C char array what it is: it is a byte of raw data. Each element in such an array stores one byte of information. In a null-terminated char array, each byte may assume all values between 0000 0001 (x01) and 1111 1111 (xFF). The first byte with the value 0000 0000 (x00) terminates the “string”.

In essence, the data in argv (which by itself is a pointer to an array of pointers to type char arrays) as provided to execve() so to say takes a long journey through the kernel, the launcher, and finally ends up as second argument to the main() function in the new program (the first argument is the argument count). That’s why, when you write a C program, you usually use the following signature for the main function: main(argc, argv).

An argument may contain arbitrary binary data, with certain limitations

  • A command line argument is nothing but a sequence of bytes. These bytes are raw binary data that may mean anything. It is up to the retrieving program to make sense of these bytes (to decode them into something meaningful).
  • Any kind of binary data can be provided within the byte sequence of a command line argument. However, there is one important exception: the x00 byte cannot be included in such a sequence. It always terminates the byte sequence. If x00 is the first or the only byte in the sequence, then the sequence is considered empty.
  • Since argv data is initially and entirely written to the stack of the new program, the total amount of data that may be provided is limited by the operating system. These limits are defined in the system header files. xargs --show-limits can be used to evaluate these limits in a convenient way:
    $ xargs --show-limits
    Your environment variables take up 4478 bytes
    POSIX upper limit on argument length (this system): 2090626
    POSIX smallest allowable upper limit on argument length (all systems): 4096
    Maximum length of command we could actually use: 2086148
    Size of command buffer we are actually using: 131072

    The value “Maximum length of command we could actually use” is about 2 MB (this holds true for my machine and operating system and may be entirely different in other environments).

Provide argument data to execve() and read it in the new program: in theory

The minimal example for demonstrating the behavior explained above would require two compiled C programs. The first, the recipient, would have a main(argc, argv) function which evaluates the contents of argv and prints them in some human readable form to stdout (in Hex representation, for example). The second program, the sender, would

  • 1) prepare the arguments by setting up certain byte sequences (pointers to type char arrays).
  • 2) call one of the exec*() system calls (that actually wrap the execve() system call). It would provide the path to the compiled recipient program, and argv — a pointer to an array of pointers to type char arrays: the arguments.
  • 3) upon execution of the execve() system call the calling process (the sender) becomes replaced by the new program, which is now the receiver. The operating system launcher takes care of copying the argv data to the stack of the new program.
  • 4) The receiver, compiled against (g)libc, goes through the _start() function (provided by (g)libc) and eventually executes its main(argc, argv) function, which evaluates the byte sequences that we call command line arguments.

argv programming interface? We want an actual command line interface!

Knowing how the above works is useful in order to understand that, internally, command line arguments are just a portion of memory copied to the stack of the main thread of the new program. You might have noticed that in this picture, however, there is no actual command line involved.

So far, we have discussed a programming interface provided by the operating system that enables us to use argv in order to bootstrap a newly invoked program with some input data that is not coming from any file, socket, or other data sources.

The concept of argv quickly translates to the concept of command line arguments. A command line interface is something that enables us to call programs in the established program arg1 arg2 arg3 ... fashion. Right, that is one of the main features a shell provides! This is what happens behind the scenes: the shell translates the arguments provided on the command line to a set of C char arrays, spawns a child process and eventually calls the new program via execve().

In other words, the shell program takes parts of the user input on the command line and translates these parts to C char arrays that it later provides to execve(). Hence, it does all the things that our hypothetical sender program from above would have done (and more).

Provide command line arguments and read them in the new program: simple practice, yei!

An example for a shell is bash. An example for a ‘new program’ is Python (2, in this case). Python 2 is a useful tool in this case, because (in contrast to Python 3) it provides raw access to the byte sequences provided via argv. Consider this example program

import sys
for a in sys.argv[1:]:
    print repr(a)

We type python b in our command line interface provided by the shell. It makes the shell spawn the Python executable in a child process. This program consumes the first command line argument, which is the path to our little Python script. The remaining arguments (one in this case) are left for our script. They are accessible in the sys.argv list. Each item in sys.argv is of Python 2 type ‘str’. This type carries raw byte sequences. This is the output:

$ python b

The single quotes ‘ are added by Python’s repr() function (it tries to reveal the true content of the variable a in the Python code — the quotes show where the bytestring starts and ends). The fact that b in the input translates to b in the output seems normal. I don’t want to go into all the details here, but you need to appreciate that the process starting with the keystroke on your keyboard that leads to “b” being displayed on the command line, and ending with “b” being displayed in your terminal as the output of our Python script, involves several encoding and decoding steps, i.e. data interpretation steps. These interpretation steps do not always do the thing you would expect. The common ground of most of these encoders and decoders is the 7-bit ASCII character set (a 2-way translation table between byte values and characters). That is why for simple characters such as “b” things seem to be simple and ‘work’ out of the box. As you will see below, it is not always that simple and often times you need to understand the details of the data interpretation steps involved.

Provide command line arguments and read them in the new program: binary data

From an ASCII character table like this we can infer that the letter b corresponds to the byte value 0110 0010, or x62 in hexadecimal representation. Let us now try to explicitly use this byte value as command line argument to our little Python script.

There is one difficulty: on the command line — how do you construct arbitrary binary data? Having the extended 8-bit ASCII character set in mind (i.e. all characters and their corresponding byte values) is not an option :-).

There are a couple of possibilities. I like one of them particularly: in bash, the $'...' notation (discussed here) is allowed to be used together with \x00-like escape sequences for constructing arbitrary byte sequences from the hexadecimal notation. Let us create the same output as before, but with a more interesting input:

$ python  $'\x62'

This worked as expected. The input chain is clear: this command line explicitly instructs the shell to create a certain byte sequence (1 byte long in this case) and provide this as first argument to our script. I guess that the shell internally actually terminates this byte sequence properly with a null byte before calling execve(). sys.argv in our Python script has the same contents as before. Therefore, it does not surprise that the output is the same as before. This example again suggests that there is some magic happening between stdout of the Python script and our terminal. Some decoder expected to retrieve ASCII (or UTF-8, of which ASCII is a subset) as input and correspondingly interpreted this byte as ‘b’ — our terminal displays it as such.

Let us now provide two arguments in explicit binary form. We expect one to translate to “b”, the other to “c” (according to ASCII):

$ python  $'\x62' $'\x63'

Cool. Now, I mentioned the null-termination of arguments. Difficult to create with the keyboard, right? Straight-forward with the hex notation:

$ python  $'\x00' $'\x63\x00' $'\x63\x00\x63'

That proves that a null byte actually terminates an argument byte sequence. The first one arrives as an empty byte sequence, because it only contains a null byte. The second and the third one arrives as single byte \x63 (“c” according to ASCII), because the next byte in the input is a null byte.

More fun? For a fact, the Unicode character ☺ (a smiley) is encoded with the byte sequence \xe2\x98\xba in UTF-8. Send it:

$ python  $'\xe2\x98\xba'

Python’s repr() prints every single byte in this byte sequence in hex notation. It’s just a fallback to a readable representation when a certain byte is not representable as ASCII character. None of these three bytes has a character correspondence in the 7-bit ASCII table. The fact that both ‘strings’ look the same is because the hex notation for defining the input is the same as the hex notation for representing the output. We could have defined the input with a different notation representing the same byte sequence and would have gotten the same output.

It is clear: to our little Python script these three bytes just look like random binary data. It cannot make sense of it without us defining how to interpret this data. As I said earlier, these three bytes are the UTF-8 encoded form of a smiley. In order to make sense of this data, the Python script needs to decode it. The modified version of the script:

import sys
for a in sys.argv[1:]:
    print repr(a)
    da = a.decode("utf-8")
    print repr(da)
    print da

This is the output:

$ python  $'\xe2\x98\xba'

It first prints the raw representation of the byte string via repr() (the same as before). Secondly, it decodes the data using the explicitly defined codec UTF-8. This leads to a unicode data type da containing a certain code point representing a character. repr(da) tells us the number of this code point. See the 263a? This may not ring a bell for you, but it actually is the abstract and unambiguous description of our character here: print da then actually makes us see the smiley in the terminal. The fact that this works involves Python being aware of the terminal’s expected character encoding. So when Python prints this unicode data type, it actually encodes it in the encoding as expected by the terminal. The terminal then decodes it again and displays the character (if the terminal font has a glyph for it).

I hope the article made clear that command line arguments are nothing but byte sequences (with certain limitations) that deserve proper interpretation in the receiving program. I intend to report more about the details of Python’s behavior when starting programs with the subprocess module which also allows passing command line arguments from within Python. At this point, Python 2 and 3 behave quite differently.