JavaScript in a browser: create a binary blob from a base64-encoded string, and decode it using UTF-8.

For an AngularJS-based web application I am currently working on, I want to put arbitrary text information into a URL, send it to the client, and have it decoded by the browser. With the right tools this should be a trivial process, shouldn’t it?

The basic idea is:

  1. Start with the original text, which is a sequence of unicode code points.
  2. Encode the original text into binary data; with the UTF-8 codec.
  3. base64-encode the resulting binary data (and replace the URL-unsafe characters / and + with, for instance, – and _).
  4. The result is a URL-safe string. Send it. Let’s assume it ends up in a browser (in window.location, for instance).
  5. Invert the entire procedure in the browser:
    1. The data arrives as DOMString type (unicode text, so to say).
    2. Transform URL-safe base64 representation to canonical base64 (replace _ and – characters).
    3. Decode the base64 string into a real binary data type (Uint8Array, for instance).
    4. Decode the binary blob into a DOMString containing the original text, using the UTF-8 codec.

Unfortunately, there so far are no official and no established ways for performing steps 5.3 and 5.4 in a browser environment. There is no obvious way for obtaining a binary blob from base64-encoded data. Further below, I’ll show three different methods for executing this step. Proceeding from here, I realized that there also is still no established way for decoding a binary blob into a DOMString using a given codec (UTF-8 in this case). I’ll show two different methods for performing this task.

The original text and its URL-safe representation

I’ll start with a Python snippet defining the original text and creating its URL-safe base64 representation:

# -*- coding: utf-8 -*-
 
from __future__ import unicode_literals
from base64 import urlsafe_b64encode
 
text = """
«küßЌύБЇ”ﻈﻉﻌﻍﻎ㌀㌁㌂❶❷❸⍝⍞⍟⍠ோௌ«ταБЬℓσ»:
n20%of٩(-̮̮̃-̃)۶٩(●̮̮̃•̃)۶٩(͡๏̯͡๏)۶٩(-̮̮̃•̃)!॒॑॓ஙசປຜἑἔℇ∆∇▆▇█
✆✇✈✉✌✍うぇえぉおㆅㆇ㉫㉬㍍㍎㍏沈拾若掠略亮兩凉梁糧
ﯗﯘﯙﯚﯝﯠﯡﯢカキクケコサシス'kosme':"κόσμε"/?#+퟿
"""
 
data = text.encode("utf-8")
datab64 = urlsafe_b64encode(data)
 
print("text length: %s" % len(text))
print("urlsafe base64 representation of the binary data:\n\n%s" % datab64)

The original text (the variable named text) is meant to contain code points from many different Unicode character blocks. The following resources helped me assembling this test text:

As you can see, the text is first encoded using the UTF-8 codec. The resulting binary data then is put into urlsafe_b64encode(), yielding a URL-safe byte sequence. So, execution of the named Python script yields the following URL-safe representation of the original text:

CsKra8O8w5_DrsK74oCc0IzPjdCR0IfigJ3vu4jvu4nvu4zvu43vu47jjIDjjIHjjILinbbinbfinbjijZ3ijZ7ijZ_ijaDgr4vgr4zCq8-EzrHQkdCs4oSTz4PCuzoKbjIwJW9m2akoLcyuzK7Mgy3MgynbttmpKOKXj8yuzK7Mg-KAosyDKdu22akozaHguY_Mr82h4LmPKdu22akoLcyuzK7Mg-KAosyDKSHgpZHgpZLgpZPgrpngrprgupvgupzhvJHhvJTihIfiiIbiiIfilobilofilogK4pyG4pyH4pyI4pyJ4pyM4pyN44GG44GH44GI44GJ44GK44aF44aH44mr44ms442N442O442P76Wy76Wz76W076W176W276W376W476W576W676W7Cu-vl--vmO-vme-vmu-vne-voO-voe-vou-9tu-9t--9uO-9ue-9uu-9u--9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K

The Python script also tells you that the text is 167 characters long, which is a useful reference for later comparison.

Decoding the URL-safe representation in the browser

Here is the test document that implements four different methods for obtaining the original text back from the URL-safe representation (it just doesn’t show anything yet!): http://gehrcke.de/files/perm/blog/2015/jsdecode/test.html

Remember, the text we want to transport is encoded in a URL-safe way, so just for fun I want to make use of this fact in this small demonstration here, and communicate the information via the URL. To that end, the test document executes JavaScript code that extracts a string from the anchor/hash part of the URL:

// Get URL-safe text representation from the address bar.
// This is a DOMString type, as indicated by the text_ variable name prefix.
var text_urlsafe_b64data = window.location.hash.substring(1);

Okay, let’s put the test data into the URL (the output from the Python script above):

http://gehrcke.de/files/perm/blog/2015/jsdecode/test.html#CsKra8O8w5_DrsK74oCc0IzPjdCR0IfigJ3vu4jvu4nvu4zvu43vu47jjIDjjIHjjILinbbinbfinbjijZ3ijZ7ijZ_ijaDgr4vgr4zCq8-EzrHQkdCs4oSTz4PCuzoKbjIwJW9m2akoLcyuzK7Mgy3MgynbttmpKOKXj8yuzK7Mg-KAosyDKdu22akozaHguY_Mr82h4LmPKdu22akoLcyuzK7Mg-KAosyDKSHgpZHgpZLgpZPgrpngrprgupvgupzhvJHhvJTihIfiiIbiiIfilobilofilogK4pyG4pyH4pyI4pyJ4pyM4pyN44GG44GH44GI44GJ44GK44aF44aH44mr44ms442N442O442P76Wy76Wz76W076W176W276W376W476W576W676W7Cu-vl–vmO-vme-vmu-vne-voO-voe-vou-9tu-9t–9uO-9ue-9uu-9u–9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K

This is a long URL now, indeed. But it is not too long ;-).

When you access the URL above, an HTML document should show up. It should have four panels, whereas each panel should show the exact same text as initially defined in the Python script above, including all newlines. Each of the individual panels is the result of a different decoding implementation. I recommend looking at the source code of this HTML document, I have commented the different methods sufficiently. I will now quickly go through the different methods.

The first step common to all four methods is to convert from URL-safe base64 encoding to canonical base64 encoding:

function urlsafeb64_to_b64(s) {
  // Replace - with + and _ with /
  return s.replace(/-/g, '+').replace(/_/g, '/');
}
// Create canonical base64 data from whatever Python's urlsafe_b64encode() produced.
var text_b46data = urlsafeb64_to_b64(text_urlsafe_b64data);

Method 1

Starting from there, the ugliest way to obtain the original text is what is designated “Method 1″ in the source of the test document:

var text_original_1 = decodeURIComponent(escape(window.atob(text_b46data)));

This is a hacky all-in-one solution. It uses the deprecated escape() function and implicitly performs the UTF-8 decoding. Who the hell can explain why this really works? Monsur can:
monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. However, this really is a black magic approach with ugly semantics, and the tools involved never were designed for this purpose. There is no specification that guarantees proper behavior. I recommend to not use this method, especially for its really bad semantics and the use of a now deprecated function. However, if you love to confuse your peers with cryptic one-liners, then this is the way to go.

Method 2

This article states that there is “a better, more faithful and less expensive solution” involving native binary data types. In my opinion, this distinct two-step process is easy to understand and has quite clear semantics. So, my favorite decoding scheme is what is designated “Method 2″ in the source of the test document:

// Step 1: decode the base64-encoded data into a binary blob (a Uint8Array).
var binary_utf8data_1 = base64DecToArr(text_b46data);
// Step 2: decode the binary data into a DOMString. Use a custom UTF-8 decoder.
var text_original_2 = UTF8ArrToStr(binary_utf8data_1);

The functions base64DecToArr() and UTF8ArrToStr() are lightweight custom implementations, taken from the Mozilla Knowledge Base. They should work in old as well as modern browsers, and should have a decent performance. The custom functions are not really lengthy and can be shipped with your application. Just look at the source of test.html.

Method 3

The custom UTF8ArrToStr() function used in method 2 can at some point be replaced by a TextDecoder()-based method, which is part of the so-called encoding standard. This standard is a WHATWG living standard, and still labeled to be an experimental feature. Nevertheless, it is already available in modern Firefox and Chrome versions, and there also is a promising polyfill project on GitHub. Prior to using TextDecoder(), the base64-encoded data (a DOMString) must still be decoded into binary data, so the first part is the same as in method 2:

var binary_utf8data_1 = base64DecToArr(text_b46data);
var text_original_3 = new TextDecoder("utf-8").decode(binary_utf8data_1);

Method 4

The fourth method I am showing here uses an alternative approach for base64DecToArr(), i.e. for decoding the base64-encoded data (DOMString) into binary data (Uint8Array). It is shorter and easier to understand than base64DecToArr(), but presumably also of lower performance. Let’s look at base64_to_uint8array() (based on this answer on StackOverflow):

function base64_to_uint8array(s) {
  var byteChars = atob(s);
  var l = byteChars.length;
  var byteNumbers = new Array(l);
  for (var i = 0; i < l; i++) {
    byteNumbers[i] = byteChars.charCodeAt(i);
  }
  return new Uint8Array(byteNumbers);
}

Let’s combine it with the already introduced UTF8ArrToStr() (see method 2):

var binary_utf8data_2 = base64_to_uint8array(text_b46data)
var text_original_4 = UTF8ArrToStr(binary_utf8data_2);

Final words

By carefully looking at the rendered test document, one can infer that all four methods work for the test data used here. In my application scenario I am currently using method 4, since the conversion I am doing there is not performance-critical (in which case I would use method 2). A disadvantage of method 2 would be the usage of the atob() function, which is not available in IE 8 and 9. If this was a core component of an application, I’d probably start using the TextDecoder()-based method with a polyfill for older browsers. The disadvantage here is that the polyfill itself is quite a heavy dependency.

I hope these examples are of use, let me know what you think.

In-memory SQLite database and Flask: a threading trap

In a Flask development environment people love to use SQLAlchemy with Python’s built-in sqlite backend. When configured with

app.config['SQLALCHEMY_DATABASE_URI'] = "sqlite://"

the database, created via db = SQLAlchemy(app) is stored in memory instead of being persisted to disk. This is a nice feature for development and testing.

With some model classes being defined, at some point the tables should actually be created within the database, which is what the db.create_all() call is usually used for. But where should this be invoked and what difference does it make? Let’s look at two possible places:

  • In the app’s bootstrap code, before app.run() is called (such as in the __init__.py file of your application package.
  • After the development server has been started via app.run(), in a route handler.

What difference does it make? A big one, in certain cases: the call happens in different threads (you can easily convince yourself of this by calling threading.current_thread() in the two places, the result is different).

You might think this should be an implementation detail of how Flask’s development server works under the hood. I agree, and usually this detail is not important to be aware of. However, in case of an in-memory SQLite database this implementation detail makes a significant difference: the two threads see independent databases. And the mean thing is: the same db object represents different databases, depending on the thread from which it is being used.

Example: say you call db.create_all() and pre-populate the database in the main thread, before invoking app.run(). When you then access the database via the same db object (equivalent id()) from within a route handler (which does not run in the main thread), the interaction takes place with a different database, which of course is still empty, the tables are not created. A simple read/query would yield unexpected results.

Again, in other words: although you push around and interact the same db object in your modules, you might still access different databases, depending on from which thread the interaction takes place.

This is not well-documented and leads to nasty errors. I guess most often people run into OperationalError: (OperationalError) no such table:, although they think they have already created the corresponding table.

There are two reasonable solutions:

  • Bootstrap the database from within a route handler. All route handlers run within the same thread, so all route handlers interact with the same database.
  • Do not use the in-memory feature. Then db represents the same physical database across threads.

People have been bitten by this, as can be seen in these StackOverflow threads:

This has been observed with Python 2.7.5 and Flask 0.10.1.

Uploaded.to download with wget

Downloading from uploaded.to/.net with premium credentials through the command line is possible using standard tools such as wget or curl. However, there is no official API and the exact method required depends on the mechanism implemented by the uploaded.to/.net website. Finding these implementation details requires a little amount reverse engineering.

Here I share a small shell script that should work on all POSIX-compliant platforms (e.g. Mac or Linux). The method is based on current behavior of the uploaded.to website. There are no special tools involved, just wget, grep, sed, mktemp.

(The solutions I found on the web did not work (anymore) and/or were suspiciously wrong.)

Usage

Copy the script content below, define username and password, and save the script as, for instance, download.sh. Then, invoke the script like so:

$ /bin/sh download.sh urls.txt

The file urls.txt should contain one uploaded.to/.net URL per line, such as in this example:

http://uploaded.net/file/98389123/foo.rar
http://uploaded.net/file/bxmsdkfm/bar.rar
http://uploaded.net/file/72asjh98/not.zip

Method

This paragraph is just for the curious ones. The script first POSTs your credentials to http://uploaded.net/io/login and stores the resulting authentication cookie in a file. This authentication cookie is then used for retrieving the website corresponding to an uploaded.to file. That website contains a temporarily valid download URL corresponding to the file. Using grep and sed, the HTML code is filtered for this URL. The payload data transfer is triggered by firing a POST request with empty body against this URL (cookie not needed). Files are downloaded to the current working directory. All intermediate data is stored in a temporary directory. That directory is automatically deleted upon script exit (no data is leaked, unless the script is terminated with SIGKILL).

The script

#!/bin/sh
# Copyright 2015 Jan-Philip Gehrcke, http://gehrcke.de
# See http://gehrcke.de/2015/03/uploaded-to-download-with-wget/
 
 
USERNAME="user"
PASSWORD="password"
 
 
if [ "$#" -ne 1 ]; then
    echo "Missing argument: URLs file (containing one URL per line)." >&2
    exit 1
fi
 
 
URLSFILE="${1}"
if [ ! -r "${URLSFILE}" ]; then
    echo "Cannot read URLs file ${URLSFILE}. Exit." >&2
    exit 1
fi
if [ ! -s "${URLSFILE}" ]; then
    echo "URLs file is empty. Exit." >&2
    exit 1
fi
 
 
TMPDIR="$(mktemp -d)"
# Install trap that removes the temporary directory recursively
# upon exit (except for when this program retrieves SIGKILL).
trap 'rm -rf "$TMPDIR"' EXIT
 
 
LOGINRESPFILE="${TMPDIR}/login.response"
LOGINOUTPUTFILE="${TMPDIR}/login.outerr"
COOKIESFILE="${TMPDIR}/login.cookies"
LOGINURL="http://uploaded.net/io/login"
 
 
echo "Temporary directory: ${TMPDIR}"
echo "Log in via POST request to ${LOGINURL}, save cookies."
wget --save-cookies=${COOKIESFILE} --server-response \
    --output-document ${LOGINRESPFILE} \
    --post-data="id=${USERNAME}&pw=${PASSWORD}" \
    ${LOGINURL} > ${LOGINOUTPUTFILE} 2>&1
 
# Status code is 200 even if login failed.
# Uploaded sends a '{"err":"User and password do not match!"}'-like response
# body in case of error.
 
echo "Verify that login response is empty."
# Response is more than 0 bytes in case of login error.
if [ -s "${LOGINRESPFILE}" ]; then
    echo "Login response larger than 0 bytes. Print response and exit." >&2
    cat "${LOGINRESPFILE}"
    exit 1
fi
 
# Zero response size does not necessarily imply successful login.
# Wget adds three commented lines to the cookies file by default, so
# set cookies should result in more than three lines in this file.
COOKIESFILELINES="$(cat ${COOKIESFILE} | wc -l)"
echo "${COOKIESFILELINES} lines in cookies file found."
if [ "${COOKIESFILELINES}" -lt "4" ]; then
    echo "Expected >3 lines in cookies file. Exit.". >&2
    exit 1
fi
 
echo "Process URLs."
# Assume that login worked. Iterate through URLs.
while read CURRENTURL; do
    if [ "x$CURRENTURL" = "x" ]; then
        # Skip empty lines.
        continue
    fi
    echo -e "\n\n"
    TMPFILE="$(mktemp --tmpdir=${TMPDIR} response.html.XXXX)"
    echo "GET ${CURRENTURL} (use auth cookie), store response."
    wget --no-verbose --load-cookies=${COOKIESFILE} \
        --output-document ${TMPFILE} ${CURRENTURL}
 
    if [ ! -s "${TMPFILE}" ]; then
        echo "No HTML response: ${TMPFILE} is zero size. Skip processing."
        continue
    fi
 
    # Extract (temporarily valid) download URL from HTML.
    LINEOFINTEREST="$(grep post ${TMPFILE} | grep action | grep uploaded)"
    # Match entire line, include space after action="bla" , replace
    # entire line with first group, which is bla.
    DLURL=$(echo $LINEOFINTEREST | sed 's/.*action="\(.\+\)" .*/\1/')
    echo "Extracted download URL: ${DLURL}"
    # This file contains account details, so delete as soon as not required
    # anymore.
    rm -f "${TMPFILE}"
    echo "POST to URL w/o data. Response is file. Get filename from header."
    # --content-disposition should extract the proper filename.
    wget --content-disposition --post-data='' "${DLURL}"
done < "${URLSFILE}"

Structured data for Google: how to add the ‘updated’ hentry field

This is a WordPress-specific post. I am using a modified TwentyTwelve theme and Google Webmaster Tools report missing structured data for all of my posts:

Missing "updated" field in the microformats hatom markup

Missing “updated” field in the microformats hatom markup.

In particular, it is the updated hentry field that seems to be missing. TwentyTwelve, like many themes, uses the microformats approach to communicate structured data to Google (also to others, it’s just that Google is a popular and important consumer of this data). How to correctly present date/time information to Google? A quote from their microdata docs:

To specify dates and times unambiguously, use the time element with the datetime attribute. […] The value in the datetime attribute is specified using the ISO date format.

And a quote from their microformats docs:

In general, microformats use the class attribute in HTML tags

It appears that we can combine the approaches. Might be dirty, but it works. So, what you might want to have in the HTML source of your blog post looks like this:

<time class="updated" datetime="2015-02-28T18:09:49+00:00" pubdate>
February 28, 2015
</time>

The value of the datetime attribute is in ISO 8601 format and not shown to the user. It should contain the point in time the article/blog post was last modified (updated). It is parsed by Google as the updated property, because of the class="updated" attribute. The string content of the time tag is what is displayed to your users (February 28, 2015 in this case). There, you usually want to display the point in time when the article was first published.

So, how do you get this into the HTML source code of all of your blog posts? A simple solution is to create a custom “byline” (that is what the author and date information string is often called in the context of WordPress themes), for instance with a PHP function like this:

function modbyline() {
    $datecreated = esc_html(get_the_date());
    $author = esc_html(get_the_author());
    $datemodifiedISO = esc_html(get_the_modified_time("c"));
    echo '<div class="bylinemod"><time class="entry-date updated" datetime="'.$datemodifiedISO.'" pubdate>'.$datecreated.'</time> &mdash; by '.$author.'</div>';
}

This creates HTML code for a custom byline, in my case rendered like so:

<div class="bylinemod">
    <time class="entry-date updated" datetime="2015-02-28T18:09:49+00:00" pubdate>
        February 28, 2015
    </time>
    &mdash; by Jan-Philip Gehrcke
</div>

The user-visible date is the article publication date, and the machine-readable datetime attribute encodes the modification time of the article. Note that WordPress’ get_the_modified_time() by default returns a date string with a human-readable default format. In order to make it machine-readable by ISO 8601 standard, you need to provide it the "c" format specifier argument (I have done this in the function above).

You want to define this custom byline function in your (child) theme’s functions.php. It should be called from within content.php.

After inclusion use Google’s structured data testing tool for validation of the approach. It should show updated entry, containing the correct date.

Google authorship feature deactivated

I just realized that the Google authorship feature (by which web content could be related to a Google+ profile) had been disabled in summer 2014. The feature was introduced not long before that and the web ecosystem followed with enthusiasm: content management systems like WordPress offered support (at least via plugins), and the SEO media response was positive. Many articles were published on the importance and usage of this feature, such as:

And then, suddenly, a posting on Google+:

[…] With this in mind, we’ve made the difficult decision to stop showing authorship in search results.

Another posting from John Mueller:

Edit: In the meantime, we’ve decided to remove authorship completely

What is left is the URL http://plus.google.com/authorship which redirects to https://support.google.com/webmasters/answer/6083347, which shows nothing but:

Authorship markup is no longer supported in web search.
To learn about what markup you can use to improve search results, visit rich snippets.

What’s left are many websites containing wasteful markup. Garbage, and it will remain for years, probably. I just deactivated my Google Author Link WordPress plugin. What a waste of time, for so many people. For the interested ones, the removal of this feature is discussed in some depth in this article.