Insights into SoundCloud culture

SoundCloud (cf. Wikipedia) is a young Berlin-based company “under the laws of England & Wales”, and with Swedish origin. Six years ago, in 2009, they acquired their first big funding. Since then, they experienced a tremendous growth and were able to regularly raise investment capital. Until today, SoundCloud has created and put itself into a unique position with a convincing product (which I am using — and paying — myself), but can also be considered as competitor of big brands such as Spotify and Beats Music. In fact, according to the Wall Street Journal, SoundCloud can be expected to join the party of billion-dollar IT companies quite soon:

SoundCloud, a popular music and audio-sharing service, is in discussions to raise about $150 million in new financing at a valuation that is expected to top $1.2 billion, according to two people with knowledge of the negotiations.

Having these facts in mind, it is impressive to hear that SoundCloud still only employs 300 people, in just a handful of offices around the world. Just like me, you might be curious about getting to know details of this kind, and about the SoundCloud story in itself. So, I was really eager to listen to Episode 17 of “Hipster & Hack“, featuring an interview with David Noël (Twitter profile, LinkedIn profile). David has accompanied SoundCloud for six years now and currently leads Internal Communications. He clearly is in the position to provide authoritative information about how SoundCloud’s vision was translated into reality over time, but also about how culture and communication within SoundCloud evolved. The latter is what he mainly talks about in the interview, providing insights about the structure and tools applied for defining a culture, for keeping it under control, and for communicating it to employees from the very first moment on until even after they have left the company. David defines culture as the living manifestation of core values and comes to insightful statements such as

Living your values is your culture at any moment in time.

In the interview, we learn that one of SoundCloud’s core values is being open, in the context of internal communications. The culture and communication topic really seems to have a high priority in the company, judging based on methods like the “all-hands” meeting that David refers to in the interview. Personally, I cannot overstate how much I value this, coming from classical research where such elements are often just neglected.

So, if that raises your interest, I recommend listening to three quite likable guys here (listening to minutes 4 to 29 suffices, the rest is enjoyable overhead ;-)):

JavaScript in a browser: create a binary blob from a base64-encoded string, and decode it using UTF-8.

For an AngularJS-based web application I am currently working on, I want to put arbitrary text information into a URL, send it to the client, and have it decoded by the browser. With the right tools this should be a trivial process, shouldn’t it?

The basic idea is:

  1. Start with the original text, which is a sequence of unicode code points.
  2. Encode the original text into binary data; with the UTF-8 codec.
  3. base64-encode the resulting binary data (and replace the URL-unsafe characters / and + with, for instance, – and _).
  4. The result is a URL-safe string. Send it. Let’s assume it ends up in a browser (in window.location, for instance).
  5. Invert the entire procedure in the browser:
    1. The data arrives as DOMString type (unicode text, so to say).
    2. Transform URL-safe base64 representation to canonical base64 (replace _ and – characters).
    3. Decode the base64 string into a real binary data type (Uint8Array, for instance).
    4. Decode the binary blob into a DOMString containing the original text, using the UTF-8 codec.

Unfortunately, there so far are no official and no established ways for performing steps 5.3 and 5.4 in a browser environment. There is no obvious way for obtaining a binary blob from base64-encoded data. Further below, I’ll show three different methods for executing this step. Proceeding from here, I realized that there also is still no established way for decoding a binary blob into a DOMString using a given codec (UTF-8 in this case). I’ll show two different methods for performing this task.

The original text and its URL-safe representation

I’ll start with a Python snippet defining the original text and creating its URL-safe base64 representation:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
from base64 import urlsafe_b64encode
text = """
data = text.encode("utf-8")
datab64 = urlsafe_b64encode(data)
print("text length: %s" % len(text))
print("urlsafe base64 representation of the binary data:\n\n%s" % datab64)

The original text (the variable named text) is meant to contain code points from many different Unicode character blocks. The following resources helped me assembling this test text:

As you can see, the text is first encoded using the UTF-8 codec. The resulting binary data then is put into urlsafe_b64encode(), yielding a URL-safe byte sequence. So, execution of the named Python script yields the following URL-safe representation of the original text:


The Python script also tells you that the text is 167 characters long, which is a useful reference for later comparison.

Decoding the URL-safe representation in the browser

Here is the test document that implements four different methods for obtaining the original text back from the URL-safe representation (it just doesn’t show anything yet!):

Remember, the text we want to transport is encoded in a URL-safe way, so just for fun I want to make use of this fact in this small demonstration here, and communicate the information via the URL. To that end, the test document executes JavaScript code that extracts a string from the anchor/hash part of the URL:

// Get URL-safe text representation from the address bar.
// This is a DOMString type, as indicated by the text_ variable name prefix.
var text_urlsafe_b64data = window.location.hash.substring(1);

Okay, let’s put the test data into the URL (the output from the Python script above):–vmO-vme-vmu-vne-voO-voe-vou-9tu-9t–9uO-9ue-9uu-9u–9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K

This is a long URL now, indeed. But it is not too long ;-).

When you access the URL above, an HTML document should show up. It should have four panels, whereas each panel should show the exact same text as initially defined in the Python script above, including all newlines. Each of the individual panels is the result of a different decoding implementation. I recommend looking at the source code of this HTML document, I have commented the different methods sufficiently. I will now quickly go through the different methods.

The first step common to all four methods is to convert from URL-safe base64 encoding to canonical base64 encoding:

function urlsafeb64_to_b64(s) {
  // Replace - with + and _ with /
  return s.replace(/-/g, '+').replace(/_/g, '/');
// Create canonical base64 data from whatever Python's urlsafe_b64encode() produced.
var text_b46data = urlsafeb64_to_b64(text_urlsafe_b64data);

Method 1

Starting from there, the ugliest way to obtain the original text is what is designated “Method 1” in the source of the test document:

var text_original_1 = decodeURIComponent(escape(window.atob(text_b46data)));

This is a hacky all-in-one solution. It uses the deprecated escape() function and implicitly performs the UTF-8 decoding. Who the hell can explain why this really works? Monsur can: However, this really is a black magic approach with ugly semantics, and the tools involved never were designed for this purpose. There is no specification that guarantees proper behavior. I recommend to not use this method, especially for its really bad semantics and the use of a now deprecated function. However, if you love to confuse your peers with cryptic one-liners, then this is the way to go.

Method 2

This article states that there is “a better, more faithful and less expensive solution” involving native binary data types. In my opinion, this distinct two-step process is easy to understand and has quite clear semantics. So, my favorite decoding scheme is what is designated “Method 2” in the source of the test document:

// Step 1: decode the base64-encoded data into a binary blob (a Uint8Array).
var binary_utf8data_1 = base64DecToArr(text_b46data);
// Step 2: decode the binary data into a DOMString. Use a custom UTF-8 decoder.
var text_original_2 = UTF8ArrToStr(binary_utf8data_1);

The functions base64DecToArr() and UTF8ArrToStr() are lightweight custom implementations, taken from the Mozilla Knowledge Base. They should work in old as well as modern browsers, and should have a decent performance. The custom functions are not really lengthy and can be shipped with your application. Just look at the source of test.html.

Method 3

The custom UTF8ArrToStr() function used in method 2 can at some point be replaced by a TextDecoder()-based method, which is part of the so-called encoding standard. This standard is a WHATWG living standard, and still labeled to be an experimental feature. Nevertheless, it is already available in modern Firefox and Chrome versions, and there also is a promising polyfill project on GitHub. Prior to using TextDecoder(), the base64-encoded data (a DOMString) must still be decoded into binary data, so the first part is the same as in method 2:

var binary_utf8data_1 = base64DecToArr(text_b46data);
var text_original_3 = new TextDecoder("utf-8").decode(binary_utf8data_1);

Method 4

The fourth method I am showing here uses an alternative approach for base64DecToArr(), i.e. for decoding the base64-encoded data (DOMString) into binary data (Uint8Array). It is shorter and easier to understand than base64DecToArr(), but presumably also of lower performance. Let’s look at base64_to_uint8array() (based on this answer on StackOverflow):

function base64_to_uint8array(s) {
  var byteChars = atob(s);
  var l = byteChars.length;
  var byteNumbers = new Array(l);
  for (var i = 0; i < l; i++) {
    byteNumbers[i] = byteChars.charCodeAt(i);
  return new Uint8Array(byteNumbers);

Let’s combine it with the already introduced UTF8ArrToStr() (see method 2):

var binary_utf8data_2 = base64_to_uint8array(text_b46data)
var text_original_4 = UTF8ArrToStr(binary_utf8data_2);

Final words

By carefully looking at the rendered test document, one can infer that all four methods work for the test data used here. In my application scenario I am currently using method 4, since the conversion I am doing there is not performance-critical (in which case I would use method 2). A disadvantage of method 2 would be the usage of the atob() function, which is not available in IE 8 and 9. If this was a core component of an application, I’d probably start using the TextDecoder()-based method with a polyfill for older browsers. The disadvantage here is that the polyfill itself is quite a heavy dependency.

I hope these examples are of use, let me know what you think.

In-memory SQLite database and Flask: a threading trap

In a Flask development environment people love to use SQLAlchemy with Python’s built-in sqlite backend. When configured with

app.config['SQLALCHEMY_DATABASE_URI'] = "sqlite://"

the database, created via db = SQLAlchemy(app) is stored in memory instead of being persisted to disk. This is a nice feature for development and testing.

With some model classes being defined, at some point the tables should actually be created within the database, which is what the db.create_all() call is usually used for. But where should this be invoked and what difference does it make? Let’s look at two possible places:

  • In the app’s bootstrap code, before is called (such as in the file of your application package.
  • After the development server has been started via, in a route handler.

What difference does it make? A big one, in certain cases: the call happens in different threads (you can easily convince yourself of this by calling threading.current_thread() in the two places, the result is different).

You might think this should be an implementation detail of how Flask’s development server works under the hood. I agree, and usually this detail is not important to be aware of. However, in case of an in-memory SQLite database this implementation detail makes a significant difference: the two threads see independent databases. And the mean thing is: the same db object represents different databases, depending on the thread from which it is being used.

Example: say you call db.create_all() and pre-populate the database in the main thread, before invoking When you then access the database via the same db object (equivalent id()) from within a route handler (which does not run in the main thread), the interaction takes place with a different database, which of course is still empty, the tables are not created. A simple read/query would yield unexpected results.

Again, in other words: although you push around and interact the same db object in your modules, you might still access different databases, depending on from which thread the interaction takes place.

This is not well-documented and leads to nasty errors. I guess most often people run into OperationalError: (OperationalError) no such table:, although they think they have already created the corresponding table.

There are two reasonable solutions:

  • Bootstrap the database from within a route handler. All route handlers run within the same thread, so all route handlers interact with the same database.
  • Do not use the in-memory feature. Then db represents the same physical database across threads.

People have been bitten by this, as can be seen in these StackOverflow threads:

This has been observed with Python 2.7.5 and Flask 0.10.1. download with wget

Downloading from with premium credentials through the command line is possible using standard tools such as wget or curl. However, there is no official API and the exact method required depends on the mechanism implemented by the website. Finding these implementation details requires a little amount reverse engineering.

Here I share a small shell script that should work on all POSIX-compliant platforms (e.g. Mac or Linux). The method is based on current behavior of the website. There are no special tools involved, just wget, grep, sed, mktemp.

(The solutions I found on the web did not work (anymore) and/or were suspiciously wrong.)


Copy the script content below, define username and password, and save the script as, for instance, Then, invoke the script like so:

$ /bin/sh urls.txt

The file urls.txt should contain one URL per line, such as in this example:


This paragraph is just for the curious ones. The script first POSTs your credentials to and stores the resulting authentication cookie in a file. This authentication cookie is then used for retrieving the website corresponding to an file. That website contains a temporarily valid download URL corresponding to the file. Using grep and sed, the HTML code is filtered for this URL. The payload data transfer is triggered by firing a POST request with empty body against this URL (cookie not needed). Files are downloaded to the current working directory. All intermediate data is stored in a temporary directory. That directory is automatically deleted upon script exit (no data is leaked, unless the script is terminated with SIGKILL).

The script

# Copyright 2015 Jan-Philip Gehrcke,
# See
if [ "$#" -ne 1 ]; then
    echo "Missing argument: URLs file (containing one URL per line)." >&2
    exit 1
if [ ! -r "${URLSFILE}" ]; then
    echo "Cannot read URLs file ${URLSFILE}. Exit." >&2
    exit 1
if [ ! -s "${URLSFILE}" ]; then
    echo "URLs file is empty. Exit." >&2
    exit 1
TMPDIR="$(mktemp -d)"
# Install trap that removes the temporary directory recursively
# upon exit (except for when this program retrieves SIGKILL).
trap 'rm -rf "$TMPDIR"' EXIT
echo "Temporary directory: ${TMPDIR}"
echo "Log in via POST request to ${LOGINURL}, save cookies."
wget --save-cookies=${COOKIESFILE} --server-response \
    --output-document ${LOGINRESPFILE} \
    --post-data="id=${USERNAME}&pw=${PASSWORD}" \
# Status code is 200 even if login failed.
# Uploaded sends a '{"err":"User and password do not match!"}'-like response
# body in case of error.
echo "Verify that login response is empty."
# Response is more than 0 bytes in case of login error.
if [ -s "${LOGINRESPFILE}" ]; then
    echo "Login response larger than 0 bytes. Print response and exit." >&2
    cat "${LOGINRESPFILE}"
    exit 1
# Zero response size does not necessarily imply successful login.
# Wget adds three commented lines to the cookies file by default, so
# set cookies should result in more than three lines in this file.
echo "${COOKIESFILELINES} lines in cookies file found."
if [ "${COOKIESFILELINES}" -lt "4" ]; then
    echo "Expected >3 lines in cookies file. Exit.". >&2
    exit 1
echo "Process URLs."
# Assume that login worked. Iterate through URLs.
while read CURRENTURL; do
    if [ "x$CURRENTURL" = "x" ]; then
        # Skip empty lines.
    echo -e "\n\n"
    TMPFILE="$(mktemp --tmpdir=${TMPDIR} response.html.XXXX)"
    echo "GET ${CURRENTURL} (use auth cookie), store response."
    wget --no-verbose --load-cookies=${COOKIESFILE} \
        --output-document ${TMPFILE} ${CURRENTURL}
    if [ ! -s "${TMPFILE}" ]; then
        echo "No HTML response: ${TMPFILE} is zero size. Skip processing."
    # Extract (temporarily valid) download URL from HTML.
    LINEOFINTEREST="$(grep post ${TMPFILE} | grep action | grep uploaded)"
    # Match entire line, include space after action="bla" , replace
    # entire line with first group, which is bla.
    DLURL=$(echo $LINEOFINTEREST | sed 's/.*action="\(.\+\)" .*/\1/')
    echo "Extracted download URL: ${DLURL}"
    # This file contains account details, so delete as soon as not required
    # anymore.
    rm -f "${TMPFILE}"
    echo "POST to URL w/o data. Response is file. Get filename from header."
    # --content-disposition should extract the proper filename.
    wget --content-disposition --post-data='' "${DLURL}"
done < "${URLSFILE}"

Structured data for Google: how to add the ‘updated’ hentry field

This is a WordPress-specific post. I am using a modified TwentyTwelve theme and Google Webmaster Tools report missing structured data for all of my posts:

Missing "updated" field in the microformats hatom markup

Missing “updated” field in the microformats hatom markup.

In particular, it is the updated hentry field that seems to be missing. TwentyTwelve, like many themes, uses the microformats approach to communicate structured data to Google (also to others, it’s just that Google is a popular and important consumer of this data). How to correctly present date/time information to Google? A quote from their microdata docs:

To specify dates and times unambiguously, use the time element with the datetime attribute. […] The value in the datetime attribute is specified using the ISO date format.

And a quote from their microformats docs:

In general, microformats use the class attribute in HTML tags

It appears that we can combine the approaches. Might be dirty, but it works. So, what you might want to have in the HTML source of your blog post looks like this:

<time class="updated" datetime="2015-02-28T18:09:49+00:00" pubdate>
February 28, 2015

The value of the datetime attribute is in ISO 8601 format and not shown to the user. It should contain the point in time the article/blog post was last modified (updated). It is parsed by Google as the updated property, because of the class="updated" attribute. The string content of the time tag is what is displayed to your users (February 28, 2015 in this case). There, you usually want to display the point in time when the article was first published.

So, how do you get this into the HTML source code of all of your blog posts? A simple solution is to create a custom “byline” (that is what the author and date information string is often called in the context of WordPress themes), for instance with a PHP function like this:

function modbyline() {
    $datecreated = esc_html(get_the_date());
    $author = esc_html(get_the_author());
    $datemodifiedISO = esc_html(get_the_modified_time("c"));
    echo '<div class="bylinemod"><time class="entry-date updated" datetime="'.$datemodifiedISO.'" pubdate>'.$datecreated.'</time> &mdash; by '.$author.'</div>';

This creates HTML code for a custom byline, in my case rendered like so:

<div class="bylinemod">
    <time class="entry-date updated" datetime="2015-02-28T18:09:49+00:00" pubdate>
        February 28, 2015
    &mdash; by Jan-Philip Gehrcke

The user-visible date is the article publication date, and the machine-readable datetime attribute encodes the modification time of the article. Note that WordPress’ get_the_modified_time() by default returns a date string with a human-readable default format. In order to make it machine-readable by ISO 8601 standard, you need to provide it the "c" format specifier argument (I have done this in the function above).

You want to define this custom byline function in your (child) theme’s functions.php. It should be called from within content.php.

After inclusion use Google’s structured data testing tool for validation of the approach. It should show updated entry, containing the correct date.