Category Archives: Python

JavaScript in a browser: create a binary blob from a base64-encoded string, and decode it using UTF-8.

For an AngularJS-based web application I am currently working on, I want to put arbitrary text information into a URL, send it to the client, and have it decoded by the browser. With the right tools this should be a trivial process, shouldn’t it?

The basic idea is:

  1. Start with the original text, which is a sequence of unicode code points.
  2. Encode the original text into binary data; with the UTF-8 codec.
  3. base64-encode the resulting binary data (and replace the URL-unsafe characters / and + with, for instance, – and _).
  4. The result is a URL-safe string. Send it. Let’s assume it ends up in a browser (in window.location, for instance).
  5. Invert the entire procedure in the browser:
    1. The data arrives as DOMString type (unicode text, so to say).
    2. Transform URL-safe base64 representation to canonical base64 (replace _ and – characters).
    3. Decode the base64 string into a real binary data type (Uint8Array, for instance).
    4. Decode the binary blob into a DOMString containing the original text, using the UTF-8 codec.

Unfortunately, there so far are no official and no established ways for performing steps 5.3 and 5.4 in a browser environment. There is no obvious way for obtaining a binary blob from base64-encoded data. Further below, I’ll show three different methods for executing this step. Proceeding from here, I realized that there also is still no established way for decoding a binary blob into a DOMString using a given codec (UTF-8 in this case). I’ll show two different methods for performing this task.

The original text and its URL-safe representation

I’ll start with a Python snippet defining the original text and creating its URL-safe base64 representation:

# -*- coding: utf-8 -*-
 
from __future__ import unicode_literals
from base64 import urlsafe_b64encode
 
text = """
«küßЌύБЇ”ﻈﻉﻌﻍﻎ㌀㌁㌂❶❷❸⍝⍞⍟⍠ோௌ«ταБЬℓσ»:
n20%of٩(-̮̮̃-̃)۶٩(●̮̮̃•̃)۶٩(͡๏̯͡๏)۶٩(-̮̮̃•̃)!॒॑॓ஙசປຜἑἔℇ∆∇▆▇█
✆✇✈✉✌✍うぇえぉおㆅㆇ㉫㉬㍍㍎㍏沈拾若掠略亮兩凉梁糧
ﯗﯘﯙﯚﯝﯠﯡﯢカキクケコサシス'kosme':"κόσμε"/?#+퟿
"""
 
data = text.encode("utf-8")
datab64 = urlsafe_b64encode(data)
 
print("text length: %s" % len(text))
print("urlsafe base64 representation of the binary data:\n\n%s" % datab64)

The original text (the variable named text) is meant to contain code points from many different Unicode character blocks. The following resources helped me assembling this test text:

As you can see, the text is first encoded using the UTF-8 codec. The resulting binary data then is put into urlsafe_b64encode(), yielding a URL-safe byte sequence. So, execution of the named Python script yields the following URL-safe representation of the original text:

CsKra8O8w5_DrsK74oCc0IzPjdCR0IfigJ3vu4jvu4nvu4zvu43vu47jjIDjjIHjjILinbbinbfinbjijZ3ijZ7ijZ_ijaDgr4vgr4zCq8-EzrHQkdCs4oSTz4PCuzoKbjIwJW9m2akoLcyuzK7Mgy3MgynbttmpKOKXj8yuzK7Mg-KAosyDKdu22akozaHguY_Mr82h4LmPKdu22akoLcyuzK7Mg-KAosyDKSHgpZHgpZLgpZPgrpngrprgupvgupzhvJHhvJTihIfiiIbiiIfilobilofilogK4pyG4pyH4pyI4pyJ4pyM4pyN44GG44GH44GI44GJ44GK44aF44aH44mr44ms442N442O442P76Wy76Wz76W076W176W276W376W476W576W676W7Cu-vl--vmO-vme-vmu-vne-voO-voe-vou-9tu-9t--9uO-9ue-9uu-9u--9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K

The Python script also tells you that the text is 167 characters long, which is a useful reference for later comparison.

Decoding the URL-safe representation in the browser

Here is the test document that implements four different methods for obtaining the original text back from the URL-safe representation (it just doesn’t show anything yet!): http://gehrcke.de/files/perm/blog/2015/jsdecode/test.html

Remember, the text we want to transport is encoded in a URL-safe way, so just for fun I want to make use of this fact in this small demonstration here, and communicate the information via the URL. To that end, the test document executes JavaScript code that extracts a string from the anchor/hash part of the URL:

// Get URL-safe text representation from the address bar.
// This is a DOMString type, as indicated by the text_ variable name prefix.
var text_urlsafe_b64data = window.location.hash.substring(1);

Okay, let’s put the test data into the URL (the output from the Python script above):

http://gehrcke.de/files/perm/blog/2015/jsdecode/test.html#CsKra8O8w5_DrsK74oCc0IzPjdCR0IfigJ3vu4jvu4nvu4zvu43vu47jjIDjjIHjjILinbbinbfinbjijZ3ijZ7ijZ_ijaDgr4vgr4zCq8-EzrHQkdCs4oSTz4PCuzoKbjIwJW9m2akoLcyuzK7Mgy3MgynbttmpKOKXj8yuzK7Mg-KAosyDKdu22akozaHguY_Mr82h4LmPKdu22akoLcyuzK7Mg-KAosyDKSHgpZHgpZLgpZPgrpngrprgupvgupzhvJHhvJTihIfiiIbiiIfilobilofilogK4pyG4pyH4pyI4pyJ4pyM4pyN44GG44GH44GI44GJ44GK44aF44aH44mr44ms442N442O442P76Wy76Wz76W076W176W276W376W476W576W676W7Cu-vl–vmO-vme-vmu-vne-voO-voe-vou-9tu-9t–9uO-9ue-9uu-9u–9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K

This is a long URL now, indeed. But it is not too long ;-).

When you access the URL above, an HTML document should show up. It should have four panels, whereas each panel should show the exact same text as initially defined in the Python script above, including all newlines. Each of the individual panels is the result of a different decoding implementation. I recommend looking at the source code of this HTML document, I have commented the different methods sufficiently. I will now quickly go through the different methods.

The first step common to all four methods is to convert from URL-safe base64 encoding to canonical base64 encoding:

function urlsafeb64_to_b64(s) {
  // Replace - with + and _ with /
  return s.replace(/-/g, '+').replace(/_/g, '/');
}
// Create canonical base64 data from whatever Python's urlsafe_b64encode() produced.
var text_b46data = urlsafeb64_to_b64(text_urlsafe_b64data);

Method 1

Starting from there, the ugliest way to obtain the original text is what is designated “Method 1” in the source of the test document:

var text_original_1 = decodeURIComponent(escape(window.atob(text_b46data)));

This is a hacky all-in-one solution. It uses the deprecated escape() function and implicitly performs the UTF-8 decoding. Who the hell can explain why this really works? Monsur can:
monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. However, this really is a black magic approach with ugly semantics, and the tools involved never were designed for this purpose. There is no specification that guarantees proper behavior. I recommend to not use this method, especially for its really bad semantics and the use of a now deprecated function. However, if you love to confuse your peers with cryptic one-liners, then this is the way to go.

Method 2

This article states that there is “a better, more faithful and less expensive solution” involving native binary data types. In my opinion, this distinct two-step process is easy to understand and has quite clear semantics. So, my favorite decoding scheme is what is designated “Method 2” in the source of the test document:

// Step 1: decode the base64-encoded data into a binary blob (a Uint8Array).
var binary_utf8data_1 = base64DecToArr(text_b46data);
// Step 2: decode the binary data into a DOMString. Use a custom UTF-8 decoder.
var text_original_2 = UTF8ArrToStr(binary_utf8data_1);

The functions base64DecToArr() and UTF8ArrToStr() are lightweight custom implementations, taken from the Mozilla Knowledge Base. They should work in old as well as modern browsers, and should have a decent performance. The custom functions are not really lengthy and can be shipped with your application. Just look at the source of test.html.

Method 3

The custom UTF8ArrToStr() function used in method 2 can at some point be replaced by a TextDecoder()-based method, which is part of the so-called encoding standard. This standard is a WHATWG living standard, and still labeled to be an experimental feature. Nevertheless, it is already available in modern Firefox and Chrome versions, and there also is a promising polyfill project on GitHub. Prior to using TextDecoder(), the base64-encoded data (a DOMString) must still be decoded into binary data, so the first part is the same as in method 2:

var binary_utf8data_1 = base64DecToArr(text_b46data);
var text_original_3 = new TextDecoder("utf-8").decode(binary_utf8data_1);

Method 4

The fourth method I am showing here uses an alternative approach for base64DecToArr(), i.e. for decoding the base64-encoded data (DOMString) into binary data (Uint8Array). It is shorter and easier to understand than base64DecToArr(), but presumably also of lower performance. Let’s look at base64_to_uint8array() (based on this answer on StackOverflow):

function base64_to_uint8array(s) {
  var byteChars = atob(s);
  var l = byteChars.length;
  var byteNumbers = new Array(l);
  for (var i = 0; i < l; i++) {
    byteNumbers[i] = byteChars.charCodeAt(i);
  }
  return new Uint8Array(byteNumbers);
}

Let’s combine it with the already introduced UTF8ArrToStr() (see method 2):

var binary_utf8data_2 = base64_to_uint8array(text_b46data)
var text_original_4 = UTF8ArrToStr(binary_utf8data_2);

Final words

By carefully looking at the rendered test document, one can infer that all four methods work for the test data used here. In my application scenario I am currently using method 4, since the conversion I am doing there is not performance-critical (in which case I would use method 2). A disadvantage of method 2 would be the usage of the atob() function, which is not available in IE 8 and 9. If this was a core component of an application, I’d probably start using the TextDecoder()-based method with a polyfill for older browsers. The disadvantage here is that the polyfill itself is quite a heavy dependency.

I hope these examples are of use, let me know what you think.

In-memory SQLite database and Flask: a threading trap

In a Flask development environment people love to use SQLAlchemy with Python’s built-in sqlite backend. When configured with

app.config['SQLALCHEMY_DATABASE_URI'] = "sqlite://"

the database, created via db = SQLAlchemy(app) is stored in memory instead of being persisted to disk. This is a nice feature for development and testing.

With some model classes being defined, at some point the tables should actually be created within the database, which is what the db.create_all() call is usually used for. But where should this be invoked and what difference does it make? Let’s look at two possible places:

  • In the app’s bootstrap code, before app.run() is called (such as in the __init__.py file of your application package.
  • After the development server has been started via app.run(), in a route handler.

What difference does it make? A big one, in certain cases: the call happens in different threads (you can easily convince yourself of this by calling threading.current_thread() in the two places, the result is different).

You might think this should be an implementation detail of how Flask’s development server works under the hood. I agree, and usually this detail is not important to be aware of. However, in case of an in-memory SQLite database this implementation detail makes a significant difference: the two threads see independent databases. And the mean thing is: the same db object represents different databases, depending on the thread from which it is being used.

Example: say you call db.create_all() and pre-populate the database in the main thread, before invoking app.run(). When you then access the database via the same db object (equivalent id()) from within a route handler (which does not run in the main thread), the interaction takes place with a different database, which of course is still empty, the tables are not created. A simple read/query would yield unexpected results.

Again, in other words: although you push around and interact the same db object in your modules, you might still access different databases, depending on from which thread the interaction takes place.

This is not well-documented and leads to nasty errors. I guess most often people run into OperationalError: (OperationalError) no such table:, although they think they have already created the corresponding table.

There are two reasonable solutions:

  • Bootstrap the database from within a route handler. All route handlers run within the same thread, so all route handlers interact with the same database.
  • Do not use the in-memory feature. Then db represents the same physical database across threads.

People have been bitten by this, as can be seen in these StackOverflow threads:

This has been observed with Python 2.7.5 and Flask 0.10.1.

How to set up a 64 bit version of NumPy on Windows

A short note on a complex topic. Feel free to shoot questions at me in the comments.

There are no official NumPy 64 bit builds available for Windows. In fact, 64 bit Windows is not officially supported by NumPy. So, if you are serious about your project, you need to either consider building on top of Unix-like platforms and inherit external quality assurance, or (on Windows) you need to anticipate issues of various kinds, and do extensive testing on your own. One of the reasons is that there is no adequate (open source, reliable, feature-rich) tool chain for creating proper 64 bit builds of NumPy on Windows (further references: numpy mailing list thread, Intel forums). Nevertheless, in many cases a working solution are the non-official builds provided by Christoph Gohlke, created with Intel’s commercial compiler suite. It is up to you to understand the license impacts and whether you want or can use these builds. I love to use these builds.

The following steps show a very simple way to get NumPy binaries for the AMD64 architecture installed on top of CPython 3(.4). These instructions are valid only for Python installed with an official CPython installer, obtained from python.org.

1) Install CPython for AMD64 arch

Download a 64 bit MSI installer file from python.org. The crucial step is to get an installer for the AMD64 (x86-64) architecture, usually called “Windows x86-64 MSI installer”. I have chosen python-3.4.2.amd64.msi. Run the setup.

2) Upgrade pip

Recent versions of Python 3 ship with pip, but you should use the newest version for proper wheel support. Open cmd.exe, and run

C:\> pip install pip --upgrade

Verify:

C:\> pip --version
pip 6.0.8 from C:\Python34\lib\site-packages (python 3.4)

The latter verifies that this pip i) is up-to-date, and ii) belongs to our target CPython version (multiple versions of CPython can be installed on any given system, and the correspondence between pip and a certain Python build is sometimes not obvious).

Note: The CPython installer should properly adjust your PATH environment variable so that python as well as pip entered at the command line correspond to what has been installed by the installer. It is however possible that you have somehow lost control of your environment by installing too many different things in an unreasonable order. In that case, you might have to manually adjust your PATH so that it priorizes the exetuables in C:\Python34\Scripts (or wherever you have installed your 64 bit Python version to).

3) Download wheel of NumPy build for AMD64 on Windows

Navigate to lfd.uci.edu/~gohlke/pythonlibs/#numpy and select a build for your Python version and for AMD64. I chose numpy‑1.9.2rc1+mkl‑cp34‑none‑win_amd64.whl.

4) Install the wheel via pip

On the command line, navigate to the directory where you have downloaded the wheel file to. Install it:

C:\Users\user\Desktop>pip install "numpy-1.9.2rc1+mkl-cp34-none-win_amd64.whl"
Processing c:\users\user\desktop\numpy-1.9.2rc1+mkl-cp34-none-win_amd64.whl
Installing collected packages: numpy
 
Successfully installed numpy-1.9.2rc1

The simplicity of this approach is kind of new. Actually, this simplicity is why wheels have been designed in the first place! Installing pre-built binaries with pip has not been possible with the “old” egg package format. So, older tutorials/descriptions of this kind might point to MSI installers or dubious self-extracting installers. These times are over now, and this is also the main reason for why I am writing this blog post.

5) Verify

>>> import numpy
>>> numpy.__version__
'1.9.2rc1'

Great.

Third-party Python distributions

I do not want to leave unmentioned that out there are very nice third party Python distributions (i.e. not provided by the Python Software Foundation) that include commercially supported and properly tested NumPy/SciPy builds for 64 bit Windows platforms. Most of these third party vendors have a commercial background, and dictate their own licenses with respect to the usage of their Python distribution. For non-commercial purposes, most of them can be used for free. The following distributions provide a working solution:

All of these three distributions are recommendable from a technical point of view (I cannot tell whether their license models / restrictions are an issue for you or not). They all come as 64 bit builds. I am not entirely sure if Enthought and ActiveState build NumPy against Intel’s Math Kernel Library. In case of Anaconda, this definitely is not the case in the free version — this is something that can be explicitly obtained, for 29 $ (it’s called the “MKL Optimizations” package).

Songkick events for Google’s Knowledge Graph

Google can display upcoming concert events in the Knowledge Graph of musical artists (as announced in March 2014). This is a great feature, and probably many people in the field of music marketing and especially record labels aim to get this kind of data into the Knowledge Graph for their artists. However, Google does not magically find this data on its own. It needs to be informed, with a special kind of data structure (in the recently standardized JSON-LD format) contained within the artist’s website.

While of great interest to record labels, finding a proper technical solution to create and provide this data to Google still might be a challenge. I have prepared a web service that greatly simplifies the process of generating the required data structure. It pulls concert data from Songkick and translates them into the JSON-LD representation as required by Google. In the next section I explain the process by means of an example.

Web service usage example

The concert data of the band Milky Chance is published and maintained via Songkick, a service that many artists use. The following website shows — among others — all upcoming events of Milky Chance: http://www.songkick.com/artists/6395144-milky-chance. My web service translates the data held by Songkick into the data structure that Google requires in order to make this concert data appear in their Knowledge Graph. This is the corresponding service URL that needs to be called to retrieve the data:

https://jsonld-events.appspot.com/api/songkick/artist?skid=6395144&name=Milky+Chance&weburl=http%3A%2F%2Fmilkychanceofficial.com

That URL is made of the base URL of the web service, the songkick ID of the artist (6395144 in this case), the artist name and the artist website URL. Try accessing named service URL in your browser. It currently yields this:

[
  {
    "@context": "http://schema.org", 
    "@type": "MusicEvent", 
    "name": "Milky Chance", 
    "startDate": "2014-12-12", 
    "url": "http://www.songkick.com/concerts/21926613-milky-chance-at-max-nachttheater?utm_source=30793&utm_medium=partner", 
    "location": {
      "address": {
        "addressLocality": "Kiel", 
        "postalCode": "24116", 
        "streetAddress": "Eichhofstra\u00dfe 1", 
 
[ ... SNIP ~ 1000 lines of data ... ]
 
    "performer": {
      "sameAs": "http://milkychanceofficial.com", 
      "@type": "MusicGroup", 
      "name": "Milky Chance"
    }
  }
]

This piece of data needs to be included in the HTML source code of the artist website. Google then automatically finds this data and eventually displays the concert data in the Knowledge Graph (within a couple of days). That’s it — pretty simple, right? The good thing is that this method does not require layout changes to your website. This data can literally be included in any website, right now.

That is what happened in case of Milky Chance: some time ago, the data created by the web service was fed into the Milky Chance website. Consequently, their concert data is displayed in their Knowledge Graph. See for yourself: access https://www.google.com/search?q=milky+chance and look out for upcoming events on the right hand side. Screenshot:

milkychance_google_knowledgegraph

Google Knowledge Graph generated for Milky Chance. Note the upcoming events section: for this to appear, Google needs to find the event data in a special markup within the artist’s website.

So, in summary, when would you want to use this web service?

  • You have an interest in presenting the concert data of an artist in Google’s Knowledge Graph (you are record label or otherwise interested in improved marketing and user experience).
  • You have access to the artist website or know someone who has access.
  • The artist concert data already is present on Songkick or will be present in the future.

Then all you need is a specialized service URL, which you can generate with a small form I have prepared for you here: http://gehrcke.de/google-jsonld-events

Background: why Songkick?

Of course, the event data shown in the Knowledge Graph should be up to date and in sync with presentations of the same data in other places (bands usually display their concert data in many places: on Facebook, on their website, within third-party services, …). Fortunately, a lot of bands actually do manage this data in a central place (any other solution would be tedious). This central place/platform/service often is Songkick, because Songkick really made a nice job in providing people with what they need. My web service reflects recent changes made within Songkick.

Technical detail

The core of the web service is a piece of software that translates the data provided by Songkick into the JSON-LD data as required and specified by Google. The Songkick data is retrieved via Songkick’s JSON API (I applied for and got a Songkick API key). Large parts of this software deal with the unfortunate business of data format translation while handling certain edge cases.

The service is implemented in Python and hosted on Google App Engine. Its architecture is quite well thought-through (for instance, it uses memcache and asynchronous urlfetch wherever possible). It is ready to scale, so to say. Some technical highlights:

  • The web service enforces transport encryption (HTTPS).
  • Songkick back-end is queried via HTTPS only.
  • Songkick back-end is queried concurrently whenever possible.
  • Songkick responses are cached for several hours in order to reduce load on their service.
  • Responses of this web service are cached for several hours. These are served within milliseconds.

This is an overview of the data flow:

  1. Incoming request, specifying Songkick artist ID, artist name, and artist website.
  2. Using the Songkick API (SKA), all upcoming events are queried for this artist (one or more SKA requests, depending on number of events).
  3. For each event, the venue ID is extracted, if possible.
  4. All venues are queried for further details (this implicates as many SKA requests as venue IDs extracted).
  5. A JSON-LD representation of an event is constructed from a combination of
    • event data
    • venue data
    • user-given data (artist name and artist website)
  6. All event representations are combined and a returned.

Some notable points in this context:

  • A single request to this web service might implicate many requests to the Songkick API. This is why SKA responses are aggressively cached:
    • An example artist with 54 upcoming events requires 2 upcoming events API requests (two pages, cannot be requested concurrently) and requires roundabout 50 venue API requests (can be requested concurrently). Summed up, this implicates that my web service cannot respond earlier than three SKA round trip times take.
    • If none of the SKA responses has been cached before, the retrieval of about 2 + 50 SKA responses might easily take about 2 seconds.
    • This web services cannot be faster than SK delivers.
  • This web service applies graceful degradation when extracting data from Songkick (many special cases are handled, which is especially relevant for the venue address).

Generate your service URL

This blog post is just an introduction, and sheds some light on the implementation and decision-making. For general reference, I have prepared this document to get you started:

http://gehrcke.de/google-jsonld-events

It contains a web form where you can enter the (currently) three input parameters required for using the service. It returns a service URL for you. This URL points to my application hosted on Google App Engine. Using this URL, the service returns the JSON data that is to be included in an artist’s website. That’s all, it’s really pretty simple.

So, please go ahead and use this tool. I’d love to retrieve some feedback. Closely look at the data it returns, and keep your eyes open for subtle bugs. If you see something weird, report it, please. I am very open for suggestions, and also interested in your questions regarding future plans, release cycle etc. Also, if you need support for (dynamically) including this kind of data in your artist’s website, feel free to contact me.

gipc 0.5.0 released

I just released gipc 0.5.0. It contains a backwards-incompatible change: the SIGPIPE signal action is not automatically reset to the default action anymore in child processes. This will hopefully satisfy developers expecting the SIGPIPE signal to be ignored, resulting in a proper Python exception when attempting to write to a pipe that has been closed on the read end (if interested, you can follow the related discussion here). Furthermore, this release improves the performance when sending large messages through gipc pipes on Linux. This release also introduces workarounds for two corner case issues present in Mac OS X and FreeBSD.

This is the full changelog:

  • Improve large message throughput on Linux (see issue #13).
  • Work around read(2) system call flaw on Mac OS X (see issue #13).
  • Work around signal.NSIG-related problem on FreeBSD (see issue #10).
  • Do not alter SIGPIPE action during child bootstrap (breaking change, see issue #12).

Thanks to Dustin Oprea, bra, John Ricklefs, Heungsub Lee, Miguel Turner, and Alex Besogonov for contributing. As usual, the release is available via PyPI (https://pypi.python.org/pypi/gipc). The documentation and further detail are available at http://gehrcke.de/gipc.