For an AngularJS-based web application I am currently working on, I want to put arbitrary text information into a URL, send it to the client, and have it decoded by the browser. With the right tools this should be a trivial process, shouldn’t it?
The basic idea is:
- Start with the original text, which is a sequence of unicode code points.
- Encode the original text into binary data; with the UTF-8 codec.
- base64-encode the resulting binary data (and replace the URL-unsafe characters / and + with, for instance, – and _).
- The result is a URL-safe string. Send it. Let’s assume it ends up in a browser (in
window.location
, for instance). - Invert the entire procedure in the browser:
- The data arrives as
DOMString
type (unicode text, so to say). - Transform URL-safe base64 representation to canonical base64 (replace _ and – characters).
- Decode the base64 string into a real binary data type (
Uint8Array
, for instance). - Decode the binary blob into a
DOMString
containing the original text, using the UTF-8 codec.
- The data arrives as
Unfortunately, there so far are no official and no established ways for performing steps 5.3 and 5.4 in a browser environment. There is no obvious way for obtaining a binary blob from base64-encoded data. Further below, I’ll show three different methods for executing this step. Proceeding from here, I realized that there also is still no established way for decoding a binary blob into a DOMString using a given codec (UTF-8 in this case). I’ll show two different methods for performing this task.
The original text and its URL-safe representation
I’ll start with a Python snippet defining the original text and creating its URL-safe base64 representation:
# -*- coding: utf-8 -*- from __future__ import unicode_literals from base64 import urlsafe_b64encode text = """ «küßЌύБЇ”ﻈﻉﻌﻍﻎ㌀㌁㌂❶❷❸⍝⍞⍟⍠ோௌ«ταБЬℓσ»: n20%of٩(-̮̮̃-̃)۶٩(●̮̮̃•̃)۶٩(͡๏̯͡๏)۶٩(-̮̮̃•̃)!॒॑॓ஙசປຜἑἔℇ∆∇▆▇█ ✆✇✈✉✌✍うぇえぉおㆅㆇ㉫㉬㍍㍎㍏沈拾若掠略亮兩凉梁糧 ﯗﯘﯙﯚﯝﯠﯡﯢカキクケコサシス'kosme':"κόσμε"/?#+ """ data = text.encode("utf-8") datab64 = urlsafe_b64encode(data) print("text length: %s" % len(text)) print("urlsafe base64 representation of the binary data:\n\n%s" % datab64)
The original text (the variable named text
) is meant to contain code points from many different Unicode character blocks. The following resources helped me assembling this test text:
- http://stackoverflow.com/q/1343223/145400
- http://www.ltg.ed.ac.uk/~richard/unicode-sample.html
- http://nedbatchelder.com/blog/200310/unicode_test_strings.html
- http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
As you can see, the text is first encoded using the UTF-8 codec. The resulting binary data then is put into urlsafe_b64encode()
, yielding a URL-safe byte sequence. So, execution of the named Python script yields the following URL-safe representation of the original text:
CsKra8O8w5_DrsK74oCc0IzPjdCR0IfigJ3vu4jvu4nvu4zvu43vu47jjIDjjIHjjILinbbinbfinbjijZ3ijZ7ijZ_ijaDgr4vgr4zCq8-EzrHQkdCs4oSTz4PCuzoKbjIwJW9m2akoLcyuzK7Mgy3MgynbttmpKOKXj8yuzK7Mg-KAosyDKdu22akozaHguY_Mr82h4LmPKdu22akoLcyuzK7Mg-KAosyDKSHgpZHgpZLgpZPgrpngrprgupvgupzhvJHhvJTihIfiiIbiiIfilobilofilogK4pyG4pyH4pyI4pyJ4pyM4pyN44GG44GH44GI44GJ44GK44aF44aH44mr44ms442N442O442P76Wy76Wz76W076W176W276W376W476W576W676W7Cu-vl--vmO-vme-vmu-vne-voO-voe-vou-9tu-9t--9uO-9ue-9uu-9u--9vO-9vSdrb3NtZSc6Is664b25z4POvM61Ii8_Iyvtn78K
The Python script also tells you that the text is 167 characters long, which is a useful reference for later comparison.
Decoding the URL-safe representation in the browser
Here is the test document that implements four different methods for obtaining the original text back from the URL-safe representation (it just doesn’t show anything yet!): http://gehrcke.de/files/perm/blog/2015/jsdecode/test.html
Remember, the text we want to transport is encoded in a URL-safe way, so just for fun I want to make use of this fact in this small demonstration here, and communicate the information via the URL. To that end, the test document executes JavaScript code that extracts a string from the anchor/hash part of the URL:
// Get URL-safe text representation from the address bar. // This is a DOMString type, as indicated by the text_ variable name prefix. var text_urlsafe_b64data = window.location.hash.substring(1);
Okay, let’s put the test data into the URL (the output from the Python script above):
This is a long URL now, indeed. But it is not too long ;-).
When you access the URL above, an HTML document should show up. It should have four panels, whereas each panel should show the exact same text as initially defined in the Python script above, including all newlines. Each of the individual panels is the result of a different decoding implementation. I recommend looking at the source code of this HTML document, I have commented the different methods sufficiently. I will now quickly go through the different methods.
The first step common to all four methods is to convert from URL-safe base64 encoding to canonical base64 encoding:
function urlsafeb64_to_b64(s) { // Replace - with + and _ with / return s.replace(/-/g, '+').replace(/_/g, '/'); } // Create canonical base64 data from whatever Python's urlsafe_b64encode() produced. var text_b46data = urlsafeb64_to_b64(text_urlsafe_b64data);
Method 1
Starting from there, the ugliest way to obtain the original text is what is designated “Method 1” in the source of the test document:
var text_original_1 = decodeURIComponent(escape(window.atob(text_b46data)));
This is a hacky all-in-one solution. It uses the deprecated escape()
function and implicitly performs the UTF-8 decoding. Who the hell can explain why this really works? Monsur can:
monsur.hossa.in/2012/07/20/utf-8-in-javascript.html. However, this really is a black magic approach with ugly semantics, and the tools involved never were designed for this purpose. There is no specification that guarantees proper behavior. I recommend to not use this method, especially for its really bad semantics and the use of a now deprecated function. However, if you love to confuse your peers with cryptic one-liners, then this is the way to go.
Method 2
This article states that there is “a better, more faithful and less expensive solution” involving native binary data types. In my opinion, this distinct two-step process is easy to understand and has quite clear semantics. So, my favorite decoding scheme is what is designated “Method 2” in the source of the test document:
// Step 1: decode the base64-encoded data into a binary blob (a Uint8Array). var binary_utf8data_1 = base64DecToArr(text_b46data); // Step 2: decode the binary data into a DOMString. Use a custom UTF-8 decoder. var text_original_2 = UTF8ArrToStr(binary_utf8data_1);
The functions base64DecToArr()
and UTF8ArrToStr()
are lightweight custom implementations, taken from the Mozilla Knowledge Base. They should work in old as well as modern browsers, and should have a decent performance. The custom functions are not really lengthy and can be shipped with your application. Just look at the source of test.html.
Method 3
The custom UTF8ArrToStr()
function used in method 2 can at some point be replaced by a TextDecoder()
-based method, which is part of the so-called encoding
standard. This standard is a WHATWG living standard, and still labeled to be an experimental feature. Nevertheless, it is already available in modern Firefox and Chrome versions, and there also is a promising polyfill project on GitHub. Prior to using TextDecoder()
, the base64-encoded data (a DOMString) must still be decoded into binary data, so the first part is the same as in method 2:
var binary_utf8data_1 = base64DecToArr(text_b46data); var text_original_3 = new TextDecoder("utf-8").decode(binary_utf8data_1);
Method 4
The fourth method I am showing here uses an alternative approach for base64DecToArr()
, i.e. for decoding the base64-encoded data (DOMString
) into binary data (Uint8Array
). It is shorter and easier to understand than base64DecToArr()
, but presumably also of lower performance. Let’s look at base64_to_uint8array()
(based on this answer on StackOverflow):
function base64_to_uint8array(s) { var byteChars = atob(s); var l = byteChars.length; var byteNumbers = new Array(l); for (var i = 0; i < l; i++) { byteNumbers[i] = byteChars.charCodeAt(i); } return new Uint8Array(byteNumbers); }
Let’s combine it with the already introduced UTF8ArrToStr()
(see method 2):
var binary_utf8data_2 = base64_to_uint8array(text_b46data) var text_original_4 = UTF8ArrToStr(binary_utf8data_2);
Final words
By carefully looking at the rendered test document, one can infer that all four methods work for the test data used here. In my application scenario I am currently using method 4, since the conversion I am doing there is not performance-critical (in which case I would use method 2). A disadvantage of method 2 would be the usage of the atob()
function, which is not available in IE 8 and 9. If this was a core component of an application, I’d probably start using the TextDecoder()
-based method with a polyfill for older browsers. The disadvantage here is that the polyfill itself is quite a heavy dependency.
I hope these examples are of use, let me know what you think.
Leave a Reply