Category Archives: Companies & Products

Download article as PDF file from Elsevier’s ScienceDirect via command line (curl)

When not in the office, we often times cannot directly access scientific literature, because access control is usually based on IP addresses. However, we usually have SSH access to the university network. Being logged in to a machine in the university network we should — in theory — be able to access a certain article. Most of the times it is the PDF file that we are interested in and not the “web page” corresponding to an article. So, can’t we just $ curl http://whatever.com/article.pdf to get that file? Most of the times, this does not work, because access to journal articles usually happens through rather complex web sites, such as Elsevier’s ScienceDirect:

ScienceDirect is a leading full-text scientific database offering journal articles and book chapters from nearly 2,500 journals and 26,000 books.

Such web sites add a considerable amount of complexity to the technical task of downloading a file. The problem usually starts with obtaining the direct URL to the PDF file. Also, HTTP redirection and cookies are usually involved. Often times, the only solution people see to solve these issues is to set up a VPN and then to use a fully fledged browser through that VPN, and let the browser deal with the complexity.

However, I prefer to get back to the basics and always strive to somehow find a direct URL to the PDF file to then download it via curl or wget.

This is my solution for Elsevier’s ScienceDirect:

Say, for instance, you wish to download the PDF version of this article: http://www.sciencedirect.com/science/article/pii/S0169433215012131

Then all you need is that URL and the following commands executed on a common Linux system:

export SDURL="http://www.sciencedirect.com/science/article/pii/S0169433215012131"
curl -Lc cookiejar "${SDURL}" | grep pdfurl | perl -pe 's|.* pdfurl=\"(.*?)\".*|\1|' > pdfurl
curl -Lc cookiejar "$(cat pdfurl)" > article.pdf

The method first parses the HTML source code of the main page corresponding to the article and extracts a URL to the PDF file. At the same time, it also stores the HTTP cookie(s) set by the web server when accessing named web page. These cookies are then re-used when accessing the PDF file directly. This has reproducibly worked for me.

If it does not work for you, I recommend having a look into the file pdfurl and see if that part of the process has lead to a meaningful result or not. Obviously, the second step can only succeed aver having obtained a proper URL to the PDF file.

This snippet should not be treated as a black box. Please execute it in an empty directory. Also note that this snippet only works subject to the condition that ScienceDirect keeps functioning the way it does right now (which most likely is the case for the next couple of months or years).

Don’t hesitate to get back to me if you have any questions!

Insights into SoundCloud culture

SoundCloud (cf. Wikipedia) is a young Berlin-based company “under the laws of England & Wales”, and with Swedish origin. Six years ago, in 2009, they acquired their first big funding. Since then, they experienced a tremendous growth and were able to regularly raise investment capital. Until today, SoundCloud has created and put itself into a unique position with a convincing product (which I am using — and paying — myself), but can also be considered as competitor of big brands such as Spotify and Beats Music. In fact, according to the Wall Street Journal, SoundCloud can be expected to join the party of billion-dollar IT companies quite soon:

SoundCloud, a popular music and audio-sharing service, is in discussions to raise about $150 million in new financing at a valuation that is expected to top $1.2 billion, according to two people with knowledge of the negotiations.

Having these facts in mind, it is impressive to hear that SoundCloud still only employs 300 people, in just a handful of offices around the world. Just like me, you might be curious about getting to know details of this kind, and about the SoundCloud story in itself. So, I was really eager to listen to Episode 17 of “Hipster & Hack“, featuring an interview with David Noël (Twitter profile, LinkedIn profile). David has accompanied SoundCloud for six years now and currently leads Internal Communications. He clearly is in the position to provide authoritative information about how SoundCloud’s vision was translated into reality over time, but also about how culture and communication within SoundCloud evolved. The latter is what he mainly talks about in the interview, providing insights about the structure and tools applied for defining a culture, for keeping it under control, and for communicating it to employees from the very first moment on until even after they have left the company. David defines culture as the living manifestation of core values and comes to insightful statements such as

Living your values is your culture at any moment in time.

In the interview, we learn that one of SoundCloud’s core values is being open, in the context of internal communications. The culture and communication topic really seems to have a high priority in the company, judging based on methods like the “all-hands” meeting that David refers to in the interview. Personally, I cannot overstate how much I value this, coming from classical research where such elements are often just neglected.

So, if that raises your interest, I recommend listening to three quite likable guys here (listening to minutes 4 to 29 suffices, the rest is enjoyable overhead ;-)):

Google authorship feature deactivated

I just realized that the Google authorship feature (by which web content could be related to a Google+ profile) had been disabled in summer 2014. The feature was introduced not long before that and the web ecosystem followed with enthusiasm: content management systems like WordPress offered support (at least via plugins), and the SEO media response was positive. Many articles were published on the importance and usage of this feature, such as:

And then, suddenly, a posting on Google+:

[…] With this in mind, we’ve made the difficult decision to stop showing authorship in search results.

Another posting from John Mueller:

Edit: In the meantime, we’ve decided to remove authorship completely

What is left is the URL http://plus.google.com/authorship which redirects to https://support.google.com/webmasters/answer/6083347, which shows nothing but:

Authorship markup is no longer supported in web search.
To learn about what markup you can use to improve search results, visit rich snippets.

What’s left are many websites containing wasteful markup. Garbage, and it will remain for years, probably. I just deactivated my Google Author Link WordPress plugin. What a waste of time, for so many people. For the interested ones, the removal of this feature is discussed in some depth in this article.

Songkick events for Google’s Knowledge Graph

Google can display upcoming concert events in the Knowledge Graph of musical artists (as announced in March 2014). This is a great feature, and probably many people in the field of music marketing and especially record labels aim to get this kind of data into the Knowledge Graph for their artists. However, Google does not magically find this data on its own. It needs to be informed, with a special kind of data structure (in the recently standardized JSON-LD format) contained within the artist’s website.

While of great interest to record labels, finding a proper technical solution to create and provide this data to Google still might be a challenge. I have prepared a web service that greatly simplifies the process of generating the required data structure. It pulls concert data from Songkick and translates them into the JSON-LD representation as required by Google. In the next section I explain the process by means of an example.

Web service usage example

The concert data of the band Milky Chance is published and maintained via Songkick, a service that many artists use. The following website shows — among others — all upcoming events of Milky Chance: http://www.songkick.com/artists/6395144-milky-chance. My web service translates the data held by Songkick into the data structure that Google requires in order to make this concert data appear in their Knowledge Graph. This is the corresponding service URL that needs to be called to retrieve the data:

https://jsonld-events.appspot.com/api/songkick/artist?skid=6395144&name=Milky+Chance&weburl=http%3A%2F%2Fmilkychanceofficial.com

That URL is made of the base URL of the web service, the songkick ID of the artist (6395144 in this case), the artist name and the artist website URL. Try accessing named service URL in your browser. It currently yields this:

[
  {
    "@context": "http://schema.org", 
    "@type": "MusicEvent", 
    "name": "Milky Chance", 
    "startDate": "2014-12-12", 
    "url": "http://www.songkick.com/concerts/21926613-milky-chance-at-max-nachttheater?utm_source=30793&utm_medium=partner", 
    "location": {
      "address": {
        "addressLocality": "Kiel", 
        "postalCode": "24116", 
        "streetAddress": "Eichhofstra\u00dfe 1", 
 
[ ... SNIP ~ 1000 lines of data ... ]
 
    "performer": {
      "sameAs": "http://milkychanceofficial.com", 
      "@type": "MusicGroup", 
      "name": "Milky Chance"
    }
  }
]

This piece of data needs to be included in the HTML source code of the artist website. Google then automatically finds this data and eventually displays the concert data in the Knowledge Graph (within a couple of days). That’s it — pretty simple, right? The good thing is that this method does not require layout changes to your website. This data can literally be included in any website, right now.

That is what happened in case of Milky Chance: some time ago, the data created by the web service was fed into the Milky Chance website. Consequently, their concert data is displayed in their Knowledge Graph. See for yourself: access https://www.google.com/search?q=milky+chance and look out for upcoming events on the right hand side. Screenshot:

milkychance_google_knowledgegraph

Google Knowledge Graph generated for Milky Chance. Note the upcoming events section: for this to appear, Google needs to find the event data in a special markup within the artist’s website.

So, in summary, when would you want to use this web service?

  • You have an interest in presenting the concert data of an artist in Google’s Knowledge Graph (you are record label or otherwise interested in improved marketing and user experience).
  • You have access to the artist website or know someone who has access.
  • The artist concert data already is present on Songkick or will be present in the future.

Then all you need is a specialized service URL, which you can generate with a small form I have prepared for you here: http://gehrcke.de/google-jsonld-events

Background: why Songkick?

Of course, the event data shown in the Knowledge Graph should be up to date and in sync with presentations of the same data in other places (bands usually display their concert data in many places: on Facebook, on their website, within third-party services, …). Fortunately, a lot of bands actually do manage this data in a central place (any other solution would be tedious). This central place/platform/service often is Songkick, because Songkick really made a nice job in providing people with what they need. My web service reflects recent changes made within Songkick.

Technical detail

The core of the web service is a piece of software that translates the data provided by Songkick into the JSON-LD data as required and specified by Google. The Songkick data is retrieved via Songkick’s JSON API (I applied for and got a Songkick API key). Large parts of this software deal with the unfortunate business of data format translation while handling certain edge cases.

The service is implemented in Python and hosted on Google App Engine. Its architecture is quite well thought-through (for instance, it uses memcache and asynchronous urlfetch wherever possible). It is ready to scale, so to say. Some technical highlights:

  • The web service enforces transport encryption (HTTPS).
  • Songkick back-end is queried via HTTPS only.
  • Songkick back-end is queried concurrently whenever possible.
  • Songkick responses are cached for several hours in order to reduce load on their service.
  • Responses of this web service are cached for several hours. These are served within milliseconds.

This is an overview of the data flow:

  1. Incoming request, specifying Songkick artist ID, artist name, and artist website.
  2. Using the Songkick API (SKA), all upcoming events are queried for this artist (one or more SKA requests, depending on number of events).
  3. For each event, the venue ID is extracted, if possible.
  4. All venues are queried for further details (this implicates as many SKA requests as venue IDs extracted).
  5. A JSON-LD representation of an event is constructed from a combination of
    • event data
    • venue data
    • user-given data (artist name and artist website)
  6. All event representations are combined and a returned.

Some notable points in this context:

  • A single request to this web service might implicate many requests to the Songkick API. This is why SKA responses are aggressively cached:
    • An example artist with 54 upcoming events requires 2 upcoming events API requests (two pages, cannot be requested concurrently) and requires roundabout 50 venue API requests (can be requested concurrently). Summed up, this implicates that my web service cannot respond earlier than three SKA round trip times take.
    • If none of the SKA responses has been cached before, the retrieval of about 2 + 50 SKA responses might easily take about 2 seconds.
    • This web services cannot be faster than SK delivers.
  • This web service applies graceful degradation when extracting data from Songkick (many special cases are handled, which is especially relevant for the venue address).

Generate your service URL

This blog post is just an introduction, and sheds some light on the implementation and decision-making. For general reference, I have prepared this document to get you started:

http://gehrcke.de/google-jsonld-events

It contains a web form where you can enter the (currently) three input parameters required for using the service. It returns a service URL for you. This URL points to my application hosted on Google App Engine. Using this URL, the service returns the JSON data that is to be included in an artist’s website. That’s all, it’s really pretty simple.

So, please go ahead and use this tool. I’d love to retrieve some feedback. Closely look at the data it returns, and keep your eyes open for subtle bugs. If you see something weird, report it, please. I am very open for suggestions, and also interested in your questions regarding future plans, release cycle etc. Also, if you need support for (dynamically) including this kind of data in your artist’s website, feel free to contact me.

You should let ‘SamKnows’ know.

For a couple of years now I have a network device in my place, called “Whitebox” by SamKnows. It is part of the following endeavor:

Together, the European Commission and SamKnows aim to provide Europe with reliable and accurate statistics of broadband performance across Europe.

Volunteers will receive a purpose-built broadband measurement unit which can be plugged into the existing modem/router. This is called the SamKnows Whitebox.

The network device periodically performs measurements without interfering with actual payload traffic in your network. It monitors Ethernet as well as wireless LAN. The data becomes uploaded, aggregated, and nicely visualized in a personalized dashboard.

If you are paranoid, you probably do not want to have such a device in your home and you might argue that you can measure for yourself. However, I think that this project is trust-able and the statistics obtained by SamKnows are essential for evaluating and developing broadband Internet access in Europe. Furthermore, the data obtained by my Whitebox were already multiple times useful for convincing my ISP that it is not holding up its end of the contract.

I recommend signing up: https://www.samknows.eu/sign-up/