Download article as PDF file from Elsevier’s ScienceDirect via command line (curl)

When not in the office, we often times cannot directly access scientific literature, because access control is usually based on IP addresses. However, we usually have SSH access to the university network. Being logged in to a machine in the university network we should — in theory — be able to access a certain article. Most of the times it is the PDF file that we are interested in and not the “web page” corresponding to an article. So, can’t we just $ curl http://whatever.com/article.pdf to get that file? Most of the times, this does not work, because access to journal articles usually happens through rather complex web sites, such as Elsevier’s ScienceDirect:

ScienceDirect is a leading full-text scientific database offering journal articles and book chapters from nearly 2,500 journals and 26,000 books.

Such web sites add a considerable amount of complexity to the technical task of downloading a file. The problem usually starts with obtaining the direct URL to the PDF file. Also, HTTP redirection and cookies are usually involved. Often times, the only solution people see to solve these issues is to set up a VPN and then to use a fully fledged browser through that VPN, and let the browser deal with the complexity.

However, I prefer to get back to the basics and always strive to somehow find a direct URL to the PDF file to then download it via curl or wget.

This is my solution for Elsevier’s ScienceDirect:

Say, for instance, you wish to download the PDF version of this article: http://www.sciencedirect.com/science/article/pii/S0169433215012131

Then all you need is that URL and the following commands executed on a common Linux system:

export SDURL="http://www.sciencedirect.com/science/article/pii/S0169433215012131"
curl -Lc cookiejar "${SDURL}" | grep pdfurl | perl -pe 's|.* pdfurl=\"(.*?)\".*|\1|' > pdfurl
curl -Lc cookiejar "$(cat pdfurl)" > article.pdf

The method first parses the HTML source code of the main page corresponding to the article and extracts a URL to the PDF file. At the same time, it also stores the HTTP cookie(s) set by the web server when accessing named web page. These cookies are then re-used when accessing the PDF file directly. This has reproducibly worked for me.

If it does not work for you, I recommend having a look into the file pdfurl and see if that part of the process has lead to a meaningful result or not. Obviously, the second step can only succeed aver having obtained a proper URL to the PDF file.

This snippet should not be treated as a black box. Please execute it in an empty directory. Also note that this snippet only works subject to the condition that ScienceDirect keeps functioning the way it does right now (which most likely is the case for the next couple of months or years).

Don’t hesitate to get back to me if you have any questions!

Jan-Philip Gehrcke, PhD

Download article as PDF file from Elsevier’s ScienceDirect via command line (curl)

Leave a Reply Cancel reply