Sci-Hub succeeds where publishers fail (open and closed)

Sci-Hub use in The Netherlands is not limited to
the academic research cities. Harlingen is a small
harbor town where at best a doctor lives and
one or two students who visit parents in the
weekend. The nature of the top downloaded
paper suggests it is not a doctor :)
Data from Bohannon and Elbakyan.
Knowledge dissemination is a thing. It's not easy. In fact, it's a major challenge. Traditional routes are not efficient anymore, where they were 200 years ago. The world has moved on; the publishing industry has not. I have written plenty in this blog about how the publishers could catch up, and while this is happening, progress is (too) slow.

The changes are not only technical, but also social. Several publishers still believe we live in a industrial area, where the world has moved on into a knowledge era. More people are mining and servicing data than there are making physical things (think about that!). Access to knowledge matters, and dealing with data and knowledge stopped being something specific for academic and other research institutes many, many years ago. Arguments that knowledge is only for the highly educated is simply contradicting and bluntly ignore our modern civilization.

This makes access to knowledge a mix of technological and social evolution, and on both end many publishers fail, fail hard, fail repeatedly. I would even argue that all the new publishers are improving things, but are failing to really innovate in knowledge dissemination. And not just the publishing industry, also many scientists. Preprint servers are helpful, but this is really not the end goal. If you really care about speeding up knowledge dissemination, stop worrying about things like text mining, preprints, but you have to start making knowledge machine readable (sorry, scientist) and release that along or before your article. Yes, that is harder, but just realize you are getting well-paid for doing your job.

So, by no means the success of Sci-Hub is unexpected. It is not really the end goal I have in mind, and in many ways contradicting what I want. But the research community thinks differently, clearly. Oh wait, not just the research community, but the current civilization. The results of the Bohannon analysis of the Sci-Hub access logs I just linked to clearly shows this. There are so many aspects, and so many interpretations and remaining questions. The article rightfully asks, is it need or convenience. I argued recently the latter is likely an important reason at western universities, and that it is nothing new.

This article is a must read if you care about the future of civilization. Bonus points for a citable data set!

Bohannon, J. Who's downloading pirated papers? everyone. Science 352, 508-512 (2016). URL
Elbakyan, A. & Bohannon, J. Data from: Who's downloading pirated papers? everyone. (2016). URL

Programming in the Life Sciences #22: jsFiddle

My son pointed me to jsFiddle which allows you to edit JavaScript snippets and run them. I have heard of them before, but never really got time for it. But I'm genuinely impressed with the stuff he is doing, and finally wanted to try sharing JavaScript snippets online, particularly, because I had to update the course description of Programming in the Life Sciences. In this course the students work with JavaScript and there are a number of example, but that has a lot of HTML boiler plate code.

So, here's the first of those examples, but then stripped from most of the things you don't need, and with some extra documentation as comments:

Splitting up Bioclipse Groovy scripts

Source: Wikipedia, CC-BY-SA 3.0
... without writing additional script managers (see doi:10.1186/1471-2105-10-397). That was what I was after. I found that by using evaluate() you could load additional code. Only requirements, you wrap stuff in a class, and the filename need to match the class name. So, you put stuff in a class SomeName and safe that in a Bioclipse project (e.g. SomeProject/) with the name SomeName.groovy.

That is, I have this set up:


Then, in this aScript.groovy you can include the following code to load that class and make use of the content:

  someClass = evaluate(
    new File(

Maybe there are even better ways, but this works for me. I tried the regular Groovy way of instantiating a class defined like this, but because the Bioclipse Groovy environment does not have a working directory, I could not get that to work.

Still a draft: The Amsterdam Call for Action on Open Science

It was on the agenda: "Presenting the Amsterdam Call for Action". However, a day of hard work by some 300 participants of the Dutch Presidency's meeting on Open Science did not allow for the draft to be finalized today. Instead, the editors will work the next 24h (a bit less by now) on a new draft that will be send around to the participants which will then have about a week to send in further comments.

There was enough feedback given on the draft indeed, and followers of my blog and twitter account know how much they already got from just me. It will be a busy 24 hours for the editors. I am really looking forward with the next draft they come up with. BTW, it is not clear yet if I will be able to share the draft that I get tomorrow. We'll see. At least I will tweet about whether or not my main points got addressed.

Meanwhile, the Dutch VSNU sent out a press release that "[they are] pleased with European action plan Open Science". Given that it was in fact not released yet, suggests a few things:

  1. they anticipated the draft was a done deal (which aligns with the lack of openness around the draft);
  2. automated sending out of press releases is a bad idea.

The Amsterdam Call for Action on Open Science #2: 10% Open Science

Some of the discussions here are about data sensitivity, privacy, etc. Excellent points! One confusion that should be put aside is that you cannot be an open scientist if you do not release everything. That is nonsense, FUD perhaps.

The The Amsterdam Call for Action on Open Science may very well set some goals; I do not think it currently does. It may set a goal of 10% by the end of this year (already made, probably), 20% at the end of 2018, and 30% at the end of H2020. You can read this in many ways, and here too, things are very vague.

After yesterday I herald this approach! Would it not be brilliant if at the end of this year all scholars output 10% of the research as Open Science?! Go for it!

The Amsterdam Call for Action on Open Science

First, it is very regrettable the participants of #EU2016NL #openaccess did not get access to The Amsterdam Call for Action on Open Science document (read it in this Trello) until minutes before the break out sessions. So, I am still reading it (it's 21 pages), while in the Innovation session, which will only discuss actions #8 and #9.

Here's the first page:

All participants split up in various break out sessions, and I ended up in the Innovation session. However, I find the description of Open Science insufficient for me to understand the proposed twelve actions. It does not define the core values of Open Science:

You can see my comments already. And really, this is critical: all the actions make assumptions, do not define things, which causes the problem no nothing is actionable, because you can basically do anything and still comply to the action. (In fact, another issue is that several actions are already being undertaken, but that's for later).

My recommendation is to rephrase the first sentence into:

"Open science is an umbrella term for a technology and data driven systemic change in how knowledge dissemination works and how researchers work, collaborate, share ideas and disseminate results, by adopting the core values that knowledge should be reusable, modifiable and redistributable. This allows us address the increasing demand in society to address societal challenges of our time."

These are the cover values implemented by Open Data and Open Source. Sadly, not commonly by Open Access, causing a lot of confusion in the latter area, which have been very clear at this meeting too.

Migrating pKa data from DrugMet to Wikidata

In 2010 Samuel Lampa and I started a pet project: collecting pKa data: he was working on RDF extension of MediaWiki and I like consuming RDF data. We started DrugMet. When you read this post, this MediaWiki installation may already be down, which is why I am migrating the data to Wikidata. Why? Because data curation takes effort, I like to play with Wikidata (see this H2020 proposal by Daniel Mietchen et al.), I like Open Data (see ), and it still much needed.

We opted for a page with the minimal amount of information. To maximize the speed at which we could add information. However, when it came to semantics, we tried to be as explicit as possible, and, e.g. use the CHEMINF ontology. So, it collected:
  1. InChIKey (used to show images)
  2. the paper it was collected from (identified by a DOI)
  3. the value, and where possible, the experimental error
A page typically looks something like this:

While not used on all pages, at some point I even started using templates, and I used these two, for molecules and papers:



These templates, as well as the above screenshot, already contain a spoiler, but more about that later. Using MediaWiki functionality it was now easy to make lists, e.g. for all pKa data (more spoilers):

I find a database like this very important. It does not capture all the information it should be capturing, though, as is clear from the proposal some of use worked on a while back. However, this project got on hold; I don't have time for it anymore, and it is not core to our department enough to spend time on write grant proposals for it.

But I still do not want to get this data get lost. Wikidata is something I have started using, as it is a machine readable CCZero database with an increasing amount of scientific knowledge. More and more people are working on it, and you must absolutely read this paper about this very topic (by a great team you should track, anyway). I am using it myself as source of identifier mappings and more. So, migrating the previously collected data to Wikidata makes perfect sense to me:

  1. if a compound is missing, I can easily create a new one using Bioclipse
  2. if a paper is missing, I can easily create a new one using Magnus Manske's QuickStatements
  3. Wikidata has a pretty decent provenance model
I can annotate data with the data source (paper) it came from and also experimental conditions:

In fact, you'll note that the the book is a separate Wikidata entry in itself. Better even, it's an 'edition' of the book. This is the whole point we make in the above linked H2020 proposal: Wikidata is not a database specific for one domain, it works for any (scholarly) domain, and seamlessly links all those domains.

Now, to keep track of what data I have migrated, I am annotating DrugMet entries with links to Wikidata: everything with a Wikidata Q-code is already migrated. The above pKa table already shows Q-identifiers, but I also created them for all data sources I have used (three of them are two books and one old paper without a DOI):

I have still quite a number of entries to do, but all the protocols are set up now.

On the downstream side, Wikidata is also great because of their SPARQL end point. Something that I did not get worked out some weeks ago, I did manage yesterday (after some encouragement from @arthursmith): list all pKstatements, including literature source if available:

If you run that query on the Wikidata endpoint, you get a table like this:

We here see experimental data from two papers: 10.1021/ja01489a008 and 10.1021/ed050p510. This can all be displayed a lot fancier, like make histograms, tables with 2D drawings of the chemical structures, etc, but I leave that to the reader.

Re: How should we add citations inside software?

Practice is that many cite webpages for the software, sometimes even just list the name. I do not understand why scholars do not en masse look up the research papers that are associated with the software. As a reviewer of research papers I often have to advice authors to revise their manuscript accordingly, but I think this is something that should be caught by the journal itself. Fact is, not all reviewers seem to check this.

In some future, if publishers would also take this serious, we will citation metrics for software like we have to research papers and increasingly for data (see also this brief idea). You can support this by assigning DOIs to software releases, e.g. using ZENODO. This list on our research group's webpage shows some of the software releases:

My advice for citation software thus goes a bit beyond what traditionally request for authors:

  1. cite the journal article(s) for the software that you use
  2. cite the specific software release version using ZENODO (or compatible) DOIs

 This tweet gives some advice about citing software, triggering this blog post:
Citations inside software
Daniel Katz takes a step further and asked how we should add citations inside software. After all, software reuses knowledge too, stands on algorithmic shoulders, and this can be a lot. This is something I can relate to a lot: if you write a cheminformatics software library, you use a ton of algorithms, all that are written up somewhere. Joerg Wegner did this too in his JOELib, and we adopted this idea for the Chemistry Development Kit.

So, the output looks something like:

(Yes, I spot the missing page information. But rather than missing information, it's more that this was an online only journal, and the renderer cannot handle it well. BTW, here you can find this paper; it was my first first author paper.)

However, at a Java source code level it looks quite different:

The build process is taking advantage of the JavaDoc taglet API and uses a BibTeXML file with the literature details. The taglet renders it to full HTML as we saw above.

Bioclipse does not use this in the source code, but does have the equivalent of a CITATION file: the managers, that extend the Python, JavaScript, and Groovy scripting environments with domain specific functionality (well, read the paper!). You can ask in any of these scripting languages about citation information:

    > doi bridgedb

This will open the webpage of the cited article (which sometimes opens in Bioclipse, sometimes in an external browser, depending on how it is configured).

At a source code level, this looks like:

So, here are my few cents. Software citation is important!

Adding disclosures to Wikidata with Bioclipse

Last week the huge, bi-annual ACS meeting took place (#ACSSanDiego), during which commonly new drug (leads) are disclosed. This time too, like this one tweeted by Bethany Halford:

Because getting this information out in the open is important, I think it's a good idea to add them to Wikidata (see doi:10.3897/rio.1.e7573). So, with Bioclipse (doi:10.1186/1471-2105-8-59) I redrew the structure:

I previously blogged about how to add chemicals to Wikidata, but I realized that I wanted to also use Bioclipse to automate this process a bit. So, I wrote this script to generated the SMILES, InChI, InChIKey, double check the compound is not already in Wikidata (using the Wikidata SPARQL endpoint), an look up the PubChem compound identifier (example SMILES).

smiles = "CCCC"

mol = cdk.fromSMILES(smiles)

inchiObj = inchi.generate(mol)
inchiShort = inchiObj.value.substring(6)
key = inchiObj.key // key = "GDGXJFJBRMKYDL-FYWRMAATSA-N"

sparql = """
PREFIX wdt: <>
SELECT ?compound WHERE {
  ?compound wdt:P235 "$key" .

if (bioclipse.isOnline()) {
  results = rdf.sparqlRemote(
    "", sparql
  missing = results.rowCount == 0
} else {
  missing = true

formula = cdk.molecularFormula(mol)

// Create the Wikidata QuickStatement,
// see

item = "LAST" // set to Qxxxx if you need to append info,
              // e.g. item = "Q22579236"

pubchemLine = ""
if (bioclipse.isOnline()) {
  pcResults =
  if (pcResults.size == 1) {
    cid = pcResults[0]
    pubchemLine = "$item\tP662\t\"$cid\""

if (!missing) {
  println "===================="
  println "Already in Wikidata as " + results.get(1,"compound")
  println "===================="
} else {
  statement = """
    $item\tDen\t\"chemical compound\"

  println "===================="
  println statement
  println "===================="

The output of this script is a QuickStatement for Magnus Manske's tool (IMPORTANT: it's not meant to automate editing Wikidata! I only automate creating the input, which I carefully check (e.g. checking all stereochemistry is defined)! Note, how Bioclipse opens up the structure in a viewer with, which is a list of commands to create and edit entries in Wikidata. You need to enable it first, but if you have an account, this is not too hard. Of course, the advantage is that it is a lot quicker. I have similar script to create QuickStatements starting with only a ChEMBL identifier.

The QuickStatement for GDC-0853 looks like:

    LAST Den "chemical compound"
    LAST P233 "O=C1C(=CC(=CN1C)c2ccnc(c2CO)N4C(=O)c3cc5c(n3CC4)CC(C)(C)C5)Nc6ncc(cc6)N7CCN(C[C@@H]7C)C8COC8"
    LAST P274 "C37H44N8O4"
    LAST P234 "1S/C37H44N8O4/c1-23-18-42(27-21-49-22-27)9-10-43(23)26-5-6-33(39-17-26)40-30-13-25(19-41(4)35(30)47)28-7-8-38-34(29(28)20-46)45-12-11-44-31(36(45)48)14-24-15-37(2,3)16-32(24)44/h5-8,13-14,17,19,23,27,46H,9-12,15-16,18,20-22H2,1-4H3,(H,39,40)/t23-/m0/s1"
    LAST P662 "86567195"

The first line creates a new Wikidata item, while the next ones add information about this compound. GDC-0853 is now also Q23304817. The label I added manually afterwards. Note how the Bioclipse script found the PubChem identifier, using the InChIKey. I also use this approach to add compounds to Wikidata that we have in WikiPathways.

Sci-Hub: a sign on the wall, but not a new sign

Logo of Sci-Hub. Fair use,
via Wikipedia.
It is hard to have missed Sci-Hub, as it even showed up in the Dutch Volkskrant. It is now sharing PDFs of many closed and open access papers (yes, there are plenty of OA PDFs shared). Opinions about it vary, but it is important to realize that it violates, for the closed access papers, social agreements made between scientists and publishers. There are valid arguments but also FUD, e.g. by the Association of American Publishers (AAP; some publishers, surely) that writes that "publishers work to ensure that their publications create an accurate and correct scientific record, (i.e., publishing revisions to correct or update data)". Yes, we all know how well that works out :/

As always, Björn Brembs has a great overview, a must read, though I think I disagree with the Sci-Hub As Necessary, Effective Civil Disobedience title. The argument from the developer of Sci-Hub convinces me more. Wikipedia writes: Alexandra Elbakyan has cited the Article 27 of the UN Declaration of Human Rights "to share in scientific advancement and its benefits".

I am not a lawyer and have no idea it this really applies, but as a scholar, it does reflect exactly what I care about: research should benefit society, directly or indirectly. We have established many routes, often involving some sustainability model (whatever that is). But at a very base level, there is nothing sustainable about knowledge that is hard to get. Scarcity improves value, but knowledge must be cheap, and unlike the AAP suggests, cheap is not the same as low quality (if only they had gotten that press release peer-reviewed...). If it would, then a simple smartphone would be of lower quality than a fax.

But civil obedience is not the solution. However, the point I want to make here is that the Sci-Hub practices are not really new: scholars have been sharing papers for free for many, many years. As a student I learned how to ask collaborators a copy of some paper important to the research you were doing. And to ask the author of the paper of a copy. This civil obedience was and is common sense. And publishers were happy about this for many, many years. That is, I have never heard of any of them going after universities for these practices of their scholars.

Sci-Hub just makes this common activity easier. It only makes it easier to look up scientific knowledge. It just that publishers do not seem to want that. It has to take effort (which is a sign of high quality research, right?).

So, here is my howto on how to get access to research output in a way that publishers have been happy with for at least twenty years:
  1. do not use ILL or go to a nearby university has access to the paper (do not make a list of papers you can get there and do not go there once a month; while there, do not meet up with other scholars to discuss science)
  2. do not ask collaborators or friends at other institutes if they have access
  1. email the corresponding author of the paper (after all, the author decided on a closed access license)
  2. try to ask for a reprint via their publication list provider, such as ResearchGate (the author will love your request via such systems!)
  3. email him again, if you have not received an answer in about a week
  4. repeat this step a few times; they are busy people and may have missed your earlier communication
  5. email other authors of the paper, possibly the ask the dean of the faculty to ask the author to send you a reprint
  6. once you have a copy of the paper, email the authors to say you have a copy of their paper (they may not even have access themselves, which may be the reason why they did not provide a reprint)
Really, this "reprint" concept has been quite formalized and some of you may still remember the term. Several publishers used to have specific URLs that allowed authors to share up to X (normally quite sufficient for most papers) that authors could send back in reply to one of your above emails. Of course, it would be silly to just have the first X PDF downloads for free anyway.

Oh, and some have asked the Open Access community to indicate their position about Sci-Hub practices. Well, as is clear from the above, the Open Access community has long ago responded how to solve the problem of access to literature: by creating the gold Open Access movement.

So, that's why I try to ensure my important work is Open Access. Because then I don't get bothered with tens of reprint requests and give me time to write this personal </rant>. In no way does this post reflect the position of my employer.