Talk: "Making open science a reality, from a researcher perspective"

Slide from the presentations with
a screenshot of the
Woordenboek Organische Chemie.
Last week I was in Paris (wonderful, but like London, a city that makes you understand Ankh-morpork) for the AgreenSkills+ annual meeting. AgreenSkills+ is a program for postdoc funding in France and the postdocs presented their works. Wednesday (#agreenskills) was a day to learn about Open Science, with other talks from Nancy Potinka and Ivo Grigorov from Foster Open Science, Martin Donnelly from the Edinburgh Digital Curation Centre about data management and the DMPonline tool, and Michael Witt of Purdue University about digital repositories and DataCite (which I should really make time to blog aobut too).

I was asked to talk about my experiences from a researcher perspective (which started with the Woordenboek Organische Chemie). Here are my slides:

Making open science a reality, from a researcher perspective from Egon Willighagen

Open Science is already a thing in The Netherlands

It has been hard to miss it: the Dutch National Plan Open Science (doi:10.4233/uuid:9e9fa82e-06c1-4d0d-9e20-5620259a6c65). It sets out an important step forward: it goes beyond Open Access publishing, which has become a tainted topic. After all, green Open Access does not provide enough rights. For example, teachers can still not share green Open Access publications with their students easily.

I am happy I have been able to give feedback on a draft version, and hope it helped. During the weeks before the release I also looked how the Open Science working group of the Open Knowledge International foundation(?) is doing, and happy that at least the Dutch mailing list is still in action. Things are a bit in a flux, as the OKI is undergoing a migration to a new platform. Maybe more about that later.

But one of my main comments was that there already is a lot of Open Science ongoing in The Netherlands. And then I am not talking about all those scientists that already publish part of their work as (gold) Open Access, but the many researchers that already share Open Data, Open Source, or other Open research outputs. In fact, I started a public (CCZero) spreadsheet with GitHub repositories of Dutch research groups, which now also covers many educational groups, at our universities and "hogescholen". This now includes some fourty(!) git repositories, mostly on GitHub but also on GitLab. Wageningen even have their own public git website!

Mind you, I had to educate myself a bit in the exact history of the term Open Science. It actually seems to go back to the USA Open Source community (see these references and particularly this article). And that's actually where I also knew it from, in particular from Dan Gezelter, founding author of the well-known Jmol viewer for small molecules and protein structures, and host of the domain.

Wikidata-powered citation lists with citation.js

I don't get enough time with the kids as I would like, but if your son is doing interesting coding projects it makes that a lot easier. One project he is working on is citation.js, a JavaScript library to edit bibliographies. It has become really powerful and totally awesome! We all hate formatting bibliographies and that every journal has its own format. LaTeX and Citation Style Language have done wonders here, but all should even be simpler. As an author I want to be able to just give a DOI and that should be enough.

Or a Wikidata entity identifier.

And citation.js makes that last thing possible, and I spent some time with Lars to implement this for my homepage:

This is more or less what I had before too, but then everything hard coded. The citation.js way allow me to give just a list of two entity IDs (Q27062312 and Q27062639) and citation.js outputs the above. I just have this snippet in the HTML:

      <ul class="cite" id="cite1" />
      <script class="code" type="text/javascript">
        var wikidata = new Cite()
        wikidata.set( [ "Q27062312", "Q27062639" ] )
        htmlOutput = wikidata.get( opt )
          htmlOutput.replace( /&(lt|#60);/g, '<' )
                    .replace( /&(gt|#62);/g, '>' )

The formatting is actually mostly done with a CSL template (though it needs a hack to get it to output HTML), though adapted to also output the DOI hyperlink and Altmetric icon (you can find the customized CSL in the HTML source code as CC-BY-SA 3.0). The citation.js library fetches the data from Wikidata and actually has to deal with the structure there, which includes a mixture of 'author' and 'author name string' fields for author information. Well done!

If you like this, make sure to check out Wikicite, OpenCitations, and Scholia, projects that enabled and triggered some of the ideas behind the above citation.js use!

"10 everyday things on the web the EU Commission wants to make illegal" #04

Fourth example is harder then the third and I hope I got the translation of Julia Reda's example in good way. The starting point is simple enough, bookmarking things where an image is used. However, I am less sure to what extend we use this in online science.

04. Pinning a photo to an online shopping list

Well, you can see how much trouble I had with finding a good equivalent here. So, what is a science shopping list? The above example shows a Google+ post by Björn Brembs. Now, G+ is not really a shopping list, but then again, literature is what researchers buy. Literally. We pay millions and millions for it. Second, we do have dedicated shopping lists for these products, but they not always support images. Of course, these shopping lists are our CiteULike, Mendeley, ResearchGate, etc accounts.

Second limitation of this example is that we would not consider most of our literature of journalistic nature. Therefore the above example. Blogs are typically a mixture of science writing and a kind of journalism. It's a grey area. Now, under the new laws, Björn would have to ask my permission, and worse, G+ needs to install a monitoring system to see if Björn got a proper license as to not break my copyright.

So, back to the likes of ResearchGate and ScienceOpen. With the current proposal, any system of this kind with some commercial model in mind (both are set up by SMEs), they will have to install this monitoring system (after all, we also happily bookmark Nature News articles). The cost of that investment will have to come from somewhere, so this has an enormous impact on their sustainability.

Even worse, the wordings in the proposal I have seen so far, and to the extend I understand Julia's worries, there are no limitations set on this; few or no words on allowed behavior. So, what about dissemination systems in general? I think later examples (we still have six to go!), will shed more light on that.

(And make sure to read the original article by Julia Read!)

EPA CompTox Dashboard IDs in Wikidata

After Antony Williams left the ChemSpider team, he moved on to the EPA. Since then, he has set up the EPA CompTox Dashboard (see also doi:10.1007/s00216-016-0139-z [€]). And in August he was kind enough to upload mappings between InChIKeys (doi:10.1186/s13321-015-0068-4) and their identifiers on Figshare (doi:10.6084/m9.figshare.3578313.v1) as a tab-separated values (TSV) file. Because this database is of interest to our pathway and systems biology work, I realized I wanted ID-ID mappings in our BridgeDb identifier mappings files (doi:10.1186/1471-2105-11-5). As I wrote earlier, I have adopted Wikidata (doi:10.3897/rio.1.e7573) as data source. So, entering these new identifiers in Wikidata is helpful.

Somewhere in the past few months I proposed the needed Wikidata property, P3117 ("DSSTOX substance identifier"), which was approved some time later. For entering the mappings, I have opted to write a Bioclipse script (doi:10.1186/1471-2105-10-397) that uses the Wikidata SPARQL endpoint to get about 150 thousand Wikidata item identifiers (Q-codes) and their InChIKeys. I then parses over the lines in the TSV file from Figshare and creates input for Wikidata for each match, based on exact InChIKey string equivalence.

This output is formatted QuickStatements instructions, a great tool set up by Magnus Manske. Each line looks like (here for N6-methyl-deoxy-adenosine-5'-monophosphate, aka Q27456455):

Q27456455 P3117 "DTXSID30678817" S248 Q28061352

The P248 ("stated in") property is used to link the source (hence: S248) information as reference, with points to the Q28061352 item which is for the Figshare entry for Tony's mapping data. The result in this Wikidata item looks like:

I entered about 36 thousand of such statements to Wikidata. Thus, the yield is about 5%, calculating from the CompTox Dashboard as starting point with about 720 thousand identifiers. From a Wikidata perspective, the yield is higher. There are about 150 thousand items with an InChIKey, so that 24% could be mapped.

Based on properties of the property, it does some automatic validation. For example, it is specified that any Wikidata item can only have one DSSTOX substance identifier, because it can only have one InChIKey too. Similarly, there can not be two Wikidata items with the same DSSTOX identifier. Normally, because because of how Wikidata works, there can be isolated examples. With less then 25 constraint violations, the quality of the process turned out pretty high (>99.9%).

Some of the issues have been manually inspected. Causes vary. One issue was that the Wikidata item in fact had more than one InChIKey. A possible reason for that is that it does not distinguish between various forms of a compound. Two Wikidata items have been split up accordingly. Other problems are due to features of the CompTox Dashboard, and some issues have been tweeted to the Dashboard team.

This mashup of these two resources, as anticipated in our H2020 proposal (doi:10.3897/rio.1.e7573), makes it possible to easily make slices of data. For example, we can query for experimental data for compounds in the EPA CompTox Dashboard with a SPARQL query like for the dipole moment:

Importantly, this query shows the source where this data comes from, one of the advantages of Wikidata.

"10 everyday things on the web the EU Commission wants to make illegal" #03

The third example in this series is not too hard to explain.

03. Posting a blog post to social media

Because many of you are familiar with blogging and many of you blog yourself, you know what this one is about. The way I understood it, it will be legal, and you just have to figure out if and how much I would need to pay Kerstin to share this wonderful story about clinical trials in Wiki{pedia|data} on Google+:

As with all original examples, Julia's post provides a lot of legal detail, which I reshare here for this item, because you may initially think this is just about news from newspapers, but here too, wording matters:

And while I have argued a long time ago, that there are many kinds of blogging (it's just a medium, like paper), many can certainly be considered of journalistic nature. In fact, some even use their blog for getting press tickets for scientific conferences (but that's another story ;).

Well, if you are still reading this series, maybe it is time to head over to the website.

"10 everyday things on the web the EU Commission wants to make illegal" #01

OK, after moving to the second example, I realized the subtle difference with the first: I got example 01 and 02 mixed up, and while the previous post was really discussing Julia's second example. Example 01 is really about snippets of publications, like quotes. Now, before you argue that quoting is legal, realize that depends on specifics in various jurisdictions, and, as Julia writes:

"[..] in many EU countries, sharing an extract without further commenting on its substance is not covered by that exception".

So, I hope this post provides enough commenting and substance. But that clearly does not apply for modern way of dissemination of science via Twitter.

01. Sharing what happened 20 years ago

Anyway, now I got a kickstart for the first example too: both tweets were actually about news of close to twenty years ago: both publications are of about 20 years ago! So, take the first tweet with the title of the Nature News article, but now with a quote.

This will be illegal for commercial entities, and possible me too: there is no significant commenting. It practically means that covering the news of the past will be practically illegal or very hard at least, or at least to some, where some is ill-defined, because of the proposal is very unclear about who can and who cannot.

Oh, and if you're not already freaked out: it's retroactive. That is, happy cleaning up the past 20 years of dissemination you did and figure out where this example applied. Nice excuse to not do research!

"10 everyday things on the web the EU Commission wants to make illegal" #02

OK, let me first say that I hope Julia Reda does not consider herself a publisher. If you did not get the above joke, then continue reading.

This post is an attempt to translate the proposed copyright laws to, well, research. The choice of words is critical here. I deliberately did not say: academic, university, scholarly. I hope my hesitation will become more clear after reading this post too. I am not a lawyer (IANAL), but it is also important to realize that the exact meaning of law often only becomes clear when tried in court, where judges will create de facto examples of what really is allowed and not, following the intention of the law. I am a researcher and a teacher. I implement this by being a strong proponent of Open Science.

This post is about proposed clean up of the European copyright situation, or so it was meant to be. The practice shows differently, unfortunately. The problem is what I see happen around me (and wrote about it and speak about it). I see a huge gap with how the previous generation of scholars think about research dissemination and copyright, and how the modern society sees this. And having read more about it than I should as a scientist, I cannot undo seeing all the contradictions there are in there.

This post will, therefore, take the 10 example activities that will soon be illegal, if we vote badly in upcoming elections and don't follow Julia's knowledge, as a starting point to highlight some of the problems I expect that will happen, based on observations in doing research in the European Community.

Before I start off, one more disclaimer. The proposal is not hard to read, but like other legal works, uses a specific language. Words that have a common or scientific meaning may have a different meaning in law. So, when this post talks about a "hyperlink", I may get the legal meaning wrong. I strongly rely here on more legally knowledgeable people, like Julia. But as she also indicated in her #33C3 talk, legal definitions can be tweaked by newer laws. Two terms that are critical here which are not well-defined (IMHO), but central in the proposal are: commercial (see e.g. Breaking News: CC-NC only for personal use!) and publisher. But that's part of the problem with this proposal.

Finally, what is critical, we must not let ourselves be deluded: law only exists as a formal way to agree on things. Increasingly, very sadly, it is being used to force people into criminality.

02.a Tweeting a creative news headline

I will actually split this up into two examples, one which will be illegal, the other also, but depending on how far the term "publisher" extends. That is, are press outlets the only intended copyright holders here, or also scientific journal publishers.

This tweet reposts a news item from Nature News of about 18 years ago. This will be illegal for commercial websites. So, how does that affect me as scholar? If I do this on my personal behalf (my social accounts are not Maastricht University accounts), it probably still affects me. As Julia points out, first of all, Twitter is commercial, and they may or may not pay Springer+Nature for being allowed tweet this....

WTF? Ho, ho, ho... you're not saying that tweeting the title of an article is illegal???

Actually, yes, that's exactly what this proposal is saying. So, let me continue. If Twitter does not pay Springer+Nature for the right to tweet this, I may have to. May, because it depends on a court to formally decide if I am commercial or not, if I ever get challenged.

It's weird, isn't it? I'm making free advertisement here, and I may need to pay money to do have to right to advertise that.

However, and this is also critical, commercial entities need to apply. Some argue that some universities are commercial, what about SMEs? What about H2020 projects, where often a significant part of the project are SMEs? Are they commercial? Can a project like eNanoMapper still make such tweets, or would that be illegal? Who knows, but even if probably not, will they take the risk? Can they afford to? How much will it cost to make a decision? They will likely not bother and just not do it, inhibiting scholarly dissemination.

02.b Tweeting a creative news headline

Well, OK, I cannot copy/paste the title of the article and still do the advertisement. But I stress that this practice is very common among scholars; it's one of the foundations of #altmetrics.

Now, the above example used Nature News, but what about Nature itself? Or Cell by Elsevier? This is where my legal knowledge fails. At this moment I am not sure if scholarly journals are the rights owners in mind in this proposal, but I currently doubt that the owners and legal departments of the big scientific publishers will say so differently.

So, will the next tweet still be legal?

I honestly do not know, but my current guess is this will be illegal for commercial entities.

OK, to not make this post too long, I will save the next example for a next post. To be continued!

Facts, Data and Open Data

Source, CCZero.
I was recently asked my experiences around data sharing, and in particularly the legal aspects of it. Because whether we like it or not (I think "we" generally do not like it and I see many scholars ignore it), society has an impact on scholarly research. Particularly, copyright and intellectual property (IP) laws make research increasingly expensive. I wrote up the following aspects related to that discussion. I am not a lawyer, and these laws are different in each country (think about facts, governmental output, etc). Your mileage may vary.

#1 Don't give away your copyright to any single other party

Scholars are common to this. For a very long time we would freely give our research IP to publishers. By selling that IP, publishers would fund the knowledge dissemination (often with huge profits). But institutes start thinking about this, and are backtracking on it. Bottom line: do not give away your copyright.

The importance of this is that you will loose all control over the data. You will no longer be able to give your data to others, because it is no longer yours. Also, you can never repurpose the data anymore, because it is no longer yours. Instead, give other rights to work with the data, by removing copyright or by giving people a suitable license (see the next point).

#2 The three pillars of Open: the rights to (re)use, modify, and redistribute

Really, these three points are critical: it gives anyone the rights to work with the data.

(Re)use is clear.

The right to modify is critical because it is needed for changing the format in which the data is shared (e.g. create ISATab-Nano) but also for data curation!

Redistribute is the right that anyone needs to make your data available to others. In fact, all those EULAs (end-user license agreements) that all of us sign when creating an online account give Google, Facebook, etc, etc the right to reshare (some of) the data you share with them. Clearly, without this right, ECHA, eNanoMapper, CEINT, etc cannot reshare the data with others.

#3 Copyright

Copyright law around data is very complex. For example, there are huge differences between law in European countries and in the USA. The latter, for example, have the concept of "public domain" that many European countries do not have (though we still happily use that term here too). In Europe, databases have database rights. Facts are excluded, but I have yet to find a clear statement of what a "fact" is. But a collection of facts is the outcome of a creative process (like any EC FP7 or H2020 project) and hence has copyright.

For starting projects, the consortium agreement (CA) defines how this is dealt. And like you can give the copyright of a research paper to a publisher, a CA can define that all partners of a project have shared IP. That ensures they can all use it, but it also means it becomes really hard to share it outside the consortium. Instead, my recommendation is to keep the IP with the data creator, and make it available within the consortium with a license. Or just waive the copyright. Copyright with one legal department can already be complicated, and if you have multiple legal departments discussing IP, it certainly does not become easier.

Of course, consensus among all partners is best. I also stress that laws are just tools. Any partner can give others more rights without problems. They cannot hide behind laws. Ideally, each project proposal writing starts with a formal consensus how data will be available. Solve that before you get the money. But I will write more about that later during these holidays.

#4 Licenses and waivers

The open source community realized these issues decades ago. First with source code, leading to Open Source Initiative (OSI)-approved licenses, providing the aforementioned rights. For source code, there are also so-called waivers. The difference with licenses is that the latter gives you specific rights, while a waiver "waves" away any rights any law (from any jurisdiction) might automatically give. For the three "pillars" the outcome is the same: you will have those three rights. In case of a waiver, you just get any right you can think of too, whereas a license is limited to those rights specified in the license.

Now, these ideas developed in the open source community found their way to the "Open Access" (OA, for documents) community and the "Open Data" community in the last 10 years. Some lobbying forces managed to clutter the definition of Open Access, which is why the community talks about green OA and gold OA. The first is not really Open and does not give you all three rights. Gold Open Access does. A green OA article you cannot reshare.

For data there are basically two options:

  • licenses: Creative Commons (CC) license
  • waiver: CCZero (not a licence)

For the first option, the licenses, the CC licenses come in various flavors, and this is implemented with "clauses". For example, there is an "attribution" clause. This creates the CC-BY license as you know from gold Open Access journals. This clause gives you the three rights, but also requires you to cite where you got the data.

A second CC clause is the ND (No Derivative) clause, which defines that no one can make derived products. Effectively, it removes on of the three rights. It exists with the idea that some things are not meant to change. Think for example about the JRCNMxxxx codes for nanomaterials. No one should be changing them, because it would defy the purpose of the definition of those codes.

A third CC clause is the NC (Non-Commercial) clause. This clause specifies that you can only use that data for non-commercial purposes. Some publishers use that in their implementation of "Open Access" and basically says that only some people get the three basic rights. Now, who "some" is, is not clearly defined. Not legally, not practically. No one really knows when something is commercial and when not. Some legal experts have argued that some American universities are commercial enterprises (source needed). For Europe SME's are a clearly commercial entity.

A final CC clause is the SA (Share Alike) clause, which requires that people redistributing your data also make it available under the same license. This is in the open source community referred to as "copylefting" and has upsides and downsides.

I stress that in case of licenses, no IP is reassigned and the producers of the data keep owner of the IP.

At a recent NanoSafety Cluster meeting I gave a presentation about these matters and the slides are available here.

The SWAT4LS poster about eNanoMapper

SWAT4LS was once again a great meeting. I doubt I will find time soon enough to write up notes, but at least I can post the eNanoMapper poster I presented, which is available from F1000Research:

Willighagen E, Rautenberg M, Gebele D et al. Answering scientific questions with linked European nanosafety data [v1; not peer reviewed]. F1000Research 2016, 5:2848 (poster) (doi: 10.7490/f1000research.1113520.1)