New Paper: "Using the Semantic Web for Rapid Integration of WikiPathways with Other Biological Online Data Resources"


Andra Waagmeester published a paper on his work on a semantic web version of the WikiPathways (doi:10.1371/journal.pcbi.1004989). The paper outlines the design decisions, shows the SPARQL endpoint, and several examples SPARQL queries. These include federates queries, like a mashup with DisGeNET (doi:10.1093/database/bav028) and EMBL-EBI's Expression Atlas. That results in nice visualisations like this:


If you have the relevant information in the pathway, these pathways can help a lot in helping understanding of what is biologically going on. And, of course, used for exactly that a lot.

Press release
Because press releases have become an interesting tool in knowledge dissemination, I wanted to learn what it involved to get one out. This involved the people as PLOS Computational Biology and the press offices of the Gladstone Institutes and our Maastricht University (press release 1, press release 2 EN/NL). There is already one thing I learned in retrospect, and I am pissed with myself that I did not think of this: you should always have a graphics supporting your story. I have been doing this for a long time in my blog now (sometimes I still forget), but did not think of that in the press release. The press release was picked up by three outlets, though all basically as we presented it to them (thanks to Altmetric.com):


SPARQL
But what makes me appreciate this piece of work, and WikiPathways itself, is how it creates a central hub of biological knowledge. Pathway databases capture knowledge not easily embedded an generally structured (relational) databases. As such, expression this in the RDF format seems simple enough. The thing I really love about this approach, is that your queries become machine readable stories, particularly when you start using human readable variants of SPARQL for this. And you can share these queries with the online scientific community with, for example, myExperiment.

There are two applications how I have used SPARQL on WikiPathways data for metabolomics: 1. curation; 2. statistics. Data analysis is harder, because in the RDF world resources scientific lenses are needed to accommodate for the chemical structural-temporal complexity of metabolites. For curation, we have long used SPARQL for unit tests to support the curation of WikiPathways. Moreover, I have manually used the SPARQL end point to find curation tasks. But now that the paper is out, I can blog about this more. For now, many examples SPARQL queries can be found in the WikiPathways wiki. It features several queries showing statistics, but also some for curation. This is an example query I use to improve the interoperability of WikiPathways with Wikidata (also for BridgeDb):

SELECT DISTINCT ?metabolite WHERE {
  ?metabolite a wp:Metabolite .
  OPTIONAL { ?metabolite wp:bdbWikidata ?wikidata . }
  FILTER (!BOUND(?wikidata))
}

Feel free to give this query a go at sparql.wikipathways.org!

Triptych
This papers completes a nice triptych of three papers about WikiPathways in the past 6 months. Thanks to whole community and the very many contributors! All three papers are linked below.

Waagmeester, A., Kutmon, M., Riutta, A., Miller, R., Willighagen, E. L., Evelo, C. T., Pico, A. R., Jun. 2016. Using the semantic web for rapid integration of WikiPathways with other biological online data resources. PLoS Comput Biol 12 (6), e1004989+. http://dx.doi.org/10.1371/journal.pcbi.1004989
Bohler, A., Wu, G., Kutmon, M., Pradhana, L. A., Coort, S. L., Hanspers, K., Haw, R., Pico, A. R., Evelo, C. T., May 2016. Reactome from a WikiPathways perspective. PLoS Comput Biol 12 (5), e1004941+. http://dx.doi.org/10.1371/journal.pcbi.1004941
Kutmon, M., Riutta, A., Nunes, N., Hanspers, K., Willighagen, E. L., Bohler, A., Mélius, J., Waagmeester, A., Sinha, S. R., Miller, R., Coort, S. L., Cirillo, E., Smeets, B., Evelo, C. T., Pico, A. R., Jan. 2016. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Research 44 (D1), D488-D494. http://dx.doi.org/10.1093/nar/gkv1024

Wikidata showing chemical properties with references

As you have seen in my blog, I'm a fan of Wikidata. Because of the Open nature, it's creating an enormous eco-system, in which many scientists are involved and with innovative visualizations. Data comes from many trusted databases, but the complexity of it all requires some hard decisions now and then. However, unlike many other databases, Wikidata has data provenance high on the agenda: all statements can be complemented with primary literature references, which I have been using when porting the pKa data.

SQID page for aspirin in Wikidata.
A new visualization of the data is provided by SQID Markus Krötzsch et al. And this interface propagates the references for each bit of fact, though by default hidden behind an arrow icon at the top right of the fact. Clicking that will show the provenance, though that is currently still often a database, rather than primary literature.


Section of the SQID page for aspirin, with references given for solubility, mass,
and a hazardous chemical exposure.
I really like where this is going! Why have publishers not been able to do something like this in the past 20 years?? This is knowledge dissemination as we want to see it.

Comparing sets of identifiers: the Bioclipse implementation

Source: Wikipedia
The problem
That sounds easy: take two collection of identifiers, put them in sets, determine the intersection, done. Sadly, each collection uses identifiers from different databases. Worse, within one set identifiers from multiple databases. Mind you, I'm not going full monty, though some chemistry will be involved at some point. Instead, this post is really based on identifiers.

The example
Data set 1:

Data set 2: all metabolites from WikiPathways. This set has many different data sources, and seven provide more than 100 unique identifiers. The full list of metabolite identifiers is here.

The goal
Determine the interaction of two collections of identifiers from arbitrary databases, ultimately using scientific lenses. I will develop at least two solutions: one based on Bioclipse (this post) and one based on R (later).

Needs
First of all, we need something that links IDs in the first place. Not surprisingly, I will be using BridgeDb (doi:10.1186/1471-2105-11-5) for that, but for small molecules alternatives exist, like the Open PHACTS IMS based on BridgeDb, the Chemical Translation Service (doi:10.1093/bioinformatics/btq476) or UniChem (doi:10.1186/s13321-014-0043-5, doi:10.1186/1758-2946-5-3).

The Bioclipse implementation
The first thing we need to do is read the files. I have them saved as CSV even though it is a tab-separated file. Bioclipse will now open it in it's matrix editor (yes, I think .tsv needs to be linked to that editor, which does not seem to be the case yet). Reading the human metabolites from WikiPathways is done with this code (using Groovy as scripting language):

file1 = new File(
  bioclipse.fullPath(
    "/Compare Identifiers/human_metabolite_identifiers.csv"
  )
)
set1 = new java.util.HashSet();
file1.eachLine { line ->
  fields = line.split(/\t/)
  def syscode;
  def id;
  if (fields.size() >= 2) {
    (syscode, id) = line.split(/\t/)
  }
  if (syscode != "syscode") { // ok, not the first line
    set1.add(bridgedb.xref(id, syscode))
  }
}

You can see that I am using the BridgeDb functionality already, to create Xref objects. The code skips the first line (or any line with "column headers"). The BridgeDb Xref object's equals() method ensures I only have unique cross references in the resulting set.

Reading the other identifier set is a bit trickier. First, I manually changed the second column, to use the BridgeDb system codes. The list is short, and saves me from making mappings in the source code. One thing I decide to do in the source code is normalize the ChEBI identifiers (something that many of you will recognize):

file2 = new File(
  bioclipse.fullPath("/Compare Identifiers/set.csv")
)
set2 = new java.util.HashSet();
file2.eachLine { line ->
  fields = line.split(/\t/)
  def name;
  def syscode;
  def id;
  if (fields.size() >= 3) {
    (name, syscode, id) = line.split(/\t/)
  }
  if (syscode != "syscode") { // ok, not the first line
    if (syscode == "Ce") {
      if (!id.startsWith("CHEBI:")) {
        id = "CHEBI:" + id
      } 
    }
    set2.add(bridgedb.xref(id, syscode))
  }
}

Then, the naive approach that does not take into account identifier equivalence makes it easy to list the number of identifiers in both sets:

intersection = new java.util.HashSet();
intersection.addAll(set1);
intersection.retainAll(set2)

println "set1: " + set1.size()
println "set2: " + set2.size()
println "intersection: " + intersection.size()

This reports:

set1: 2584
set2: 6
intersection: 3

With the following identifiers in common:

[Ce:CHEBI:30089, Ce:CHEBI:15904, Ca:25513-46-6]

Of course, we want to use the identifier mapping itself. So, we first compare identifiers directly, and if not matching, use BridgeDb and an metabolite identifier mapping database (get one here):

mbMapper = bridgedb.loadRelationalDatabase(
  bioclipse.fullPath(
    "/VOC/hmdb_chebi_wikidata_metabolites.bridge"
  )
)

intersection = new java.util.HashSet();
for (id2 in set2) {
  if (set1.contains(id2)) {
    // OK, direct match
    intersection.add(id2)
  } else {
    mappings = bridgedb.map(mbMapper, id2)
    for (mapped in mappings) {
      if (set1.contains(mapped)) {
        // OK, direct match
        intersection.add(id2)
      }
    }
  }
}

This gives five matches:

[Ch:HMDB00042, Cs:5775, Ce:CHEBI:15904, Ca:25513-46-6, Ce:CHEBI:30089]

The only metabolite it did not find in any pathway is the KEGG identified metabolite, homocystine. I just added this compound to Wikidata. That means that in the next metabolite mapping database, it will recognize this compound too.

The R and JavaScript implementations
I will soon write up the R version in a follow up post (but got to finish grading student reports first).

Sci-Hub succeeds where publishers fail (open and closed)

Sci-Hub use in The Netherlands is not limited to
the academic research cities. Harlingen is a small
harbor town where at best a doctor lives and
one or two students who visit parents in the
weekend. The nature of the top downloaded
paper suggests it is not a doctor :)
Data from Bohannon and Elbakyan.
Knowledge dissemination is a thing. It's not easy. In fact, it's a major challenge. Traditional routes are not efficient anymore, where they were 200 years ago. The world has moved on; the publishing industry has not. I have written plenty in this blog about how the publishers could catch up, and while this is happening, progress is (too) slow.

The changes are not only technical, but also social. Several publishers still believe we live in a industrial area, where the world has moved on into a knowledge era. More people are mining and servicing data than there are making physical things (think about that!). Access to knowledge matters, and dealing with data and knowledge stopped being something specific for academic and other research institutes many, many years ago. Arguments that knowledge is only for the highly educated is simply contradicting and bluntly ignore our modern civilization.

This makes access to knowledge a mix of technological and social evolution, and on both end many publishers fail, fail hard, fail repeatedly. I would even argue that all the new publishers are improving things, but are failing to really innovate in knowledge dissemination. And not just the publishing industry, also many scientists. Preprint servers are helpful, but this is really not the end goal. If you really care about speeding up knowledge dissemination, stop worrying about things like text mining, preprints, but you have to start making knowledge machine readable (sorry, scientist) and release that along or before your article. Yes, that is harder, but just realize you are getting well-paid for doing your job.

So, by no means the success of Sci-Hub is unexpected. It is not really the end goal I have in mind, and in many ways contradicting what I want. But the research community thinks differently, clearly. Oh wait, not just the research community, but the current civilization. The results of the Bohannon analysis of the Sci-Hub access logs I just linked to clearly shows this. There are so many aspects, and so many interpretations and remaining questions. The article rightfully asks, is it need or convenience. I argued recently the latter is likely an important reason at western universities, and that it is nothing new.

This article is a must read if you care about the future of civilization. Bonus points for a citable data set!

Bohannon, J. Who's downloading pirated papers? everyone. Science 352, 508-512 (2016). URL http://dx.doi.org/10.1126/science.352.6285.508.
Elbakyan, A. & Bohannon, J. Data from: Who's downloading pirated papers? everyone. (2016). URL http://dx.doi.org/10.5061/dryad.q447c.

Programming in the Life Sciences #22: jsFiddle

My son pointed me to jsFiddle which allows you to edit JavaScript snippets and run them. I have heard of them before, but never really got time for it. But I'm genuinely impressed with the stuff he is doing, and finally wanted to try sharing JavaScript snippets online, particularly, because I had to update the course description of Programming in the Life Sciences. In this course the students work with JavaScript and there are a number of example, but that has a lot of HTML boiler plate code.

So, here's the first of those examples, but then stripped from most of the things you don't need, and with some extra documentation as comments:

Splitting up Bioclipse Groovy scripts

Source: Wikipedia, CC-BY-SA 3.0
... without writing additional script managers (see doi:10.1186/1471-2105-10-397). That was what I was after. I found that by using evaluate() you could load additional code. Only requirements, you wrap stuff in a class, and the filename need to match the class name. So, you put stuff in a class SomeName and safe that in a Bioclipse project (e.g. SomeProject/) with the name SomeName.groovy.

That is, I have this set up:

  SomeProject/
    SomeClass.groovy
    aScript.groovy

Then, in this aScript.groovy you can include the following code to load that class and make use of the content:

  someClass = evaluate(
    new File(
      bioclipse.fullPath("/SomeProject/SomeClass.groovy")
    )
  )

Maybe there are even better ways, but this works for me. I tried the regular Groovy way of instantiating a class defined like this, but because the Bioclipse Groovy environment does not have a working directory, I could not get that to work.

Still a draft: The Amsterdam Call for Action on Open Science

It was on the agenda: "Presenting the Amsterdam Call for Action". However, a day of hard work by some 300 participants of the Dutch Presidency's meeting on Open Science did not allow for the draft to be finalized today. Instead, the editors will work the next 24h (a bit less by now) on a new draft that will be send around to the participants which will then have about a week to send in further comments.

There was enough feedback given on the draft indeed, and followers of my blog and twitter account know how much they already got from just me. It will be a busy 24 hours for the editors. I am really looking forward with the next draft they come up with. BTW, it is not clear yet if I will be able to share the draft that I get tomorrow. We'll see. At least I will tweet about whether or not my main points got addressed.

Meanwhile, the Dutch VSNU sent out a press release that "[they are] pleased with European action plan Open Science". Given that it was in fact not released yet, suggests a few things:

  1. they anticipated the draft was a done deal (which aligns with the lack of openness around the draft);
  2. automated sending out of press releases is a bad idea.

The Amsterdam Call for Action on Open Science #2: 10% Open Science

Some of the discussions here are about data sensitivity, privacy, etc. Excellent points! One confusion that should be put aside is that you cannot be an open scientist if you do not release everything. That is nonsense, FUD perhaps.

The The Amsterdam Call for Action on Open Science may very well set some goals; I do not think it currently does. It may set a goal of 10% by the end of this year (already made, probably), 20% at the end of 2018, and 30% at the end of H2020. You can read this in many ways, and here too, things are very vague.

After yesterday I herald this approach! Would it not be brilliant if at the end of this year all scholars output 10% of the research as Open Science?! Go for it!

The Amsterdam Call for Action on Open Science

First, it is very regrettable the participants of #EU2016NL #openaccess did not get access to The Amsterdam Call for Action on Open Science document (read it in this Trello) until minutes before the break out sessions. So, I am still reading it (it's 21 pages), while in the Innovation session, which will only discuss actions #8 and #9.

Here's the first page:


All participants split up in various break out sessions, and I ended up in the Innovation session. However, I find the description of Open Science insufficient for me to understand the proposed twelve actions. It does not define the core values of Open Science:


You can see my comments already. And really, this is critical: all the actions make assumptions, do not define things, which causes the problem no nothing is actionable, because you can basically do anything and still comply to the action. (In fact, another issue is that several actions are already being undertaken, but that's for later).

My recommendation is to rephrase the first sentence into:

"Open science is an umbrella term for a technology and data driven systemic change in how knowledge dissemination works and how researchers work, collaborate, share ideas and disseminate results, by adopting the core values that knowledge should be reusable, modifiable and redistributable. This allows us address the increasing demand in society to address societal challenges of our time."

These are the cover values implemented by Open Data and Open Source. Sadly, not commonly by Open Access, causing a lot of confusion in the latter area, which have been very clear at this meeting too.

Migrating pKa data from DrugMet to Wikidata

In 2010 Samuel Lampa and I started a pet project: collecting pKa data: he was working on RDF extension of MediaWiki and I like consuming RDF data. We started DrugMet. When you read this post, this MediaWiki installation may already be down, which is why I am migrating the data to Wikidata. Why? Because data curation takes effort, I like to play with Wikidata (see this H2020 proposal by Daniel Mietchen et al.), I like Open Data (see ), and it still much needed.

We opted for a page with the minimal amount of information. To maximize the speed at which we could add information. However, when it came to semantics, we tried to be as explicit as possible, and, e.g. use the CHEMINF ontology. So, it collected:
  1. InChIKey (used to show images)
  2. the paper it was collected from (identified by a DOI)
  3. the value, and where possible, the experimental error
A page typically looks something like this:


While not used on all pages, at some point I even started using templates, and I used these two, for molecules and papers:

    {{Molecule
      |Name=
      |InChIKey=
      |DOI=
      |Wikidata=
    }}

    {{Paper
      |DOI=
      |Year=
      |Wikidata=
    }}

These templates, as well as the above screenshot, already contain a spoiler, but more about that later. Using MediaWiki functionality it was now easy to make lists, e.g. for all pKa data (more spoilers):


I find a database like this very important. It does not capture all the information it should be capturing, though, as is clear from the proposal some of use worked on a while back. However, this project got on hold; I don't have time for it anymore, and it is not core to our department enough to spend time on write grant proposals for it.

But I still do not want to get this data get lost. Wikidata is something I have started using, as it is a machine readable CCZero database with an increasing amount of scientific knowledge. More and more people are working on it, and you must absolutely read this paper about this very topic (by a great team you should track, anyway). I am using it myself as source of identifier mappings and more. So, migrating the previously collected data to Wikidata makes perfect sense to me:

  1. if a compound is missing, I can easily create a new one using Bioclipse
  2. if a paper is missing, I can easily create a new one using Magnus Manske's QuickStatements
  3. Wikidata has a pretty decent provenance model
I can annotate data with the data source (paper) it came from and also experimental conditions:



In fact, you'll note that the the book is a separate Wikidata entry in itself. Better even, it's an 'edition' of the book. This is the whole point we make in the above linked H2020 proposal: Wikidata is not a database specific for one domain, it works for any (scholarly) domain, and seamlessly links all those domains.

Now, to keep track of what data I have migrated, I am annotating DrugMet entries with links to Wikidata: everything with a Wikidata Q-code is already migrated. The above pKa table already shows Q-identifiers, but I also created them for all data sources I have used (three of them are two books and one old paper without a DOI):


I have still quite a number of entries to do, but all the protocols are set up now.

On the downstream side, Wikidata is also great because of their SPARQL end point. Something that I did not get worked out some weeks ago, I did manage yesterday (after some encouragement from @arthursmith): list all pKstatements, including literature source if available:

If you run that query on the Wikidata endpoint, you get a table like this:


We here see experimental data from two papers: 10.1021/ja01489a008 and 10.1021/ed050p510. This can all be displayed a lot fancier, like make histograms, tables with 2D drawings of the chemical structures, etc, but I leave that to the reader.