New paper: "RDFIO: extending Semantic MediaWiki for interoperable biomedical data management"

Figure 10 from the article showing what the DrugMet wiki
with the pKa data looked like. CC-BY.
When I was still doing research at Uppsala University, I had a internship student, Samuel Lampa, who did wonderful work on knowledge representation and logic (check his thesis). In that same period he started RDFIO, a Semantic MediaWiki extension to provide a SPARQL end point and some clever feature to import and export RDF. As I was already using RDF in my research, and wikis are great way to explore how to model domain data, particularly when extracted from diverse literature, I was quite interested. Together we worked on capturing pKa data, and Samuel had put DrugMet online. Extracting pKa values from primary literature is a lot of laborious work and crowdsourcing did not pick up. This data was migrated to Wikidata about a year ago.

I also used the RDFIO extension when I started capturing nanosafety data from literature when I worked at Karolinska Institutet. I will soon write up this work, as the NanoWiki (check out these FigShare data releases) was a seminal data set in eNanoMapper, during which I continued adding data to test new AMBIT features.

Earlier this week Samuel's write up of his RDFIO project was published, to which I contributed the pKa use case (doi:10.1186/s13326-017-0136-y). There are various ways to install the software, as described on the RDFIO project site. The DrugMet data as well as the data for the OrphaNet data from the other example use case can also be downloaded from that site.

Lampa, S., Willighagen, E., Kohonen, P., King, A., Vrandečić, D., Grafström, R., & Spjuth, O. (2017). RDFIO: extending semantic MediaWiki for interoperable biomedical data management. Journal of Biomedical Semantics, 8 (1). http://dx.doi.org/10.1186/s13326-017-0136-y

DataCite: the PubMed for data and software

We have services like PubMed, Europe PMC, and Google Scholar to make a list of literature. Scholia/Wikidata and ORCID are upcoming services, but for data and software there are fewer options. One notable exception is DataCite (two past blogs where I mentioned it). There is plenty of caution in interpreting the results, like versioning, the fact that preprints, posters, etc are also hosted by the supported repositories (e.g. Figshare, Zenodo), but it seems the faceted browsing based on metadata is really improving.

This is what my recent "DataCite" history looks like:


And it get's even more exciting when you realize that DataCite integrates with ORCID so that you can have it all listed on your ORCID profile.

Updated HMDB identifier scheme

I have not found further details about it yet, but noticed half an hour ago that the Human Metabolome Database (doi:10.1093/nar/gks1065) seems to have changes all their identifiers: the added extra zeros. The screenshot for D-fructose on the right shows how the old identifiers are now secondary identifiers. We will face a period of a few years where one resource uses the old identifiers (archives, supplementary information, other databases, etc).

This change has huge implications, including that mere string matching of identifiers becomes really difficult: we need to know if it uses the old scheme or the new scheme. Of course, we can see this simply from the identifier length, but we likely need a bit of software ("artificial intelligence") in our software.

I ran into the change just now, because I was working on the next BridgeDb metabolite identifier mapping database. The release of this weekend will not have the new identifiers for sure: I first need more info, more detail.

For now, if you use HMDB identifiers in your database, get prepared! Using old identifiers to link to the HMDB website seems to work fine, as they have a redirect working at the server level. Starting to think about internally updating your identifiers (by adding two zero's), is likely something to put on the agenda.

What about postprint servers?

Various article version types, including pre and post.
Source: SHERPA/ROMEO.
Now that preprint servers are picking up speed, let's talk about postprint servers. Sure, we have plenty of places to place and find discussions about the content of articles (e.g. PubPeer, PubMed Commons, ...), and sure we have retractions and corrections.

But what if we could just make revisions of articles?

And I'm not only talking about typo-fixes, but also clarifications that show up during post-publication peer-review. Not about full revisions; if a paper is wrong, then this is not the method of choice. They should happen frequently either, but sometimes it is just convenient. Maybe to fix broken website URLs?

One point is, ResearchGate, Academia, Mendeley, and the likes allow you to host versions, but we need to track the fixes and versioned DOIs. That metadata is essential: it is the FAIRness of the post-publication life time of a publication.

Text mining literature that mention JRC representative nanomaterials

The week before a short holiday in France (nature, cycling, hiking, touristic CERN visit; thanks to Philippe for the ViaRhone tip!), I did some further work on contentmining literature that mention the JRC representative nanomaterials. One important reason was that I could play with the tools developed by Lars in his fellowship with The ContentMine.

I had about one day, as there always is work left over to finish in your first week of holiday, and had several OS upgrades to do too (happily running the latest 64bit Debian!). But, as a good practice, I kept an Open Notebook Science practice, and the initial run of the workflow turned out quite satisfactory:


What we see here is content mined from literature searched with "titanium dioxide" with the getpapers tool. AMI then extracted the nanomaterials and species information. Tools developed by Lars aggregated all information into a single JSON, which I converted into input for cytoscape.js with a simple Groovy script. Yeah, click on the image, and you get the live network.

So, if I find a bit of time before I get back to work, I'll convert this output also to eNanoMapper RDF for loading into data.enanomapper.net. Of course, then I will run this on other EuropePMC searches too, for other nanomaterials.

Wikidata visualizes SMILES strings with John Mayfield's CDK Depict


SVG depiction of D-ribulose.
Wikidata is building up a curated collection of information about chemicals. A lot of data originates from Wikipedia, but active users are augmenting this information. Of particular interest, in this respect, is Sebastian's PubChem ID curation work (he can use a few helping hands!). Followers of my blog know that I am using Wikidata as source of compound ID mapping data for BridgeDb.

Each chemical can have one or two associated SMILES strings. A canonical SMILES, that excludes any chirality, and a isomeric SMILES that does include chirality. Because statement values can be linked to a formatter URL, Wikidata often has values associated with a link. For example, for the EPA CompTox Dashboard identifiers it links to that database. Kopiersperre used this approach to link to John Mayfield's CDK Depict.

Until two weeks ago, the formatter URL for both the canonical and isomeric SMILES was he same. I changed that, so that when a isomeric SMILES is depicted, it shows the perceived R,S (CIP) annotation as well. That should help further curation of Wikidata and Wikipedia content.

new paper: "A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury"

Figure from the article. CC-BY.
One of the projects I worked on at Karolinska Institutet with Prof. Grafström was the idea of combining transcriptomics data with dose-response data. Because we wanted to know if there was a relation between the structures of chemicals (drugs, toxicants, etc) and how biological systems react to that. Basically, testing the whole idea behind quantitative-structure activity relationship (QSAR) modeling.

Using data from the Connectivity Map (Cmap, doi:10.1126/science.1132939) and NCI60, we set out to do just that. My role in this work was to explore the actual structure-activity relationship. The Chemistry Development Kit (doi:10.1186/s13321-017-0220-4) was used to calculate molecular descriptor, and we used various machine learning approaches to explore possible regression models. Bottom line was, it is not possible to correlate the chemical structures with the biological activities. We explored the reason and ascribe this to the high diversity of the chemical structures in the Cmap data set. In fact, they selected the chemicals in that study based on chemical diversity. All the details can be found in this new paper.

It's important to note that these findings does not validate the QSAR concept, but just that they very unfortunately selected their compounds, making exploration of this idea impossible, by design.

However, using the transcriptomics data and a method developed by Juuso Parkkinen it is able to find multivariate patterns. In fact, what we saw is more than is presented in this paper, as we have not been able to support further findings with supporting evidence yet. This paper, however, presents experimental confirmation that predictions based on this component model, coined the Predictive Toxicogenocics Gene Space, actually makes sense. Biological interpretation is presented using a variety of bioinformatics analyses. But a full mechanistic description of the components is yet to be developed. My expectation is that we will be able to link these components to key events in biological responses to exposure to toxicants.

 Kohonen, P., Parkkinen, J. A., Willighagen, E. L., Ceder, R., Wennerberg, K., Kaski, S., Grafström, R. C., Jul. 2017. A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury. Nature Communications 8. 
https://doi.org/10.1038/ncomms15932

The Elsevier-SciHub story

I blogged earlier today why I try to publish all my work gold Open Access. My ImpactStory profile shows I score 93% and note that with that 10% of the scientists in general score in that range. But then again, some publisher do make it hard for us to publish gold Open Access. And then if STM industries spreads FUD for their and only their good ("Sci-Hub does not add any value to the scholarly community.", doi:10.1038/nature.2017.22196), I get annoyed. Particularly, as the system makes young scientists believe that transferring copyright to a publisher (for free, in most cases) is a normal thing to do.

As said, I have no doubt that under current copyright law it was to be expected that Sci-Hub was going to be judged to violate that law. I also blogged previously that I believe copyright is not doing our society a favor (mind you, all my literature is copyrighted, and much of it I license to readers allowing them to read my work, copy it (e.g. share it with colleagues and students), and even modify it, e.g. allowing journals to change their website layout without having to ask me). About copyright, I still highly recommend Free Culture by Prof. Lessig (who unfortunately did not run for presidency).

To get a better understand of Sci-Hub and its popularity (I believe gold Open Access is the real solution), I looked at what literature was in Wikidata, using Scholia (wonderful work by Finn Nielsen, see arXiv). I added a few papers and annotated papers with their main subject's. I guess there must be more literature about Sci-Hub, but this is the "co-occuring topics graph" provided by Scholia at the time of writing:


It's a growing story.

As a PhD student, I was often confronted with Closed Access.

It sounds like a problem not so common in western Europe, but it was when I was a fresh student (around 1994). The Radboud's University Library certainly did not have all journals and for one journal I had to go to a research department and sit in their coffee room. Not a problem at all. Big Package deals improved access, but created a vendor lock-in. And we're paying Big Time for these deals now, with insane year-over-year inflation of the prices.

But even then, I was repeatedly confronted with not having access to literature I wanted to read. Not just me, btw, for PhD students this was very common too. In fact, they regularly visited other universities, just to make some copies there. An article basically costed a PhD a train travel and a euro or two copying cost (besides the package deal cost for the visited university, of course). Nothing much has changed, despite the fact that in this electronic age the cost should have gone down significantly, instead of up.

That Elsevier sues Sci-Hub (about Sci-Hub, see this and this), I can understand. It's good to have a court decide what is more important: Elsevier's profit or the human right of access to literature (doi:10.1038/nature.2017.22196). This is extremely important: how does our society want to continue: do we want a fact-based society, where dissemination of knowledge is essential; or, do we want a society where power and money decides who benefits from knowledge.

But the STM industry claiming that Sci-Hub does not contribute to the scholarly community is plain outright FUD. In fact, it's outright lies. The fact that Nature does not call out those lies in their write up is very disappointing, indeed.

I do not know if it is the ultimate solution, but I strongly believe in a knowledge dissemination system where knowledge can be freely read, modified, and redistributed. Whether Open Science, or gold Open Access.

Therefore, I am proud to be one of the 10 Open Access proponents at Maastricht University. And a huge thank you to our library to keep pushing Open Access in Maastricht.


You are what you do, or how people got to see me as an engineer

Source, Wikicommons, CC-BY-SA.
Over the past 20 years I have had endless discussions into what the research is that I do. Many see my work as engineer, but I vigorously disagree. But some days it's just too easy to give up and explain things yet again. The question came up on the past few month several times again, and I am suggested to make a choice. That modern academia for you: you have to excel in something tiny, and complex and hard to explain ambition is loosing from the system based on funding, buzz words, "impact", and such. So, again, I am trying to make up my defense as to why my research is not engineering. You know what is ironic? It's all the fault of Open Science! Darn Open Science.

In case you missed it (no worries, many of the people I talk in depth about these things do, IMHO), my research is of theoretical nature (I tried bench chemistry, but my back is not strong enough for that): I am interested in how to digitally represent chemical knowledge. I get excited about Shannon entropy and books from Hofstadter. I do not get excited about "deep learning" (boring! In fact, the only fun I get out of that is pointing you to this). So, arguably, I am in the wrong field of science. One could argue I am not a biologist or chemist, but a computer scientist, or maybe philosophy (mind you, I have a degree in philosophy).

And that's actually where it starts getting annoying. Because I do stuff on a computer, people associate me with software. And software is generally seen as something that Microsoft does... hello, engineering. The fact that I publish papers on software (think CDK, Bioclipse, Jmol) does not help, of course.

That's where that darn Open Science comes in. Because I have a varied set of skills, I actually know how to instruct a computer to do something for me. It's like writing English, just to a different person, um, thingy. Because of Open Science, I can build the machines that I need to do my science.

But a true scientist does not make their own tools; they buy them (of course, that's an exaggeration, but just realize how well we value data and software citations at this time). They get loads of money to do so, just so that they don't have to make machines. And just because I don't ask for loads of money, or ask a bit of money to actually make the tools I need, you are tagged as engineer. And I, I got tricked by Open Science in fixing things, adding things. What was I thinking??

Does this resonate with experience from others? Also upset about it? What can we do about this?

(So, one of my next blog posts will be about the new scientific knowledge I have discovered. I have to say,  not as much as I wanted, mostly because we did not have the right tools yet, which I have to build first, but that's what this post is about...)