new paper: "A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury"

Figure from the article. CC-BY.
One of the projects I worked on at Karolinska Institutet with Prof. Grafström was the idea of combining transcriptomics data with dose-response data. Because we wanted to know if there was a relation between the structures of chemicals (drugs, toxicants, etc) and how biological systems react to that. Basically, testing the whole idea behind quantitative-structure activity relationship (QSAR) modeling.

Using data from the Connectivity Map (Cmap, doi:10.1126/science.1132939) and NCI60, we set out to do just that. My role in this work was to explore the actual structure-activity relationship. The Chemistry Development Kit (doi:10.1186/s13321-017-0220-4) was used to calculate molecular descriptor, and we used various machine learning approaches to explore possible regression models. Bottom line was, it is not possible to correlate the chemical structures with the biological activities. We explored the reason and ascribe this to the high diversity of the chemical structures in the Cmap data set. In fact, they selected the chemicals in that study based on chemical diversity. All the details can be found in this new paper.

It's important to note that these findings does not validate the QSAR concept, but just that they very unfortunately selected their compounds, making exploration of this idea impossible, by design.

However, using the transcriptomics data and a method developed by Juuso Parkkinen it is able to find multivariate patterns. In fact, what we saw is more than is presented in this paper, as we have not been able to support further findings with supporting evidence yet. This paper, however, presents experimental confirmation that predictions based on this component model, coined the Predictive Toxicogenocics Gene Space, actually makes sense. Biological interpretation is presented using a variety of bioinformatics analyses. But a full mechanistic description of the components is yet to be developed. My expectation is that we will be able to link these components to key events in biological responses to exposure to toxicants.

 Kohonen, P., Parkkinen, J. A., Willighagen, E. L., Ceder, R., Wennerberg, K., Kaski, S., Grafström, R. C., Jul. 2017. A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury. Nature Communications 8. 
https://doi.org/10.1038/ncomms15932

The Elsevier-SciHub story

I blogged earlier today why I try to publish all my work gold Open Access. My ImpactStory profile shows I score 93% and note that with that 10% of the scientists in general score in that range. But then again, some publisher do make it hard for us to publish gold Open Access. And then if STM industries spreads FUD for their and only their good ("Sci-Hub does not add any value to the scholarly community.", doi:10.1038/nature.2017.22196), I get annoyed. Particularly, as the system makes young scientists believe that transferring copyright to a publisher (for free, in most cases) is a normal thing to do.

As said, I have no doubt that under current copyright law it was to be expected that Sci-Hub was going to be judged to violate that law. I also blogged previously that I believe copyright is not doing our society a favor (mind you, all my literature is copyrighted, and much of it I license to readers allowing them to read my work, copy it (e.g. share it with colleagues and students), and even modify it, e.g. allowing journals to change their website layout without having to ask me). About copyright, I still highly recommend Free Culture by Prof. Lessig (who unfortunately did not run for presidency).

To get a better understand of Sci-Hub and its popularity (I believe gold Open Access is the real solution), I looked at what literature was in Wikidata, using Scholia (wonderful work by Finn Nielsen, see arXiv). I added a few papers and annotated papers with their main subject's. I guess there must be more literature about Sci-Hub, but this is the "co-occuring topics graph" provided by Scholia at the time of writing:


It's a growing story.

As a PhD student, I was often confronted with Closed Access.

It sounds like a problem not so common in western Europe, but it was when I was a fresh student (around 1994). The Radboud's University Library certainly did not have all journals and for one journal I had to go to a research department and sit in their coffee room. Not a problem at all. Big Package deals improved access, but created a vendor lock-in. And we're paying Big Time for these deals now, with insane year-over-year inflation of the prices.

But even then, I was repeatedly confronted with not having access to literature I wanted to read. Not just me, btw, for PhD students this was very common too. In fact, they regularly visited other universities, just to make some copies there. An article basically costed a PhD a train travel and a euro or two copying cost (besides the package deal cost for the visited university, of course). Nothing much has changed, despite the fact that in this electronic age the cost should have gone down significantly, instead of up.

That Elsevier sues Sci-Hub (about Sci-Hub, see this and this), I can understand. It's good to have a court decide what is more important: Elsevier's profit or the human right of access to literature (doi:10.1038/nature.2017.22196). This is extremely important: how does our society want to continue: do we want a fact-based society, where dissemination of knowledge is essential; or, do we want a society where power and money decides who benefits from knowledge.

But the STM industry claiming that Sci-Hub does not contribute to the scholarly community is plain outright FUD. In fact, it's outright lies. The fact that Nature does not call out those lies in their write up is very disappointing, indeed.

I do not know if it is the ultimate solution, but I strongly believe in a knowledge dissemination system where knowledge can be freely read, modified, and redistributed. Whether Open Science, or gold Open Access.

Therefore, I am proud to be one of the 10 Open Access proponents at Maastricht University. And a huge thank you to our library to keep pushing Open Access in Maastricht.


You are what you do, or how people got to see me as an engineer

Source, Wikicommons, CC-BY-SA.
Over the past 20 years I have had endless discussions into what the research is that I do. Many see my work as engineer, but I vigorously disagree. But some days it's just too easy to give up and explain things yet again. The question came up on the past few month several times again, and I am suggested to make a choice. That modern academia for you: you have to excel in something tiny, and complex and hard to explain ambition is loosing from the system based on funding, buzz words, "impact", and such. So, again, I am trying to make up my defense as to why my research is not engineering. You know what is ironic? It's all the fault of Open Science! Darn Open Science.

In case you missed it (no worries, many of the people I talk in depth about these things do, IMHO), my research is of theoretical nature (I tried bench chemistry, but my back is not strong enough for that): I am interested in how to digitally represent chemical knowledge. I get excited about Shannon entropy and books from Hofstadter. I do not get excited about "deep learning" (boring! In fact, the only fun I get out of that is pointing you to this). So, arguably, I am in the wrong field of science. One could argue I am not a biologist or chemist, but a computer scientist, or maybe philosophy (mind you, I have a degree in philosophy).

And that's actually where it starts getting annoying. Because I do stuff on a computer, people associate me with software. And software is generally seen as something that Microsoft does... hello, engineering. The fact that I publish papers on software (think CDK, Bioclipse, Jmol) does not help, of course.

That's where that darn Open Science comes in. Because I have a varied set of skills, I actually know how to instruct a computer to do something for me. It's like writing English, just to a different person, um, thingy. Because of Open Science, I can build the machines that I need to do my science.

But a true scientist does not make their own tools; they buy them (of course, that's an exaggeration, but just realize how well we value data and software citations at this time). They get loads of money to do so, just so that they don't have to make machines. And just because I don't ask for loads of money, or ask a bit of money to actually make the tools I need, you are tagged as engineer. And I, I got tricked by Open Science in fixing things, adding things. What was I thinking??

Does this resonate with experience from others? Also upset about it? What can we do about this?

(So, one of my next blog posts will be about the new scientific knowledge I have discovered. I have to say,  not as much as I wanted, mostly because we did not have the right tools yet, which I have to build first, but that's what this post is about...)

New paper: "The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching"

This paper was long overdue. But software papers are not easy to write, particularly not follow up papers. That actually seems a lot easier for databases. Moreover, we already publish too much. However, the scholarly community does not track software citations (data citations neither, but there seems to be a bit more momentum there; larger user group?). So, we need these kind of papers, and just a version, archived software release (e.g. on Zenodo) is not enough. But, there it is, the third CDK paper (doi:10.1186/s13321-017-0220-4). Fifth, if you include the two papers about ring finding, also describing CDK functionality.


Of course, I could detail what this paper has to offer, but let's not spoil the article. It's CC-BY so everyone can readily access it. You don't even need Sci-Hub (but are allowed for this paper; it's legal!).

A huge thanks to all co-authors, John's work as release manager and great performance improvements as well as code improvement, code clean up, etc, and all developers who are not co-author on this paper but contributed bigger or smaller patches over time (plz check the full AUTHOR list!). That list does not include the companies that have been supporting the project in kind, tho. Also huge thanks to all the users, particularly those who have used the CDK in downstream projects, many of which are listed in the introduction of the paper.

And, make sure to follow John Mayfield's blog with tons of good stuff.

May 29, Delft, The Netherlands: "Open Science: the National Plan and you"

In less than ten days, a first national meeting is organized in Delft, The Netherlands, where researchers can meet researchers to talk about Open Science. Mind you, researcher is very broad: it is anyone doing research, at home (e.g. citizen science, or as a hobby), at work (company or research institute), or in educational setting (university, HBOs, ...). After all, anyone benefits from Open Science (at least from that by others! "Standing on the shoulders of Open Science, ...")

The meeting is part of the National Plan Open Science (see also Open Science is already a thing in The Netherlands), which is a direct result of the Open Science meeting in Amsterdam during the Dutch presidency which resulted in the Amsterdam Call for action on Open Science.

The program for the #npos2017 meeting is very interactive. It starts with obligatory introductions, explaining how Open Science fits into the national future research landscape, but quickly moves to practical experiences from researchers, a Knowledge Commons session where everyone can show and discuss their Open Science works (with a free lunch: yes, #OpenScience and free lunches are compatible), a number of breakout sessions where the "but how" can be discussed and answered (topics in the image below), and a wrap up panel to wrap up the break out sessions, and a free drink afterwards.

During the Knowledge Commons I will join Andra Waagmeester (Micelio) and Yaroslav Blanter (Delft University) to show Wikidata, and how I have been using this for data interoperability for the WikiPathways metabolism pathways (via BridgeDb).

The meeting is free and you can sign up here. Looking forward to meeting you there!


GenX spill, national coverage, but where is the data

First (I have never blogged much about risk and hazard), I am not an toxicological expert nor a regulator. I have deepest respect for both, as these studies are one of the most complex ones I am aware off. It makes rocket science look dull. However, I have quite some experience in the relation chemical structure to properties and with knowledge integration, which is a prerequisite for understanding that relation. Anything I do does not say what the right course of action is. Any new piece of knowledge (or technology) has pros and cons. It is science that provides the evidence to support finding the right balance. It is science I focus on.

The case
The AD national newspaper reported spilling of the compound with the name GenX in the environment and reaching drinking water. This was picked up by other newspapers, like de VK. The chemistry news outlet C2W commented on the latter on Twitter:


Translated, the tweet reports that we do not know if the compound is dangerous. Now, to me, there are then two things: first, any spilling should not happen (I know this is controversial, as people are more than happy to repeatedly pollute the environment, just because of self-interest and/or laziness); second, what do we know about the compound? In fact, what is GenX even? It certainly won't be "generation X", though we don't actually know the hazard of that either. (We have IUPAC names, but just like with the ACS disclosures, companies like to make up cryptic names.)

But having working on predictive toxicology and data integration projects around toxicology, and for just having a chemical interest, I started out searching what we know about this compound.

Of course, I need an open notebook for my science, but I tend to be sloppy and mix up blog posts like this, with source code repositories, and public repositories. For new chemicals, as you could read earlier this weekend, Wikidata is one of my favorites (see also doi:10.3897/rio.1.e7573). Using the same approach as for the disclosures, I checked if Wikidata had entries for the ammonium salt and the "active" ingredient FRD-903 (fairly, chemically they are different, and so may their hazard and risk profiles). Neither existed, so I added them using Bioclipse and QuickStatements (a wonderful tool by Magnus Manke): GenX and FRD-903. So, a seed of knowledge was planted.
    A side topic... if you have not looked at hypothes.is yet, please do. It allows you to annotate (yes, there are more tools that allow that, but I like this one), which I have done for the VK article:


I had a look around on the web for information, and there is not a lot. A Wikidata page with further identifiers then helps tracking your steps. Antony Williams, previous of ChemSpider fame, now working on the EPA CompTox Dashboard, added the DTX substance IDs, but the entries in the dashboard will not show up for another bit of time. For FRD-903 I found growth inhibition data in ChEMBL.

But Nina Jeliazkova pointed me to her LRI AMBIT database (poster abstract doi:10.1016/j.toxlet.2016.06.1469, PDF links) that makes (public) data from ECHA available from REACH dossiers in a machine readable way (see this ECHA press release), using their AMBIT software (doi:10.1186/1758-2946-3-18). (BTW, this makes the legal hassle Hartung had last year even more interesting, see doi:10.1038/nature.2016.19365). After creation of a free login, you can find a full (public) dossier with information about the toxicology of the compound (toxicity, ecotoxicity, environmental fate, and more):


I reported this slide, as they worry seems to be about drinking water, so, oral toxicity seems appropriate (note, this is only acute toxicity). The LD50 is the median lethal dose, but is only measured for mouse and rat (these are models for human toxicity, but only models, as humans are just not rats; well, not literally, anyway). Also, >1 gram per kilogram body weight ("kg bw"; assumption) seems pretty high. In my naive understand, the rat may be the canary in the coal mine. But let me refrain from making any conclusions. I leave that to the experts on risk management!

Experts like those from the Dutch RIVM, which wrote up this report. One of the information they say is missing is that of biodistribution: "waar het zich ophoopt", or in English, where the compound accumulates.

The ACS Spring disclosures of 2017 #2: some history

Bethany Halford adds some history about the sessions (see part #1):
    I believe Stu Borman was the first to cover the Division of Medicinal Chemistry’s First Time Disclosures symposium for C&EN, but it was Carmen Drahl who began the practice of hand-drawing and tweeting the clinical candidates as they were disclosed in real time. This seems like an oddball practice to folks who aren’t at the meeting. Why not just take a picture of the relevant slide? Well, that’s against the rules: There are signs all over the ACS National Meeting stating that photos, video, and audio recording of presentations are strictly prohibited. In San Francisco, symposium organizer Jacob Schwarz repeatedly reminded attendees that this was the case. Carmen’s brilliant idea to get around this rule was to simply draw the structures as they were presented, snap a photo, and then tweet it out.

    I’ve inherited the task since Carmen left the magazine a couple of years ago. I find it incredibly stressful. For an even that’s billed as a disclosure, the actual disclosing is fairly fleeting. The structures are often not on the screen for very long, and I’m never confident that I’ve got it 100% right. Last year in San Diego I tweeted out one structure and I heard the following day from Anthony Melvin Crasto, a chemist in India, that based on the patent literature he thought I had an atom wrong. I was certain that I had written this structure correctly, so I contacted the presenting scientist. He had disclosed the wrong structure!

    I agree that there should be some sort of database established afterwards, and I think you all have done great work on that front. I think you’ll find the pharmaceutical companies reluctant to help you out in any way. They guard these compounds so fiercely that it often makes we wonder why we have this symposium to begin with.

The ACS Spring disclosures of 2017 #1

At the American Chemical Society meetings drug companies disclose recent new drugs to the world. Normally, the chemical structures are already out in the open, often as part of patents. But because these patents commonly discuss many compounds, the disclosures are a big thing.

Now, these disclosure meetings are weird. You will not get InChIKeys (see doi:10.1186/s13321-015-0068-4) or something similar. No, people sit down with paper, manually redraw the structure. Like Carmen Drahl has done in the past. And Bethany Halford has taken over that role at some point. Great work from both! The Chemical & Engineering News has aggregated the tweets into this overview.

Of course, a drug structure disclosure is not complete if it does not lead to deposition in databases. The first thing is to convert the drawings into something machine readable. And thanks to the great work from John May on the Chemistry Development Kit and the OpenSMILES team, I'm happy with this being SMILES. So, we (Chris Southan and me) started a Google Spreadsheet with CCZero data:



I drew the structures in Bioclipse 2.6.2 (which has CDK 1.5.13) and copy-pasted the SMILES and InChIKey into the spreadsheet. Of course, it is essential to get the stereochemistry right. The stereochemistry of the compounds was discussed on Twitter, and we think we got it right. But we cannot be 100% sure. For that, it would have been hugely helpful if the disclosures included the InChIKeys!

As I wrote before, I see Wikidata as a central resource in a web of linked chemical data. So, using the same code I used previously to add disclosures to Wikidata, I created Wikidata items for these compounds, except for one that was already in the database (see the right image). The code also fetches PubChem compound IDs, which are also listed in this spreadsheet.

The Wikidata IDs link to the SQID interface, giving a friendly GUI, one that I actually brought up before too. That said, until people add more information, it may be a bit sparsely populated:


But others are working on this series of disclosures too, and keep an eye on this blog post, as others may follow up with further information!

Closed access book chapters, Bookmetrix, and job creations

Enjoying my Saturday morning (you'll can actually track down that I write more blog posts then, than any other time of the week) with a coffee (no, not beer, Christoph). Wanted to complete my Scholia profile (gree work by Finn, arxiv:1703.04222, happy to have contributes ideas and small patches) a bit more (or perhaps that of the Journal of Cheminformatics), as that relaxes me, and nicely complements rerunning some Bioclipse scripts to add metabolite/compound data to Wikidata (e.g. this post). Because this afternoon I want to do some serious work, like write up outlines for a few cool grant applications. And if lucky, I may be able to do a bit of work on this below-the-radar project.

So, I started updating a full work available at for a peer-reviewed IEEE paper (doi:10.1109/BIBM.2014.6999367), as it is not old Open Access, and I have to rely on green Open Access. Then I headed over to my ImpactStory profile and ran into a closed Open Access book chapter with Tony, Sean, and Ola (doi:10.1007/978-1-62703-050-2_10). But I have no idea if I can put online a green Open Access version of this book chapter.

Now, why I am blogging this (and meanwhile, adding four new DTXSIDs to Wikidata), is two observiations. First, I had not blogged about Bookmetrix yet, a cool project that reports the impact of book chapters. The ROI on writing book chapters I always considered as not so high, but then I saw the #altmetrics for this chapter:


Five citations is not that lot, but considering I do not cite book chapter much either. But look at that number of downloads, 2.39 thousand! Wow!

But there is another angle to that. We regularly report our societal impact, nowadays. It's part of the Dutch Standard Evaluation Protocol, or at least selected by our research institute as something to assess researchers on. Hang on, no, citations is not part of that category. But this is: the paper is sold for about 50 euro. Seriously? Yes, seriously. And apparently 2.39K people bought this chapter. I am not sure if I need to assume that this is mostly people buying the full book, which means the chapter is a lot cheaper. But the full book reports download numbers of above 50 thousand, so it seems not. Now, let's assume that a good part of the bought copies is via package deals and the average payment is half. That may sound high, but we ignore the 50k download for the full book to compensate for that.

Doing that math means that our joint book chapter contributed 60k euro to the European market. That's a full job the four of us created with this single book chapter. I'm impressed.