Databasing nanomaterials: substance APIs

Cell uptake of gold nanoparticles
in human cells. Source. CC-BY 4.0
Nanomaterials are quite interesting from a science perspective: first, they are materials and not so well-defined as such. The can best be described as a distribution of similar nanoparticles. That is, unlike small compounds, which we commonly describe as pure materials. Nanomaterials have a size distribution, surface differences, etc. But akin the QSAR paradigm, because they are similar enough, we can expect similar interaction effects, and thus treat them as the same. A nanomaterials is basically a large collection of similar nanoparticles.

Until the start interacting, of course. Cell membrane penetration is studies at a single nanoparticle level, and they make interesting pictures of that (see top left). Or when we do computation. Then too, we typically study a single materials. On the other hand, many nanosafety studies work with the materials, at a certain dosage. Study cell death, transcriptional changes, etc, when the materials is brought into contact with some biosample.

The synthesis is equally interesting. Because of the nature of many manufacturing processes (and the literature synthesizing new materials is enormous), it is typically not well understood what the nanomaterial or even nanoparticle looks like. This is overcome by stydying the bulk properties, and report some physicochemical properties, like the size distribution, with methods like DLS and TEM. The field just lacks the equivalent of what NMR is for (small) (organic) compounds.

Now, try capturing this in a unified database. That's exactly what eNanoMapper is doing. And with a modern approach. It's a database project, not a website proejct. We develop APIs and test all aspects of the database extensively using test data. Of course, using the API we can easily create websites (there currently are JavaScript and R client libraries), and we have done so at It's great to be working with so many great domain specialists who get things done!

There is a lot to write and discuss about this, but end now by just pointing you to our recent paper outlining much of the cheminformatics of this new nanosafety database solution.

Of course, we study in our group the nanosafety and nanoresponse (think nanomedicine) at a systems biology level. So, here's the obligatory screenshot of work of one of of interns (Stan van Roij). Not fully integrated with the database yet, though.

Jeliazkova, N., Chomenidis, C., Doganis, P., Fadeel, B., Grafström, R., Hardy, B., Hastings, J., Hegi, M., Jeliazkov, V., Kochev, N., Kohonen, P., Munteanu, C. R., Sarimveis, H., Smeets, B., Sopasakis, P., Tsiliki, G., Vorgrimmler, D., Willighagen, E., Jul. 2015. The eNanoMapper database for nanomaterial safety information. Beilstein Journal of Nanotechnology 6, 1609-1634.

Twitter at conferences

I have been happily tweeting the BioMedBridges meeting in Hinxton last week using the #lifesciencedata hashtag, along with more than 100 others, though a small subset was really active. A lot has been published about using Twitter at conference, like the recent paper by Ekins et al (doi:10.1371/journal.pcbi.1003789).

The backchannel discussions only get better when more and more people join, and when complementary information is passed around. For example, I tend to tweet links to papers that appear on slides, chemical/protein structure mentioned, etc. I have also started tweeting ORCID identifiers of the speakers if I can find them, in addition to adding them to a Lanyrd page.

Like at most meetings, people ask me about this tweeting. Why I do it? Doesn't it distract you from the presentation? I understand these questions.

First, I started recording my notes of meetings electronically during my PhD thesis, because I needed to write a summary of each meeting for my funder. So, when Twitter came along, and after I had already built up some experience blogging summaries of meetings, I realized that I might as well tweet my notes. And since I was looking up DOIs of papers anyway, the step was not big. The effect, however, was use. People started replying, some at the conference itself, some not. This resulted in a lot of meetings with people at the conference. Tweetups do not regularly happen anymore, but it's a great first line for people, "hey, aren't you doing all that blogging", and before you know it, you are talking science.

Second, no, it does not significantly distract me from the talk. First, like listening to a radio while studying, it keeps me focused. Yes, I very much think this differs from person to person, and I am not implying that it generally is not distracting. But it keeps me busy, which is very useful during some talks, when people in the audience otherwise start reading email. If I look up details (papers, project websites, etc) from the talk, I doubt I am more distracted than some others.

Third: what about keeping up. Yes, that's a hard one, and I was beaten in coverage speed by others during this meeting. That was new to me, but I liked that. Sadly, some of the most active people left the meeting after the first day. So, I was wondering how I could speed up my tweeting, or, alternatively, how it would take me less time so that I can read more of the other tweets. Obvious candidates are blogging additional information like papers, etc.

So, I started working on some R code to help me tweet faster, and using the great collection of rOpenSci packates, I have come up with the following two first helper methods. In both examples, I am using an #example hashtag.

Tweeting papers
This makes use of the rcrossref package to fetch the name of the first author and title of the paper.

Tweeting speakers
Or perhaps, tweeting people. This bit of code makes use of the rorcid package.

Of course, you are most interesting in the code than the screenshots, so here it is (public domain; I may make a package out of this):


setup_twitter_oauth("YOURINFO", "YOURINFO")

tweetAuthor = function(orcid=NULL, hashtag=NULL) {
  person = as.orcid(orcid)
  firstName = person[[1]]$"orcid-bio"$`personal-details`$`given-names`$value
  surname = person[[1]]$"orcid-bio"$`personal-details`$`family-name`$value
  orcidURL = person[[1]]$"orcid-identifier"$uri
    paste(firstName, " ", surname, " orcid:",

    orcid, " ", orcidURL, " #", hashtag, sep="")
tweetPaper = function(doi=NULL, hashtag=NULL) {
  info = cr_cn(dois=doi, format="citeproc-json")
      info$author[[1]]$family, " et al. \"",
      substr(info$title, 0, 60), "...\" ",
      "", info$DOI, " #", hashtag, sep=""

Getting your twitteR to work (the authentication, that is) may be the hardest part. I do plan to add further methods like: tweetCompound(), tweetProtein(), tweetGene(), etc...

Got access to literature?

Got access to literature? Only yesterday I discovered that resolving some Nature Publishing Group DOIs do not necessarily lead to useful information. High quality metadata about literature is critical for the future of science. Elsevier just showed how creative publishers can be in interpreting laws and licenses (doi:10.1038/527413f).

So, it may be interesting to regularly check your machine readable Open Access metadata. ImpactStory helps here with their Open Access Badge. New to me was what Daniel pointed me to: dissemin (@disseminOA). Just pass your ORCID and you end up with a nice overview of what the world knows about the open/closed status of your output.

I would not say my report is flawless, but that nicely shows how important it is to get this flow of metadata right! For example, there are some data sets and abstracts detected as publications; fairly, I think this is to a good extend my inability to annotate them properly in my ORCID profile.

WikiPathways: capturing the full diversity of pathway knowledge

Figure from the new NAR paper.
Biology is a complex matter. The biological matter indeed involves many different chemicals in very many temporospatial forms: small compounds may be present in different charge states (proteins too, of course), tautomers, etc. Proteins may exhibit isoforms, various post-translational modifications, etc. Genes shows structures we are only now starting to see: the complex structures in the nucleus have been invisible to mankind until some time ago. Likewise, the biological processes, encoded as pathways, cover an equal amount of complexity.

WikiPathways is a community run pathway database, similar to others like KEGG, Reactome, and many others. One striking difference is the community approach of WikiPathways: anyone can work on or extend the content of the database. This makes WikiPathways exciting to me: it encodes very different bits of biological knowledge, and a key reason why I joined Chris Evelo's team almost four years ago. Importantly, this community is supported by a lively and reasonably sized (>10 people and growing) curation team, primarily located at Maastricht University and the Gladstone Institutes.

The newest paper in NAR (doi:10.1093/nar/gkv1024) outlines some recent developments and the growth of the database. There is still so much to do, and given the current speed at which we learn new biological patterns, this will not get less soon.

Want to help? Sign up, enlist your ORCID! Need ideas what you can do? Why not take a recent paper you published (or read), take a new biological insight, look up an appropriate pathway and add that paper. If you have a novel pathway or important new insight in a biological paper published, why not convert that figure from that paper into a machine readable pathway?

Kutmon, M., Riutta, A., Nunes, N., Hanspers, K., Willighagen, E. L., Bohler, A., Mélius, J., Waagmeester, A., Sinha, S. R., Miller, R., Coort, S. L., Cirillo, E., Smeets, B., Evelo, C. T., Pico, A. R., Oct. 2015. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Research.

RRegrs: exploring the space of possible regression models

Machine learning is a field of science that focusses on mathematically describing patterns in data. Chemometrics does this for chemical data. Examples are (nano)QSAR where structural information is related to biological activity. I studied during my PhD studies the interaction between the statistics and machine learning with how you computationally (numerically) represent the question. The right combination is not obvious and it has become common to try various modelling methods, though something with support vector machines (SVM/SVR) and more recently neural networks (deep learning) have become popular. A simpler model, however, has its benefits too and frequently not significantly worse than more complex models. That said, exploring all machine learning methods manually takes a lot of time, as each comes with its own parameters which need varying.

Georgia Tsiliki (NTUA partner in eNanoMapper), Cristian Munteany (former postdoc in our group), and others developed RRegrs, an R package to explore the various models and automatically calculate a number of statistics to allow to compare them (doi:10.1186/s13321-015-0094-2). That said, following my thesis, you must never rely on performance statistics, but the output of RRegrs may help you explore the full set of models.

Tsiliki, G., Munteanu, C. R., Seoane, J. A., Fernandez-Lozano, C., Sarimveis, H., Willighagen, E. L., Sep. 2015. RRegrs: an r package for computer-aided model selection with multiple regression models. Journal of Cheminformatics 7 (1), 46.

So, now you have SMILES that are faulty... visualize them?

So, you validated your list of SMILES in the paper you were planning to use (or about to submit), and you found a shortlist of SMILES strings that do not look right. Well, let's visualize them.

We all used to use the Daylight Depict tool, but this is no longer online. I blogged previously already about using AMBIT for SMILES depiction (which uses various tools for depiction; doi:10.1186/1758-2946-3-18), but now John May released a CDK-only tool, called CDK Depict. The download section offers a jar file and a war for easy deployment in a Tomcat environment. But for the impatient, there is also this online host where you can give it a try (it may go offline at some point?).

Just copy/paste your shortlist there, and visually see what is wrong with them :) Big HT to John for doing all these awesome things!

How to test SMILES strings in Supplementary Information

Source. License: CC-BY 2.0.
When you stumble upon a nice paper describing a new predictive or explanatory model for a property or a class of compounds that has your interest, the first thing you do is test the training data. For example, validating SMILES (or OpenSMILES) strings in such data files is now easy with the many Open Source tools that can parse SMILES strings: the Chemistry Toolkit Rosetta provides many pointers for parsing SMILES strings. I previously blogged about a CDK/Groovy approach.

Cheminformatics toolkits need to understand what the input is, in order to correctly calculate descriptors. So, let's start there. It does not matter so much which toolkit you use and I will use the Chemistry Development Kit (doi:10.1021/ci025584y) here to illustrate the approach.

Let's assume we have a tab-separated values file, with the compound identifier in the first column and the SMILES in the second column. That can easily be parsed in Groovy. For each SMILES we parse it and determine the CDK atom types. For validation of the supplementary information we only want to report the fails, but let's first show all atom types:

import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.atomtype.CDKAtomTypeMatcher;

parser = new SmilesParser(
matcher = CDKAtomTypeMatcher.getInstance(

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") { // header line
    mol = parser.parseSmiles(smiles)
    println "$id -> $smiles";

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        report += "  no CDK atom type\n"
      } else {
        println "  atom type: " + type.atomTypeName

This gives output like:

mo1 -> COC
  atom type: C.sp3
  atom type: O.sp3
  atom type: C.sp3

If we rather only report the errors, we make some small modifications and do something like:

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)
    errors = 0
    report = ""

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        errors += 1;
        report += "  no CDK atom type\n"

    // report
    if (errors > 0) {
      println "$id -> $smiles";
      print report;

Alternatively, you can use the InChI library to do such checking. And here too, we will use the CDK and the CDK-InChI integration (doi:10.1186/1758-2946-5-14).

factory = InChIGeneratorFactory.getInstance();

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)

    // check InChI warnings
    generator = factory.getInChIGenerator(mol);
    if (generator.returnStatus != INCHI_RET.OKAY) {
      println "$id -> $smiles";
      println generator.message;

The advantage of doing this, is that it will also give warnings about stereochemistry, like:

mol2 -> BrC(I)(F)Cl
  Omitted undefined stereo

I hope this gives you some ideas on what to do with content in supplementary information of QSAR papers. Of course, this works just as well for MDL molfiles. What kind of validation do you normally do?

Coding an OWL ontology in HTML5 and RDFa

There are many fancy tools to edit ontologies. I like simple editors, like nano. And like any hacker, I can hack OWL ontologies in nano. The hacking implies OWL was never meant to be hacked on a simple text editor; I am not sure that is really true. Anyways, HTML5 and RDFa will do fine, and here is a brief write up. This post will not cover the basics of RDFa and does assume you already know how triples work. If not, read this RDFa primer first.

The BridgeDb DataSource Ontology
This example uses the BridgeDb DataSource Ontology, created by BridgeDb developers from Manchester University (Christian, Stian, and Alasdair). The ontology covers describing data sources of identifiers, a technology outlined in the BridgeDb paper by Martijn (see below) as well as terms from the Open PHACTS Dataset Descriptions for the Open Pharmacological Space by Alasdair et al.

Because I needed to put this online for Open PHACTS (BTW, the project won a big award!) and our previous solution did not work well enough anymore. You may also see the HTML of the result first. You may also want to verify it really is HTML: here is the HTML5 validation report. Also, you may be interested in what the ontology in RDF looks like: here is the extracted RDF for the ontology. Now follow the HTML+RDFa snippets. First, the ontology details (actually, I have it split up):

<div about=""
<h1>The <span property="rdfs:label">BridgeDb DataSource Ontology</span>
(version <span property="owl:versionInfo">2.1.0</span>)</h1>
This page describes the BridgeDb ontology. Make sure to visit our
<a property="rdfs:seeAlso" href="">homepage</a> too!
<p about="">
The OWL ontology can be extracted
<a property="owl:versionIRI"
The Open PHACTS specification on
<a property="rdf:seeAlso"
>Dataset Descriptions</a> is also useful.

This is the last time I show the color coding, but for a first time it is useful. In red are basically the predicates, where @about indicates a new resource is started, @typeof defines the rdf:type, and @property indicates all other predicates. The blue and green blobs are literals and object resources, respectively. If you work this out, you get this OWL code (more or less):

bridgedb: a owl:Ontology;
rdfs:label "BridgeDb DataSource Ontology"@en;
rdfs:seeAlso <>;
owl:versionInfo "2.1.0"@en .

An OWL class
Defining OWL classes are using the same approach: define the resource it is @about, define the @typeOf and giving is properties. BTW, note that I added a @id so that ontology terms can be looked up using the HTML # functionality. For example:

<div id="DataSource"
<h3 property="rdfs:label">Data Source</h3>
<p property="dc:description">A resource that defines
identifiers for some biological entity, like a gene,
protein, or metabolite.</p>

An OWL object property
Defining an OWL data property is pretty much the same, but note that we can arbitrary add additional things, making use of <span>, <div>, and <p> elements. The following example also defines the rdfs:domain and rdfs:range:

<div id="aboutOrganism"
<h3 property="rdfs:label">About Organism</h3>
<p><span property="dc:description">Organism for all entities
with identifiers from this datasource.</span>
This property has
<a property="rdfs:domain"
as domain and
<a property="rdfs:range"
as range.</p>

So, now anyone can host an OWL ontology with dereferencable terms: to remove confusion, I have used the full URLs of the terms in @about attributes.

 Van Iersel, M. P., Pico, A. R., Kelder, T., Gao, J., Ho, I., Hanspers, K., Conklin, B. R., Evelo, C. T., Jan. 2010. The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 11 (1), 5+.

#Altmetrics on CiteULike entries in R

I wanted to know when a set of publications I was aggregating on CiteULike was published. The number of publications per year, for example. I did a quick Google but could not find an R package to client to the CiteULike API, and because I wanted to play with JSON in R anyway, I created a citeuliker package. Because I'm a liker of CiteULike (see these posts). Well, to me that makes sense.

citeuliker uses jsonlite, plyr, and curl (and testthat for testing). The first converts the JSON returned by the API to a R data structure. The package unfolds the "published" field, so that I can more easily plot things by year. I use this code for that:
    data[,"year"] <- laply(data[,"published"], function(x) {
      if (length(x) < 1) return(NA) else return(x[1])
The laply() method comes from the plyr package. For example, if I want to see when the publications were published that I collected in my CiteULike library, I type:
That then looks like the plot in the top-right of this post. And, yes, I have a publication from 1777 in my library :) See the reference at the bottom of this page.

Getting all the DOIs from my library is trivial too now:
    data <- citeuliker::getData(user="egonw")
    doi <- as.vector(na.omit(data[,"doi"]))
I guess the as.vector() to remove attributes can be done more efficiently; suggestions welcome.

Now, this makes it really easy to aggregate #altmetrics, because the rOpenSci people provide the rAltmetric package, and I can simply do (continuing from the above):
    library(rAltmetric) acuna <- altmetrics(doi=dois[6]);
    acuna_data <- altmetric_data(acuna);

And then I get something like this:

Following the tutorial, I can easily get #altmetrics for all my DOIs, and plot a histogram of my Altmetric scores (make sure you have the plyr library loaded):
    raw_metrics <- lapply(dois, function(x) altmetrics(doi = x))
    metric_data <- ldply(raw_metrics, altmetric_data
    hist(metric_data$score, main="Altmetric scores", xlab="score")
That gives me this follow distribution:

The percentile statistics are also useful to me. After all, there is a clear pressure to have impact with your research. Getting your research known is a first step there. That's why we submit abstracts for orals and posters too. Advertisement. Anyway, there is enough to be said about how useful #altmetrics are, and my main interest is in using them to see what people say about that, but I don't have time now to do anything with that (it's about time for dinner and Dr. Who).

But, as a last plot, and happy my online presence is useful for something, here a plot of the percentile of my papers in the journal it was published in and for the full corpus:
      xlab="pct all", ylab="pct journal"
This is the result:

This figure shows that my social campaign puts many of my publications in the top 10. That's a start. Of course, these do not link one-to-one to citations, which are valued more by many, even though it also does not reflect well the true impact. Sadly, scientists here commonly ignore that the citation count also includes cito:disagreesWith and cito:citesAsAuthority.

Anyways... I think I need other R packages for getting citation counts from Google Scholar, Web of Science, and Scopus.

Scheele, C. W., 1777. Chemische Abhandlung von der Luft und dem Feuer.
Mietchen, D., Others, M., Anonymous, Hagedorn, G., Jan. 2015. Enabling open science: Wikidata for research.

Pimped website: HTML5, still with RDFa, restructuring and a slidebar!

My son did some HTML, CSS, JavaScript, and jQuery courses at Codecademy recently. Good for me: he pimped my personal website:

Of course, he used GitHub and pull requests (he had been using git for a few years already). His work:

  • fixed the columns to properly resize
  • added a section with my latest tweets
  • added menus for easier navigating the information
  • made section fold and unfold (most are now folded by default)
  • added a slide bar, which I use to highlight some recent output
Myself, I upgraded the website to HTML5. It used to be XHTML, but it seems XHTML+RDFa is not really established yet; or, at least, there is no good validator. So, it's now HTML5+RDFa (validation report; currently one bug). Furthermore, I updated the content and gave the first few collaborators ORCID ids, which are now linked as owl:sameAs in the RDF to the foaf:Person (RDF triples extracted from this page).