Adding chemical compounds to Wikidata

(write up in progress)

Adding chemical compounds to Wikidata is not difficult. You can store the chemical formula (P274), (canonical) SMILES (P233), InChIKey (P235) (and InChI (P234), of course), as well various database identifiers (see what I wrote about that here). It also allows storing of the provenance, and has predicates for that too.

So, to enter a new structure for a compound, you should enter the compound information to Wikidata. Of course, make sure to create the needed accounts, particularly one for Wikidata (create account) (not sure if the next steps needs a more general Wikimedia account too).

Entering the research paper
Magnus Manske pointed me to this helper tool. If you have the DOI of the paper, it is easy to add a new paper. This is what the tool shows for doi:10.1128/AAC.01148-08 (but no longer when you try!):

You need permission to run this script and the tool will alert you about that, and give the instructions how to get permission. After I clicked the Open in QuickStatements I get this output, showing me an entry in Wikidata was created for this paper:

Later, I can use the new Q-code (Q22309806) to use as source for statements about the compound (formula, etc).

Draw your compound and get an InChIKey
The next step is to draw a compound and get an InChIKey. This can be done with many tools, including Bioclipse. Rajarshi opted for alternatives:

Then check if the compound is not already in Wikidata. You can use this SPARQL query for that using the InChIKey of the compound (it's for acetic acid, so it will be found):

For convenience, here the copy/pastable SPARQL:
    PREFIX wdt: 
    SELECT ?compound WHERE {
    ?compound wdt:P235 "QTBSBXVTEAMEQO-UHFFFAOYSA-N" .
Entering the compound
So, the compound is not already in Wikidata, so time to add it. The minimal information you should provide is the following:
  • mark the new entry as 'instance of' (P) 'chemical compound (Q)
  • the chemical formula and SMILES (use as reference the paper)
    • add the reference to the paper you entered above
  • add the InChIKey and/or InChI
The first step is to create a new Wikidat entry. The Create new item menu in the left side panel can be used, showing a page like this:

As a label you can use the name used in the paper for the compound, even if a code, and as description 'chemical compound' will do for now; it can be changed later.
    Feel free to add as much information about the compound as you can find. There are some chemically rich entries in Wikidata, such as that for acetic acid (Q47512).

    Publishing H2020 Proposals

    Figure from the RIO paper.
    Over a year ago Daniel Mietchen invited me to join writing a H2020 proposal around Open Science. Well, that combines two of my current worlds, so interesting indeed. But there was more: Daniel wanted to do the writing openly, and that was certainly new to me. But since I see piles of benefits in open science, this is sort of the next step. Not obvious, perhaps, but certainly a step I wanted to try.

    The proposal that resulted from this was "Enabling Open Science: Wikidata for Research (Wiki4R)", as said, lead by Daniel Mietchen. It was drafted fully in the open, and we got a lot of feedback from people not involved in the anticipated consortium. Of course, we did not get it; you would have heard me about it earlier if we had.

    As part of the open writing is, of course, an open license, to ensure everyone who participates has equal IP on the proposal. (Some seem to forget that an Open Access license is not giving your IP; you're just licensing it!) The final, proposal was posted on ZENODO (see below) just after submission. More recently, however, Daniel submitted it to Research Ideas and Outcomes journal (ISSN 2367-7163) (which, of course, the Open license allows too!) some weeks back, which is a new journal which covers not just the end product of some research (a research paper), but also other things, including project proposals (full reference below). Mind you, not everything in this "journal" of peer-reviewed pre-publication, and the proposal is not reviewed, indeed. Post-review is most welcome, BTW! Just head of to PubPeer or Publons and start ranting about the proposal ;)

    Now, the journal seems to have blogged about this H2020 proposal publication - Daniel is involved in setting up the journal - and send it out as a press release-like thing, which is actually being picked up by news outlets :) That's new to me too.

    All in all, it's an interesting experiment, and I am grateful to Daniel for having been able to be part of this. Writing H2020 proposals openly is a new phenomenon, and I cannot commit myself to use this approach for all my proposals, but I think I may do this more often in the future.

    Mietchen, D., Hagedorn, G., Willighagen, E., Rico, M., Gomez-Perez, A., Aibar, E., Rafes, K., Germain, C., Dunning, A., Pintscher, L., Kinzler, D., Anonymous, Jan. 2015. Enabling open science: Wikidata for research.
    Mietchen, D., Hagedorn, G., Willighagen, E., Rico, M., Gómez-Pérez, A., Aibar, E., Rafes, K., Germain, C., Dunning, A., Pintscher, L., Kinzler, D., Dec. 2015. Enabling open science: Wikidata for research (Wiki4R). Research Ideas and Outcomes 1, e7573+.

    ELIXIR is setting up a Tools and Data Services Registry

    ELIXIR is setting up a Tools and Data Services Registry. Recently, they organized a workshop in Amsterdam that I attended and where I learned how to add tools and services to their database. I played with the entry for WikiPathways, and one of the nice things is that it inherits from past European registry projects and allows the encoding if the input and output format, for tools and services alike. Here's what it gives for WikiPathways now:

    The record editing facility is pretty straightforward and uses a number of tabs where you can add information.

    A summary:

    The publications:


    Where documentation is found:

    And information would is not really supplementary, such as the license terms:

    Here, the collections are of particular interest. During the meeting, a few people from the Dutch Techcenter for Life Sciences decided to use a ELIXIR-NL group for all Dutch services that benefit the full ELIXIR network. Furthermore, the BIGCAT-UM collection was set up to indicate all services by our research group, which may eventually serve is a folder towards supporting the Dutch ZonMW Enabling Technologies Hotels calls.

    Mind you, the registry can distinguish various services. The above entry is for the web interface, not for the web services. That entry in the registry is not that well populated yet, and that's for a reason. (Actually, more than one, one being that I did not create that entry and cannot change it).

    But the WikiPathways Webservices are nicely exposed via a Swagger configuration file. Moreover, the registry supports JSON too, export and import. The format is pretty simply and we only need to create a Swagger 2.0 config file convertor. I just need to find a bit of time to finish my draft implementation.

    Open Spectral Database

    Stuart Chalk wrote on the CHMINF-L mailing list about Open Spectral Database (OSDB). This new database is more of an idea than something with critical mass yet. But the idea seems right: it has a CCZero waiver for the data, is Open Source (see, and API. The webinterface looks good too:

    It supports various spectral types and maybe it can be seeded with data from one of the Massbank instances. That said, it does seem popular enough to already attract some spamming in the collections corner; that also means, it needs curators that keep an eye on what enters. Perhaps register via ORCID may be an option to fight spam, but I do not have experience with setting that up. Other feature requests I can think of is links out to Wikidata, in addition to the existing three databases.

    Now I really have a good reason to dig out my past NMRShiftDB contributions and submit that here (see also these past blog posts about NMRShiftDB).

    Project "Chemical Safety Library", aka redistributable MSDS data

    Public Domain, Wikipedia.
    The Pistoia Alliance has an interesting project proposal:
      This project consists of two distinct phases:
      1. development of a collaborative system for sharing information about known laboratory hazard based on an existing system developed and implemented by a member of the Pistoia Alliance 
      2. working with chemical suppliers and publishers to define and adopt standards for hazard information making MSDS and handbooks more accessible and more easily used
    One reason this has never happened is that in certain jurisdictions (re)distributing such information introduces a liability for the person or organization doing that (re)distribution, I was told (IANAL). Looking forward to this project, whether it will be open, how they will handle redistribution (needed if you want to have it show up in ELNs), etc

    The quality of SMILES strings in Wikidata

    Russian Wikipedia on tungsten hexacarbonyl.
    One thing that machine readability adds, is all sorts of machine processing. Validation of data consistency is one. For SMILES strings, one of the things you can do is test of the string parses at all. Wikidata is machine readable, and, in fact, easier to parse than Wikipedia, for which the SMILES strings were validated recently in a J. Cheminformatics paper by Ertl et al. (doi:10.1186/s13321-015-0061-y).

    Because I was wondering about the quality of the SMILES strings (and because people ask me about these things), I made some time today to run a test:
    1. SPARQL for all SMILES strings
    2. process each one of them with the CDK SMILES parser
    I can do both easily in Bioclipse with an integrated script:

    identifier = "P233" // SMILES
    type = "smiles"

    sparql = """
    PREFIX wdt: <>
    SELECT ?compound ?smiles WHERE {
      ?compound wdt:P233 ?smiles .
    mappings = rdf.sparqlRemote("", sparql)

    outFilename = "/Wikidata/badWikidataSMILES.txt"
    if (ui.fileExists(outFilename)) ui.remove(outFilename)
    fileContent = ""
    for (i=1; i<=mappings.rowCount; i++) {
      try {
        wdID = mappings.get(i, "compound")
        smiles = mappings.get(i, "smiles")
        mol = cdk.fromSMILES(smiles)
      } catch (Throwable exception) {
        fileContent += (wdID + "," + smiles + ": " +

                       exception.message + "\n")
      if (i % 1000 == 0) js.say("" + i)
    ui.append(outFilename, fileContent)

    It turns out that out of the more than 16 thousand SMILES strings in Wikidata, only 42 could not be parsed. That does not mean they are correct, but it does mean the are wrong. Many of them turned out to be imported from the Russian Wikipedia, which is nice, as it gives me the opportunite to work in that Wikipedia instance too :)

    At this moment, some 19 SMILES still need fixing (the list will chance over time, so by the time you read this...):

    New Edition! Getting CAS registry numbers out of WikiData

    Source: Wikipedia. CC-BY-SA

    April this year I blogged about an important SPARQL query for many chemists: getting CAS registry numbers from Wikidata. This is relevant for two reasons:
    1. CAS works together with Wikimedia on a large, free CAS-to-structure database
    2. Wikidata is CCZero
    The original effort validated about eight thousand registry numbers, made available via Wikipedia and the Common Chemistry website. However, the effort did not stop there, and Wikipedia now contains many more CAS registry numbers. In fact, Wikidata picked up many of these and now lists almost twenty thousand CAS numbers. That well exceeds what databases are allowed to aggregate and make available.

    Since the post in April, Wikidata put online a new SPARQL end point and created "direct" property links. This way, you loose the provenance information, but the query becomes simpler:
      PREFIX wdt: <>
      SELECT ?compound ?id WHERE {
        ?compound wdt:P231 ?id .
    The other thing that changed since April is that others and I requested the creation of more compound identifiers, and here's an overview along with the current number of such identifiers in Wikidata:
    Clearly, some identifiers are not well populated yet. This is what bots are for, like those used by the Andrew Su team.

    Because there is also a predicate for SMILES, we can also create a query that puts the CAS registry number alongside to the SMILES (or any other identifier):
      PREFIX wdt: <>
      SELECT ?compound ?id ?smiles WHERE {
        ?compound wdt:P231 ?id ;
                  wdt:P233 ?smiles .
    Of course, then the question is, are these SMILES string valid...And, importantly, this is nothing compared to the number of chemical compounds we know about, which currently is in the order of 100 million, of which a quarter can be readily purchased:

    Willighagen, E., 2015. Getting CAS registry numbers out of WikiData. The Winnower.

    Using the WikiPathways API in R

    Colored pathways created with
    the new R package.
    Earlier this week there was a question on the WikiPathways mailing list about the webservices. There are older SOAP webservices and newer REST-like webservices, which come with this nice Swagger webfront set up by Nuno. Of course, both approaches are pretty standard and you can use them from basically any environment. Still, some personas prefer to not see technical issues: "why should I know how an car engine works". I do not think any scholar is allowed you use this argument, but alas...

    Of course, hiding those details is not so much of an issue, and since I have made so many R packages in the past, I decided to pick up the request to create an R package for WikiPathways: rWikiPathways. It is not feature complete yet, and not extensively tested in daily use yet (there is a test suite). But here are some code examples. Listing pathways and organisms in the wiki is done with:
      organisms = listOrganisms()
      pathways = listPathways()
      humanPathways = listPathways(organism="Homo sapiens")
    For the technology oriented users, for each pathway, you have access to the GPML source file for each pathway:
      gpml = getPathway(pathway="WP4")
      gpml = getPathway(pathway="WP4", revision=83654)
    However, most use will likely be via database identifiers for genes, proteins, and metabolites, called Xrefs (also check out the R package for BridgeDb):
      xrefs = getXrefList(pathway="WP2338", systemCode="S")
      pathways = findPathwaysByXref("HMDB00001", "Ch")
      pathways = findPathwaysByXref(identifier="HMDB00001", systemCode="Ch")
      pathways = findPathwaysByXref(
      identifier=c("HMDB00001", "HMDB00002"),
      systemCode=c("Ch", "Ch") 
    Of course, these are just the basics, and the question was about colored pathways. The SOAP code was a bit more elaborate, and this is the version with this new package (the result at the top of this post):
      svg = getColoredPathway(pathway="WP1842", graphId=c("dd68a","a2c17"),
      color=c("FF0000", "00FF00"));
      writeLines(svg, "pathway.svg")
    If you use this package in your research, please cite the below WikiPathways paper. If you have feature requests, please post the in the issue tracker.

    Kutmon, M., Riutta, A., Nunes, N., Hanspers, K., Willighagen, E. L., Bohler, A., Mélius, J., Waagmeester, A., Sinha, S. R., Miller, R., Coort, S. L., Cirillo, E., Smeets, B., Evelo, C. T., Pico, A. R., Oct. 2015. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Research.

    SWAT4LS in Cambridge

    Wordle of the #swat4ls tweets.
    Last week the BiGCaT team were present with three person (Linda, Ryan, and me) at the Sematic Web Applications and Tools 4 Life Sciences meeting in Cambridge (#swat4ls). It's a great meeting, particularly because if the workshops and hackathon. Previously, I attended the meeting in Amsterdam (gave this presentation) and Paris (which I apparently did not blog about).

    I have mixed feelings about missing half of the workshops on Monday for a visit of one of our Open PHACTS partners, but do not regret that meeting at all; I just wish I could have done both. During the visit we spoke particularly about WikiPathways and our collaboration in this area.

    The Monday morning workshops were cool. First, Evan Bolton and Gang Fu gave an overview of their PubChemRDF work. I have been involved in that in the past, and I greatly enjoyed seeing the progress they have made, and a rich overview of the 250GB of data they make available on their FTP side (sorry, the rights info has not gotten any clearer over the years, but generally considered "open"). The RDF now covers, for example, the biosystems module too, so that I can query PubChem for all compounds in WikiPathways (and compare that against internal efforts).

    The second workshop I attended was by Andra and others about Wikidata. The room, about 50 people, all started editing Wikidata, in trade of a chocolate letter:

    The editing was about prevalence is two diseases. Both topics continued during the hackathon, see below. Slides of this presentation are online. But I missed the DisGeNET workshop, unfortunately :(

    The conference itself (in the new part of Clare College, even the conference dinner) started on the second day, and all presentations are backed by a paper, linked from the program. Not having attended a semantic web conference in the past 2~ish years, it was nice to see the progress in the field. Some papers I found interesting:
    But the rest is most worthwhile checking out too! The Webulous I as able to get going with some help (not paying enough attention to the GUI) for eNanoMapper:

    A Google Spreadsheet where I restricted the content of a set of cells to only subclasses of the "nanomaterial" class in the eNanoMapper ontology (see doi:10.1186/s13326-015-0005-5).
    The conference ended with a panel discussion, and despite our efforts of me and the other panel members (Frank Gibson – Royal Society of Chemistry, Harold Solbrig – Mayo Clinic, Jun Zhao, University of Oxford), it took long before the conference audience really started joining in. Partly this was because the conference organization asked the community for questions, and the questions clearly did not resonate with the audience. It was not until we started discussing publishing that it became more lively. My point there was I believe the semantic web applications and tools are not really a rate limiting factor anymore, and if we really want to make a difference, we really must start changing the publishing industry. This has been said by me and others for many years already, but the pace at which things change it too low. Someone mentioned a chicken-and-egg situation, but I really believe it is all just a choice we make and an easy solution: pick up a knife, kill the chicken, and have a nice dinner. It is annoying to see all the great efforts at this conference, but much of it limited because our writing style makes nice stories and yields few machine readable facts.

    The hackathon was held at the EBI in Hinxton (south/elixir building) and during the meeting I had a hard time deciding what to hack on: there just were too many interesting technologies to work on, but I ended up working on PubChem/HDT (long) and Wikidata (short). The timings are based on the amount of help I needed to bootstrap things and how much I can figure out at home (which is a lot for Wikidata).

    HDT (header, dictionary, triple) is a not-so-new-but-under-the-radar technology for binary storing triples in a file based store. The specification outlines this binary format as well as the index. That means that you can share triple data compressed and indexed. That opens up new possibilities. One thing I am interested in, is using this approach for sharing link sets (doi:10.1007/978-3-319-11964-9_7) for BridgeDb, our identifier mapping platform. But there is much more, of course: share life science databases on your laptop.

    This hack was triggered by a beer with Evan Bolton and Arto Bendiken. Yes, there is a Java library, hdt-java, and for me the easiest way to work out how to use a Java API, is to write a Bioclipse plugin. Writing the plugin is trivial, though setting up a Bioclipse development is less so: the New Wizard does the hard work in seconds. But then started the dependency hacking. The Jena version it depended on is incompatible with the version in Bioclipse right now, but that is not a big deal for Eclipse, and the outcome is that we have both version on the classpath :) That, however, did require me to introduce a new plugin, net.bioclipse.rdf.core with the IRDFStore, something I wanted to do for a long time, because that is also needed if one wants to use Sesame/OpenRDF instead of Jena.

    So, after lunch I was done with the code cleanup, and I got to the HDT manager again. Soon, I could open a HDT file. I first had the API method to read it into memory, but that's not what I wanted, because I want to open large HDT files. Because it uses Jena, it conveniently provides a Jena Model object, so adding SPARQL-ing support was easy; I cannot use the old SPARQL-ing code, because then I would start mixing Jena versions, but since all is Open Source, I just copied/pasted the code (which is written by me in the first place, doi:10.1186/2041-1480-2-s1-s6, interestingly, work that originates from my previous SWAT4LS talk :). Then, I could do this:
    It is file based, which has different from a full triple store server. So, questions arise about performance. Creating an index takes time and memory (1GB of heap space, for example). However, the index file can be shared (downloaded) and then a HDT file "opens" in a second in Bioclipse. Of course, the opening does not do anything special, like loading into memory, and should be compared to connecting to a relational database. The querying is what takes the time. Here are some numbers for the Wiktionary data that the RDFHDT team provides as example data set:
    However, I am not entirely sure what to compare this against. I will have to experiment with, for example, ChEMBL-RDF (maybe update the Uppsala version, see doi:10.1186/1758-2946-5-23). The advantage would be that ChEMBL data could easily be distributed along with Bioclipse to service the decision support features. Because the typical query is asking for data for a parcicular compound, not all compounds. If that works within less than 0.1 seconds, then this may give a nice user experience.

    But before I reach that, it needs a bit more hacking:
    1. take the approach I took with BridgeDb mapping databases for sharing HDT files (which has the advantage that you get a decent automatic updating system, etc)
    2. ensure I can query over more than one HDT file
    And probably a bit more.

    Wikidata and WikiPathways
    After the coffee break I joined the Wikidata people, and sat down to learn about the bots. However, Andra wanted to finish something else first, where I could help out. Considering I probably manage to hack up a bot anyway, we worked on the following. Multiple database about genes, proteins, and metabolites like to link these biological entities to pathways in WikiPathways (doi:10.1093/nar/gkv1024). Of course, we love to collaborate with all the projects that integrate WikiPathways into their systems, but I personally rather use a solution that services all needs. If only because then people can do this integration without needing our time. Of course, this is an idea we pitched about a year ago in the Enabling Open Science: WikiData for Research proposal (doi:10.5281/zenodo.13906).

    That is, would it not be nice of people can just pulled the links between the biological entities to WikiPathways from Wikidata, using one of the many APIs they have (SPARQL, REST), supporting multiple formats (XML, JSON, RDF)? I think so, as you might have guessed. So does Andra, and he asked me if I could start the discussions in the Wikidata community, which I happily did. I'm not sure about the outcome, because despite having links like these is not of their prime interest - they did not like the idea of links to the Crystallography Open Database much yet, with the argument it is a one-to-many relation - though this is exactly what the PDB identifier is too, and that is accepted. So, it's a matter of notability again. But this is what the current proposal looks like:

    Let's see how the discussion unfolds. Please feel tree to coin in and show your support, comments, questions, or opposition, so that we can together get this right.

    Chemistry Development Kit
    There is undoubtedly a lot more, but I have been summarizing the meeting for about three hours now, getting notes together etc. A last thing I want to mention now, is the CDK. Cheminformatics is, afterall, a critical feature of life science data, and spoke with a few about the CDK. And I visited NextMove Software on Friday where John May works nowadays, who did a lot of work on the CDK recently (we also spoke about WikiPathways and eNanoMapper). NextMove is doing great stuff (thanks for the invitation), and so did John during his PhD in Chris Steinbeck's group at the EBI. But during the conference I also spoke with others about the CDK and following up on these conversations.

    Databasing nanomaterials: substance APIs

    Cell uptake of gold nanoparticles
    in human cells. Source. CC-BY 4.0
    Nanomaterials are quite interesting from a science perspective: first, they are materials and not so well-defined as such. The can best be described as a distribution of similar nanoparticles. That is, unlike small compounds, which we commonly describe as pure materials. Nanomaterials have a size distribution, surface differences, etc. But akin the QSAR paradigm, because they are similar enough, we can expect similar interaction effects, and thus treat them as the same. A nanomaterials is basically a large collection of similar nanoparticles.

    Until the start interacting, of course. Cell membrane penetration is studies at a single nanoparticle level, and they make interesting pictures of that (see top left). Or when we do computation. Then too, we typically study a single materials. On the other hand, many nanosafety studies work with the materials, at a certain dosage. Study cell death, transcriptional changes, etc, when the materials is brought into contact with some biosample.

    The synthesis is equally interesting. Because of the nature of many manufacturing processes (and the literature synthesizing new materials is enormous), it is typically not well understood what the nanomaterial or even nanoparticle looks like. This is overcome by stydying the bulk properties, and report some physicochemical properties, like the size distribution, with methods like DLS and TEM. The field just lacks the equivalent of what NMR is for (small) (organic) compounds.

    Now, try capturing this in a unified database. That's exactly what eNanoMapper is doing. And with a modern approach. It's a database project, not a website proejct. We develop APIs and test all aspects of the database extensively using test data. Of course, using the API we can easily create websites (there currently are JavaScript and R client libraries), and we have done so at It's great to be working with so many great domain specialists who get things done!

    There is a lot to write and discuss about this, but end now by just pointing you to our recent paper outlining much of the cheminformatics of this new nanosafety database solution.

    Of course, we study in our group the nanosafety and nanoresponse (think nanomedicine) at a systems biology level. So, here's the obligatory screenshot of work of one of of interns (Stan van Roij). Not fully integrated with the database yet, though.

    Jeliazkova, N., Chomenidis, C., Doganis, P., Fadeel, B., Grafström, R., Hardy, B., Hastings, J., Hegi, M., Jeliazkov, V., Kochev, N., Kohonen, P., Munteanu, C. R., Sarimveis, H., Smeets, B., Sopasakis, P., Tsiliki, G., Vorgrimmler, D., Willighagen, E., Jul. 2015. The eNanoMapper database for nanomaterial safety information. Beilstein Journal of Nanotechnology 6, 1609-1634.