Coding an OWL ontology in HTML5 and RDFa

There are many fancy tools to edit ontologies. I like simple editors, like nano. And like any hacker, I can hack OWL ontologies in nano. The hacking implies OWL was never meant to be hacked on a simple text editor; I am not sure that is really true. Anyways, HTML5 and RDFa will do fine, and here is a brief write up. This post will not cover the basics of RDFa and does assume you already know how triples work. If not, read this RDFa primer first.

The BridgeDb DataSource Ontology
This example uses the BridgeDb DataSource Ontology, created by BridgeDb developers from Manchester University (Christian, Stian, and Alasdair). The ontology covers describing data sources of identifiers, a technology outlined in the BridgeDb paper by Martijn (see below) as well as terms from the Open PHACTS Dataset Descriptions for the Open Pharmacological Space by Alasdair et al.

Because I needed to put this online for Open PHACTS (BTW, the project won a big award!) and our previous solution did not work well enough anymore. You may also see the HTML of the result first. You may also want to verify it really is HTML: here is the HTML5 validation report. Also, you may be interested in what the ontology in RDF looks like: here is the extracted RDF for the ontology. Now follow the HTML+RDFa snippets. First, the ontology details (actually, I have it split up):

<div about=""
<h1>The <span property="rdfs:label">BridgeDb DataSource Ontology</span>
(version <span property="owl:versionInfo">2.1.0</span>)</h1>
This page describes the BridgeDb ontology. Make sure to visit our
<a property="rdfs:seeAlso" href="">homepage</a> too!
<p about="">
The OWL ontology can be extracted
<a property="owl:versionIRI"
The Open PHACTS specification on
<a property="rdf:seeAlso"
>Dataset Descriptions</a> is also useful.

This is the last time I show the color coding, but for a first time it is useful. In red are basically the predicates, where @about indicates a new resource is started, @typeof defines the rdf:type, and @property indicates all other predicates. The blue and green blobs are literals and object resources, respectively. If you work this out, you get this OWL code (more or less):

bridgedb: a owl:Ontology;
rdfs:label "BridgeDb DataSource Ontology"@en;
rdfs:seeAlso <>;
owl:versionInfo "2.1.0"@en .

An OWL class
Defining OWL classes are using the same approach: define the resource it is @about, define the @typeOf and giving is properties. BTW, note that I added a @id so that ontology terms can be looked up using the HTML # functionality. For example:

<div id="DataSource"
<h3 property="rdfs:label">Data Source</h3>
<p property="dc:description">A resource that defines
identifiers for some biological entity, like a gene,
protein, or metabolite.</p>

An OWL object property
Defining an OWL data property is pretty much the same, but note that we can arbitrary add additional things, making use of <span>, <div>, and <p> elements. The following example also defines the rdfs:domain and rdfs:range:

<div id="aboutOrganism"
<h3 property="rdfs:label">About Organism</h3>
<p><span property="dc:description">Organism for all entities
with identifiers from this datasource.</span>
This property has
<a property="rdfs:domain"
as domain and
<a property="rdfs:range"
as range.</p>

So, now anyone can host an OWL ontology with dereferencable terms: to remove confusion, I have used the full URLs of the terms in @about attributes.

 Van Iersel, M. P., Pico, A. R., Kelder, T., Gao, J., Ho, I., Hanspers, K., Conklin, B. R., Evelo, C. T., Jan. 2010. The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 11 (1), 5+.

#Altmetrics on CiteULike entries in R

I wanted to know when a set of publications I was aggregating on CiteULike was published. The number of publications per year, for example. I did a quick Google but could not find an R package to client to the CiteULike API, and because I wanted to play with JSON in R anyway, I created a citeuliker package. Because I'm a liker of CiteULike (see these posts). Well, to me that makes sense.

citeuliker uses jsonlite, plyr, and curl (and testthat for testing). The first converts the JSON returned by the API to a R data structure. The package unfolds the "published" field, so that I can more easily plot things by year. I use this code for that:
    data[,"year"] <- laply(data[,"published"], function(x) {
      if (length(x) < 1) return(NA) else return(x[1])
The laply() method comes from the plyr package. For example, if I want to see when the publications were published that I collected in my CiteULike library, I type:
That then looks like the plot in the top-right of this post. And, yes, I have a publication from 1777 in my library :) See the reference at the bottom of this page.

Getting all the DOIs from my library is trivial too now:
    data <- citeuliker::getData(user="egonw")
    doi <- as.vector(na.omit(data[,"doi"]))
I guess the as.vector() to remove attributes can be done more efficiently; suggestions welcome.

Now, this makes it really easy to aggregate #altmetrics, because the rOpenSci people provide the rAltmetric package, and I can simply do (continuing from the above):
    library(rAltmetric) acuna <- altmetrics(doi=dois[6]);
    acuna_data <- altmetric_data(acuna);

And then I get something like this:

Following the tutorial, I can easily get #altmetrics for all my DOIs, and plot a histogram of my Altmetric scores (make sure you have the plyr library loaded):
    raw_metrics <- lapply(dois, function(x) altmetrics(doi = x))
    metric_data <- ldply(raw_metrics, altmetric_data
    hist(metric_data$score, main="Altmetric scores", xlab="score")
That gives me this follow distribution:

The percentile statistics are also useful to me. After all, there is a clear pressure to have impact with your research. Getting your research known is a first step there. That's why we submit abstracts for orals and posters too. Advertisement. Anyway, there is enough to be said about how useful #altmetrics are, and my main interest is in using them to see what people say about that, but I don't have time now to do anything with that (it's about time for dinner and Dr. Who).

But, as a last plot, and happy my online presence is useful for something, here a plot of the percentile of my papers in the journal it was published in and for the full corpus:
      xlab="pct all", ylab="pct journal"
This is the result:

This figure shows that my social campaign puts many of my publications in the top 10. That's a start. Of course, these do not link one-to-one to citations, which are valued more by many, even though it also does not reflect well the true impact. Sadly, scientists here commonly ignore that the citation count also includes cito:disagreesWith and cito:citesAsAuthority.

Anyways... I think I need other R packages for getting citation counts from Google Scholar, Web of Science, and Scopus.

Scheele, C. W., 1777. Chemische Abhandlung von der Luft und dem Feuer.
Mietchen, D., Others, M., Anonymous, Hagedorn, G., Jan. 2015. Enabling open science: Wikidata for research.

Pimped website: HTML5, still with RDFa, restructuring and a slidebar!

My son did some HTML, CSS, JavaScript, and jQuery courses at Codecademy recently. Good for me: he pimped my personal website:

Of course, he used GitHub and pull requests (he had been using git for a few years already). His work:

  • fixed the columns to properly resize
  • added a section with my latest tweets
  • added menus for easier navigating the information
  • made section fold and unfold (most are now folded by default)
  • added a slide bar, which I use to highlight some recent output
Myself, I upgraded the website to HTML5. It used to be XHTML, but it seems XHTML+RDFa is not really established yet; or, at least, there is no good validator. So, it's now HTML5+RDFa (validation report; currently one bug). Furthermore, I updated the content and gave the first few collaborators ORCID ids, which are now linked as owl:sameAs in the RDF to the foaf:Person (RDF triples extracted from this page).

Linking papers to database to papers: PubMed Commons and

I argued earlier this year (doi:10.5281/zenodo.17892) in the Journal of Brief Ideas that measuring reuse of data and/or results in databases is a good measure of impact of that research. Who knows, it may even beat the citation count, which does not measure quality or correctness of data (e.g. you may cite a paper because you disagree with the content; I have long and still am advocating the Citation Typing Ontology).

But making the link between databases and papers is not only benefiting measuring reuse, it is also just critical for doing research. Without clear links, finding answers is hard. I experience that myself frequently, and so do others, like Christopher Southan, and it puzzles me that so few people worry about this. Of course, databases do a good part of linking, but only if they expose an API (still rare, but upcoming), it is hard to use these links. PubMed Commons can be used to link to (machine readable) version of data in a paper. See, for example, these four comments by me.

Better is when the database provides an API. And that is used by Ferret. I have no idea where this project is going to; it does not seem Open Source, I am not entirely sure how the implemented the history, but the idea is interesting. Not novel, as UtopiaDocs does a similar thing. Difference is, Ferret is not a PDF reader, but works directly in your Chrome browser. That makes it more powerful, but also more scary, which is why it is critical they send a clear message about any involvement of Ferret servers, or if everything is done locally (otherwise they can forget about (pharma) company uptake, and they'd have a hard time restoring trust). That said, there privacy policy document is already quite informative!

Last week, I asked them about their tool and if it was hard to add databases, as that is one thing Ferret does: if you open it up for a paper, it will show the databases that cite that paper (and thus likely have information or data from that paper, e.g. supplementary information). Here's an example:

This screenshots shows the results for a nanotoxicity paper and we see it picked up "titanium oxide" (accurately picking up actual nanomaterials or nanoparticles is an unsolved text mining issue). We get some impact statistics, but if you read my blog and my brief idea about capturing reuse, I think they got "impact" wrong. Anyway, they do have a knowledge graph section, which has the paper-database links, and Ferret found this paper cited in UniProt.

Thus, when I asked them if it would be hard to add new databases to that section, and I mentioned Open PHACTS and WikiPathways, they replied. In fact, within hours they told me they found the WikiPathways SPARQL end point that Andra started, which they find easier to use than the WikiPathways webservices :)  They asked me for a webpage to point users too, and while I was thinking about that, they found another WikiPathways trick I did not know about, you can browse for WP2371 OR WP2059. Tina then replied that, given a PubMed ID, there was even a nicer way, just browse for all pathways with a particular PubMed ID.

Well, a bit later, they release Ferret 0.4.2 with WikiPathways support. The below screenshot shows the output for a paper (doi:10.2174/1389200214666131118234138) by Rianne (who did internships in our group, and now does here PhD in toxicology):

The Ferret infobar shows seventeen WikiPathways that are linked to this paper, which happens to be the collection that Rianne made during her internship leading to this paper, and uploaded to WikiPathways some months ago. Earlier this year we sat down with her, Freddie, and Linda to make them more machine readable. This is what this list looks like in the browse functionality:

Ferret version 0.4.2 did not work for me, but they fixed the issue within a day, and the above screenshot was made with version 0.4.3. So, besides like a bunch of good hackers, they also seem to listen to their customers. So, what databases do you feel they should add? Leave a comment here, or tweet them at @getferret (pls cc me).

Willighagen, E., Capturing reuse in altmetrics. J. Brief Ideas. May 2015. URL
Fijten, R. R. R., Jennen, D. G. J., Delft, Dec. 2013. Pathways for ligand activated nuclear receptors to unravel the genomic responses induced by hepatotoxicants. Current Drug Metabolism, 1022-1028.

Journal of Brief Ideas: an excellent idea!

Journals, in the past, published what researchers wanted to talk about. That is what dissemination is about, of course. Like everything, over time, the process becomes more restricted and more bureaucratic. All for quality, of course. To provide and to formalize that scientific communication has diversity, many journals have different articles types. Letters to the Editor, Brief Communications, etc. Posting a brief idea, however, is for many journals not of enough interest.

Hence, a niche for the Journal of Brief Ideas. It's a project in beta, any may never find sustainability, but it is worth a try:

I can see why this may work:
  • you teamed up with ZENODO to provide DOIs
  • you log in with your ORCID
  • it is Open Access (CC-BY)
  • it fills the niche that ideas you will not tests never see the light of the day (so, this journal will contribute to more efficient scholarly communication)
I can also see why it may not work:
  • it is too easy to post an idea, leading to too much noise
  • it will not be indexed and therefore not fulfill a key requirements for many scientists (WoS, etc)
  • you cannot add references like with papers
I can also see some features I would love to see:
  • bookmarking buttons for CiteULike, Mendeley, etc
  • #altmetrics output on this site
  • provide #altmetrics from this site (view statistics, etc)
  • integrate with peer review cites (for post-publication peer review)
  • allow annotation of entities in papers (like PDB, gene, protein codes, metabolite identifiers, etc; and whatever else for other scholarly domains)
Things I am not sure about:
  • allow a single ToC-like graphics (as they will give papers more coverage and more impact)
Anyway, what is needs now, is momentum. It needs a business model, even if the turnover can be kept low because of good choices of technology. I am looking forward where the team is going, and how the community will pick up this idea. (For example, despite I know that some ideas are tweeted, I haven't found a donut from for one of the idea DOIs yet.)

For my readers, please give it a try. You know you have that idea you like to get some feedback on, but you know you will not have funding for it, and it does not really match what general research plans. It would be a shame to leave that idea rot on the shelf. Get it out, get cited!

I tried it too, see below my brief idea as found on ZENODO (where they automatically get deposited), and my experiences are a bit mixed. I like the idea, but it is also getting used to. The number of words are limited, and I really find it awkward not to cite prior art, the things I built on. The above points reflect a good deal of my reservations.

Internet-aided serendipity in science (was: How the Internet can help chemists with serendipity)

The ACS Central Science RSS feed in Feedly.
Finding new or useful knowledge to solve your scientific problem, question, etc, is key to research. It also is what struck me as a university student as so badly organized (mid-nineties). In fact, technologically there was no issue, so why are scientists not using these technologies then?? This question is still relevant, and readers of this blog know this is a toy research area to me, and I have previously experimented with a lot of technologies to see how they can support research, and, well, basically, serendipity. Hence, internet-aided serendipity.

This happened to be the topic of an article by Prof. Bertozzi (@CarolynBertozzi), editor-in-chief of the gold Open Access ACS Central ScienceHow the Internet can help chemists with serendipity, part of the website. I left a comment, which is currently awaiting moderation, but to keep the discussion on twitter going, here is what I left (the comment on the article may turn out to have lost the formatting still present here):
    Dear Prof Bertozzi,

    the browsing of TOCs is not a lost art, and neither has the Internet solved everything. Where I fully agree that Twitter and other social media have filled a niche in finding interesting literature, it is basically kind of a majority vote and does not really find you the papers interesting to your research. This has to extend, of course, to #altmetrics, which capture the attention on social media and allows creating TOCs on the fly, as do (good) paper bookmarking services like CiteULike (see Similarly, people developed tools to find science in blog posts, like the no longer existing, continued/forked as Chemical blogspace (see, but consider this code has not been updated in the past 2-3 years). So, creating cross-journal TOCs is a daily habit for many of us still. (BTW, will ACS Central Science fully adopt #altmetrics, as data provider as well as showing #altmetrics on the website?)

    Returning to the single journal TOCs. Here, RSS feeds have shown to be critical, happy to find a RSS feed for ACS Central Science ( It is good to see that the journal's RSS feed for the ASAP papers contains for each paper the title, authors, the TOC image, and the DOI (possibly, it could also include the abstract and ORCIDs of the authors). Better, it should adopt CMLRSS and include InChIs, MDL molfiles, or SMILES of the chemical compounds discussed in that paper (see this ACS JCIM paper: With proper adoption of CMLRSS, chemists could define substructures and be alerted when papers would be published containing chemicals with that substructure (and it does not have to stop there, as cheminformatically it is trivial to extend this to chemical reactions, or any other chemistry). After all, we don't want to miss the chemistry that sparks our inspiration!

    I personally keep track of a number of journals via RSS feeds which I aggregate in Feedly, which filled the gap after GoogleReader was closed down. Feedly does not support CMLRSS (unfortunately, but I have other tools for that) and there are a few alternatives.

    So, I hope the ACS Central Central journal will pick up your challenge and continue to support modern (well, CMLRSS was published in 2004) technologies to support your past workflows! For example, make the link to the ACS Central Science RSS feed more prominent, and write an editorial about how to use it with, for example, Feedly.

    Maastricht University
    The Netherlands
Of course, there is a lot more. It should not surprise you that adoption of PDF and ReadCube as killing internet-aided serendipity, where HTML+RDF, microformats,, etc would in fact enable serendipity. Chemistry publishers do not particularly have a track record in enabling the kind of serendipity Prof. Bertozzi is looking for. Good thing is that as editor-in-chief of an ACS journal, she can restore this serendipity and I kindly invite her to the Blue Obelisk community to discuss how all the technologies that have been developed in the past 15 years can help chemists. Because we have plenty of ideas. (And where is that website again aggregating chemistry journal RSS feeds...?)

Or, just browse this posts in blog, where I have frequently written about the innovation with publishers (in general; some do better than others).

Update: Other perspectives

WikiPathways and two estrone-x,y-quinones added to Wikidata

WikiPathways does a lot of curation, with a team growing in size. A number of regular jobs is performed weekly by one of a group of some 15-20 curators. On top of that, some curators do much more than this weekly task, e.g. Kristina Haspers. Since I joined the BiGCaT team of Chris Evelo in Maastricht, I have been looking into the metabolites and other small molecules, and did quite a bit of work to make that information machine readable. See, for example, these open notebook science posts.

This curation is partly supported by tools, e.g. bots and tests. Tests are, among others, being run nightly on a Jenkins instance (in various configurations). One of the bots create this report, which Martina Kutmon recently reminded me of. Starting at the end of that, I started browsing it for unrecognized metabolites (for various reasons). My eyes fell on two compounds in the estrogen metabolism pathway, originally created by Pieter Giesbertz: estrone-2,3-quinone and estrone-3,4-quinone (in green):

The website was not showing up mappings to other database for the cross-references from PubChem. A quick check confirmed that HMDB, KEGG and ChEBI did not have this compound. HMDB has an entry for one of the compounds, given the name, but the chemical graph has undefined stereochemistry. That certainly explains why it did not map to the PubChem compound ID. And, indeed, PubChem does have the HMDB as substance, but not linked to a compound. So, I added them to Wikidata: Q20739847 and Q20742851.

Then, when I make the next metabolite ID mapping database for BridgeDb, it will have mappings between the cross-references in WikiPathways for these two compounds to, at the time of writing, ChemSpider, and to the CAS registry number of one of the two. Please also note that Wikidata allowed me to store the information source.

Thus, for me, Wikidata is the place to add new mappings, and I herald work by Andra Waagmeester, Andrew Su, and others to use Wikidata for this kind of purpose. If you agree, you can add your support here.

PubChemRDF: semantic web access to PubChem data

Gang Fu and Evan Bolton have blogged about it previously, but their PubChemRDF paper is out now (doi:10.1186/s13321-015-0084-4). It very likely defines the largest collection of RDF triples using the CHEMINF ontology and I congratulate the authors with a increasingly powerful PubChem database.

With this major provider of Linked Open Data for chemistry now published, I should soon see where my Isbjørn stands. The release of this publication is also very timely with respect to the CHEMINF ontology, as I last week finished a transition from Google to GitHub, by moving the important wiki pages, including one about "Where is the CHEMINF ontology used?". I already added Gang's paper. A big thanks and congratulations to the PubChem team and my sincere thanks to have been able to contribute to this paper.

CDK Literature #9

Visualization of functional groups.
Public domain from Wikipedia.
In the past 50 years we have been trying to understand why certain classes of compounds show the same behavior. Quantum chemical calculations are still getting cheaper and easier (though, I cannot point you to a review of recent advances), but it has not replaced other approaches, as is visible in the number of QSAR/descriptor applications of the CDK.

Functional Group Ontology
Sankar et al. have developed an ontology for functional groups (doi:10.1016/j.jmgm.2013.04.003). One popular thought is that subgroups of atoms are more important than the molecule as a whole. Much of our cheminformatics is based on this idea. And it matches what we experimentally observe. If we add a hydroxyl or an acid group, the molecule becomes more hydrophylic. Semantically encoding this clearly important information seems important, though intuitively I would have left this to the cheminformatics tools. This paper and a few cited papers, however, show far you can take this. It organizes more than 200 functional groups, but I am not sure where the ontology can be downloaded.

Sankar, P., Krief, A., Vijayasarathi, D., Jun. 2013. A conceptual basis to encode and detect organic functional groups in XML. Journal of Molecular Graphics and Modelling 43, 1-10. URL

Linking biological to chemical similarities
If we step aside from our concept of "functional group", we can also just look at whatever is similar between molecules. Michael Kuhn et al. (of STITCH and SIDER) looked into the role of individual proteins in side effect (doi:10.1038/msb.2013.10). They find that many drug side effects are mediated by a selection of individual proteins. The study uses a drug-target interaction data set, and to reduce the change of bias due to some compound classes more extensively studies (more data), they removed too similar compounds from the data set, using the CDK's Tanimoto stack.

Kuhn, M., Al Banchaabouchi, M., Campillos, M., Jensen, L. J., Gross, C., Gavin, A. C., Bork, P., Apr. 2014. Systematic identification of proteins that elicit drug side effects. Molecular Systems Biology 9 (1), 663. URL

Drug-induced liver injury
These approaches can also be used to study if there are structural reasons why Drug-induced liver injury (DILI) occurs. This was studied in this paper Zhu et al. where the CDK is used to calculate topological descriptors (doi:10.1002/jat.2879). They compared explanatory models that correlate descriptors with the measured endpoint and a combination with hepatocyte imaging assay technology (HIAT) descriptors. These descriptors capture phenotypes such as nuclei count, nuclei area, intensities of reactive oxygen species intensity, tetramethyl rhodamine methyl ester, lipid intensity, and glutathione. It doesn't cite any of the CDK papers, so I left a comment with PubMed Commons.

Zhu, X.-W., Sedykh, A., Liu, S.-S., Mar. 2014. Hybrid in silico models for drug-induced liver injury using chemical descriptors and in vitro cell-imaging information. Journal of Applied Toxicology 34 (3), 281-288. URL

PubMed Commons: comments, pointers, questions, etc

I could have sworn I had blogged about this already, but cannot find it in my blog archives. If you do not know PubMed Commons yet, check it out! As the banner on the right shows, they're in Pilot mode (yeah, why stick to alpha/beta release tagging), and it already found several uses, as explain in this blog post. Journal clubs is one of them, which they introduced at the end of last year. The pilot started out with giving access to PubMed authors, but since many of us are, that was never really a reason not to give it a try. Comments on PubMed Commons automatically get picked up by other platforms, like PubPeer, and commentators get a profile page, this is mine.

Like the use cases people have adopted - see the above linked blog post - I have found a number of use cases:

  1. additional information:
    1. missing citations (1)
    2. where data can be downloaded (1)
  2. where data from that paper was deposited:
    1. paper figures available in WikiPathways (1,2,3,4)
    2. authors uploaded data/figures to FigShare but the paper does not link it (1)
    3. authors uploaded data/figures to DataDryad but the paper does not link it (1)
  3. me too:
    1. CDK can help (1)
  4. commenting (1) and questions (2)
  5. a closed paper was made gold Open Acces (1)
  6. the source code behind that paper moved
    1. from Google Code to GitHub (1)
So, get your account today, and start updating your papers which changed locations. Because we all now the bit rot in website locations in papers. Show PubMed how you like to improve scientific communication via the publishing platform!