The Web - What is the issue?

From Wikipedia.
Last week I gave an invited presentation in the nice library of the Royal Society of Chemistry, at the What's in a Name? The Unsung Heroes of Open Innovation: Nomenclature and Terminology meeting. I was asked to speak about HTML in this context, something I have worked with as channel for communication of scientific knowledge and data for almost 20 years know. Mostly in the area of small molecules, starting with the Dictionary of Organic Chemistry, which is interesting because I presented the web technologies behind this project also in London, October 10 years ago!

As a spoiler, the bottom line of my presentation is that we're not even using 10% of what the web technologies have to offer us. Slowly we are getting there, but too slow in my opinion. For some weird behavioral law, the larger the organization the less innovation gets done (some pointers).

Anyway, I only had 20 minutes, and in that time you cannot do justice to the web technologies.

Papers that I mention in these slides are given below.
Wiener, H. Structural determination of paraffin boiling points. Journal of the American Chemical Society 69, 17-20 (1947). URL
Murray-Rust, P., Rzepa, H. S., Williamson, M. J. & Willighagen, E. L. Chemical markup, XML, and the world wide web. 5. applications of chemical metadata in RSS aggregators. J Chem Inf Comput Sci 44, 462-469 (2004). URL
Rzepa, H. S., Murray-Rust, P. & Whitaker, B. J. The application of chemical multipurpose internet mail extensions (chemical MIME) internet standards to electronic mail and world wide web information exchange. J. Chem. Inf. Comput. Sci. 38, 976-982 (1998). URL
Willighagen, E. et al. Userscripts for the life sciences. BMC Bioinformatics 8, 487+ (2007). URL
Willighagen, E. L. & Brändle, M. P. Resource description framework technologies in chemistry. Journal of cheminformatics 3, 15+ (2011). URL

The history of the Woordenboek Organische Chemie

Chemistry students at the Radboud University in Nijmegen (then called the Catholic University of Nijmegen) got internet access in spring 1994. BTW, the catholic part only was reflected in the curriculum in that philosophy was an obligatory course. The internet access part meant a few things:
  1. xblast
  2. HTML and web servers
  3. email
Our university also had a campus-wide IT group that experimented with new technologies. So, many students had internet access via cable early on (though I do not remember when that got introduced).

During these years I was studying organic chemistry, and I started something to help me learn name reactions and trivial names. I realized that the knowledge base I had built up would be useful to others too, and hence I started the Woordenboek Organische Chemie (WOC). This project no longer exists, and is largely redundant with Wikipedia and other resources. The first public version goes back to 1996, but most of the history is lost, sadly.

Here are a few screenshots I have been able to dig up from the Internet Archive. A pretty recent version is from 2003 and this is what it looked like in those days:

The oldest version I have been able to dig up with from January 1998:

Originally, I started with specific HTML pages, but then quickly realized the importance of separating content from display. The first data format was a custom format which looks an awful lot like JSON but we later moved to the easier to work with XML. The sources are still available from SourceForge where we uploaded the data once we realized the importance of proper data licensing. This screenshot also shows that the website won Ralf Claessen's Chemistry Index award. That was in December 1997.

Unfortunately, I never published the website, which I should have because I realize each day how nice the technologies were we played with, but at least I got it mentioned in two papers. The first time was in the 2000 JChemPaint paper (doi:10.3390/50100093). JChemPaint at the time had functionality to download 2D chemical diagrams from the WOC using CAS registry numbers. The second time was in the CMLRSS paper where the WOC was one of the providers of a CMLRSS feed.

In 2004 I gave a presentation about which HTML technologies were being used in the WOC, also in London, almost 10 years ago! Darn, I should have thought of that, so that I could've mentioned that in my presentation this week! Here are the slides of back then:

Krause, S., Willighagen, E. L. & Steinbeck, C. JChemPaint - using the collaborative forces of the internet to develop a free editor for 2D chemical structures. Molecules 5, 93-98 (2000).
Murray-Rust, P., Rzepa, H. S., Williamson, M. J. & Willighagen, E. L. Chemical markup, XML, and the world wide web. 5. applications of chemical metadata in RSS aggregators. J Chem Inf Comput Sci 44, 462-469 (2004).

Jenkins-CI: automating lab processes

Our group organizes public Science Cafes where people from Maastricht University can see the research it is involved in. Yesterday it was my turn again, and I gave a presentation showing the BiGCaT and eNanoMapper Jenkins-CI installations (set up by Nuno) which I have been using for a variety of processes which Jenkins conveniently runs based on input it gets.

For example, I have it compile and run test suits for a variety of software projects (like the CDK, NanoJava), but also have it build R packages, and even daily run Andra Waagmeester's code to create RDF for WikiPathways. And the use of Jenkins-CI is not limited to dry lab processes: Ioannis Moutsatsos recently showed nice work at Novartis that uses Jenkins for high-throughput screening and data/image analysis.

Slides at the Open PHACTS community workshop (June 26)

First MSP graduates.
It seems had not posted my slides yet of the presentation at the 6th Open PHACTS community workshop. At this meeting I gave an overview of the Programming in the Life Sciences course we give to 2nd and 3rd year students of the Maastricht Science Programme (MSP; some participants graduated this summer, see the photo on the right side).

This course will again be given this year, starting in about a month from now, and I am looking forward to all the cool apps the students come up with! Given that the Open PHACTS API has been extended with pathways and disease information, they will likely be even cooler than last year.

OpenTox Europe 2014 presentation: "Open PHACTS: solutions and the foundation"

CC-BY 2.0 by Dmitry Valberg.
Where the OpenTox Europe 2013 presentation focused on the technical layers of Open PHACTS, this presentation addressed a key knowledge management solution to scientific questions and the Open PHACTS Foundation. I stress here too, as in the slides, that the presentation is on behalf of the full consortium!

For the knowledge management, I think Open PHACTS did really interested work in the field of "identity" and am happy to have been involved in this [Brenninkmeijer2012]. The platform implementation is, furthermore, based on the BridgeDb platform, that originated in our group [VanIersel2010]. The slides outline the scientific issues addressed by this solution:

PS, many more Open PHACTS presentations are found here.

Brenninkmeijer, C. et al. Scientific lenses over linked data: An approach to support task specific views of the data. a vision. In Linked Science 2012 - Tackling Big Data (2012). URL

van Iersel, M. et al. The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 11, 5+ (2010). URL

Do a postdoc with eNanoMapper

CC-BY-SA from Zherebetskyy @ WP.
Details will still have to follow as they are being worked out, but with Cristian Munteanu having accepted an associate professorship, I need a new postdoc to fill his place, and I am reopening the position I had almost a year ago. Do you like to works in a systems biology group (BiGCaT), are pro Open Science, like to work on tools for safe-by-design nanomaterials, and have skills in one or more of bioinformatics, chemoinformatics, statistics, coding, ontologies, then this position may be something for you.

The primary project for this position is eNanoMapper and you will be working within the large European NanoSafety Cluster network, though interactions are not limited to the EU.

If you have interest and cannot wait until the details of the position come out, please send me an email. first.lastname @ General questions about eNanoMapper and our BiGCaT solutions for nanosafety are also welcome in the comments.

CDK: Element and Isotope information

When reading files the format in one way or another has implicit information you may need for some algorithms. Element and isotope information is a key example. Typically, the element symbol is provided in the file, but not the mass number or isotope implied. You would need to read the format specification what properties are implicitly meant. The idea here is that information about elements and isotopes is pretty standardized by other organizations such as the IUPAC. Such default element and isotope properties are exposed in the CDK by the classes Elements and Isotopes. I am extending my Groovy Cheminformatics with the CDK with these bits.

The Elements class provides information about the element's atomic number, symbol, periodic table group and period, covalent radius and Van der Waals radius, and Pauling electronegativity (Groovy code):

Elements lithium = Elements.Lithium
println "atomic number: " + lithium.number()
println "symbol: " + lithium.symbol()
println "periodic group: " +
println "periodic period: " + lithium.period()
println "covalent radius: " + lithium.covalentRadius()
println "Vanderwaals radius: " + lithium.vdwRadius()
println "electronegativity: " + lithium.electronegativity()

For example, for lithium this gives:

atomic number: 3
symbol: Li
periodic group: 1
periodic period: 2
covalent radius: 1.34
Vanderwaals radius: 2.2
electronegativity: 0.98

Similarly, there is the Isotopes class to help you look up isotope information. For example, you can get all isotopes for an element or just the major isotope:

isofac = Isotopes.getInstance();
isotopes = isofac.getIsotopes("H");
majorIsotope = isofac.getMajorIsotope("H")
for (isotope in isotopes) {
  print "${isotope.massNumber}${isotope.symbol}: " +
    "${isotope.exactMass} ${isotope.naturalAbundance}%"
  if (majorIsotope.massNumber == isotope.massNumber)
    print " (major isotope)"
  println ""

For hydrogen this gives:

1H: 1.007825032 99.9885% (major isotope)
2H: 2.014101778 0.0115%
3H: 3.016049278 0.0%
4H: 4.02781 0.0%
5H: 5.03531 0.0%
6H: 6.04494 0.0%
7H: 7.05275 0.0%

This class is also used by the getMajorIsotopeMass() method in the MolecularFormulaManipulator class to calculate the monoisotopic mass of a molecule:

molFormula = MolecularFormulaManipulator
println "Monoisotopic mass: " +

The output for ethanol looks like:

Monoisotopic mass: 46.041864812

CDK 1.5.8, Zenodo, GitHub, and DOIs

Screenshot from John blog post.
John released CDK 1.5.8, which has a few nice goodies, like a new renderer. The full changelog is available. Interesting aspect of this release is, that it uses one ZENODO to make the release citable with a DOI. And that is relevant because it simplifies (making it a lot cheaper!) to track the impact of it, e.g. with #altmetrics. And that matters too, because no one really has a clue on how to decide which scientist is better than another, and which scientist should and should not get funding. Where we know peer review of literature is severely limited, we happily accept it to determine career future.

Anyways, so, we have a DOI now for a CDK release. So, everything using the CDK in research can cite this specific CDK release in their papers with this DOI. Of course, most publishers still don't support providing reference lists as a list of DOIs and often do not show the DOI there, but all this is a step forward. John listed the DOI with a nicely ZENODO-provided icon in the release post.

If you follow the DOI you go to the ZENODO website (they effectively act as a publishing platform). It is this page that I want to continue talking about, and in particular about the list of authors. The webpage provides two alternatives. The first is the most prominent one if you visit the page first:

This looks pretty good, I think. It seems to have picked up a list of authors, and looking at the list, not from the standard AUTHORS file, but from the commit messages. However, that is unsuited for the CDK, with a repository history in CVS, via SVN, to Git, but only the latter show up. The list seems sorted by the amount of contributions, but note that Christoph is missing. His work predates the Git era.

The second list of "authors" is given in the bottom right of the page, as "Cite As":

This suggestion is different, though it seems reasonable to assume the et al. (missing dot) refers to the rest of the authors of the first list. In the BibTeX export the full author list shows up again, supporting that idea.

Correct citation?
This makes me wonder: whom are the authors of this release? Clearly, this version includes code from all authors in some sort of way. Code from some original authors may have been long replaced with newer code. And we noted the problem of missing authors, because of the right version control history.

An alternative is to consider this release as a product of those people who have contributed patches since the previous release. In fact, this is something we noted as important in the past and now always report when making a release. For the 1.5.8 release that list looks like:

That is, this approach basically takes an accepted approach in publishing: papers describing updates of running projects involve only the people that contributed to that released work.

Therefore, I think the proper citation for this CDK 1.5.8 release should be:
    John May, Egon Willighagen, Mark Vine, Oliver Stücker, Andy Howlett, Mark Williamson, Sambit Gaan, Alison Choy (2014). cdk: CDK Release 1.5.8. ZENODO. 10.5281/zenodo.11681
Also note the correct spelling of the author names, though one can argue that they should have correctly spelled their names in the Git commit messages. Here are some challenges for GitHub in adopting the ORCID, I guess.

The question is, however, how do we get ZENODO to do this they way we want it to do? I think the above citation makes much more sense, but others may have good reasons why the current practice is better. What should ZENODO pick up to get the author provenance from?

Open knowledge dissemination (with @IFTTT)

An important part of science is communication. That is why we publish. New insights are useless if they sit on some desk. Instead, reuse counts. This communication is not just about the facts, but also a means to establish research networks. Efficient research requires this: you cannot be an expert in everything or at least not be experienced with everything. That is, for most things you do, there is another researcher that can do it faster. This is probably one of the reasons why many Open Science projects actually work, despite limited funding: they are very efficient.

Readers of my blog know a bit about my research, and know how important data exchange is to me. But similarly, allowing people to know what I do. You can see that in my literature: I strive towards knowledge integration (think UserScripts, think CMLRSS, think Linked Open Drug Data) and efficient methods for data exchange. Just because I need this to get statistically significant patterns. After all, my background is chemometrics primarily. Cheminformatics was my hobby, explaining the mashup.

FriendFeed was a brilliant platform for disseminating research and also for exchange of data. Actually, it is a brilliant platform, but when they sold themselves to FaceBook, it got a lot quieter there. And, as said, communication needs a community, and without listeners it is just not the same. Scientists just moved to different social platforms, and it is no surprise FriendFeed didn't show up in Richard van Noorden's recent analysis. A lot of good things happened on FriendFeed, but one was that it used RSS feeds and users could indicate which information sources they liked to show up there. Try my FriendFeed account. Better even was that listeners could select which of my information sources they do not want to listen to. For example, if you were not interested in Flickr images of Person X, but the others sources were interesting, you just silenced that source. Brilliant!

But this feature of using RSS to aggregate dissemination channels is not repeated by other networks, and If This Then That fills that gap. Unlike FriendFeed it does not aggregate it, but send the items to external social networks (and many other systems), including FaceBook and Twitter. It does a lot more than RSS feeds (e.g. check out the Android app), but that is an important one for me and the point of this blog post.

Actually, they have taking the idea of sending around events to a next level, allowing you to tune how it shows up. An example of that is what you see in the screenshot: I played with how news items from the RSS with changes I made to WikiPathways are shown. After a few iterations, I ended up with this "recipe":

The grey boxes is information from the RSS feed. The first iteration (see bottom tweet in the screenshot) only contained a custom perfix "Wikipathways edit:" followed by the {{EntryTitle}} and {{EntryUrl}}. I realized that would the commit message it would not be fun, and added the {{EntryContent}} (one bot last tweet in screenshot). Then I realized that having "Pathway" twice (once from my prefix, once from the {{EntryTitle}}, was not nice to look at either, and I ended up with the above "Wiki{{EntryTitle}} edit:", as visible in the top tweet in the screenshot. Too bad the RSS feed of WikiPathways doesn't have graphics :(

At this moment I am using a few outlets: I use Twitter to send about anything, like I did with FriendFeed. Sadly, Twitter doesn't have the same power to select which tweets you like to listen to. Not interested in the changes I make to WikiPathways? Sorry, you'll have to live with it. Well, you can also try my FaceBook account, where I route fewer things. But there you can like and comment, but I will not respond.

Anyway, my message is, give IFTTT a try!

First steps in Open Notebook Science

Scheme 2 from this Beilstein Journal of Organic
Chemistry paper
by Frank Hahn et al.
I blogged a few weeks back I blogged about my first Open Notebook Science entry. The post suggest I will look at a few ONS service providers, but, honestly, Open Notebook Science Network serves my needs well.

What I have in mind, and will soon advocate, is that the total synthesis approach from organic chemistry fits chem- and bioinformatics research. It may not be perfect, and perhaps somewhat artificial (no pun intended), but I like the idea.

Compound to Compound
Basically, a lab notebook entry should be a step of something larger. You don't write Bioclipse from scratch. You don't do a metabolomics pathway enrichment analysis in one step, either. It's steps, each one taking you from one state to another. Ah, another nice analogy (see automata theory)! In terms of organic chemistry, from one compound to another. The importance here is that the analogy shows that there is no step you should not report. The same applies to cheminformatics: you cannot report a QSAR model without explaining how your cleaned up that SDF file you got from paper X (which still commonly is practised).

Methods Sections
Organic chemistry literature has well-defined templates on how to report the method for a reaction, including minimal reporting standards for the experimental results. For example, you must report chemical shifts, an elemental composition. In cheminformatics we do not have such templates, but there is no reason not too. Another feature that must be reported is the yield.

Reaction yield
The analogy with organic chemistry continues: each step has a yield. We must report this. I am not sure how, and this is one of the things I am exploring and will be part of my argument. In fact, the point of keeping track of variance introduced is something I have been advocating for longer. I think it really matters. We, as a research field, now publish a lot of cheminformatics and chemometrics work, without taking into account the yield of methods, though, for obvious reasons, very much more in chemometrics than in cheminformatics. I won't go into that now, but there is indeed a good part of benchmark work, but the point is, any cheminformatics "reaction" step should be benchmarked.

Total synthesis
The final aspect is, is that by taking this analogy, there is a clear protocol how cheminformatics, or bioinformatics, work must be reported: as a sequence of detailed small steps. It also means that intermediate "products" can be continued with in multiple ways: you get a directed graph of methods you applied and results you got.

You get something like this:

Created with Graphviz Workspace.

The EWx codes refer to entries in my lab notebook:
  1. EW4: Finding nodes in Anopheles gambiae pathways with IUPAC names
  2. EW5: Finding nodes in Homo sapiens pathways with IUPAC names
  3. EW6: Finding nodes in Rattus norvegicus pathways with IUPAC names
  4. EW7: converting metabolite Labels into DataNodes in WikiPathways GPML

Open Notebook Science
Of course, the above applies also if you do not do Open Notebook Science (ONS). In fact, the above outline is not really different from how I did my research before. However, I see value in using the ONS approach here. By having it Open, it

  1. requires me to be as detailed as possible
  2. allows others to repeat it
Combine this with the advantage of the total synthesis analogy:
  1. "reactions" can be performed in reasonable time
  2. easy branching of the synthesis
  3. clear methodology that can be repeated for other "compounds
  4. step towards minimal reporting standards for cheminformatics methods
  5. clear reporting structure that is compatible with journal requirements
OK, that is more or less the paper I want to write up and submit to the Jean-Claude Bradley Memorial Issue in the Journal of Cheminformatics and Chemistry Central. It is an idea, something that helps me, and I hope more people find useful bits in this approach.