New: DOI hyperlinks in the CDK JavaDoc

Apparently I never extended the cdk.cite JavaDoc Taglet to use DOIs from the bibliographic database to create hyperlinks in the JavaDoc. But fear no more! I have submitted a simple patch today to add these to the JavaDoc, and I assume it will be part of the next CDK release from the master branch.

Of course, many papers in this bibliographic database (i.e. this cheminf.bibx file) do not have DOIs for all papers :/

Of course, you can help out here! The only thing you need is a web browser and some knowledge how to look up DOIs for papers. Just check this blog post (from Step 4 onwards) and line 260 in cheminf.bibx to see how a DOI addition to a BibTeXML entry should look like.

Re: "Thank you for sharing"

CC-BY 4.0, from Roche et al.via Wikipedia.
Nature wrote a piece on data sharing (doi:10.1038/520585a). It remains a tricky area to write about, particularly those terms like public access. Researchers are still a bit shy in sharing data, in some fields more than in others. And for valid reasons. Data sharing is a choice, it is something you do to get something in return. The return you get on your investment can vary, for example:
  1. goodwill (e.g. from your employer or funder)
  2. others will donate data to the same resource to benefit your research (a research needs some critical mass)
  3. it can be enjoyable
  4. the repository where you contribute your data adds value (e.g. by linking to other resources)
  5. others can find your data more easily, leading to more citations of your publications
  6. after using Open Data for yourself (e.g., you like to return a favor
I probably miss a few. On the other hand, you may miss out on other opportunities. For example, your data could have been part of an IP-based business model. For example, you are the only one to be able to use that data to solve/answer questions.

As said, there are many good and valid reasons for either option. It is an option, it is a choice.

The Nature News article has this lead that misled me:
    Initiatives to make genetic and medical data publicly available could improve diagnostics — but they lose value if they do not share with other projects.
The article, however, then discusses a few mechanisms use for data sharing, but I could not spot one that had anything to do with "publicly available". So, I left this comment with the editorial and with PubMed Commons:
    Like Open Access, "sharing" is a meaningless term if it is not linked to meaningful rights. The problems outlined in this paper result from the fact that their may be a wish to share data but only if it allows you to take back the data. Private, custom data licenses do just that. There is nothing wrong with this kind of sharing, but it must not be confused with Open Data. It must not be confounded with terms like "publicly available", because if it needs a signature, it's not publicly available. That makes the lead of this article quite misleading.
    For public or open data, three basic rights are part of the social agreement between the data owner (yes, fact in many countries; database rights, etc) and data user. These rights are: 1. make a copy, 2. make modifications, and 3. reshare (under the same conditions). By using a license (or waiver) that gives this rights automatically to the receiver, then there is no need for signatures. It also allows for anyone to make the mappings that are required to convert one format into another.
BTW, the image I used in this post is from a paper from Roche et al. of about a year ago (doi:10.1371/journal.pbio.1001779). I have not read that one yet, but looks like an interesting read too, just like the Nature editorial.

, Apr. 2015. Thank you for sharing. Nature 520 (7549), 585. URL

Roche, D. G., Lanfear, R., Binning, S. A., Haff, T. M., Schwanz, L. E., Cain, K. E., Kokko, H., Jennions, M. D., Kruuk, L. E. B., Jan. 2014. Troubleshooting public data archiving: Suggestions to increase participation. PLoS Biol 12 (1), e1001779+. URL

Bioclipse 2.6.2 with recent hacks #2: reading content from online Google Spreadsheets

Similar to the previous post in this new series, this post will outline how to make use of the Google Spreadsheet functionality in Bioclipse 2.6.2. But before I provide the steps needed to install the functionality, first consider this Bioclipse JavaScript:

    "your.account", "16charpassword"
    "ORCID @ Maastricht University"
  data = google.loadWorksheet(
    "ORCID @ Maastricht University",
    "with works"

Because that's what this functionality: read data from Google Spreadsheets. That opens up an integration of Google Spreadsheets with your regular data analysis workflows. I am not sure of Bioclipse is the only tool that embeds the Google client code to access these services, and can imagine similar functionality is available from R, Taverna, and KNIME.

Getting your credentials
The first call to the google manager requires your login details. But don't use your regular password: you need a application password. This specific, sixteen character, password needs to be manually created using your webbrowser, following this link. Create a new App password (”Other (Customized name)” ) and use this password in Bioclipse.

Installing Bioclipse 2.6.2 and the Google Spreadsheet functionality
The first you need to do (unless you already did that, of course) is install Bioclipse 2.6.2 (the beta) and enable the advanced mode. This is outline in my previous post up to Step 1. The update site, obviously, is different, and in Step 2 in that post you should use:

  1. Name: Open Notebook Science Update Site
  2. Location:
Yes, the links only seem to get longer and longer. Just continue to the next step and install the Google Feature:

That's it, have fun!

Oh, and this hack is not so recent. I wrote the first version of the plugin and matching manager, as used in the above code, dates back to January 2011, when I had just started at the Karolinska Institutet. But the code to download data from spreadsheets is even older, and goes back to 2008 when I worked with Cameron Neylon and Pierre Lindenbaum on creating RDF for data being collected by Jean Claude-Bradley. If you're interested, check the repository history and this book chapter.

CDK Literature #6

Originally a series I started in the CDK News, later for some issues part of this blog, and then for some time on Google+, CDK Literature is now returning to my blog. BTW, I created a poll about whether CDK News should be picked up again. The reason why we stopped was that we were not getting enough submissions anymore.

For those who are not familiar with the CDK Literature series, the posts discuss recent literature that cites one of the two CDK papers (the first one is now Open Access). A short description explains what the paper is about and why the CDK is cited. For that I am using the CiTO, of which the data is available from CiteULike. That allows me to keep track how people are using the CDK, resulting, for example, in these wordles.

I will try to pick up this series again, but may be a bit more selective. The number of CDK citing papers has grown extensively, resulting in at least one new paper each week (indeed, not even close to the citation rate of DAVID). I aim at covering ~5 papers each week.

Ring perception
Ring perception has evolved in the CDK. Originally, there was the Figueras algorithm (doi:10.1021/ci960013p) implementation which was improved by Berger et al. (doi:10.1007/s00453-004-1098-x). Now, John May (the CDK release manager) has reworked the ring perception in the CDK, also introduction a new API which I covered recently. Also check John's blog.

May, J. W., Steinbeck, C., Jan. 2014. Efficient ring perception for the chemistry development kit. Journal of Cheminformatics 6 (1), 3+. URL

Screening Assistant 2
A bit longer ago, Vincent Le Guilloux published the second version their Screening Assistant tool fo rmining large sets of compounds. The CDK is used for various purposes. The paper is already from 2012 (I am that much behind with this series) and the source code on SourceForge does not seem to have change much recently.

Figure 2 of the paper (CC-BY) shows an overview of the Screening Assistant GUI.
Guilloux, V. L., Arrault, A., Colliandre, L., Bourg, S., Vayer, P., Morin-Allory, L., Aug. 2012. Mining collections of compounds with screening assistant 2. Journal of Cheminformatics 4 (1), 20+. URL

Similarity and enrichment
Using fingerprints for compound enrichment, i.e. finding the actives in a set of compounds, is a common cheminformatics application. This paper by Avram et al. introduces a new metric (eROCE). I will not go into details, which are best explained by the paper, but note that the CDK is used via PaDEL and that various descriptors and fingerprints are used. The data set they used to show the performance is one of close to 50 thousand inhibitors of ALDH1A1.

Avram, S. I., Crisan, L., Bora, A., Pacureanu, L. M., Avram, S., Kurunczi, L., Mar. 2013. Retrospective group fusion similarity search based on eROCE evaluation metric. Bioorganic & Medicinal Chemistry 21 (5), 1268-1278. URL

The International Chemical Identifier
It is only because Antony Williams advocated the importance of the InChI in this excellent slides that I list this paper again: I covered it here in more detail already. The paper describes work by Sam Adams to wrap the InChI library into a Java library, how it is integrated in the CDK, and how Bioclipse uses it. It does not formally cite the CDK, which now feels silly. Perhaps I did not add because of fear of self-citation? Who knows. Anyway, you find this paper cited on slide 30 in aforementioned presentation from Tony.

Spjuth, O., Berg, A., Adams, S., Willighagen, E., 2013. Applications of the InChI in cheminformatics with the CDK and bioclipse. Journal of Cheminformatics 5 (1), 14+. URL

Predictive toxicology
Cheminformatics is a key tool in predictive toxicology. I starts with the assumption that compounds of similar structure, behave similarly when coming in contact with biological systems. This is a long-standing paradigm which turns out to be quite hard to use, but has not shown to be incorrect either. This paper proposes a new approach using Pareto points and used the CDK to calculate logP values for compounds. However, I cannot find which algorithm it is using to do so.

Palczewska, A., Neagu, D., Ridley, M., Mar. 2013. Using pareto points for model identification in predictive toxicology. Journal of Cheminformatics 5 (1), 16+. URL

Cheminformatics in Python
ChemoPy is a tool to do cheminformatics in Python. This paper cites the CDK just as one of the tools available for cheminformatics. The tool is available from Google Code. It has not been migrated yet, but they still have about half a year to do so. Then again, given that there does not seem to have been activity since 2013, I recommend looking at Cinfony instead (doi:10.1186/1752-153X-2-24): exposed the CDK and is still maintained.

Cao, D.-S., Xu, Q.-S., Hu, Q.-N., Liang, Y.-Z., Apr. 2013. ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29 (8), 1092-1094. URL

Groovy Cheminformatics with the CDK - 11th edition

It's been a while since I blogged about a release of my "Groovy Cheminformatics with the CDK" book, but not too long ago I made another release, 1.5.10-0. This was also the first one with white paper, and updated for the latest CDK development release.

There are two versions (and always check the special deals, e.g. today you can use UNPLUG10 to get an additional 10% off the below prices):
  1. paperback, for $25
  2. eBook, for $15, a PDF version
Compared to the 8th edition, this version offers this new material:
  • Chapter 1: Cheminformatics
  • Section 13.3: Ring counts (though it is not updated for John's ring perception work, doi:10.1186/1758-2946-6-3)
  • Section 14.1: Element and Isotope information
  • Section 16.4: SMARTS matching
  • Chapter 20: four more Chemistry Toolkit Rosetta solutions
  • Section 24.1: CDK 1.4 to 1.6 (see also this series)
This version of the book has 204 Groovy scripts, all of which have been tested against CDK 1.5.10.

Pathways as summaries: Nature Review Disease Primers and Open Source Malaria

A P.falciparum isoprenoid
biosynthesis pathway (WP2918).
Event 1
The Nature Publishing Group (NPG) has launched a new journal, which you probably did not miss. There is founding editorial titles From mechanisms to management (doi:10.1038/nrdp.2015.1) as the goal of the journal. Very noble and very needed, indeed! They write:
Each Primer article includes the same major sections: epidemiology, mechanisms, pathophysiology, diagnosis, screening, prevention, management and patient quality of life.
The complement the articles with PrimerViews and even animations:
Together, we hope that the Primer and PrimeView will provide readily accessible introductions to each topic for readers from all disciplines.
Very exciting! The mechanistic diagrams in the papers are perhaps even better, but, it wouldn't be a proper chem-bla-ics post had I not something to bitch about. And I do; read on.

Event 2
This weekend Christopher Southan asked if the Plasmodium falciparum pathway for isoprenoid biosynthesis was to be found in WikiPathways (related to this blog post about MMV008138). It was not at the time. But other resources did, including literature (of course), Wikipedia, and the excellent Malaria Parasite Metabolic Pathways resource.

In related news, about a year ago, Patricia Zaandam worked in our group on pathway analysis related to malaria. At the time, we selected human data from ArrayExpress because of the abundance of human pathways in WikiPathways (>600 now, of which the Curated Collection and Reactome Approved are subsets). So, on a weekend where I really needed a break from working and with some time free, I decided to make that pathway. One of the first observations was that you cannot create Plasmodium pathways on WikiPathways yet. Second, we also do not have a BridgeDb gene identifier mapping database for this organism either. But that is not needed for drawing the pathway.

So, I am digitizing the pathway from the various sources that I can find, added MMV008138, and will probably add more malaria drugs and drug leads along the way. The idea of the project of Patricia last year was indeed possible drug targets. This resulted in this current outcome (with MMV008138 highlighted in red):

The new NPG journal realized we need high quality summaries, and they are correct. This is why the periodic table of elements has been so useful, and the purpose of physical laws expressed as mathematical equations: it puts emphasis on what we think matters. This is also why I believe WikiPathways is so important.

But that's where the parallel between WikiPathways and NatRevDiseasePrimers about ends. The goal of WikiPathways is not just to summarize the knowledge, but to make it manageable. We are talking about data management here. I don't care that much about nice graphics; if we really want to make the science and the industry going forward, then we cannot hide behind a knowledge publishing system that doesn't scale and that doesn't integrate. That is not the kind of management we need.

New readers of my blog - welcome! - can browse my past writings to read what the publishing industry should have done. I have explored many different solutions, and only few of them are being picked up. The Nature Publishing Group has repeatedly experimented with new technologies to make the flood of knowledge manageable, and it find it rather disappointing that this editorial does not manage to go beyond nice graphics. I hope the journal will quickly pick up speed, and add the missing machine readability and APIs. Because a new journal is for years, and we really cannot wait another 15 years.

I am not claiming this new journal is not useful, but it could have been so much more.

"Open Data in Science"

Recently, I got invited to a meeting of Eindhoven's Social Media Club, which has interesting meetings in the knowledge city capital of The Netherlands [ref]. This months topic was Open Data and I was asked to present Open Data in research, which I eagerly accepted. The quite liked the title too: The great wide Open Data.

I very much enjoyed the other presentation too, mostly by Allard Couwenberg, whom gave an excellent introduction into Open Data, which simplified my presentation, allowing me to focus on the role of Open Data in research and possible at universities. For example, I discussed that I think we can improve the quality of our education of we improve the access to knowledge for our students. I got great questions from the audience, mostly consisting of people outside the scholarly community, and including a few people working with Open Data a lot. A full storify is available.

I have uploaded my slides to SpeakerDeck:

But I only today sent the slides around today, because I just spent (for the first time ever) annotation my slides with source information (the last two slides).

Also, for the first time, I really felt I could have spoken for much longer. While I was able to mention a number of Open Data initiatives, like the Open Knowledge Foundation and its Dutch Open Science working group, WikiData and Wikidata4Research, the Blue Obelisk movement, the Open Notebook Science Challenge, Open Source Malaria, and crowdsourcing initatives like Mark2Cure, I realized there is so much around nowadays, that this can no longer be covered in a single presentation.

Congrats to the scholarly Open Data community!

Bioclipse 2.6.2 with recent hacks #1: Wikidata & Linked Data Fragments

Bioclipse dialog to upload chemical
structures to an OpenTox repository.
Us chem- and bioinformaticians have it easy when it comes to Open Science. Sure, writing documentation, doing unit testing, etc, takes a lot of time, but testing some new idea is done easily. Yes, people got used to that, so trying to explain that doing it properly actually takes long (documentation, unit testing) can be rather hard.

Important for this is a platform that allows you to easy experiment. For many biologists this environment is R or Python. To me, with most of the libraries important to me written in Java, this is Groovy (e.g. see my Groovy Cheminformatics book) and Bioclipse (doi:10.1186/1471-2105-8-59). Sometimes these hacks grow to be full papers, like with what started with OpenTox support (doi:10.1186/1756-0500-4-487) which even paved (for me) the way to the eNanoMapper project!

But often these hacks are just for me personal, or at least initially. However, I have no excuse to not make this available to a wider audience too. Of course, the source code is easy, and I normally have even the smallest Bioclipse hack available somewhere on GitHub (look for bioclipse.* repositories). But it is getting even better, now that Arvid Berg (Bioclipse team) gave me the pointers to ensure you can install those hacks, taking advantage from Uppsala's build system.

So, from now on, I will blog how to install Bioclipse hacks I deem useful for a wider audience, starting with this post on my Wikidata/Linked Data Fragments hack I used to get more CAS registry number mappings to other identifiers.

Install Bioclipse 2.6.2
The first thing you need is Bioclipse 2.6.2. That's the beta release of Bioclipse, and required for my hacks. From this link you can download binary nightly builds for GNU/Linux, MS-Windows, and OS/X. For the first two 32 and 64 bit build are available. You may need to install Java and version 1.7 should do fine. Unpack the archive, and then start the Bioclipse executable. For example, on GNU/Linux:

  $ tar zxvf Bioclipse.2.6.2-beta.linux.gtk.x86.tar.gz
  $ cd Bioclipse.2.6.2-beta/
  $ ./bioclipse

Install the Linked Data Fragments manager
The default update site already has a lot of goodies you can play with. Just go to Install → New Feature.... That will give you a nice dialog like this one (which allows you to install the aforementioned Bioclipse-OpenTox feature):

But that update site doesn't normally have my quick hacks. This is where Arvid's pointers come in, which I hope to carefully reproduce here so that my readers can install other Bioclipse extensions too.

Step 1: enable the 'Advanced Mode'
The first step is to enable the 'Advanced Mode'; that is, unless you are advanced, forget about this. Fortunately, the fact that you haven't given up on reading my blog yet is a good indicated you are advanced. Go to the Window → Preferences menu and enable the 'Advanced Mode' in the dialog, as shown here:

When done, click Apply and close the dialog with OK.

Step 2: add an update site from the Uppsala build system
The first step enables you to add arbitrary new update sites, like update sites available from the Uppsala build system, by adding a new menu option. To add new update sites, use this new menu option and select Install → Software from update site...:

By clicking the Add button, you go this dialog where you should enter the update site information:

This dialog will become a recurrent thing in this series, though the content may change from time to time. The information you need to enter is (the name is not too important and can be something else that makes sense to you):

  1. Name: Bioclipse RDF Update Site
  2. Location:

After clicking OK in the above dialog, you will return to the Available Software dialog (shown earlier).

Step 3: installing the Linked Data Fragments Feature
The  Available Software dialog will now show a list of features available from the just added update site:

You can see the Linked Data Fragments Feature is now listed which you can select with the checkbox in front of the name (as shown above). The Next button will walk you through a few more pages in this dialog, providing information about dependencies and a page that requires you to accept the Open Source licenses involved. And at the end of these steps, it may require you to reboot Bioclipse.

Step 4: opening the JavaScript Console and verify the new extension is installed
Because the Linked Data Fragments Feature extends Bioclipse with a new, so-called manager (see doi:10.1186/1471-2105-10-397), we need to use the JavaScript Console (or Groovy Console, or Python Console, if you prefer those languages). Make sure the JavaScript Console is open, or do this via the menu Windows → Show View → JavaScript Console and type in the console view man ldf which should result in something like this:

You can also type man ldf.createStore to get a brief description of the method I used to get a Linked Data Fragments wrapper for Wikidata in my previous post, which is what you should reread next.

Have fun and looking forward to hear how you use Linked Data Fragments with Bioclipse!

Chemistry Central and the ORCID identifier

If you are a scientist you have heard about the ORCID identifier by now. If not, you have been focusing on groundbreaking research and isolated yourself from the rest of the world, just to make it perfect and get that Nobel prize next year. If you have been working on impactful research, Nobel prize-worthy, and have been blogging and tweeting about your progress, as a good Open Scholar, you know ORCID is the DOI for "research contributors" and you already have one yourself, and probably also that T-shirt with your own identifier. Mine is 0000-0001-7542-0286, and almost 1.3M other authors got one too. The list of ORCIDs on Wikipedia is growing (and Wikidata), thanks to Andy Mabbett, whom also made it possible to add your ORCID on WikiPathways.

Anyway, what I was pleased to see today that you can now log in with your ORCID identifier with the Chemistry Central article submission system (notice the green icon):

Many other publishers allow logging in with your ORCID too, which benefits many:

  1. authors who just enter a list of ORCID identifiers, instead of a long list of author names and affiliations
  2. publishers, which have a simpler submission system and get more accurate information about submitters
  3. funding agencies which can more easily track what is done with the research funding
  4. research institutes which can more easily track what their employees are studying
Don't have one yet? Get your very own ORCID here.

CC-BY with the ACS Author Choice: CDK and Blue Obelisk papers liberated

Screenshot of an old CDK-based
JChemPaint, from the first CDK paper.
CC-BY :)
Already a while ago, the American Chemical Society (ACS) decided to allow the Creative Commons Attribution license (version 4.0) to be used on their papers, via their Author Choice program. ACS members pay $1500, which is low for a traditional publisher. While I even rather seem them move to a gold Open Access journal, it is a very welcome option! For the ACS business model it means a guaranteed sell of some 40 copies of this paper (at about $35 dollar each), because it will not immediately affect the sale of the full journal (much). Some papers may sell more than that had the paper remained closed access, but many for papers that sounds like a smart move money wise. Of course, they also buy themselves some goodwill and green Open Access is just around the corner anyway.

Better, perhaps, is that you can also use this option to make a past paper Open Access under a CC-BY license! And that is exactly what Christoph Steinbeck did with five of his papers, including two on which I am co-author. And these are not the least papers either. The first is the first CDK paper from 2003 (doi:10.1021/ci050400b), which featured a screenshot of JChemPaint shown above. Note that in those days, the print journal was still the target, so the screenshot is in gray scale :) BTW, given that this paper is cited 329 times (according to ImpactStory), maybe the ACS could have sold more than 40 copies. But for me, it means that finally people can read this paper about Open Science in chemistry, even after so many years. BTW, there is little chance the second CDK paper will be freed in a similar way.

The second paper that was liberated this way, is the first Blue Obelisk paper (doi:10.1021/ci050400b), which was cited 276 times (see ImpactStory):

This screenshot nicely shows how readers can see the CC-BY license for this paper. Note that it also lists that the copyright is with the ACS, which is correct, because in those days you commonly gave away your copyright to the publisher (I have stopped doing this, bar some unfortunate recent exceptions).

So, head over to your email client and email and let them know you also want your JCICS/JCIM paper available under a CC-BY license! No excuse anymore to make your seminal work in cheminformatics not available as gold Open Access!

Of course, submitting your new work to the Journal of Cheminformatics is cheaper and has the advantage that all papers are Open Access!