#metsoc2015 Converting SMILES annotation into InChIKey annotation

One of the questions I had in the hackathon today is about how to use the CDK to convert SMILES string into InChIs and InChIKeys (see doi:10.1186/1758-2946-5-14). So, here goes. This is the Groovy variant, though you can access the CDK just as well in other programming languages (Python, Java, JavaScript). We'll use the binary jar for CDK 1.5.10.  We can then run code, say test.groovy, using the CDK with:

groovy -cp cdk-1.5.10.jar test.groovy

With that out of the way, let's look at the code. Let's assume we start with a text file with one SMILES string on each line, say test.smi, then we parse this file with:

new File("test.smi").eachLine { line ->
  mol = parser.parseSmiles(line)

This already parses the SMILES string into a chemical graph. If we pass this to the generator to create an InChIKey, we may get an error, so we do an extra check:

gen = factory.getInChIGenerator(mol)
if (gen.getReturnStatus() == INCHI_RET.OKAY) {
  println gen.inchiKey;
} else {
  println "# error: " + gen.message

If we combine these two bits, we get a full test.groovy program:

import org.openscience.cdk.silent.*
import org.openscience.cdk.smiles.*
import org.openscience.cdk.inchi.*
import net.sf.jniinchi.INCHI_RET

parser = new SmilesParser(
factory = InChIGeneratorFactory.instance

new File("test.smi").eachLine { line ->
  mol = parser.parseSmiles(line)
  gen = factory.getInChIGenerator(mol)
  if (gen.getReturnStatus() == INCHI_RET.OKAY) {
    println gen.inchiKey;
  } else {
    println "# error: " + gen.message

Update: John May suggested an update, which I quite like. If the result is not 100% okay, but the InChI library gave a warning, it still yields an InChIKey which we can output, along with the warning message. For this, replace the if-else statement with this code:

if (gen.returnStatus == INCHI_RET.OKAY) {
  println gen.inchiKey;
} else if (gen.returnStatus == INCHI_RET.WARNING) {
  println gen.inchiKey + " # warning: " + gen.message;
} else {
  println "# error: " + gen.message


CDK Literature #7

CC-BY-SA from WikiMedia.
Despite evidence that it does not make sense to aim for something, I did it again: I aimed at discussion some five CDK-citing papers each week. That was three weeks ago, and I don't really have time today either. But let me cover a few, so that I do not get even further behind.

Subset selection in QSAR modeling
We (intuitively) know that negative data is important for statistical pattern recognition and modelling. We also know that literature is not quite riddled with such data. This paper, however, studies the effect of using sets of inactive compounds in modelling and particularly the part about selecting which compounds should go into the training set. Like with the positive compounds, in the results of this paper too, the selection method matters. The CDK is used to calculate fingerprints.

Smusz, S., Kurczab, R., Bojarski, A. J., Apr. 2013. The influence of the inactives subset generation on the performance of machine learning methods. Journal of Cheminformatics 5 (1), 17+. URL http://dx.doi.org/10.1186/1758-2946-5-17

Using fingerprints to create clustering trees

I need to read this paper by Palacios-Bejarano et al. in more detail, because it seems quite interesting. The use fingerprints to make clustering trees, which, if I understand it correctly, be used to calculate similarities between molecules. That is used in QSAR modeling of the CLogP, and the results suggest that while MCS works better, this approach is more robust. This paper too uses the CDK for fingerprint calculation.

Palacios-Bejarano, B., Cerruela Garcia, G., Luque Ruiz, I., Gómez-Nieto, M., Jun. 2013. An algorithm for pattern extraction in fingerprints. Chemometrics and Intelligent Laboratory Systems 125, 87-100. URL http://dx.doi.org/10.1016/j.chemolab.2013.04.003

Dr. J. Alvarsson: Bioclipse 2, signature fingerprints, and chemometrics

Last Friday I attended the PhD defense of, now, Dr. Jonathan Alvarsson (Dept. Pharmaceutical Biosciences, Uppsala University), who defended his thesis Ligand-based Methods for Data Management and Modelling (PDF). Key papers resulting from his work include (see the list below) one about Bioclipse 2, particularly covering his work on plug-able managers that enrich scripting languages (JavaScript, Python, Groovy) with domain specific functionality, which I make frequent use of (doi:10.1186/1471-2105-10-397), a paper about Brunn, a LIMS system for microplates, which is based on Bioclipse 2 (doi:10.1186/1471-2105-12-179), and a set of chemometrics papers, looking at scaling up pattern recognition via QSAR model buildings (e.g. doi:10.1021/ci500361u). He is also author on several other papers and we collaborated on several of them, so you will find his name in several more papers. Check his Google Scholar profile.

In Sweden there is one key opponent, though further questions can be asked by a few other committee members. John Overington (formerly of ChEMBL) was the opponent and he asked Jonathan questions for at least an hour, going through the thesis. Of course, I don't remember most of it, but there were a few that I remember and want to bring up. One issue was about the uptake of Bioclipse by the community, and, for example, how large the community is. The answer is that this is hard to answer; there are download statistics and there is actual use.

Download statistics of the Bioclipse 2.6.1 release.
Like citation statistics (the Bioclipse 1 paper was cited close to 100 times, Bioclipse 2 is approaching 40 citations), download statistics reflect this uptake but are hardly direct measurements. When I first learned about Bioclipse, I realized that it could be a game changer. But it did not. I still don't quite understand why not. It looks good, is very powerful, very easy to extend (which I still make a lot of use of), it is fairly easy to install (download 2.6.1 or the 2.6.2 beta), etc. And it resulted in a large set of applications, just check the list of papers.

One argument could be, it is yet another tool to install, and developers are turning to web-based solutions. Moreover, the cheminformatics community has many alternatives and users seem to prefer smaller, more dedicated tools, like a file format converter, like Open Babel, or a dedicated descriptor calculator, like PaDEL. Simpler messages seem more successful; this is expected for politics, but I guess science is more like politics that we like to believe.

A second question I remember was about what Jonathan would like to see changed in ChEMBL, the database Overington has worked so long on. As a data analyst you are in a different mind set: rather than thinking about single molecules, you think about classes of compounds, and rather than thinking about the specific interaction of a drug with a protein, you think about the general underlying chemical phenomenon. A question like this one requires a different kind of thinking: it needs one to think like an analytical chemist, that worries about the underlying experiments. Obvious, but easy to return too once thinking at a higher (different) level. That experimental error information in ChEMBL can actually support modelling, is something we showed using Bayesian statistics (well, Martin Eklund particularly) in Linking the Resource Description Framework to cheminformatics and proteochemometrics (doi:10.1186/2041-1480-2-S1-S6) by including the assay confidence assigned by the ChEMBL curation team. If John would have asked me, I would have said I wanted ChEMBL to capture as much of the experimental details as possible.

Integration of RDF technologies in Bioclipse. Alvarsson worked on the integration of the RDF editor in Bioclipse.
The screenshot shows that if you click a RDF resource reflecting a molecule, it will show the structure (if there is a
predicte providing the SMILES) and information by predicates in general.
The last question I want to discuss was about the number of rotable bonds in paracetamol. If you look at this structure, you would identify four purely σ bonds (BTW, can you have π bonds without sigma bonds?). So, four could be the expected answer. You can argue that the peptide bond should not be considered rotatable, and should be excluded, and thus the answer would be two. Now, the CDK answers two, as shown in an example of descriptor calculation in the thesis. I raised my eyebrows, and thought: "I surely hope this is not a bug!". (Well, my thoughts used some more words, which I will not repeat here.)

But thinking about that, I valued the idea of Open Source: I could just checked, and took my tablet from my bag, opened up a browser, went to GitHub, and looked up the source code. It turned out it was not a bug! Phew. No, in fact, it turned out that the default parameters of this descriptor excludes the terminal rotatable bonds:

So, problem solved. Two follow up questions, though: 1. can you look up source code during a thesis defense? Jonathan had his laptop right in front of him. I only thought of that yesterday, when I was back home, having dinner with the family. 2. I wonder if I should discuss the idea of parameterizable descriptors more; what do you think? There is a lot of confusion about this design decision in the CDK. For example, it is not uncommon that the CDK only calculates some two hundred descriptor values, whereas tool X calculates more than a thousand. Mmmm, that always makes me question the quality of that paper in general, but, oh well...

There also was a nice discussion about chemometrics. Jonathan argues in his thesis that a fast modeling method may be a better way forward at this moment, and more powerful statistical methods. He presented results with LIBLINEAR and signature fingerprints, comparing it to other approaches. The latter was compared with industry standards, like ECFP (which Clark and Ekins implemented for the CDK and have been using in Bayesian statistics on the mobile phone), and for the first Jonathan showed that LINLINEAR can handle more data than regular SVM libraries, and that the using more training data still improves the model more than a "better" statistical method (which quite matches my own experiences). And with SVMs, finding the right parameters typically is an issue. Using a RBF kernel only adds one, and since Jonathan also indicated that the Tanimoto distance measure for fingerprints is still a more than sufficient approach, which makes me wonder if the chemometrics models should not be using a Tanimoto kernel instead of a RBF kernel (though doi:10.1021/ci800441c suggests RBF may really do better for some tasks, but at the expense of more parameter optimization needed).

To wrap up, I really enjoyed working with Jonathan a lot and I think he did excellent multidisciplinary work. I am also happy that I was able to attend his defense and the events around that. In no way does this post do justice or reflect the defense; it merely reflects that how relevant his research is in my opinion, and just highlights some of my thoughts during (and after) the defense.

Jonathan, congratulations!

Spjuth, O., Alvarsson, J., Berg, A., Eklund, M., Kuhn, S., Mäsak, C., Torrance, G., Wagener, J., Willighagen, E. L., Steinbeck, C., Wikberg, J. E., Dec. 2009. Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 10 (1), 397+.
Alvarsson, J., Andersson, C., Spjuth, O., Larsson, R., Wikberg, J. E. S., May 2011. Brunn: An open source laboratory information system for microplates with a graphical plate layout design process. BMC Bioinformatics 12 (1), 179+.
Alvarsson, J., Eklund, M., Engkvist, O., Spjuth, O., Carlsson, L., Wikberg, J. E. S., Noeske, T., Oct. 2014. Ligand-Based target prediction with signature fingerprints. J. Chem. Inf. Model. 54 (10), 2647-2653.

Bioclipse 2.6.2 with recent hacks #3: using functionality from the OWLAPI

Update: if you had problems installing this feature, please try again. Two annoying issues have been fixed now.

Third in this series is this post about the Bioclipse plugin I wrote for the OWLAPI library. This manager I wrote to learn how the OWLAPI is working. The OWLAPI feature is available from the same update site as the Linked Data Fragments, so you can just follow the steps outlined here (if you had not already).

Using the manager
The manager has various methods, for example, for loading an OWL ontology:

ontology = owlapi.load(
  "/eNanoMapper/enanomapper.owl", null

If your ontology imports other ontologies, you may need to tell the OWLAPI first where to find those, by defining mappings. For example, I could do before making the above call:

mapper = null; // initially no mapper
mapper = owlapi.addMapping(mapper,
mapper = owlapi.addMapping(mapper,
  "http://www.enanomapper.net/ontologies/" + 

I can list which ontologies have been imported with:

imported = owlapi.getImportedOntologies(ontology)
for (var i = 0; i < imported.size(); i++) {

When the ontology is successfully loaded, I can list the classes and various types of properties:

classes = owlapi.getClasses(ontology)
annotProps = owlapi.getAnnotationProperties(ontology)
declaredProps = owlapi.getPropertyDeclarationAxioms(ontology)

There likely needs some further functionality that needs adding, and I love to hear about what methods you like to see added.

New: DOI hyperlinks in the CDK JavaDoc

Apparently I never extended the cdk.cite JavaDoc Taglet to use DOIs from the bibliographic database to create hyperlinks in the JavaDoc. But fear no more! I have submitted a simple patch today to add these to the JavaDoc, and I assume it will be part of the next CDK release from the master branch.

Of course, many papers in this bibliographic database (i.e. this cheminf.bibx file) do not have DOIs for all papers :/

Of course, you can help out here! The only thing you need is a web browser and some knowledge how to look up DOIs for papers. Just check this blog post (from Step 4 onwards) and line 260 in cheminf.bibx to see how a DOI addition to a BibTeXML entry should look like.

Re: "Thank you for sharing"

CC-BY 4.0, from Roche et al.via Wikipedia.
Nature wrote a piece on data sharing (doi:10.1038/520585a). It remains a tricky area to write about, particularly those terms like public access. Researchers are still a bit shy in sharing data, in some fields more than in others. And for valid reasons. Data sharing is a choice, it is something you do to get something in return. The return you get on your investment can vary, for example:
  1. goodwill (e.g. from your employer or funder)
  2. others will donate data to the same resource to benefit your research (a research needs some critical mass)
  3. it can be enjoyable
  4. the repository where you contribute your data adds value (e.g. by linking to other resources)
  5. others can find your data more easily, leading to more citations of your publications
  6. after using Open Data for yourself (e.g. pdb.org), you like to return a favor
I probably miss a few. On the other hand, you may miss out on other opportunities. For example, your data could have been part of an IP-based business model. For example, you are the only one to be able to use that data to solve/answer questions.

As said, there are many good and valid reasons for either option. It is an option, it is a choice.

The Nature News article has this lead that misled me:
    Initiatives to make genetic and medical data publicly available could improve diagnostics — but they lose value if they do not share with other projects.
The article, however, then discusses a few mechanisms use for data sharing, but I could not spot one that had anything to do with "publicly available". So, I left this comment with the editorial and with PubMed Commons:
    Like Open Access, "sharing" is a meaningless term if it is not linked to meaningful rights. The problems outlined in this paper result from the fact that their may be a wish to share data but only if it allows you to take back the data. Private, custom data licenses do just that. There is nothing wrong with this kind of sharing, but it must not be confused with Open Data. It must not be confounded with terms like "publicly available", because if it needs a signature, it's not publicly available. That makes the lead of this article quite misleading.
    For public or open data, three basic rights are part of the social agreement between the data owner (yes, fact in many countries; database rights, etc) and data user. These rights are: 1. make a copy, 2. make modifications, and 3. reshare (under the same conditions). By using a license (or waiver) that gives this rights automatically to the receiver, then there is no need for signatures. It also allows for anyone to make the mappings that are required to convert one format into another.
BTW, the image I used in this post is from a paper from Roche et al. of about a year ago (doi:10.1371/journal.pbio.1001779). I have not read that one yet, but looks like an interesting read too, just like the Nature editorial.

, Apr. 2015. Thank you for sharing. Nature 520 (7549), 585. URL http://dx.doi.org/10.1038/520585a

Roche, D. G., Lanfear, R., Binning, S. A., Haff, T. M., Schwanz, L. E., Cain, K. E., Kokko, H., Jennions, M. D., Kruuk, L. E. B., Jan. 2014. Troubleshooting public data archiving: Suggestions to increase participation. PLoS Biol 12 (1), e1001779+. URL http://dx.doi.org/10.1371/journal.pbio.1001779

Bioclipse 2.6.2 with recent hacks #2: reading content from online Google Spreadsheets

Update 2015-06-04: the authentication with the Google Drive has changed; I need to update the code and am afraid I missed the point, so that the below code is not working right now :(

Similar to the previous post in this new series, this post will outline how to make use of the Google Spreadsheet functionality in Bioclipse 2.6.2. But before I provide the steps needed to install the functionality, first consider this Bioclipse JavaScript:

    "your.account", "16charpassword"
    "ORCID @ Maastricht University"
  data = google.loadWorksheet(
    "ORCID @ Maastricht University",
    "with works"

Because that's what this functionality: read data from Google Spreadsheets. That opens up an integration of Google Spreadsheets with your regular data analysis workflows. I am not sure of Bioclipse is the only tool that embeds the Google client code to access these services, and can imagine similar functionality is available from R, Taverna, and KNIME.

Getting your credentials
The first call to the google manager requires your login details. But don't use your regular password: you need a application password. This specific, sixteen character, password needs to be manually created using your webbrowser, following this link. Create a new App password (”Other (Customized name)” ) and use this password in Bioclipse.

Installing Bioclipse 2.6.2 and the Google Spreadsheet functionality
The first you need to do (unless you already did that, of course) is install Bioclipse 2.6.2 (the beta) and enable the advanced mode. This is outline in my previous post up to Step 1. The update site, obviously, is different, and in Step 2 in that post you should use:

  1. Name: Open Notebook Science Update Site
  2. Location: http://pele.farmbio.uu.se/jenkins/job/Bioclipse.ons/lastSuccessfulBuild/artifact/buckminster.output/net.bioclipse.ons_site_1.0.0-eclipse.feature/site.p2/
Yes, the links only seem to get longer and longer. Just continue to the next step and install the Google Feature:

That's it, have fun!

Oh, and this hack is not so recent. I wrote the first version of the net.bioclipse.google plugin and matching manager, as used in the above code, dates back to January 2011, when I had just started at the Karolinska Institutet. But the code to download data from spreadsheets is even older, and goes back to 2008 when I worked with Cameron Neylon and Pierre Lindenbaum on creating RDF for data being collected by Jean Claude-Bradley. If you're interested, check the repository history and this book chapter.

CDK Literature #6

Originally a series I started in the CDK News, later for some issues part of this blog, and then for some time on Google+, CDK Literature is now returning to my blog. BTW, I created a poll about whether CDK News should be picked up again. The reason why we stopped was that we were not getting enough submissions anymore.

For those who are not familiar with the CDK Literature series, the posts discuss recent literature that cites one of the two CDK papers (the first one is now Open Access). A short description explains what the paper is about and why the CDK is cited. For that I am using the CiTO, of which the data is available from CiteULike. That allows me to keep track how people are using the CDK, resulting, for example, in these wordles.

I will try to pick up this series again, but may be a bit more selective. The number of CDK citing papers has grown extensively, resulting in at least one new paper each week (indeed, not even close to the citation rate of DAVID). I aim at covering ~5 papers each week.

Ring perception
Ring perception has evolved in the CDK. Originally, there was the Figueras algorithm (doi:10.1021/ci960013p) implementation which was improved by Berger et al. (doi:10.1007/s00453-004-1098-x). Now, John May (the CDK release manager) has reworked the ring perception in the CDK, also introduction a new API which I covered recently. Also check John's blog.

May, J. W., Steinbeck, C., Jan. 2014. Efficient ring perception for the chemistry development kit. Journal of Cheminformatics 6 (1), 3+. URL http://dx.doi.org/10.1186/1758-2946-6-3

Screening Assistant 2
A bit longer ago, Vincent Le Guilloux published the second version their Screening Assistant tool fo rmining large sets of compounds. The CDK is used for various purposes. The paper is already from 2012 (I am that much behind with this series) and the source code on SourceForge does not seem to have change much recently.

Figure 2 of the paper (CC-BY) shows an overview of the Screening Assistant GUI.
Guilloux, V. L., Arrault, A., Colliandre, L., Bourg, S., Vayer, P., Morin-Allory, L., Aug. 2012. Mining collections of compounds with screening assistant 2. Journal of Cheminformatics 4 (1), 20+. URL http://dx.doi.org/10.1186/1758-2946-4-20

Similarity and enrichment
Using fingerprints for compound enrichment, i.e. finding the actives in a set of compounds, is a common cheminformatics application. This paper by Avram et al. introduces a new metric (eROCE). I will not go into details, which are best explained by the paper, but note that the CDK is used via PaDEL and that various descriptors and fingerprints are used. The data set they used to show the performance is one of close to 50 thousand inhibitors of ALDH1A1.

Avram, S. I., Crisan, L., Bora, A., Pacureanu, L. M., Avram, S., Kurunczi, L., Mar. 2013. Retrospective group fusion similarity search based on eROCE evaluation metric. Bioorganic & Medicinal Chemistry 21 (5), 1268-1278. URL http://dx.doi.org/10.1016/j.bmc.2012.12.041

The International Chemical Identifier
It is only because Antony Williams advocated the importance of the InChI in this excellent slides that I list this paper again: I covered it here in more detail already. The paper describes work by Sam Adams to wrap the InChI library into a Java library, how it is integrated in the CDK, and how Bioclipse uses it. It does not formally cite the CDK, which now feels silly. Perhaps I did not add because of fear of self-citation? Who knows. Anyway, you find this paper cited on slide 30 in aforementioned presentation from Tony.

Spjuth, O., Berg, A., Adams, S., Willighagen, E., 2013. Applications of the InChI in cheminformatics with the CDK and bioclipse. Journal of Cheminformatics 5 (1), 14+. URL http://dx.doi.org/10.1186/1758-2946-5-14

Predictive toxicology
Cheminformatics is a key tool in predictive toxicology. I starts with the assumption that compounds of similar structure, behave similarly when coming in contact with biological systems. This is a long-standing paradigm which turns out to be quite hard to use, but has not shown to be incorrect either. This paper proposes a new approach using Pareto points and used the CDK to calculate logP values for compounds. However, I cannot find which algorithm it is using to do so.

Palczewska, A., Neagu, D., Ridley, M., Mar. 2013. Using pareto points for model identification in predictive toxicology. Journal of Cheminformatics 5 (1), 16+. URL http://dx.doi.org/10.1186/1758-2946-5-16

Cheminformatics in Python
ChemoPy is a tool to do cheminformatics in Python. This paper cites the CDK just as one of the tools available for cheminformatics. The tool is available from Google Code. It has not been migrated yet, but they still have about half a year to do so. Then again, given that there does not seem to have been activity since 2013, I recommend looking at Cinfony instead (doi:10.1186/1752-153X-2-24): exposed the CDK and is still maintained.

Cao, D.-S., Xu, Q.-S., Hu, Q.-N., Liang, Y.-Z., Apr. 2013. ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29 (8), 1092-1094. URL http://dx.doi.org/10.1093/bioinformatics/btt105

Groovy Cheminformatics with the CDK - 11th edition

It's been a while since I blogged about a release of my "Groovy Cheminformatics with the CDK" book, but not too long ago I made another release, 1.5.10-0. This was also the first one with white paper, and updated for the latest CDK development release.

There are two versions (and always check the special deals, e.g. today you can use UNPLUG10 to get an additional 10% off the below prices):
  1. paperback, for $25
  2. eBook, for $15, a PDF version
Compared to the 8th edition, this version offers this new material:
  • Chapter 1: Cheminformatics
  • Section 13.3: Ring counts (though it is not updated for John's ring perception work, doi:10.1186/1758-2946-6-3)
  • Section 14.1: Element and Isotope information
  • Section 16.4: SMARTS matching
  • Chapter 20: four more Chemistry Toolkit Rosetta solutions
  • Section 24.1: CDK 1.4 to 1.6 (see also this series)
This version of the book has 204 Groovy scripts, all of which have been tested against CDK 1.5.10.

Pathways as summaries: Nature Review Disease Primers and Open Source Malaria

A P.falciparum isoprenoid
biosynthesis pathway (WP2918).
Event 1
The Nature Publishing Group (NPG) has launched a new journal, which you probably did not miss. There is founding editorial titles From mechanisms to management (doi:10.1038/nrdp.2015.1) as the goal of the journal. Very noble and very needed, indeed! They write:
Each Primer article includes the same major sections: epidemiology, mechanisms, pathophysiology, diagnosis, screening, prevention, management and patient quality of life.
The complement the articles with PrimerViews and even animations:
Together, we hope that the Primer and PrimeView will provide readily accessible introductions to each topic for readers from all disciplines.
Very exciting! The mechanistic diagrams in the papers are perhaps even better, but, it wouldn't be a proper chem-bla-ics post had I not something to bitch about. And I do; read on.

Event 2
This weekend Christopher Southan asked if the Plasmodium falciparum pathway for isoprenoid biosynthesis was to be found in WikiPathways (related to this blog post about MMV008138). It was not at the time. But other resources did, including literature (of course), Wikipedia, and the excellent Malaria Parasite Metabolic Pathways resource.

In related news, about a year ago, Patricia Zaandam worked in our group on pathway analysis related to malaria. At the time, we selected human data from ArrayExpress because of the abundance of human pathways in WikiPathways (>600 now, of which the Curated Collection and Reactome Approved are subsets). So, on a weekend where I really needed a break from working and with some time free, I decided to make that pathway. One of the first observations was that you cannot create Plasmodium pathways on WikiPathways yet. Second, we also do not have a BridgeDb gene identifier mapping database for this organism either. But that is not needed for drawing the pathway.

So, I am digitizing the pathway from the various sources that I can find, added MMV008138, and will probably add more malaria drugs and drug leads along the way. The idea of the project of Patricia last year was indeed possible drug targets. This resulted in this current outcome (with MMV008138 highlighted in red):

The new NPG journal realized we need high quality summaries, and they are correct. This is why the periodic table of elements has been so useful, and the purpose of physical laws expressed as mathematical equations: it puts emphasis on what we think matters. This is also why I believe WikiPathways is so important.

But that's where the parallel between WikiPathways and NatRevDiseasePrimers about ends. The goal of WikiPathways is not just to summarize the knowledge, but to make it manageable. We are talking about data management here. I don't care that much about nice graphics; if we really want to make the science and the industry going forward, then we cannot hide behind a knowledge publishing system that doesn't scale and that doesn't integrate. That is not the kind of management we need.

New readers of my blog - welcome! - can browse my past writings to read what the publishing industry should have done. I have explored many different solutions, and only few of them are being picked up. The Nature Publishing Group has repeatedly experimented with new technologies to make the flood of knowledge manageable, and it find it rather disappointing that this editorial does not manage to go beyond nice graphics. I hope the journal will quickly pick up speed, and add the missing machine readability and APIs. Because a new journal is for years, and we really cannot wait another 15 years.

I am not claiming this new journal is not useful, but it could have been so much more.