"What You're Doing Is Rather Desperate"

Self-protrait by Gustave Courbet.
Source: WikiMedia Commons.
One blog that I have been (occasionally but repeatedly) reading for a long time is the What You're Doing Is Rather Desperate blog by Neil Saunders. HT to WoW!ter for pointing me to this nice post where Saunders shows how to calculate the number of entries in PubMed marked as retracted (in two seperate ways). The spoiler (you should check his R scripts anyway!): about 3900 retractions.

I have been quite interested in this idea of retractions and recent discussions with Chris (Christopher: mostly offline this time, and even some over a beer :) about if retractions are good or bad (BTW, check his inaugural speech on YouTube). He points out that retractions are not always for the right reasons, and, probably worse, have unintended consequences. An example he gave is a more senior researcher with two positions; in one lab someone misbehaved and this lab did not see any misconduct of this senior researcher; however, his other affiliation did not agree and fired him.

Five years ago I would have sad any senior researcher on a paper should still understand things in detail, and if misconduct was found, the senior authors are to blame too. I still believe this is the case, that's what you're co-researcher for, but feeling the pressure of publishing enough and just not having enough time, I do realize I cannot personally reproduce all results my post-docs and collaborators do. But we all know that publishing has made a wrong turn and, yes, I am trying to make it return to a better default.

Why I am interested in retractions
But that is not why I wanted to blog and discuss Saunders post. Instead, I want to explain why I am interested in retractions and, another blog you should check out, Retraction Watch. In fact, I am very much looking forward to their database! However, this is not because of the blame game. Not at all.

Instead, I am interested in noise in knowledge. Obviously, because this greatly affects my chemblaics research. Particularly, I like to reduce noise or at the very least take appropriate measures when doing statistics as we have plenty of means to deal with noise (like cancelling it out). Now, are retractions then a appropriate means to find incorrect, incomplete, or just outright false knowledge? No. But there is nothing better.

There are better approached: I have long and still am advocating the Citation Typing Ontology, though I have to admit I am not up to date with David Shotton's work. The CiTO allows to annotate if two papers disagree or if it agrees and perhaps even uses the knowledge. It can also annotate the citation as merely being included because the cited paper has some authority (expect many of those to Nature and Science papers).

But we have a long way to go before using CiTO becomes a reality. If interested, please check out the CiteULike support and Shotton's OpenCitations.

What does this mean for databases?
Research depends on data, some you measure, some you get from literature and increasingly databases. The latter, for example, to compare your own results with other findings. It is indeed helpful that databases provide these two functions:
  1. provide means to find similar experiments
  2. provide a gold standard of true knowledge
These database will have two different approaches: the first will present the data as reported or even better as raw data (as it came from the machine, unprocessed, though increasingly the machines already do processing to the best of its knowledge); the second will filter out true facts, possibly normalizing data along the way, e.g. by correcting obvious typing and drawing errors.

Indeed, database can combine these features, just like PubChem and ChemSpider do for small compounds. PubChem has the explicit approach of providing both the raw input from sources (the substances with SIDs) and the normalized result (the compounds with CIDs).

But what if the original paper turned out the be wrong? There are (at least) two phases:
  1. can we detect when a paper turns out wrong?
  2. can we propagate this knowledge into databases?
The first clearly reflects my interest in CiTO and retractions. We must develop means to filter out all the reported facts that turn out to be incorrect. And, can we efficiently keep our thousands of databases clean (many valid approaches!)? Do retractions matter here? Yes, because research in so-called higher impact journals is also seeing more retractions (whatever the reason is for that correlation), see this post by Bjoern Brembs

Where we must be heading
What the community needs to develop in the next few years is approaches for propagation of knowledge about correct and incorrect knowledge. That is, high impact knowledge must enter databases quickly, e.g. the exercise myokine irisin, but also the fact that it was recently shown it very likely doesn't exist, or at least that the original paper most likely measured something else (doi:10.1038/srep08889). Now, this is clearly a high-profile "retraction" of facts that few of us will have missed. Right? And that's where the problem is, the very long tail of disproven knowledge is very long, and we cannot rely on such facts to propagate quickly if we do not use tools to help us. This is one reason why my attention turned to semantic technologies, so that contradictions can be found more easily.

But I am getting rather desperate about all the badly annotated knowledge in databases, and I am also getting desperate about being able to make a change. The research I'm doing may turn out rather desperate.

Ambit.js: JavaScript client library for the eNanoMapper API technical preview

eNanoMapper is passed its first year, and an interesting year it has been! I enjoyed the collaboration with all partners very much and also the Open character of it. Just check our GitHub repository or our Jenkins build server.

Just this month, the NanoWiki data on Figshare was released and I just got around to uploading ambit.js to GitHub. This library is something in development, and should too be considered a technical preview. This JavaScript client library, inspired by Ian Dunlop's ops.js for Open PHACTS, allows visualization of data in the repository in arbitrary web pages, using jQuery and d3.js.

The visualization on the right shows the distribution of nanomaterial types in the preview server at http://data.enanomapper.net/ (based on AMBIT and the OpenTox API), containing various data sets (others by IDEA and NTUA), including the above mentioned NanoWiki knowledge base that I started in Prof. Fadeel's group at Karolinska Institutet. This set makes up about 3/4th of all data, and effectively excludes the orange 'nanoparticle' and the blue section due north. You can see I collected mostly data for metal oxides and some carbon nanotubes (though I did not digitize a lot of biological data for those).

But pie charts only work on pi days so let's quickly look at another application: summarize the size distribution:

Or what about the zeta potentials? Yes, the suggestion to make it a scatter plot with the pH when the potential was mentioned is already under consideration :)

Do you want to learn more? Get in contact!

Google Code closing down: semanticchemistry and chemojava

Google Code is closing down. I had a few projects running there and participated in semanticchemistry which hosts the CHEMINF ontology (doi:10.1371/journal.pone.0025513). But also ChemoJava, an extension of the CDK with GPL-licensed bits.

Fortunately, they have an exporter which automates the migration of the project to GitHub. And this is now in progress. Because the exporter is seeing a heavy load, they warn about a export times of up to twelve hours! The "0% complete" is promising, however :)

For the semanticchemistry project, I have asked the other people involved where we want to host it, as GitHub is just one of the options. A copy does not hurt anyway, but the one I am currently making may very well not be the new project location.

PubMed Commons
When you migrate your own projects and you published your work referring to this Google Code project page, please also leave a comment on PubMed Commons pointing readers to the new project page!

Migrating to CDK 1.5: SSSR and two other ring sets

I arrived a bit earlier at the year 1 eNanoMapper meeting so that I could sit down with Nina Jealiazkova to work on migrating AMBIT2 to CDK 1.5. One of the big new things is the detection of ring systems, which John gave a major overhaul. Part of that is that the SSSRFinder is now retired into the legacy model. Myself, I haven't even gone into all the new goodies of this, but one of the things we needed to look at, is how to replace the use the SSSRFinder use for finding the SSSR set. So, I just had a check if I understood the replacement code, by using the SSSRFinderTest and replacing the code in those tests with the new code, and all seemed to work as I expect it to do.

So, here are a few examples of code snippets you need to replace. Previously, your code would have something like:

IRingSet ringSet = new SSSRFinder(someAtomContainer).findSSSR();

This code is replaced with:

IRingSet ringSet = Cycles.sssr(someAtomContainer).toRingSet();

Similarly, you could have had something like:

IRingSet ringSetEssential =
  new SSSRFinder(buckyball).findEssentialRings();
IRingSet ringSetRelevant =
  new SSSRFinder(buckyball).findRelevantRings();

This is changed to:

IRingSet ringSetEssential = Cycles.essential(buckyball).toRingSet();
IRingSet ringSetRelevant = Cycles.relevant(buckyball).toRingSet();

Programming in the Life Sciences #21: 2014 Screenshots #1

December saw the end of this year's PRA3006 course (aka #mcspils). Time to blog some screenshots of the student projects. Like last year, the aim is to use the Open PHACTS API to collect data with ops.js and which should then be visualized in a HTML page, preferably with d3.js. This year, all projects reached that goal.

ACE inhibitors
The first team (Mischa-Alexander and Hamza) focused on the ACE inhibitors (type:"drug class") and the WP554 from WikiPathways. The use a tree structure to list inhibitors along with their activity:

The source code for this project is available under a MIT license.

The second team (Catherine and Moritz) looked at compounds hitting diabetes mellitus targets. They take advantage from the new disease API methods and first ask for all targets for the disease, and then query for all compounds. Mind you, the compounds are not filtered by activity, so it mostly shows interactions that real targets.

This product too is available with the MIT license.

The third project (Nadia en Loic) also goes from disease to targets and they looked at tuberculosis.

Asynchronous calls
If you know the ops.js, d3.js, and JavaScript a bit, you know that these projects are not trivial. The remote web service calls are made in an asynchronous manner: each call comes with a callback function that gets called when the server returns an answer, at some future point in time. Therefore, if you want to visualization, for example, compounds with activities against targets for a particular disease, you need two web service calls, with the second made in the callback function of the first call. Now, try to globally collect the data from that with JavaScript and HTML, and make sure to call the visualization call when all information is collected! But even without that, the students need to convert the returned web service answer into a format that d3.js can handle. In short: quite a challenge for six practical days!

"Royal Society of Chemistry grants journals access to Wikipedia Editors"

The Royal Society of Chemistry and Wikipedia just released an interesting press release:
    "The Royal Society of Chemistry has announced that it is donating 100 “RSC Gold” accounts – the complete portfolio of their journals and databases – to be used by Wikipedia editors who write about chemistry. The partnership is part of a wider collaboration between the Society’s members and staff, Wikimedia UK and the Wikimedia community. The collaboration is working to improve the coverage of chemistry-related topics on Wikipedia and its sister projects."
This leaves me with a lot of questions. I asked these in a comment awaiting moderation:
    Can you elaborate on the conditions? Is it limited to wikipedia.org or does it extend to other Wikimedia projects, like Wikidata? Does the agreement allow manual lookup of information only, or does it allow text mining on the literature as well as on the database? How should I put this in perspective with the UK law that allows text mining, and, in particular, can UK Wikipedia editors use text mining anyway, or is that restricted? Is there an overview of the details of what is allowed and not allowed, or a list of restrictions otherwise?
Details on how to apply to access can be found here.

Data sharing

When you want to share data with others, the first thing you need to decide is under what terms. Do you want others be able just look at the data, change the format (Excel to CSV, for example)? Do you want this person to be allowed to share the data within his research group, or perhaps collaborators?

Very often this is arranged informally, in good faith, or by some consortium, confidentiality, or non-disclosure agreements. However, these approaches do not scale very well. When this matters, then data licenses are an alternative approach (not necessarily better!).

Madeleine Gray has written on her EUDat blog some words about the EUDAT License Wizard. This wizards talks you through the things you like to agree on, and in the end suggests a possible data license. It seems pretty well done, and the first few questions focus on an important aspect: do you even have the rights to change the license (i.e. you are the copyright owner).

Mind you, there are huge differences between countries around data copyright.

Nature publications and the ReadCube: see but no touch. Or...

Altmetric.com score for the news
item by Van Noorden (see text).
The big (non-science) news this week was the announcement that papers from a wide selection of Nature Publishing Group (NPG) journals can now be shared allowing others without a subscription to read the paper (press release, news item). That is not Open Access to me (that requires the right to modify and redistribute modifications), but does remove the pay-wall and therefore speeding up dissemination. It depends on your perspective if this news is good or bad. I rather see more NPG journals go Open Access, like Nature Communications. But I have gotten used to the publishing industry moving slowly.

Thanks to Altmetric.com (a sister company of ReadCube) and the DOI we can easily find all discussion around the news item by Van Noorden (doi:10.1038/nature.2014.16460) in blogs. From the free, open dissemination of scientific knowledge ideal point of view the product is limited:
Of course, it is a free boon, and from that perspective this is a welcome move:
And it is a welcome move! We have all been arguing that there are so many people not able to access the closed-access papers, including politicians, SMEs, people from 3rd world countries, people with a deathly illness. These people can now access these Nature papers. Sort of. Because you still need a subscription to get one of these ReadCube links to read the paper without pay-wall. It lowers the barrier, but the barrier is not entirely gone yet.

Unless people with a subscription start sharing these links as they are invited too. At this moment it is very clear when and how these links can be shared. Now, this is an interesting approach. In most jurisdictions you are allowed to link to copyrighted material, but the user agreement (UA) can put additional restrictions. When and how this UA applies is unclear.

For example, Ross Mounce suggested to use PubMed Commons. Triggered by that idea, I experimented with the new offering and added a link on CiteULike. I asked if this is acceptable use, and it turned out to be unclear at this moment:
This tweet also shows the state of things, and I congratulate NPG with this experiment. Few established publishing companies seem willing to make these kind of significant steps! Changing the nature of your business is hard, and NPG trying to find a route to Open Science that doesn't kill the company is something I welcome, even if the road is still long!

A nice example of how long this road is, is the claim about "read-only". I think this is bogus. ReadCube has done a marvelous job at rewriting the PDF-viewer and manages to render a paper fully in JavaScript. That's worth a congratulations! It depends on strong hardware and a modern browser. Older browsers with older JavaScript engines will not be able to do the job (try opening a page with KDE's Konqueror). But clearly it is your browser that does the rendering. Therefore:
  1. you do download the paper
  2. you can copy the content (and locally save it)
  3. you can print the content
Because the content is sent to your browser. It only takes basic HTML/JavaScript skills to notice this. And the comment that you cannot print them?? Duh, never heard of screenshots?! Oh, and you will also notice a good amount of tracking information. I noted previously that it knows at least which institute originally shared the ReadCube link, possible in more detail. Others have concerns too.

Importantly, it is only a matter of time that someone clever and too much time (fortunately for ReadCube, scientists are too busy writing grant applications) that someone uses the techniques this ReadCube application uses to render it to something else, say a PDF, to make the printing process a bit easier.

Of course, the question is, will that be allowed. And that is what matters. In the next weeks we will learn what NPG allows us to do and, therefore, what their current position is about Open Science. These are exciting times!

Programming in the Life Sciences #20: extracting data from JSON

I previously wrote about the JavaScript Object Notation (JSON) which has become a de facto standard for sharing data by web services. I personally still prefer something using the Resource Description Framework (RDF) because of its clear link to ontologies, but perhaps JSON-LD combines the best of both worlds.

The Open PHACTS API support various formats and this JSON is the default format used by the ops.js library. However, the amount of information returned by the Open PHACTS cache is complex, and generally includes more than you want to use in the next step. Therefore, it is needed to extract data from the JSON document, which was not covered in the post #10 or #11.

Let's start with the example JSON given in that post, and let's consider this is the value of a variable with the name jsonData:

    "id": 1,
    "name": "Foo",
    "price": 123,
    "tags": [ "Bar", "Eek" ],
    "stock": {
        "warehouse": 300,
        "retail": 20

We can see that this JSON value starts with a map-like structure. We can also see that there is a list embedded, and another map. I guess that one of the reasons why JSON has taken such a flight is how well it integrates with the JavaScript language: selecting content can be done in terms of core language features, different from, for example, XPath statements needed for XML or SPARQL for RDF content. This is because the notation just follows core data types of JavaScript and data is stored as native data types and objects.

For example, to get the price value from the above JSON code, we use:

var price = jsonData.price;

Or, if we want to get the first value in the Bar-Eek list, we use:

var tag = jsonData.tags[0];

Or, if we want to inspect the warehouse stock:

var inStock = jsonData.stock.warehouse;

Now, the JSON returned by the Open PHACTS API has a lot more information. This is why the online, interactive documentation is so helpful: it shows the JSON. In fact, given that JSON is so much used, there are many tools online that help you, such as jsoneditoronline.org (yes, it will show error messages if the syntax is wrong):

BTW, I also recommend installing a JSON viewer extension for Chrome or for Firefox. Once you have installed this plugin, you can not just read the JSON on Open PHACTS' interactive documentation page, but also open the Request URL into a separate browser window. Just copy/paste the URL from this output:

And with a JSON viewing extension, opening this https://beta.openphacts.org/1.3/pathways/... URL in your browser window will look something like:

And because these extensions typically use syntax highlighting, it is easier to understand how to access information from within your JavaScript code. For example, if we want the number of pathways in which the compound testosterone (the link is the ConceptWiki URL in the above example) is found, we can use this code:

var pathwayCount = jsonData.result.primaryTopic.pathway_count;

Programming in the Life Sciences #19: debugging

Debugging is the process find removing a fault in your code (the etymology goes further back than the moth story, I learned today). Being able to debug is an essential programming skill, and being able to program flawlessly is not enough; the bug can be outside your own code. (... there is much that can be written up about module interactions, APIs, documentation, etc, that lead to malfunctioning code ...)

While there are full debugging tools, achieving the task of finding where the bug is can often be reached with simpler means:

  1. take notice of error messages
  2. add debug statements in your code
Error messages
Keeping track of error messages is first starting point. This skill is almost an art: it requires having seen enough for them to understand how to interpret them. I guess error messages are the worst developed aspects of programming language, and I do not frequently see programming language tutorial that discuss error messages. The field can certainly improve here.

However, at least error messages in general give an indication where the problem occurs. Often by a line number, though this number is not always accurate. Underlying causes of that are the problem that if there is a problem in the code, it is not always clear what the problem is. For example, if there is a closing (or opening) bracket missing somewhere, how can the compiler decide what the author of the code meant? Web browsers like Firefox/Iceweasel and Chrome (Ctrl-C) have a console that displays compiler errors and warnings:

Another issue is that error messages can be cryptic and misleading. For example, the above error message "TypeError: searcher.bytag is not a function example1.html:73" is confusing for a starting programmer. Surely, the source code calls searcher.bytag() which definately is a function. So, why does the compiler say it is not?? The bug here, of course, is that the function called in the source code is not found: it should be byTag().

But this bug at least can be detected during interpretation and executing of the code. That is, it is clear to the compiler that it doesn't know how to handle the code. Another common problem is the situation where the code looks fine (to the compiler), but the data it handles makes the code break down. For example, an variable doesn't have the expected value, leading to errors (e.g. null pointer-style). Therefore, understanding the variable values at a particular point in your code can be of great use.

Console output
A simple way to inspect the content of a variable is to use this console visible in the above screenshot. Many programming languages have their custom call to send output there. Java has the System.out.println() and JavaScript has console.log()

Thus, if you have some complex bit of code with multiple for-loops, if-else statements, etc, this can be used to see if some part of your code that you expect to be called really is:

console.log("He, I'm here!");

This can be very useful when using asynchronous web service calls! Similarly, see what the value of some variable is:

var label = jsonResponse.items[i].prefLabel;
console.log("label: " + label);

Also, because JavaScript is not a strongly typed programming I frequently find myself inspecting the data type of a variable:

var label = jsonResponse.items[i].prefLabel;

console.log("typeof label: " + typeof(label));

These tools are very useful to find the location of a bug. And this matters. Yesterday I was trying to use the histogram code in example6.html to visualize a set of values with negative numbers (zeta potentials of nanomaterials, to be precise) and I was debugging the issue, trying to find where my code when wrong. I used the above approaches, and the array of values looked in order, but different from the original example. But still the histogram was not showing up. Well, after hours, and having asked someone else to look at the code too, and having ruled out many alternatives, she pointed out that the problem was not in the JavaScript part of the code, but in the HTML: I was mixing up how default JavaScript and the d3.js library add SVG content to the HTML data model. That is, I was using <div id="chart">, which works with document.getElementById("chart").innerHTML, but needed to use <div class="chart"> with the d3.select(".chart").innerHTML code I was using later.

OK, that bug was on my account. However, it still was not working: I did see a histogram, but it didn't look good. Again debugging, and after again much too long, I found out that this was a bug in the d3.js code that makes it impossible to use their histogram example code for negative values. Again, once I knew where the bug was, I could Google and quickly found the solution for it on StackOverflow.

So, the workflow of debugging at a top level, looks like:
  1. find where the problem is
  2. try to solve the problem

Happy debugging!