Pimped website: HTML5, still with RDFa, restructuring and a slidebar!

My son did some HTML, CSS, JavaScript, and jQuery courses at Codecademy recently. Good for me: he pimped my personal website:


Of course, he used GitHub and pull requests (he had been using git for a few years already). His work:

  • fixed the columns to properly resize
  • added a section with my latest tweets
  • added menus for easier navigating the information
  • made section fold and unfold (most are now folded by default)
  • added a slide bar, which I use to highlight some recent output
Myself, I upgraded the website to HTML5. It used to be XHTML, but it seems XHTML+RDFa is not really established yet; or, at least, there is no good validator. So, it's now HTML5+RDFa (validation report; currently one bug). Furthermore, I updated the content and gave the first few collaborators ORCID ids, which are now linked as owl:sameAs in the RDF to the foaf:Person (RDF triples extracted from this page).

Linking papers to database to papers: PubMed Commons and Ferret.ai

I argued earlier this year (doi:10.5281/zenodo.17892) in the Journal of Brief Ideas that measuring reuse of data and/or results in databases is a good measure of impact of that research. Who knows, it may even beat the citation count, which does not measure quality or correctness of data (e.g. you may cite a paper because you disagree with the content; I have long and still am advocating the Citation Typing Ontology).

But making the link between databases and papers is not only benefiting measuring reuse, it is also just critical for doing research. Without clear links, finding answers is hard. I experience that myself frequently, and so do others, like Christopher Southan, and it puzzles me that so few people worry about this. Of course, databases do a good part of linking, but only if they expose an API (still rare, but upcoming), it is hard to use these links. PubMed Commons can be used to link to (machine readable) version of data in a paper. See, for example, these four comments by me.

Better is when the database provides an API. And that is used by Ferret. I have no idea where this project is going to; it does not seem Open Source, I am not entirely sure how the implemented the history, but the idea is interesting. Not novel, as UtopiaDocs does a similar thing. Difference is, Ferret is not a PDF reader, but works directly in your Chrome browser. That makes it more powerful, but also more scary, which is why it is critical they send a clear message about any involvement of Ferret servers, or if everything is done locally (otherwise they can forget about (pharma) company uptake, and they'd have a hard time restoring trust). That said, there privacy policy document is already quite informative!

Last week, I asked them about their tool and if it was hard to add databases, as that is one thing Ferret does: if you open it up for a paper, it will show the databases that cite that paper (and thus likely have information or data from that paper, e.g. supplementary information). Here's an example:


This screenshots shows the results for a nanotoxicity paper and we see it picked up "titanium oxide" (accurately picking up actual nanomaterials or nanoparticles is an unsolved text mining issue). We get some impact statistics, but if you read my blog and my brief idea about capturing reuse, I think they got "impact" wrong. Anyway, they do have a knowledge graph section, which has the paper-database links, and Ferret found this paper cited in UniProt.

Thus, when I asked them if it would be hard to add new databases to that section, and I mentioned Open PHACTS and WikiPathways, they replied. In fact, within hours they told me they found the WikiPathways SPARQL end point that Andra started, which they find easier to use than the WikiPathways webservices :)  They asked me for a webpage to point users too, and while I was thinking about that, they found another WikiPathways trick I did not know about, you can browse for WP2371 OR WP2059. Tina then replied that, given a PubMed ID, there was even a nicer way, just browse for all pathways with a particular PubMed ID.

Well, a bit later, they release Ferret 0.4.2 with WikiPathways support. The below screenshot shows the output for a paper (doi:10.2174/1389200214666131118234138) by Rianne (who did internships in our group, and now does here PhD in toxicology):


The Ferret infobar shows seventeen WikiPathways that are linked to this paper, which happens to be the collection that Rianne made during her internship leading to this paper, and uploaded to WikiPathways some months ago. Earlier this year we sat down with her, Freddie, and Linda to make them more machine readable. This is what this list looks like in the browse functionality:


Ferret version 0.4.2 did not work for me, but they fixed the issue within a day, and the above screenshot was made with version 0.4.3. So, besides like a bunch of good hackers, they also seem to listen to their customers. So, what databases do you feel they should add? Leave a comment here, or tweet them at @getferret (pls cc me).

Willighagen, E., Capturing reuse in altmetrics. J. Brief Ideas. May 2015. URL http://dx.doi.org/10.5281/zenodo.17892
Fijten, R. R. R., Jennen, D. G. J., Delft, Dec. 2013. Pathways for ligand activated nuclear receptors to unravel the genomic responses induced by hepatotoxicants. Current Drug Metabolism, 1022-1028.
URL http://dx.doi.org/10.2174/1389200214666131118234138

Journal of Brief Ideas: an excellent idea!

Journals, in the past, published what researchers wanted to talk about. That is what dissemination is about, of course. Like everything, over time, the process becomes more restricted and more bureaucratic. All for quality, of course. To provide and to formalize that scientific communication has diversity, many journals have different articles types. Letters to the Editor, Brief Communications, etc. Posting a brief idea, however, is for many journals not of enough interest.

Hence, a niche for the Journal of Brief Ideas. It's a project in beta, any may never find sustainability, but it is worth a try:


I can see why this may work:
  • you teamed up with ZENODO to provide DOIs
  • you log in with your ORCID
  • it is Open Access (CC-BY)
  • it fills the niche that ideas you will not tests never see the light of the day (so, this journal will contribute to more efficient scholarly communication)
I can also see why it may not work:
  • it is too easy to post an idea, leading to too much noise
  • it will not be indexed and therefore not fulfill a key requirements for many scientists (WoS, etc)
  • you cannot add references like with papers
I can also see some features I would love to see:
  • bookmarking buttons for CiteULike, Mendeley, etc
  • #altmetrics output on this site
  • provide #altmetrics from this site (view statistics, etc)
  • integrate with peer review cites (for post-publication peer review)
  • allow annotation of entities in papers (like PDB, gene, protein codes, metabolite identifiers, etc; and whatever else for other scholarly domains)
Things I am not sure about:
  • allow a single ToC-like graphics (as they will give papers more coverage and more impact)
Anyway, what is needs now, is momentum. It needs a business model, even if the turnover can be kept low because of good choices of technology. I am looking forward where the team is going, and how the community will pick up this idea. (For example, despite I know that some ideas are tweeted, I haven't found a donut from Altmetric.com for one of the idea DOIs yet.)

For my readers, please give it a try. You know you have that idea you like to get some feedback on, but you know you will not have funding for it, and it does not really match what general research plans. It would be a shame to leave that idea rot on the shelf. Get it out, get cited!

I tried it too, see below my brief idea as found on ZENODO (where they automatically get deposited), and my experiences are a bit mixed. I like the idea, but it is also getting used to. The number of words are limited, and I really find it awkward not to cite prior art, the things I built on. The above points reflect a good deal of my reservations.


Internet-aided serendipity in science (was: How the Internet can help chemists with serendipity)

The ACS Central Science RSS feed in Feedly.
Finding new or useful knowledge to solve your scientific problem, question, etc, is key to research. It also is what struck me as a university student as so badly organized (mid-nineties). In fact, technologically there was no issue, so why are scientists not using these technologies then?? This question is still relevant, and readers of this blog know this is a toy research area to me, and I have previously experimented with a lot of technologies to see how they can support research, and, well, basically, serendipity. Hence, internet-aided serendipity.

This happened to be the topic of an article by Prof. Bertozzi (@CarolynBertozzi), editor-in-chief of the gold Open Access ACS Central ScienceHow the Internet can help chemists with serendipity, part of the internet.cen.org website. I left a comment, which is currently awaiting moderation, but to keep the discussion on twitter going, here is what I left (the comment on the article may turn out to have lost the formatting still present here):
    Dear Prof Bertozzi,

    the browsing of TOCs is not a lost art, and neither has the Internet solved everything. Where I fully agree that Twitter and other social media have filled a niche in finding interesting literature, it is basically kind of a majority vote and does not really find you the papers interesting to your research. This has to extend, of course, to #altmetrics, which capture the attention on social media and allows creating TOCs on the fly, as do (good) paper bookmarking services like CiteULike (see http://www.citeulike.org/citegeist?days=7). Similarly, people developed tools to find science in blog posts, like the no longer existing Postgenomic.com, continued/forked as Chemical blogspace (see http://cb.openmolecules.net/inchis.php, but consider this code has not been updated in the past 2-3 years). So, creating cross-journal TOCs is a daily habit for many of us still. (BTW, will ACS Central Science fully adopt #altmetrics, as data provider as well as showing #altmetrics on the website?)

    Returning to the single journal TOCs. Here, RSS feeds have shown to be critical, happy to find a RSS feed for ACS Central Science (http://feeds.feedburner.com/acs/acscii). It is good to see that the journal's RSS feed for the ASAP papers contains for each paper the title, authors, the TOC image, and the DOI (possibly, it could also include the abstract and ORCIDs of the authors). Better, it should adopt CMLRSS and include InChIs, MDL molfiles, or SMILES of the chemical compounds discussed in that paper (see this ACS JCIM paper: http://dx.doi.org/10.1021/ci034244p). With proper adoption of CMLRSS, chemists could define substructures and be alerted when papers would be published containing chemicals with that substructure (and it does not have to stop there, as cheminformatically it is trivial to extend this to chemical reactions, or any other chemistry). After all, we don't want to miss the chemistry that sparks our inspiration!

    I personally keep track of a number of journals via RSS feeds which I aggregate in Feedly, which filled the gap after GoogleReader was closed down. Feedly does not support CMLRSS (unfortunately, but I have other tools for that) and there are a few alternatives.

    So, I hope the ACS Central Central journal will pick up your challenge and continue to support modern (well, CMLRSS was published in 2004) technologies to support your past workflows! For example, make the link to the ACS Central Science RSS feed more prominent, and write an editorial about how to use it with, for example, Feedly.

    Egon
    Maastricht University
    The Netherlands
Of course, there is a lot more. It should not surprise you that adoption of PDF and ReadCube as killing internet-aided serendipity, where HTML+RDF, microformats, schema.org, etc would in fact enable serendipity. Chemistry publishers do not particularly have a track record in enabling the kind of serendipity Prof. Bertozzi is looking for. Good thing is that as editor-in-chief of an ACS journal, she can restore this serendipity and I kindly invite her to the Blue Obelisk community to discuss how all the technologies that have been developed in the past 15 years can help chemists. Because we have plenty of ideas. (And where is that website again aggregating chemistry journal RSS feeds...?)

Or, just browse this posts in blog, where I have frequently written about the innovation with publishers (in general; some do better than others).

Update: Other perspectives

WikiPathways and two estrone-x,y-quinones added to Wikidata

WikiPathways does a lot of curation, with a team growing in size. A number of regular jobs is performed weekly by one of a group of some 15-20 curators. On top of that, some curators do much more than this weekly task, e.g. Kristina Haspers. Since I joined the BiGCaT team of Chris Evelo in Maastricht, I have been looking into the metabolites and other small molecules, and did quite a bit of work to make that information machine readable. See, for example, these open notebook science posts.

This curation is partly supported by tools, e.g. bots and tests. Tests are, among others, being run nightly on a Jenkins instance (in various configurations). One of the bots create this report, which Martina Kutmon recently reminded me of. Starting at the end of that, I started browsing it for unrecognized metabolites (for various reasons). My eyes fell on two compounds in the estrogen metabolism pathway, originally created by Pieter Giesbertz: estrone-2,3-quinone and estrone-3,4-quinone (in green):


The website was not showing up mappings to other database for the cross-references from PubChem. A quick check confirmed that HMDB, KEGG and ChEBI did not have this compound. HMDB has an entry for one of the compounds, given the name, but the chemical graph has undefined stereochemistry. That certainly explains why it did not map to the PubChem compound ID. And, indeed, PubChem does have the HMDB as substance, but not linked to a compound. So, I added them to Wikidata: Q20739847 and Q20742851.


Then, when I make the next metabolite ID mapping database for BridgeDb, it will have mappings between the cross-references in WikiPathways for these two compounds to, at the time of writing, ChemSpider, and to the CAS registry number of one of the two. Please also note that Wikidata allowed me to store the information source.

Thus, for me, Wikidata is the place to add new mappings, and I herald work by Andra Waagmeester, Andrew Su, and others to use Wikidata for this kind of purpose. If you agree, you can add your support here.

PubChemRDF: semantic web access to PubChem data

Gang Fu and Evan Bolton have blogged about it previously, but their PubChemRDF paper is out now (doi:10.1186/s13321-015-0084-4). It very likely defines the largest collection of RDF triples using the CHEMINF ontology and I congratulate the authors with a increasingly powerful PubChem database.

With this major provider of Linked Open Data for chemistry now published, I should soon see where my Isbjørn stands. The release of this publication is also very timely with respect to the CHEMINF ontology, as I last week finished a transition from Google to GitHub, by moving the important wiki pages, including one about "Where is the CHEMINF ontology used?". I already added Gang's paper. A big thanks and congratulations to the PubChem team and my sincere thanks to have been able to contribute to this paper.

CDK Literature #9

Visualization of functional groups.
Public domain from Wikipedia.
In the past 50 years we have been trying to understand why certain classes of compounds show the same behavior. Quantum chemical calculations are still getting cheaper and easier (though, I cannot point you to a review of recent advances), but it has not replaced other approaches, as is visible in the number of QSAR/descriptor applications of the CDK.

Functional Group Ontology
Sankar et al. have developed an ontology for functional groups (doi:10.1016/j.jmgm.2013.04.003). One popular thought is that subgroups of atoms are more important than the molecule as a whole. Much of our cheminformatics is based on this idea. And it matches what we experimentally observe. If we add a hydroxyl or an acid group, the molecule becomes more hydrophylic. Semantically encoding this clearly important information seems important, though intuitively I would have left this to the cheminformatics tools. This paper and a few cited papers, however, show far you can take this. It organizes more than 200 functional groups, but I am not sure where the ontology can be downloaded.

Sankar, P., Krief, A., Vijayasarathi, D., Jun. 2013. A conceptual basis to encode and detect organic functional groups in XML. Journal of Molecular Graphics and Modelling 43, 1-10. URL http://dx.doi.org/10.1016/j.jmgm.2013.04.003

Linking biological to chemical similarities
If we step aside from our concept of "functional group", we can also just look at whatever is similar between molecules. Michael Kuhn et al. (of STITCH and SIDER) looked into the role of individual proteins in side effect (doi:10.1038/msb.2013.10). They find that many drug side effects are mediated by a selection of individual proteins. The study uses a drug-target interaction data set, and to reduce the change of bias due to some compound classes more extensively studies (more data), they removed too similar compounds from the data set, using the CDK's Tanimoto stack.

Kuhn, M., Al Banchaabouchi, M., Campillos, M., Jensen, L. J., Gross, C., Gavin, A. C., Bork, P., Apr. 2014. Systematic identification of proteins that elicit drug side effects. Molecular Systems Biology 9 (1), 663. URL http://dx.doi.org/10.1038/msb.2013.10

Drug-induced liver injury
These approaches can also be used to study if there are structural reasons why Drug-induced liver injury (DILI) occurs. This was studied in this paper Zhu et al. where the CDK is used to calculate topological descriptors (doi:10.1002/jat.2879). They compared explanatory models that correlate descriptors with the measured endpoint and a combination with hepatocyte imaging assay technology (HIAT) descriptors. These descriptors capture phenotypes such as nuclei count, nuclei area, intensities of reactive oxygen species intensity, tetramethyl rhodamine methyl ester, lipid intensity, and glutathione. It doesn't cite any of the CDK papers, so I left a comment with PubMed Commons.

Zhu, X.-W., Sedykh, A., Liu, S.-S., Mar. 2014. Hybrid in silico models for drug-induced liver injury using chemical descriptors and in vitro cell-imaging information. Journal of Applied Toxicology 34 (3), 281-288. URL http://dx.doi.org/10.1002/jat.2879

PubMed Commons: comments, pointers, questions, etc

I could have sworn I had blogged about this already, but cannot find it in my blog archives. If you do not know PubMed Commons yet, check it out! As the banner on the right shows, they're in Pilot mode (yeah, why stick to alpha/beta release tagging), and it already found several uses, as explain in this blog post. Journal clubs is one of them, which they introduced at the end of last year. The pilot started out with giving access to PubMed authors, but since many of us are, that was never really a reason not to give it a try. Comments on PubMed Commons automatically get picked up by other platforms, like PubPeer, and commentators get a profile page, this is mine.

Like the use cases people have adopted - see the above linked blog post - I have found a number of use cases:

  1. additional information:
    1. missing citations (1)
    2. where data can be downloaded (1)
  2. where data from that paper was deposited:
    1. paper figures available in WikiPathways (1,2,3,4)
    2. authors uploaded data/figures to FigShare but the paper does not link it (1)
    3. authors uploaded data/figures to DataDryad but the paper does not link it (1)
  3. me too:
    1. CDK can help (1)
  4. commenting (1) and questions (2)
  5. a closed paper was made gold Open Acces (1)
  6. the source code behind that paper moved
    1. from Google Code to GitHub (1)
So, get your account today, and start updating your papers which changed locations. Because we all now the bit rot in website locations in papers. Show PubMed how you like to improve scientific communication via the publishing platform!

CDK Literature #8

Tool validation
The first paper this week is a QSAR paper. In fact, it does some interesting benchmarking of a few tools with a data set of about 6000 compounds. It includes looking into the applicability domain, and studies the error of prediction for compounds inside and outside the chemical space defined by the training set. The paper indirectly uses the CDK descriptor calculation corner, by using EPA's T.E.S.T. toolkit (at least one author, Todd Martin, contributed to the CDK).

Callahan, A., Cruz-Toledo, J., Dumontier, M., Apr. 2013. Ontology-Based querying with Bio2RDF's linked open data. Journal of Biomedical Semantics 4 (Suppl 1), S1+. URL http://dx.doi.org/10.1186/2041-1480-4-s1-s1

Tetranortriterpenoid
Arvind et al. study tetranortriterpenoids using a QSAR approach involving COMFA and the CPSA descriptor (green OA PDF). The latter CDK descriptor is calculated using Bioclipse. The study finds that using compound classes can improve the regression.

Arvind, K., Anand Solomon, K., Rajan, S. S., Apr. 2013. QSAR studies of tetranortriterpenoid: An analysis through CoMFA and CPSA parameters. Letters in Drug Design & Discovery 10 (5), 427-436. URL http://dx.doi.org/10.2174/1570180811310050010

Accurate monoisotopic masses
Another useful application of the CDK is the Java wrapping of the isotope data in the Blue Obelisk Data Repository (BODR). Mareile Niesser et al. use Rajarshi's rcdk package for R to calculate the differences in accurate monoisotopic masses. They do not cite the CDK directly, but do mention it by name in the text.

Niesser, M., Harder, U., Koletzko, B., Peissner, W., Jun. 2013. Quantification of urinary folate catabolites using liquid chromatography–tandem mass spectrometry. Journal of Chromatography B 929, 116-124. URL http://dx.doi.org/10.1016/j.jchromb.2013.04.008

#metsoc2015 Converting SMILES annotation into InChIKey annotation

One of the questions I had in the hackathon today is about how to use the CDK to convert SMILES string into InChIs and InChIKeys (see doi:10.1186/1758-2946-5-14). So, here goes. This is the Groovy variant, though you can access the CDK just as well in other programming languages (Python, Java, JavaScript). We'll use the binary jar for CDK 1.5.10.  We can then run code, say test.groovy, using the CDK with:

groovy -cp cdk-1.5.10.jar test.groovy

With that out of the way, let's look at the code. Let's assume we start with a text file with one SMILES string on each line, say test.smi, then we parse this file with:

new File("test.smi").eachLine { line ->
  mol = parser.parseSmiles(line)
}

This already parses the SMILES string into a chemical graph. If we pass this to the generator to create an InChIKey, we may get an error, so we do an extra check:

gen = factory.getInChIGenerator(mol)
if (gen.getReturnStatus() == INCHI_RET.OKAY) {
  println gen.inchiKey;
} else {
  println "# error: " + gen.message
}

If we combine these two bits, we get a full test.groovy program:

import org.openscience.cdk.silent.*
import org.openscience.cdk.smiles.*
import org.openscience.cdk.inchi.*
import net.sf.jniinchi.INCHI_RET

parser = new SmilesParser(
  SilentChemObjectBuilder.instance
)
factory = InChIGeneratorFactory.instance

new File("test.smi").eachLine { line ->
  mol = parser.parseSmiles(line)
  gen = factory.getInChIGenerator(mol)
  if (gen.getReturnStatus() == INCHI_RET.OKAY) {
    println gen.inchiKey;
  } else {
    println "# error: " + gen.message
  }
}

Update: John May suggested an update, which I quite like. If the result is not 100% okay, but the InChI library gave a warning, it still yields an InChIKey which we can output, along with the warning message. For this, replace the if-else statement with this code:

if (gen.returnStatus == INCHI_RET.OKAY) {
  println gen.inchiKey;
} else if (gen.returnStatus == INCHI_RET.WARNING) {
  println gen.inchiKey + " # warning: " + gen.message;
} else {
  println "# error: " + gen.message

}