Saturday, February 9, 2008

InChIKey Resolver

Tony Williams ('Chemspiderman') posted an interesting article on his weblog at http://www.chemspider.com/blog/we-need-an-inchikey-resolver-and-we-need-it-now.html

dealing with the 'translation' of an InChIKey back to a structural diagram via a 'lookup-service'.


I like Tony's idea of an Inchikey-resolver and I would like to support it. The only questions/remarks I have, deals with the efficiency of such a process in our world of 'parallel systems'.


A few facts first:


At the moment approx. 33 millions of organics are known. Chemspider holds approx. 21M, PUBCHEM-Compounds approx. 18M structures, which represents 2/3 of known chemistry. I know that within CHEMSPIDER structure correction is an ongoing process as it is e.g. within my own CSEARCH-project. NMRshiftdb has been severly improved over the last months, etc.


Now all these systems exchange structures ..... 'A' gets 10,000 structures from 'B', 'A' does some corrections and gives its structures to 'C'. 'B' doesnt know A's corrections and gives also its structures to 'C'. Now 'C' has 2 "versions" of the same structure - in principle you can ignore that for an Inchikey-resolver, but the situation is much more complicated, because CHEMSPIDER, PUBCHEM, etc. have dozens of contributors.


I have definitely understanding of data-curation and I know that data-curation is sometimes a work like 'Sherlock Holmes' has done, because experimental parts of publications (and especially NMR-assignments) tends to be cryptic. We have a lot of systems in parallel, everybody doing his/her job seriously, spends a lot of time on data-curation. What we need is not 10, 20, maybe 100 structure repositories - each of it is incomplete (see above). What we really need is ONE SINGLE STRUCTURE REPOSITORY ( we live on only ONE PLANET ! ) - now I also put my kevlar vest on and put 300 feet landmines around my house - we have it, its CAS ! Sorry to say so, but this is the most complete one. When you are interested in a specific structure and you dont find it in Chemspider, emolecules, Pubchem, etc. - what does this answer tell you. It simply tells you, it is not stored - it DOES NOT TELL you, that IT DOES NOT EXIST ! I am quite sure I will be (hopefully only virtually) beaten by the community for this statement, but please keep in mind the relationship between 'new things' (algorithm, data, new procedures, etc.) and 'data-curation' when hosting a large database. The 'curation-effort' doesnt linearly increase with the size of the database - its at least a quadratic relationship.


What we need is ONE, CENTRALIZED place for structures and 'retrieval functionality' (including this inchikey-resolver), which covers the COMPLETE KNOWN CHEMISTRY and NOT hundreds of incomplete and severly overlapping installations. Let me know, when I can put off my kevlar vest ;-))


An example in order to convince that this (highly desirable) curation-process leads to a lot of confusion:


Globostellatic acid F: was drawn with C-O-O-H (hydroperoxide) instead of a carboxyl group in NMRSHIFTDB -> the data went to PUBCHEM ( CID: 15938977 / original NMRSHIFTDB-number was 22047)


Within NMRShiftDB this entry has been corrected: NMRSHIFTDB-number 20093989 and went again to PUBCHEM: CID=11526176


Do a search on PUBCHEM for the name 'globostellatic' - you end up with 2 'globostellatic acid F' structures, one is correct, the other is a hydroperoxide instead of an acid. Its simply applied error-propagation ...... like in school, when you put your eyes into your neighbors work. When you copy it perfectly, you are consistent, but your examination might also be completely wrong, when your neighbor failes. In chemistry we have a more technical term - its called 'citation'.

3 comments:

ChemSpiderMan said...

Wolfgang, Thanks for the comments…I am copying this from my blog for othersto read.

You said: “I know that within chemspider structure correction is an ongoing process as it is within my own CSEARCH-project.” Yes, there is a growing effort now around curation and comments. We have just rolled out an enhanced system this week and I will blog about it shortly once the manual is written. See http://www.chemspider.com/feedbackcurated.aspx

You said “NMRshiftdb has been severly improved over the last months, etc. - Now all these systems exchange structures ….. ‘A’ gets 10,000 structures from ‘B’, ‘A’ does some corrections and gives its structures to ‘C’. ‘B’ doesnt know A’s corrections and gives also its structures to ‘C’. Now ‘C’ has 2 “versions” of the same structure - in principle you can ignore that for an Inchikey-resolver, but the situation is much more complicated, because CHEMSPIDER, PUBCHEM, etc. have dozens of contributors.” Yes, this is VERY complex. We are also making lots of edits to the PubChem dataset and they are not finding their way back. We end up making edits in ChemSpider and redepositing to PubCHem and withdrawing structures. We are tending NOT to remove any structures from the database but annotating them with information.

You said “I have definitely understanding of data-curation and I know that data-curation is sometimes a work like ‘Sherlock Holmes’ has done, because experimental parts of publications (and especially NMR-assignments) tends to be cryptic. ” Absolutely yes!!!! There are examples which have taken a couple of hours to work through.

You said “We have a lot of systems in parallel, everybody doing his/her job seriously, spends a lot of time on data-curation. What we need is not 10, 20, maybe 100 structure repositories - each of it is incomplete (see above). What we really need is ONE SINGLE STRUCTURE REPOSITORY ( we live on only ONE PLANET !) - now I also put my kevlar vest on and put 300 feet landmines around my house - we have it, its CAS ! Sorry to say so, but this is the most complete one.” Yes, I agree. It is the highest quality repository available and certainly the largest collection of quality data. I agree. But, like many chemists, I don’t have access to it. Also, the system is not indexing online materials which are not published. They have started hosting spectral data as you know but these are limited to commercial collections. There is no way to deposit data either. Also, for the purpose of this discussion, they do not support InChI.

“When you are interested in a specific structure and you dont find it in Chemspider, emolecules, Pubchem,etc. - what does this tell you. It simply tells you, it is not stored - it DOES NOT TELL: IT DOES NOT EXIST ! ” This is also a true statement….not all chemicals studied to date are in the registry. Not all the chemicals in PubChem are in the registry (I think it’s between 1/2 and 2/3). Not all chemicals for sale in the marketplace are in the registry. Until recently prophetic compounds from patents were not supported. Nevertheless, I believe CAS IS the curated standard.

You said “I am quite sure I will be (hopefully only virtually) beaten by the community for this statement, but please keep in mind the relationship between ‘new things’ (algorithm, data, new procedures, etc.) and ‘data-curation’ when hosting a large database. The ‘curation-effort’ doesnt linearly increase with the size of the database - its at least a quadratic relationship. What we need is ONE, CENTRALIZED place for structures and ‘retrieval functionality’ (including this inchikey-resolver), which covers the COMPLETE KNOWN CHEMISTRY and NOT hundreds of incomplete and severly overlapping installations.” So, what I think you are suggesting is that CAS hosts the InChIKey Resolver. Now that’s a novel idea. A number of CAS people are registered on this blog and are likely reading it. But, there’s never been a response from anyone at CAS and ACS and I expect that will not change with this discussion. But..I like the idea…

One comment about curation: you might want to check out what I posted tonight: http://www.chemspider.com/blog/a-users-guide-to-the-process-of-curating-identifiers-on-chemspider.html

You said “Let me know, when I can put off my kevlar vest ;-))” I have bulk-purchase pricing if you need another one…

Wolfgang Robien said...

I recommend to follow the discussion on http://www.chemspider.com/blog/we-need-an-inchikey-resolver-and-we-need-it-now.html#comments

Markus Sitzmann said...

We at NCI have developed a InChIKey resolver, too. It actually was working since Nov. 2008 but we didn't announced it much until recently. It currently holds 65 million unique chemical structures and their respective InChIKeys. It covers most of PubChem's structure including what is available from ChemSpider at PubChem. As addition it includes the structures available from ChemNavigator's iResearch library. The NCI/CADD Chemical Identifier Resolver is available at http://cactus.nci.nih.gov/chemical/structure.