CSEARCH NMR-Database

Saturday, March 29, 2008

Revision of Assignment

On my webpage http://nmrpredict.orc.univie.ac.at/csearchlite/NMRSHIFTDB_March_2008.html I have proposed a reassignment of 10 signals (out of 16 !) solely based on spectrum prediction using CSEARCH despite the original assignment has been done by means of HH-COSY, HMQC and HMBC. I am glad that this proposals has been fully integrated into NMRShiftDB, obviously after extensive verification by another program according to the protocols of my web-server.

The statement 'This was possible, because the data are open' is definitely wrong - within a more professional system such a wrong entry would never be able to step from the 'purgatory database' into the 'production database.' The detailed analysis can be found on the webpage given above.

Tuesday, March 25, 2008

My New Office

About 2 weeks ago I had to move into my new office, which is now for approximately one year the "Home of CSEARCH".

Step (at least virtually) by and enjoy !

http://nmrpredict.orc.univie.ac.at/csearchlite/Home_of_CSEARCH.html

Your comments are welcome !

Tuesday, March 4, 2008

Basic Misinterpretations of NMR-Data

I have created a series of pages on my webserver dealing with misinterpretations, typos and any other type of errors within NMR-data. I am definitely not talking about errors below 10ppm - I am talking about errors, which can be easily detected by application of appropriate computer algorithms using a few seconds of CPU-time.

At the moment 2 examples are online - I promise 'More to come' ! - Stay tuned, check back !

http://nmrpredict.orc.univie.ac.at/csearchlite/NMR_misinterpretation.html

In order to do a serious job I have to cite every paper in error I find during my daily work - BUT I dont want to blame somebody personally. On the other hand I think its necessary to analyze the quality of available NMR-data, because this is the basis for solving future structure elucidation problems ! Keep in mind, what is necessary to perform this task: State-of-the-art algorithms for automatic data-checking with an underlying database of highly verified spectra AND the largest CNMR-database available (despite its size of more than half a million C-spectra it is still incomplete)

Wednesday, February 27, 2008

Proton Prediction

A nice article on proton prediction can be found in the latest issue of Spectroscopy Europe

http://www.spectroscopyeurope.com/TD_20_1.pdf

A few more links summarizing where this new development has been already integrated, can be found on

http://nmrpredict.orc.univie.ac.at/

Saturday, February 9, 2008

InChIKey Resolver

Tony Williams ('Chemspiderman') posted an interesting article on his weblog at http://www.chemspider.com/blog/we-need-an-inchikey-resolver-and-we-need-it-now.html

dealing with the 'translation' of an InChIKey back to a structural diagram via a 'lookup-service'.

I like Tony's idea of an Inchikey-resolver and I would like to support it. The only questions/remarks I have, deals with the efficiency of such a process in our world of 'parallel systems'.

A few facts first:

At the moment approx. 33 millions of organics are known. Chemspider holds approx. 21M, PUBCHEM-Compounds approx. 18M structures, which represents 2/3 of known chemistry. I know that within CHEMSPIDER structure correction is an ongoing process as it is e.g. within my own CSEARCH-project. NMRshiftdb has been severly improved over the last months, etc.

Now all these systems exchange structures ..... 'A' gets 10,000 structures from 'B', 'A' does some corrections and gives its structures to 'C'. 'B' doesnt know A's corrections and gives also its structures to 'C'. Now 'C' has 2 "versions" of the same structure - in principle you can ignore that for an Inchikey-resolver, but the situation is much more complicated, because CHEMSPIDER, PUBCHEM, etc. have dozens of contributors.

I have definitely understanding of data-curation and I know that data-curation is sometimes a work like 'Sherlock Holmes' has done, because experimental parts of publications (and especially NMR-assignments) tends to be cryptic. We have a lot of systems in parallel, everybody doing his/her job seriously, spends a lot of time on data-curation. What we need is not 10, 20, maybe 100 structure repositories - each of it is incomplete (see above). What we really need is ONE SINGLE STRUCTURE REPOSITORY ( we live on only ONE PLANET ! ) - now I also put my kevlar vest on and put 300 feet landmines around my house - we have it, its CAS ! Sorry to say so, but this is the most complete one. When you are interested in a specific structure and you dont find it in Chemspider, emolecules, Pubchem, etc. - what does this answer tell you. It simply tells you, it is not stored - it DOES NOT TELL you, that IT DOES NOT EXIST ! I am quite sure I will be (hopefully only virtually) beaten by the community for this statement, but please keep in mind the relationship between 'new things' (algorithm, data, new procedures, etc.) and 'data-curation' when hosting a large database. The 'curation-effort' doesnt linearly increase with the size of the database - its at least a quadratic relationship.

What we need is ONE, CENTRALIZED place for structures and 'retrieval functionality' (including this inchikey-resolver), which covers the COMPLETE KNOWN CHEMISTRY and NOT hundreds of incomplete and severly overlapping installations. Let me know, when I can put off my kevlar vest ;-))

An example in order to convince that this (highly desirable) curation-process leads to a lot of confusion:

Globostellatic acid F: was drawn with C-O-O-H (hydroperoxide) instead of a carboxyl group in NMRSHIFTDB -> the data went to PUBCHEM ( CID: 15938977 / original NMRSHIFTDB-number was 22047)

Within NMRShiftDB this entry has been corrected: NMRSHIFTDB-number 20093989 and went again to PUBCHEM: CID=11526176

Do a search on PUBCHEM for the name 'globostellatic' - you end up with 2 'globostellatic acid F' structures, one is correct, the other is a hydroperoxide instead of an acid. Its simply applied error-propagation ...... like in school, when you put your eyes into your neighbors work. When you copy it perfectly, you are consistent, but your examination might also be completely wrong, when your neighbor failes. In chemistry we have a more technical term - its called 'citation'.

Tuesday, February 5, 2008

Spectral Searching on PUBCHEM-Structures

About 2 years ago, a spectral search system based on 16M PUBCHEM structures (approx. 5M unique) has been built and made available on http://nmrpredict.orc.univie.ac.at/identify . It went online during May 2006.

Some background information:

C-NMR spectra have been calculated using CSEARCH-NN-technology
Spectral search technique is based on SAHO as implemented into CSEARCH
The main intention of this system is to get some feeling about the compound class for an unknown. It must be clearly stated, that a database of 5M unqiue structures is definitely too small to cover known organic chemistry (approx. 33M at 02/2007). When taking into account the possible structures for a given molecular formula, 5M structures represent a neglictable part of possible organic chemistry !

In the meantime there were massive updates on PUBCHEM - this was the reason for rerunning the predictions and implementing another (much faster) search technique - the principle is still based on Wolfgang Bremsers SAHO-technique - the speed has been increased to allow searching of 1 billion (10**9) of CNMR-spectra within less than 3 seconds on a single CPU. At the moment the system is only partly installed and allows searching of 405,704,611 spectral patterns (usually in 1.2-1.6 seconds).

Key features:

PREDICTED CNMR-spectra for approx. 23M unique structures downloaded from PUBCHEM using CSEARCH-NN-technology
Structures deposited from CHEMSPIDER are already included
Intention is again to give some flavour of possible compound classes for an unknown

A detailed description of the search-technique will be given soon - stay tuned !

Another nice feature of this system: Whenever an experimental set of NMR-data is available within CSEARCH / SPECINFO / NMRPRedict / NMRShiftDB / CHEMGATE - this information is automatically included into the final resulting table of structures !

Feel free to test it ! The URL is

http://nmrpredict.orc.univie.ac.at/case/propose.php

Your feedback is highly appreciated - use the comment section !

Thursday, January 31, 2008

NMRPredict as robot-referee

As well-known within the NMR-community NMRPredict uses CSEARCH-technology for predicting and searching X-nuclei spectra. The databases behind consist of the combined collections of CSEARCH and SPECINFO.

One out of many possible applications of such a program like NMRPredict is the field of structure-verification. An excellent example has been analyzed coming from the debate on 1,7-Diaza[12]annulenes, which have been shown by Manfred Christl to be well-known pyridinium salts. A simple spectral similarity search using NMRPredict - either applied by the authors of the 2 papers (Angew.Chem. & Org.Lett.) or by the referees - would have shown that these spectral data are known since 1980. A detailed analysis including screen dumps can be found on:

http://nmrpredict.orc.univie.ac.at/csearchlite/Annulenes_or_Pyridines.html

Wednesday, January 30, 2008

Prediction of H1-NMR Spectra

The prediction of H1-NMR spectra within MODGRAPH's NMRPredict-program is based on the algorithms developed by Ray Abraham's and Ernö Pretsch's groups. Both techniques have excellent performance on their own, but a combination of these method gives superior results.

I am proud that I could apply the 'Best'-technology, which has been already successfully implemented into the HOSE-code and NN-based prediction engines for C13, to the H1-prediction module giving an average deviation of 0.18ppm on a testset of 90,000 well-assigned proton-spectra provided by Wiley. It was a great pleasure to me to work together with Ernö and Ray on this subject. We all know, that there is space for further improvements - the corresponding concepts are already there and are waiting for implementation and subsequent testing.

For detailed information have a look into:
http://www.modgraph.co.uk/best_proton_press_release.htm

Real-world structure verification examples can be found on the MESTRELAB RESEARCH webpage:
http://www.mestrec.com/recursos.php?idr=54&i18n=1
http://www.mestrec.com/recursos.php?idr=55&i18n=1

Tuesday, January 8, 2008

NMR-Spectral Data and InChIKeys

Within my ongoing project to create a portal for existing NMR-spectral data, another set of some 90,000 structures has been made available to me. They will be processed within the next 2 weeks and my collection of links will be updated accordingly. Afterwards more than 500,000 spectra from some 400,000 different structures will be available. Feel free to generate requests automatically using utilities like 'curl' or 'wget'. Please be so kind and restrict requests to less than 100 per day !

Sunday, January 6, 2008

Could someone explain to me

I have found an article about "Open Data in Science" written by Peter Murray-Rust (Article can be downloaded from http://www.dspace.cam.ac.uk/handle/1810/194890 ), where 'OSCAR-3' (a tool for extracting data from the chemical literature) in the context of C-NMR spectroscopy has been mentioned.

I am now surprised, that the NMRShiftDB-collection ( http://nmrshiftdb.org/ ) increased only by 8 structures within 7 weeks ( from Nov 18th, 2007 to Jan 6th, 2008 ), when OSCAR-3 is around, which allows automatic extraction of NMR-data from articles ?! For legal reasons only the automatic extraction of data from OA-journals seems to be possible, which reduces the number of available data. Therefore I simply want to see ONE, SINGLE FULLY ASSIGNED C-NMR spectrum. which has been AUTOMATICALLY EXTRACTED by OSCAR-3 from the chemical literature.

A corresponding question has been deposited at Peter Murray-Rust's Weblog - I hope to get an answer. Check back, I'll keep you up-to-date.

My questions can be found on
http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=916#comments

Just for your comparison:
The increase of spectra within CSEARCH can be found here - without OSCAR-3 support ;-))