Sunday, January 6, 2008

Could someone explain to me

I have found an article about "Open Data in Science" written by Peter Murray-Rust (Article can be downloaded from ), where 'OSCAR-3' (a tool for extracting data from the chemical literature) in the context of C-NMR spectroscopy has been mentioned.

I am now surprised, that the NMRShiftDB-collection ( ) increased only by 8 structures within 7 weeks ( from Nov 18th, 2007 to Jan 6th, 2008 ), when OSCAR-3 is around, which allows automatic extraction of NMR-data from articles ?! For legal reasons only the automatic extraction of data from OA-journals seems to be possible, which reduces the number of available data. Therefore I simply want to see ONE, SINGLE FULLY ASSIGNED C-NMR spectrum. which has been AUTOMATICALLY EXTRACTED by OSCAR-3 from the chemical literature.

A corresponding question has been deposited at Peter Murray-Rust's Weblog - I hope to get an answer. Check back, I'll keep you up-to-date.

My questions can be found on

Just for your comparison:
The increase of spectra within CSEARCH can be found here - without OSCAR-3 support ;-))


Wolfgang Robien said...

One response has been posted by Egon Willighagen - see URL above

Wolfgang Robien said...

A few more comments have been posted - according to my opinion, the situation has been clarified, despite my questions havn't been answered explicitely

Egon Willighagen said...

Hi Wolfgang,

those are impressive increases, really. That's a major effort, and very important.

It might be worthwhile to go back to one of the reasons why I (and others) think Open Data is important.

Let me make clear this is about money. The problem is really not that those who have to enter and curate the data should do this for free. Of course not, that would be stupid.

However, the current money flow is the problem. Peter has blogged about this in the early days, and you might find those items interesting as further reading.

With proprietary databases the money flow is at first as it should be: money goes from the user to the author who enters the data. However, soon, after the authors have stopped entering new structures, money still flows from the users, but normally no longer to the authors, but to some software company or publishers, but not back into science.

And, that's what is worrying me. If that money would be spend of further helping science, it would be much less of a problem, I think. If the money flow, when you stop working on the CSEARCH database would still aid research, that is go to research, instead of managers, or account holders...

The second problem is simply being able to verify what things are doing. I don't like black boxes, which proprietary databases are. I find that rather unscientific, as *user*, not as database maintainer.

Anyway, *really* impressive numbers!

Areas where I think OSCAR3 is helpful:

- it can indicated things obviously wrong
- it can automate drawing of the structure, at least a good draft (stereochemistry excluded)

PS. the chemical structure is extracted from the experimental section, by converting the structure name to a structure. Not MDL molfile/etc involved.

Wolfgang Robien said...

Thanks for commenting on the open issues, I think the situation has been clarified. My personal summary: There is good software-support to speed up data-extraction, a few, but decisive tasks are still missing - no wonder, the complete workflow is extremely complex and diverse between between different journals. I agree that some standardization in this field is highly desireable.

The picture of money-flow is very much simplified - you can negotiate that, as long as you stay within the legal boundaries. Be sure, the license fees for CSEARCH-algorithms and CSEARCH-data go back 100% into the project and therefore into science.

atlas245 said...

I thought the post made some good points on extracting data, For simple stuff i use python to get or simplify data,data extraction can be a time consuming process but for larger projects like files, the web, or documents i tried which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs