Score, Significance and Quality in the context of an automatic verification system

As explained in previous articles, a verification system relies on several elements which can be summarized or grouped into:

Elements which are responsible for processing and analyzing the experimental data.
Elements which are responsible for predicting what the experimental data should look like, starting generally from the structure (not always).
Tests which compare the experimental with the predicted data in a way which is meaningful and relevant to the compound identity question we are asking.
A scoring system which combines the results of all these tests to give a final result for a dataset.

Once this is understood (and we will write more about what the different tests do and how they work in a further article), a common question is: what is the difference between Score, Significance and Quality results that Mnova Verify uses to report the goodness of the correlation between the experimental data and the proposed molecular structure?

Score:

The score for a given test is an evaluation of how well the proposed dataset has performed. How does the hypothesis that my experimental data may correspond to my proposed structure score on the basis of that specific test? Our scoring system scores between -1.0 and +1.0. Therefore, a structure which is completely compatible with the experimental data in a proposed test will score close to +1.0 for that testand a structure which is totally impossible in the light of the experimental data, will score close to -1.0.

Significance:

The significance of a given test in a given case is an evaluation of how reliable a Score is, in the light of the data and of the historical performance of that test. It therefore indicates how much we trust that test in that specific case, and therefore how much weight should be given to that test and to its Score when combining all the test scores in the Scoring System to arrive at a total Score for the dataset. It is a weighting value. Significance can range from a value of 0 (very low significance) to any positive value: a potentially totally deterministic test may have an infinite value, although typical values are 0-2 (low values), 2-5 (medium values) and >5 (high significances).

Quality:

Quality is just a combination of the Score and the Significance to arrive at a single value. It is simply a mathematical operation carried out on the Score and the Significance to bring them together and represent both. The purpose of the Quality is to allow users of the system, and especially those who are using it in batch mode and looking at very large volumes of data, to decide maybe which data to review, or to sort the data on the basis of how well they have performed in verification. Clearly, it is much easier to sort a large volume of data based on 1 value than it is to sort them based on 2.

A useful analogy could be a review panel. Imagine that I have a review panel with 3 people. I present them with the same problem the software is looking at, i.e., a structure confirmation problem to be resolved by using some analytical data. My 3 panelists are:

An analytical expert with 30 years experience who has shown near infallibility when making these types of evaluation in the past.
An analytical chemist with 3 years experience and a proneness to being too optimistic when passing structures in verification problems.
A 1^st year undergraduate student who has taken a 4 hour course in analytical chemistry and is evaluating his/her first real case.

Each of my panelists will give a score to the data, and it is quite possible that the score may be similar for all of them. They may well agree that the data is correct or incorrect. However, I will not give the same weight or significance to their opinions. My first panelist will get a lot more weight (say 10) than the second (4), who will get a lot more weight than the third (1). If the 3 of them agree on a high score, for example, of 0.8, my conclusion will be that the score for this dataset is 0.8 and the significance will be higher than 10, since the weight of the first panelist is reinforced by the agreement of the other 2. If they disagree, then I will get a combined scored, lower than 0.8, with a significance which is also lower than the expert´s, although the overall result may well align with the expert unless the other 2 disagree with him, in which case I would get a fairly neutral result. Furthermore, for the second panelist, and given his/her tendency to being optimistic, I may give higher significance to a negative result, than I would to a positive one. The quality for these similar scores will therefore differ because it takes account of the different significances.

Let´s now see this with a couple of examples specific to the Automatic Structure Verification problem:

Examples

Example 1:

In this example, I am going to use an experimental dataset acquired by Lee Griffiths et al [1-3] for their series of papers on automatic verification.

NCE launches between 1999-2009 and NCE cost from 1976 to 2000.

Figure 1: An experimental spectrum, with the correct structure on the left and a structure manually modified in Mnova on the right

Let´s now look at the scores for my 2 structures:

Structure 1 (correct structure) gets the following scores and significances for the different tests:

Whilst structure 2 (incorrect structure) gets the following:

We will start our analysis with the first test, the ´1H Nuclides Count Test´. This is positive in both cases, as both structures have the same number of nuclides and the distribution of the number of nuclides in different multiplets is also equivalent. The relaxation time for this dataset is long, there is no overlapping, phase correction is good and relatively simple, the agreement of the integrals with the structure is excellent (1.00, 0.99, 0.99 and 2.93 for the 3 single protons and the methyl group), and therefore we get a very high score of 0.98, for the right structure and slightly lower for the incorrect one (the reasons for this lower score will be explained in a further article about the specific tests and are not relevant here). However, the significance of this score is relatively low (1.01 and 1.48). The reason for that is that ´1H Nuclide Count´ is a test which, when positive, is not very deterministic in indicating that the data and the structure are a good match. There can be many molecules, even very different to the one proposed, which would show the same total and relative integrals, as well as the same number of multiplets and peaks. Therefore, we have a high score (0.98, 0.63) and a low significance (1.01, 1.48) and that combines to give us a positive, but low quality of (0.33, 0.27).

The second test, 1H Prediction bounds metric, determines whether the peaks we find fit within the error bounds allowed for each one of the predicted multiplets. When we look at the results, we see that they are both positive, and that indicates that the multiplets do indeed fit within the allowed ranges.

Figure 2: Experimental (bottom, red line) and predicted (left hand correct molecule in centre, green line; right hand incorrect molecule at top, blue line) spectra for this dataset.

Looking at the scores (1.00 for the good molecule, 0.89 for the bad one) it is clear that the match is better for the correct molecule, but that both are perfectly acceptable. In this case, the significances are higher than in the case of the 1H Nuclide Count test (4.23 and 4.32 as compared to 1.01 and 1.48). This is the case because the fact that the 1H Prediction Bounds test fits is much more significant than the fact that the 1H Nuclides Count test fits, as it takes into consideration not only the integrals and multiplets but also their frequency, and therefore it is more indicative of a good match between molecules, and less prone to passing incorrect, and very different, molecules. The Quality for both structures is also much higher than for the previous test, as it is just a combination of the Score and the Significance, and the higher significance results in a higher quality (0.68 and 0.61 respectively)

Now let´s look at the assignments test. This test really differentiates both structures. The assignment test will be using not only chemical shift and integration information, but also multiplicities, common splittings between multiplets which are expected to be coupling, etc. This is therefore a very good test to differentiate between positional isomers such as these, which may both pass on other tests. In this test, the correct structure gets a very good score (0.92) and the incorrect one gets a very low, and negative, one (-0.32). This means that it has been possible to find a very good matching assignment for the correct structure and that no reasonable assignments were found for the incorrect structure. In this case, we can see from Figure 2 that this is due to the multiplicities. For example, we are expecting to see a triplet for atom 3, but there is no triplet in the experimental data. So, although atom 3 fitted in the 1H Prediction Bounds as there was a multiplet within the right chemical shift range, the multiplet shape of that peak is incorrect and it cannot be assigned to atom 3 (this is indicated also by the red mark on atom 3 on the incorrect structure). The significances are both quite high (5.42 and 6.36) telling us that these test scores will be given particular weight when making a decision as to whether these structures are correct in the light of the data or not.

We have therefore seen the difference between several tests and how they may get different scores, but also different significances, and what this means. Let us now illustrate this further by going back to the 1H Nuclides Count test and looking at a different example.

Example 2:

In this example we will use the same dataset at before, and again look at the 1H Nuclides Count test, but we will use 2 different incorrect molecules (one of which is not an isomer)

Figure 3: A second example with one 1H spectrum and 3 proposed structures (Structure 1 is the correct structure, structure 2 is a functional isomer of the correct structure and structure 3 is a different, but similar, structure).

What would happen to the 1H Nuclides Test with these 3 structures?

Structure 1 test results:

Structure 2 test results:

Structure 3 test results:

As we would expect, the scores and significances for the first 2 structures are the same as before. Structure 3 gets a negative score in the 1H Nuclides Count test, as it has an additional proton when compared to the experimental data, and that proton cannot be accounted for. The score is therefore very clearly negative (-0.86) and the significance of this negative test is much higher than the significance of the positive tests had been. As explained above, this is due to the fact that, whilst a pass on 1H Number of Nuclides is indicative of potential compatibility but could pass many wrong structures, a negative score on the 1H Number of Nuclides is much more strongly indicative of incompatibility between data and structures. This is therefore a case where the same test, on the same data, has very different significances whether the score is positive or negative. An analogous case would be with a Mass Spectrometry mass test – finding the right m/z value is not very significant or indicative of having the right structure, as many others may have the same molecular weight, whilst failing to find the right m/z value is very highly indicative of not having the right structure.

As an aside, we can also see that the 1H Assignment test for Structure 3 has a much lower score than for Structure 2, and much higher significance. Once we cannot find that additional proton in the experimental data, it becomes totally impossible to even get an assignment close to the data, and the significance of this becomes greatly increased, as we even have protons we cannot account for.

This explanation is why it is so important for the algorithm to correctly identify labile protons, as they can show quite unusual behaviour in the NMR spectrum and the significance of this will be quite different from a CH proton.

Conclusion

Both score and significance are critical parameters in an automatic verification system. We not only need to know how well a structure matches a spectrum in a specific test, but also if that evaluation of how well they match is reliable, and therefore how much importance must be given to that test when compared to the evaluation results of other tests. Score without significance has no relevance and therefore it cannot be compared or evaluated in combination with other complimentary, or competing tests.

[1] Griffiths, L. (2000), Towards the automatic analysis of 1H NMR spectra. Magn. Reson. Chem., 38: 444–451. doi: 10.1002/1097-458X(200006)38:63.0.CO;2-Z
[2] Griffiths, L. and Horton, R. (2006), Towards the automatic analysis of NMR spectra: Part 6. Confirmation of chemical structure employing both 1H and 13C NMR spectra. Magn. Reson. Chem., 44: 139–145. doi: 10.1002/mrc.1736 [3] Griffiths, L., Beeley, H. H. and Horton, R. (2008), Towards the automatic analysis of NMR spectra: Part 7. Assignment of 1H by employing both 1H and 1H/13C correlation spectra. Magn. Reson. Chem., 46: 818–827. doi: 10.1002/mrc.2257

Score:

Significance:

Quality:

Examples

Example 1:

Example 2:

Conclusion

About Author

Related Posts

IVAN NMR Webinar- Recorded Session

Managing data workflows in a purification laboratory

Press Release: The U.S. Pharmacopeia (USP) and Mestrelab Research, S.L. (Mestrelab)