Identification of 1H-NMR Spectra of Xyloglucan Oligosaccharides: A Comparative Study of Artificial Neural Networks and Bayesian Classification Using Nonparametric Density Estimation
TTitle:
Identification of H-NMR Spectra of Xyloglucan Oligosaccharides: A Comparative Studyof Artificial Neural Networks and Bayesian Classification Using Nonparametric DensityEstimation.
Authors:
Faramarz Valafar, Homayoun Valafar, William S. York
Conference:
IC-AI '99
Paper identification number: dentification of H-NMR Spectra of XyloglucanOligosaccharides: A Comparative Study of Artificial NeuralNetworks and Bayesian Classification Using NonparametricDensity Estimation
Faramarz ValafarUniversity of Georgia, CCRC Athens, GA 30602Email:[email protected] Homayoun Valafar University of Georgia, CCRC Athens, GA 30602 Email:[email protected] William S. YorkUniversity of Georgia,CCRCAthens, GA 30602Email: [email protected]
Keywords : Chemical structural identification, neural networks, Bayesian classifier, xyloglucan, H-NMR.
Abstract:
Proton nuclear magnetic resonance ( H-NMR)is a widely used tool for chemical structuralanalysis. However, H-NMRspectrum of these structures with reasonable signal-to-noise ratio, recorded on any 500 MHz NMRinstrument. The system uses Artificial NeuralNetworks (ANNs) technology and is insensitive toinstrument and environment dependent variations in H-NMR spectroscopy. In this paper, comparativeresults of the ANN engine versus a multidimensionalBayes' classifier is also presented.
1. Introduction: H-NMR spectroscopy.
NMR (nuclearmagnetic resonance) spectroscopy is a widelyused tool for chemical analysis. It is used toidentify materials, determine the chemicalstructure of organic compounds, and can be used to quantify chemical substituents or thecomponents of chemical mixtures. The proton( H ) is the nuclide that is most frequentlyobserved by NMR. When a sample is placed ina strong magnetic field, the magnetically activenuclei become aligned. The resulting samplemagnetization can be manipulated by applying avery brief magnetic field pulse that oscillates atradio frequency (RF). Such RF pulses perturbthe sample magnetization, which can beobserved via its induction of a current in adetector coil as the magnetization relaxes backto its equilibrium state. The resulting "freeinduction decay" (FID) contains informationregarding the chemical environment of nucleiwithin the chemical sample, and thus can beused to identify and quantitate individualchemical components of the sample. The FIDconsists of a mixture of sinusoidal oscillationsin the time-domain with decaying amplitudes.The time-domain signal is normally transformed(usually using Fourier transform) into thefrequency domain. Figure 1 (located at the endof this article), illustrates two examples of afrequency domain signal (spectrum) of axyloglucan oligosaccharide. H-NMR spectra, in general, suffer fromenvironmental, instrumental, and other types ofvariations that manifest themselves in a varietyof aberrations. Low signal-to-noise ratio [1, 2,4], baseline drifts [3, 4, 7], frequency shifts dueo temperature variations, line broadening andnegative peaks due to phasing problems, andmalformed peaks (or overlapped peaks) due toinaccurate shimming, are among the mostprominent and common aberrations. Forexample, Figure 1, shows two H-NMR spectraof a complex carbohydrate. The spectrumlabeled (B) in this figure suffer from a variety ofthe above mentioned aberrations, andcontamination by lactate, frequently introducedby touching laboratory glassware with barehands. It is important to realize that thisspectrum by no means represents a worst casescenario, and it does not represent the level ofcomplexity present in the problem of instrumentindependent identification of H-NMR spectra ofxyloglucan oligosaccharides. Spectrum (B) ismerely a demonstration of some types ofpossible aberrations.For the purpose of automatedidentification of these spectra, elimination of theabove mentioned aberrations becomes essential,as they can lead to erroneous identification [1-7]. A variety of signal processing techniqueshave been applied to "clean up" H-NMRspectra. For instance, signal averaging [4] andapodization [4] have become standard ways ofimproving the signal-to-noise ratio. To correctbaseline problems, a number of techniques havebeen used such as parametric modeling using apriori knowledge [3, 5], optimal associativememory (OAM) [5], and spectral derivatives[6]. Other mathematical techniques have also In signal averaging a spectrum is recordedseveral times. Each recorded signal is referred to asa “transient.” The final spectrum is the arithmeticaverage of all the transients. The hope is that bydoing so the zero mean components of the noisepresent in the signal will be averaged out. Apodization is a type of low (high) passfiltering performed in the time-domain.Apodization is performed by speeding up, orslowing down the rate of decay of time-domainexponential functions. This is accomplished bymultiplying the time-domain signal by anotherfunction. This technique allows the improvement ofsignal-to-noise ratio in exchange for the reduction ofsignal resolution (or visa versa). been introduced to address each specific type ofaberration encountered in H-NMR spectra. Although many of these signal processingtechniques have enjoyed success in specificapplications, they remain solutions to specifictypes of aberrations. In order to produce anoverall “clean” spectrum, one needs to useseveral of these methods to eliminate theaberrations present in a real spectrum.Furthermore, most of these techniques produceside effects that are magnified when improperlyprocessed by a second signal processingalgorithm. Furthermore, after the initial signalprocessing steps have been taken, the task ofidentifying the processed spectrum remains.This is not a trivial task as many times thequality of the processed spectrum remains poor,requiring a sophisticated identification system.In this paper we show that instead ofeliminating all the present aberrations by asignal processing procedure as a preprocessor, itis possible to eliminate some of them in theprocessing step, and some in the actualidentification step. Here, we show that anadaptive identification system can learn toeffectively ignore some forms of aberrations.
Xyloglucan Oligosaccharides.
Complexcarbohydrates are important biomolecules thatplay a role in many biological functions such asproviding physical strength (connective tissue inanimals and woody tissue in plants) and as asource of energy reserves (glycogen in animalsand starch in plants). These molecules are alsoknown to be directly and widely involved inbiological recognition and regulatory processesin normal growth and development as well as indisease processes. The recent discovery of therole of complex carbohydrates in diseaseprocesses, and therefore drug development,among others has triggered a large number ofstudies in order to better understand the role ofabnormal (structurally altered) complexcarbohydrates in disease development. For thisreason, an automated identification system ofcomplex carbohydrates can eliminate the manyman-hours wasted in duplicated efforts instructural characterization of knowncarbohydrates. specific group of these molecules fromplant cell wall are called xyloglucanoligosaccharides. The H-NMR spectra of thesemolecules resemble each other to a great degree,and the experiments in developing an automatedidentification system for these spectra is a goodindicator for the success of such future projectsfor automated identification of other molecules.
2. Method:
Two pattern recognition techniques werestudied in this project, namely Bayesianclassification [8], and artificial neural networks[9]. Multidimensional Parzen densityestimation [10] was used to estimate the a priori probability density functions required for Bayesclassification. For the ANN experiments, afeed-forward, 2-stage network trained withback-propagation learning algorithm was used toproduce an identification system. Bothidentification systems were built using 30spectra representing 30 unique xyloglucanoligosaccharides (training set), and tested with30 newly recorded spectra of the sameoligosaccharides in addition to 45 H-NMRspectra of complex carbohydrates other thanxyloglucans. Each spectrum contained 5000points representing the region between 1.0-5.5ppm (parts per million). Five percent normallydistributed noise was dynamically added to thespectra at the beginning of each ANN trainingepoch to prevent memorization. Same amountof noise was introduced to build a large databaseof spectra required by Parzen density estimationin order to accurately estimate themultidimensional densities. The optimalnetwork configuration for the ANN was foundto be 5000 input, 12 hidden, and 30 outputneurons.A preprocessing step is also implementedand kept constant for both identificationtechniques. This preprocessing step iscomprised of several signal processingtechniques that are intended to eliminate certain,but not all, aberrations present in H-NMRspectra of xyloglucan oligosaccharides. Thepreprocessing step includes, interpolation, arunning window low pass filter for highfrequency noise reduction, a ¾ scaling mechanism based on bin analysis for reducingthe effects of sample concentration on signalstrength, and a piecewise linear baselinecorrection routine.
3. Results:
The performance of the ANN was comparedto that of a multidimensional Bayesianclassifier. Table 1, shows the results of the firstset of experiments. As can be seen, theperformance of both methods was good duringtraining. Although, Bayes' classifiermisclassified one of the 30 xyloglucans from thetraining set. The two methods were tested withtwo testing sets. Testing set 1 included 30 newspectra of the same xyloglucans. Testing set 2,included 45 spectra of some carbohydrates otherthan xyloglucans. Testing set 2 was specificallydesigned to test the models for false positiveerrors. As it can be seen from Table 1, TheANN model performed better for all three datasets. However, both models need improvementsto avoid false positives. The ANN modelreported 4 false positives, while the Bayes'classifier reported 9. For testing set 2, thecorrect classification was considered to be a "nohit".
Table 1. Number of correctly identifiedcomplex carbohydrates by the two methods.
ClassificationMethod Training Set Testing Set 1 TestingSet 2
ArtificialNeuralNetwork 30 30 41Parzen densityestimation /BayesianClassification 29 28 36A second set of experiments were conductedto test both models' noise tolerance. Three newtesting sets were prepared from the originaltesting set 1. Each of the new sets contained theoriginal testing spectra perturbed with 5%, 10%,and 15% white noise respectively. As it can beseen from Table 2, the performance of neithermodel degraded when 5% noise was added. Theperformance of the Bayes' classifier was evenmproved slightly. We hypothesize that this isdue to the fact that both models were built withspectra that were 5% perturbed. We proposethat the models have learned to filter out thatlevel of noise. However, when the noise levelwas increased, the performance of both modelsdegraded. This was especially evident for theBayes' classifier. With 15% white noise, theperformance of this model degraded to 18correct identifications out of 30 carbohydrates.
Table 2.
Number of correctly identifiedcomplex carbohydrates in presence of whitenoise.
ClassificationMethod TestingSet 1+5%whitenoise Testing Set 1 + 10%whitenoise TestingSet 1 + 15%whitenoise
ArtificialNeuralNetwork 30 28 23Parzen densityestimation /BayesianClassification 29 27 18
4. New Aspects:
Separation of xyloglucan oligosaccharidesbased on their
5. Conclusions:
Xyloglucan oligosaccharides are a group ofclosely related plant cell wall complexcarbohydrates whose spectra resemble each other to a great degree. The lack of a largenumber of clean H-NMR spectra of thesestructures has prevented the building of anaccurate statistical model to identify thesestructures. For instance, Bayes' classification incombination with multidimensional Parzendensity estimation did not perform well mainlydue to a very sparse input space, and therefore,the failure to accurately estimate the distributionfunctions. We have developed an artificialneural network system that can successfullydistinguish between the H-NMR spectra ofthese molecules. Furthermore, this model hasnot exhibited any instrument dependentsensitivity.
6. Literature Cited:
1. Van Huffel, S. 1993. Enhanced ResolutionBased on Minimum Variance Estimationand Exponential Data Modeling.
SignalProcessing , , 333-355.2. Van den Boogaarth, A., F. A. Howe, L. M.Rodriges, M. Stubbs, and J. R. Griffiths.1995. In Vivo P MRS: AbsoluteConcentrations, Signal-to-Noise and PriorKnowledge.
NMR in Biomedicine , , 87-93.3. Blumler, P., M. Greferath, B. Blumich, and H.W. Spiess. 1993. NMR Imaging of ObjectsContaining Similar Substructures. Magnetic Resonance , Series A , 142-150.4. Angelidis, P. A. 1996. Spectrum Estimation andthe Fourier Transform in Imaging andSpectroscopy.
Concepts in MagneticResonance , Analytical Chemistry , , 2047-2051.6. Goodacre, R., E. M. Timmins, A. Jones, D. B.Kell, J. Maddock, M. Heginbothom, J. T.Magee. 1997. On Mass SpectrometerInstrument Standardization andInterlaboratory Calibration Transfer UsingNeural Networks. Analytica Chemica Acta , , 511-532.7. Wabuyele, B. W., and P. Harrington. 1995.Quantitative comparison of BidirectionalOptimal Associative Memories forBackground Prediction of Spectra. hemometrics and Intelligent LaboratorySystems , , 51-61.8. Fukunaga, K. 1990. Introduction to StatisticalPattern Recognition , Second Edition.Academic Press, Boston, San Diego, NewYork. Chapter three: Hypothesis Testing,51-123.9. Rumelhart, D.E. and J.L. McClelland. 1988.
Parallel Distributed Processing: Explorations in the Microstructure ofCognition , Vols. 1 and 2. MIT Press,Cambridge, MA.10. Fukunaga, K. 1990.
Introduction to StatisticalPattern Recognition , Second Edition.Academic Press, Boston, San Diego, New York.Chapter Six: Nonparametric DensityEstimation, 254-300.
Figure 1. (A) A high quality H-NMR spectrum of a xyloglucan. (B) A poor quality H-NMRspectrum of the same oligosaccharide, with baseline drift, noise, negative signals, and large contaminantand standard signals.
StandardLactateHDO (Water) (B)(A)(B)(A)