[PDF] Aligning Subjective Ratings in Clinical Decision Making

Abstract

In addition to objective indicators (e.g. laboratory values), clinical data often contain subjective evaluations by experts (e.g. disease severity assessments). While objective indicators are more transparent and robust, the subjective evaluation contains a wealth of expert knowledge and intuition. In this work, we demonstrate the potential of pairwise ranking methods to align the subjective evaluation with objective indicators, creating a new score that combines their advantages and facilitates diagnosis. In a case study on patients at risk for developing Psoriatic Arthritis, we illustrate that the resulting score (1) increases classification accuracy when detecting disease presence/absence, (2) is sparse and (3) provides a nuanced assessment of severity for subsequent analysis.

Full PDF

aa r X i v : . [ s t a t . A P ] S e p Aligning Subjective Ratings in Clinical DecisionMaking

Annika Pick , , , Sebastian Ginzel , , , Stefan R¨uping , Jil Sander , , AnnChristina Foldenauer , , and Michaela K¨ohm , Fraunhofer IAIS, Sankt Augustin, Germany Fraunhofer IME-TMP, Frankfurt, Germany firstname.lastname@ { iais,ime } .fraunhofer.de Fraunhofer Cluster of Excellence Immune-Mediated Diseases CIMD, Frankfurt,Germany Fraunhofer Center for Machine Learning, Sankt Augustin, Germany

Abstract.

In addition to objective indicators (e.g. laboratory values),clinical data often contain subjective evaluations by experts (e.g. diseaseseverity assessments). While objective indicators are more transparentand robust, the subjective evaluation contains a wealth of expert knowl-edge and intuition. In this work, we demonstrate the potential of pairwiseranking methods to align the subjective evaluation with objective indica-tors, creating a new score that combines their advantages and facilitatesdiagnosis. In a case study on patients at risk for developing PsoriaticArthritis, we illustrate that the resulting score (1) increases classiﬁca-tion accuracy when detecting disease presence/absence, (2) is sparse and(3) provides a nuanced assessment of severity for subsequent analysis.

Keywords:

Clinical Data · Ranking SVM · Data Integration.

In data obtained from clinical studies it is often challenging to determine thedisease status of a patient. There are usually multiple ways for a disease tomanifest and the ways of manifestation are detected by diﬀerent examinationmethods. While the assessment e.g. of swelling of a speciﬁc joint is relativelyobjective and can be performed by a non-specialist as well, the determinationof which symptoms are directly caused by the disease in question including thegeneration of a complete symptomatic picture is no trivial task.In general, an overall assessment is approximated by a numerical disease ac-tivity (DA) rating speciﬁed by the physician. Although such a rating implicitlycontains valuable domain expertise when provided by an experienced specialist,the absence of strong diagnostic criteria makes it highly subjective. For exam-ple, the well-established Visual Analog Scale (VAS) is a standard measurementinstrument for DA assessment of arthritic diseases, but it has been previouslyshown that VAS diﬀerences of as much as 15 % fall within the expected vari-ance [13].

A. Pick et al.

In contrast to the DA rating, the variables describing individual symptomsare more objective; however, it takes domain knowledge to weight them correctlyand assess the resulting score. Our goal is to align the DA rating with symptomvariables to combine the advantages of the two scores within a single rating.We implement this by learning to predict relative DA rankings from individualsymptoms, using a Ranking SVM [4]. A general challenge is that ratings of thesame disease activity may vary widely although the symptoms remain similar.Only patients with signiﬁcantly larger disease activity express more relevantdisease signs leading to substantially diﬀerent ratings. We address this issue inour method in order to reduce noise.To evaluate the resulting model, we test the correlation of the new score tothe original DA rating and also its ability to predict the most reliable binaryexamination result (presence or absence of the disease); if our new score capturesthe severity of disease more accurately than the raw DA rating, it should alsodistinguish better between absence and presence of disease. Furthermore, weexpect a meaningful model to be sparse.

We denote the normalized variables of the dataset describing the clinical features( m symptoms of n patients) as X = { x , . . . , x n } , x i ∈ R m and the label (phy-sicians’ ratings of DA for all patients) as y ∈ R n . We extend the concept ofthe Ranking SVM [4], which learns pairwise rankings of data points based onpairwise diﬀerences. In order to account for inaccuracies in y , we adapt themethod by training only on pairs where y diﬀers by at least δ ∈ R : x paired p = x i − x j , y paired p = sign( y i − y j ) (cid:27) for p ∈ { ( i, j ) | i < j and | y i − y j | ≥ δ } . After training a regular SVM on the set of new data points, we obtain aweighting vector w and the decision function i has lower DA than j ⇔ w ⊤ ( x i − x j ) < ⇔ w ⊤ x i < w ⊤ x j . Thus, for a set of new patients I and their respective symptoms x i , i ∈ I , we cancalculate w ⊤ x i for each of them and use this as a new score that maintains anapproximate order according to disease activity, as illustrated by the equationabove. The approximation is that only pairs of patients with suﬃciently diﬀerentDA are used for training the weights w of the SVM. The original Ranking SVM [4] was developed to solve ordinal regression, i. e.,regression on ﬁxed categories that have an order. There, data points are pairedand an SVM learns to arrange them according to the order of their categories. ligning Subjective Ratings in Clinical Decision Making 3

In comparison, our method can also handle a continuous scale without ﬁxedcategories by learning only on clearly distinguishable points.Another possibility to reﬂect diﬀerent levels of conﬁdence regarding the orderof diﬀerent pairs was introduced by Kotsiantis et al. [7], where the pairs areweighted according to their label diﬀerence. We may evaluate this approach infuture work, although level of conﬁdence and label diﬀerence are not necessarilylinearly dependent in our use case.The Ranking SVM has also been used in Image Recognition to predict theage of humans based on pictures [1]. This problem exhibits similarities to thechallenge of DA ratings, since large age diﬀerence can easily be identiﬁed, butslight diﬀerences are hard to detect.

We evaluated our method in the context of the skin condition Psoriasis andthe risk of developing Psoriatic Arthritis (PsA): Although psoriasis is closelyrelated to PsA (as 30 % of psoriasis patients will develop PsA over time), acommon problem of PsA is the lack of a clear correlation between disease dura-tion, phenotype of skin psoriasis and PsA development. Non-expert physiciansfrequently misdiagnose PsA due to the multitude of diﬀerent clinical manifes-tations and symptoms [11]. If diagnosis is late, patients can develop irreversiblemusculoskeletal damage [2].When PsA ﬁrst emerged, disease activity scores were often derived fromthose developed for Rheumatoid Arthritis (RA), e.g. DAS28 [3]. However, it hasbecome evident today that despite some similarities, the development and thetype of manifestation of RA and PsA are very diﬀerent and therefore they cannotbe evaluated by the same scoring systems. This motivates the development ofmore accurate assessment scores or tools for PsA.In a prospective study [6], 391 eligible patients diagnosed with psoriasis vul-garis and the risk for development of PsA were included. About 35 % of themwere diagnosed with PsA during the examinations as part of the study.The presence of PsA is indicated by a binary examination result (PsA de-tected in physical examination), a rating of disease activity (DA) of PsA ona Visual Analog Scale by the physician, and by various symptom assessments(swollen/tender joints, lab values, etc.). Our goal is to align the subjective dis-ease activity rating with the more objective symptom assessments. Therefore, wemeasure the correlation between the new score and the original disease activityrating. Additionally, we test the ROC-AUC when using the new score to predictif the disease is present or not, as indicated by the binary examination result.We compare the Ranking SVM to two baseline models for mirroring the DA:simple Linear Regression and Support Vector Regression (SVR). When testingthe ability to predict the presence/absence of disease, we compare the RankingSVM to the same two baselines as well as to the original, raw DA rating providedby the physician. All models are L -regularized (optimized separately) and linearin order to obtain a sparse and interpretable solution, especially because of the A. Pick et al.

RankSVM Regr. SVR0.620.640.660.68 C o rr e l a t i o n w i t h D A RankSVM Regr. SVR Rating0.760.780.80 R O C - A U C RankSVM SVR2530 N o n z e r o C o e ff i c i e n t s Fig. 1.

Left: Similar correlation of all models with DA. Center: Scores from RankingSVM and SVR can predict the binary label better than raw DA ratings; Linear Re-gression performs worse than raw rating. Right: Ranking SVM needs fewer nonzerocoeﬃcients than SVR. medical context. We calculate the DA predictions (via Linear Regression andSVR) and the ranking score (via Ranking SVM) for every patient by using 5-fold cross validation (CV).In the Ranking SVM, we set δ to 15 (DA ratings range from 0-100), since thisis the minimal clinically important improvement in physician global assessment(absolute value) as determined in [13].Stability tests (plots omitted due to space limitations) for δ ranging between10 and 40 imply that the mean correlation of the new score with DA drops withhigher δ , but only by 0 .

07 in the tested range. The mean ROC-AUC of detectingdisease presence is stable (variation < .

03) and the number of coeﬃcients dropsfrom 31 to 6 when increasing δ from 10 to 40. Accordingly, we conclude that thechoice is not critical and we chose 15 as value for δ due to the semantic reasonsexplained above.Figure 1 shows the results of 100 runs with diﬀerent CV-splits, since the in-homogeneity of the limited data set led to high ﬂuctuations in performance. Allmodels correlate similarly with DA, but the Ranking SVM is the most stable.With regards to detecting disease presence, both Ranking SVM and SVR en-hance the raw DA rating. However, the Ranking SVM performs slightly betterand needs 20 % fewer nonzero coeﬃcients than the SVR, which is highly rele-vant since obtaining medical information is costly and time-consuming. A pureclassiﬁer SVM was trained to predict the binary label as well (plots omitted dueto space limitations); although the mean ROC-AUC was 0.04 higher than theROC-AUC obtained by using the Ranking SVM, the average correlation withDA was lower by 0.17. We have created a score to reﬂect disease activity which integrates subjectiveexpert knowledge with (more) objective symptom descriptions and which is ableto detect the presence of PsA better than the expert rating alone is. It needs fewerthan 25 nonzero coeﬃcients on average – compare this to the well-established ligning Subjective Ratings in Clinical Decision Making 5

DAS28 score [3], which needs 58 attributes (assessment of 28 joints for swellingand tenderness plus lab value plus physician’s rating of DA).However, this is a work in progress. In future work, we aim to improve theRanking SVM in several ways. First, by ﬁnding a way to integrate pairs belowthe distance threshold. Second, by improving the evaluation methodology, sincethe current large number of cross-validations and the limited data set make itimpossible to set aside a data set of similar size for ﬁtting the sparsity parameters.Third, we aim to extend the method to other use cases. One of them is thatphysicians and the patients themselves often rate disease activity diﬀerently. Forexample, Lebwohl et al. [9] show that for Psoriasis patients itching is the factorcontributing most to high DA according to their opinion, whereas dermatologistsput the highest emphasis on the size and location of skin lesions. The idea is to seethis reﬂected in the weights of the Ranking SVM. Besides that, the alignmentof disease activity ratings with symptoms has implications for other complexdiseases with an activity rating as well, e.g. multiple sclerosis or schizophrenia.

This publication is a joined work between the Fraunhofer Cluster of Excel-lence for Immune-Mediated Diseases and the Fraunhofer Center for MachineLearning within the Fraunhofer Cluster for Cognitive Internet Technologies. Ithas also been partially funded by the Federal Ministry of Education and Re-search of Germany as part of the competence center for machine learning ML2R(01IS18038B).We thank the anonymous reviewers for their valuable comments and sugges-tions.

References

1. Cao, D., Lei, Z., Zhang, Z., Feng, J., Li, S.Z.: Human age estimation using rank-ing svm. In: Chinese Conference on Biometric Recognition. pp. 324–331. Springer(2012)2. Haroon, M., Gallagher, P., FitzGerald, O.: Diagnostic delay of morethan 6 months contributes to poor radiographic and functional out-come in psoriatic arthritis. Annals of the Rheumatic Diseases (6),1045–1050 (2015). https://doi.org/10.1136/annrheumdis-2013-204858, https://ard.bmj.com/content/74/6/1045

3. Van der Heijde, D., van’t Hof, M.A., Van Riel, P., Theunisse, L., Lubberts, E.W.,van Leeuwen, M.A., van Rijswijk, M.H., Van de Putte, L.: Judging disease activityin clinical practice in rheumatoid arthritis: ﬁrst step in the development of a diseaseactivity score. Annals of the rheumatic diseases (11), 916–920 (1990)4. Herbrich, R., Graepel, T., Obermayer, K.: Support Vector Learning for OrdinalRegression A Risk Formulation for Ordinal Regression. In: Proceedings of the NinthInternational Conference on Artiﬁcial Neural Networks. pp. 97–102. Edinburgh(1999), A. Pick et al.5. Hunter, J.D.: Matplotlib: A 2d graphics environment. Computing in Science &Engineering (3), 90–95 (2007). https://doi.org/10.1109/MCSE.2007.556. Koehm, M., Rossmanith, T., Langer, H.E., Backhaus, M., Kaesser, U., Kneitz, C.,Wassenberg, S., Burkhardt, H., Behrens, F.: Sat0574 detection of subclinical signsof musculoskeletal inﬂammation by use of ﬂuorescence-optical imaging techniquein patients with psoriasis–data of the ﬁrst interims analysis of the xciting study(2015)7. Kotsiantis, S.B., Pintelas, P.E.: A cost sensitive technique for ordinal classiﬁcationproblems. In: Hellenic Conference on Artiﬁcial Intelligence. pp. 220–229. Springer(2004)8. Laasonen, L., Lindqvist, U., Iversen, L., Ejstrup, L., Jonmundsson, T., St˚ahle, M.,Gudbjornsson, B.: Radiographic scoring systems for psoriatic arthritis are insuf-ﬁcient for psoriatic arthritis mutilans: results from the nordic pam study. ActaRadiologica Open (4), 2058460120920797 (2020)9. Lebwohl, M.G., Bachelez, H., Barker, J., Girolomoni, G., Kavanaugh, A., Langley,R.G., Paul, C.F., Puig, L., Reich, K., van de Kerkhof, P.C.: Patient perspectivesin the management of psoriasis: results from the population-based multinationalassessment of psoriasis and psoriatic arthritis survey. Journal of the AmericanAcademy of Dermatology (5), 871–881 (2014)10. McKinney, W., et al.: Data structures for statistical computing in python. In:Proceedings of the 9th Python in Science Conference. vol. 445, pp. 51–56. Austin,TX (2010)11. Ogdie, A., Nowell, W.B., Applegate, E., Gavigan, K., Venkatachalam, S., de laCruz, M., Flood, E., Schwartz, E.J., Romero, B., Hur, P.: Patient perspectives onthe pathway to psoriatic arthritis diagnosis: results from a web-based survey ofpatients in the united states. BMC Rheumatology (1), 2 (2020)12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machinelearning in python. the Journal of machine Learning research , 2825–2830 (2011)13. Tubach, F., Ravaud, P., Martin-Mola, E., Awada, H., Bellamy, N., Bombardier,C., Felson, D., Hajjaj-Hassouni, N., Hochberg, M., Logeart, I., et al.: Minimumclinically important improvement and patient acceptable symptom state in painand function in rheumatoid arthritis, ankylosing spondylitis, chronic back pain,hand osteoarthritis, and hip and knee osteoarthritis: results from a prospectivemultinational study. Arthritis care & research64