Development of a natural language processing pipeline for assessment of cardiovascular risk in myeloproliferative neoplasms

August 8, 2024

Andrea DuminucoJoshua Au YeungRaj VaghelaSukhraj VirdeeClaire WoodleySusan AsirvathamNatalia Curto-GarciaPriya SriskandarajahJennifer O’SullivanHugues de Lavallade, et al.

A central feature of myeloproliferative neoplasms (MPN) is an increased risk of cardiovascular thrombotic complications, and this is the primary determinant for the introduction of cytoreductive therapy.1 The landmark ECLAP study in polycythemia vera (PV) patients, showed cardiovascular mortality accounted for 45% of all deaths, with a thrombosis incidence rate of 1.7/100 person/year and a cumulative incidence of 4.5% over a median follow-up of 2.8 years.2

Natural language processing (NLP) is a branch of machine learning involving computational interpretation and analysis of human language. CogStack (https://github.com/CogStack), is an open-source software ecosystem, that retrieves structured and unstructured components of electronic health records (EHR). The Medical Concept Annotation Toolkit (MedCAT), the NLP component of CogStack, structures clinical free text by disambiguating and capturing synonyms, acronyms, and contextual details, such as negation, subject, and grammatical tense, and mapping text to medical Systematized Nomenclature of Medicine–Clinical Terms (SNOMED-CT) concepts. This technique is known as “named entity recognition and linkage” (NER+L). MedCAT has previously been used and validated in many studies to structure EHR data across a range of medical specialties for auditing, observational studies, de-identifying patient records, operational insights, disease modeling, and prediction.38

We employed our NLP pipeline, Cogstack, and MedCAT, to determine the prevalence and impact of cardiovascular risk factors upon thrombotic events during follow-up. We used Cogstack to retrieve outpatient hematology clinic letters and hematology discharge letters. MedCAT was then used for NER+L of relevant clinical free-text to respective SNOMED-CT codes that were determined by two hematology specialists. The base MedCAT model was trained unsupervised on >18 million EHR documents, and this was further fine-tuned using a 80:20 train:test split with 600 clinician-annotated MPN-specific documents. Total SNOMED-CT code counts were aggregated and grouped by individual patient, a unique threshold count was then applied to “infer” presence of the respective SNOMED code. In this process, hematology specialists read through clinical documents and manually highlight correct words or phrases detected by MedCAT that correspond to the SNOMED concept of interest.

Read more

Posted in Research and tagged , .

Leave a Reply

Your email address will not be published. Required fields are marked *