Preprint Article Version 1 This version is not peer-reviewed

Exploring the Potential of Neural Machine Translation for Cross-Language Clinical NLP Resource Generation through Annotation Projection

Version 1 : Received: 2 August 2024 / Approved: 8 August 2024 / Online: 8 August 2024 (16:50:40 CEST)

How to cite: Rodriguez Miret, J.; Farre Maduell, E.; Lima Lopez, S.; Vigil Gimenez, L.; Briva-Iglesias, V.; Krallinger, M. Exploring the Potential of Neural Machine Translation for Cross-Language Clinical NLP Resource Generation through Annotation Projection. Preprints 2024, 2024080616. https://doi.org/10.20944/preprints202408.0616.v1 Rodriguez Miret, J.; Farre Maduell, E.; Lima Lopez, S.; Vigil Gimenez, L.; Briva-Iglesias, V.; Krallinger, M. Exploring the Potential of Neural Machine Translation for Cross-Language Clinical NLP Resource Generation through Annotation Projection. Preprints 2024, 2024080616. https://doi.org/10.20944/preprints202408.0616.v1

Abstract

Recent advancements in neural machine translation (NMT) offer promising potential for generating cross-language clinical natural language processing (NLP) resources. There is a pressing need to be able to foster the development of clinical NLP tools that extract key clinical entities in a comparable way for a multitude of medical application scenarios, hindered by lack of multilingual annotated data. This study explores the efficacy of using NMT and annotation projection techniques with expert in the loop validation to develop named entity recognition (NER) systems for an under resourced target language (Catalan) by leveraging Spanish clinical corpora annotated by domain experts. We employed a state-of-the-art NMT system to translate three clinical case corpora. The translated annotations were then projected onto the target language texts and subsequently validated and corrected by clinical domain experts. The efficacy of the resulting NER systems was evaluated against manually annotated test sets in the target language. Our findings indicate that this approach not only facilitates the generation of high-quality training data for the target language (Catalan) but also demonstrates the potential to extend this methodology to other languages, thereby enhancing multilingual clinical NLP resource development. The generated corpora and components are publicly accessible, providing potentially a valuable resource for further research and application in multilingual clinical settings: https://zenodo.org/doi/10.5281/zenodo.13133124.

Keywords

machine translation; annotation projection; clinical NLP; named entity recognition

Subject

Computer Science and Mathematics, Other

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.