Rodriguez Miret, J.; Farre Maduell, E.; Lima Lopez, S.; Vigil Gimenez, L.; Briva-Iglesias, V.; Krallinger, M. Exploring the Potential of Neural Machine Translation for Cross-Language Clinical NLP Resource Generation through Annotation Projection. Preprints2024, 2024080616. https://doi.org/10.20944/preprints202408.0616.v1
APA Style
Rodriguez Miret, J., Farre Maduell, E., Lima Lopez, S., Vigil Gimenez, L., Briva-Iglesias, V., & Krallinger, M. (2024). Exploring the Potential of Neural Machine Translation for Cross-Language Clinical NLP Resource Generation through Annotation Projection. Preprints. https://doi.org/10.20944/preprints202408.0616.v1
Chicago/Turabian Style
Rodriguez Miret, J., Vicent Briva-Iglesias and Martin Krallinger. 2024 "Exploring the Potential of Neural Machine Translation for Cross-Language Clinical NLP Resource Generation through Annotation Projection" Preprints. https://doi.org/10.20944/preprints202408.0616.v1
Abstract
Recent advancements in neural machine translation (NMT) offer promising potential for generating cross-language clinical natural language processing (NLP) resources. There is a pressing need to be able to foster the development of clinical NLP tools that extract key clinical entities in a comparable way for a multitude of medical application scenarios, hindered by lack of multilingual annotated data. This study explores the efficacy of using NMT and annotation projection techniques with expert in the loop validation to develop named entity recognition (NER) systems for an under resourced target language (Catalan) by leveraging Spanish clinical corpora annotated by domain experts. We employed a state-of-the-art NMT system to translate three clinical case corpora. The translated annotations were then projected onto the target language texts and subsequently validated and corrected by clinical domain experts. The efficacy of the resulting NER systems was evaluated against manually annotated test sets in the target language. Our findings indicate that this approach not only facilitates the generation of high-quality training data for the target language (Catalan) but also demonstrates the potential to extend this methodology to other languages, thereby enhancing multilingual clinical NLP resource development. The generated corpora and components are publicly accessible, providing potentially a valuable resource for further research and application in multilingual clinical settings: https://zenodo.org/doi/10.5281/zenodo.13133124.
Keywords
machine translation; annotation projection; clinical NLP; named entity recognition
Subject
Computer Science and Mathematics, Other
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.