Assessing models for de-identification of Electronic Discharge Summary Using Machine Learning tools

Garira, WinstonNetshikweta, RendaniMudau, Tshilisanani2024-09-302024-09-302024-09-06Mudau, T. 2024. Assessing models for de-identification of Electronic Discharge Summary Using Machine Learning tools. . .https://univendspace.univen.ac.za/handle/11602/2674M.Sc. (e-Science)Department of Mathematical and Computational SciencesBackground: De-identification is a technique that eliminates identifying information from Clinical Records in order to protect individual privacy. This procedure decreases the chance of personal information being collected, processed, distributed, and published from being used to identify the person. When Machine Learning techniques were included in the de-identification process, it substantially improved over the previous method. Research Problem: The Electronic Discharge Summary(EDS) has evolved into a significantly improved technique of providing discharge summaries though this information contains Protected Health Information (PHI), which poses a risk to patients’ privacy. This makes the process of de-identification to be mandatory. There have lately been several Machine Learning approaches to de-identify data. This study focuses on applying Machine Learning techniques to figure out which model can best de-identify a data set. Methods: The open source data set from Harvard Medical School was used. This data set contains 899 Electronic Health Records (EHR), 669 for training and 220 for test purpose. The Conditional Random Fields (CRF), Long Short Term Memory (LSTM) and Random Forest models were used, and the performance of each model was assessed. Findings: In order to assess each model’s performance, evaluation metrics were used to compare F-measure, Recall and Precision at token level to determine which Machine Learning model performed best. The Long Short Term Memory was found to outperform both Conditional Random Fields and Random Forest with micro average F-measure, Recall and precision of 99%, and macro average F-measure of 77%, Recall of 73% and Precision of 90%.1 online resource (ix, 48 leaves)enUniversity of VendaUCTDAssessing models for de-identification of Electronic Discharge Summary Using Machine Learning toolsDissertationMudau T. Assessing models for de-identification of Electronic Discharge Summary Using Machine Learning tools. []. , 2024 [cited yyyy month dd]. Available from:Mudau, T. (2024). <i>Assessing models for de-identification of Electronic Discharge Summary Using Machine Learning tools</i>. (). . Retrieved fromMudau, Tshilisanani. <i>"Assessing models for de-identification of Electronic Discharge Summary Using Machine Learning tools."</i> ., , 2024.TY - Dissertation AU - Mudau, Tshilisanani AB - Background: De-identification is a technique that eliminates identifying information from Clinical Records in order to protect individual privacy. This procedure decreases the chance of personal information being collected, processed, distributed, and published from being used to identify the person. When Machine Learning techniques were included in the de-identification process, it substantially improved over the previous method. Research Problem: The Electronic Discharge Summary(EDS) has evolved into a significantly improved technique of providing discharge summaries though this information contains Protected Health Information (PHI), which poses a risk to patients’ privacy. This makes the process of de-identification to be mandatory. There have lately been several Machine Learning approaches to de-identify data. This study focuses on applying Machine Learning techniques to figure out which model can best de-identify a data set. Methods: The open source data set from Harvard Medical School was used. This data set contains 899 Electronic Health Records (EHR), 669 for training and 220 for test purpose. The Conditional Random Fields (CRF), Long Short Term Memory (LSTM) and Random Forest models were used, and the performance of each model was assessed. Findings: In order to assess each model’s performance, evaluation metrics were used to compare F-measure, Recall and Precision at token level to determine which Machine Learning model performed best. The Long Short Term Memory was found to outperform both Conditional Random Fields and Random Forest with micro average F-measure, Recall and precision of 99%, and macro average F-measure of 77%, Recall of 73% and Precision of 90%. DA - 2024-09-06 DB - ResearchSpace DP - Univen KW - UCTD LK - https://univendspace.univen.ac.za PY - 2024 T1 - Assessing models for de-identification of Electronic Discharge Summary Using Machine Learning tools TI - Assessing models for de-identification of Electronic Discharge Summary Using Machine Learning tools UR - ER -