Data Mining for Associations in Biomedical Literature
Scientific literature relating to the biomedical sciences can typically be reproduced through a comprehensive understanding of a papers four rhetorical elements: Data, Methods, Software and Findings. While a scientific article is not necessarily structured with these 4 subtopics in mind, the DMSF information is essential to properly and efficiently recreate an accurate summarization of a papers results. In taking a machine learning approach of extracting these essential aspects from a medical paper, the clear first step involves classifying the body of the article into these 4 categories. With sentences properly classified and grouped according to type, further semantic analysis and data mining can be more easily performed. For example, techniques to summarize these elements would be more effective when aggregating like-minded sentences and providing a summary for each rhetorical category, rather than one summary attempt for entire text body.
The github project repository for the project containing relevant scripts, reports and presentation materials can be found here.