Symbolic Methods project: Domain Coverage and SemRep Evaluation for Knowledge Integration in Pediatric Oncology
This github repository contains code and data for our project.
The Java-based API used to run SemRep was not included in this repo but can be downloaded and installed here
Note that access to this tool requires a UMLS Terminology Services (UTS) account
These python files are used to split the data into multiple files to make it compatible with SemRep (input file can not exceed 10,000 characters in length per request)
This text file contains all the code used to run the SemRep WebAPI. The steps involved in this process include:
- Remove non ASCII characters
- Run SemRep WebAPI (compile before, if changes were made)
- Triple extraction part 1: keep rows with "relation"
- Triple extraction part 2: run get_triples.py script to get triples
This python script extracts triples from files created in the process of running blocks of code in semrep_command.txt. Make sure to manually specify inside the python file what input files to run the script on.
This python script creates all the plots reported in our final paper and final presentation using the data provided in this github repository. Running this code generates all the plots in the plots directory.
These files show MetaMap output accuracy by highlighting mapped phrases with the following color scheme:
- Yellow: correct match with best concept
- Green: partial match
- Pink: incorrect
- Purple: not mapped
This file contains SemRep and MetaMap evaluation results
The different worksheets in this file contain the following information:
- PubMed MetaMap Matches, NCT MetaMap Matches, and Textbook MetaMap Matches: which parts of the gold standard concepts were mapped by MetaMap (shown bolded)
- MetaMap Comparison:
- Summary statistics for MetaMap matchings of gold standard concepts (from PubMed MetaMap Matches, NCT MetaMap Matches, and Textbook MetaMap Matches)
- Summary statistics for accuracy of MetaMap output (data from Metamap folder)
- SemRep Internal Eval: assign automated SemRep triples to one of the following
- True and useful (T+)
- True but not useful (T-)
- False (F)
- SemRep Comparison: summary statistics for SemRep vs. gold standard (used to calculate precision and recall)
These files contain triples automatically extracted via SemRep WebAPI on all the data
These files contain triples automatically extracted via SemRep WebAPI on the subset of data that was also used to create the gold standard triples
These files contain triples manually extracted on the subset of data
This file contains examples of desired ontologic predications for pediatric ALL used to make an ideal semantic network