This post walks through steps to create a demo web application, https://drug-portal.appspot.com, using an entity extraction model trained with the Python spacy library to reduce the time spent manufacturers code adverse event reports.
Drug Adverse Event Background
Drug adverse events are side effects experienced by a consumer of a drug. Healthcare professionals, consumers and manufacturers submit reports to the FDA or drug manufacturer as unstructured notes when a drug consumer is experiencing complications from one or more drugs. The drug manufacturer must then code the report for side effects. The drug manufacturer may then update the drug product label for consumer use.
Product labels of drugs on the market from the FDA provide an enormous opportunity for researchers for further drug development. The real-world data can enable researchers to
-
determine new purpose for drug (i.e. new indication for a drug)
-
predict performance of new drugs with similar chemicals.
-
identify drug interactions
-
identify post-market adverse reactions not found in clinical trials
-
identify causal relationships between drugs and adverse reactions
In this post we walk through steps to create an entity extraction service from FDA drug labels to identify adverse events. The post walks through engineering a training dataset from a database of tagged drug labels, training a Python spacy named-entity-recognition (NER) pipeline, then deploying a web service to predict adverse event entities in free text.
Building an Entity Extraction Model with Spacy
Training Data
Demner-Fushman et al. (2018) provided an exhaustive labeled dataset of adverse reactions for 200 labels of the most recently-approved drugs. Their dataset is intended as a benchmarking dataset for other models with benchmarking resources provided here. However in this demo we use their database to train our model.
For each drug label they provided a structured XML document annotated with entity labels, e.g. Severity or AdverseReaction. For example, for the AdreView drug label a sample of the label and annotations look like the following:
Download the full labeled dataset here.
Transforming Annotated Documents
First I transformed the labeled dataset suitable for training a Python spacy pipeline. For the above example XML the training data would be formatted in the following way:
I uploaded the annotated XML documents to Google Cloud Storage. Next I loaded the annotated documents in the following way
I parsed the XML into the spacy annotated offset scheme.
Finally, I split the DATA into a test and training dataset
Training the Entity Extraction Model
Next, I trained a new NER pipeline model in spacy. I passed the train data and entity labels to train a new pipeline. Note I was not concerned about the entities in the default spacy model so I did not include training examples from the default model. For more info read about the catastrophic forgetting problem.
I saved the model to disk to be deployed later.
Testing the Model
I evaluated the efficacy of the model against the test data by using the
spacy.scorer
API.
We find a recall of 96% and a precision of 100%. The model performs well, however a precision of 100% is suspicious. Obviously I need to further investigate the test/train data or the evaluation method.
Deploying the Model
Docker allows easier distribution of the model since the model is quite large, on order of a few hundred megabytes. We can pull the docker image and copy the model.
The Dockerfile for the model image is simple. It includes steps to copy a model from the host to the docker image.
Next build the docker image with the model and push it to Google Container Registry (GCR).
The model is now available to incorporate into a runtime model. You can find the public repo here.
Building and Deploying a Web App for Entity Extraction
In this demo I serve the app using Google Cloud App Engine. The repo is located here. The app is a lightlight Flask app running in a custom docker container.
The App Engine configuration app.yaml
looks like the following:
Note I specified the CPU and memory to be higher than the default since the app directly loads the NER model into memory.
I built the docker image locally and pushed the image to GCR.
Finally I deployed the app to App Engine specifying the docker image to use:
We can now provide free text for the engine to predict adverse event entity labels. The app is deployed to the following URL:
https://drug-portal.appspot.com
Example Text for Entity Extraction
Submitting the following text from a drug label:
Predicts the following entities:
API
You can also retrieve entities programatically with a simple JSON API to
https://drug-portal.appspot.com/ner/drug.json
. Submit a request like the following:
Conclusion
This post walked through steps to create a demo web application using an entity extraction model. I trained Python spacy entity-recognition pipeline on an annotated drug label dataset. I deployed a simple web app to easily label free text with adverse event labels.
This demo application shows the potential for automatically identifying adverse events in free text.