
Text Preprocessing and Entity Extraction
Chao Liu
2025-09-24
Source:vignettes/Text_Preprocessing.Rmd
Text_Preprocessing.Rmd
Text Preprocessing and Entity Extraction
This vignette explains the text preprocessing and entity extraction
capabilities of the LBDiscover
package, which are
fundamental steps in the literature-based discovery process.
Introduction
Before applying discovery models, we need to preprocess the text data and extract the entities of interest. These steps transform raw text into structured information that can be used for discovering relationships between biomedical concepts.
Loading the Package
library(LBDiscover)
#> Loading LBDiscover package
Data Retrieval
First, let’s retrieve some sample articles:
# Search for articles about migraines
migraine_articles <- pubmed_search(
query = "migraine pathophysiology",
max_results = 100
)
#> Created pubmed_cache environment for result caching
#> Searching PubMed for: migraine pathophysiology
#> Found 11828 results, retrieving 100 records
#> Fetching batch 1 of 1 (records 1-100)
#> Processing 100 articles
#> Processing article 100 of 100
#> Cached search results for future use
# View the first article
head(migraine_articles[, c("pmid", "title")], 3)
#> pmid
#> 1 40979183
#> 2 40977107
#> 3 40973555
#> title
#> 1 New daily persistent headache with May-Thurner physiology and spinal epidural venous congestion: treatment with ascending lumbar vein embolization.
#> 2 Exploring the Link Between Inflammatory Biomarkers (SII, SIRI, PLR, NLR, LMR) and Migraine in Young and Early Middle-Aged US Adults: Evidence From NHANES 1999-2004 and Machine Learning Models.
#> 3 Migraine: advances in treatment.
Basic Text Preprocessing
The first step is to preprocess the text data to extract meaningful terms:
# Preprocess the abstracts
preprocessed_data <- preprocess_text(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
custom_stopwords = c("study", "patient", "result", "conclusion"),
min_word_length = 3,
max_word_length = 25
)
#> Tokenizing text...
# View terms extracted from the first document
head(preprocessed_data$terms[[1]], 10)
#> word count
#> 1 additionally 1
#> 2 alv 3
#> 3 artery 1
#> 4 ascending 1
#> 5 associated 3
#> 6 based 1
#> 7 can 1
#> 8 case 2
#> 9 clinical 2
#> 10 common 2
Optimized Preprocessing for Large Datasets
For larger datasets, we can use the optimized vectorized preprocessing function:
# Use optimized vectorized preprocessing
opt_preprocessed_data <- vec_preprocess(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
min_word_length = 3,
chunk_size = 50 # Process in chunks of 50 documents
)
#> Processing text in 2 chunks...
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
# Compare processing times
system.time({
preprocess_text(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE
)
})
#> Tokenizing text...
#> user system elapsed
#> 0.064 0.000 0.064
system.time({
vec_preprocess(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
chunk_size = 50
)
})
#> Processing text in 2 chunks...
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
#> user system elapsed
#> 0.064 0.001 0.065
Advanced Text Analysis
N-gram Extraction
We can extract n-grams (sequences of n words) to capture multi-word concepts:
# Extract bigrams (2-word sequences)
bigrams <- extract_ngrams(
migraine_articles$abstract,
n = 2,
min_freq = 2
)
# View the most frequent bigrams
head(bigrams, 10)
#> ngram frequency
#> 8000 in the 135
#> 11049 of the 82
#> 11445 p 0 82
#> 7909 in migraine 69
#> 10916 of migraine 69
#> 11819 patients with 58
#> 2825 associated with 44
#> 9707 migraine and 44
#> 17388 with migraine 44
#> 8629 is a 42
Sentence Segmentation
Segmenting text into sentences can be useful for more granular analysis:
# Extract sentences from the first abstract
abstracts <- migraine_articles$abstract
first_abstract <- abstracts[1]
# Make sure we have a valid abstract
if(is.na(first_abstract) || length(first_abstract) == 0 || nchar(first_abstract) == 0) {
# Find the first non-empty abstract
valid_idx <- which(!is.na(abstracts) & nchar(abstracts) > 0)
if(length(valid_idx) > 0) {
first_abstract <- abstracts[valid_idx[1]]
cat("First abstract was empty, using abstract #", valid_idx[1], "instead.\n")
} else {
# Create a sample abstract for demonstration
first_abstract <- "This is a sample abstract for demonstration. It contains multiple sentences. Each sentence will be extracted separately."
cat("No valid abstracts found. Using a sample abstract for demonstration.\n")
}
}
# Now segment the valid abstract
sentences <- segment_sentences(first_abstract)
# Check if sentences list has elements before trying to access them
if(length(sentences) > 0 && length(sentences[[1]]) > 0) {
# View the first few sentences
head(sentences[[1]], min(3, length(sentences[[1]])))
} else {
cat("No sentences could be extracted. The abstract might be too short or formatted incorrectly.\n")
}
#> [1] "May-Thurner physiology (MTP) can lead to various congestion syndromes due to compression of the left common iliac vein (LCIV) by the right common iliac artery (RCIA)."
#> [2] "This compression may result in venous reflux through the lumbar vein, leading to congestion of the spinal epidural venous plexus (EVP), which could contribute to refractory headaches."
#> [3] "This case report details the clinical course of a patient with severe refractory new daily persistent headache associated with MTP who underwent ascending lumbar vein (ALV) embolization."
# View the first few sentences
head(sentences[[1]], 3)
#> [1] "May-Thurner physiology (MTP) can lead to various congestion syndromes due to compression of the left common iliac vein (LCIV) by the right common iliac artery (RCIA)."
#> [2] "This compression may result in venous reflux through the lumbar vein, leading to congestion of the spinal epidural venous plexus (EVP), which could contribute to refractory headaches."
#> [3] "This case report details the clinical course of a patient with severe refractory new daily persistent headache associated with MTP who underwent ascending lumbar vein (ALV) embolization."
Language Detection
For dealing with multilingual corpora, we can detect the language of each document:
# Filter out NA values from abstracts and detect language
abstracts <- migraine_articles$abstract[1:5]
valid_abstracts <- abstracts[!is.na(abstracts)]
# Apply language detection to valid abstracts
if (length(valid_abstracts) > 0) {
languages <- sapply(valid_abstracts, detect_lang)
# View results
data.frame(
abstract_id = which(!is.na(abstracts)),
language = languages
)
} else {
message("No valid abstracts found for language detection")
}
#> abstract_id
#> May-Thurner physiology (MTP) can lead to various congestion syndromes due to compression of the left common iliac vein (LCIV) by the right common iliac artery (RCIA). This compression may result in venous reflux through the lumbar vein, leading to congestion of the spinal epidural venous plexus (EVP), which could contribute to refractory headaches. This case report details the clinical course of a patient with severe refractory new daily persistent headache associated with MTP who underwent ascending lumbar vein (ALV) embolization. The patient is a 59-year-old female with a 3-year history of daily persistent headache which failed multiple migraine prevention therapies and minimally invasive procedures. Imaging studies revealed significant LCIV compression by the RCIA, retrograde ALV flow, and EVP congestion. The patient underwent ALV embolization, resulting in significant symptomatic relief. At the 22-month follow-up, headache severity reduced by 80%, with the patient no longer requiring routine headache medications. Additionally, treatment of associated pelvic congestion syndrome through gonadal vein embolization resolved longstanding associated pelvic pain and pressure. This case highlights the role of venous congestion in refractory headache syndromes and underscores the potential of targeted venous interventions, such as embolization, in their management. The findings expand on emerging evidence linking venous compression syndromes to headache pathophysiology and support exploring interventional strategies as viable treatment options for selected patients. Further research is needed to validate these findings and establish evidence-based guidelines for clinical practice. 1
#> BACKGROUND: Migraines are a prevalent neurological condition that significantly impacts quality of life, but the underlying pathophysiology remains unclear. This study aims to explore the relationship between inflammatory biomarkers and migraine prevalence in young and early middle-aged Americans. The inflammatory biomarkers considered include the Systemic Immune-Inflammation Index (SII), Systemic Inflammatory Response Index (SIRI), Platelet-to-Lymphocyte Ratio (PLR), Neutrophil-to-Lymphocyte Ratio (NLR), and Lymphocyte-to-Monocyte Ratio (LMR). METHODS: Data from the National Health and Nutrition Examination Survey (NHANES) 1999-2004 were utilized for this investigation. Subgroup analysis, smooth curve fitting, and multivariable logistic regression were employed to evaluate associations. Boruta's algorithm, alongside nine machine learning models, was applied to identify key features. SHapley Additive Explanations (SHAP) values were used to interpret the leading models and highlight influential features. RESULTS: The study revealed no significant differences in SII, SIRI, NLR, or PLR between individuals with and without migraines. However, a significantly higher LMR was observed in individuals with migraines (mean difference: 0.37, p < 0.001). Multivariable logistic regression analysis demonstrated a strong positive correlation between LMR and migraine risk across multiple models (OR = 1.51, 95% CI: 1.14-2.00, p = 0.009). No significant associations were found for the other inflammatory biomarkers. Subgroup analyses further confirmed that the positive correlation between LMR and migraine risk remained consistent across different strata. Threshold effect analysis revealed a stable linear relationship between LMR and migraine risk up to a value of 1.61. Among the nine machine learning models, the LightGBM model exhibited the highest AUROC (0.9198), recall (93.3%), F1-score (0.896), and MCC (0.702). CONCLUSIONS: LMR may serve as a potential biomarker for assessing migraine risk, offering support for early diagnosis and personalized intervention strategies. 2
#> Increased understanding of the pathophysiology of migraine has resulted in the development of therapies targeting calcitonin gene-related peptide and its receptor. Ditans, which are serotonin 5HT1F receptor agonists, have demonstrated efficacy in acute management and bypass vascular risks associated with triptans, which are 5HT1B/1D receptor agonists. However, despite favourable safety and efficacy data, many patients do not respond to these therapies. Treatments targeting pituitary adenylate cyclase activating polypeptide and other potential targets, including amylin and adrenomedullin and their receptors, KATP and transient receptor potential ion channels, as well as neuronal nitric oxide synthase, are emerging. Improving our understanding of patient heterogeneity in migraine biology may pave the way for precision medicine in migraine management. 3
#> Chiari malformation types 1 and 1.5 can be treated with posterior fossa decompression, though surgical techniques vary considerably, with more aggressive approaches often considered for type 1.5. Given this variability, an objective intraoperative marker of adequate decompression would support more tailored surgery. While brainstem auditory evoked potentials (BAEPs) have been explored in pediatric populations, their utility in adults remains unstudied. We present a 26-year-old female with Chiari 1.5 and symptoms including migraines, visual disturbances, balance issues, and right-hand clumsiness. She underwent a BAEP-guided, minimally invasive decompression involving a C1 laminectomy, linear dural opening, and tonsillar cauterization. Intraoperative BAEP monitoring allowed for a targeted, less extensive decompression, resulting in significant clinical improvement. This case highlights the potential utility of BAEPs in adult Chiari decompression, suggesting a role for further investigation of this technique in optimizing outcomes while minimizing invasiveness. 4
#> BackgroundPeople with high-frequency episodic migraine or chronic migraine may have resistant or refractory forms. The lack of efficacy of pharmacologic therapies is a major clinical challenge that requires alternative strategies, including neuromodulation and exploration of new targets to improve disease management. The present study aimed to test the effectiveness of an accelerated protocol of theta burst stimulation (iTBS) via the dorso lateral prefrontal cortex (DLPFC) in a group of chronic migraine individuals who did not respond to monoclonal antibodies against calcitonin gene-related peptide (CGRP). The co-primary outcomes were the reduction in monthly headache frequency, use of symptomatic medication and perceived pain intensity. In parallel we wanted to understand the possible role of the prefrontal cortex in the emotional and cognitive functions likely responsible for treatment failure and to offer a possible non-pharmacologic option to individuals with difficult-to-treat migraine. To this end, we measured clinical outcomes along with an electroencephalogram (EEG) and behavioral responses to cognitive and emotional tests related to prefrontal functions.MethodsThis study was conducted in a controlled, single-blind design in 12 people with chronic refractory migraine. An accelerated protocol of iTBS on DLPFC was preceded by a sham session and followed by a two-month follow-up. Clinical data were collected and a neuropsychological assessment including anxiety, depression and cognitive profile was performed. Cognitive and emotional Stroop testing was performed at baseline, after sham and real stimulation, and at follow-up during high-density EEG recording to obtain event-related potentials (N2, N400 and late sustained potential (LP)). Stroop data from an age- and sex-matched control group were compared with those of migraine individuals.ResultsMonthly headache days, monthly medication days and headache intensity improved after real stimulation. A similar trend emerged for anxiety, depression, and cognitive performance. The Stroop test was impaired in the baseline, as evidenced by an increase in reaction time and a decrease in N2 and LP in the cognitive task, which returned to normal after real iTBS and at follow-up.ConclusionsThe results support the efficacy of iTBS as a non-invasive neuromodulation approach for the treatment of chronic, refractory migraine. They tentatively point to the role of cognitive fog and psychopathological symptoms in refractoriness to anti-CGRP drugs, which should be confirmed in larger multicenter studies, and suggest this non-pharmacological approach as another promising therapeutic option for people with difficult-to-treat migraine. 5
#> language
#> May-Thurner physiology (MTP) can lead to various congestion syndromes due to compression of the left common iliac vein (LCIV) by the right common iliac artery (RCIA). This compression may result in venous reflux through the lumbar vein, leading to congestion of the spinal epidural venous plexus (EVP), which could contribute to refractory headaches. This case report details the clinical course of a patient with severe refractory new daily persistent headache associated with MTP who underwent ascending lumbar vein (ALV) embolization. The patient is a 59-year-old female with a 3-year history of daily persistent headache which failed multiple migraine prevention therapies and minimally invasive procedures. Imaging studies revealed significant LCIV compression by the RCIA, retrograde ALV flow, and EVP congestion. The patient underwent ALV embolization, resulting in significant symptomatic relief. At the 22-month follow-up, headache severity reduced by 80%, with the patient no longer requiring routine headache medications. Additionally, treatment of associated pelvic congestion syndrome through gonadal vein embolization resolved longstanding associated pelvic pain and pressure. This case highlights the role of venous congestion in refractory headache syndromes and underscores the potential of targeted venous interventions, such as embolization, in their management. The findings expand on emerging evidence linking venous compression syndromes to headache pathophysiology and support exploring interventional strategies as viable treatment options for selected patients. Further research is needed to validate these findings and establish evidence-based guidelines for clinical practice. en
#> BACKGROUND: Migraines are a prevalent neurological condition that significantly impacts quality of life, but the underlying pathophysiology remains unclear. This study aims to explore the relationship between inflammatory biomarkers and migraine prevalence in young and early middle-aged Americans. The inflammatory biomarkers considered include the Systemic Immune-Inflammation Index (SII), Systemic Inflammatory Response Index (SIRI), Platelet-to-Lymphocyte Ratio (PLR), Neutrophil-to-Lymphocyte Ratio (NLR), and Lymphocyte-to-Monocyte Ratio (LMR). METHODS: Data from the National Health and Nutrition Examination Survey (NHANES) 1999-2004 were utilized for this investigation. Subgroup analysis, smooth curve fitting, and multivariable logistic regression were employed to evaluate associations. Boruta's algorithm, alongside nine machine learning models, was applied to identify key features. SHapley Additive Explanations (SHAP) values were used to interpret the leading models and highlight influential features. RESULTS: The study revealed no significant differences in SII, SIRI, NLR, or PLR between individuals with and without migraines. However, a significantly higher LMR was observed in individuals with migraines (mean difference: 0.37, p < 0.001). Multivariable logistic regression analysis demonstrated a strong positive correlation between LMR and migraine risk across multiple models (OR = 1.51, 95% CI: 1.14-2.00, p = 0.009). No significant associations were found for the other inflammatory biomarkers. Subgroup analyses further confirmed that the positive correlation between LMR and migraine risk remained consistent across different strata. Threshold effect analysis revealed a stable linear relationship between LMR and migraine risk up to a value of 1.61. Among the nine machine learning models, the LightGBM model exhibited the highest AUROC (0.9198), recall (93.3%), F1-score (0.896), and MCC (0.702). CONCLUSIONS: LMR may serve as a potential biomarker for assessing migraine risk, offering support for early diagnosis and personalized intervention strategies. en
#> Increased understanding of the pathophysiology of migraine has resulted in the development of therapies targeting calcitonin gene-related peptide and its receptor. Ditans, which are serotonin 5HT1F receptor agonists, have demonstrated efficacy in acute management and bypass vascular risks associated with triptans, which are 5HT1B/1D receptor agonists. However, despite favourable safety and efficacy data, many patients do not respond to these therapies. Treatments targeting pituitary adenylate cyclase activating polypeptide and other potential targets, including amylin and adrenomedullin and their receptors, KATP and transient receptor potential ion channels, as well as neuronal nitric oxide synthase, are emerging. Improving our understanding of patient heterogeneity in migraine biology may pave the way for precision medicine in migraine management. en
#> Chiari malformation types 1 and 1.5 can be treated with posterior fossa decompression, though surgical techniques vary considerably, with more aggressive approaches often considered for type 1.5. Given this variability, an objective intraoperative marker of adequate decompression would support more tailored surgery. While brainstem auditory evoked potentials (BAEPs) have been explored in pediatric populations, their utility in adults remains unstudied. We present a 26-year-old female with Chiari 1.5 and symptoms including migraines, visual disturbances, balance issues, and right-hand clumsiness. She underwent a BAEP-guided, minimally invasive decompression involving a C1 laminectomy, linear dural opening, and tonsillar cauterization. Intraoperative BAEP monitoring allowed for a targeted, less extensive decompression, resulting in significant clinical improvement. This case highlights the potential utility of BAEPs in adult Chiari decompression, suggesting a role for further investigation of this technique in optimizing outcomes while minimizing invasiveness. en
#> BackgroundPeople with high-frequency episodic migraine or chronic migraine may have resistant or refractory forms. The lack of efficacy of pharmacologic therapies is a major clinical challenge that requires alternative strategies, including neuromodulation and exploration of new targets to improve disease management. The present study aimed to test the effectiveness of an accelerated protocol of theta burst stimulation (iTBS) via the dorso lateral prefrontal cortex (DLPFC) in a group of chronic migraine individuals who did not respond to monoclonal antibodies against calcitonin gene-related peptide (CGRP). The co-primary outcomes were the reduction in monthly headache frequency, use of symptomatic medication and perceived pain intensity. In parallel we wanted to understand the possible role of the prefrontal cortex in the emotional and cognitive functions likely responsible for treatment failure and to offer a possible non-pharmacologic option to individuals with difficult-to-treat migraine. To this end, we measured clinical outcomes along with an electroencephalogram (EEG) and behavioral responses to cognitive and emotional tests related to prefrontal functions.MethodsThis study was conducted in a controlled, single-blind design in 12 people with chronic refractory migraine. An accelerated protocol of iTBS on DLPFC was preceded by a sham session and followed by a two-month follow-up. Clinical data were collected and a neuropsychological assessment including anxiety, depression and cognitive profile was performed. Cognitive and emotional Stroop testing was performed at baseline, after sham and real stimulation, and at follow-up during high-density EEG recording to obtain event-related potentials (N2, N400 and late sustained potential (LP)). Stroop data from an age- and sex-matched control group were compared with those of migraine individuals.ResultsMonthly headache days, monthly medication days and headache intensity improved after real stimulation. A similar trend emerged for anxiety, depression, and cognitive performance. The Stroop test was impaired in the baseline, as evidenced by an increase in reaction time and a decrease in N2 and LP in the cognitive task, which returned to normal after real iTBS and at follow-up.ConclusionsThe results support the efficacy of iTBS as a non-invasive neuromodulation approach for the treatment of chronic, refractory migraine. They tentatively point to the role of cognitive fog and psychopathological symptoms in refractoriness to anti-CGRP drugs, which should be confirmed in larger multicenter studies, and suggest this non-pharmacological approach as another promising therapeutic option for people with difficult-to-treat migraine. en
Entity Extraction
After preprocessing, the next step is to extract biomedical entities from the text.
Loading Entity Dictionaries
First, let’s load entity dictionaries that will be used for entity recognition:
# Load a disease dictionary
disease_dict <- load_dictionary(
dictionary_type = "disease",
source = "mesh"
)
#> Searching MeSH database for: disease[MeSH]
#> Found 193984 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#> Removed 56 terms that did not match their claimed entity types
#> Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
# Load a drug dictionary
drug_dict <- load_dictionary(
dictionary_type = "drug",
source = "mesh"
)
#> Searching MeSH database for: pharmaceutical preparations[MeSH]
#> Found 999301 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 2
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 2 of 2
#> Extracted 3 unique terms from MeSH text format
#> Retrieved 26 unique terms from MeSH
#> Sanitizing dictionary with 26 terms...
#> Removed 26 terms that did not match their claimed entity types
#> Sanitization complete. 0 terms remaining (0% of original)
# View a sample of each dictionary
head(disease_dict, 3)
#> term id type source
#> 10 Lobomycosis MESH_10 disease mesh_text
#> 20 Disease MESH_ENTRY_19 disease mesh_text
#> 48 Osteochondrosis MESH_7 disease mesh_text
head(drug_dict, 3)
#> [1] term id type source
#> <0 rows> (or 0-length row.names)
Basic Entity Extraction
Now we can extract entities from the text using these dictionaries:
# Extract disease and drug entities
entities <- extract_entities(
preprocessed_data,
text_column = "abstract",
dictionary = rbind(disease_dict, drug_dict),
case_sensitive = FALSE,
overlap_strategy = "priority"
)
#> Sanitizing dictionary with 8 terms...
#> Sanitization complete. 8 terms remaining (100% of original)
#> Extracting entities from 100 documents...
#> Processing document 100 of 100
#> Extracted 27 entity mentions:
#> disease: 27
# View some extracted entities
head(entities[, c("doc_id", "entity", "entity_type", "sentence")], 10)
#> doc_id entity entity_type
#> 1 5 Disease disease
#> 2 7 Disease disease
#> 3 12 Disease disease
#> 4 16 Disease disease
#> 5 28 Disease disease
#> 6 31 Disease disease
#> 7 36 Disease disease
#> 8 36 Disease disease
#> 9 38 Disease disease
#> 10 38 Disease disease
#> sentence
#> 1 The lack of efficacy of pharmacologic therapies is a major clinical challenge that requires alternative strategies, including neuromodulation and exploration of new targets to improve disease management.
#> 2 These criteria may also increase attention to this population's disease burden to help advocate for them as a specific migraine subgroup.
#> 3 Cognitive performance is modulated by disease severity, chronification, hormonal fluctuations, psychiatric comorbidities, sleep disturbances and medication use.
#> 4 CONCLUSIONS: These findings indicate the importance of considering individual differences in VM research and may offer insights for precise diagnosis and individualized treatment of the disease.
#> 5 Increasing evidence points to an overlap between migraine and cerebral small vessel disease (SVD), implicating vascular dysfunction in HM pathophysiology.
#> 6 in cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL) or epileptic seizures, or treatment of a primary independent chronic disease (e.g.
#> 7 Disease presentation varies in symptoms and duration, including vertigo, hearing loss, visual disturbances, migraine-like headaches, and central nervous system dysfunction.
#> 8 The current classification of SuS remains unclear due to a lack of pathophysiology and many hypotheses have been suggested, such as genetic predisposition and/or previous immune challenge causing SuS as a secondary disease.
#> 9 RESULTS: Four studies met the inclusion criteria, with an additional eight found through citation analysis that analyzed tens of thousands of patients with migraine disease overall.
#> 10 Six studies demonstrated an association between temperature or temperature changes and migraine disease.
Complete Entity Extraction Workflow
For a more comprehensive approach, we can use the complete entity extraction workflow:
# Extract entities using the complete workflow
# Check if running in R CMD check environment
is_check <- !interactive() &&
(!is.null(Sys.getenv("R_CHECK_RUNNING")) &&
Sys.getenv("R_CHECK_RUNNING") == "true")
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
is_check <- TRUE
}
# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4
# Extract entities using the complete workflow
entities_workflow <- extract_entities_workflow(
preprocessed_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene", "protein", "pathway"),
dictionary_sources = c("local", "mesh"),
sanitize = TRUE,
parallel = !is_check, # Disable parallel in check environment
num_cores = num_cores_to_use # Use 1 core in check environment
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Loading dictionaries sequentially...
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for disease
#> Added 20 terms from disease (local)
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for drug
#> Added 20 terms from drug (local)
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for gene
#> Added 20 terms from gene (local)
#> Searching MeSH database for: proteins[MeSH]
#> Found 7656074 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 24 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 19 unique terms from MeSH text format
#> Retrieved 105 unique terms from MeSH
#> Added 105 terms from protein (mesh)
#> Searching MeSH database for: metabolic networks and pathways[MeSH]
#> Found 188790 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 1
#> Extracted 6 unique terms from MeSH text format
#> Retrieved 6 unique terms from MeSH
#> Added 6 terms from pathway (mesh)
#> Created combined dictionary with 171 unique terms
#> Sanitizing dictionary with 171 terms...
#> Removed 8 terms with numbers followed by special characters
#> Correcting type for 'headache' from 'disease' to 'symptom'
#> Correcting type for 'fatigue' from 'disease' to 'symptom'
#> Applied 2 type corrections for commonly misclassified terms
#> Removed 107 terms that did not match their claimed entity types
#> Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 14 terms remaining (8.2% of original)
#> Extracting entities from 100 documents...
#> Processing batch 1/1
#> Extracting entities from 100 documents...
#> Processing document 100 of 100
#> Extracted 702 entity mentions:
#> disease: 530
#> protein: 17
#> symptom: 155
#> Extracted 702 entity mentions in 0.15 minutes
#> disease: 530
#> protein: 17
#> symptom: 155
# View summary of entity types
table(entities_workflow$entity_type)
#>
#> disease protein symptom
#> 530 17 155
Customizing Entity Extraction
We can customize the entity extraction process by providing additional MeSH queries or custom dictionaries:
# Define custom MeSH queries for different entity types
mesh_queries <- list(
"disease" = "migraine disorders[MeSH] OR headache disorders[MeSH]",
"drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
"gene" = "genes[MeSH] OR channelopathy[MeSH]"
)
# Create a custom dictionary
custom_dict <- data.frame(
term = c("CGRP", "trigeminal nerve", "cortical spreading depression"),
type = c("protein", "anatomy", "biological_process"),
id = c("CUSTOM_1", "CUSTOM_2", "CUSTOM_3"),
source = rep("custom", 3),
stringsAsFactors = FALSE
)
# Extract entities with custom settings
custom_entities <- extract_entities_workflow(
preprocessed_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene", "protein", "pathway"),
dictionary_sources = c("local", "mesh"),
additional_mesh_queries = mesh_queries,
custom_dictionary = custom_dict,
sanitize = TRUE
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Adding 3 terms from custom dictionary
#> Loading dictionaries sequentially...
#> Using cached dictionary for disease (local)
#> Using cached dictionary for drug (local)
#> Using cached dictionary for gene (local)
#> Using cached dictionary for protein (mesh)
#> Using cached dictionary for pathway (mesh)
#> Created combined dictionary with 174 unique terms
#> Sanitizing dictionary with 171 terms...
#> Removed 8 terms with numbers followed by special characters
#> Correcting type for 'headache' from 'disease' to 'symptom'
#> Correcting type for 'fatigue' from 'disease' to 'symptom'
#> Applied 2 type corrections for commonly misclassified terms
#> Removed 107 terms that did not match their claimed entity types
#> Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 14 terms remaining (8.2% of original)
#> Extracting entities from 100 documents...
#> Processing batch 1/1
#> Extracting entities from 100 documents...
#> Processing document 100 of 100
#> Extracted 775 entity mentions:
#> anatomy: 1
#> biological_process: 9
#> disease: 530
#> protein: 80
#> symptom: 155
#> Extracted 775 entity mentions in 0.01 minutes
#> anatomy: 1
#> biological_process: 9
#> disease: 530
#> protein: 80
#> symptom: 155
# View custom entities
custom_entities[custom_entities$source == "custom", ]
#> [1] entity entity_type doc_id start_pos end_pos sentence
#> [7] frequency
#> <0 rows> (or 0-length row.names)
Dictionary Sanitization
The quality of entity extraction heavily depends on the quality of the dictionaries. We can sanitize dictionaries to improve extraction quality:
# Create a raw dictionary with some problematic entries
raw_dict <- data.frame(
term = c("migraine", "5-HT", "headache", "the", "and", "patient", "inflammation", "study"),
type = c("disease", "chemical", "symptom", "NA", "NA", "NA", "biological_process", "NA"),
id = paste0("ID_", 1:8),
source = rep("example", 8),
stringsAsFactors = FALSE
)
# Sanitize the dictionary
sanitized_dict <- sanitize_dictionary(
raw_dict,
term_column = "term",
type_column = "type",
validate_types = TRUE,
verbose = TRUE
)
#> Sanitizing dictionary with 8 terms...
#> Removed 1 terms with numbers followed by special characters
#> Removed 1 common non-medical terms, conjunctive adverbs, and general terms
#> Sanitization complete. 6 terms remaining (75% of original)
# View the sanitized dictionary
sanitized_dict
#> term type id source
#> 1 migraine disease ID_1 example
#> 3 headache symptom ID_3 example
#> 4 the NA ID_4 example
#> 5 and NA ID_5 example
#> 7 inflammation biological_process ID_7 example
#> 8 study NA ID_8 example
Mapping Terms to Biomedical Ontologies
We can map extracted terms to standard biomedical ontologies like MeSH or UMLS:
# Extract terms to map
terms_to_map <- c("migraine", "headache", "CGRP", "serotonin")
# Map to MeSH
mesh_mappings <- map_ontology(
terms_to_map,
ontology = "mesh",
fuzzy_match = TRUE,
similarity_threshold = 0.8
)
#> Searching MeSH database for: disease[MeSH]
#> Found 193984 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#> Removed 56 terms that did not match their claimed entity types
#> Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
#> No matches found for the input terms in the mesh ontology
# View MeSH mappings
mesh_mappings
#> [1] term ontology_id ontology_term match_type
#> <0 rows> (or 0-length row.names)
Topic Modeling
We can also apply topic modeling to discover the main themes in the corpus:
# Extract topics from the corpus
topics <- extract_topics(
migraine_articles,
text_column = "abstract",
n_topics = 5,
max_terms = 10
)
#> Tokenizing text...
# View top terms for each topic
topics$topics
#> $`Topic 1`
#> term weight
#> receptor receptor 1.9657181
#> receptors receptors 1.9253816
#> cgrp cgrp 1.1020609
#> family family 0.9929711
#> calcitonin calcitonin 0.9924144
#> atogepant atogepant 0.9653846
#> ubrogepant ubrogepant 0.9550745
#> rat rat 0.9461572
#> data data 0.6556991
#> human human 0.6190749
#>
#> $`Topic 2`
#> term weight
#> headache headache 3.176157
#> sleep sleep 2.414030
#> between between 2.409866
#> index index 2.167505
#> quality quality 2.133990
#> function function 1.530117
#> dti dti 1.508792
#> alps alps 1.508792
#> glymphatic glymphatic 1.508792
#> patients patients 1.437619
#>
#> $`Topic 3`
#> term weight
#> light light 0.9386107
#> hypersensitivity hypersensitivity 0.8763295
#> isdn isdn 0.8286756
#> mechanical mechanical 0.8252680
#> cephalic cephalic 0.7710377
#> migraine migraine 0.7508415
#> induced induced 0.5407903
#> both both 0.5389988
#> treatment treatment 0.4254265
#> sexes sexes 0.4178799
#>
#> $`Topic 4`
#> term weight
#> symptoms symptoms 60.43327
#> patients patients 41.65392
#> cyst cyst 35.33360
#> colloid colloid 30.91690
#> aura aura 29.69046
#> neuropsychiatric neuropsychiatric 22.08350
#> clinical clinical 17.85787
#> diagnosis diagnosis 17.16393
#> reported reported 17.04440
#> psychiatric psychiatric 16.86371
#>
#> $`Topic 5`
#> term weight
#> migraine migraine 514.22370
#> pain pain 103.60331
#> chronic chronic 88.83474
#> headache headache 82.94794
#> studies studies 78.83151
#> patients patients 76.21948
#> pathophysiology pathophysiology 57.05345
#> potential potential 53.91623
#> may may 48.62655
#> levels levels 45.15590