
Text Preprocessing and Entity Extraction
Chao Liu
2025-10-05
Source:vignettes/Text_Preprocessing.Rmd
Text_Preprocessing.Rmd
Text Preprocessing and Entity Extraction
This vignette explains the text preprocessing and entity extraction
capabilities of the LBDiscover
package, which are
fundamental steps in the literature-based discovery process.
Introduction
Before applying discovery models, we need to preprocess the text data and extract the entities of interest. These steps transform raw text into structured information that can be used for discovering relationships between biomedical concepts.
Loading the Package
library(LBDiscover)
#> Loading LBDiscover package
Data Retrieval
First, let’s retrieve some sample articles:
# Search for articles about migraines
migraine_articles <- pubmed_search(
query = "migraine pathophysiology",
max_results = 100
)
#> Created pubmed_cache environment for result caching
#> Searching PubMed for: migraine pathophysiology
#> Found 11848 results, retrieving 100 records
#> Fetching batch 1 of 1 (records 1-100)
#> Processing 100 articles
#> Processing article 100 of 100
#> Cached search results for future use
# View the first article
head(migraine_articles[, c("pmid", "title")], 3)
#> pmid
#> 1 41044859
#> 2 41044301
#> 3 41039799
#> title
#> 1 Can atogepant be a preventive treatment for cluster headache?-Insights from a case series.
#> 2 Morphotype-based risk stratification in patients with patent foramen ovale using computational fluid dynamics.
#> 3 The Spectrum of Headaches in Moyamoya Angiopathy: From Mechanisms to Management Strategies-A Consensus Review From the NEUROVASC Working Group.
Basic Text Preprocessing
The first step is to preprocess the text data to extract meaningful terms:
# Preprocess the abstracts
preprocessed_data <- preprocess_text(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
custom_stopwords = c("study", "patient", "result", "conclusion"),
min_word_length = 3,
max_word_length = 25
)
#> Tokenizing text...
# View terms extracted from the first document
head(preprocessed_data$terms[[1]], 10)
#> word count
#> 1 300 1
#> 2 adverse 1
#> 3 antagonist 1
#> 4 anti 1
#> 5 antibodies 1
#> 6 approved 1
#> 7 atogepant 4
#> 8 attacks 2
#> 9 calcitonin 1
#> 10 case 3
Optimized Preprocessing for Large Datasets
For larger datasets, we can use the optimized vectorized preprocessing function:
# Use optimized vectorized preprocessing
opt_preprocessed_data <- vec_preprocess(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
min_word_length = 3,
chunk_size = 50 # Process in chunks of 50 documents
)
#> Processing text in 2 chunks...
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
# Compare processing times
system.time({
preprocess_text(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE
)
})
#> Tokenizing text...
#> user system elapsed
#> 0.068 0.000 0.068
system.time({
vec_preprocess(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
chunk_size = 50
)
})
#> Processing text in 2 chunks...
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
#> user system elapsed
#> 0.063 0.000 0.063
Advanced Text Analysis
N-gram Extraction
We can extract n-grams (sequences of n words) to capture multi-word concepts:
# Extract bigrams (2-word sequences)
bigrams <- extract_ngrams(
migraine_articles$abstract,
n = 2,
min_freq = 2
)
# View the most frequent bigrams
head(bigrams, 10)
#> ngram frequency
#> 8215 in the 124
#> 11229 of the 88
#> 11629 p 0 82
#> 11100 of migraine 72
#> 8119 in migraine 68
#> 12020 patients with 59
#> 8794 is a 52
#> 2981 associated with 50
#> 9884 migraine and 43
#> 9980 migraine patients 43
Sentence Segmentation
Segmenting text into sentences can be useful for more granular analysis:
# Extract sentences from the first abstract
abstracts <- migraine_articles$abstract
first_abstract <- abstracts[1]
# Make sure we have a valid abstract
if(is.na(first_abstract) || length(first_abstract) == 0 || nchar(first_abstract) == 0) {
# Find the first non-empty abstract
valid_idx <- which(!is.na(abstracts) & nchar(abstracts) > 0)
if(length(valid_idx) > 0) {
first_abstract <- abstracts[valid_idx[1]]
cat("First abstract was empty, using abstract #", valid_idx[1], "instead.\n")
} else {
# Create a sample abstract for demonstration
first_abstract <- "This is a sample abstract for demonstration. It contains multiple sentences. Each sentence will be extracted separately."
cat("No valid abstracts found. Using a sample abstract for demonstration.\n")
}
}
# Now segment the valid abstract
sentences <- segment_sentences(first_abstract)
# Check if sentences list has elements before trying to access them
if(length(sentences) > 0 && length(sentences[[1]]) > 0) {
# View the first few sentences
head(sentences[[1]], min(3, length(sentences[[1]])))
} else {
cat("No sentences could be extracted. The abstract might be too short or formatted incorrectly.\n")
}
#> [1] "Cluster headache (CH) is a disabling primary headache disorder with limi.e. therapeutic options."
#> [2] "Calcitonin gene-related pepti.e.(CGRP) is known to be involved in CH pathophysiology; however, except for galcanezumab (300 mg) in episodic CH, anti-CGRP monoclonal antibodies did not reduce CH attacks in randomi.e. clinical trials."
#> [3] "Atogepant is an oral, small-molecule, CGRP receptor antagonist, which is approved for the preventi.e.treatment of migrai.e. Here, we descri.e.four case reports of CH (two episodic CH and two chronic CH), unresponsi.e.to previous prophylactic treatments, who responded to daily atogepant (60 mg)."
# View the first few sentences
head(sentences[[1]], 3)
#> [1] "Cluster headache (CH) is a disabling primary headache disorder with limi.e. therapeutic options."
#> [2] "Calcitonin gene-related pepti.e.(CGRP) is known to be involved in CH pathophysiology; however, except for galcanezumab (300 mg) in episodic CH, anti-CGRP monoclonal antibodies did not reduce CH attacks in randomi.e. clinical trials."
#> [3] "Atogepant is an oral, small-molecule, CGRP receptor antagonist, which is approved for the preventi.e.treatment of migrai.e. Here, we descri.e.four case reports of CH (two episodic CH and two chronic CH), unresponsi.e.to previous prophylactic treatments, who responded to daily atogepant (60 mg)."
Language Detection
For dealing with multilingual corpora, we can detect the language of each document:
# Filter out NA values from abstracts and detect language
abstracts <- migraine_articles$abstract[1:5]
valid_abstracts <- abstracts[!is.na(abstracts)]
# Apply language detection to valid abstracts
if (length(valid_abstracts) > 0) {
languages <- sapply(valid_abstracts, detect_lang)
# View results
data.frame(
abstract_id = which(!is.na(abstracts)),
language = languages
)
} else {
message("No valid abstracts found for language detection")
}
#> abstract_id
#> Cluster headache (CH) is a disabling primary headache disorder with limited therapeutic options. Calcitonin gene-related peptide (CGRP) is known to be involved in CH pathophysiology; however, except for galcanezumab (300 mg) in episodic CH, anti-CGRP monoclonal antibodies did not reduce CH attacks in randomized clinical trials. Atogepant is an oral, small-molecule, CGRP receptor antagonist, which is approved for the preventive treatment of migraine. Here, we describe four case reports of CH (two episodic CH and two chronic CH), unresponsive to previous prophylactic treatments, who responded to daily atogepant (60 mg). Chronic CH cases were refractory to subcutaneous galcanezumab. In one case, a reduction to atogepant (30 mg daily) resulted in recurrence of headache attacks, which subsided on reintroduction of the initial dose. No serious adverse effects were reported. Despite the limited number of cases and the open retrospective design, our case series suggests atogepant as a possible prophylactic treatment for CH. Further research on CGRP signaling in CH and the implementation of well-designed clinical trials are necessary. 1
#> Patent Foramen Ovale (PFO) is a congenital cardiac anomaly, anatomically persistent in approximately 25% of the adult population. While traditionally associated with paradoxical embolism and cryptogenic stroke, increasing evidence suggests a functional link between PFO and migraine with aura. However, the biomechanical mechanisms underlying these associations remain poorly defined, particularly regarding the role of PFO morphology in modulating local hemodynamics and red blood cell (RBC) mechanical stress. This study employs computational fluid dynamics (CFD) combined with Lagrangian particle tracking to assess the impact of PFO tunnel geometry on flow behavior and RBC loading across eight representative morphologies. Velocity fields, wall shear stress (WSS), and particle-level stress histories were computed under physiologically calibrated boundary conditions replicating Valsalva-induced shunting. Results reveal a dichotomy between elongated/narrow and short/wide morphotypes, with the former exhibiting jet-like flows, higher WSS, and significantly elevated RBC stress metrics (up to 31 Pa and 0.49 Pa·s of stress accumulation). The length-to-mean-quadratic-diameter ratio ([Formula: see text]) emerged as a strong predictor of mechanical exposure ([Formula: see text]), while outlet diameter correlated with potential systemic desaturation. This dual-scale analysis reveals a mechanistic connection between pathological stress levels and tunnel geometry, identifying [Formula: see text] as a candidate index for future imaging-based stratification of PFO-related clinical risk. 2
#> BACKGROUND: Moyamoya angiopathy (MMA) is a rare, progressive cerebrovascular disorder characterized by stenosis or occlusion of the terminal internal carotid arteries, leading to the development of fragile collateral vessels. Headache is a common but understudied symptom of MMA, reported in up to 75% of patients. The headache phenotype often mimics migraine or tension-type headache, although cluster headache-like episodes have also been described. Aims to summarize current evidence on the clinical characteristics, underlying mechanisms, and treatment strategies for headache in MMA. MATERIALS AND METHODS: A narrative review of the literature was conducted, focusing on the prevalence, phenotype, pathophysiological mechanisms, and therapeutic options for headache in MMA. RESULTS: The pathogenesis of headache in MMA remains unclear but is likely multifactorial, involving impaired cerebrovascular autoregulation, microvascular ischemia, and collateral vessel development. No standardized treatment exists for MMA-related headache. Antiplatelet therapy, particularly aspirin, may offer some benefit, whereas NSAIDs and triptans require caution due to cerebrovascular risks. Emerging therapies such as calcitonin gene-related peptide (CGRP) inhibitors and Lasmiditan show potential but lack specific data in MMA patients. Surgical revascularization, mainly through direct or combined bypass, is an established intervention for stroke prevention and may also reduce headache burden. However, postoperative outcomes are heterogeneous, with reports of both headache improvement and new-onset headache. DISCUSSION AND CONCLUSION: Headache is a frequent and clinically relevant manifestation of MMA that significantly impacts quality of life. Evidence on optimal management remains scarce, and current strategies are largely empirical. Further studies are needed to clarify pathogenic mechanisms, refine patient selection for surgical interventions, and evaluate pharmacological treatments, including novel agents, to improve clinical outcomes. 3
#> BACKGROUND: Although migraine attacks have been precisely characterized over the years - with significant advances in pathophysiology and treatment - the comprehensive identity of the migraine patient remains poorly defined. Real-world data capturing the full sociodemographic and clinical spectrum of individuals with migraine is still limited. The Italian National Migraine Registry (I-GRAINE) was established to address this gap by systematically collecting data on individuals with migraine across Italy's public healthcare system. METHODS: I-GRAINE is an ongoing, nationwide, multicenter, prospective registry involving 43 publicly funded headache centers. Since 19/04/2021, patients diagnosed with episodic migraine (EM) or chronic migraine (CM) have been systematically enrolled. Data were collected through face-to-face interviews conducted by trained neurologists using a dedicated electronic platform. Information included sociodemographic and lifestyle factors, comorbidities, and detailed clinical characteristics. We aimed to define the patient profile, explore the broad clinical phenotype, and compare EM and CM subgroups. RESULTS: As of 02/05/2025, 1,630 patients had been enrolled (81.7% EM, 18.3% CM), predominantly female (85.4%), mean age 45.7 years, normal BMI (23.2 kg/m2), and high education level. Over 70% were physically inactive, and 32.2% reported sleep disturbances. Headache was typically unilateral (69.1%), pulsating (64.0%), and lasted > 24 h (57.1%). Frequently reported non-ICHD-3 symptoms included osmophobia (41.5%), allodynia (40.5%), dopaminergic symptoms (37.2%), cephalalgiaphobia (34.0%), and dizziness (16.9%). ≥ 1 comorbidity was present in 41.2% of patients. Compared to those with EM, CM patients had higher BMI (24.0 vs. 23.0, p < 0.001), greater sleep disturbances (39.1% vs. 30.6%, p = 0.006), earlier onset (16.5 vs. 17.7 years, p = 0.032), more severe pain (NRS: 8.1 vs. 7.5, p < 0.001), and higher prevalence of medication overuse (58.3% vs. 14.5%, p < 0.001), dopaminergic symptoms (45.1% vs. 35.4%, p = 0.002), allodynia (47.5% vs. 38.9%, p = 0.009), and cephalalgiaphobia (41.4% vs. 32.3%, p = 0.004). Disability was also greater (MIDAS: 76.3 vs. 41.9; HIT-6: 64.3 vs. 61.2; both p < 0.001). CONCLUSIONS: The typical patient attending Italian headache centers is a 45-year-old, normal-weight, well-educated, employed woman, often physically inactive, affected by sleep disturbances, and experiencing an average of 9.8 migraine days/month. I-GRAINE identifies migraine symptoms that may represent endophenotypes and distinct patterns associated with CM, offering valuable real-world insights to inform personalized care, research, and health policy. 4
#> BACKGROUND: Cluster headache (CH) is a rare primary headache disorder characterized by recurrent episodes of strictly unilateral excruciating pain accompanied by trigemino-autonomic signs, which significantly impacts the quality of life, social interactions, and occupational functioning of those who are affected. To promote a better understanding of this disabling condition and to foster research on the topic, this review provides a comprehensive description of the hallmarks of CH, including its clinical presentation, diagnostic challenges, pathophysiology, and current and novel therapeutic targets. It concludes by describing the disease burden and advocating for significant improvements in healthcare systems, and promoting health equity, as well as reducing stigma. PRINCIPAL FINDINGS: Despite its distinctive clinical and chronobiological features, CH may be mistaken for other primary headache disorders or different types of orofacial pain. Key pathogenic characteristics include the activation of the trigeminal-autonomic system with the release of several neuropeptides, the involvement of the hypothalamus in regulating the circadian rhythm, genetic variants, and the mesolimbic system. Both invasive and non-invasive neuromodulation treatments have been used to target the trigemino-cervical, parasympathetic, and hypothalamic systems. Additionally, novel therapeutic targets are currently being study. Alongside canonical therapies, several complementary approaches have been explored over the years, with most evidence deriving from uncontrolled research involving individuals who do not respond to standard pharmacological treatments. Despite advancements in our understanding of this complex disease, CH continues to pose considerable social, economic, and psychological challenges. Advocacy is essential and should prioritize early diagnosis, alleviate stigma, provide specialized training for healthcare professionals, and offer support to and through patient associations. CONCLUSIONS: CH is characterised by a complex, multifactorial, pathophysiology that is still not fully understood. Precise diagnosis, additional research studies, and robust psychosocial and institutional support are necessary to improve the quality of life for individuals affected by this debilitating condition. 5
#> language
#> Cluster headache (CH) is a disabling primary headache disorder with limited therapeutic options. Calcitonin gene-related peptide (CGRP) is known to be involved in CH pathophysiology; however, except for galcanezumab (300 mg) in episodic CH, anti-CGRP monoclonal antibodies did not reduce CH attacks in randomized clinical trials. Atogepant is an oral, small-molecule, CGRP receptor antagonist, which is approved for the preventive treatment of migraine. Here, we describe four case reports of CH (two episodic CH and two chronic CH), unresponsive to previous prophylactic treatments, who responded to daily atogepant (60 mg). Chronic CH cases were refractory to subcutaneous galcanezumab. In one case, a reduction to atogepant (30 mg daily) resulted in recurrence of headache attacks, which subsided on reintroduction of the initial dose. No serious adverse effects were reported. Despite the limited number of cases and the open retrospective design, our case series suggests atogepant as a possible prophylactic treatment for CH. Further research on CGRP signaling in CH and the implementation of well-designed clinical trials are necessary. en
#> Patent Foramen Ovale (PFO) is a congenital cardiac anomaly, anatomically persistent in approximately 25% of the adult population. While traditionally associated with paradoxical embolism and cryptogenic stroke, increasing evidence suggests a functional link between PFO and migraine with aura. However, the biomechanical mechanisms underlying these associations remain poorly defined, particularly regarding the role of PFO morphology in modulating local hemodynamics and red blood cell (RBC) mechanical stress. This study employs computational fluid dynamics (CFD) combined with Lagrangian particle tracking to assess the impact of PFO tunnel geometry on flow behavior and RBC loading across eight representative morphologies. Velocity fields, wall shear stress (WSS), and particle-level stress histories were computed under physiologically calibrated boundary conditions replicating Valsalva-induced shunting. Results reveal a dichotomy between elongated/narrow and short/wide morphotypes, with the former exhibiting jet-like flows, higher WSS, and significantly elevated RBC stress metrics (up to 31 Pa and 0.49 Pa·s of stress accumulation). The length-to-mean-quadratic-diameter ratio ([Formula: see text]) emerged as a strong predictor of mechanical exposure ([Formula: see text]), while outlet diameter correlated with potential systemic desaturation. This dual-scale analysis reveals a mechanistic connection between pathological stress levels and tunnel geometry, identifying [Formula: see text] as a candidate index for future imaging-based stratification of PFO-related clinical risk. en
#> BACKGROUND: Moyamoya angiopathy (MMA) is a rare, progressive cerebrovascular disorder characterized by stenosis or occlusion of the terminal internal carotid arteries, leading to the development of fragile collateral vessels. Headache is a common but understudied symptom of MMA, reported in up to 75% of patients. The headache phenotype often mimics migraine or tension-type headache, although cluster headache-like episodes have also been described. Aims to summarize current evidence on the clinical characteristics, underlying mechanisms, and treatment strategies for headache in MMA. MATERIALS AND METHODS: A narrative review of the literature was conducted, focusing on the prevalence, phenotype, pathophysiological mechanisms, and therapeutic options for headache in MMA. RESULTS: The pathogenesis of headache in MMA remains unclear but is likely multifactorial, involving impaired cerebrovascular autoregulation, microvascular ischemia, and collateral vessel development. No standardized treatment exists for MMA-related headache. Antiplatelet therapy, particularly aspirin, may offer some benefit, whereas NSAIDs and triptans require caution due to cerebrovascular risks. Emerging therapies such as calcitonin gene-related peptide (CGRP) inhibitors and Lasmiditan show potential but lack specific data in MMA patients. Surgical revascularization, mainly through direct or combined bypass, is an established intervention for stroke prevention and may also reduce headache burden. However, postoperative outcomes are heterogeneous, with reports of both headache improvement and new-onset headache. DISCUSSION AND CONCLUSION: Headache is a frequent and clinically relevant manifestation of MMA that significantly impacts quality of life. Evidence on optimal management remains scarce, and current strategies are largely empirical. Further studies are needed to clarify pathogenic mechanisms, refine patient selection for surgical interventions, and evaluate pharmacological treatments, including novel agents, to improve clinical outcomes. en
#> BACKGROUND: Although migraine attacks have been precisely characterized over the years - with significant advances in pathophysiology and treatment - the comprehensive identity of the migraine patient remains poorly defined. Real-world data capturing the full sociodemographic and clinical spectrum of individuals with migraine is still limited. The Italian National Migraine Registry (I-GRAINE) was established to address this gap by systematically collecting data on individuals with migraine across Italy's public healthcare system. METHODS: I-GRAINE is an ongoing, nationwide, multicenter, prospective registry involving 43 publicly funded headache centers. Since 19/04/2021, patients diagnosed with episodic migraine (EM) or chronic migraine (CM) have been systematically enrolled. Data were collected through face-to-face interviews conducted by trained neurologists using a dedicated electronic platform. Information included sociodemographic and lifestyle factors, comorbidities, and detailed clinical characteristics. We aimed to define the patient profile, explore the broad clinical phenotype, and compare EM and CM subgroups. RESULTS: As of 02/05/2025, 1,630 patients had been enrolled (81.7% EM, 18.3% CM), predominantly female (85.4%), mean age 45.7 years, normal BMI (23.2 kg/m2), and high education level. Over 70% were physically inactive, and 32.2% reported sleep disturbances. Headache was typically unilateral (69.1%), pulsating (64.0%), and lasted > 24 h (57.1%). Frequently reported non-ICHD-3 symptoms included osmophobia (41.5%), allodynia (40.5%), dopaminergic symptoms (37.2%), cephalalgiaphobia (34.0%), and dizziness (16.9%). ≥ 1 comorbidity was present in 41.2% of patients. Compared to those with EM, CM patients had higher BMI (24.0 vs. 23.0, p < 0.001), greater sleep disturbances (39.1% vs. 30.6%, p = 0.006), earlier onset (16.5 vs. 17.7 years, p = 0.032), more severe pain (NRS: 8.1 vs. 7.5, p < 0.001), and higher prevalence of medication overuse (58.3% vs. 14.5%, p < 0.001), dopaminergic symptoms (45.1% vs. 35.4%, p = 0.002), allodynia (47.5% vs. 38.9%, p = 0.009), and cephalalgiaphobia (41.4% vs. 32.3%, p = 0.004). Disability was also greater (MIDAS: 76.3 vs. 41.9; HIT-6: 64.3 vs. 61.2; both p < 0.001). CONCLUSIONS: The typical patient attending Italian headache centers is a 45-year-old, normal-weight, well-educated, employed woman, often physically inactive, affected by sleep disturbances, and experiencing an average of 9.8 migraine days/month. I-GRAINE identifies migraine symptoms that may represent endophenotypes and distinct patterns associated with CM, offering valuable real-world insights to inform personalized care, research, and health policy. en
#> BACKGROUND: Cluster headache (CH) is a rare primary headache disorder characterized by recurrent episodes of strictly unilateral excruciating pain accompanied by trigemino-autonomic signs, which significantly impacts the quality of life, social interactions, and occupational functioning of those who are affected. To promote a better understanding of this disabling condition and to foster research on the topic, this review provides a comprehensive description of the hallmarks of CH, including its clinical presentation, diagnostic challenges, pathophysiology, and current and novel therapeutic targets. It concludes by describing the disease burden and advocating for significant improvements in healthcare systems, and promoting health equity, as well as reducing stigma. PRINCIPAL FINDINGS: Despite its distinctive clinical and chronobiological features, CH may be mistaken for other primary headache disorders or different types of orofacial pain. Key pathogenic characteristics include the activation of the trigeminal-autonomic system with the release of several neuropeptides, the involvement of the hypothalamus in regulating the circadian rhythm, genetic variants, and the mesolimbic system. Both invasive and non-invasive neuromodulation treatments have been used to target the trigemino-cervical, parasympathetic, and hypothalamic systems. Additionally, novel therapeutic targets are currently being study. Alongside canonical therapies, several complementary approaches have been explored over the years, with most evidence deriving from uncontrolled research involving individuals who do not respond to standard pharmacological treatments. Despite advancements in our understanding of this complex disease, CH continues to pose considerable social, economic, and psychological challenges. Advocacy is essential and should prioritize early diagnosis, alleviate stigma, provide specialized training for healthcare professionals, and offer support to and through patient associations. CONCLUSIONS: CH is characterised by a complex, multifactorial, pathophysiology that is still not fully understood. Precise diagnosis, additional research studies, and robust psychosocial and institutional support are necessary to improve the quality of life for individuals affected by this debilitating condition. en
Entity Extraction
After preprocessing, the next step is to extract biomedical entities from the text.
Loading Entity Dictionaries
First, let’s load entity dictionaries that will be used for entity recognition:
# Load a disease dictionary
disease_dict <- load_dictionary(
dictionary_type = "disease",
source = "mesh"
)
#> Searching MeSH database for: disease[MeSH]
#> Found 194026 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#> Removed 56 terms that did not match their claimed entity types
#> Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
# Load a drug dictionary
drug_dict <- load_dictionary(
dictionary_type = "drug",
source = "mesh"
)
#> Searching MeSH database for: pharmaceutical preparations[MeSH]
#> Found 1000447 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 2
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 2 of 2
#> Extracted 3 unique terms from MeSH text format
#> Retrieved 26 unique terms from MeSH
#> Sanitizing dictionary with 26 terms...
#> Removed 26 terms that did not match their claimed entity types
#> Sanitization complete. 0 terms remaining (0% of original)
# View a sample of each dictionary
head(disease_dict, 3)
#> term id type source
#> 10 Lobomycosis MESH_10 disease mesh_text
#> 20 Disease MESH_ENTRY_19 disease mesh_text
#> 48 Osteochondrosis MESH_7 disease mesh_text
head(drug_dict, 3)
#> [1] term id type source
#> <0 rows> (or 0-length row.names)
Basic Entity Extraction
Now we can extract entities from the text using these dictionaries:
# Extract disease and drug entities
entities <- extract_entities(
preprocessed_data,
text_column = "abstract",
dictionary = rbind(disease_dict, drug_dict),
case_sensitive = FALSE,
overlap_strategy = "priority"
)
#> Sanitizing dictionary with 8 terms...
#> Sanitization complete. 8 terms remaining (100% of original)
#> Extracting entities from 100 documents...
#> Processing document 100 of 100
#> Extracted 32 entity mentions:
#> disease: 32
# View some extracted entities
head(entities[, c("doc_id", "entity", "entity_type", "sentence")], 10)
#> doc_id entity entity_type
#> 1 5 Disease disease
#> 2 5 Disease disease
#> 3 9 Disease disease
#> 4 9 Disease disease
#> 5 9 Disease disease
#> 6 9 Disease disease
#> 7 9 Disease disease
#> 8 16 Disease disease
#> 9 21 Disease disease
#> 10 23 Disease disease
#> sentence
#> 1 It concludes by describing the disease burden and advocating for significant improvements in healthcare systems, and promoting health equity, as well as reducing stigma.
#> 2 Despite advancements in our understanding of this complex disease, CH continues to pose considerable social, economic, and psychological challenges.
#> 3 Preliminary studies have shown that detailed medical history, audiological, and vestibular function evaluation are effective methods to distinguish Meniere's disease and vestibular migraine diseases.
#> 4 This study retrospectively included 503 patients with vestibular migraine and 1,125 patients with Meniere's disease.
#> 5 In our study, we found that patients with Meniere's disease often exhibit abnormal unilateral weakness, while those with vestibular migraine show changes in vestibular function characterized by labyrinthine hyperactivity.
#> 6 Compared to vestibular migraine patients, Ménière's disease patients are more likely to experience unilateral full-frequency hearing loss.
#> 7 This study systematically compared the vestibular function and audiological characteristics of vestibular migraine and Ménière's disease, providing an evidence-based foundation for clinical differential diagnosis.
#> 8 BACKGROUND: Migraine is a neurovascular disease associated with significant morbidity and disability, but its underlying pathophysiology remains elusive.
#> 9 The lack of efficacy of pharmacologic therapies is a major clinical challenge that requires alternative strategies, including neuromodulation and exploration of new targets to improve disease management.
#> 10 These criteria may also increase attention to this population's disease burden to help advocate for them as a specific migraine subgroup.
Complete Entity Extraction Workflow
For a more comprehensive approach, we can use the complete entity extraction workflow:
# Extract entities using the complete workflow
# Check if running in R CMD check environment
is_check <- !interactive() &&
(!is.null(Sys.getenv("R_CHECK_RUNNING")) &&
Sys.getenv("R_CHECK_RUNNING") == "true")
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
is_check <- TRUE
}
# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4
# Extract entities using the complete workflow
entities_workflow <- extract_entities_workflow(
preprocessed_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene", "protein", "pathway"),
dictionary_sources = c("local", "mesh"),
sanitize = TRUE,
parallel = !is_check, # Disable parallel in check environment
num_cores = num_cores_to_use # Use 1 core in check environment
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Loading dictionaries sequentially...
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for disease
#> Added 8 terms from disease (local)
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for drug
#> Added 8 terms from drug (local)
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for gene
#> Added 8 terms from gene (local)
#> Searching MeSH database for: proteins[MeSH]
#> Found 7662924 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 24 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 19 unique terms from MeSH text format
#> Retrieved 105 unique terms from MeSH
#> Added 105 terms from protein (mesh)
#> Searching MeSH database for: metabolic networks and pathways[MeSH]
#> Found 189110 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 1
#> Extracted 6 unique terms from MeSH text format
#> Retrieved 6 unique terms from MeSH
#> Added 6 terms from pathway (mesh)
#> Created combined dictionary with 135 unique terms
#> Sanitizing dictionary with 135 terms...
#> Removed 8 terms with numbers followed by special characters
#> Removed 78 terms that did not match their claimed entity types
#> Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 7 terms remaining (5.2% of original)
#> Extracting entities from 100 documents...
#> Processing batch 1/1
#> Extracting entities from 100 documents...
#> Processing document 100 of 100
#> Extracted 559 entity mentions:
#> disease: 543
#> protein: 16
#> Extracted 559 entity mentions in 0.13 minutes
#> disease: 543
#> protein: 16
# View summary of entity types
table(entities_workflow$entity_type)
#>
#> disease protein
#> 543 16
Customizing Entity Extraction
We can customize the entity extraction process by providing additional MeSH queries or custom dictionaries:
# Define custom MeSH queries for different entity types
mesh_queries <- list(
"disease" = "migraine disorders[MeSH] OR headache disorders[MeSH]",
"drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
"gene" = "genes[MeSH] OR channelopathy[MeSH]"
)
# Create a custom dictionary
custom_dict <- data.frame(
term = c("CGRP", "trigeminal nerve", "cortical spreading depression"),
type = c("protein", "anatomy", "biological_process"),
id = c("CUSTOM_1", "CUSTOM_2", "CUSTOM_3"),
source = rep("custom", 3),
stringsAsFactors = FALSE
)
# Extract entities with custom settings
custom_entities <- extract_entities_workflow(
preprocessed_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene", "protein", "pathway"),
dictionary_sources = c("local", "mesh"),
additional_mesh_queries = mesh_queries,
custom_dictionary = custom_dict,
sanitize = TRUE
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Adding 3 terms from custom dictionary
#> Loading dictionaries sequentially...
#> Using cached dictionary for disease (local)
#> Using cached dictionary for drug (local)
#> Using cached dictionary for gene (local)
#> Using cached dictionary for protein (mesh)
#> Using cached dictionary for pathway (mesh)
#> Created combined dictionary with 138 unique terms
#> Sanitizing dictionary with 135 terms...
#> Removed 8 terms with numbers followed by special characters
#> Removed 78 terms that did not match their claimed entity types
#> Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 7 terms remaining (5.2% of original)
#> Extracting entities from 100 documents...
#> Processing batch 1/1
#> Extracting entities from 100 documents...
#> Processing document 100 of 100
#> Extracted 650 entity mentions:
#> anatomy: 1
#> biological_process: 9
#> disease: 543
#> protein: 97
#> Extracted 650 entity mentions in 0.01 minutes
#> anatomy: 1
#> biological_process: 9
#> disease: 543
#> protein: 97
# View custom entities
custom_entities[custom_entities$source == "custom", ]
#> [1] entity entity_type doc_id start_pos end_pos sentence
#> [7] frequency
#> <0 rows> (or 0-length row.names)
Dictionary Sanitization
The quality of entity extraction heavily depends on the quality of the dictionaries. We can sanitize dictionaries to improve extraction quality:
# Create a raw dictionary with some problematic entries
raw_dict <- data.frame(
term = c("migraine", "5-HT", "headache", "the", "and", "patient", "inflammation", "study"),
type = c("disease", "chemical", "symptom", "NA", "NA", "NA", "biological_process", "NA"),
id = paste0("ID_", 1:8),
source = rep("example", 8),
stringsAsFactors = FALSE
)
# Sanitize the dictionary
sanitized_dict <- sanitize_dictionary(
raw_dict,
term_column = "term",
type_column = "type",
validate_types = TRUE,
verbose = TRUE
)
#> Sanitizing dictionary with 8 terms...
#> Removed 1 terms with numbers followed by special characters
#> Removed 3 common non-medical terms, conjunctive adverbs, and general terms
#> Sanitization complete. 4 terms remaining (50% of original)
# View the sanitized dictionary
sanitized_dict
#> term type id source
#> 1 migraine disease ID_1 example
#> 3 headache symptom ID_3 example
#> 6 patient NA ID_6 example
#> 7 inflammation biological_process ID_7 example
Mapping Terms to Biomedical Ontologies
We can map extracted terms to standard biomedical ontologies like MeSH or UMLS:
# Extract terms to map
terms_to_map <- c("migraine", "headache", "CGRP", "serotonin")
# Map to MeSH
mesh_mappings <- map_ontology(
terms_to_map,
ontology = "mesh",
fuzzy_match = TRUE,
similarity_threshold = 0.8
)
#> Searching MeSH database for: disease[MeSH]
#> Found 194026 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#> Removed 56 terms that did not match their claimed entity types
#> Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
#> No matches found for the input terms in the mesh ontology
# View MeSH mappings
mesh_mappings
#> [1] term ontology_id ontology_term match_type
#> <0 rows> (or 0-length row.names)
Topic Modeling
We can also apply topic modeling to discover the main themes in the corpus:
# Extract topics from the corpus
topics <- extract_topics(
migraine_articles,
text_column = "abstract",
n_topics = 5,
max_terms = 10
)
#> Tokenizing text...
# View top terms for each topic
topics$topics
#> $`Topic 1`
#> term weight
#> migraine migraine 28.897259
#> patients patients 28.260375
#> group group 14.819735
#> acute acute 11.180952
#> brimonidine brimonidine 10.937912
#> control control 9.898169
#> study study 9.800481
#> after after 9.282002
#> attacks attacks 6.993491
#> tartrate tartrate 6.836195
#>
#> $`Topic 2`
#> term weight
#> headache headache 155.33718
#> type type 61.70740
#> tension tension 58.53668
#> management management 54.34699
#> migraine migraine 52.47353
#> pharmacological pharmacological 45.09822
#> treatment treatment 44.42223
#> patient patient 42.08885
#> between between 37.64652
#> evidence evidence 35.47875
#>
#> $`Topic 3`
#> term weight
#> migraine migraine 123.70781
#> chronic chronic 25.14440
#> studies studies 24.97209
#> treatment treatment 20.89579
#> sgb sgb 19.47861
#> aura aura 18.01701
#> these these 16.17306
#> mir mir 14.85247
#> results results 14.58542
#> using using 14.54412
#>
#> $`Topic 4`
#> term weight
#> pain pain 2.3169622
#> migraine migraine 1.8345946
#> catastrophizing catastrophizing 1.5532020
#> disability disability 1.4712272
#> baseline baseline 1.3463989
#> frequency frequency 0.9190715
#> related related 0.9132389
#> follow follow 0.9098953
#> headache headache 0.9078959
#> children children 0.8907298
#>
#> $`Topic 5`
#> term weight
#> kif1a kif1a 0.36321417
#> cgrp cgrp 0.35736002
#> creb creb 0.29948346
#> migraine migraine 0.15809715
#> knockdown knockdown 0.08634776
#> expression expression 0.06956658
#> induced induced 0.06823267
#> hypersensitivity hypersensitivity 0.06541924
#> axis axis 0.06461942
#> forskolin forskolin 0.06409662