
Text Preprocessing and Entity Extraction
Chao Liu
2025-05-15
Source:vignettes/Text_Preprocessing.Rmd
Text_Preprocessing.Rmd
Text Preprocessing and Entity Extraction
This vignette explains the text preprocessing and entity extraction
capabilities of the LBDiscover
package, which are
fundamental steps in the literature-based discovery process.
Introduction
Before applying discovery models, we need to preprocess the text data and extract the entities of interest. These steps transform raw text into structured information that can be used for discovering relationships between biomedical concepts.
Loading the Package
library(LBDiscover)
#> Loading LBDiscover package
Data Retrieval
First, let’s retrieve some sample articles:
# Search for articles about migraines
migraine_articles <- pubmed_search(
query = "migraine pathophysiology",
max_results = 100
)
#> Created pubmed_cache environment for result caching
#> Searching PubMed for: migraine pathophysiology
#> Found 11643 results, retrieving 100 records
#> Fetching batch 1 of 1 (records 1-100)
#> Processing 100 articles
#> Processing article 100 of 100
#> Cached search results for future use
# View the first article
head(migraine_articles[, c("pmid", "title")], 3)
#> pmid
#> 1 40371864
#> 2 40357069
#> 3 40354987
#> title
#> 1 [Morning headaches in patients with obstructive sleep apnea syndrome: Pathogenesis, differential diagnosis, and treatment].
#> 2 Third Occipital Nerve Block and Cooled Radiofrequency Ablation for Managing Hemicrania Continua: A Case Report.
#> 3 Hypothalamic connectivity strength is decreasing with polygenic risk in migraine without aura patients.
Basic Text Preprocessing
The first step is to preprocess the text data to extract meaningful terms:
# Preprocess the abstracts
preprocessed_data <- preprocess_text(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
custom_stopwords = c("study", "patient", "result", "conclusion"),
min_word_length = 3,
max_word_length = 25
)
#> Tokenizing text...
# View terms extracted from the first document
head(preprocessed_data$terms[[1]], 10)
#> word count
#> 1 abdominal 1
#> 2 addressed 1
#> 3 airway 1
#> 4 algic 1
#> 5 among 1
#> 6 anatomical 1
#> 7 apnea 3
#> 8 associated 6
#> 9 attention 1
#> 10 back 1
Optimized Preprocessing for Large Datasets
For larger datasets, we can use the optimized vectorized preprocessing function:
# Use optimized vectorized preprocessing
opt_preprocessed_data <- vec_preprocess(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
min_word_length = 3,
chunk_size = 50 # Process in chunks of 50 documents
)
#> Processing text in 2 chunks...
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
# Compare processing times
system.time({
preprocess_text(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE
)
})
#> Tokenizing text...
#> user system elapsed
#> 0.065 0.000 0.064
system.time({
vec_preprocess(
migraine_articles,
text_column = "abstract",
remove_stopwords = TRUE,
chunk_size = 50
)
})
#> Processing text in 2 chunks...
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
#> user system elapsed
#> 0.06 0.00 0.06
Advanced Text Analysis
N-gram Extraction
We can extract n-grams (sequences of n words) to capture multi-word concepts:
# Extract bigrams (2-word sequences)
bigrams <- extract_ngrams(
migraine_articles$abstract,
n = 2,
min_freq = 2
)
# View the most frequent bigrams
head(bigrams, 10)
#> ngram frequency
#> 7529 in the 161
#> 10138 of migraine 79
#> 10239 of the 79
#> 10631 p 0 79
#> 10987 patients with 68
#> 16211 with migraine 58
#> 7452 in migraine 51
#> 2828 associated with 48
#> 14627 this study 47
#> 14908 to the 44
Sentence Segmentation
Segmenting text into sentences can be useful for more granular analysis:
# Extract sentences from the first abstract
abstracts <- migraine_articles$abstract
first_abstract <- abstracts[1]
# Make sure we have a valid abstract
if(is.na(first_abstract) || length(first_abstract) == 0 || nchar(first_abstract) == 0) {
# Find the first non-empty abstract
valid_idx <- which(!is.na(abstracts) & nchar(abstracts) > 0)
if(length(valid_idx) > 0) {
first_abstract <- abstracts[valid_idx[1]]
cat("First abstract was empty, using abstract #", valid_idx[1], "instead.\n")
} else {
# Create a sample abstract for demonstration
first_abstract <- "This is a sample abstract for demonstration. It contains multiple sentences. Each sentence will be extracted separately."
cat("No valid abstracts found. Using a sample abstract for demonstration.\n")
}
}
# Now segment the valid abstract
sentences <- segment_sentences(first_abstract)
# Check if sentences list has elements before trying to access them
if(length(sentences) > 0 && length(sentences[[1]]) > 0) {
# View the first few sentences
head(sentences[[1]], min(3, length(sentences[[1]])))
} else {
cat("No sentences could be extracted. The abstract might be too short or formatted incorrectly.\n")
}
#> [1] "Sleep disorders are often associated with painful ones, especially in patients with chronic syndromes."
#> [2] "Cephalgia occupies an important place among such algic forms as fibromyalgia, back pain, and abdominal and joint pain."
#> [3] "Headaches and sleep disturbances may be independent, deri.e. from a single pathogenetic factor, or their relationship may be bidi.e.tional."
# View the first few sentences
head(sentences[[1]], 3)
#> [1] "Sleep disorders are often associated with painful ones, especially in patients with chronic syndromes."
#> [2] "Cephalgia occupies an important place among such algic forms as fibromyalgia, back pain, and abdominal and joint pain."
#> [3] "Headaches and sleep disturbances may be independent, deri.e. from a single pathogenetic factor, or their relationship may be bidi.e.tional."
Language Detection
For dealing with multilingual corpora, we can detect the language of each document:
# Filter out NA values from abstracts and detect language
abstracts <- migraine_articles$abstract[1:5]
valid_abstracts <- abstracts[!is.na(abstracts)]
# Apply language detection to valid abstracts
if (length(valid_abstracts) > 0) {
languages <- sapply(valid_abstracts, detect_lang)
# View results
data.frame(
abstract_id = which(!is.na(abstracts)),
language = languages
)
} else {
message("No valid abstracts found for language detection")
}
#> abstract_id
#> Sleep disorders are often associated with painful ones, especially in patients with chronic syndromes. Cephalgia occupies an important place among such algic forms as fibromyalgia, back pain, and abdominal and joint pain. Headaches and sleep disturbances may be independent, derived from a single pathogenetic factor, or their relationship may be bidirectional. This review focuses on the reciprocal relationship between headaches and sleep disorders; particular attention is paid to the morning headache variant associated with obstructive sleep apnea (OSA). Modern data on the anatomical structures and pathophysiological mechanisms common to disorders in the regulation of the sleep-wake cycle and the perception of pain impulses are presented. Possible pathogenetic processes and nuances of differential diagnosis of headaches associated with sleep apnea are discussed. Methods of treating headaches associated with sleep disorders, in particular OSA-associated cephalgia, are addressed. The effectiveness of therapy with constant positive airway pressure (CPAP therapy) and splint therapy for OSA has been shown for headaches associated with sleep apnea. 1
#> Hemicrania continua is a rare and debilitating headache disorder characterized by continuous, unilateral pain that responds to indomethacin but is often resistant to other treatments. This report presents the case of a 26-year-old female patient with refractory hemicrania continua and chronic migraine who achieved significant pain relief following a fluoroscopy-guided third occipital nerve block and subsequent radiofrequency ablation (RFA) of the C2-C3 facet joint. The procedure resulted in an immediate reduction of pain as evident from a reduction in the Visual Analog Scale (VAS) score from 10/10 to 0/10, with sustained relief (VAS 2/10) at three months and notable improvement in the patient's quality of life. This case highlights the potential efficacy of targeting the third occipital nerve for the management of hemicrania continua, with thermal RFA (COOLIEF, Avanos Medical, Inc., Alpharetta, GA) offering prolonged relief by ablating nociceptive fibers. Given the emerging evidence supporting the involvement of the third occipital nerve in headache pathophysiology, the third occipital nerve block and RFA represent promising strategies for refractory cases. 2
#> Migraine is a heritable primary headache disorder which pathophysiology involves altered hypothalamic activity during migraine attacks. To explore the relationship between hypothalamic functional connectivity (HYPT FC) and genetic predisposition characterised by polygenic risk scores (PRS), in migraine, this research examines two types of PRS: one based on all migraine patients (PRSALL) regardless their time of diagnosis and other disorders, and another on "migraine-first" patients (PRSFIRST), whose first diagnosed condition was migraine in their lifetime. In an independent sample of 35 migraine patients and 38 healthy controls, using resting-state functional magnetic resonance (rfMRI, 3T) brain imaging, the study reveals significant hypoconnectivity of hypothalamus with the two investigated PRS scores but with different brain areas. While weakened hypothalamic connections in relations with PRSALL highlight regions involved in pain modulation, correlation with PRSFIRST emphasizes decreased connections with sensory and integrative brain areas, suggesting a link between migraine-first genetic risk and cortical hyperexcitability. Our results demonstrate that the polygenic risk of different migraine subgroups may advance our insight into the specific genetic and neural underpinnings of migraine, advancing precision medicine approaches in this field. 3
#> Like Janus, the Roman god of beginnings, transitions, and endings, spreading depolarizations (SDs) can be depicted with two faces: one looking backward, waving a symbolic farewell to the end of a cortical seizure; the other forward looking, opening a darker door for a fatal wave in the brainstem that ends life. There is good agreement on the distinct electrical nature of both events, but neither role is yet proven in patients. SD is a slow-moving wave of cellular depolarization that steadily silences neuronal networks and depresses EEG amplitude, whereas seizures represent fast, intermittent synchronization of neural networks with highly variable EEG activation patterns. However, the thresholds triggering both events are neither fixed nor inseparable; indeed, their co-occurrence and interaction depend on dimly-lit intrinsic brain pathophysiology. New insights into single gene control of SD and seizure thresholds are beginning to illuminate the darkness. Here, we review recent data and consider the title's question at the end. 4
#> The article addresses the main mechanisms of migraine pathogenesis in terms of biochemical features (neurotransmitter metabolism, neurochemistry, neurophysiology, and neurogenetics). The effect of hormones, electrolytes (magnesium, calcium, sodium), vitamins (vitamin D, B12), and other biologically active molecules (melatonin, L-carnitine, L-tryptophan) on the course of the disease is considered. Including some laboratory tests in the migraine diagnostic algorithm helps identify the secondary nature of headache and/or dizziness, manage therapeutic approaches, and adjust the prognosis and treatment outcomes. 5
#> language
#> Sleep disorders are often associated with painful ones, especially in patients with chronic syndromes. Cephalgia occupies an important place among such algic forms as fibromyalgia, back pain, and abdominal and joint pain. Headaches and sleep disturbances may be independent, derived from a single pathogenetic factor, or their relationship may be bidirectional. This review focuses on the reciprocal relationship between headaches and sleep disorders; particular attention is paid to the morning headache variant associated with obstructive sleep apnea (OSA). Modern data on the anatomical structures and pathophysiological mechanisms common to disorders in the regulation of the sleep-wake cycle and the perception of pain impulses are presented. Possible pathogenetic processes and nuances of differential diagnosis of headaches associated with sleep apnea are discussed. Methods of treating headaches associated with sleep disorders, in particular OSA-associated cephalgia, are addressed. The effectiveness of therapy with constant positive airway pressure (CPAP therapy) and splint therapy for OSA has been shown for headaches associated with sleep apnea. en
#> Hemicrania continua is a rare and debilitating headache disorder characterized by continuous, unilateral pain that responds to indomethacin but is often resistant to other treatments. This report presents the case of a 26-year-old female patient with refractory hemicrania continua and chronic migraine who achieved significant pain relief following a fluoroscopy-guided third occipital nerve block and subsequent radiofrequency ablation (RFA) of the C2-C3 facet joint. The procedure resulted in an immediate reduction of pain as evident from a reduction in the Visual Analog Scale (VAS) score from 10/10 to 0/10, with sustained relief (VAS 2/10) at three months and notable improvement in the patient's quality of life. This case highlights the potential efficacy of targeting the third occipital nerve for the management of hemicrania continua, with thermal RFA (COOLIEF, Avanos Medical, Inc., Alpharetta, GA) offering prolonged relief by ablating nociceptive fibers. Given the emerging evidence supporting the involvement of the third occipital nerve in headache pathophysiology, the third occipital nerve block and RFA represent promising strategies for refractory cases. en
#> Migraine is a heritable primary headache disorder which pathophysiology involves altered hypothalamic activity during migraine attacks. To explore the relationship between hypothalamic functional connectivity (HYPT FC) and genetic predisposition characterised by polygenic risk scores (PRS), in migraine, this research examines two types of PRS: one based on all migraine patients (PRSALL) regardless their time of diagnosis and other disorders, and another on "migraine-first" patients (PRSFIRST), whose first diagnosed condition was migraine in their lifetime. In an independent sample of 35 migraine patients and 38 healthy controls, using resting-state functional magnetic resonance (rfMRI, 3T) brain imaging, the study reveals significant hypoconnectivity of hypothalamus with the two investigated PRS scores but with different brain areas. While weakened hypothalamic connections in relations with PRSALL highlight regions involved in pain modulation, correlation with PRSFIRST emphasizes decreased connections with sensory and integrative brain areas, suggesting a link between migraine-first genetic risk and cortical hyperexcitability. Our results demonstrate that the polygenic risk of different migraine subgroups may advance our insight into the specific genetic and neural underpinnings of migraine, advancing precision medicine approaches in this field. en
#> Like Janus, the Roman god of beginnings, transitions, and endings, spreading depolarizations (SDs) can be depicted with two faces: one looking backward, waving a symbolic farewell to the end of a cortical seizure; the other forward looking, opening a darker door for a fatal wave in the brainstem that ends life. There is good agreement on the distinct electrical nature of both events, but neither role is yet proven in patients. SD is a slow-moving wave of cellular depolarization that steadily silences neuronal networks and depresses EEG amplitude, whereas seizures represent fast, intermittent synchronization of neural networks with highly variable EEG activation patterns. However, the thresholds triggering both events are neither fixed nor inseparable; indeed, their co-occurrence and interaction depend on dimly-lit intrinsic brain pathophysiology. New insights into single gene control of SD and seizure thresholds are beginning to illuminate the darkness. Here, we review recent data and consider the title's question at the end. en
#> The article addresses the main mechanisms of migraine pathogenesis in terms of biochemical features (neurotransmitter metabolism, neurochemistry, neurophysiology, and neurogenetics). The effect of hormones, electrolytes (magnesium, calcium, sodium), vitamins (vitamin D, B12), and other biologically active molecules (melatonin, L-carnitine, L-tryptophan) on the course of the disease is considered. Including some laboratory tests in the migraine diagnostic algorithm helps identify the secondary nature of headache and/or dizziness, manage therapeutic approaches, and adjust the prognosis and treatment outcomes. en
Entity Extraction
After preprocessing, the next step is to extract biomedical entities from the text.
Loading Entity Dictionaries
First, let’s load entity dictionaries that will be used for entity recognition:
# Load a disease dictionary
disease_dict <- load_dictionary(
dictionary_type = "disease",
source = "mesh"
)
#> Searching MeSH database for: disease[MeSH]
#> Found 193563 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#> Removed 56 terms that did not match their claimed entity types
#> Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
# Load a drug dictionary
drug_dict <- load_dictionary(
dictionary_type = "drug",
source = "mesh"
)
#> Searching MeSH database for: pharmaceutical preparations[MeSH]
#> Found 985543 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 2
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 2 of 2
#> Extracted 3 unique terms from MeSH text format
#> Retrieved 26 unique terms from MeSH
#> Sanitizing dictionary with 26 terms...
#> Removed 26 terms that did not match their claimed entity types
#> Sanitization complete. 0 terms remaining (0% of original)
# View a sample of each dictionary
head(disease_dict, 3)
#> term id type source
#> 10 Lobomycosis MESH_10 disease mesh_text
#> 20 Disease MESH_ENTRY_19 disease mesh_text
#> 48 Osteochondrosis MESH_7 disease mesh_text
head(drug_dict, 3)
#> [1] term id type source
#> <0 rows> (or 0-length row.names)
Basic Entity Extraction
Now we can extract entities from the text using these dictionaries:
# Extract disease and drug entities
entities <- extract_entities(
preprocessed_data,
text_column = "abstract",
dictionary = rbind(disease_dict, drug_dict),
case_sensitive = FALSE,
overlap_strategy = "priority"
)
#> Sanitizing dictionary with 8 terms...
#> Sanitization complete. 8 terms remaining (100% of original)
#> Extracting entities from 98 documents...
#> Extracted 36 entity mentions:
#> disease: 36
# View some extracted entities
head(entities[, c("doc_id", "entity", "entity_type", "sentence")], 10)
#> doc_id entity entity_type
#> 1 5 Disease disease
#> 2 10 Disease disease
#> 3 22 Disease disease
#> 4 22 Disease disease
#> 5 31 Disease disease
#> 6 41 Disease disease
#> 7 44 Disease disease
#> 8 45 Disease disease
#> 9 47 Disease disease
#> 10 54 Disease disease
#> sentence
#> 1 The effect of hormones, electrolytes (magnesium, calcium, sodium), vitamins (vitamin D, B12), and other biologically active molecules (melatonin, L-carnitine, L-tryptophan) on the course of the disease is considered.
#> 2 The the prevalence of aura (p = 0.028), age (p = 0.001) and mean disease duration (p < 0.001, t=-4.257) were significantly higher in migraine patients with WMH than those without WMH.
#> 3 Neuroimaging has contributed to a better understanding of VSS disease mechanism.
#> 4 Given the complexity of its disease state, multidisciplinary therapeutic approaches appear to be required for more effective symptom management.
#> 5 Ménière's disease (4%) and vestibular neuritis/labyrinthitis (3.9%) were associated with younger patients and unilateral or asymmetrical hearing loss.
#> 6 Therefore, although migraine with and without aura are considered two types of the same disease, more research should focus on their differences, thus finally enabling better specific treatment options for both types of migraine.
#> 7 Common comorbidities were hypertension, diabetes, and polycystic ovarian disease.
#> 8 CONCLUSIONS: Our study shows that UFs share substantial genetic basis with traits related to BP, obesity, diabetes, and migraine, a predominantly female vascular disease.
#> 9 BACKGROUND: Coronavirus disease 2019 (COVID-19), caused by the SARS-CoV-2 virus, placed unprecedented pressure on public health systems due to its mortality and global panic-and later due to long COVID challenges.
#> 10 Central nervous system (CNS) disorders, such as Alzheimer's disease (AD), Parkinson's disease (PD), multiple sclerosis (MS), and migraines, rank among the most prevalent and concerning conditions worldwide.
Complete Entity Extraction Workflow
For a more comprehensive approach, we can use the complete entity extraction workflow:
# Extract entities using the complete workflow
# Check if running in R CMD check environment
is_check <- !interactive() &&
(!is.null(Sys.getenv("R_CHECK_RUNNING")) &&
Sys.getenv("R_CHECK_RUNNING") == "true")
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
is_check <- TRUE
}
# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4
# Extract entities using the complete workflow
entities_workflow <- extract_entities_workflow(
preprocessed_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene", "protein", "pathway"),
dictionary_sources = c("local", "mesh"),
sanitize = TRUE,
parallel = !is_check, # Disable parallel in check environment
num_cores = num_cores_to_use # Use 1 core in check environment
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Loading dictionaries sequentially...
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for disease
#> Added 20 terms from disease (local)
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for drug
#> Added 20 terms from drug (local)
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for gene
#> Added 20 terms from gene (local)
#> Searching MeSH database for: proteins[MeSH]
#> Found 7579035 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 24 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 19 unique terms from MeSH text format
#> Retrieved 105 unique terms from MeSH
#> Added 105 terms from protein (mesh)
#> Searching MeSH database for: metabolic networks and pathways[MeSH]
#> Found 184983 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 1
#> Extracted 6 unique terms from MeSH text format
#> Retrieved 6 unique terms from MeSH
#> Added 6 terms from pathway (mesh)
#> Created combined dictionary with 171 unique terms
#> Sanitizing dictionary with 171 terms...
#> Removed 8 terms with numbers followed by special characters
#> Correcting type for 'headache' from 'disease' to 'symptom'
#> Correcting type for 'fatigue' from 'disease' to 'symptom'
#> Applied 2 type corrections for commonly misclassified terms
#> Removed 107 terms that did not match their claimed entity types
#> Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 14 terms remaining (8.2% of original)
#> Extracting entities from 98 documents...
#> Processing batch 1/1
#> Extracting entities from 98 documents...
#> Extracted 662 entity mentions:
#> disease: 533
#> protein: 13
#> symptom: 116
#> Extracted 662 entity mentions in 0.14 minutes
#> disease: 533
#> protein: 13
#> symptom: 116
# View summary of entity types
table(entities_workflow$entity_type)
#>
#> disease protein symptom
#> 533 13 116
Customizing Entity Extraction
We can customize the entity extraction process by providing additional MeSH queries or custom dictionaries:
# Define custom MeSH queries for different entity types
mesh_queries <- list(
"disease" = "migraine disorders[MeSH] OR headache disorders[MeSH]",
"drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
"gene" = "genes[MeSH] OR channelopathy[MeSH]"
)
# Create a custom dictionary
custom_dict <- data.frame(
term = c("CGRP", "trigeminal nerve", "cortical spreading depression"),
type = c("protein", "anatomy", "biological_process"),
id = c("CUSTOM_1", "CUSTOM_2", "CUSTOM_3"),
source = rep("custom", 3),
stringsAsFactors = FALSE
)
# Extract entities with custom settings
custom_entities <- extract_entities_workflow(
preprocessed_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene", "protein", "pathway"),
dictionary_sources = c("local", "mesh"),
additional_mesh_queries = mesh_queries,
custom_dictionary = custom_dict,
sanitize = TRUE
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Adding 3 terms from custom dictionary
#> Loading dictionaries sequentially...
#> Using cached dictionary for disease (local)
#> Using cached dictionary for drug (local)
#> Using cached dictionary for gene (local)
#> Using cached dictionary for protein (mesh)
#> Using cached dictionary for pathway (mesh)
#> Created combined dictionary with 174 unique terms
#> Sanitizing dictionary with 171 terms...
#> Removed 8 terms with numbers followed by special characters
#> Correcting type for 'headache' from 'disease' to 'symptom'
#> Correcting type for 'fatigue' from 'disease' to 'symptom'
#> Applied 2 type corrections for commonly misclassified terms
#> Removed 107 terms that did not match their claimed entity types
#> Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 14 terms remaining (8.2% of original)
#> Extracting entities from 98 documents...
#> Processing batch 1/1
#> Extracting entities from 98 documents...
#> Extracted 736 entity mentions:
#> anatomy: 2
#> biological_process: 8
#> disease: 533
#> protein: 77
#> symptom: 116
#> Extracted 736 entity mentions in 0.01 minutes
#> anatomy: 2
#> biological_process: 8
#> disease: 533
#> protein: 77
#> symptom: 116
# View custom entities
custom_entities[custom_entities$source == "custom", ]
#> [1] entity entity_type doc_id start_pos end_pos sentence
#> [7] frequency
#> <0 rows> (or 0-length row.names)
Dictionary Sanitization
The quality of entity extraction heavily depends on the quality of the dictionaries. We can sanitize dictionaries to improve extraction quality:
# Create a raw dictionary with some problematic entries
raw_dict <- data.frame(
term = c("migraine", "5-HT", "headache", "the", "and", "patient", "inflammation", "study"),
type = c("disease", "chemical", "symptom", "NA", "NA", "NA", "biological_process", "NA"),
id = paste0("ID_", 1:8),
source = rep("example", 8),
stringsAsFactors = FALSE
)
# Sanitize the dictionary
sanitized_dict <- sanitize_dictionary(
raw_dict,
term_column = "term",
type_column = "type",
validate_types = TRUE,
verbose = TRUE
)
#> Sanitizing dictionary with 8 terms...
#> Removed 1 terms with numbers followed by special characters
#> Removed 1 common non-medical terms, conjunctive adverbs, and general terms
#> Sanitization complete. 6 terms remaining (75% of original)
# View the sanitized dictionary
sanitized_dict
#> term type id source
#> 1 migraine disease ID_1 example
#> 3 headache symptom ID_3 example
#> 4 the NA ID_4 example
#> 5 and NA ID_5 example
#> 7 inflammation biological_process ID_7 example
#> 8 study NA ID_8 example
Mapping Terms to Biomedical Ontologies
We can map extracted terms to standard biomedical ontologies like MeSH or UMLS:
# Extract terms to map
terms_to_map <- c("migraine", "headache", "CGRP", "serotonin")
# Map to MeSH
mesh_mappings <- map_ontology(
terms_to_map,
ontology = "mesh",
fuzzy_match = TRUE,
similarity_threshold = 0.8
)
#> Searching MeSH database for: disease[MeSH]
#> Found 193563 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#> Removed 56 terms that did not match their claimed entity types
#> Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
#> No matches found for the input terms in the mesh ontology
# View MeSH mappings
mesh_mappings
#> [1] term ontology_id ontology_term match_type
#> <0 rows> (or 0-length row.names)
Topic Modeling
We can also apply topic modeling to discover the main themes in the corpus:
# Extract topics from the corpus
topics <- extract_topics(
migraine_articles,
text_column = "abstract",
n_topics = 5,
max_terms = 10
)
#> Tokenizing text...
# View top terms for each topic
topics$topics
#> $`Topic 1`
#> term weight
#> migraine migraine 158.54694
#> cgrp cgrp 33.93941
#> aura aura 28.48012
#> headache headache 26.85610
#> pain pain 26.52241
#> between between 23.21742
#> patients patients 20.81965
#> study study 20.37030
#> related related 20.09579
#> treatment treatment 19.25877
#>
#> $`Topic 2`
#> term weight
#> progression progression 37.439073
#> migraine migraine 36.013245
#> 634 634 10.570779
#> definitions definitions 10.570779
#> midas midas 9.241191
#> mhd mhd 8.808983
#> increase increase 7.308507
#> odds odds 7.052829
#> year year 7.047243
#> definition definition 7.047186
#>
#> $`Topic 3`
#> term weight
#> patients patients 44.09683
#> brain brain 40.43149
#> asd asd 39.81409
#> mwoa mwoa 39.58473
#> stroke stroke 34.74475
#> after after 31.86060
#> network network 29.71455
#> significant significant 28.73627
#> compared compared 26.71239
#> individuals individuals 25.19733
#>
#> $`Topic 4`
#> term weight
#> migraine migraine 73.11895
#> covid covid 60.29671
#> patients patients 48.10263
#> long long 33.81408
#> individuals individuals 31.86965
#> headache headache 27.79943
#> symptoms symptoms 25.11927
#> without without 20.33448
#> sex sex 15.35365
#> study study 15.23040
#>
#> $`Topic 5`
#> term weight
#> migraine migraine 78.23895
#> group group 28.04809
#> patients patients 24.72097
#> study study 16.40871
#> between between 15.56503
#> edema edema 11.28163
#> perilesional perilesional 11.28163
#> days days 11.08811
#> scores scores 10.27279
#> compared compared 10.20141