Text Preprocessing and Entity Extraction

This vignette explains the text preprocessing and entity extraction capabilities of the LBDiscover package, which are fundamental steps in the literature-based discovery process.

Introduction

Before applying discovery models, we need to preprocess the text data and extract the entities of interest. These steps transform raw text into structured information that can be used for discovering relationships between biomedical concepts.

Loading the Package

library(LBDiscover)
#> Loading LBDiscover package

Data Retrieval

First, let’s retrieve some sample articles:

# Search for articles about migraines
migraine_articles <- pubmed_search(
  query = "migraine pathophysiology",
  max_results = 100
)
#> Created pubmed_cache environment for result caching
#> Searching PubMed for: migraine pathophysiology
#> Found 11643 results, retrieving 100 records
#> Fetching batch 1 of 1 (records 1-100)
#> Processing 100 articles
#>   Processing article 100 of 100
#> Cached search results for future use

# View the first article
head(migraine_articles[, c("pmid", "title")], 3)
#>       pmid
#> 1 40371864
#> 2 40357069
#> 3 40354987
#>                                                                                                                         title
#> 1 [Morning headaches in patients with obstructive sleep apnea syndrome: Pathogenesis, differential diagnosis, and treatment].
#> 2             Third Occipital Nerve Block and Cooled Radiofrequency Ablation for Managing Hemicrania Continua: A Case Report.
#> 3                     Hypothalamic connectivity strength is decreasing with polygenic risk in migraine without aura patients.

Basic Text Preprocessing

The first step is to preprocess the text data to extract meaningful terms:

# Preprocess the abstracts
preprocessed_data <- preprocess_text(
  migraine_articles,
  text_column = "abstract",
  remove_stopwords = TRUE,
  custom_stopwords = c("study", "patient", "result", "conclusion"),
  min_word_length = 3,
  max_word_length = 25
)
#> Tokenizing text...

# View terms extracted from the first document
head(preprocessed_data$terms[[1]], 10)
#>          word count
#> 1   abdominal     1
#> 2   addressed     1
#> 3      airway     1
#> 4       algic     1
#> 5       among     1
#> 6  anatomical     1
#> 7       apnea     3
#> 8  associated     6
#> 9   attention     1
#> 10       back     1

Optimized Preprocessing for Large Datasets

For larger datasets, we can use the optimized vectorized preprocessing function:

# Use optimized vectorized preprocessing
opt_preprocessed_data <- vec_preprocess(
  migraine_articles,
  text_column = "abstract",
  remove_stopwords = TRUE,
  min_word_length = 3,
  chunk_size = 50  # Process in chunks of 50 documents
)
#> Processing text in 2 chunks...
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%

# Compare processing times
system.time({
  preprocess_text(
    migraine_articles,
    text_column = "abstract",
    remove_stopwords = TRUE
  )
})
#> Tokenizing text...
#>    user  system elapsed 
#>   0.065   0.000   0.064

system.time({
  vec_preprocess(
    migraine_articles,
    text_column = "abstract",
    remove_stopwords = TRUE,
    chunk_size = 50
  )
})
#> Processing text in 2 chunks...
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#>    user  system elapsed 
#>    0.06    0.00    0.06

Advanced Text Analysis

N-gram Extraction

We can extract n-grams (sequences of n words) to capture multi-word concepts:

# Extract bigrams (2-word sequences)
bigrams <- extract_ngrams(
  migraine_articles$abstract,
  n = 2,
  min_freq = 2
)

# View the most frequent bigrams
head(bigrams, 10)
#>                 ngram frequency
#> 7529           in the       161
#> 10138     of migraine        79
#> 10239          of the        79
#> 10631             p 0        79
#> 10987   patients with        68
#> 16211   with migraine        58
#> 7452      in migraine        51
#> 2828  associated with        48
#> 14627      this study        47
#> 14908          to the        44

Sentence Segmentation

Segmenting text into sentences can be useful for more granular analysis:

# Extract sentences from the first abstract
abstracts <- migraine_articles$abstract
first_abstract <- abstracts[1]

# Make sure we have a valid abstract
if(is.na(first_abstract) || length(first_abstract) == 0 || nchar(first_abstract) == 0) {
  # Find the first non-empty abstract
  valid_idx <- which(!is.na(abstracts) & nchar(abstracts) > 0)
  if(length(valid_idx) > 0) {
    first_abstract <- abstracts[valid_idx[1]]
    cat("First abstract was empty, using abstract #", valid_idx[1], "instead.\n")
  } else {
    # Create a sample abstract for demonstration
    first_abstract <- "This is a sample abstract for demonstration. It contains multiple sentences. Each sentence will be extracted separately."
    cat("No valid abstracts found. Using a sample abstract for demonstration.\n")
  }
}

# Now segment the valid abstract
sentences <- segment_sentences(first_abstract)

# Check if sentences list has elements before trying to access them
if(length(sentences) > 0 && length(sentences[[1]]) > 0) {
  # View the first few sentences
  head(sentences[[1]], min(3, length(sentences[[1]])))
} else {
  cat("No sentences could be extracted. The abstract might be too short or formatted incorrectly.\n")
}
#> [1] "Sleep disorders are often associated with painful ones, especially in patients with chronic syndromes."                                     
#> [2] "Cephalgia occupies an important place among such algic forms as fibromyalgia, back pain, and abdominal and joint pain."                     
#> [3] "Headaches and sleep disturbances may be independent, deri.e. from a single pathogenetic factor, or their relationship may be bidi.e.tional."

# View the first few sentences
head(sentences[[1]], 3)
#> [1] "Sleep disorders are often associated with painful ones, especially in patients with chronic syndromes."                                     
#> [2] "Cephalgia occupies an important place among such algic forms as fibromyalgia, back pain, and abdominal and joint pain."                     
#> [3] "Headaches and sleep disturbances may be independent, deri.e. from a single pathogenetic factor, or their relationship may be bidi.e.tional."

Language Detection

For dealing with multilingual corpora, we can detect the language of each document:

# Filter out NA values from abstracts and detect language
abstracts <- migraine_articles$abstract[1:5]
valid_abstracts <- abstracts[!is.na(abstracts)]

# Apply language detection to valid abstracts
if (length(valid_abstracts) > 0) {
  languages <- sapply(valid_abstracts, detect_lang)
  
  # View results
  data.frame(
    abstract_id = which(!is.na(abstracts)),
    language = languages
  )
} else {
  message("No valid abstracts found for language detection")
}
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         abstract_id
#> Sleep disorders are often associated with painful ones, especially in patients with chronic syndromes. Cephalgia occupies an important place among such algic forms as fibromyalgia, back pain, and abdominal and joint pain. Headaches and sleep disturbances may be independent, derived from a single pathogenetic factor, or their relationship may be bidirectional. This review focuses on the reciprocal relationship between headaches and sleep disorders; particular attention is paid to the morning headache variant associated with obstructive sleep apnea (OSA). Modern data on the anatomical structures and pathophysiological mechanisms common to disorders in the regulation of the sleep-wake cycle and the perception of pain impulses are presented. Possible pathogenetic processes and nuances of differential diagnosis of headaches associated with sleep apnea are discussed. Methods of treating headaches associated with sleep disorders, in particular OSA-associated cephalgia, are addressed. The effectiveness of therapy with constant positive airway pressure (CPAP therapy) and splint therapy for OSA has been shown for headaches associated with sleep apnea.                                                                                                                                                                                                                           1
#> Hemicrania continua is a rare and debilitating headache disorder characterized by continuous, unilateral pain that responds to indomethacin but is often resistant to other treatments. This report presents the case of a 26-year-old female patient with refractory hemicrania continua and chronic migraine who achieved significant pain relief following a fluoroscopy-guided third occipital nerve block and subsequent radiofrequency ablation (RFA) of the C2-C3 facet joint. The procedure resulted in an immediate reduction of pain as evident from a reduction in the Visual Analog Scale (VAS) score from 10/10 to 0/10, with sustained relief (VAS 2/10) at three months and notable improvement in the patient's quality of life. This case highlights the potential efficacy of targeting the third occipital nerve for the management of hemicrania continua, with thermal RFA (COOLIEF, Avanos Medical, Inc., Alpharetta, GA) offering prolonged relief by ablating nociceptive fibers. Given the emerging evidence supporting the involvement of the third occipital nerve in headache pathophysiology, the third occipital nerve block and RFA represent promising strategies for refractory cases.                                                                                                                                                                                                           2
#> Migraine is a heritable primary headache disorder which pathophysiology involves altered hypothalamic activity during migraine attacks. To explore the relationship between hypothalamic functional connectivity (HYPT FC) and genetic predisposition characterised by polygenic risk scores (PRS), in migraine, this research examines two types of PRS: one based on all migraine patients (PRSALL) regardless their time of diagnosis and other disorders, and another on "migraine-first" patients (PRSFIRST), whose first diagnosed condition was migraine in their lifetime. In an independent sample of 35 migraine patients and 38 healthy controls, using resting-state functional magnetic resonance (rfMRI, 3T) brain imaging, the study reveals significant hypoconnectivity of hypothalamus with the two investigated PRS scores but with different brain areas. While weakened hypothalamic connections in relations with PRSALL highlight regions involved in pain modulation, correlation with PRSFIRST emphasizes decreased connections with sensory and integrative brain areas, suggesting a link between migraine-first genetic risk and cortical hyperexcitability. Our results demonstrate that the polygenic risk of different migraine subgroups may advance our insight into the specific genetic and neural underpinnings of migraine, advancing precision medicine approaches in this field.           3
#> Like Janus, the Roman god of beginnings, transitions, and endings, spreading depolarizations (SDs) can be depicted with two faces: one looking backward, waving a symbolic farewell to the end of a cortical seizure; the other forward looking, opening a darker door for a fatal wave in the brainstem that ends life. There is good agreement on the distinct electrical nature of both events, but neither role is yet proven in patients. SD is a slow-moving wave of cellular depolarization that steadily silences neuronal networks and depresses EEG amplitude, whereas seizures represent fast, intermittent synchronization of neural networks with highly variable EEG activation patterns. However, the thresholds triggering both events are neither fixed nor inseparable; indeed, their co-occurrence and interaction depend on dimly-lit intrinsic brain pathophysiology. New insights into single gene control of SD and seizure thresholds are beginning to illuminate the darkness. Here, we review recent data and consider the title's question at the end.                                                                                                                                                                                                                                                                                                                                                 4
#> The article addresses the main mechanisms of migraine pathogenesis in terms of biochemical features (neurotransmitter metabolism, neurochemistry, neurophysiology, and neurogenetics). The effect of hormones, electrolytes (magnesium, calcium, sodium), vitamins (vitamin D, B12), and other biologically active molecules (melatonin, L-carnitine, L-tryptophan) on the course of the disease is considered. Including some laboratory tests in the migraine diagnostic algorithm helps identify the secondary nature of headache and/or dizziness, manage therapeutic approaches, and adjust the prognosis and treatment outcomes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            5
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         language
#> Sleep disorders are often associated with painful ones, especially in patients with chronic syndromes. Cephalgia occupies an important place among such algic forms as fibromyalgia, back pain, and abdominal and joint pain. Headaches and sleep disturbances may be independent, derived from a single pathogenetic factor, or their relationship may be bidirectional. This review focuses on the reciprocal relationship between headaches and sleep disorders; particular attention is paid to the morning headache variant associated with obstructive sleep apnea (OSA). Modern data on the anatomical structures and pathophysiological mechanisms common to disorders in the regulation of the sleep-wake cycle and the perception of pain impulses are presented. Possible pathogenetic processes and nuances of differential diagnosis of headaches associated with sleep apnea are discussed. Methods of treating headaches associated with sleep disorders, in particular OSA-associated cephalgia, are addressed. The effectiveness of therapy with constant positive airway pressure (CPAP therapy) and splint therapy for OSA has been shown for headaches associated with sleep apnea.                                                                                                                                                                                                                       en
#> Hemicrania continua is a rare and debilitating headache disorder characterized by continuous, unilateral pain that responds to indomethacin but is often resistant to other treatments. This report presents the case of a 26-year-old female patient with refractory hemicrania continua and chronic migraine who achieved significant pain relief following a fluoroscopy-guided third occipital nerve block and subsequent radiofrequency ablation (RFA) of the C2-C3 facet joint. The procedure resulted in an immediate reduction of pain as evident from a reduction in the Visual Analog Scale (VAS) score from 10/10 to 0/10, with sustained relief (VAS 2/10) at three months and notable improvement in the patient's quality of life. This case highlights the potential efficacy of targeting the third occipital nerve for the management of hemicrania continua, with thermal RFA (COOLIEF, Avanos Medical, Inc., Alpharetta, GA) offering prolonged relief by ablating nociceptive fibers. Given the emerging evidence supporting the involvement of the third occipital nerve in headache pathophysiology, the third occipital nerve block and RFA represent promising strategies for refractory cases.                                                                                                                                                                                                       en
#> Migraine is a heritable primary headache disorder which pathophysiology involves altered hypothalamic activity during migraine attacks. To explore the relationship between hypothalamic functional connectivity (HYPT FC) and genetic predisposition characterised by polygenic risk scores (PRS), in migraine, this research examines two types of PRS: one based on all migraine patients (PRSALL) regardless their time of diagnosis and other disorders, and another on "migraine-first" patients (PRSFIRST), whose first diagnosed condition was migraine in their lifetime. In an independent sample of 35 migraine patients and 38 healthy controls, using resting-state functional magnetic resonance (rfMRI, 3T) brain imaging, the study reveals significant hypoconnectivity of hypothalamus with the two investigated PRS scores but with different brain areas. While weakened hypothalamic connections in relations with PRSALL highlight regions involved in pain modulation, correlation with PRSFIRST emphasizes decreased connections with sensory and integrative brain areas, suggesting a link between migraine-first genetic risk and cortical hyperexcitability. Our results demonstrate that the polygenic risk of different migraine subgroups may advance our insight into the specific genetic and neural underpinnings of migraine, advancing precision medicine approaches in this field.       en
#> Like Janus, the Roman god of beginnings, transitions, and endings, spreading depolarizations (SDs) can be depicted with two faces: one looking backward, waving a symbolic farewell to the end of a cortical seizure; the other forward looking, opening a darker door for a fatal wave in the brainstem that ends life. There is good agreement on the distinct electrical nature of both events, but neither role is yet proven in patients. SD is a slow-moving wave of cellular depolarization that steadily silences neuronal networks and depresses EEG amplitude, whereas seizures represent fast, intermittent synchronization of neural networks with highly variable EEG activation patterns. However, the thresholds triggering both events are neither fixed nor inseparable; indeed, their co-occurrence and interaction depend on dimly-lit intrinsic brain pathophysiology. New insights into single gene control of SD and seizure thresholds are beginning to illuminate the darkness. Here, we review recent data and consider the title's question at the end.                                                                                                                                                                                                                                                                                                                                             en
#> The article addresses the main mechanisms of migraine pathogenesis in terms of biochemical features (neurotransmitter metabolism, neurochemistry, neurophysiology, and neurogenetics). The effect of hormones, electrolytes (magnesium, calcium, sodium), vitamins (vitamin D, B12), and other biologically active molecules (melatonin, L-carnitine, L-tryptophan) on the course of the disease is considered. Including some laboratory tests in the migraine diagnostic algorithm helps identify the secondary nature of headache and/or dizziness, manage therapeutic approaches, and adjust the prognosis and treatment outcomes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        en

Entity Extraction

After preprocessing, the next step is to extract biomedical entities from the text.

Loading Entity Dictionaries

First, let’s load entity dictionaries that will be used for entity recognition:

# Load a disease dictionary
disease_dict <- load_dictionary(
  dictionary_type = "disease",
  source = "mesh"
)
#> Searching MeSH database for: disease[MeSH]
#> Found 193563 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#>   Removed 56 terms that did not match their claimed entity types
#>   Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)

# Load a drug dictionary
drug_dict <- load_dictionary(
  dictionary_type = "drug",
  source = "mesh"
)
#> Searching MeSH database for: pharmaceutical preparations[MeSH]
#> Found 985543 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 2
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 2 of 2
#> Extracted 3 unique terms from MeSH text format
#> Retrieved 26 unique terms from MeSH
#> Sanitizing dictionary with 26 terms...
#>   Removed 26 terms that did not match their claimed entity types
#> Sanitization complete. 0 terms remaining (0% of original)

# View a sample of each dictionary
head(disease_dict, 3)
#>               term            id    type    source
#> 10     Lobomycosis       MESH_10 disease mesh_text
#> 20         Disease MESH_ENTRY_19 disease mesh_text
#> 48 Osteochondrosis        MESH_7 disease mesh_text
head(drug_dict, 3)
#> [1] term   id     type   source
#> <0 rows> (or 0-length row.names)

Basic Entity Extraction

Now we can extract entities from the text using these dictionaries:

# Extract disease and drug entities
entities <- extract_entities(
  preprocessed_data,
  text_column = "abstract",
  dictionary = rbind(disease_dict, drug_dict),
  case_sensitive = FALSE,
  overlap_strategy = "priority"
)
#> Sanitizing dictionary with 8 terms...
#> Sanitization complete. 8 terms remaining (100% of original)
#> Extracting entities from 98 documents...
#> Extracted 36 entity mentions:
#>   disease: 36

# View some extracted entities
head(entities[, c("doc_id", "entity", "entity_type", "sentence")], 10)
#>    doc_id  entity entity_type
#> 1       5 Disease     disease
#> 2      10 Disease     disease
#> 3      22 Disease     disease
#> 4      22 Disease     disease
#> 5      31 Disease     disease
#> 6      41 Disease     disease
#> 7      44 Disease     disease
#> 8      45 Disease     disease
#> 9      47 Disease     disease
#> 10     54 Disease     disease
#>                                                                                                                                                                                                                                 sentence
#> 1               The effect of hormones, electrolytes (magnesium, calcium, sodium), vitamins (vitamin D, B12), and other biologically active molecules (melatonin, L-carnitine, L-tryptophan) on the course of the disease is considered.
#> 2                                                The the prevalence of aura (p = 0.028), age (p = 0.001) and mean disease duration (p < 0.001, t=-4.257) were significantly higher in migraine patients with WMH than those without WMH.
#> 3                                                                                                                                                       Neuroimaging has contributed to a better understanding of VSS disease mechanism.
#> 4                                                                                       Given the complexity of its disease state, multidisciplinary therapeutic approaches appear to be required for more effective symptom management.
#> 5                                                                                 Ménière's disease (4%) and vestibular neuritis/labyrinthitis (3.9%) were associated with younger patients and unilateral or asymmetrical hearing loss.
#> 6  Therefore, although migraine with and without aura are considered two types of the same disease, more research should focus on their differences, thus finally enabling better specific treatment options for both types of migraine.
#> 7                                                                                                                                                      Common comorbidities were hypertension, diabetes, and polycystic ovarian disease.
#> 8                                                             CONCLUSIONS: Our study shows that UFs share substantial genetic basis with traits related to BP, obesity, diabetes, and migraine, a predominantly female vascular disease.
#> 9                  BACKGROUND: Coronavirus disease 2019 (COVID-19), caused by the SARS-CoV-2 virus, placed unprecedented pressure on public health systems due to its mortality and global panic-and later due to long COVID challenges.
#> 10                        Central nervous system (CNS) disorders, such as Alzheimer's disease (AD), Parkinson's disease (PD), multiple sclerosis (MS), and migraines, rank among the most prevalent and concerning conditions worldwide.

Complete Entity Extraction Workflow

For a more comprehensive approach, we can use the complete entity extraction workflow:

# Extract entities using the complete workflow
# Check if running in R CMD check environment
is_check <- !interactive() && 
            (!is.null(Sys.getenv("R_CHECK_RUNNING")) && 
             Sys.getenv("R_CHECK_RUNNING") == "true")
             
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
  is_check <- TRUE
}

# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4

# Extract entities using the complete workflow
entities_workflow <- extract_entities_workflow(
  preprocessed_data,
  text_column = "abstract",
  entity_types = c("disease", "drug", "gene", "protein", "pathway"),
  dictionary_sources = c("local", "mesh"),
  sanitize = TRUE,
  parallel = !is_check,           # Disable parallel in check environment
  num_cores = num_cores_to_use    # Use 1 core in check environment
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Loading dictionaries sequentially...
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for disease
#>   Added 20 terms from disease (local)
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for drug
#>   Added 20 terms from drug (local)
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for gene
#>   Added 20 terms from gene (local)
#> Searching MeSH database for: proteins[MeSH]
#> Found 7579035 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 24 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 23 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 19 unique terms from MeSH text format
#> Retrieved 105 unique terms from MeSH
#>   Added 105 terms from protein (mesh)
#> Searching MeSH database for: metabolic networks and pathways[MeSH]
#> Found 184983 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 1
#> Extracted 6 unique terms from MeSH text format
#> Retrieved 6 unique terms from MeSH
#>   Added 6 terms from pathway (mesh)
#> Created combined dictionary with 171 unique terms
#> Sanitizing dictionary with 171 terms...
#>   Removed 8 terms with numbers followed by special characters
#>   Correcting type for 'headache' from 'disease' to 'symptom'
#>   Correcting type for 'fatigue' from 'disease' to 'symptom'
#>   Applied 2 type corrections for commonly misclassified terms
#>   Removed 107 terms that did not match their claimed entity types
#>   Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 14 terms remaining (8.2% of original)
#> Extracting entities from 98 documents...
#> Processing batch 1/1
#> Extracting entities from 98 documents...
#> Extracted 662 entity mentions:
#>   disease: 533
#>   protein: 13
#>   symptom: 116
#> Extracted 662 entity mentions in 0.14 minutes
#>   disease: 533
#>   protein: 13
#>   symptom: 116

# View summary of entity types
table(entities_workflow$entity_type)
#> 
#> disease protein symptom 
#>     533      13     116

Customizing Entity Extraction

We can customize the entity extraction process by providing additional MeSH queries or custom dictionaries:

# Define custom MeSH queries for different entity types
mesh_queries <- list(
  "disease" = "migraine disorders[MeSH] OR headache disorders[MeSH]",
  "drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
  "gene" = "genes[MeSH] OR channelopathy[MeSH]"
)

# Create a custom dictionary
custom_dict <- data.frame(
  term = c("CGRP", "trigeminal nerve", "cortical spreading depression"),
  type = c("protein", "anatomy", "biological_process"),
  id = c("CUSTOM_1", "CUSTOM_2", "CUSTOM_3"),
  source = rep("custom", 3),
  stringsAsFactors = FALSE
)

# Extract entities with custom settings
custom_entities <- extract_entities_workflow(
  preprocessed_data,
  text_column = "abstract",
  entity_types = c("disease", "drug", "gene", "protein", "pathway"),
  dictionary_sources = c("local", "mesh"),
  additional_mesh_queries = mesh_queries,
  custom_dictionary = custom_dict,
  sanitize = TRUE
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Creating dictionaries for entity extraction...
#> Adding 3 terms from custom dictionary
#> Loading dictionaries sequentially...
#>   Using cached dictionary for disease (local)
#>   Using cached dictionary for drug (local)
#>   Using cached dictionary for gene (local)
#>   Using cached dictionary for protein (mesh)
#>   Using cached dictionary for pathway (mesh)
#> Created combined dictionary with 174 unique terms
#> Sanitizing dictionary with 171 terms...
#>   Removed 8 terms with numbers followed by special characters
#>   Correcting type for 'headache' from 'disease' to 'symptom'
#>   Correcting type for 'fatigue' from 'disease' to 'symptom'
#>   Applied 2 type corrections for commonly misclassified terms
#>   Removed 107 terms that did not match their claimed entity types
#>   Removed 42 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 14 terms remaining (8.2% of original)
#> Extracting entities from 98 documents...
#> Processing batch 1/1
#> Extracting entities from 98 documents...
#> Extracted 736 entity mentions:
#>   anatomy: 2
#>   biological_process: 8
#>   disease: 533
#>   protein: 77
#>   symptom: 116
#> Extracted 736 entity mentions in 0.01 minutes
#>   anatomy: 2
#>   biological_process: 8
#>   disease: 533
#>   protein: 77
#>   symptom: 116

# View custom entities
custom_entities[custom_entities$source == "custom", ]
#> [1] entity      entity_type doc_id      start_pos   end_pos     sentence   
#> [7] frequency  
#> <0 rows> (or 0-length row.names)

Dictionary Sanitization

The quality of entity extraction heavily depends on the quality of the dictionaries. We can sanitize dictionaries to improve extraction quality:

# Create a raw dictionary with some problematic entries
raw_dict <- data.frame(
  term = c("migraine", "5-HT", "headache", "the", "and", "patient", "inflammation", "study"),
  type = c("disease", "chemical", "symptom", "NA", "NA", "NA", "biological_process", "NA"),
  id = paste0("ID_", 1:8),
  source = rep("example", 8),
  stringsAsFactors = FALSE
)

# Sanitize the dictionary
sanitized_dict <- sanitize_dictionary(
  raw_dict,
  term_column = "term",
  type_column = "type",
  validate_types = TRUE,
  verbose = TRUE
)
#> Sanitizing dictionary with 8 terms...
#>   Removed 1 terms with numbers followed by special characters
#>   Removed 1 common non-medical terms, conjunctive adverbs, and general terms
#> Sanitization complete. 6 terms remaining (75% of original)

# View the sanitized dictionary
sanitized_dict
#>           term               type   id  source
#> 1     migraine            disease ID_1 example
#> 3     headache            symptom ID_3 example
#> 4          the                 NA ID_4 example
#> 5          and                 NA ID_5 example
#> 7 inflammation biological_process ID_7 example
#> 8        study                 NA ID_8 example

Mapping Terms to Biomedical Ontologies

We can map extracted terms to standard biomedical ontologies like MeSH or UMLS:

# Extract terms to map
terms_to_map <- c("migraine", "headache", "CGRP", "serotonin")

# Map to MeSH
mesh_mappings <- map_ontology(
  terms_to_map,
  ontology = "mesh",
  fuzzy_match = TRUE,
  similarity_threshold = 0.8
)
#> Searching MeSH database for: disease[MeSH]
#> Found 193563 PubMed records with matching MeSH terms
#> Error getting MeSH links: Must specify either (not both) 'id' or 'web_history' arguments
#> Trying direct MeSH database search...
#> Processing batch 1 of 5
#> Extracted 20 unique terms from MeSH text format
#> Processing batch 2 of 5
#> Extracted 21 unique terms from MeSH text format
#> Processing batch 3 of 5
#> Extracted 25 unique terms from MeSH text format
#> Processing batch 4 of 5
#> Extracted 19 unique terms from MeSH text format
#> Processing batch 5 of 5
#> Extracted 21 unique terms from MeSH text format
#> Retrieved 102 unique terms from MeSH
#> Sanitizing dictionary with 102 terms...
#>   Removed 56 terms that did not match their claimed entity types
#>   Removed 38 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 8 terms remaining (7.8% of original)
#> No matches found for the input terms in the mesh ontology

# View MeSH mappings
mesh_mappings
#> [1] term          ontology_id   ontology_term match_type   
#> <0 rows> (or 0-length row.names)

Topic Modeling

We can also apply topic modeling to discover the main themes in the corpus:

# Extract topics from the corpus
topics <- extract_topics(
  migraine_articles,
  text_column = "abstract",
  n_topics = 5,
  max_terms = 10
)
#> Tokenizing text...

# View top terms for each topic
topics$topics
#> $`Topic 1`
#>                term    weight
#> migraine   migraine 158.54694
#> cgrp           cgrp  33.93941
#> aura           aura  28.48012
#> headache   headache  26.85610
#> pain           pain  26.52241
#> between     between  23.21742
#> patients   patients  20.81965
#> study         study  20.37030
#> related     related  20.09579
#> treatment treatment  19.25877
#> 
#> $`Topic 2`
#>                    term    weight
#> progression progression 37.439073
#> migraine       migraine 36.013245
#> 634                 634 10.570779
#> definitions definitions 10.570779
#> midas             midas  9.241191
#> mhd                 mhd  8.808983
#> increase       increase  7.308507
#> odds               odds  7.052829
#> year               year  7.047243
#> definition   definition  7.047186
#> 
#> $`Topic 3`
#>                    term   weight
#> patients       patients 44.09683
#> brain             brain 40.43149
#> asd                 asd 39.81409
#> mwoa               mwoa 39.58473
#> stroke           stroke 34.74475
#> after             after 31.86060
#> network         network 29.71455
#> significant significant 28.73627
#> compared       compared 26.71239
#> individuals individuals 25.19733
#> 
#> $`Topic 4`
#>                    term   weight
#> migraine       migraine 73.11895
#> covid             covid 60.29671
#> patients       patients 48.10263
#> long               long 33.81408
#> individuals individuals 31.86965
#> headache       headache 27.79943
#> symptoms       symptoms 25.11927
#> without         without 20.33448
#> sex                 sex 15.35365
#> study             study 15.23040
#> 
#> $`Topic 5`
#>                      term   weight
#> migraine         migraine 78.23895
#> group               group 28.04809
#> patients         patients 24.72097
#> study               study 16.40871
#> between           between 15.56503
#> edema               edema 11.28163
#> perilesional perilesional 11.28163
#> days                 days 11.08811
#> scores             scores 10.27279
#> compared         compared 10.20141

Text Preprocessing and Entity Extraction

Chao Liu

2025-05-15