
Getting Started with Literature-Based Discovery
Chao Liu
2025-05-15
Source:vignettes/Intro_to_Literature-Based_Discovery.Rmd
Intro_to_Literature-Based_Discovery.Rmd
Introduction
Literature-based discovery (LBD) is a powerful approach to
identifying hidden connections between existing knowledge in the
scientific literature. This vignette introduces the
LBDiscover
package, which provides tools for automated
literature-based discovery by analyzing biomedical publications.
Installation
You can install the package from CRAN:
install.packages("LBDiscover")
You can install the development version of LBDiscover
from GitHub:
# install.packages("devtools")
devtools::install_github("chaoliu-cl/LBDiscover")
Basic Workflow
The typical workflow for literature-based discovery with this package consists of:
- Retrieving publications from PubMed or other sources
- Preprocessing text data
- Extracting biomedical entities
- Creating a co-occurrence matrix
- Applying discovery models
- Visualizing and evaluating results
Let’s walk through a comprehensive example exploring connections in migraine research.
Example: Exploring Migraine Research
In this example, we’ll explore potential discoveries in migraine research by applying the improved ABC model approach with various utility functions.
1. Load the package
library(LBDiscover)
#> Loading LBDiscover package
2. Define the primary term of interest
# Define the primary term of interest for our analysis
primary_term <- "migraine"
3. Retrieve publications
We’ll search for articles about migraine pathophysiology and treatment.
# Search for migraine-related articles
migraine_articles <- pubmed_search(
query = paste0(primary_term, " pathophysiology"),
max_results = 1000
)
# Search for treatment-related articles
drug_articles <- pubmed_search(
query = "neurological drugs pain treatment OR migraine therapy OR headache medication",
max_results = 1000
)
# Combine and remove duplicates
all_articles <- merge_results(migraine_articles, drug_articles)
cat("Retrieved", nrow(all_articles), "unique articles\n")
#> Retrieved 1925 unique articles
4. Extract variations of the primary term
# Extract variations of our primary term using the utility function
primary_term_variations <- get_term_vars(all_articles, primary_term)
cat("Found", length(primary_term_variations), "variations of", primary_term, "in the corpus:\n")
#> Found 13 variations of migraine in the corpus:
print(head(primary_term_variations, 10))
#> [1] "migraine" "Migraine" "MigrainE" "migraines" "Migraines"
#> [6] "migraineur" "ofmigraine" "Migraineux" "migraineurs" "Migraineurs"
5. Preprocess text data
# Preprocess text
preprocessed_articles <- preprocess_text(
all_articles,
text_column = "abstract",
remove_stopwords = TRUE,
min_word_length = 2 # Set min_word_length to capture short terms
)
#> Tokenizing text...
6. Create a custom dictionary
# Create a custom dictionary with all variations of our primary term
custom_dictionary <- data.frame(
term = c(primary_term, primary_term_variations),
type = rep("disease", length(primary_term_variations) + 1),
id = paste0("CUSTOM_", 1:(length(primary_term_variations) + 1)),
source = rep("custom", length(primary_term_variations) + 1),
stringsAsFactors = FALSE
)
# Define additional MeSH queries for extended dictionaries
mesh_queries <- list(
"disease" = paste0(primary_term, " disorders[MeSH] OR headache disorders[MeSH]"),
"protein" = "receptors[MeSH] OR ion channels[MeSH]",
"chemical" = "neurotransmitters[MeSH] OR vasoactive agents[MeSH]",
"pathway" = "signal transduction[MeSH] OR pain[MeSH]",
"drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
"gene" = "genes[MeSH] OR channelopathy[MeSH]"
)
# Sanitize the custom dictionary
custom_dictionary <- sanitize_dictionary(
custom_dictionary,
term_column = "term",
type_column = "type",
validate_types = FALSE # Don't validate custom terms as they're trusted
)
#> Sanitizing dictionary with 14 terms...
#> Sanitization complete. 14 terms remaining (100% of original)
7. Extract biomedical entities
# Extract entities using our custom dictionary
custom_entities <- extract_entities(
preprocessed_articles,
text_column = "abstract",
dictionary = custom_dictionary,
case_sensitive = FALSE,
overlap_strategy = "priority",
sanitize_dict = FALSE # Already sanitized
)
# Extract entities using the standard workflow with improved entity validation
# Check if running in R CMD check environment
is_check <- !interactive() &&
(!is.null(Sys.getenv("R_CHECK_RUNNING")) &&
Sys.getenv("R_CHECK_RUNNING") == "true")
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
is_check <- TRUE
}
# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4
standard_entities <- extract_entities_workflow(
preprocessed_articles,
text_column = "abstract",
entity_types = c("disease", "drug", "gene"),
parallel = !is_check, # Disable parallel in check environment
num_cores = num_cores_to_use, # Use 1 core in check environment
batch_size = 500 # Process 500 documents per batch
)
#> Warning in extract_entities_workflow(preprocessed_articles, text_column =
#> "abstract", : UMLS source requested but no API key provided. Skipping UMLS.
# Uncomment to include UMLS entities
# standard_entities <- extract_entities_workflow(
# preprocessed_articles,
# text_column = "abstract",
# entity_types = c("disease", "drug", "gene", "protein", "pathway", "chemical"),
# dictionary_sources = c("local", "mesh", "umls"), # Including UMLS
# additional_mesh_queries = mesh_queries,
# sanitize = TRUE,
# api_key = "your-umls-api-key", # Your UMLS API key here
# parallel = TRUE,
# num_cores = 4
# )
# Combine entity datasets using our utility function
entities <- merge_entities(
custom_entities,
standard_entities,
primary_term
)
#> Combined 6383 custom entities with 8929 standard entities.
# Filter entities to ensure only relevant biomedical terms are included
filtered_entities <- valid_entities(
entities,
primary_term,
primary_term_variations,
validation_function = is_valid_biomedical_entity
)
#> Filtered from 9338 to 9338 validated entities
# View the first few extracted entities
head(filtered_entities)
#> entity entity_type doc_id start_pos end_pos
#> 1 antimigraine disease 251 368 379
#> 2 antimigraine disease 251 667 678
#> 3 antimigraine disease 292 1188 1199
#> 4 antimigraine disease 292 876 887
#> 5 antimigraine disease 292 632 643
#> 6 antimigraine disease 388 670 681
#> sentence
#> 1 However, the administration of antimigraine drugs in conventional oral pharmaceutical dosage forms is a challenge, since many molecules have difficulty crossing the blood-brain barrier (BBB) to reach the brain, which leads to bioavailability problems.
#> 2 Efforts have been made to find alternative delivery systems and/or routes for antimigraine drugs.
#> 3 The existence of patients with medication-resistant migraine may be due to the: (i) complex migraine pathophysiology, in which several systems appear to be deregulated before, during, and after a migraine attack; and (ii) pharmacodynamic and pharmacokinetic properties of antimigraine medications.
#> 4 EXPERT OPINION: Current anti-CGRPergic medications, although effective, have limitations, such as side effects and lack of antimigraine efficacy in some patients.
#> 5 AREAS COVERED: By searching multiple electronic scientific databases, this narrative review examined: (i) the role of CGRP in migraine; and (ii) the current knowledge on the effects of CGRPergic antimigraine pharmacotherapies, including a brief analysis of their pharmacodynamic and pharmacokinetic characteristics.
#> 6 Behavioral analysis, antioxidant assay, immunohistochemistry (IHC), histopathological examination, ELISA, and RT-PCR were conducted to evaluate the antimigraine potential of genistein.
#> frequency
#> 1 13
#> 2 13
#> 3 13
#> 4 13
#> 5 13
#> 6 13
8. Create co-occurrence matrix
# Create co-occurrence matrix with validated entities
co_matrix <- create_comat(
filtered_entities,
doc_id_col = "doc_id",
entity_col = "entity",
type_col = "entity_type",
normalize = TRUE,
normalization_method = "cosine"
)
#> Building entity-document matrix...
#> Calculating co-occurrence matrix...
#> Normalizing co-occurrence matrix using cosine method...
# Find our primary term in the co-occurrence matrix
a_term <- find_term(co_matrix, primary_term)
#> Found primary term in co-occurrence matrix
# Check matrix dimensions
dim(co_matrix)
#> [1] 19 19
9. Apply the improved ABC model
# Apply the improved ABC model with enhanced term filtering and type validation
abc_results <- abc_model(
co_matrix,
a_term = a_term,
c_term = NULL, # Allow all potential C terms
min_score = 0.001, # Lower threshold to capture more potential connections
n_results = 500, # Increase to get more candidates before filtering
scoring_method = "combined",
# Focus on biomedically relevant entity types
b_term_types = c("protein", "gene", "pathway", "chemical"),
c_term_types = c("drug", "chemical", "protein", "gene"),
exclude_general_terms = TRUE, # Enable enhanced term filtering
filter_similar_terms = TRUE, # Remove terms too similar to migraine
similarity_threshold = 0.7, # Relatively strict similarity threshold
enforce_strict_typing = TRUE # Enable strict entity type validation
)
#> Filtered 8 B terms (61.5%) that weren't valid biomedical entities
#> Filtered 5 B terms that didn't match specified entity types: protein, gene, pathway, chemical
#> No suitable B terms found with association score > 0.001 after filtering
# If we don't have enough results, try with less stringent criteria
min_desired_results <- 10
if (nrow(abc_results) < min_desired_results) {
cat("Not enough results with strict filtering. Trying with less stringent criteria...\n")
abc_results <- abc_model(
co_matrix,
a_term = a_term,
c_term = NULL,
min_score = 0.0005, # Even lower threshold
n_results = 500,
scoring_method = "combined",
b_term_types = NULL, # No type constraints
c_term_types = NULL, # No type constraints
exclude_general_terms = TRUE,
filter_similar_terms = TRUE,
similarity_threshold = 0.8, # More lenient similarity threshold
enforce_strict_typing = FALSE # Disable strict type validation as fallback
)
}
#> Not enough results with strict filtering. Trying with less stringent criteria...
#> Filtered 8 B terms (61.5%) that weren't valid biomedical entities
#> Filtered out 0 B terms that were too similar to A term (similarity threshold: 0.8)
#> Filtered 8 potential C terms that weren't valid biomedical entities
#> Identifying potential C terms via 5 B terms...
#> | | | 0% | |============== | 20% | |============================ | 40% | |========================================== | 60% | |======================================================== | 80% | |======================================================================| 100%
# View top results
head(abc_results[, c("a_term", "b_term", "c_term", "abc_score", "b_type", "c_type")])
#> a_term b_term c_term abc_score b_type c_type
#> thrombosis1 migraine thrombosis heparin 0.05907662 disease drug
#> headache2 migraine headache heparin 0.05746834 symptom drug
#> headache1 migraine headache enoxaparin 0.05632963 symptom drug
#> headache3 migraine headache myocarditis 0.05515907 symptom disease
#> headache migraine headache azithromycin 0.05207699 symptom drug
#> headache4 migraine headache tocilizumab 0.05207699 symptom drug
10. Apply statistical validation to the results
# Apply statistical validation to the results
validated_results <- tryCatch({
validate_abc(
abc_results,
co_matrix,
alpha = 0.1, # More lenient significance threshold
correction = "BH", # Benjamini-Hochberg correction for multiple testing
filter_by_significance = FALSE # Keep all results but mark significant ones
)
}, error = function(e) {
cat("Error in statistical validation:", e$message, "\n")
cat("Using original results without validation...\n")
# Add dummy p-values based on ABC scores
abc_results$p_value <- 1 - abc_results$abc_score / max(abc_results$abc_score, na.rm = TRUE)
abc_results$significant <- abc_results$p_value < 0.1
return(abc_results)
})
#> Using optimized approach for large matrix validation...
#> Using metadata for document count: 1692
#> Calculating statistical significance using hypergeometric test...
#> 0.0% of connections are statistically significant (p < 0.10, BH correction)
# Sort by ABC score and take top results
validated_results <- validated_results[order(-validated_results$abc_score), ]
top_n <- min(100, nrow(validated_results)) # Larger top N for diversification
top_results <- head(validated_results, top_n)
# View top validated results
head(top_results[, c("a_term", "b_term", "c_term", "abc_score", "p_value", "significant")])
#> a_term b_term c_term abc_score p_value significant
#> thrombosis1 migraine thrombosis heparin 0.05907662 1 FALSE
#> headache2 migraine headache heparin 0.05746834 1 FALSE
#> headache1 migraine headache enoxaparin 0.05632963 1 FALSE
#> headache3 migraine headache myocarditis 0.05515907 1 FALSE
#> headache migraine headache azithromycin 0.05207699 1 FALSE
#> headache4 migraine headache tocilizumab 0.05207699 1 FALSE
11. Diversify and ensure minimum results
# Diversify results using our utility function
diverse_results <- safe_diversify(
top_results,
diversity_method = "both",
max_per_group = 5,
min_score = 0.0001,
min_results = 5
)
# Ensure we have enough results for visualization
diverse_results <- min_results(
diverse_results,
top_results,
a_term,
min_results = 3
)
12. Visualize the results
# Create heatmap visualization
plot_heatmap(
diverse_results,
output_file = "migraine_heatmap.png",
width = 1200,
height = 900,
top_n = 15,
min_score = 0.0001,
color_palette = "blues",
show_entity_types = TRUE
)
#> No connections are statistically significant (p < 0.05)
#> Created heatmap visualization: migraine_heatmap.png
# Create network visualization
plot_network(
diverse_results,
output_file = "migraine_network.png",
width = 1200,
height = 900,
top_n = 15,
min_score = 0.0001,
node_size_factor = 5,
color_by = "type",
title = "Migraine Treatment Network",
show_entity_types = TRUE,
label_size = 1.0
)
#> Created network visualization: migraine_network.png
13. Create interactive visualizations
# Create interactive HTML network visualization
export_network(
diverse_results,
output_file = "migraine_network.html",
top_n = min(30, nrow(diverse_results)),
min_score = 0.0001,
open = FALSE # Don't automatically open in browser
)
# Create interactive chord diagram
export_chord(
diverse_results,
output_file = "migraine_chord.html",
top_n = min(30, nrow(diverse_results)),
min_score = 0.0001,
open = FALSE
)
#> Number of unique terms: 9
#> First few terms: migraine, thrombosis, headache, fatigue, heparin
#> Role assignments: A=1, B=3, C=5
14. Evaluate literature support and generate report
# Evaluate literature support for top connections
evaluation <- eval_evidence(
diverse_results,
max_results = 5,
base_term = "migraine",
max_articles = 5
)
#>
#> === Evaluation of Top Results ===
#>
#> Evaluating potential treatment: heparin (drug)
#> ABC score: 0.0591
#> P-value: 1 - Not statistically significant
#> Connection through intermediary: thrombosis
#> Found 4 articles directly linking migraine and heparin
#> Most recent article: Stroke Following Blunt Head Trauma: A Case Report and Review of the Literature.
#>
#> Evaluating potential treatment: heparin (drug)
#> ABC score: 0.0575
#> P-value: 1 - Not statistically significant
#> Connection through intermediary: headache
#> Found 4 articles directly linking migraine and heparin
#> Most recent article: Stroke Following Blunt Head Trauma: A Case Report and Review of the Literature.
#>
#> Evaluating potential treatment: enoxaparin (drug)
#> ABC score: 0.0563
#> P-value: 1 - Not statistically significant
#> Connection through intermediary: headache
#> Found 5 articles directly linking migraine and enoxaparin
#> Most recent article: Cerebral Venous Sinus Thrombosis Following Varicella Infection: A Case Report.
#>
#> Evaluating potential treatment: myocarditis (disease)
#> ABC score: 0.0552
#> P-value: 1 - Not statistically significant
#> Connection through intermediary: headache
#> Found 4 articles directly linking migraine and myocarditis
#> Most recent article: Brain white matter hyperintensities in Kawasaki disease: A case-control study.
#>
#> Evaluating potential treatment: azithromycin (drug)
#> ABC score: 0.0521
#> P-value: 1 - Not statistically significant
#> Connection through intermediary: headache
#> Found 1 articles directly linking migraine and azithromycin
#> Most recent article: A comparison of ciprofloxacin, norfloxacin, ofloxacin, azithromycin and cefixime examined by observational cohort studies.
# Prepare articles for report generation
articles_with_years <- prep_articles(all_articles)
#> Found 1925 articles with valid publication years
# Store results for report
results_list <- list(abc = diverse_results)
# Store visualization paths
visualizations <- list(
heatmap = "migraine_heatmap.png",
network = "migraine_network.html",
chord = "migraine_chord.html"
)
# Create comprehensive report
gen_report(
results_list = results_list,
visualizations = visualizations,
articles = articles_with_years,
output_file = "migraine_discoveries.html"
)
#> Generated comprehensive report: migraine_discoveries.html
cat("\nDiscovery analysis complete!\n")
#>
#> Discovery analysis complete!
Interactive Visualizations and Report
The LBDiscover package generates interactive visualizations and a comprehensive report. Below you can see the embedded report with interactive visualizations.
Advanced Features
The LBDiscover
package offers several advanced features
that we’ve demonstrated in this example:
- Term variation detection: Automatically finding different forms of your primary term of interest
- Custom dictionary integration: Creating and using custom dictionaries alongside standard ones
- Entity validation: Filtering entities to ensure biomedical relevance
- Improved ABC model: Enhanced scoring methods and filtering options
- Statistical validation: Applying rigorous statistical tests to potential discoveries
- Result diversification: Ensuring a diverse set of discovery candidates
- Interactive visualizations: Creating dynamic network and chord diagrams
- Evidence evaluation: Assessing the literature support for discoveries
- Comprehensive reporting: Generating detailed HTML reports of findings
Conclusion
This vignette has demonstrated a comprehensive workflow for
literature-based discovery using the LBDiscover
package.
The improved ABC model and additional utility functions provide a robust
framework for identifying potential novel connections in biomedical
literature. Users can explore examples that are included in the
inst\examples
folder.