Getting Started with Literature-Based Discovery

Introduction

Literature-based discovery (LBD) is a powerful approach to identifying hidden connections between existing knowledge in the scientific literature. This vignette introduces the LBDiscover package, which provides tools for automated literature-based discovery by analyzing biomedical publications.

Installation

You can install the package from CRAN:

install.packages("LBDiscover")

You can install the development version of LBDiscover from GitHub:

# install.packages("devtools")
devtools::install_github("chaoliu-cl/LBDiscover")

Basic Workflow

The typical workflow for literature-based discovery with this package consists of:

Retrieving publications from PubMed or other sources
Preprocessing text data
Extracting biomedical entities
Creating a co-occurrence matrix
Applying discovery models
Visualizing and evaluating results

Let’s walk through a comprehensive example exploring connections in migraine research.

Example: Exploring Migraine Research

In this example, we’ll explore potential discoveries in migraine research by applying the improved ABC model approach with various utility functions.

1. Load the package

library(LBDiscover)
#> Loading LBDiscover package

2. Define the primary term of interest

# Define the primary term of interest for our analysis
primary_term <- "migraine"

3. Retrieve publications

We’ll search for articles about migraine pathophysiology and treatment.

# Search for migraine-related articles
migraine_articles <- pubmed_search(
  query = paste0(primary_term, " pathophysiology"),
  max_results = 1000
)

# Search for treatment-related articles
drug_articles <- pubmed_search(
  query = "neurological drugs pain treatment OR migraine therapy OR headache medication",
  max_results = 1000
)

# Combine and remove duplicates
all_articles <- merge_results(migraine_articles, drug_articles)
cat("Retrieved", nrow(all_articles), "unique articles\n")
#> Retrieved 1925 unique articles

4. Extract variations of the primary term

# Extract variations of our primary term using the utility function
primary_term_variations <- get_term_vars(all_articles, primary_term)
cat("Found", length(primary_term_variations), "variations of", primary_term, "in the corpus:\n")
#> Found 13 variations of migraine in the corpus:
print(head(primary_term_variations, 10))
#>  [1] "migraine"    "Migraine"    "MigrainE"    "migraines"   "Migraines"  
#>  [6] "migraineur"  "ofmigraine"  "Migraineux"  "migraineurs" "Migraineurs"

5. Preprocess text data

# Preprocess text
preprocessed_articles <- preprocess_text(
  all_articles,
  text_column = "abstract",
  remove_stopwords = TRUE,
  min_word_length = 2  # Set min_word_length to capture short terms
)
#> Tokenizing text...

6. Create a custom dictionary

# Create a custom dictionary with all variations of our primary term
custom_dictionary <- data.frame(
  term = c(primary_term, primary_term_variations),
  type = rep("disease", length(primary_term_variations) + 1),
  id = paste0("CUSTOM_", 1:(length(primary_term_variations) + 1)),
  source = rep("custom", length(primary_term_variations) + 1),
  stringsAsFactors = FALSE
)

# Define additional MeSH queries for extended dictionaries
mesh_queries <- list(
  "disease" = paste0(primary_term, " disorders[MeSH] OR headache disorders[MeSH]"),
  "protein" = "receptors[MeSH] OR ion channels[MeSH]",
  "chemical" = "neurotransmitters[MeSH] OR vasoactive agents[MeSH]",
  "pathway" = "signal transduction[MeSH] OR pain[MeSH]",
  "drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
  "gene" = "genes[MeSH] OR channelopathy[MeSH]"
)

# Sanitize the custom dictionary
custom_dictionary <- sanitize_dictionary(
  custom_dictionary,
  term_column = "term",
  type_column = "type",
  validate_types = FALSE  # Don't validate custom terms as they're trusted
)
#> Sanitizing dictionary with 14 terms...
#> Sanitization complete. 14 terms remaining (100% of original)

7. Extract biomedical entities

# Extract entities using our custom dictionary
custom_entities <- extract_entities(
  preprocessed_articles,
  text_column = "abstract",
  dictionary = custom_dictionary,
  case_sensitive = FALSE,
  overlap_strategy = "priority",
  sanitize_dict = FALSE  # Already sanitized
)

# Extract entities using the standard workflow with improved entity validation
# Check if running in R CMD check environment
is_check <- !interactive() && 
            (!is.null(Sys.getenv("R_CHECK_RUNNING")) && 
             Sys.getenv("R_CHECK_RUNNING") == "true")
             
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
  is_check <- TRUE
}

# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4

standard_entities <- extract_entities_workflow(
  preprocessed_articles,
  text_column = "abstract",
  entity_types = c("disease", "drug", "gene"),
  parallel = !is_check,           # Disable parallel in check environment
  num_cores = num_cores_to_use,   # Use 1 core in check environment
  batch_size = 500                # Process 500 documents per batch
)
#> Warning in extract_entities_workflow(preprocessed_articles, text_column =
#> "abstract", : UMLS source requested but no API key provided. Skipping UMLS.

# Uncomment to include UMLS entities
# standard_entities <- extract_entities_workflow(
#   preprocessed_articles,
#   text_column = "abstract",
#   entity_types = c("disease", "drug", "gene", "protein", "pathway", "chemical"),
#   dictionary_sources = c("local", "mesh", "umls"),  # Including UMLS
#   additional_mesh_queries = mesh_queries,
#   sanitize = TRUE,
#   api_key = "your-umls-api-key",  # Your UMLS API key here
#   parallel = TRUE,
#   num_cores = 4
# )

# Combine entity datasets using our utility function
entities <- merge_entities(
  custom_entities,
  standard_entities,
  primary_term
)
#> Combined 6383 custom entities with 8929 standard entities.

# Filter entities to ensure only relevant biomedical terms are included
filtered_entities <- valid_entities(
  entities,
  primary_term,
  primary_term_variations,
  validation_function = is_valid_biomedical_entity
)
#> Filtered from 9338 to 9338 validated entities

# View the first few extracted entities
head(filtered_entities)
#>         entity entity_type doc_id start_pos end_pos
#> 1 antimigraine     disease    251       368     379
#> 2 antimigraine     disease    251       667     678
#> 3 antimigraine     disease    292      1188    1199
#> 4 antimigraine     disease    292       876     887
#> 5 antimigraine     disease    292       632     643
#> 6 antimigraine     disease    388       670     681
#>                                                                                                                                                                                                                                                                                                                      sentence
#> 1                                                                 However, the administration of antimigraine drugs in conventional oral pharmaceutical dosage forms is a challenge, since many molecules have difficulty crossing the blood-brain barrier (BBB) to reach the brain, which leads to bioavailability problems.
#> 2                                                                                                                                                                                                                           Efforts have been made to find alternative delivery systems and/or routes for antimigraine drugs.
#> 3                   The existence of patients with medication-resistant migraine may be due to the: (i) complex migraine pathophysiology, in which several systems appear to be deregulated before, during, and after a migraine attack; and (ii) pharmacodynamic and pharmacokinetic properties of antimigraine medications.
#> 4                                                                                                                                                          EXPERT OPINION: Current anti-CGRPergic medications, although effective, have limitations, such as side effects and lack of antimigraine efficacy in some patients.
#> 5 AREAS COVERED: By searching multiple electronic scientific databases, this narrative review examined: (i) the role of CGRP in migraine; and (ii) the current knowledge on the effects of CGRPergic antimigraine pharmacotherapies, including a brief analysis of their pharmacodynamic and pharmacokinetic characteristics.
#> 6                                                                                                                                    Behavioral analysis, antioxidant assay, immunohistochemistry (IHC), histopathological examination, ELISA, and RT-PCR were conducted to evaluate the antimigraine potential of genistein.
#>   frequency
#> 1        13
#> 2        13
#> 3        13
#> 4        13
#> 5        13
#> 6        13

8. Create co-occurrence matrix

# Create co-occurrence matrix with validated entities
co_matrix <- create_comat(
  filtered_entities,
  doc_id_col = "doc_id",
  entity_col = "entity",
  type_col = "entity_type",
  normalize = TRUE,
  normalization_method = "cosine"
)
#> Building entity-document matrix...
#> Calculating co-occurrence matrix...
#> Normalizing co-occurrence matrix using cosine method...

# Find our primary term in the co-occurrence matrix
a_term <- find_term(co_matrix, primary_term)
#> Found primary term in co-occurrence matrix

# Check matrix dimensions
dim(co_matrix)
#> [1] 19 19

9. Apply the improved ABC model

# Apply the improved ABC model with enhanced term filtering and type validation
abc_results <- abc_model(
  co_matrix,
  a_term = a_term,
  c_term = NULL,  # Allow all potential C terms
  min_score = 0.001,  # Lower threshold to capture more potential connections
  n_results = 500,    # Increase to get more candidates before filtering
  scoring_method = "combined",
  # Focus on biomedically relevant entity types
  b_term_types = c("protein", "gene", "pathway", "chemical"),
  c_term_types = c("drug", "chemical", "protein", "gene"),
  exclude_general_terms = TRUE,  # Enable enhanced term filtering
  filter_similar_terms = TRUE,   # Remove terms too similar to migraine
  similarity_threshold = 0.7,    # Relatively strict similarity threshold
  enforce_strict_typing = TRUE   # Enable strict entity type validation
)
#> Filtered 8 B terms (61.5%) that weren't valid biomedical entities
#> Filtered 5 B terms that didn't match specified entity types: protein, gene, pathway, chemical
#> No suitable B terms found with association score > 0.001 after filtering

# If we don't have enough results, try with less stringent criteria
min_desired_results <- 10
if (nrow(abc_results) < min_desired_results) {
  cat("Not enough results with strict filtering. Trying with less stringent criteria...\n")
  abc_results <- abc_model(
    co_matrix,
    a_term = a_term,
    c_term = NULL,
    min_score = 0.0005,  # Even lower threshold
    n_results = 500,
    scoring_method = "combined",
    b_term_types = NULL,  # No type constraints
    c_term_types = NULL,  # No type constraints
    exclude_general_terms = TRUE,
    filter_similar_terms = TRUE,
    similarity_threshold = 0.8,    # More lenient similarity threshold
    enforce_strict_typing = FALSE  # Disable strict type validation as fallback
  )
}
#> Not enough results with strict filtering. Trying with less stringent criteria...
#> Filtered 8 B terms (61.5%) that weren't valid biomedical entities
#> Filtered out 0 B terms that were too similar to A term (similarity threshold: 0.8)
#> Filtered 8 potential C terms that weren't valid biomedical entities
#> Identifying potential C terms via 5 B terms...
#>   |                                                                              |                                                                      |   0%  |                                                                              |==============                                                        |  20%  |                                                                              |============================                                          |  40%  |                                                                              |==========================================                            |  60%  |                                                                              |========================================================              |  80%  |                                                                              |======================================================================| 100%

# View top results
head(abc_results[, c("a_term", "b_term", "c_term", "abc_score", "b_type", "c_type")])
#>               a_term     b_term       c_term  abc_score  b_type  c_type
#> thrombosis1 migraine thrombosis      heparin 0.05907662 disease    drug
#> headache2   migraine   headache      heparin 0.05746834 symptom    drug
#> headache1   migraine   headache   enoxaparin 0.05632963 symptom    drug
#> headache3   migraine   headache  myocarditis 0.05515907 symptom disease
#> headache    migraine   headache azithromycin 0.05207699 symptom    drug
#> headache4   migraine   headache  tocilizumab 0.05207699 symptom    drug

10. Apply statistical validation to the results

# Apply statistical validation to the results
validated_results <- tryCatch({
  validate_abc(
    abc_results,
    co_matrix,
    alpha = 0.1,  # More lenient significance threshold
    correction = "BH",  # Benjamini-Hochberg correction for multiple testing
    filter_by_significance = FALSE  # Keep all results but mark significant ones
  )
}, error = function(e) {
  cat("Error in statistical validation:", e$message, "\n")
  cat("Using original results without validation...\n")
  # Add dummy p-values based on ABC scores
  abc_results$p_value <- 1 - abc_results$abc_score / max(abc_results$abc_score, na.rm = TRUE)
  abc_results$significant <- abc_results$p_value < 0.1
  return(abc_results)
})
#> Using optimized approach for large matrix validation...
#> Using metadata for document count: 1692
#> Calculating statistical significance using hypergeometric test...
#> 0.0% of connections are statistically significant (p < 0.10, BH correction)

# Sort by ABC score and take top results
validated_results <- validated_results[order(-validated_results$abc_score), ]
top_n <- min(100, nrow(validated_results))  # Larger top N for diversification
top_results <- head(validated_results, top_n)

# View top validated results
head(top_results[, c("a_term", "b_term", "c_term", "abc_score", "p_value", "significant")])
#>               a_term     b_term       c_term  abc_score p_value significant
#> thrombosis1 migraine thrombosis      heparin 0.05907662       1       FALSE
#> headache2   migraine   headache      heparin 0.05746834       1       FALSE
#> headache1   migraine   headache   enoxaparin 0.05632963       1       FALSE
#> headache3   migraine   headache  myocarditis 0.05515907       1       FALSE
#> headache    migraine   headache azithromycin 0.05207699       1       FALSE
#> headache4   migraine   headache  tocilizumab 0.05207699       1       FALSE

11. Diversify and ensure minimum results

# Diversify results using our utility function
diverse_results <- safe_diversify(
  top_results,
  diversity_method = "both",
  max_per_group = 5,
  min_score = 0.0001,
  min_results = 5
)

# Ensure we have enough results for visualization
diverse_results <- min_results(
  diverse_results,
  top_results,
  a_term,
  min_results = 3
)

12. Visualize the results

# Create heatmap visualization
plot_heatmap(
  diverse_results,
  output_file = "migraine_heatmap.png",
  width = 1200,
  height = 900,
  top_n = 15,
  min_score = 0.0001,
  color_palette = "blues",
  show_entity_types = TRUE
)
#> No connections are statistically significant (p < 0.05)
#> Created heatmap visualization: migraine_heatmap.png

# Create network visualization
plot_network(
  diverse_results,
  output_file = "migraine_network.png",
  width = 1200,
  height = 900,
  top_n = 15,
  min_score = 0.0001,
  node_size_factor = 5,
  color_by = "type",
  title = "Migraine Treatment Network",
  show_entity_types = TRUE,
  label_size = 1.0
)
#> Created network visualization: migraine_network.png

13. Create interactive visualizations

# Create interactive HTML network visualization
export_network(
  diverse_results,
  output_file = "migraine_network.html",
  top_n = min(30, nrow(diverse_results)),
  min_score = 0.0001,
  open = FALSE  # Don't automatically open in browser
)

# Create interactive chord diagram
export_chord(
  diverse_results,
  output_file = "migraine_chord.html",
  top_n = min(30, nrow(diverse_results)),
  min_score = 0.0001,
  open = FALSE
)
#> Number of unique terms: 9
#> First few terms: migraine, thrombosis, headache, fatigue, heparin
#> Role assignments: A=1, B=3, C=5

14. Evaluate literature support and generate report

# Evaluate literature support for top connections
evaluation <- eval_evidence(
  diverse_results,
  max_results = 5,
  base_term = "migraine",
  max_articles = 5
)
#> 
#> === Evaluation of Top Results ===
#> 
#> Evaluating potential treatment: heparin  (drug) 
#> ABC score: 0.0591 
#> P-value: 1 - Not statistically significant 
#> Connection through intermediary: thrombosis
#> Found 4 articles directly linking migraine and heparin 
#> Most recent article: Stroke Following Blunt Head Trauma: A Case Report and Review of the Literature. 
#> 
#> Evaluating potential treatment: heparin  (drug) 
#> ABC score: 0.0575 
#> P-value: 1 - Not statistically significant 
#> Connection through intermediary: headache
#> Found 4 articles directly linking migraine and heparin 
#> Most recent article: Stroke Following Blunt Head Trauma: A Case Report and Review of the Literature. 
#> 
#> Evaluating potential treatment: enoxaparin  (drug) 
#> ABC score: 0.0563 
#> P-value: 1 - Not statistically significant 
#> Connection through intermediary: headache
#> Found 5 articles directly linking migraine and enoxaparin 
#> Most recent article: Cerebral Venous Sinus Thrombosis Following Varicella Infection: A Case Report. 
#> 
#> Evaluating potential treatment: myocarditis  (disease) 
#> ABC score: 0.0552 
#> P-value: 1 - Not statistically significant 
#> Connection through intermediary: headache
#> Found 4 articles directly linking migraine and myocarditis 
#> Most recent article: Brain white matter hyperintensities in Kawasaki disease: A case-control study. 
#> 
#> Evaluating potential treatment: azithromycin  (drug) 
#> ABC score: 0.0521 
#> P-value: 1 - Not statistically significant 
#> Connection through intermediary: headache
#> Found 1 articles directly linking migraine and azithromycin 
#> Most recent article: A comparison of ciprofloxacin, norfloxacin, ofloxacin, azithromycin and cefixime examined by observational cohort studies.

# Prepare articles for report generation
articles_with_years <- prep_articles(all_articles)
#> Found 1925 articles with valid publication years

# Store results for report
results_list <- list(abc = diverse_results)

# Store visualization paths
visualizations <- list(
  heatmap = "migraine_heatmap.png",
  network = "migraine_network.html",
  chord = "migraine_chord.html"
)

# Create comprehensive report
gen_report(
  results_list = results_list,
  visualizations = visualizations,
  articles = articles_with_years,
  output_file = "migraine_discoveries.html"
)
#> Generated comprehensive report: migraine_discoveries.html

cat("\nDiscovery analysis complete!\n")
#> 
#> Discovery analysis complete!

Interactive Visualizations and Report

The LBDiscover package generates interactive visualizations and a comprehensive report. Below you can see the embedded report with interactive visualizations.

Embedding HTML Report in the Vignette

Direct Links to Interactive Visualizations

The interactive visualizations are embedded below:

Network Visualization

Chord Diagram

Advanced Features

The LBDiscover package offers several advanced features that we’ve demonstrated in this example:

Term variation detection: Automatically finding different forms of your primary term of interest
Custom dictionary integration: Creating and using custom dictionaries alongside standard ones
Entity validation: Filtering entities to ensure biomedical relevance
Improved ABC model: Enhanced scoring methods and filtering options
Statistical validation: Applying rigorous statistical tests to potential discoveries
Result diversification: Ensuring a diverse set of discovery candidates
Interactive visualizations: Creating dynamic network and chord diagrams
Evidence evaluation: Assessing the literature support for discoveries
Comprehensive reporting: Generating detailed HTML reports of findings

Conclusion

This vignette has demonstrated a comprehensive workflow for literature-based discovery using the LBDiscover package. The improved ABC model and additional utility functions provide a robust framework for identifying potential novel connections in biomedical literature. Users can explore examples that are included in the inst\examples folder.

Chao Liu

2025-05-15