Skip to contents

Introduction

Literature-based discovery (LBD) is a powerful approach to identifying hidden connections between existing knowledge in the scientific literature. This vignette introduces the LBDiscover package, which provides tools for automated literature-based discovery by analyzing biomedical publications.

Installation

You can install the package from CRAN:

install.packages("LBDiscover")

You can install the development version of LBDiscover from GitHub:

# install.packages("devtools")
devtools::install_github("chaoliu-cl/LBDiscover")

Basic Workflow

The typical workflow for literature-based discovery with this package consists of:

  1. Retrieving publications from PubMed or other sources
  2. Preprocessing text data
  3. Extracting biomedical entities
  4. Creating a co-occurrence matrix
  5. Applying discovery models
  6. Visualizing and evaluating results

Let’s walk through a comprehensive example exploring connections in migraine research.

Example: Exploring Migraine Research

In this example, we’ll explore potential discoveries in migraine research by applying the improved ABC model approach with various utility functions.

1. Load the package

library(LBDiscover)
#> Loading LBDiscover package

2. Define the primary term of interest

# Define the primary term of interest for our analysis
primary_term <- "migraine"

3. Retrieve publications

We’ll search for articles about migraine pathophysiology and treatment.

# Search for migraine-related articles
migraine_articles <- pubmed_search(
  query = paste0(primary_term, " pathophysiology"),
  max_results = 1000
)

# Search for treatment-related articles
drug_articles <- pubmed_search(
  query = "neurological drugs pain treatment OR migraine therapy OR headache medication",
  max_results = 1000
)

# Combine and remove duplicates
all_articles <- merge_results(migraine_articles, drug_articles)
cat("Retrieved", nrow(all_articles), "unique articles\n")
#> Retrieved 1935 unique articles

4. Extract variations of the primary term

# Extract variations of our primary term using the utility function
primary_term_variations <- get_term_vars(all_articles, primary_term)
cat("Found", length(primary_term_variations), "variations of", primary_term, "in the corpus:\n")
#> Found 15 variations of migraine in the corpus:
print(head(primary_term_variations, 10))
#>  [1] "migraine"    "Migraine"    "MigrainE"    "MIGRAINE"    "Migraines"  
#>  [6] "migraines"   "migraineur"  "ofmigraine"  "migraineurs" "Migraineurs"

5. Preprocess text data

# Preprocess text
preprocessed_articles <- preprocess_text(
  all_articles,
  text_column = "abstract",
  remove_stopwords = TRUE,
  min_word_length = 2  # Set min_word_length to capture short terms
)
#> Tokenizing text...

6. Create a custom dictionary

# Create a custom dictionary with all variations of our primary term
custom_dictionary <- data.frame(
  term = c(primary_term, primary_term_variations),
  type = rep("disease", length(primary_term_variations) + 1),
  id = paste0("CUSTOM_", 1:(length(primary_term_variations) + 1)),
  source = rep("custom", length(primary_term_variations) + 1),
  stringsAsFactors = FALSE
)

# Define additional MeSH queries for extended dictionaries
mesh_queries <- list(
  "disease" = paste0(primary_term, " disorders[MeSH] OR headache disorders[MeSH]"),
  "protein" = "receptors[MeSH] OR ion channels[MeSH]",
  "chemical" = "neurotransmitters[MeSH] OR vasoactive agents[MeSH]",
  "pathway" = "signal transduction[MeSH] OR pain[MeSH]",
  "drug" = "analgesics[MeSH] OR serotonin agonists[MeSH] OR anticonvulsants[MeSH]",
  "gene" = "genes[MeSH] OR channelopathy[MeSH]"
)

# Sanitize the custom dictionary
custom_dictionary <- sanitize_dictionary(
  custom_dictionary,
  term_column = "term",
  type_column = "type",
  validate_types = FALSE  # Don't validate custom terms as they're trusted
)
#> Sanitizing dictionary with 16 terms...
#> Sanitization complete. 16 terms remaining (100% of original)

7. Extract biomedical entities

# Extract entities using our custom dictionary
custom_entities <- extract_entities(
  preprocessed_articles,
  text_column = "abstract",
  dictionary = custom_dictionary,
  case_sensitive = FALSE,
  overlap_strategy = "priority",
  sanitize_dict = FALSE  # Already sanitized
)

# Extract entities using the standard workflow with improved entity validation
# Check if running in R CMD check environment
is_check <- !interactive() && 
            (!is.null(Sys.getenv("R_CHECK_RUNNING")) && 
             Sys.getenv("R_CHECK_RUNNING") == "true")
             
# More robust check for testing environment
if (!is_check && !is.null(Sys.getenv("_R_CHECK_LIMIT_CORES_"))) {
  is_check <- TRUE
}

# Set number of cores based on environment
num_cores_to_use <- if(is_check) 1 else 4

standard_entities <- extract_entities_workflow(
  preprocessed_articles,
  text_column = "abstract",
  entity_types = c("disease", "drug", "gene"),
  parallel = !is_check,           # Disable parallel in check environment
  num_cores = num_cores_to_use,   # Use 1 core in check environment
  batch_size = 500                # Process 500 documents per batch
)
#> Warning in extract_entities_workflow(preprocessed_articles, text_column =
#> "abstract", : UMLS source requested but no API key provided. Skipping UMLS.

# Uncomment to include UMLS entities
# standard_entities <- extract_entities_workflow(
#   preprocessed_articles,
#   text_column = "abstract",
#   entity_types = c("disease", "drug", "gene", "protein", "pathway", "chemical"),
#   dictionary_sources = c("local", "mesh", "umls"),  # Including UMLS
#   additional_mesh_queries = mesh_queries,
#   sanitize = TRUE,
#   api_key = "your-umls-api-key",  # Your UMLS API key here
#   parallel = TRUE,
#   num_cores = 4
# )

# Combine entity datasets using our utility function
entities <- merge_entities(
  custom_entities,
  standard_entities,
  primary_term
)
#> Combined 6296 custom entities with 5999 standard entities.

# Filter entities to ensure only relevant biomedical terms are included
filtered_entities <- valid_entities(
  entities,
  primary_term,
  primary_term_variations,
  validation_function = is_valid_biomedical_entity
)
#> Filtered from 6383 to 6383 validated entities

# View the first few extracted entities
head(filtered_entities)
#>         entity entity_type doc_id start_pos end_pos
#> 1 antimigraine     disease    148      1488    1499
#> 2 antimigraine     disease    155      1671    1682
#> 3 antimigraine     disease    452       667     678
#> 4 antimigraine     disease    452       368     379
#> 5 antimigraine     disease    494      1188    1199
#> 6 antimigraine     disease    494       632     643
#>                                                                                                                                                                                                                                                                                                                      sentence
#> 1                                                                                                           CONCLUSION AND IMPLICATIONS: DOP-mediated antimigraine effect is stronger in female than male rats and appears to be mediated in chronic conditions through peripheral DOPs in females and central DOPs in males.
#> 2                                                                                                                                                      Our results suggest that the antagonism on the alpha-CGRP isoform plays a relevant role in the antimigraine action of mABs but not in the development of constipation.
#> 3                                                                                                                                                                                                                           Efforts have been made to find alternative delivery systems and/or routes for antimigraine drugs.
#> 4                                                                 However, the administration of antimigraine drugs in conventional oral pharmaceutical dosage forms is a challenge, since many molecules have difficulty crossing the blood-brain barrier (BBB) to reach the brain, which leads to bioavailability problems.
#> 5                   The existence of patients with medication-resistant migraine may be due to the: (i) complex migraine pathophysiology, in which several systems appear to be deregulated before, during, and after a migraine attack; and (ii) pharmacodynamic and pharmacokinetic properties of antimigraine medications.
#> 6 AREAS COVERED: By searching multiple electronic scientific databases, this narrative review examined: (i) the role of CGRP in migraine; and (ii) the current knowledge on the effects of CGRPergic antimigraine pharmacotherapies, including a brief analysis of their pharmacodynamic and pharmacokinetic characteristics.
#>   frequency
#> 1        22
#> 2        22
#> 3        22
#> 4        22
#> 5        22
#> 6        22

8. Create co-occurrence matrix

# Create co-occurrence matrix with validated entities
co_matrix <- create_comat(
  filtered_entities,
  doc_id_col = "doc_id",
  entity_col = "entity",
  type_col = "entity_type",
  normalize = TRUE,
  normalization_method = "cosine"
)
#> Building entity-document matrix...
#> Calculating co-occurrence matrix...
#> Normalizing co-occurrence matrix using cosine method...

# Find our primary term in the co-occurrence matrix
a_term <- find_term(co_matrix, primary_term)
#> Found primary term in co-occurrence matrix

# Check matrix dimensions
dim(co_matrix)
#> [1] 11 11

9. Apply the improved ABC model

# Apply the improved ABC model with enhanced term filtering and type validation
abc_results <- abc_model(
  co_matrix,
  a_term = a_term,
  c_term = NULL,  # Allow all potential C terms
  min_score = 0.001,  # Lower threshold to capture more potential connections
  n_results = 500,    # Increase to get more candidates before filtering
  scoring_method = "combined",
  # Focus on biomedically relevant entity types
  b_term_types = c("protein", "gene", "pathway", "chemical"),
  c_term_types = c("drug", "chemical", "protein", "gene"),
  exclude_general_terms = TRUE,  # Enable enhanced term filtering
  filter_similar_terms = TRUE,   # Remove terms too similar to migraine
  similarity_threshold = 0.7,    # Relatively strict similarity threshold
  enforce_strict_typing = TRUE   # Enable strict entity type validation
)
#> Filtered 8 B terms (100%) that weren't valid biomedical entities
#> No suitable B terms found with association score > 0.001 after filtering

# If we don't have enough results, try with less stringent criteria
min_desired_results <- 10
if (nrow(abc_results) < min_desired_results) {
  cat("Not enough results with strict filtering. Trying with less stringent criteria...\n")
  abc_results <- abc_model(
    co_matrix,
    a_term = a_term,
    c_term = NULL,
    min_score = 0.0005,  # Even lower threshold
    n_results = 500,
    scoring_method = "combined",
    b_term_types = NULL,  # No type constraints
    c_term_types = NULL,  # No type constraints
    exclude_general_terms = TRUE,
    filter_similar_terms = TRUE,
    similarity_threshold = 0.8,    # More lenient similarity threshold
    enforce_strict_typing = FALSE  # Disable strict type validation as fallback
  )
}
#> Not enough results with strict filtering. Trying with less stringent criteria...
#> Filtered 8 B terms (88.9%) that weren't valid biomedical entities
#> Filtered out 0 B terms that were too similar to A term (similarity threshold: 0.8)
#> Filtered 9 potential C terms that weren't valid biomedical entities
#> Identifying potential C terms via 1 B terms...
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
#> No ABC connections found

# Check if we have results and what columns are available
if (nrow(abc_results) > 0) {
  cat("Available columns in abc_results:", paste(colnames(abc_results), collapse = ", "), "\n")
  
  # Safely display results with available columns
  display_cols <- intersect(c("a_term", "b_term", "c_term", "abc_score", "b_type", "c_type"), 
                           colnames(abc_results))
  
  if (length(display_cols) > 0) {
    # View top results
    head(abc_results[, display_cols, drop = FALSE])
  } else {
    # If specific columns aren't available, show first few columns
    head(abc_results[, 1:min(6, ncol(abc_results)), drop = FALSE])
  }
} else {
  cat("No ABC results found\n")
}
#> No ABC results found

10. Apply statistical validation to the results

# Apply statistical validation to the results
validated_results <- tryCatch({
  validate_abc(
    abc_results,
    co_matrix,
    alpha = 0.1,  # More lenient significance threshold
    correction = "BH",  # Benjamini-Hochberg correction for multiple testing
    filter_by_significance = FALSE  # Keep all results but mark significant ones
  )
}, error = function(e) {
  cat("Error in statistical validation:", e$message, "\n")
  cat("Using original results without validation...\n")
  # Add dummy p-values based on ABC scores
  abc_results$p_value <- 1 - abc_results$abc_score / max(abc_results$abc_score, na.rm = TRUE)
  abc_results$significant <- abc_results$p_value < 0.1
  return(abc_results)
})
#> ABC results are empty

# Sort by ABC score and take top results
validated_results <- validated_results[order(-validated_results$abc_score), ]
top_n <- min(100, nrow(validated_results))  # Larger top N for diversification
top_results <- head(validated_results, top_n)

# Check available columns and display safely
if (nrow(top_results) > 0) {
  display_cols <- intersect(c("a_term", "b_term", "c_term", "abc_score", "p_value", "significant"), 
                           colnames(top_results))
  
  if (length(display_cols) > 0) {
    # View top validated results
    head(top_results[, display_cols, drop = FALSE])
  } else {
    # Show available columns
    head(top_results[, 1:min(6, ncol(top_results)), drop = FALSE])
  }
} else {
  cat("No validated results available\n")
}
#> No validated results available

11. Diversify and ensure minimum results

# Diversify results using our utility function
diverse_results <- safe_diversify(
  top_results,
  diversity_method = "both",
  max_per_group = 5,
  min_score = 0.0001,
  min_results = 5
)
#> ABC results are empty

# Ensure we have enough results for visualization
diverse_results <- min_results(
  diverse_results,
  top_results,
  a_term,
  min_results = 3
)
#> No results found. Creating a placeholder result for demonstration.

12. Visualize the results

# Create heatmap visualization
plot_heatmap(
  diverse_results,
  output_file = "migraine_heatmap.png",
  width = 1200,
  height = 900,
  top_n = 15,
  min_score = 0.0001,
  color_palette = "blues",
  show_entity_types = TRUE
)
#> Warning in vis_heatmap(results, top_n = min(top_n, nrow(results)), min_score =
#> min_score, : Entity types not found in results. Setting show_entity_types =
#> FALSE
#> No connections are statistically significant (p < 0.05)
#> Created heatmap visualization: migraine_heatmap.png

# Create network visualization
plot_network(
  diverse_results,
  output_file = "migraine_network.png",
  width = 1200,
  height = 900,
  top_n = 15,
  min_score = 0.0001,
  node_size_factor = 5,
  color_by = "type",
  title = "Migraine Treatment Network",
  show_entity_types = TRUE,
  label_size = 1.0
)
#> Warning in vis_network(results, top_n = min(top_n, nrow(results)), min_score =
#> min_score, : Entity types not found in results. Setting show_entity_types =
#> FALSE
#> Created network visualization: migraine_network.png

13. Create interactive visualizations

# Create interactive HTML network visualization
export_network(
  diverse_results,
  output_file = "migraine_network.html",
  top_n = min(30, nrow(diverse_results)),
  min_score = 0.0001,
  open = FALSE  # Don't automatically open in browser
)

# Create interactive chord diagram
export_chord(
  diverse_results,
  output_file = "migraine_chord.html",
  top_n = min(30, nrow(diverse_results)),
  min_score = 0.0001,
  open = FALSE
)
#> Number of unique terms: 7
#> First few terms: migraine, serotonin, CGRP, cortical spreading depression, sumatriptan
#> Role assignments: A=1, B=3, C=3

14. Evaluate literature support and generate report

# Evaluate literature support for top connections
evaluation <- eval_evidence(
  diverse_results,
  max_results = 5,
  base_term = "migraine",
  max_articles = 5
)
#> 
#> === Evaluation of Top Results ===
#> 
#> Evaluating potential treatment: sumatriptan  
#> ABC score: 0.04 
#> P-value: 0.1 - Not statistically significant 
#> Connection through intermediary: serotonin
#> Found 5 articles directly linking migraine and sumatriptan 
#> Most recent article: Current Trends in Management of Migraine: A Review of Current Practice and Recent Advances. 
#> 
#> Evaluating potential treatment: topiramate  
#> ABC score: 0.03 
#> P-value: 0.2 - Not statistically significant 
#> Connection through intermediary: CGRP
#> Found 5 articles directly linking migraine and topiramate 
#> Most recent article: [Migraine headaches: What is new in treatment?]. 
#> 
#> Evaluating potential treatment: propranolol  
#> ABC score: 0.02 
#> P-value: 0.3 - Not statistically significant 
#> Connection through intermediary: cortical spreading depression
#> Found 5 articles directly linking migraine and propranolol 
#> Most recent article: [Migraine headaches: What is new in treatment?].

# Prepare articles for report generation
articles_with_years <- prep_articles(all_articles)
#> Found 1935 articles with valid publication years

# Store results for report
results_list <- list(abc = diverse_results)

# Store visualization paths
visualizations <- list(
  heatmap = "migraine_heatmap.png",
  network = "migraine_network.html",
  chord = "migraine_chord.html"
)

# Create comprehensive report
gen_report(
  results_list = results_list,
  visualizations = visualizations,
  articles = articles_with_years,
  output_file = "migraine_discoveries.html"
)
#> Generated comprehensive report: migraine_discoveries.html

cat("\nDiscovery analysis complete!\n")
#> 
#> Discovery analysis complete!

Interactive Visualizations and Report

The LBDiscover package generates interactive visualizations and a comprehensive report. Below you can see the embedded report with interactive visualizations.

Embedding HTML Report in the Vignette

The interactive visualizations are embedded below:

Network Visualization
Chord Diagram

Advanced Features

The LBDiscover package offers several advanced features that we’ve demonstrated in this example:

  1. Term variation detection: Automatically finding different forms of your primary term of interest
  2. Custom dictionary integration: Creating and using custom dictionaries alongside standard ones
  3. Entity validation: Filtering entities to ensure biomedical relevance
  4. Improved ABC model: Enhanced scoring methods and filtering options
  5. Statistical validation: Applying rigorous statistical tests to potential discoveries
  6. Result diversification: Ensuring a diverse set of discovery candidates
  7. Interactive visualizations: Creating dynamic network and chord diagrams
  8. Evidence evaluation: Assessing the literature support for discoveries
  9. Comprehensive reporting: Generating detailed HTML reports of findings

Conclusion

This vignette has demonstrated a comprehensive workflow for literature-based discovery using the LBDiscover package. The improved ABC model and additional utility functions provide a robust framework for identifying potential novel connections in biomedical literature. Users can explore examples that are included in the inst\examples folder.