Real PubMed Search Analysis with searchAnalyzeR

Introduction

This vignette demonstrates how to use the searchAnalyzeR package to conduct a comprehensive analysis of systematic review search strategies using real PubMed data. We’ll walk through a complete workflow for analyzing search performance, from executing searches to generating publication-ready reports.

The searchAnalyzeR package provides tools for:

Search execution and standardization across multiple databases
Duplicate detection and removal using sophisticated algorithms
Performance metric calculation including precision, recall, and F1 scores
Visualization generation for search strategy assessment
Export capabilities in multiple formats (CSV, Excel, RIS)
PRISMA diagram creation for systematic review reporting
Term effectiveness analysis to optimize search strategies

Case Study: COVID-19 Long-term Effects

For this demonstration, we’ll analyze a search strategy designed to identify literature on the long-term effects of COVID-19, commonly known as “Long COVID.” This topic represents a rapidly evolving area of research that presents typical challenges faced in systematic reviews.

Getting Started

Required Packages

First, let’s load the required packages for our analysis:

# Load required packages
library(searchAnalyzeR)
library(rentrez)  # For PubMed API access
library(xml2)     # For XML parsing
library(dplyr)
library(ggplot2)
library(lubridate)

cat("=== searchAnalyzeR: Real PubMed Search Example ===\n")
#> === searchAnalyzeR: Real PubMed Search Example ===
cat("Topic: Long-term effects of COVID-19 (Long COVID)\n")
#> Topic: Long-term effects of COVID-19 (Long COVID)
cat("Objective: Demonstrate search strategy analysis with real data\n\n")
#> Objective: Demonstrate search strategy analysis with real data

Defining the Search Strategy

A well-defined search strategy is crucial for systematic reviews. Here we define our search parameters including terms, databases, date ranges, and filters:

# Define our search strategy
search_strategy <- list(
  terms = c(
    "long covid",
    "post-covid syndrome",
    "covid-19 sequelae",
    "post-acute covid-19",
    "persistent covid symptoms"
  ),
  databases = c("PubMed"),
  date_range = as.Date(c("2020-01-01", "2024-12-31")),
  filters = list(
    language = "English",
    article_types = c("Journal Article", "Review", "Clinical Trial")
  ),
  search_date = Sys.time()
)

cat("Search Strategy:\n")
#> Search Strategy:
cat("Terms:", paste(search_strategy$terms, collapse = " OR "), "\n")
#> Terms: long covid OR post-covid syndrome OR covid-19 sequelae OR post-acute covid-19 OR persistent covid symptoms
cat("Date range:", paste(search_strategy$date_range, collapse = " to "), "\n\n")
#> Date range: 2020-01-01 to 2024-12-31

The search strategy includes:

Search terms: A comprehensive list of synonyms and related terms for Long COVID
Date range: Covering the period from the start of the pandemic through 2024
Language filter: English language publications only
Article types: Focusing on primary research, reviews, and clinical trials

Executing the Search

PubMed Search

The searchAnalyzeR package provides convenient functions to search PubMed and retrieve article metadata:

# Execute the search using the package function
cat("Searching PubMed for real articles...\n")
#> Searching PubMed for real articles...
raw_results <- search_pubmed(
  search_terms = search_strategy$terms,
  max_results = 150,
  date_range = search_strategy$date_range,
  language = "English"
)
#> PubMed Query: ( "long covid"[Title/Abstract] OR "post-covid syndrome"[Title/Abstract] OR "covid-19 sequelae"[Title/Abstract] OR "post-acute covid-19"[Title/Abstract] OR "persistent covid symptoms"[Title/Abstract] ) AND ("2020/01/01"[Date - Publication] : "2024/12/31"[Date - Publication]) AND English [Language] 
#> Found 150 articles
#> Retrieving batch 1 of 3 
#> Retrieving batch 2 of 3 
#> Retrieving batch 3 of 3

cat("\nRaw search completed. Retrieved", nrow(raw_results), "articles.\n")
#> 
#> Raw search completed. Retrieved 150 articles.

Data Standardization

Raw search results from different databases often have varying formats. The std_search_results() function standardizes the data structure:

# Standardize the results using searchAnalyzeR functions
cat("\nStandardizing search results...\n")
#> 
#> Standardizing search results...
standardized_results <- std_search_results(raw_results, source_format = "pubmed")

This standardization ensures that:

Field names are consistent across different data sources
Date formats are properly parsed
Missing values are handled appropriately
Data types are correctly assigned

Data Quality Assessment

Duplicate Detection

Duplicate detection is critical in systematic reviews, especially when searching multiple databases. The package provides sophisticated algorithms for identifying duplicates:

# Detect and remove duplicates
cat("Detecting duplicates...\n")
#> Detecting duplicates...
dedup_results <- detect_dupes(standardized_results, method = "exact")

cat("Duplicate detection complete:\n")
#> Duplicate detection complete:
cat("- Total articles:", nrow(dedup_results), "\n")
#> - Total articles: 150
cat("- Unique articles:", sum(!dedup_results$duplicate), "\n")
#> - Unique articles: 150
cat("- Duplicates found:", sum(dedup_results$duplicate), "\n\n")
#> - Duplicates found: 0

The detect_dupes() function offers several methods:

exact: Matches identical titles and authors
fuzzy: Uses string similarity algorithms
doi: Matches based on Digital Object Identifiers
combined: Uses multiple criteria for robust detection

Search Statistics

Basic statistics help assess the overall quality of the search results:

# Calculate search statistics
search_stats <- calc_search_stats(dedup_results)
cat("Search Statistics:\n")
#> Search Statistics:
cat("- Date range:", paste(search_stats$date_range, collapse = " to "), "\n")
#> - Date range: 2024-01-01 to 2025-05-01
cat("- Missing abstracts:", search_stats$missing_abstracts, "\n")
#> - Missing abstracts: 0
cat("- Missing dates:", search_stats$missing_dates, "\n\n")
#> - Missing dates: 0

Creating a Gold Standard

Demonstration Gold Standard

For performance evaluation, we need a “gold standard” of known relevant articles. In a real systematic review, this would be your manually identified relevant articles. For this demonstration, we create a simplified gold standard:

# Create a gold standard for demonstration
# In a real systematic review, this would be your known relevant articles
# For this example, we'll identify articles that contain key terms in titles
cat("Creating demonstration gold standard...\n")
#> Creating demonstration gold standard...
long_covid_terms <- c("long covid", "post-covid", "post-acute covid", "persistent covid", "covid sequelae")
pattern <- paste(long_covid_terms, collapse = "|")

gold_standard_ids <- dedup_results %>%
  filter(!duplicate) %>%
  filter(grepl(pattern, tolower(title))) %>%
  pull(id)

cat("Gold standard created with", length(gold_standard_ids), "highly relevant articles\n\n")
#> Gold standard created with 83 highly relevant articles

Note: In practice, your gold standard would be created through:

Expert knowledge of key articles in the field
Pilot searches and manual review
Previously published systematic reviews
Consultation with domain experts

Performance Analysis

Initializing the SearchAnalyzer

The SearchAnalyzer class provides comprehensive tools for evaluating search performance:

# Initialize SearchAnalyzer with real data
cat("Initializing SearchAnalyzer...\n")
#> Initializing SearchAnalyzer...
analyzer <- SearchAnalyzer$new(
  search_results = filter(dedup_results, !duplicate),
  gold_standard = gold_standard_ids,
  search_strategy = search_strategy
)

Calculating Performance Metrics

The analyzer calculates a comprehensive set of performance metrics:

# Calculate comprehensive metrics
cat("Calculating performance metrics...\n")
#> Calculating performance metrics...
metrics <- analyzer$calculate_metrics()

# Display key metrics
cat("\n=== SEARCH PERFORMANCE METRICS ===\n")
#> 
#> === SEARCH PERFORMANCE METRICS ===
if (!is.null(metrics$precision_recall$precision)) {
  cat("Precision:", round(metrics$precision_recall$precision, 3), "\n")
  cat("Recall:", round(metrics$precision_recall$recall, 3), "\n")
  cat("F1 Score:", round(metrics$precision_recall$f1_score, 3), "\n")
  cat("Number Needed to Read:", round(metrics$precision_recall$number_needed_to_read, 1), "\n")
}
#> Precision: 0.553 
#> Recall: 1 
#> F1 Score: 0.712 
#> Number Needed to Read: 1.8

cat("\n=== BASIC METRICS ===\n")
#> 
#> === BASIC METRICS ===
cat("Total Records:", metrics$basic$total_records, "\n")
#> Total Records: 150
cat("Unique Records:", metrics$basic$unique_records, "\n")
#> Unique Records: 150
cat("Duplicates:", metrics$basic$duplicates, "\n")
#> Duplicates: 0
cat("Sources:", metrics$basic$sources, "\n")
#> Sources: 98

Understanding the Metrics

Precision: Proportion of retrieved articles that are relevant
Recall: Proportion of relevant articles that are retrieved
F1 Score: Harmonic mean of precision and recall
Number Needed to Read: Average number of articles to screen to find one relevant article

Visualization and Reporting

Performance Visualizations

The package generates publication-ready visualizations to assess search performance:

# Generate visualizations
cat("\nGenerating visualizations...\n")
#> 
#> Generating visualizations...

# Overview plot
overview_plot <- analyzer$visualize_performance("overview")
print(overview_plot)


# Temporal distribution plot
temporal_plot <- analyzer$visualize_performance("temporal")
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> ℹ The deprecated feature was likely used in the searchAnalyzeR package.
#>   Please report the issue at
#>   <https://github.com/chaoliu-cl/searchAnalyzeR/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
print(temporal_plot)


# Precision-recall curve (if gold standard available)
if (length(gold_standard_ids) > 0) {
  pr_plot <- analyzer$visualize_performance("precision_recall")
  print(pr_plot)
}

PRISMA Flow Diagram

The package can generate data for PRISMA flow diagrams, essential for systematic review reporting:

# Generate PRISMA flow diagram data
cat("\nCreating PRISMA flow data...\n")
#> 
#> Creating PRISMA flow data...
screening_data <- data.frame(
  id = dedup_results$id[!dedup_results$duplicate],
  identified = TRUE,
  duplicate = FALSE,
  title_abstract_screened = TRUE,
  full_text_eligible = runif(sum(!dedup_results$duplicate)) > 0.7,  # Simulate screening
  included = runif(sum(!dedup_results$duplicate)) > 0.85,  # Simulate final inclusion
  excluded_title_abstract = runif(sum(!dedup_results$duplicate)) > 0.3,
  excluded_full_text = runif(sum(!dedup_results$duplicate)) > 0.15
)

# Generate PRISMA diagram
reporter <- PRISMAReporter$new()
prisma_plot <- reporter$generate_prisma_diagram(screening_data)
print(prisma_plot)

Multiple Export Formats

The package supports exporting results in various formats commonly used in systematic reviews:

# Export results in multiple formats
cat("\nExporting results...\n")
#> 
#> Exporting results...
output_dir <- tempdir()
export_files <- export_results(
  search_results = filter(dedup_results, !duplicate),
  file_path = file.path(output_dir, "covid_long_term_search"),
  formats = c("csv", "xlsx", "ris"),
  include_metadata = TRUE
)

cat("Files exported:\n")
#> Files exported:
for (file in export_files) {
  cat("-", file, "\n")
}
#> - /tmp/RtmpvBR40M/covid_long_term_search.csv 
#> - /tmp/RtmpvBR40M/covid_long_term_search.xlsx 
#> - /tmp/RtmpvBR40M/covid_long_term_search.ris

Metrics Export

Performance metrics can also be exported for further analysis or reporting:

# Export metrics
metrics_file <- export_metrics(
  metrics = metrics,
  file_path = file.path(output_dir, "search_metrics.xlsx"),
  format = "xlsx"
)
cat("- Metrics exported to:", metrics_file, "\n")
#> - Metrics exported to: /tmp/RtmpvBR40M/search_metrics.xlsx

Comprehensive Data Package

For reproducibility, create a complete data package containing all analysis components:

# Create a complete data package
cat("\nCreating comprehensive data package...\n")
#> 
#> Creating comprehensive data package...
package_dir <- create_data_package(
  search_results = filter(dedup_results, !duplicate),
  analysis_results = list(
    metrics = metrics,
    search_strategy = search_strategy,
    screening_data = screening_data
  ),
  output_dir = output_dir,
  package_name = "covid_long_term_systematic_review"
)

cat("Data package created at:", package_dir, "\n")
#> Data package created at: /tmp/RtmpvBR40M/covid_long_term_systematic_review

Advanced Analysis

Benchmark Validation

The package includes tools for validating search strategies against established benchmarks:

# Demonstrate benchmark validation (simplified)
cat("\nDemonstrating benchmark validation...\n")
#> 
#> Demonstrating benchmark validation...
validator <- BenchmarkValidator$new()

# Add our search as a custom benchmark
validator$add_benchmark(
  name = "covid_long_term",
  corpus = filter(dedup_results, !duplicate),
  relevant_ids = gold_standard_ids
)

# Validate the strategy
validation_results <- validator$validate_strategy(
  search_strategy = search_strategy,
  benchmark_name = "covid_long_term"
)

cat("Validation Results:\n")
#> Validation Results:
cat("- Precision:", round(validation_results$precision, 3), "\n")
#> - Precision: 0.696
cat("- Recall:", round(validation_results$recall, 3), "\n")
#> - Recall: 0.855
cat("- F1 Score:", round(validation_results$f1_score, 3), "\n")
#> - F1 Score: 0.768

Text Similarity Analysis

Analyze how well the retrieved abstracts match the search terms:

# Text similarity analysis on abstracts
cat("\nAnalyzing abstract similarity to search terms...\n")
#> 
#> Analyzing abstract similarity to search terms...
search_term_text <- paste(search_strategy$terms, collapse = " ")

similarity_scores <- sapply(dedup_results$abstract[!dedup_results$duplicate], function(abstract) {
  if (is.na(abstract) || abstract == "") return(0)
  calc_text_sim(search_term_text, abstract, method = "jaccard")
})

cat("Average abstract similarity to search terms:", round(mean(similarity_scores, na.rm = TRUE), 3), "\n")
#> Average abstract similarity to search terms: 0.013
cat("Abstracts with high similarity (>0.1):", sum(similarity_scores > 0.1, na.rm = TRUE), "\n")
#> Abstracts with high similarity (>0.1): 0

Term Effectiveness Analysis

Evaluate which search terms are most effective:

# Analyze term effectiveness
cat("\nAnalyzing individual term effectiveness...\n")
#> 
#> Analyzing individual term effectiveness...
term_analysis <- term_effectiveness(
  terms = search_strategy$terms,
  search_results = filter(dedup_results, !duplicate),
  gold_standard = gold_standard_ids
)

print(term_analysis)
#> Term Effectiveness Analysis
#> ==========================
#> Search Results: 150 articles
#> Gold Standard: 83 relevant articles
#> Fields Analyzed: title, abstract 
#> 
#>                       term articles_with_term relevant_with_term precision
#>                 long covid                 86                 60     0.698
#>        post-covid syndrome                  5                  5     1.000
#>          covid-19 sequelae                  5                  1     0.200
#>        post-acute covid-19                 15                 11     0.733
#>  persistent covid symptoms                  0                  0     0.000
#>  coverage
#>     0.723
#>     0.060
#>     0.012
#>     0.133
#>     0.000

# Calculate term effectiveness scores
term_scores <- calc_tes(term_analysis)
cat("\nTerm Effectiveness Scores (TES):\n")
#> 
#> Term Effectiveness Scores (TES):
print(term_scores[order(term_scores$tes, decreasing = TRUE), ])
#> Term Effectiveness Analysis
#> ==========================
#> Search Results: 150 articles
#> Gold Standard: 83 relevant articles
#> Fields Analyzed: title, abstract 
#> 
#>                       term articles_with_term relevant_with_term precision
#>                 long covid                 86                 60     0.698
#>        post-acute covid-19                 15                 11     0.733
#>        post-covid syndrome                  5                  5     1.000
#>          covid-19 sequelae                  5                  1     0.200
#>  persistent covid symptoms                  0                  0     0.000
#>  coverage        tes
#>     0.723 0.71005917
#>     0.133 0.22448980
#>     0.060 0.11363636
#>     0.012 0.02272727
#>     0.000 0.00000000

# Find top performing terms
top_terms <- find_top_terms(term_analysis, n = 3, plot = TRUE, plot_type = "precision_coverage")
cat("\nTop 3 performing terms:", paste(top_terms$terms, collapse = ", "), "\n")
#> 
#> Top 3 performing terms: long covid, post-acute covid-19, post-covid syndrome
if (!is.null(top_terms$plot)) {
  print(top_terms$plot)
}

Results Interpretation and Recommendations

Automated Recommendations

Based on the calculated metrics, the package can provide automated recommendations:

# Final summary and recommendations
cat("\n=== FINAL SUMMARY AND RECOMMENDATIONS ===\n")
#> 
#> === FINAL SUMMARY AND RECOMMENDATIONS ===
cat("Search Topic: Long-term effects of COVID-19\n")
#> Search Topic: Long-term effects of COVID-19
cat("Articles Retrieved:", sum(!dedup_results$duplicate), "\n")
#> Articles Retrieved: 150
cat("Search Date Range:", paste(search_strategy$date_range, collapse = " to "), "\n")
#> Search Date Range: 2020-01-01 to 2024-12-31

if (!is.null(metrics$precision_recall$precision)) {
  cat("Search Precision:", round(metrics$precision_recall$precision, 3), "\n")

  if (metrics$precision_recall$precision < 0.1) {
    cat("RECOMMENDATION: Low precision suggests search may be too broad. Consider:\n")
    cat("- Adding more specific terms\n")
    cat("- Using MeSH terms\n")
    cat("- Adding study type filters\n")
  } else if (metrics$precision_recall$precision > 0.5) {
    cat("RECOMMENDATION: High precision suggests good specificity. Consider:\n")
    cat("- Broadening search if recall needs improvement\n")
    cat("- Adding synonyms or related terms\n")
  }
}
#> Search Precision: 0.553 
#> RECOMMENDATION: High precision suggests good specificity. Consider:
#> - Broadening search if recall needs improvement
#> - Adding synonyms or related terms

Sample Retrieved Articles

Let’s examine some of the retrieved articles to understand the search results:

# Show some example retrieved articles
cat("\n=== SAMPLE RETRIEVED ARTICLES ===\n")
#> 
#> === SAMPLE RETRIEVED ARTICLES ===
sample_articles <- filter(dedup_results, !duplicate) %>%
  arrange(desc(date)) %>%
  head(3)

for (i in 1:nrow(sample_articles)) {
  article <- sample_articles[i, ]
  cat("\n", i, ". ", article$title, "\n", sep = "")
  cat("   Journal:", article$source, "\n")
  cat("   Date:", as.character(article$date), "\n")
  cat("   PMID:", gsub("PMID:", "", article$id), "\n")
  cat("   Abstract:", substr(article$abstract, 1, 200), "...\n")
}
#> 
#> 1. Rates, Risk Factors and Outcomes of Complications After COVID-19 in Children.
#>    Journal: PubMed: The Pediatric infectious disease journal 
#>    Date: 2025-05-01 
#>    PMID: 40232883 
#>    Abstract: Coronavirus disease 2019 (COVID-19) can lead to various complications, including multisystem inflammatory syndrome in children (MIS-C) and post-COVID-19 conditions (long COVID). This study aimed to de ...
#> 
#> 2. Considerations for Long COVID Rehabilitation in Women.
#>    Journal: PubMed: Physical medicine and rehabilitation clinics of North America 
#>    Date: 2025-05-01 
#>    PMID: 40210368 
#>    Abstract: The coronavirus disease 2019 (COVID-19) pandemic has given rise to long COVID, a prolonged manifestation of severe acute respiratory syndrome coronavirus 2 infection, which presents with varied sympto ...
#> 
#> 3. Self-Assembly of Human Fibrinogen into Microclot-Mimicking Antifibrinolytic Amyloid Fibrinogen Particles.
#>    Journal: PubMed: ACS applied bio materials 
#>    Date: 2025-01-20 
#>    PMID: 39723824 
#>    Abstract: Recent clinical studies have highlighted the presence of microclots in the form of amyloid fibrinogen particles (AFPs) in plasma samples from Long COVID patients. However, the clinical significance of ...

Summary and Next Steps

What We’ve Accomplished

This vignette demonstrated a complete workflow using the searchAnalyzeR package:

cat("\n=== ANALYSIS COMPLETE ===\n")
#> 
#> === ANALYSIS COMPLETE ===
cat("This example demonstrated:\n")
#> This example demonstrated:
cat("1. Real PubMed search execution using search_pubmed()\n")
#> 1. Real PubMed search execution using search_pubmed()
cat("2. Data standardization and deduplication\n")
#> 2. Data standardization and deduplication
cat("3. Performance metric calculation\n")
#> 3. Performance metric calculation
cat("4. Visualization generation\n")
#> 4. Visualization generation
cat("5. Multi-format export capabilities\n")
#> 5. Multi-format export capabilities
cat("6. PRISMA diagram creation\n")
#> 6. PRISMA diagram creation
cat("7. Benchmark validation\n")
#> 7. Benchmark validation
cat("8. Term effectiveness analysis\n")
#> 8. Term effectiveness analysis
cat("9. Comprehensive reporting\n")
#> 9. Comprehensive reporting
cat("\nAll outputs saved to:", output_dir, "\n")
#> 
#> All outputs saved to: /tmp/RtmpvBR40M

Files Generated

The analysis generates numerous output files for different purposes:

# Clean up and provide final file locations
list.files(output_dir, pattern = "covid", full.names = TRUE, recursive = TRUE)
#> [1] "/tmp/RtmpvBR40M/covid_long_term_search.csv" 
#> [2] "/tmp/RtmpvBR40M/covid_long_term_search.ris" 
#> [3] "/tmp/RtmpvBR40M/covid_long_term_search.xlsx"

Next Steps

After completing this analysis, typical next steps would include:

Refining the search strategy based on performance metrics and term effectiveness analysis
Expanding to multiple databases using similar workflows
Implementing the refined strategy in your systematic review protocol
Using the exported data for screening and data extraction phases
Incorporating visualizations into your systematic review protocol or publication

Additional Resources

For more advanced features and customization options, consult:

The package documentation: help(package = "searchAnalyzeR")
Function-specific help: ?search_pubmed, ?SearchAnalyzer, etc.
Additional vignettes covering specialized topics
The package GitHub repository for updates and community contributions

This comprehensive workflow demonstrates how searchAnalyzeR can streamline and enhance the systematic review search process, providing objective metrics and visualizations to support evidence-based search strategy development and optimization.

A Comprehensive Example Using COVID-19 Long-term Effects

searchAnalyzeR Development Team

2025-11-03

Introduction

Case Study: COVID-19 Long-term Effects

Getting Started

Required Packages

Defining the Search Strategy

Executing the Search

PubMed Search

Data Standardization

Data Quality Assessment

Duplicate Detection

Search Statistics

Creating a Gold Standard

Demonstration Gold Standard

Performance Analysis

Initializing the SearchAnalyzer

Calculating Performance Metrics

Understanding the Metrics

Visualization and Reporting

Performance Visualizations

PRISMA Flow Diagram

Multiple Export Formats

Metrics Export

Comprehensive Data Package

Advanced Analysis

Benchmark Validation

Text Similarity Analysis

Term Effectiveness Analysis

Results Interpretation and Recommendations

Automated Recommendations

Sample Retrieved Articles

Summary and Next Steps

What We’ve Accomplished

Files Generated

Next Steps

Additional Resources

Real PubMed Search Analysis with searchAnalyzeR

A Comprehensive Example Using COVID-19 Long-term Effects

searchAnalyzeR Development Team

2025-11-03

Introduction

Case Study: COVID-19 Long-term Effects

Getting Started

Required Packages

Defining the Search Strategy

Executing the Search

PubMed Search

Data Standardization

Data Quality Assessment

Duplicate Detection

Search Statistics

Creating a Gold Standard

Demonstration Gold Standard

Performance Analysis

Initializing the SearchAnalyzer

Calculating Performance Metrics

Understanding the Metrics

Visualization and Reporting

Performance Visualizations

PRISMA Flow Diagram

Data Export and Sharing

Multiple Export Formats

Metrics Export

Comprehensive Data Package

Advanced Analysis

Benchmark Validation

Text Similarity Analysis

Term Effectiveness Analysis

Results Interpretation and Recommendations

Automated Recommendations

Sample Retrieved Articles

Summary and Next Steps

What We’ve Accomplished

Files Generated

Next Steps

Additional Resources