Skip to contents

This function provides a complete workflow for extracting entities from text using dictionaries from multiple sources, with improved performance and robust error handling.

Usage

extract_entities_workflow(
  text_data,
  text_column = "abstract",
  entity_types = c("disease", "drug", "gene"),
  dictionary_sources = c("local", "mesh", "umls"),
  additional_mesh_queries = NULL,
  sanitize = TRUE,
  api_key = NULL,
  custom_dictionary = NULL,
  max_terms_per_type = 200,
  verbose = TRUE,
  batch_size = 500,
  parallel = FALSE,
  num_cores = 2,
  cache_dictionaries = TRUE
)

Arguments

text_data

A data frame containing article text data.

text_column

Name of the column containing text to process.

entity_types

Character vector of entity types to include.

dictionary_sources

Character vector of sources for entity dictionaries.

additional_mesh_queries

Named list of additional MeSH queries.

sanitize

Logical. If TRUE, sanitizes dictionaries before extraction.

api_key

API key for UMLS access (if "umls" is in dictionary_sources).

custom_dictionary

A data frame containing custom dictionary entries to incorporate into the entity extraction process.

max_terms_per_type

Maximum number of terms to fetch per entity type. Default is 200.

verbose

Logical. If TRUE, prints detailed progress information.

batch_size

Number of documents to process in a single batch. Default is 500.

parallel

Logical. If TRUE, uses parallel processing when available. Default is FALSE.

num_cores

Number of cores to use for parallel processing. Default is 2.

cache_dictionaries

Logical. If TRUE, caches dictionaries for faster reuse. Default is TRUE.

Value

A data frame with extracted entities, their types, and positions.

Examples

# Create example text data
text_data <- data.frame(
  doc_id = 1:2,
  abstract = c(
    "Migraine is a neurological disorder.",
    "Serotonin plays a role in headache."
  )
)

# Extract entities using workflow
entities <- extract_entities_workflow(
  text_data,
  entity_types = c("disease", "chemical"),
  dictionary_sources = "local",
  max_terms_per_type = 10
)
#> Running in R CMD check environment. Disabling parallel processing.
#> Entity type 'chemical' not supported by local source. Using MeSH instead.
#> Creating dictionaries for entity extraction...
#> Loading dictionaries sequentially...
#> Package not installed or dictionary not found. Using example dictionaries.
#> Creating dummy dictionary for disease
#>   Added 8 terms from disease (local)
#> Searching MeSH database for: chemicals[MeSH]
#> Trying direct MeSH database search...
#> Processing batch 1 of 1
#> Extracted 2 unique terms from MeSH text format
#> Retrieved 2 unique terms from MeSH
#>   Added 2 terms from chemical (mesh)
#> Created combined dictionary with 10 unique terms
#> Sanitizing dictionary with 10 terms...
#>   Removed 6 terms that did not match their claimed entity types
#>   Removed 2 terms with non-alphanumeric characters (final cleanup)
#> Sanitization complete. 2 terms remaining (20% of original)
#> Extracting entities from 2 documents...
#> Processing batch 1/1
#> Extracting entities from 2 documents...
#> Extracted 1 entity mentions:
#>   disease: 1
#> Extracted 1 entity mentions in 0.03 minutes
#>   disease: 1
print(head(entities))
#>     entity entity_type doc_id start_pos end_pos
#> 1 migraine     disease      1         1       8
#>                               sentence frequency
#> 1 Migraine is a neurological disorder.         1