
Extract entities from text with improved efficiency using only base R
Source:R/text_preprocessing.R
extract_entities_workflow.Rd
This function provides a complete workflow for extracting entities from text using dictionaries from multiple sources, with improved performance and robust error handling.
Usage
extract_entities_workflow(
text_data,
text_column = "abstract",
entity_types = c("disease", "drug", "gene"),
dictionary_sources = c("local", "mesh", "umls"),
additional_mesh_queries = NULL,
sanitize = TRUE,
api_key = NULL,
custom_dictionary = NULL,
max_terms_per_type = 200,
verbose = TRUE,
batch_size = 500,
parallel = FALSE,
num_cores = 2,
cache_dictionaries = TRUE
)
Arguments
- text_data
A data frame containing article text data.
- text_column
Name of the column containing text to process.
- entity_types
Character vector of entity types to include.
- dictionary_sources
Character vector of sources for entity dictionaries.
- additional_mesh_queries
Named list of additional MeSH queries.
- sanitize
Logical. If TRUE, sanitizes dictionaries before extraction.
- api_key
API key for UMLS access (if "umls" is in dictionary_sources).
- custom_dictionary
A data frame containing custom dictionary entries to incorporate into the entity extraction process.
- max_terms_per_type
Maximum number of terms to fetch per entity type. Default is 200.
- verbose
Logical. If TRUE, prints detailed progress information.
- batch_size
Number of documents to process in a single batch. Default is 500.
- parallel
Logical. If TRUE, uses parallel processing when available. Default is FALSE.
- num_cores
Number of cores to use for parallel processing. Default is 2.
- cache_dictionaries
Logical. If TRUE, caches dictionaries for faster reuse. Default is TRUE.