This function sanitizes dictionary terms to ensure they're valid for entity extraction.
Usage
sanitize_dictionary(
dictionary,
term_column = "term",
type_column = "type",
validate_types = TRUE,
verbose = TRUE
)
Arguments
- dictionary
A data frame containing dictionary terms.
- term_column
The name of the column containing the terms to sanitize.
- type_column
The name of the column containing entity types.
- validate_types
Logical. If TRUE, validates terms against their claimed type.
- verbose
Logical. If TRUE, prints information about the filtering process.
Examples
# Create a dictionary with problematic terms
dirty_dict <- data.frame(
term = c("migraine", "europe", "optimization", "receptor", "123", ""),
type = c("disease", "location", "process", "protein", "number", "empty")
)
# Sanitize the dictionary
clean_dict <- sanitize_dictionary(dirty_dict)
#> Sanitizing dictionary with 6 terms...
#> Removing 1 empty terms
#> Removed 1 terms consisting solely of numbers
#> Sanitization complete. 4 terms remaining (80% of original)
print(clean_dict)
#> term type
#> 1 migraine disease
#> 2 europe location
#> 3 optimization process
#> 4 receptor protein