Skip to contents

This function sanitizes dictionary terms to ensure they're valid for entity extraction.

Usage

sanitize_dictionary(
  dictionary,
  term_column = "term",
  type_column = "type",
  validate_types = TRUE,
  verbose = TRUE
)

Arguments

dictionary

A data frame containing dictionary terms.

term_column

The name of the column containing the terms to sanitize.

type_column

The name of the column containing entity types.

validate_types

Logical. If TRUE, validates terms against their claimed type.

verbose

Logical. If TRUE, prints information about the filtering process.

Value

A data frame with sanitized terms.

Examples

# Create a dictionary with problematic terms
dirty_dict <- data.frame(
  term = c("migraine", "europe", "optimization", "receptor", "123", ""),
  type = c("disease", "location", "process", "protein", "number", "empty")
)

# Sanitize the dictionary
clean_dict <- sanitize_dictionary(dirty_dict)
#> Sanitizing dictionary with 6 terms...
#>   Removing 1 empty terms
#>   Removed 1 terms consisting solely of numbers
#> Sanitization complete. 4 terms remaining (80% of original)
print(clean_dict)
#>           term     type
#> 1     migraine  disease
#> 2       europe location
#> 3 optimization  process
#> 4     receptor  protein