
Extract and classify entities from text with multi-domain types
Source:R/text_preprocessing.R
extract_entities.Rd
This function extracts entities from text and optionally assigns them to specific semantic categories based on dictionaries.
Usage
extract_entities(
text_data,
text_column = "abstract",
dictionary = NULL,
case_sensitive = FALSE,
overlap_strategy = c("priority", "all", "longest"),
sanitize_dict = TRUE
)
Arguments
- text_data
A data frame containing article text data.
- text_column
Name of the column containing text to process.
- dictionary
Combined dictionary or list of dictionaries for entity extraction.
- case_sensitive
Logical. If TRUE, matching is case-sensitive.
- overlap_strategy
How to handle terms that match multiple dictionaries: "priority", "all", or "longest".
- sanitize_dict
Logical. If TRUE, sanitizes the dictionary before extraction.
Examples
# Create example text data
text_data <- data.frame(
doc_id = 1:3,
abstract = c(
"Migraine is a neurological disorder causing severe headache and photophobia.",
"Serotonin receptors play a role in migraine pathophysiology.",
"Sumatriptan is an effective treatment for migraine attacks."
)
)
# Create example dictionary
dictionary <- data.frame(
term = c("migraine", "headache", "photophobia", "serotonin", "sumatriptan"),
type = c("disease", "symptom", "symptom", "chemical", "drug")
)
# Extract entities
entities <- extract_entities(text_data, dictionary = dictionary)
#> Sanitizing dictionary with 5 terms...
#> Removed 1 terms that did not match their claimed entity types
#> Sanitization complete. 4 terms remaining (80% of original)
#> Dictionary sanitized: 4 of 5 terms retained
#> Extracting entities from 3 documents...
#> Extracted 6 entity mentions:
#> chemical: 1
#> disease: 3
#> symptom: 2
print(entities)
#> entity entity_type doc_id start_pos end_pos
#> 2 migraine disease 1 1 8
#> 3 migraine disease 2 36 43
#> 4 migraine disease 3 43 50
#> 1 headache symptom 1 52 59
#> 5 photophobia symptom 1 65 75
#> 6 serotonin chemical 2 1 9
#> sentence
#> 2 Migraine is a neurological disorder causing severe headache and photophobia.
#> 3 Serotonin receptors play a role in migraine pathophysiology.
#> 4 Sumatriptan is an effective treatment for migraine attacks.
#> 1 Migraine is a neurological disorder causing severe headache and photophobia.
#> 5 Migraine is a neurological disorder causing severe headache and photophobia.
#> 6 Serotonin receptors play a role in migraine pathophysiology.
#> frequency
#> 2 3
#> 3 3
#> 4 3
#> 1 1
#> 5 1
#> 6 1