Skip to contents

This function extracts entities from text and optionally assigns them to specific semantic categories based on dictionaries.

Usage

extract_entities(
  text_data,
  text_column = "abstract",
  dictionary = NULL,
  case_sensitive = FALSE,
  overlap_strategy = c("priority", "all", "longest"),
  sanitize_dict = TRUE
)

Arguments

text_data

A data frame containing article text data.

text_column

Name of the column containing text to process.

dictionary

Combined dictionary or list of dictionaries for entity extraction.

case_sensitive

Logical. If TRUE, matching is case-sensitive.

overlap_strategy

How to handle terms that match multiple dictionaries: "priority", "all", or "longest".

sanitize_dict

Logical. If TRUE, sanitizes the dictionary before extraction.

Value

A data frame with extracted entities, their types, and positions.

Examples

# Create example text data
text_data <- data.frame(
  doc_id = 1:3,
  abstract = c(
    "Migraine is a neurological disorder causing severe headache and photophobia.",
    "Serotonin receptors play a role in migraine pathophysiology.",
    "Sumatriptan is an effective treatment for migraine attacks."
  )
)

# Create example dictionary
dictionary <- data.frame(
  term = c("migraine", "headache", "photophobia", "serotonin", "sumatriptan"),
  type = c("disease", "symptom", "symptom", "chemical", "drug")
)

# Extract entities
entities <- extract_entities(text_data, dictionary = dictionary)
#> Sanitizing dictionary with 5 terms...
#>   Removed 1 terms that did not match their claimed entity types
#> Sanitization complete. 4 terms remaining (80% of original)
#> Dictionary sanitized: 4 of 5 terms retained
#> Extracting entities from 3 documents...
#> Extracted 6 entity mentions:
#>   chemical: 1
#>   disease: 3
#>   symptom: 2
print(entities)
#>        entity entity_type doc_id start_pos end_pos
#> 2    migraine     disease      1         1       8
#> 3    migraine     disease      2        36      43
#> 4    migraine     disease      3        43      50
#> 1    headache     symptom      1        52      59
#> 5 photophobia     symptom      1        65      75
#> 6   serotonin    chemical      2         1       9
#>                                                                       sentence
#> 2 Migraine is a neurological disorder causing severe headache and photophobia.
#> 3                 Serotonin receptors play a role in migraine pathophysiology.
#> 4                  Sumatriptan is an effective treatment for migraine attacks.
#> 1 Migraine is a neurological disorder causing severe headache and photophobia.
#> 5 Migraine is a neurological disorder causing severe headache and photophobia.
#> 6                 Serotonin receptors play a role in migraine pathophysiology.
#>   frequency
#> 2         3
#> 3         3
#> 4         3
#> 1         1
#> 5         1
#> 6         1