Skip to contents

Detect and Remove Duplicate Records

Usage

detect_dupes(results, method = "exact", similarity_threshold = 0.85)

Arguments

results

Standardized search results data frame

method

Method for duplicate detection ("exact", "fuzzy", "doi")

similarity_threshold

Threshold for fuzzy matching (0-1)

Value

Data frame with duplicates marked and removed

Details

This function provides three methods for duplicate detection:

  • exact: Matches on title and first 100 characters of abstract

  • fuzzy: Uses Jaro-Winkler string distance for similarity matching

  • doi: Matches based on cleaned DOI strings

For fuzzy matching, similarity_threshold should be between 0 and 1, where 1 means identical strings. A threshold of 0.85 typically works well for academic titles.