This function preprocesses the review text by optionally filtering non-English reviews, removing punctuation, converting to lowercase, removing stopwords, and stemming.
Value
A list containing the following elements:
corpus
: The preprocessed corpus object.dtm
: The document-term matrix.filtered_reviews
: The filtered reviews data frame.
Examples
# \donttest{
# Create a temporary file with sample book IDs
temp_file <- tempfile(fileext = ".txt")
writeLines(c("1420", "2767052", "10210"), temp_file)
# Scrape reviews
reviews <- scrape_reviews(temp_file, num_reviews = 5, use_parallel = FALSE)
#> Total book IDs to process: 3
#> 2024-10-25 03:02:24.9218 scrape_goodreads_reviews: Completed! All book reviews extracted
#> Scraping run time = 8.74226522445679
#> Total books processed: 3
# Preprocess the reviews
preprocessed <- preprocess_reviews(reviews, english_only = TRUE)
# Print the document-term matrix
print(preprocessed$dtm)
#> <<DocumentTermMatrix (documents: 13, terms: 1799)>>
#> Non-/sparse entries: 2929/20458
#> Sparsity : 87%
#> Maximal term length: 73
#> Weighting : term frequency (tf)
# Clean up: remove the temporary file
file.remove(temp_file)
#> [1] TRUE
# }