Analyzing Similarities between Two Long Speeches
Source:vignettes/Analyzing_Similarities_between_Two_Long_Speeches.Rmd
Analyzing_Similarities_between_Two_Long_Speeches.Rmd
Introduction
This vignette demonstrates the usage of various similarity functions
for analyzing speeches. We’ll be using example data
speeches_data
stored in inst/extdata
to
showcase these functions.
First, let’s load the example data:
data_path <- system.file("extdata", "speeches_data.Rdata", package = "conversim")
load(data_path)
# Print a summary of the speeches data
print(summary(speeches_data))
## speaker_id text
## Length:2 Length:2
## Class :character Class :character
## Mode :character Mode :character
Preprocessing Text
Before we begin with the similarity functions, let’s look at the
preprocess_text
function:
# Example usage with our data
original_text <- substr(speeches_data$text[1], 1, 200) # First 200 characters of speech A
preprocessed_text <- preprocess_text(original_text)
print(paste("Original:", original_text))
## [1] "Original: Ladies and Gentlemen, Distinguished Guests,\n\nToday, I stand before you to address one of the most pressing challenges of our time—climate change. What was once a distant concern is now an undeniable r"
## [1] "Preprocessed: ladies and gentlemen distinguished guests today i stand before you to address one of the most pressing challenges of our timeclimate change what was once a distant concern is now an undeniable r"
Topic Similarity
The topic_similarity
function calculates the similarity
between two speeches based on their topics:
# Example usage with our speeches data
lda_similarity <- topic_similarity(speeches_data$text[1], speeches_data$text[2], method = "lda", num_topics = 5)
lsa_similarity <- topic_similarity(speeches_data$text[1], speeches_data$text[2], method = "lsa", num_topics = 5)
print(paste("LDA Similarity:", lda_similarity))
## [1] "LDA Similarity: 0.169419269706043"
## [1] "LSA Similarity: 1"
Note: The difference between LDA (Latent Dirichlet Allocation) topic similarity (0.1694) and LSA (Latent Semantic Analysis) topic similarity (1) can be attributed to several factors:
1. Different Algorithms
LDA and LSA use fundamentally different approaches for topic modeling and semantic analysis:
- LDA is a probabilistic model that assumes documents are mixtures of topics, and topics are mixtures of words. It aims to reverse-engineer the underlying topic structure that could have generated the observed documents.
- LSA, by contrast, relies on singular value decomposition (SVD) of the term-document matrix, reducing its dimensionality to uncover latent semantic structures.
2. Possible Reasons for LSA’s High Similarity Score
- Dimensionality: If too few topics (dimensions) were chosen for LSA, the semantic space might have been oversimplified, leading to an artificially high similarity score.
- Corpus Size: LSA can be sensitive to the size of the corpus. With only two documents, there may not be enough data for LSA to create a meaningful semantic space.
- Common Vocabulary: Both speeches discuss climate change, and the use of similar high-level vocabulary could lead LSA to treat them as highly similar, especially in a small corpus.
- Implementation Issue: There could be a problem with how cosine similarity was calculated or normalized in the LSA implementation.
Lexical Similarity
The lexical_similarity
function calculates the
similarity between two speeches based on their shared unique words:
# Example usage with our speeches data
lex_similarity <- lexical_similarity(speeches_data$text[1], speeches_data$text[2])
print(paste("Lexical Similarity:", lex_similarity))
## [1] "Lexical Similarity: 0.15180265654649"
Semantic Similarity
The semantic_similarity
function calculates the semantic
similarity between two speeches using different methods:
# Example usage with our speeches data
tfidf_similarity <- semantic_similarity(speeches_data$text[1], speeches_data$text[2], method = "tfidf")
word2vec_similarity <- semantic_similarity(speeches_data$text[1], speeches_data$text[2], method = "word2vec")
print(paste("TF-IDF Similarity:", tfidf_similarity))
## [1] "TF-IDF Similarity: 0.5"
## [1] "Word2Vec Similarity: 0.999170634893728"
# Note: For GloVe method, you need to provide a path to pre-trained GloVe vectors
# glove_similarity <- semantic_similarity(speeches_data$text[1], speeches_data$text[2], method = "glove", model_path = "path/to/glove/vectors.txt")
Structural Similarity
The structural_similarity
function calculates the
similarity between two speeches based on their structure:
# Example usage with our speeches data
struct_similarity <- structural_similarity(strsplit(speeches_data$text[1], "\n")[[1]],
strsplit(speeches_data$text[2], "\n")[[1]])
print(paste("Structural Similarity:", struct_similarity))
## [1] "Structural Similarity: 0.889420039965884"
Stylistic Similarity
The stylistic_similarity
function calculates various
stylistic features and their similarity between two speeches:
# Example usage with our speeches data
style_similarity <- stylistic_similarity(speeches_data$text[1], speeches_data$text[2])
print("Stylistic Similarity Results:")
## [1] "Stylistic Similarity Results:"
print(style_similarity)
## $text1_features
## ttr avg_sentence_length fk_grade
## 0.644186 23.888889 19.878760
##
## $text2_features
## ttr avg_sentence_length fk_grade
## 0.5490849 23.1153846 17.0446339
##
## $feature_differences
## ttr avg_sentence_length fk_grade
## 0.09510119 0.77350427 2.83412575
##
## $overall_similarity
## [1] 0.8924734
##
## $cosine_similarity
## [1] 0.9949162
Sentiment Similarity
The sentiment_similarity
function calculates the
sentiment similarity between two speeches:
# Example usage with our speeches data
sent_similarity <- sentiment_similarity(speeches_data$text[1], speeches_data$text[2])
print(paste("Sentiment Similarity:", sent_similarity))
## [1] "Sentiment Similarity: 0.952602694643716"