Conversim is a powerful R package for analyzing similarities in conversations across multiple dimensions. Whether you’re studying dialogue dynamics, speech patterns, or conversational alignment, this package provides comprehensive tools for measuring topic, lexical, semantic, stylistic, sentiment, structural, and temporal similarities.

Installing and Loading the Package

Install the package:

install.packages("conversim")

Load the package along with required dependencies:

library(conversim)
library(ggplot2)
library(lme4)
library(topicmodels)
library(tm)
library(slam)

Understanding Conversation Data Structure

The conversim package works with conversation data that includes:

  • dyad_id: Identifier for conversation pairs
  • speaker: Person speaking (A, B, etc.)
  • text: Raw conversation text
  • processed_text: Cleaned and preprocessed text

Analyzing Single Dyad Conversations

Creating Sample Data

Let’s start with a simple conversation between two speakers:

set.seed(123)
conversation <- data.frame(
  speaker = rep(c("A", "B"), 10),
  processed_text = c(
    "Hello how are you today",
    "I'm doing well thanks for asking",
    "That's great to hear what are your plans",
    "I'm planning to go for a walk later",
    "Sounds nice the weather is beautiful",
    "Yes it's perfect for outdoor activities",
    "Do you often go for walks",
    "Yes I try to walk every day for exercise",
    "That's a great habit to have",
    "Thanks I find it helps me stay healthy",
    "Have you tried other forms of exercise",
    "I also enjoy swimming and yoga",
    "Those are excellent choices too",
    "What about you what exercise do you prefer",
    "I like running and playing tennis",
    "Tennis is fun do you play often",
    "I try to play at least once a week",
    "That's a good frequency to maintain",
    "Yes it keeps me active and social",
    "Social aspects of exercise are important too"
  )
)

Text Preprocessing

Before analysis, preprocess your conversation data:

preprocess_dyads <- function(conversations) {
  conversations$processed_text <- sapply(conversations$text, function(text) {
    text <- tolower(text)
    text <- gsub("[[:punct:]]", "", text)
    text <- gsub("[[:digit:]]", "", text)
    text <- gsub("\\s+", " ", trimws(text))
    return(text)
  })
  
  # Remove empty processed texts
  conversations <- conversations[nchar(conversations$processed_text) > 0, ]
  return(conversations)
}

Lexical Similarity Sequence

Track how lexical similarity evolves throughout the conversation:

lexical_sim <- lex_sim_seq(conversation, window_size = 3)

# Visualize lexical similarity over time
ggplot(data.frame(Exchange = 1:length(lexical_sim$sequence), 
                  Similarity = lexical_sim$sequence), 
       aes(x = Exchange, y = Similarity)) +
  geom_line() +
  geom_point() +
  labs(title = "Lexical Similarity Sequence", 
       x = "Conversation Exchange", 
       y = "Similarity Score") +
  theme_minimal()

lexical similarity

cat("Average lexical similarity:", round(lexical_sim$average, 3))
#> Average lexical similarity: 0.557

Topic Similarity Analysis

Analyze topic coherence using Latent Dirichlet Allocation (LDA):

topic_sim <- topic_sim_seq(conversation, method = "lda", num_topics = 2, window_size = 3)

# Plot topic similarity across conversation segments
ggplot(data.frame(Segment = 1:3, Similarity = topic_sim$sequence), 
       aes(x = Segment, y = Similarity)) +
  geom_line() +
  geom_point() +
  labs(title = "Topic Similarity Sequence", 
       x = "Conversation Segment", 
       y = "Similarity Score") +
  theme_minimal()

Semantic Similarity

Measure semantic similarity using TF-IDF vectors:

semantic_sim <- sem_sim_seq(conversation, method = "tfidf", window_size = 3)

ggplot(data.frame(Exchange = 1:length(semantic_sim$sequence), 
                  Similarity = semantic_sim$sequence), 
       aes(x = Exchange, y = Similarity)) +
  geom_line() +
  geom_point() +
  labs(title = "Semantic Similarity Sequence", 
       x = "Conversation Exchange", 
       y = "Similarity Score") +
  theme_minimal()

Stylistic and Sentiment Analysis

Examine writing style and emotional tone:

# Stylistic similarity
stylistic_sim <- style_sim_seq(conversation, window_size = 3)

# Sentiment similarity
sentiment_sim <- sent_sim_seq(conversation, window_size = 3)

# Create comparison plot
style_data <- data.frame(
  Exchange = 1:length(stylistic_sim$sequence),
  Stylistic = stylistic_sim$sequence,
  Sentiment = sentiment_sim$sequence
)

style_long <- reshape2::melt(style_data, id.vars = "Exchange")

ggplot(style_long, aes(x = Exchange, y = value, color = variable)) +
  geom_line() +
  geom_point() +
  labs(title = "Stylistic vs Sentiment Similarity", 
       x = "Conversation Exchange", 
       y = "Similarity Score",
       color = "Measure") +
  theme_minimal()

Analyzing Multiple Dyads

Loading Multi-Dyad Data

Load the example dataset for multiple conversation pairs:

data_path <- system.file("extdata", "dyad_example_data.Rdata", package = "conversim")
load(data_path)

# Display data structure
head(dyad_example_data)
# A tibble: 6 × 3
  dyad_id speaker_id text                              
    <dbl> <chr>      <chr>                             
1       1 A          What did you think of the new mov…
2       1 B          I haven’t seen it yet. Which one …
3       1 A          The latest superhero film. I hear…
4       1 B          Oh, that one! I’ve been meaning t…
5       1 A          Yes, I thought it was fantastic. …
6       1 B          Really? What about the storyline?…
str(dyad_example_data)
tibble [532 × 3] (S3: tbl_df/tbl/data.frame)
 $ dyad_id   : num [1:532] 1 1 1 1 1 1 1 1 1 1 ...
 $ speaker_id: chr [1:532] "A" "B" "A" "B" ...
 $ text      : chr [1:532] "What did you think of the new movie that just came out?" "I haven’t seen it yet. Which one are you referring to?" "The latest superhero film. I heard it’s getting great reviews." "Oh, that one! I’ve been meaning to watch it. Did you enjoy it?" ...

Preprocessing Multiple Dyads

processed_convs <- preprocess_dyads(dyad_example_data)

Comprehensive Similarity Analysis

Calculate all similarity measures across multiple dyads:

# Topic similarity using multilevel modeling
topic_sim <- topic_sim_dyads(processed_convs, method = "lda", num_topics = 5, window_size = 3)

# Lexical similarity
lexical_sim <- lexical_sim_dyads(processed_convs, window_size = 3)

# Semantic similarity
semantic_sim <- semantic_sim_dyads(processed_convs, method = "tfidf", window_size = 3)

# Structural similarity
structural_sim <- structural_sim_dyads(processed_convs)

# Stylistic similarity
stylistic_sim <- stylistic_sim_dyads(processed_convs, window_size = 3)

# Sentiment similarity
sentiment_sim <- sentiment_sim_dyads(processed_convs, window_size = 3)

# Participant balance
participant_sim <- participant_sim_dyads(processed_convs)

# Timing patterns
timing_sim <- timing_sim_dyads(processed_convs)

Visualizing Multi-Dyad Results

Create comprehensive comparison plots:

# Prepare comparison data
comparison_df <- data.frame(
  dyad = names(topic_sim$similarities_by_dyad),
  topic = sapply(topic_sim$similarities_by_dyad, mean),
  lexical = sapply(lexical_sim$similarities_by_dyad, mean),
  semantic = sapply(semantic_sim$similarities_by_dyad, mean),
  structural = unlist(structural_sim$similarities_by_dyad),
  stylistic = sapply(stylistic_sim$similarities_by_dyad, mean),
  sentiment = sapply(sentiment_sim$similarities_by_dyad, mean),
  participant = unlist(participant_sim$similarities_by_dyad),
  timing = unlist(timing_sim$similarities_by_dyad)
)

# Reshape for plotting
comparison_long <- reshape2::melt(comparison_df, id.vars = "dyad")

# Create boxplot comparison
ggplot(comparison_long, aes(x = variable, y = value, fill = variable)) +
  geom_boxplot() +
  labs(title = "Comparison of Similarity Measures Across Dyads",
       x = "Similarity Measure",
       y = "Similarity Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Topic Similarity Trajectory

Visualize how topic similarity changes within each dyad:

# Prepare data for topic similarity trajectories
topic_sim_df <- data.frame(
  dyad = rep(names(topic_sim$similarities_by_dyad), 
             sapply(topic_sim$similarities_by_dyad, length)),
  similarity = unlist(topic_sim$similarities_by_dyad),
  index = unlist(lapply(topic_sim$similarities_by_dyad, seq_along))
)

# Create trajectory plot
ggplot(topic_sim_df, aes(x = index, y = similarity, color = dyad)) +
  geom_line() +
  geom_point() +
  facet_wrap(~dyad, ncol = 2) +
  labs(title = "Topic Similarity Trajectories Across Dyads",
       x = "Conversation Sequence",
       y = "Similarity Score") +
  theme_minimal() +
  theme(legend.position = "none")

Analyzing Long Speeches

Loading Speech Data

For comparing longer texts like speeches:

data_path <- system.file("extdata", "speeches_data.Rdata", package = "conversim")
load(data_path)

# Preview the speech data
print(summary(speeches_data))
speaker_id            text          
 Length:2           Length:2          
 Class :character   Class :character  
 Mode  :character   Mode  :character

Advanced Similarity Measures

Topic Similarity with Multiple Methods

# LDA-based topic similarity
lda_similarity <- topic_similarity(speeches_data$text[1], speeches_data$text[2], 
                                   method = "lda", num_topics = 5)

# LSA-based topic similarity  
lsa_similarity <- topic_similarity(speeches_data$text[1], speeches_data$text[2], 
                                   method = "lsa", num_topics = 5)

print(paste("LDA Similarity:", round(lda_similarity, 3)))
[1] "LDA Similarity: 0.169"
print(paste("LSA Similarity:", round(lsa_similarity, 3)))
[1] "LSA Similarity: 1"

Comprehensive Semantic Analysis

# TF-IDF semantic similarity
tfidf_similarity <- semantic_similarity(speeches_data$text[1], speeches_data$text[2], 
                                         method = "tfidf")

# Word2Vec semantic similarity
word2vec_similarity <- semantic_similarity(speeches_data$text[1], speeches_data$text[2], 
                                            method = "word2vec", dim = 100, iter = 5)

print(paste("TF-IDF Similarity:", round(tfidf_similarity, 3)))
[1] "TF-IDF Similarity: 0.5"
print(paste("Word2Vec Similarity:", round(word2vec_similarity, 3)))
[1] "Word2Vec Similarity: 0.999"

Detailed Stylistic Analysis

style_analysis <- stylistic_similarity(speeches_data$text[1], speeches_data$text[2])

print("Stylistic Features Comparison:")
[1] "Stylistic Features Comparison:"
print(paste("Text 1 TTR:", round(style_analysis$text1_features["ttr"], 3)))
[1] "Text 1 TTR: 0.644"
print(paste("Text 2 TTR:", round(style_analysis$text2_features["ttr"], 3)))
[1] "Text 2 TTR: 0.549"
print(paste("Overall Stylistic Similarity:", round(style_analysis$overall_similarity, 3)))
[1] "Overall Stylistic Similarity: 0.892"

Key Similarity Measures Explained

MeasureDescriptionBest For
TopicMeasures thematic coherence using LDA or LSAAnalyzing content alignment and subject matter similarity
LexicalCalculates shared vocabulary between speakersStudying word choice and linguistic alignment
SemanticMeasures meaning similarity using vector representationsUnderstanding conceptual similarity beyond word overlap
StructuralAnalyzes conversation format and turn-taking patternsExamining dialogue structure and interaction patterns
StylisticCompares writing style features (TTR, sentence length, readability)Studying linguistic style and complexity
SentimentMeasures emotional tone alignmentAnalyzing emotional synchrony in conversations
ParticipantEvaluates speaker balance and participation equalityUnderstanding conversation dynamics
TimingAnalyzes temporal patterns in conversation flowStudying rhythm and pacing in dialogue

Multilevel Modeling Benefits

The package uses multilevel modeling to:

  • Account for dyad-level variation in similarity patterns
  • Provide more accurate overall averages across conversations
  • Handle nested data structures common in conversation analysis
  • Enable statistical inference about conversation patterns

Best Practices

Window Size Selection

  • Small windows (2-3): Capture immediate conversational alignment
  • Medium windows (4-6): Balance local and global patterns
  • Large windows (7+): Focus on broader conversational themes

Method Selection

  • TF-IDF: Fast, interpretable, good for lexical analysis
  • Word2Vec: Captures semantic relationships, better for meaning
  • LDA: Excellent for topic modeling with multiple themes
  • LSA: Good for discovering latent semantic structures

Data Preprocessing

Always preprocess your text data by:

  • Converting to lowercase
  • Removing punctuation and numbers
  • Normalizing whitespace
  • Filtering empty entries

Conclusion

The conversim package provides a comprehensive toolkit for analyzing conversation similarities across multiple dimensions. Whether studying therapeutic dialogues, educational interactions, or social conversations, these tools enable researchers to quantify and visualize the complex dynamics of human communication.

The package’s strength lies in its ability to capture different aspects of conversational similarity simultaneously, providing a holistic view of dialogue dynamics that can inform research in linguistics, psychology, communication studies, and computational social science.


📌 For more advanced features and updates, visit the textAnnotatoR GitHub repository .