Analyzing WhatsApp Conversations with Python
I started using WhatsApp around five years ago and like most people it is one of the applications I use daily to connect with my family, friends and co-workers. One day I had the idea to analyze the conversations between my partner and I.
I was curious to find out the following:
- How often do we exchange messages?
- Who has sent the most messages?
- What month do we talk the most? Which month do we talk the least?
- What words do we use the most?
- Can I visualize our word frequency?
Starting the Journey - Gathering the data
The first step was to gather the data I require for my analysis. When you go into any conversation in WhatsApp you are have the ability to download your chat with the Export Chat
feature.
![WhatsApp Screenshot](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2FImage_from_iOS.5c02c568.png&w=3840&q=75)
The exported chat is exported into a txt
file. When I first opened the text file I was initially shocked with the amount of information.
My partner and I talk...a lot.
Initial Observations
When opening the text file, every line of conversation using a new line break as a delimiter. Here is an example of one line of text from a conversation.
[2019-11-09, 6:03:41 PM] Gonzalo Vazquez: Ça va bien !
Within the brackets, every message is timestamped, followed by the author and finally the message.
First, we load the data into a dataframe using tab-stop as a delimiter.
from os import path
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
# Import Dataset
original_df = pd.read_csv("data.txt", delimiter="\t")
Exploring the data gives us an interesting view. We have 33,140 observations and one feature.
# Describe Dataframe
original_df.describe()
![](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fdataframe-01.a27157f6.png&w=3840&q=75)
The next step is to clean the data and create a new dataframe with three columns(date, author and message)
# Define empty dataframe
df = pd.DataFrame(columns=["date", "author", "message"])
# Create a method to split each line into an object with date, author and message
def split_message(index, message):
tmp_dict = {}
tmp_message = list(message)
tmp_message_2 = tmp_message[0].split("]")
message_2 = tmp_message_2[1].split(":")
tmp_dict["date"] = tmp_message_2[0].replace("[", "").replace("\u200e", "")
tmp_dict["author"] = message_2[0]
tmp_dict["message"] = message_2[1]
return tmp_dict
# Loop through each row in Dataframe and copy each value into an observation
for index, row in original_df.iterrows():
df.at[index, "date"] = pd.to_datetime(split_message(index, row)["date"])
df.at[index, "author"] = split_message(index, row)["author"].strip()
df.at[index, "message"] = split_message(index, row)["message"].strip()
# Print DataFrame
df
![](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fdataframe-02.ed65bffd.png&w=3840&q=75)
Now our data has three features and 33,140 observations. The next step is to delete the current index and set the date
as our new index.
# Set df['date'] as the index and delete the column
df.index = df['date']
del df['date']
df
Now that we can have a date
as in index, we can start to filter our dataframe
# We can now search by year - return 2020
df["2020"]
![](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fdataframe-03.5a22d24d.png&w=1920&q=75)
Analyzing our conversation
Now that our data is a dataframe with three features (date, author & message), let's start our analysis. First, let's get a count on the messages being sent by the author:
# Print the number of messages sent by each author
df["author"].value_counts()
![](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fdataframe-04.bb49eec5.png&w=1080&q=75)
Next, let's compare the messages being sent between 2019 and 2020.
# Number of messages sent in 2019
df["2019"]["message"].count()
# Output
>> 28241
# Number of messages sent in 2020
df["2020"]["message"].count()
# Output
>> 4899
Great! Finally, I want to be able to see the frequency of messages sent month by month. In order to accomplish this, we have to create a new column where we assign one or two depending on who the author is.
# Create a new column of 1 or 2 based on author - Partner A (1) and Partner B (2)
def label_author (row):
print(row["author"] == "Partner A")
if row["author"] == "Partner A":
return 1
elif row["author"] == "Partner B":
return 2
df['author_label'] = df.apply(lambda row: label_author(row), axis=1)
Now, that we have a new column with ones and twos, we can sum all of them and plot them in a time-series graph.
![](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fgraph.37484cf2.png&w=1920&q=75)
Voila! Incredible, we talked the most between September and October.
A graphical representation
Finally, we have two questions left to answer:
- What words do we use the most?
- Can I visualize our word frequency?
To answer the questions above, I turned to the ever popular word cloud. A word cloud is a visual way to represent text. Usually, the frequency of each word is represented in a different font size and colour. After some research I found this Python library, https://pypi.org/project/wordcloud/, that allows us to feed a text file and produce a word cloud.
# Loop through each row in Dataframe
# Extract messages and copy to text file
f = open("messages.txt","w+")
final_message = ""
for index, row in original_df.iterrows():
final_message += split_message(index, row)
f.write(final_message)
f.close()
# Read file created
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()
# Read the whole text.
text = open(path.join(d, 'messages.txt')).read()
# Create wordcloud using Alice in Wonderland as a mask
# Get data directory (using getcwd() is needed to support running example in generated IPython notebook)
d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()
# Read the whole text.
text = open(path.join(d, 'messages.txt')).read()
# read the mask / color image taken from
# http://jirkavinse.deviantart.com/art/quot-Real-Life-quot-Alice-282261010
alice_coloring = np.array(Image.open(path.join(d, "alice_color.png")))
# Load stop words file
text_stop = open("stop-words.txt", "r")
lines = text_stop.readlines()
stopwords = set(STOPWORDS)
# Set each line as a stopword
for line in lines:
stopwords.add(line.strip('\n'))
wc = WordCloud(background_color="white", max_words=2000, mask=alice_coloring,
stopwords=stopwords, max_font_size=40,
random_state=42, min_word_length=4)
# Generate word cloud
wc.generate(text)
# Create coloring from image
image_colors = ImageColorGenerator(alice_coloring)
# Show
fig, axes = plt.subplots(1, 3)
axes[0].imshow(wc, interpolation="bilinear")
# Recolor wordcloud and show
# we could also give color_func=image_colors directly in the constructor
axes[1].imshow(wc.recolor(color_func=image_colors), interpolation="bilinear")
axes[2].imshow(alice_coloring, cmap=plt.cm.gray, interpolation="bilinear")
for ax in axes:
ax.set_axis_off()
plt.show()
# Store to file
wc.to_file(path.join(d, "alice-colour.png"))
Finally, the result is below. Using Alice in Wonderland as a mask, we create a word cloud of the most used words in the chat conversation.
![](/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Falice.c9a6c7f0.png&w=1080&q=75)
An important tip to remember is to provide a stop-words
text file to the generator. The stop word is a new line delimited text file that has common filler words such as am
and or
that will be ignored by the word cloud generator.
I hope you have enjoyed my blog post and I encourage you to follow me on my learning journey as I share my experiences with you.