YouTube Activity Analysis
Like many others, I spend a nontrivial amount of time on YouTube. While I have a general idea about my watch and search activity on the platform, I was curious about the details of my YouTube consumption behavior. Additionally, I’ve always felt that my video preferences are pretty niche, so I wanted to see just how different my YouTube preferences are from what’s trending.
I used my personal Youtube history provided by Google Takeout and the YouTube API to collect metadata about the videos watched. I then used data on daily YouTube Trending videos in the US collected from this Kaggle dataset. For each video, there is information about the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.
From Google Takeout, I was only able to collect personal history starting from December 12, 2020 — May 2, 2021. For consistency between months, I filter out the December and May data, leaving 4 full months of data from January to April 2021. Similar, I filtered the Trending data for just these 4 months as well.
First, I took a look at my watch frequency and distribution during the day.
The line plot below graphs a count of videos watched by day. Rather than being pretty consistent, my watch count goes up and down quite a bit, most likely corresponding to midterm cycles and how busy I am. However, it should be noted that while daily video count patterns can serve as a general indicator of my watch behavior, video lengths differ and patterns in the amount of time I actually spend on YouTube each day would differ as well.
Next, I plotted video counts by hour of the day to see whether there are certain times of the day I‘m more inclined to be on YouTube.
From this clockplot, I see that the wee hours of the morning between 12AM and 4AM is my largest YouTube-browsing time. Turns out the reason I sleep so late isn’t exactly because I’m doing homework…
Then, I looked at my searches to see what videos I’ve been actively seeking out and built a wordcloud to visualize this.
An important note here is that since I tend to watch quite a few non-english videos, several search terms were also in non-ASCII characters. For this visualization, I decided to filter these out to just look at my most searched English terms.
Not unexpectedly, since I picked up learning guitar recently, my most searched words relate to guitar or songs (eg. “cover”, “lyric”, “tutorial”). You also see quite a few random foods (eg. “egg”, “tomato”, “orange”, “cake”, “shrimp”, etc.) — when I’m watching YouTube at 3AM, I’m probably getting hungry watching cooking/mukbang videos…
At the same time, I’m also quite pleasantly surprised to see “workout”, which suggests I’ve actually been making some effort towards one of this year’s resolutions to be more active!
I also took a look at which videos I’ve rewatched the most this past semester.
You can see I’ve really been into guitar recently — all top 3 videos are songs I learned or have been learning. Sadly, it’s been 14 full repeats of “isn’t she lovely”, and I still haven’t finished learning it 😔
Visualizing the trending videos by category shows us that Entertainment, Gaming, and News & Politics videos make it into the trending section most often. We also see the overall portion of the pie that each category takes up stays relatively constant.
Then, I plotted the category distributions over time of just the trending videos I watched. Out of over 500 trending videos each day, I usually watch less than 15, and my preferences are largely concentrated in the Music and Entertainment categories.
I also plotted a correlation matrix to see relationship between video impact metrics.
All variables are positively correlated. This makes sense because likes, dislikes, comments, and views generally all measure reach or engagement. The more views, the more opportunities for engagement with the video; moreover, more dislikes also correlate with more likes, so the measurement should be scaled before comparing between videos.
Sentiment from Trending Videos
YouTube trending videos typically reach millions of views, so the sentiment expressed in the video likely reflects or at least affects viewers to some extent. Thus, I analyzed the sentiment of the descriptions of the trending videos within the last 4 months to gather insights on overall public emotions. I used the AFINN lexicon, which ranges from -5 to 5, to score the sentiment of the text, and I averaged the sentiment of all the videos for each day.
The plot above graphs the overall averaged sentiment as seen from trending videos on YouTube. First we note that the averages are all positive, but relatively neutral, with only one daily average reaching above a score of 1. There are certainly fluctuations in sentiment, with sentiment in late March dropping close to 0. A look at the trending videos during the last week brings up videos like “Authorities use pepper balls on South Beach crowds after new curfew passes” and “Witness to King Soopers shooting describes scene”.
Conclusion and Next Steps
Overall, this analysis gave me a more detailed look at my own YouTube activity as well as the makeup of the trending YouTube videos. Some limitations and challenges of my analysis include:
- Limited data — I was hoping to analyze my entire search and watch history on YouTube starting several years back until now, but Google Takeout only retained the past 4 months. Thus, I wasn’t able to examine how my YouTube consumption behavior has changed throughout the years or conduct sentiment analysis on trending videos over a longer period of time.
- Working with the YouTube API — I encountered several challenges when trying to get an authorization key and when working with the YouTube API. This made it difficult to do all the analysis I had hoped to do. Nonetheless, it was interesting to see all the data that is available for the videos and use it to gather some insights about my own behavior.
- Language challenges & global applicability — Since many of the videos I watch have Chinese or Korean titles, my NLP analysis may not have picked up on them accurately as I would have liked. This can be seen from the extra symbols in the wordcloud of search terms. A next step would be to look into methods of dealing with special characters and improving this applicability to different regions.
Ultimately, I hope to continue with this analysis to further examine my YouTube activity (eg. favorite YouTubers, segmenting my watched videos by local regions, etc.). Additionally, I could even examine what makes a YouTube video viral, perhaps using even more detailed metadata like when a video was posted, length of video, and video category to predict the level of engagement a video receives within a certain time.
About the author
My name is Olivia, and I’m a junior at the University of Pennsylvania Wharton School. This data project was conducted in R for the course OIDD245: Analytics & The Digital Economy.