It's no secret that COVID-19 has had a huge effect on everyone's lives this year. People have been staying inside a lot more and as a result, many are spending much more time online than before. One such activity that many people turned to this year is video games! All you need is a computer/console/phone and an internet connection to play a wide variety of games. However, playing video games is only half the story.
Twitch.tv, or simply Twitch, is a live streaming platform that launched in 2011 and was originally centered around video games. Since then, Twitch has grown to include categories for not just video games, but music, cooking, chatting, and anything else you could think of. Back in 2014, Amazon acquired Twitch for close to $1 billion, which has only fueled Twitch’s growth in the last few years. Anyone with an account can live stream on Twitch for free, although most users just watch others live stream and can chat with other users, including the streamer themselves.
Personally, I watch live streams on Twitch several times a week, if not everyday. While I haven’t been keeping track, I would say that I have been watching more live streams on Twitch this year due to COVID-19 and wonder how Twitch has been affected as a whole. I wanted to see exactly how much of an effect COVID-19 has had on Twitch and whether it has also changed various viewing habits.
The data I will be analyzing consists of two datasets from user Ran.Kirsh on Kaggle. The two datasets include Twitch viewership statistics from January 2016 to October 2020, with one focusing on Twitch as a whole and the other focusing on individual categories (games/activities), specifically the top 200 categories by hours watched each month. The two datasets are stored locally as csv files and can be downloaded with a free Kaggle account here.
import pandas as pd
global_data = pd.read_csv("Twitch_global_data.csv")
global_data.head()
game_data = pd.read_csv("Twitch_game_data.csv", encoding='latin-1')
game_data.head()
Let’s start by seeing how hours watched and streams per month have changed since 2016 and see if there is anything different about 2020.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# creating list with year positions for plot labeling
label_positions = []
counter = 0
for index, row in global_data.iterrows():
if row['Month'] == 1:
label_positions.append(counter)
counter = counter + 1
fig, ax = plt.subplots(figsize=(10,7))
# plotting each month's hours watched in billions
ax.plot([x / 1000000000 for x in global_data['Hours_watched']])
ax.set_ylabel('Hours Watched (in billions)')
ax.set_xlabel('Time')
ax.set_title('Hours Watched per Month on Twitch from 2016-2020')
ax.set_xticks(label_positions)
ax.set_xticklabels(['2016', '2017', '2018', '2019', '2020'])
plt.show()
There is a very clear spike in hours watched per month around the beginning of 2020, going from ~0.9-1 billion hours watched per month to ~1.8 billion. This spike lines up with the point in time when COVID-19 cases started to spike and many countries went into lockdown (or at least tried to). While there is growth leading up to 2020, this growth cannot account for the massive spike in hours watched per month.
fig, ax = plt.subplots(figsize=(10,7))
# plotting each month's streams in millions
ax.plot([x / 1000000 for x in global_data['Streams']])
ax.set_ylabel('Streams (in millions)')
ax.set_xlabel('Time')
ax.set_title('Streams per Month on Twitch from 2016-2020')
ax.set_xticks(label_positions)
ax.set_xticklabels(['2016', '2017', '2018', '2019', '2020'])
plt.show()
Similar to the previous plot, we can see an increase in the number of live streams per month at the start of 2020. There was another period of growth for Twitch in 2017 and 2018, which was actually larger than the growth from COVID-19. However, because anyone can stream and there are always more viewers than streamers, hours watched is a better metric for gauging activity on Twitch and will be the primary metric for analysis going forward.
Since more time was being spent watching live streams on Twitch, let's now look at hours watched per month for the top 5 games/categories per month from January 2019 - October 2020. I'll filter the dataframe for entries that are only ranked 1-5 for monthly watch time in 2019 and 2020.
# creating a table with the top 5 games for each month since Jan 2019
game_data_top_5 = game_data[(game_data['Rank'] <= 5) & (game_data['Year'] >= 2019)]
#creating a dictionary to store which months each game was in the top 5 if repeatedly in the top 5
top_games = {}
for index, row in game_data_top_5.iterrows():
if row['Game'] not in top_games.keys():
top_games[row['Game']] = [None] * 22
top_games[row['Game']][(int(row['Year']) - 2019) * 12 + int(row['Month']) - 1] = row['Hours_watched']
# unique colors for table (from https://stackoverflow.com/questions/8389636/creating-over-20-unique-legend-colors-using-matplotlib)
NUM_COLORS = len(top_games.keys())
cm = plt.get_cmap('gist_rainbow')
fig, ax = plt.subplots(figsize=(16,10))
ax.set_prop_cycle('color', [cm(1.*i/NUM_COLORS) for i in range(NUM_COLORS)])
# plotting each game's hours watched in millions
for game in top_games.keys():
ax.plot([None if x == None else x / 1000000 for x in top_games[game]], label=game, marker='.')
ax.set_ylabel('Hours Watched (in millions)')
ax.set_xlabel('Time')
ax.set_title('Hours Watched per Month on Twitch from 2019-2020')
ax.set_xticks(label_positions[:2])
ax.set_xticklabels(['2019', '2020'])
ax.legend(loc='upper left')
plt.show()
If you’re thinking that something weird was going on with VALORANT’s viewership in the first half of 2020, you would absolutely be right. Ahead of its release in June, VALORANT had a closed beta in April where players had to watch VALORANT Twitch streams for a chance to get access to the closed beta. This resulted in many viewers idling VALORANT Twitch streams, without actually watching them, to increase their chances of getting beta access. Because of this, VALORANT’s viewership in April (the big spike near 350 millions hours watched) does not represent normal viewing habits.
Another thing of note is that the Just Chatting category experienced more growth compared to other categories since the start of 2020. This is probably due to an influx of new Twitch users who found the site during quarantine.
In order to be a successful streamer, you generally need at least 1 of 2 qualities (in addition to good internet): you need to either be funny and have a great personality, or you need to be really good at video games (or other activities e.g. music). Streaming on Twitch is driven by passion and there rarely is a shortcut to becoming big on Twitch. However, what if you only cared about getting as many viewers as possible by putting in as little effort as possible? What would be the best game / category to stream in this case?
In this next section, I’ll look at the average viewer count and average channel count for Twitch in October 2020 (the most recent data I have), to determine which games / categories would statistically be the best for someone new to live streaming.
oct_global_data = global_data[(global_data['Month'] == 10) & (global_data['Year'] == 2020)]
oct_global_data
oct_game_data = game_data[(game_data['Month'] == 10) & (game_data['Year'] == 2020)]
oct_game_data.drop(columns=['Month', 'Year']).head()
Popular games that have been around for a while usually do so because they have a dedicated viewing audience, so let's add a column to the oct_game_data
dataframe for the number of months that game has been in the top 200.
# inserting and populating new column
oct_game_data.insert(12, 'months_in_top_200', 0)
for index, row in oct_game_data.iterrows():
num_months = len(game_data[game_data['Game'] == row['Game']])
oct_game_data.at[index, 'months_in_top_200'] = num_months
oct_game_data.drop(columns=['Month', 'Year']).head()
Now, let's see whether the number of months a game has been in the Twitch top 200 has any correlation with its average viewers in October using Scikit-learn's linear regression model.
# function to quickly get linear regression model with input x and observed ouput y, no need to reshape data prior
def get_linear_reg(x, y):
model = LinearRegression()
model.fit(x.values.reshape(-1, 1), y.values.reshape(-1, 1))
return model
# function to quickly get R-squared from given model with input x and observed ouput y, no need to reshape data prior
def get_r_2(model, x, y):
return model.score(x.values.reshape(-1, 1), y.values.reshape(-1, 1))
reg = get_linear_reg(oct_game_data['months_in_top_200'], oct_game_data['Avg_viewers'])
fig, ax = plt.subplots(figsize=(14, 8))
ax.scatter(oct_game_data['months_in_top_200'], oct_game_data['Avg_viewers'])
ax.set_xlabel('months_in_top_200')
ax.set_ylabel('Avg_viewers')
ax.set_title('Months in top 200 vs avg viewers in Oct 2020')
# only labeling games with Avg_viewers > 20,000 since these will be the games of interest and
# other games' labels will not be legible due to crowding
for i, game in enumerate(oct_game_data['Game']):
if list(oct_game_data['Avg_viewers'])[i] > 20000:
ax.annotate(game, (list(oct_game_data['months_in_top_200'])[i], list(oct_game_data['Avg_viewers'])[i]))
# plotting the linear regression model
ax.plot(np.arange(1, 61), [y for [y] in reg.predict(np.arange(1, 61).reshape(-1, 1))])
plt.show()
print("R squared: " + str(get_r_2(reg, oct_game_data['months_in_top_200'], oct_game_data['Avg_viewers'])))
There doesn't really seem to be any correlation between months spent in the top 200 and average viewers, with the low R squared value confirming this. Games with anywhere from 0-58 months in the top 200 can have similar average viewers.
How about average viewers to channel ratio?
reg = get_linear_reg(oct_game_data['months_in_top_200'], oct_game_data['Avg_viewer_ratio'])
fig, ax = plt.subplots(figsize=(14, 8))
ax.scatter(oct_game_data['months_in_top_200'], oct_game_data['Avg_viewer_ratio'])
ax.set_xlabel('months_in_top_200')
ax.set_ylabel('Avg_viewer_ratio')
ax.set_title('Months in top 200 vs Avg_viewer_ratio in Oct 2020')
# only labeling games with Avg_viewer_ratio > 70 since these will be the games of interest and
# other games' labels will not be legible due to crowding
for i, game in enumerate(oct_game_data['Game']):
if list(oct_game_data['Avg_viewer_ratio'])[i] > 70:
ax.annotate(game, (list(oct_game_data['months_in_top_200'])[i], list(oct_game_data['Avg_viewer_ratio'])[i]))
# plotting the linear regression model
ax.plot(np.arange(1, 61), [y for [y] in reg.predict(np.arange(1, 61).reshape(-1, 1))])
plt.show()
print("R squared: " + str(get_r_2(reg, oct_game_data['months_in_top_200'], oct_game_data['Avg_viewer_ratio'])))
Looks like it's the same case here according to the plot and the R squared value: no correlation.
So since how long a game has been popular has no correlation to avg_viewers or avg_viewer_ratio, let's instead just look at these metrics to find out which games would be the best to get viewers. First, let's see what's the average ratio for viewers to channels across the top 200 games in October.
avg_viewers = int(oct_global_data['Avg_viewers'])
avg_channels = int(oct_global_data['Avg_channels'])
print("Average global viewer/channel ratio: " + str(avg_viewers / avg_channels))
print("Average game viewer/channel ratio: " + str(oct_game_data['Avg_viewer_ratio'].mean()))
It appears that the global viewer ratio is different from the average viewer ratio of all games. This might be because of some games with a low number of viewers and channels that are able to have a very high viewer ratio. Let's look at 2 bar plots with the same games in the same order showing their ratio and average viewer count:
fig, ax = plt.subplots(figsize=(15, 7))
ax.bar(oct_game_data.sort_values(['Avg_viewer_ratio'], ascending=False)['Game'], oct_game_data.sort_values(['Avg_viewer_ratio'], ascending=False)['Avg_viewer_ratio'])
ax.set_xlabel('Games')
ax.set_ylabel('Avg_viewer_ratio')
ax.set_title('Twitch Games\' Avg_viewer_ratio in Oct 2020')
ax.set_xticklabels("")
plt.show()
fig, ax = plt.subplots(figsize=(15, 7))
ax.bar(oct_game_data.sort_values(['Avg_viewer_ratio'], ascending=False)['Game'], oct_game_data.sort_values(['Avg_viewer_ratio'], ascending=False)['Avg_viewers'])
ax.set_xlabel('Games')
ax.set_ylabel('Avg_viewers')
ax.set_title('Twitch Games\' Avg_viewers in Oct 2020')
ax.set_xticklabels("")
plt.show()
If you're wondering why the x-axis isn't labeled, it's because there's no good way to label all the games and have them be legible. But the order of the games is the same for the above two plots for comparison.
And it looks like there are a few games with a very high viewer ratio that also have very low average viewers, which checks out with the higher game viewer ratio compared to the global viewer ratio. However, the first game (on the far left) has a decently high average viewers, so let's look at the top 15 games in terms of viewer ratio to see why this might be.
oct_game_data.sort_values(['Avg_viewer_ratio'], ascending=False).drop(columns=['Month', 'Year', 'months_in_top_200']).head(15)
The game in question (which isn't really a game) is Special Events. Special Events is a category for, well, special events. If there's an awards show or a press conference / event (E3, for example), it will be streamed under the Special Events category. These special events have a high viewer count but low channel count since only a handful of channels will be streaming them, usually a few official ones and then individual streamers re-streaming them to have watch parties within their communities. Because of its nature, I will be considering Special Events as a special case and will not include it in the discussion of which games / categories are statistically the best to stream. I will go ahead and remove Special Events from the table:
oct_game_data = oct_game_data[oct_game_data['Game'] != "Special Events"]
So, we want games with high average viewers and a high viewer ratio. But according to the two plots for before, there doesn't seem to be any relationship between the two. Let's perform a linear regression to see whether or not this is actually the case.
reg = get_linear_reg(oct_game_data['Avg_viewers'], oct_game_data['Avg_viewer_ratio'])
fig, ax = plt.subplots(figsize=(15,8))
ax.scatter(oct_game_data['Avg_viewers'], oct_game_data['Avg_viewer_ratio'])
ax.set_xlabel('Avg_viewers')
ax.set_ylabel('Avg_viewer_ratio')
ax.set_title('Avg_viewers vs Avg_viewer_ratio in Oct 2020')
# only labeling games with Avg_viewers > 100,000 or Avg_viewer_ratio > 200 since these will be the games of interest and
# other games' labels will not be legible due to crowding
for i, game in enumerate(oct_game_data['Game']):
if list(oct_game_data['Avg_viewers'])[i] > 100000 or list(oct_game_data['Avg_viewer_ratio'])[i] > 200:
ax.annotate(game, (list(oct_game_data['Avg_viewers'])[i], list(oct_game_data['Avg_viewer_ratio'])[i]))
# plotting the linear regression model
ax.plot(np.arange(1, oct_game_data['Avg_viewers'].max()), [y for [y] in reg.predict(np.arange(1, oct_game_data['Avg_viewers'].max()).reshape(-1, 1))])
plt.show()
print("R squared: " + str(get_r_2(reg, oct_game_data['Avg_viewers'], oct_game_data['Avg_viewer_ratio'])))
According to the plot and the low value for R squared, it looks like there is indeed no relationship between the two. We need to find a way to accurately describe how good a game is to stream.
In order to quantify how good a game is to stream, I will make a new column, score
, that will combine a game's average viewers and its viewer ratio to create a metric that takes both into account since they are both desirable traits. I will multiply the two together and then divide by 100,000 just to keep the final score from being unnecessarily large.
oct_game_data.insert(12, 'score', float)
for index, row in oct_game_data.iterrows():
oct_game_data.at[index, 'score'] = row['Avg_viewers'] * row['Avg_viewer_ratio'] / 100000
oct_game_data.drop(columns=['Month', 'Year', 'months_in_top_200']).head()
Now, let's just double check to see whether score
does actually represent both average viewers and viewer ratio by using linear regression.
reg = get_linear_reg(oct_game_data['Avg_viewers'], oct_game_data['score'])
fig, ax = plt.subplots(figsize=(14,7))
ax.scatter(oct_game_data['Avg_viewers'], oct_game_data['score'])
ax.set_xlabel('Avg_viewers')
ax.set_ylabel('score')
ax.set_title('Avg_viewers vs score in Oct 2020')
# only labeling games with score > 20 since these will be the games of interest and
# other games' labels will not be legible due to crowding
for i, game in enumerate(oct_game_data['Game']):
if list(oct_game_data['score'])[i] > 20:
ax.annotate(game, (list(oct_game_data['Avg_viewers'])[i], list(oct_game_data['score'])[i]))
# plotting the linear regression model
ax.plot(np.arange(1, oct_game_data['Avg_viewers'].max()), [y for [y] in reg.predict(np.arange(1, oct_game_data['Avg_viewers'].max()).reshape(-1, 1))])
plt.show()
print("R squared: " + str(get_r_2(reg, oct_game_data['Avg_viewers'], oct_game_data['score'])))
Judging by the plot itself and the R-squared value of 0.85, I can say that there is a decent relationship between average viewers and score.
And for the relationship between viewer ratio and score...
reg = get_linear_reg(oct_game_data['Avg_viewer_ratio'], oct_game_data['score'])
fig, ax = plt.subplots(figsize=(14,7))
ax.scatter(oct_game_data['Avg_viewer_ratio'], oct_game_data['score'])
ax.set_xlabel('Avg_viewer_ratio')
ax.set_ylabel('score')
ax.set_title('Avg_viewer_ratio vs score in Oct 2020')
# only labeling games with score > 20 since these will be the games of interest and
# other games' labels will not be legible due to crowding
for i, game in enumerate(oct_game_data['Game']):
if list(oct_game_data['score'])[i] > 20:
ax.annotate(game, (list(oct_game_data['Avg_viewer_ratio'])[i], list(oct_game_data['score'])[i]))
# plotting the linear regression model
ax.plot(np.arange(1, oct_game_data['Avg_viewer_ratio'].max()), [y for [y] in reg.predict(np.arange(1, oct_game_data['Avg_viewer_ratio'].max()).reshape(-1, 1))])
plt.show()
print("R squared: " + str(get_r_2(reg, oct_game_data['Avg_viewer_ratio'], oct_game_data['score'])))
There doesn't appear to be a strong relationship here. However, keep in mind that the average viewer count will usually be much bigger than the viewer ratio, meaning that the average viewers will have a greater effect on the score than viewer ratio. The degree to which a viewer ratio will lead to a high score depends on the average viewers. A game's high viewer ratio won't result in a high score if its average viewers is low.
Finally, let's rank the games by score and see which are the best games to stream if you just want to get a lot of viewers:
oct_game_data.sort_values(by='score', ascending=False).drop(columns=['Month', 'Year', 'months_in_top_200']).head(10)
So it looks like, as of October 2020, the best games to stream on Twitch are:
While you would think that the best games to stream would be the most popular ones, that doesn't seem to necessarily be the case. Yes, all the games in the top 10 above are relatively popular compared to all 200 top games, but I was surprised that Hearthstone, which was ranked 16th in watch time, ended up being ranked 5th on my list. This is primarily due to Hearthstone having a very good viewer ratio for a game of its size and popularity (127.7 viewers per channel).
Even though Fortnite and Call Of Duty: Modern Warfare were ranked 4th and 5th respectively in terms of watch time, they didn't even make the top 10 of my list due to having very poor viewer ratios (14.81 & 11.45 respectively).
So these are the best games to stream on Twitch as of October 2020. I stress this because what's popular and what's not can change very quickly. Ranked 3rd on my list is Among Us, a game which came out in 2018 and only suddenly got popular earlier this year. There might be some other undiscovered game that will have a surge in popularity and end up being better than many other games on my list. FIFA 21 is another game that should be pointed out because while it will probably remain popular for a while, it also just came out in October 2020, so its current popularity might be due to its recent release.
It's very difficult to tell what will be popular on Twitch, but it's very easy to see what is currently popular on Twitch. So if you want try to get as many viewers as possible, streaming the above 10 games are your best bet!