How should I analyse TV episode popularity while accounting for time?

Question

I have a toy dataset which contains TV episodes, the date they were released and how many streams they have received:

episode	date	streams
The Shadow Realm	2024-03-08	5987
The Forbidden Tome	2024-03-01	6315
The Devil's Tea	2024-02-23	9584
The Sorcerer's Shoes	2024-02-16	8996
The Enchanted Forest	2024-02-09	7564
The Echoing Abyss	2024-02-02	7982
The Lost Civilization	2024-01-26	8456
The Dimensional Rift	2024-01-19	8834
The Crystal Maze	2024-01-12	9215
The Scepter of Zalthar	2024-01-05	9763

I am interested in analysing the number of streams each episode receives.

In particular, from the data we can see that the longer an episode has been available, the more streams it receives (in general). At the same time, some more recent episodes have a higher than "expected" number of streams against their release date (here those episodes are The Devil's Tea and the Sorcerer's Shoes).

What method of analysis should I use to determine the popularity of each episode (higher streams = more popular) while accounting/controlling for the effect of time (earlier release date = higher streams)?

The data is also given below in an R dput format if helpful.

structure(list(episode = c("The Shadow Realm", "The Forbidden Tome", 
"The Devil's Tea", "The Sorcerer's Shoes", "The Enchanted Forest", 
"The Echoing Abyss", "The Lost Civilization", "The Dimensional Rift", 
"The Crystal Maze", "The Scepter of Zalthar"), date = structure(c(19790, 
19783, 19776, 19769, 19762, 19755, 19748, 19741, 19734, 19727
), class = "Date"), streams = c(5987, 6315, 9584, 8996, 7564, 
7982, 8456, 8834, 9215, 9763)), class = "data.frame", row.names = c(NA, 
-10L))

Jay Bee · Accepted Answer · 2024-07-16 12:10:48Z

Assuming this is just toy data to give the flavor of the problem and you have more observations, I would try

fitting a heteroscedasticity-robust Poisson model of streams on episode age
calculating the expected number of streams at each age, along with a 95% CI

This should look something like this:

R Code

# Uncomment the following lines to install the required packages if 
# not already installed
# install.packages(c("tidyverse", "lubridate", "sandwich", "emmeans", "ggrepel"))

# Load necessary libraries
library(tidyverse)
library(lubridate)
library(sandwich)
library(emmeans)
library(ggrepel)

# Input data
data <- tibble(
  episode = c("The Shadow Realm", "The Forbidden Tome", 
               "The Devil's Tea", 
              "The Sorcerer's Shoes", "The Enchanted Forest", 
              "The Echoing Abyss", 
              "The Lost Civilization", "The Dimensional Rift", 
              "The Crystal Maze", 
              "The Scepter of Zalthar"),  # Episode titles
  date = as_date(c("2024-03-08", "2024-03-01", "2024-02-23", 
                   "2024-02-16", "2024-02-09", "2024-02-02", 
                   "2024-01-26", "2024-01-19", "2024-01-12", 
                   "2024-01-05")),  # Episode release dates
  streams = c(5987, 6315, 9584, 8996, 7564, 7982, 8456, 8834, 9215, 
                9763)  # Number of streams
) %>%
  mutate(age_in_days = as.numeric(ymd("2024-07-11") - date),  
# Calculate age of each episode in days
  outlier = ifelse(streams > 8900 & age_in_days < 150, episode, NA))  

# Identify outliers based on streams and age

# Poisson regression model with robust standard errors
model <- glm(streams ~ age_in_days, data = data, 
             family = quasipoisson())  # Fit Poisson regression model
robust_vcov <- vcovHC(model, type = "HC1")  
# Calculate robust variance-covariance matrix (HC1)

# Generate new data for prediction
new_data <- tibble(age_in_days = 125:188)  
# Create a new data frame for age_in_days range

# Predictions with confidence intervals using emmeans
emm <- emmeans(model, ~ age_in_days, at = list(age_in_days = 125:188),
                 vcov = robust_vcov, type = "response")
emm_summary <- summary(emm)

# Print the column names of emm_summary to check
print(colnames(emm_summary))

# Convert emm_summary to a data frame and rename columns based on actual names
emm_df <- as.data.frame(emm_summary)
colnames(emm_df) <- c("age_in_days", "fitted", "SE", "df", "conf.low",
                      "conf.high")

# Ensure new_data matches emmeans summary
new_data <- left_join(new_data, emm_df, by = "age_in_days")  
# Merge new_data with emm_df

# Plot
ggplot(new_data, aes(x = age_in_days, y = fitted)) +
  geom_line(color = "blue") +  # Plot fitted values as a line
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), 
  linetype = "dashed", alpha = 0.1) +  
# Add confidence interval ribbon
  geom_point(data = data, aes(x = age_in_days, y = streams)) +  
# Plot actual data points
  geom_text_repel(data = data, aes(x = age_in_days, y = streams, 
                  label = outlier), 
                  na.rm = TRUE, box.padding = 0.35, 
                  point.padding = 0.3, angle = 0) +  
# Label outliers with episode names
  theme_minimal() +  # Apply minimal theme
  labs(x = "Age in Days", y = "Streams") +  # Add axis labels
  theme(legend.position = "none")  # Remove legend

Stata Code

Same results, but I cannot quite match the CI from R, probably due to different HC and small sample corrections. Differences should be less stark with a larger sample.

clear
input str30 episode str10 date int streams
"The Shadow Realm" "2024-03-08" 5987
"The Forbidden Tome" "2024-03-01" 6315
"The Devil's Tea" "2024-02-23" 9584
"The Sorcerer's Shoes" "2024-02-16" 8996
"The Enchanted Forest" "2024-02-09" 7564
"The Echoing Abyss" "2024-02-02" 7982
"The Lost Civilization" "2024-01-26" 8456
"The Dimensional Rift" "2024-01-19" 8834
"The Crystal Maze" "2024-01-12" 9215
"The Scepter of Zalthar" "2024-01-05" 9763
end
gen age_in_days = td(11jul2024) - date(date,"YMD")
gen outlier = episode if streams > 8900 & age_in_days < 150
glm streams age_in_days, vce(robust) family(poisson) irls // irr
margins, at(age_in_days = (125(1)188)) post
marginsplot, addplot(scatter streams age_in_days, ms(Oh) mlab(outlier) mlabangle(20) mlabpos(2)) recastci(rline) ciopt(lpattern(dash)) recast(line) legend(off)

This looks nice. Can you explain how you arrived at the values 8900 and 150 in the second line of code: outlier = ifelse(streams > 8900 & age_in_days < 150, episode, NA)? It looks like these are values outside the model's confidence intervals, but you do this before you run the model. — SamR, Commented Jul 12 at 9:57
@SamR I kind of cheated: eyeballed from the graph and then defined the labels earlier on. I am sure there is a better way to do the labels, but I was being lazy. — dimitriy, Commented Jul 12 at 11:04
Honestly, to me, it looks a lot like all the non-outlier episodes fall almost exactly on a straight line of views versus age, one that predicts them better than the Poisson model. To be precise, a model like y = 59.65x-1589 with a product-moment correlation coefficient of something like 0.999. That would suggest that simply dividing the number of streams minus the intercept by days (or better, hours) since release and comparing to the expected number might be a good measure of popularity. — Obie 2.0, Commented Jul 12 at 18:41
The data is a tad weird because despite the non-outliers nearly perfectly matching a linear trend, the number of predicted views is negative until the episode has been out for about 27 days. If I saw that in real data with such perfect linearity, I would suspect that the episodes were effectively not available for the first month, perhaps as a consequence of being from very obscure and unpopular shows and not being recommended anywhere except searches until reaching a low threshold of views. Which might make sense: these numbers are a good 3–4 orders of magnitude below what, say, Netflix has. — Obie 2.0, Commented Jul 12 at 19:13
Why is poisson regression applicable when the variance is 200 times the mean? — rolando2, Commented Jul 13 at 10:20

Stack Exchange Network

How should I analyse TV episode popularity while accounting for time?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
regression
r
or ask your own question.

Linked

Hot Network Questions

How should I analyse TV episode popularity while accounting for time?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged regressionr or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
regression
r
or ask your own question.