8
$\begingroup$

I have a toy dataset which contains TV episodes, the date they were released and how many streams they have received:

episode date streams
The Shadow Realm 2024-03-08 5987
The Forbidden Tome 2024-03-01 6315
The Devil's Tea 2024-02-23 9584
The Sorcerer's Shoes 2024-02-16 8996
The Enchanted Forest 2024-02-09 7564
The Echoing Abyss 2024-02-02 7982
The Lost Civilization 2024-01-26 8456
The Dimensional Rift 2024-01-19 8834
The Crystal Maze 2024-01-12 9215
The Scepter of Zalthar 2024-01-05 9763

I am interested in analysing the number of streams each episode receives.

In particular, from the data we can see that the longer an episode has been available, the more streams it receives (in general). At the same time, some more recent episodes have a higher than "expected" number of streams against their release date (here those episodes are The Devil's Tea and the Sorcerer's Shoes).

What method of analysis should I use to determine the popularity of each episode (higher streams = more popular) while accounting/controlling for the effect of time (earlier release date = higher streams)?

The data is also given below in an R dput format if helpful.

structure(list(episode = c("The Shadow Realm", "The Forbidden Tome", 
"The Devil's Tea", "The Sorcerer's Shoes", "The Enchanted Forest", 
"The Echoing Abyss", "The Lost Civilization", "The Dimensional Rift", 
"The Crystal Maze", "The Scepter of Zalthar"), date = structure(c(19790, 
19783, 19776, 19769, 19762, 19755, 19748, 19741, 19734, 19727
), class = "Date"), streams = c(5987, 6315, 9584, 8996, 7564, 
7982, 8456, 8834, 9215, 9763)), class = "data.frame", row.names = c(NA, 
-10L))
$\endgroup$

1 Answer 1

10
$\begingroup$

Assuming this is just toy data to give the flavor of the problem and you have more observations, I would try

  1. fitting a heteroscedasticity-robust Poisson model of streams on episode age
  2. calculating the expected number of streams at each age, along with a 95% CI

This should look something like this:

enter image description here


R Code

# Uncomment the following lines to install the required packages if 
# not already installed
# install.packages(c("tidyverse", "lubridate", "sandwich", "emmeans", "ggrepel"))

# Load necessary libraries
library(tidyverse)
library(lubridate)
library(sandwich)
library(emmeans)
library(ggrepel)

# Input data
data <- tibble(
  episode = c("The Shadow Realm", "The Forbidden Tome", 
               "The Devil's Tea", 
              "The Sorcerer's Shoes", "The Enchanted Forest", 
              "The Echoing Abyss", 
              "The Lost Civilization", "The Dimensional Rift", 
              "The Crystal Maze", 
              "The Scepter of Zalthar"),  # Episode titles
  date = as_date(c("2024-03-08", "2024-03-01", "2024-02-23", 
                   "2024-02-16", "2024-02-09", "2024-02-02", 
                   "2024-01-26", "2024-01-19", "2024-01-12", 
                   "2024-01-05")),  # Episode release dates
  streams = c(5987, 6315, 9584, 8996, 7564, 7982, 8456, 8834, 9215, 
                9763)  # Number of streams
) %>%
  mutate(age_in_days = as.numeric(ymd("2024-07-11") - date),  
# Calculate age of each episode in days
  outlier = ifelse(streams > 8900 & age_in_days < 150, episode, NA))  

# Identify outliers based on streams and age

# Poisson regression model with robust standard errors
model <- glm(streams ~ age_in_days, data = data, 
             family = quasipoisson())  # Fit Poisson regression model
robust_vcov <- vcovHC(model, type = "HC1")  
# Calculate robust variance-covariance matrix (HC1)

# Generate new data for prediction
new_data <- tibble(age_in_days = 125:188)  
# Create a new data frame for age_in_days range

# Predictions with confidence intervals using emmeans
emm <- emmeans(model, ~ age_in_days, at = list(age_in_days = 125:188),
                 vcov = robust_vcov, type = "response")
emm_summary <- summary(emm)

# Print the column names of emm_summary to check
print(colnames(emm_summary))

# Convert emm_summary to a data frame and rename columns based on actual names
emm_df <- as.data.frame(emm_summary)
colnames(emm_df) <- c("age_in_days", "fitted", "SE", "df", "conf.low",
                      "conf.high")

# Ensure new_data matches emmeans summary
new_data <- left_join(new_data, emm_df, by = "age_in_days")  
# Merge new_data with emm_df

# Plot
ggplot(new_data, aes(x = age_in_days, y = fitted)) +
  geom_line(color = "blue") +  # Plot fitted values as a line
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), 
  linetype = "dashed", alpha = 0.1) +  
# Add confidence interval ribbon
  geom_point(data = data, aes(x = age_in_days, y = streams)) +  
# Plot actual data points
  geom_text_repel(data = data, aes(x = age_in_days, y = streams, 
                  label = outlier), 
                  na.rm = TRUE, box.padding = 0.35, 
                  point.padding = 0.3, angle = 0) +  
# Label outliers with episode names
  theme_minimal() +  # Apply minimal theme
  labs(x = "Age in Days", y = "Streams") +  # Add axis labels
  theme(legend.position = "none")  # Remove legend

Stata Code

Same results, but I cannot quite match the CI from R, probably due to different HC and small sample corrections. Differences should be less stark with a larger sample.

clear
input str30 episode str10 date int streams
"The Shadow Realm" "2024-03-08" 5987
"The Forbidden Tome" "2024-03-01" 6315
"The Devil's Tea" "2024-02-23" 9584
"The Sorcerer's Shoes" "2024-02-16" 8996
"The Enchanted Forest" "2024-02-09" 7564
"The Echoing Abyss" "2024-02-02" 7982
"The Lost Civilization" "2024-01-26" 8456
"The Dimensional Rift" "2024-01-19" 8834
"The Crystal Maze" "2024-01-12" 9215
"The Scepter of Zalthar" "2024-01-05" 9763
end
gen age_in_days = td(11jul2024) - date(date,"YMD")
gen outlier = episode if streams > 8900 & age_in_days < 150
glm streams age_in_days, vce(robust) family(poisson) irls // irr
margins, at(age_in_days = (125(1)188)) post
marginsplot, addplot(scatter streams age_in_days, ms(Oh) mlab(outlier) mlabangle(20) mlabpos(2)) recastci(rline) ciopt(lpattern(dash)) recast(line) legend(off)
$\endgroup$
16
  • 2
    $\begingroup$ This looks nice. Can you explain how you arrived at the values 8900 and 150 in the second line of code: outlier = ifelse(streams > 8900 & age_in_days < 150, episode, NA)? It looks like these are values outside the model's confidence intervals, but you do this before you run the model. $\endgroup$
    – SamR
    Commented Jul 12 at 9:57
  • 3
    $\begingroup$ @SamR I kind of cheated: eyeballed from the graph and then defined the labels earlier on. I am sure there is a better way to do the labels, but I was being lazy. $\endgroup$
    – dimitriy
    Commented Jul 12 at 11:04
  • 3
    $\begingroup$ Honestly, to me, it looks a lot like all the non-outlier episodes fall almost exactly on a straight line of views versus age, one that predicts them better than the Poisson model. To be precise, a model like y = 59.65x-1589 with a product-moment correlation coefficient of something like 0.999. That would suggest that simply dividing the number of streams minus the intercept by days (or better, hours) since release and comparing to the expected number might be a good measure of popularity. $\endgroup$
    – Obie 2.0
    Commented Jul 12 at 18:41
  • 3
    $\begingroup$ The data is a tad weird because despite the non-outliers nearly perfectly matching a linear trend, the number of predicted views is negative until the episode has been out for about 27 days. If I saw that in real data with such perfect linearity, I would suspect that the episodes were effectively not available for the first month, perhaps as a consequence of being from very obscure and unpopular shows and not being recommended anywhere except searches until reaching a low threshold of views. Which might make sense: these numbers are a good 3–4 orders of magnitude below what, say, Netflix has. $\endgroup$
    – Obie 2.0
    Commented Jul 12 at 19:13
  • 2
    $\begingroup$ Why is poisson regression applicable when the variance is 200 times the mean? $\endgroup$
    – rolando2
    Commented Jul 13 at 10:20

Not the answer you're looking for? Browse other questions tagged or ask your own question.