The current demand for entertainment has greatly increased the importance of anime. More and more people are beginning to watch anime and it is well-recognized among millions of people worldwide. As anime continues to be a growing industry, especially as it moves west, the platform, MyAnimeList, represented in the provided dataset, shows how it is continuing to grow and remain relevant in the current world.
This poses many important questions, such as what exactly constitutes an ideal anime that will be popular and scored highly? What variables contribute to the mean score the most? Do newer or older anime tend to be scored higher? How does the intended audience affect the scores of anime? These questions are important to ask since it is interesting to see why people score certain anime higher than others. We want to see what about anime makes them so popular that people continue to watch them. Studying what makes anime scored well could also help studios produce shows that the audience will want to watch, increasing their revenue.
To answer these questions we will be using the “MyAnimeList API” dataset, which was collected by Pat Mendoza. For more background, MyAnimeList works as a tracking service for individuals where people score the anime they’ve watched. MyAnimeList can also be used for people to track what episodes they’re on, which anime they have finished, what they have dropped, and anime that they want to watch. MyAnimeList also contains a list of almost all anime that have ever existed, including their title, year of airing, voice actors, and the ranking of the anime based off of the data many of the users have provided.
| Variable Name | Variable Type | Description | Potential Values |
|---|---|---|---|
| mean | numeric | Average score out of 10 | 0-10 |
| num_scoring_users | numeric | Number of people who scored that anime | 0 - 2400000 |
| title | categorical | Title of anime | n/a |
| demo_de | categorical | Genre of the anime | Shounen, Josei, Seinen, Shoujo, Kids |
| start_season.season | categorical | Season anime aired | winter, spring, summer, fall |
| num_episodes | numerical | Number of episodes | 0 - 1818 |
| rating | categorical | whether the anime is r, pg, or pg_13, or g - intended audience | r, pg_13, pg, g, g |
| status | categorical | whether it is still airing or has finished | currently_airing, finished_airing |
| start_season.year | categorical | whether the air date is O (old, <1990), M (middle, between 1990 to 2009), and R (recent, from 2010 to 2022) |
The response variable we will be using is the mean, which is a numerical value that signifies the score of an anime from 0 to 10. The explanatory variables we will be utilizing to determine the mean include num_scoring_users, demo_de, start_season.year, start_season.season, num_episodes, rating, and status.
We have recoded num_scoring_users, as it had a noticeable skew to the right. We will be using log(num_scoring_users) to resolve this issue. Another recoded variable is start_season.year, where we have changed it from numerical to categorical. Rather organizing it by year directly, we chose to separate the years into three categories: “O” for old, which includes anime that started airing before 1990, “M” for middle, which includes anime that started airing from 1990 to 2009, and “R” for recent, which includes anime that started airing from 2010 to 2022 Another recoded variable would be start_season.season. Because there have been a very miniscule amount of TV series that have no official release date, the data for those anime tend to not be significant to the majority of the dataset and would add unnecessary values. We also got rid of the missing values in rating since they will not provide anything of interest and could take away from our findings. We also recoded demo_de. Though we did not get rid of the anime that were labeled as “missing” (as the majority of the anime in the entire list was labeled as “missing”, we would lose a lot of data in that case), we fused the Kids,Shoujo and Kids,Shounen into Kids. Because there are very few anime that had Kids,Shoujo and Kids,Shounen, it made sense to just fuse them into one to make it more relevant. Finally, we recoded num_episodes by filtering out data points for anime where num_episodes = 0. These points are not particularly useful for analysis as shows usually have atleast 1 episode, and including them in the logarithm produces undefined values.
For this project, we are creating 3 distinct models that have been proposed for the training split and we will be evaluating them on the testing split. Each model uses mean as the response variable and explores how different explanatory variables interact with each other and affect an anime’s mean score. Renee’s model uses the num_episodes,start_season.season, and num_scoring_users as explanatory variables. Rosa’s model uses num_scoring_users, rating, and start_season.year as explanatory variables. Ivy’s model uses num_scoring_users, demo_de, and start_season.season as explanatory variables. By modeling the interactions between different groups of explanatory variables and their effect on the mean score, we are able to learn more about how watcher’s score anime and what variables affect a show’s popularity with viewers.
The explanatory variables I will be exploring are num_episodes, num_scoring_users, and start_season.season. For num_episodes and num_scoring_users.
num_episodes is an important variable to include as anime can be of varying lengths. While most anime has around 12-13 episodes, some can run for multiple seasons of varying lengths. By using this data, we can check whether longer storylines are more highly rated than shorter ones.
num_scoring_users highlights how popular a show is. By analyzing this variable, we can take into account what impact a larger viewer base has on it’s rating. For the analysis, I will be taking the logarithm of num_scoring_users to account for the large right skew. When creating a scatter plot with the response variable, ‘mean’, we can see a relatively strong positive correlation, with the data cloud being mostly close to the line of best fit, and produces a strong correlation of 0.632.
## cor(log(num_scoring_users), mean)
## 1 0.6323044
Lastly, start_season.season specifies which quarterly season an anime came out in. Different seasons hold different meanings in Japanese culture, and could potentially define the theme of shows released during the time period. By studying how season affects the mean score, we can learn whether shows released in a specific season are more popular and research to find out why.
From the boxplot, we can see that while the mean score tends to be around the same each season, there are slight differences in outliers. We can see this especially in winter, which has the lowest mean score and the most outliers, some being close to 0. Meanwhile, spring has outliers both close to 0 and close to 10, the maximum rating. This could prove to have some correlation wtih the mean score, so it is beneficial to analyze.
Now I am going to create the fitted model.
#fitting the
renee_fit <- lm(mean ~ log(num_scoring_users) + start_season.season + log(num_episodes), data =renee_train)
tidy(renee_fit)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.75 0.101 37.0 4.55e-226
## 2 log(num_scoring_users) 0.252 0.00640 39.4 4.47e-248
## 3 start_season.seasonspring -0.0164 0.0358 -0.459 6.47e- 1
## 4 start_season.seasonsummer -0.129 0.0436 -2.95 3.18e- 3
## 5 start_season.seasonwinter -0.104 0.0400 -2.59 9.62e- 3
## 6 log(num_episodes) 0.285 0.0201 14.2 1.51e- 43
The model equation is mean-hat = 3.75 + 0.252 * log(num_scoring_users) - 0.0164 * start_season.seasonspring - 0.129 * start_season.seasonsummer - 0.104 * start_season.seasonwinter + 0.285 * log(num_episodes).
Spring: 3.7336 + 0.252 * log(num_scoring_users) + 0.285 * log(num_episodes)
Summer: 3.621 + 0.252 * log(num_scoring_users) + 0.285 * log(num_episodes)
Winter: 3.646 + 0.252 * log(num_scoring_users) + 0.285 * log(num_episodes)
Fall: 3.75 + 0.252 * log(num_scoring_users) + 0.285 * log(num_episodes)
Based on the equations, if we hold all other variables constant and increase log(num_scoring_users) by e, then mean will increase by 0.252. If we hold all other variables constant and increase log(num_episodes) by e, the mean will increase by 0.285. If log(num_scoring_users) and log(num_episodes) were 0, animes released in the spring would have a mean rating of 3.73, animes released in the summer would have a mean rating of 3.62, animes released in the winter would have a rating of 3.65, and animes released in the fall would have a rating of 3.75. From the data, we can see that as the number of episodes and number of scoring users increases, the mean increases. Animes released in the fall have the highest average rating, and those released in the summer have the lowest average rating.
My proposed model will describe the mean based on num_scoring_viewers, rating, and start_season.year. However, as we described above, num_scoring_users has a prominent skew to the right so I will be using log(num_scoring_users) instead. This also produced a strong correlation between the two. As depicted below, log(num_scoring_viewers) and mean have a pretty good correlation of 0.632.
## corr
## 1 0.6323044
This is also reflected in the scatter plot below of log(num_scoring_users) vs. mean.
Though the scatter plot has some outliers, the bulk of the data clearly shows a positive linear association between the two and fits well with the trend line. This along with the correlation coefficient show that num_scoring_users will be a beneficial variable.
The next variable I decided to add to my model was rating. Below is a boxplot depicting the relationship between that and mean, which I looked at in order to see if there was any correlation between the two.
This shows that the higher the rating gets the higher the mean score, as r and pg_13 have higher average mean than pg and g. Thus people rate anime higher if they are made for more mature audiences. Therefore as rating and mean do have some correlation, it would be beneficial to include rating in my model.
The final variable I will add to my model is start_season.year. According to the New York Times, “vast wealth has flooded the anime market in recent years” (Dooley). Since the anime industry has recently been bringing in more money this is most likely due to the popularity of more recent anime. This can also be seen as “nearly every animation studio in Japan is booked for solid years in advance”(Dooley). This is likely due to how highly rated recent anime has been which causes higher demand projects from these highly acclaimed studios. Thus this should be reflected in our data and newer anime should correlate with higher mean. This can be seen in the histogram below which depicts the mean faceted by how old the anime is.
This histogram shows that the peak for older anime is lower and around 6.5 whereas middle and more recent ones peak around 7. Recent anime also have more of a spread and a slight left skew whereas middle is more symmetric. We can also see that recent anime had a higher peak than middle even though they peaked at the same mean. Based on this there does seem to be some relationship as the spreads are different for the different anime classifications, which further supports the background information.
Now that I have chosen my variables, I made my linear model.
## # A tibble: 7 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 4.65
## 2 log(num_scoring_users) 0.286
## 3 ratingpg -0.118
## 4 ratingpg_13 -0.209
## 5 ratingr -0.275
## 6 start_season.yearO 0.189
## 7 start_season.yearR -0.431
The equation from my linear model is: mean-hat = 4.646 + 0.286 * log(num_scoring_users) + -0.118 * ratingpg + -0.209 * ratingpg_13 + -0.275 * ratingr + 0.189 * start_season.yearO + -0.431 * start_season.yearR
Old Anime: mean-hat = 4.835 + 0.286 * log(num_scoring_users) + -0.118 * ratingpg + -0.209 * ratingpg_13 + -0.275 * ratingr
Middle Anime: mean-hat = 4.646 + 0.286 * log(num_scoring_users) + -0.118 * ratingpg + -0.209 * ratingpg_13 + -0.275 * ratingr
Recent Anime: mean-hat = 4.215 + 0.286 * log(num_scoring_users) + -0.118 * ratingpg + -0.209 * ratingpg_13 + -0.275 * ratingr
Based on the equations, we can see that if we hold the other variables constant and increase num_scoring_users by e, then mean will increase by 0.286. If log(num_scoring_users) was 0, we can see old anime with a rating of g have a mean of 4.835, on average. For middle anime with a rating of g, this value is 4.646, and for recent anime, with a rating of g, this value is 4.215, on average. For old anime, if log(num_scoring_users) were 0, an anime with a rating of r would have a mean of 4.560 on average, a rating of pg_13 would have a mean of 4.626 on average, and a rating of pg would have a mean of 4.717 on average. If we did this for a middle anime with a rating of r, the mean would be 4.371 on average, a rating of pg_13 would have a mean of 4.437 on average, and a rating of pg would have a mean score of 4.528 on average. Finally doing this for a recent anime, an anime with a rating of r would have a mean of 3.940 on average, an anime with a rating of pg_13 would have a mean of 4.006 on average, and an anime with a rating of pg would have a mean of 4.097, on average.
Something interesting to note is that while it seemed like newer anime and ratings targeted to more mature audiences would have a higher mean, the linear model actually showed the opposite. This may be because older anime are able to accumulate more viewers as they have been around for longer. People may also be more critical of more mature anime based on the quality of gore shown, since in entertainment there tends to be a fine line between cool and cheesy.
The explanatory variables that I think are extremely important include the num_scoring_users, demo_de, and start_season.season.
“start_season.season,” which stands for the season of the year where the anime first aired, seems to be an important variable. Because the time of the year where people are more or less busy to watch anime would affect the scores of the anime.
The number of scoring users would have a heavy effect on the score. As an anime gets good reviews, people tend to spread the name of the show via word of mouth or just by looking at the ranking list. This compels other viewers to watch the same show and score it on their own, which will lead to a correlation between the score and the number of scoring users, as shown below.
Another seemingly corresponding explanatory variable could be the genre. As shown below, we can see some differences in the scores between the genres. The 3rd quartile in the Josei genre is higher than most of the other genres, while the kids genre has a significantly smaller interquartile range compared to the other genres.
After knowing this information, I have created my model below:
## # A tibble: 10 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 5.19
## 2 demo_deKids -0.235
## 3 demo_demissing -0.355
## 4 demo_deSeinen -0.121
## 5 demo_deShoujo 0.0300
## 6 demo_deShounen -0.0567
## 7 start_season.seasonspring -0.0246
## 8 start_season.seasonsummer -0.202
## 9 start_season.seasonwinter -0.154
## 10 log(num_scoring_users) 0.217
Ivy’s model equation: mean_estimate = 5.19 + (-0.235) * demo_deKids + (-0.355) * demo_demissing + (-0.121) * demo_deSeinen + (0.0300) * demo_deShoujo + (-0.0567) * demo_deShounen + 0.0246 * start_season.seasonspring + (-0.202) * start_season.seasonsummer + (-0.154) * start_season.seasonwinter + 0.217 * log(num_scoring_users)
The intercept, 5.19, would be the mean score with the genre Josei and the start_season.season being fall. The coefficient of demo_deKids is -0.235. If the genre is Kids, then it would make sense that the score is lower; most of the users on MyAnimeList tend to be teenagers or older, and kids shows are not catered to them. There is a possibility that the coefficient -0.355 that accompanies missing would be a genre that is not quite solidified. Because it is possibly a genre that might be experimental or a combination of several, there is a chance that it did not perform very well in scores as those artistic liberties are a risky choice. The coefficient -0.121 on Seinen would lower the score of the anime. A positive 0.0300 accompanying the Shoujo genre, which is a romance genre for teenage girls, increases the mean. -0.0246 with start_seasonspring lowers the score of the anime slightly compared to the default. The coefficients -0.202 and -0.154 with summer and winter respectively might be lower due to the comparison with fall. Perhaps fall might be a season where successful comfortable anime comes out and provides a lot of relaxation to the viewer. The positive 0.217 accompanying log(num_scoring_users) means that as num_scoring_users increases, so does the mean.
| Model | Adjusted R^2 | RMSE |
|---|---|---|
| Renee Singh | 0.455 | 0.568 |
| Rosa Peterson | 0.467 | 0.568 |
| Ivy Xu | 0.427 | 0.587 |
The best model out of these is the model proposed by Rosa Peterson that describes mean based on num_scoring_users, rating, and start_season.year. The main reason we chose this model was that it had an adjusted R^2 of 0.467. This was the highest adjusted R^2 out of all the proposed models which means it explained the highest percentage of variability of mean. This model had the lowest RMSE, tied with the one from Renee Singh’s model, but as Rosa Peterson’s had a higher adjusted R^2, it will better reflect the data overall. Below is the output of Rosa Peterson’s model on the entirety of the anime_ranking data set.
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.69 0.0462 101. 0
## 2 log(num_scoring_users) 0.280 0.00531 52.7 0
## 3 ratingpg -0.0926 0.0399 -2.32 2.03e- 2
## 4 ratingpg_13 -0.212 0.0306 -6.94 4.76e-12
## 5 ratingr -0.214 0.0414 -5.16 2.54e- 7
## 6 start_season.yearO 0.206 0.0342 6.02 1.88e- 9
## 7 start_season.yearR -0.393 0.0215 -18.3 1.18e-71
The equation from the model above is: mean-hat = 4.690 + 0.280 * log(num_scoring_users) + -0.093 * ratingpg + -0.212 * ratingpg_13 + -0.214 * ratingr + 0.206 * start_season.yearO + -0.393 * start_season.yearR. Splitting this up by how old the anime is yields the three equations:
Old Anime: mean-hat = 4.896 + 0.280 * log(num_scoring_users) + -0.093 * ratingpg + -0.212 * ratingpg_13 + -0.214 * ratingr
Middle Anime: mean-hat = 4.690 + 0.280 * log(num_scoring_users) + -0.093 * ratingpg + -0.212 * ratingpg_13 + -0.214 * ratingr
Recent Anime: mean-hat = 4.297 + 0.280 * log(num_scoring_users) + -0.093 * ratingpg + -0.212 * ratingpg_13 + -0.214 * ratingr
Based on the equations, we can see that if we hold the other variables constant and increase num_scoring_users by e, then the mean will increase by 0.280. If log(num_scoring_users) was 0, we can see old anime with a rating of g have a mean of 4.896, on average. For middle anime with a rating of g, this value is 4.690, and for recent anime, with a rating of g, this value is 4.297, on average. For old anime, if log(num_scoring_users) were 0, an anime with a rating of r would have a mean of 4.682 on average, a rating of pg_13 would have a mean of 4.684 on average, and a rating of pg would have a mean of 4.803 on average. If we did this for a middle anime with a rating of r, the mean would be 4.476 on average, a rating of pg_13 would have a mean of 4.478 on average, and a rating of pg would have a mean score of 4.597 on average. Finally doing this for a recent anime, an anime with a rating of r would have a mean of 4.083 on average, an anime with a rating of pg_13 would have a mean of 4.085 on average, and an anime with a rating of pg would have a mean of 4.204, on average.
This model relates to our questions, as one of the questions we needed to answer asks about the variables that constitute a high scoring anime. Based on the final model, those variables are the num_scoring_users, rating, and start_season.year. As num_scoring_users increases, so does mean which makes sense since as people tend to watch more popular anime, and those anime gain popularity because they are good, which cause them to have a high mean. So, people are drawn to anime with high mean causing more people to watch them and this process repeats. We can see that older anime is also rated higher than recent anime. However it is interesting to note that the intercepts for the three equations were close in range, meaning they did not have the biggest impact. The intended audience of anime affects the score of anime. Within the fit model equation, we can see that as anime gets more restrictive in audience rating, the mean also lowers. This is interesting because the original exploratory data analysis showed the opposite. In conclusion, while rating and start_season.year did have an impact on the mean, this was not in the way that we originally expected.
Bibliography
Dooley, Ben, and Hikari Hida. “Anime Is Booming. so Why Are Animators Living in Poverty?” The New York Times, The New York Times, 24 Feb. 2021, https://www.nytimes.com/2021/02/24/business/japan-anime.html. Accessed 10 Mar. 2022.
Mendoza, Pat. “MyAnimeList API”. Kaggle, Kaggle, 11 Feb. 2022, https://www.kaggle.com/patmendoza/myanimelist-api. Accessed 22 Feb. 2022.
…