Karmanomics
Analysis of Reddit’s popularity ranking system
grimtrigger
12/8/2010
Contents
Introduction 
3 

Data 
4 

Descriptive Statistics 
4 

Models 
5 

Model 1 
5 

Model 
2 
5 
Model 
3 
5 
Model 
4 
5 
Model 
5 
5 
Model 
6 
6 
Empirical Results 
6 

Model 1 
6 

Model 
2 
7 
Model 
3 
8 
Model 
4 
8 
Model 
5 
9 
Model 
6 
9 
Discussion of variables across models 
10 

Conclusion 
11 

Summary 
11 

Further Research 
11 

Limitations 
11 
Introduction
Social media is a radical new way for individuals to reach online content. It gives the user access to millions of sources, and gives nontraditional sources a level playing field with traditional sources. Understanding how readers digest these sources is important to content creators and special interests seeking to promote specific content.
This paper looks into one site in particular, Reddit.com. Reddit users, referred to as “Redditors”, can give submissions either positive or negative votes. The sum of these votes is referred to as “Karma” and reflects the popularity of such posts. It is important to note that Karma begins at 1 and cannot drop below 0. The most popular posts make it to the front page, where they receive they receive the most attention.
Submissions can take many forms, specifically: articles, blogs, selfposts, images, audio/visual and other. Most of these are selfexplanatory but selfposts refer to simple text submissions by Redditors. They do not link outside of Reddit.com, and are similar to a blog post with Reddit.com as the blog host. These different types of content are likely to receive different levels of Karma since they are digested differently.
Reddit also has a subReddit system. SubReddits are communities organized around ideologies, interests, or lifestyles. Users who identify with certain subReddits have an option to subscribe, and will have their personalizedfront page contain submissions to that subReddit. SubReddits are also biased towards articles which reflect their idealogy, so a liberal article posted to /r/conservative can expect very little Karma.
The size of the subReddit to which an item was submitted is a likely predictor of that submission’s Karma score for two reasons. Firstly, the numbers of readers in that subReddit reflect the level of interest in that subject matter. Secondly, the number of readers in a subReddit is an indicator of how much facetime a new submission will get before it is bumped off. In other words, there is a tradeoff between submitting to a large subReddit with a lot of interest but too many voices, and a small subReddit with little interest but each voice is well heard. In between these two extremes, there is likely a sweet spot which is most fertile for Karma.
It is important to note an anomaly in the subReddit system: the main Reddit.com subReddit. This subReddit has over 33,000 readers but has no discernable special interest. It is a vestigial part of Reddit from before the subReddit system. I mark in my data submissions to this subReddit to see if it affects Karma.
The most important predictor of Karma is immeasurable: engagement. Some posts are more popular simply because they better catch or hold the user’s attention. I use number of comments as an imperfect proxy for engagement. The rational for this is that commenting takes more effort than simply voting. Commenting on a post indicates that the reader has made enough of a connection to the submission to invest a response.
Data
A sample of 200 submissions was randomly selected from the 100,000,000 submissions made
to Reddit prior to submission number 24274635, which was made on December 5 ^{t}^{h} , 2010. These 100,000,000 only include submissions made after Reddit’s subreddit feature was brought out of betamode in 2008. Data was obtained by randomly generating 200 numbers within the range, converting these random numbers to base36, and appending it to the url:
www.reddit.com/comments/ . 51 of these 200 observations were deemed as spam and not included in the data.
Karma Popularity ranking
Comments Number of comments
Readers Number of Redditors subscribed to that specific subreddit as of December 5 ^{t}^{h} , 2010
Main
A binary variable; 1 if it was submitted to the “Reddit.com” subreddit, 0 if it was not
Type Article, Blog, Self, AV, Image, Other are mutually exclusive binary variables which tell what type
of submission it was; AV is short for Audio/Visual
Descriptors Political and NSFW are binary variables that describe the link; NSFW is short for Not Safe For Work and usually refers to pornographic submissions
Descriptive Statistics
Mean 
Median 
Max 
Min 
Std dev 

Karma 
11.39 
1 
397 
0 
46.12 
Comments 
6.36 
0 
205 
0 
22.26 
Readers 
232478.2 
334022 
437813 
10 
19037.1 
Main 
.26 
0 
1 
0 
.44 
Article 
.31 
0 
1 
0 
.46 
Blog 
.13 
0 
1 
0 
.34 
Self 
.20 
0 
1 
0 
.32 
AV 
.13 
0 
1 
0 
.33 
Image 
.19 
0 
1 
0 
.40 
Other 
.05 
0 
1 
0 
.22 
Political 
.12 
0 
1 
0 
.32 
NSFW 
.03 
0 
1 
0 
.18 
Models
Model 1
LOG(KARMA+1) = β _{1} LOG(COMMENTS+1) + β _{2} LOG(READERS) + β _{3} LOG(READERS) ^{2} + β _{4} MAIN + β _{5} ARTICLE + β _{6} BLOG + β _{7} SELF + β _{8} IMAGE + β _{9} AV + β _{1}_{0} POLITICAL + β _{1}_{1} NSFW
This is the basic model I will be testing. It gives a logarithmic interpretation to the quantitative variables since they have exponential distributions. It is regressed through the origin since it can be assumed that if all the dependent variables evaluate to 0, LOG(KARMA+1) should be 0.
Model 2
LOG(KARMA+1) = β _{1} ABS(SELF1)*LOG(COMMENTS+1) + β _{2} LOG(READERS) + β _{3} LOG(READERS) ^{2} + β _{4} MAIN + β _{5} ARTICLE
+ β _{6} BLOG + β _{7} SELF + β _{8} IMAGE + β _{9} AV + β _{1}_{0} POLITICAL + β _{1}_{1} NSFW
This is similar to model 1, but only includes COMMENTS as a dependent variable for non SELF submissions. This model recognizes that selfposts are often have more comments, not because they are more engaging, but they are asking a question. Karma is ma
Model 3
LOG(KARMA+1) = β _{2} LOG(READERS) + β _{3} LOG(READERS) ^{2} + β _{4} MAIN + β _{5} ARTICLE + β _{6} BLOG + β _{7} SELF + β _{8} IMAGE + β _{9} AV
+ β _{1}_{0} POLITICAL + β _{1}_{1} NSFW
This model does not include COMMENTS. It recognizes that COMMENTS is an imperfect measure of how engaging a submission is, and it may be better to leave it out all together.
Model 4
LOG(KARMA+1) = β _{1} ABS(SELF1)*LOG(COMMENTS+1) + β _{2} LOG(READERS) + β _{3} LOG(READERS) ^{2} + β _{4} MAIN + β _{5} ARTICLE
+ β _{6} BLOG + β _{7} SELF + β _{8} IMAGE + β _{9} AV + β _{1}_{0} POLITICAL + β _{1}_{1} NSFW + C
This is the same as Model 1 but includes an intercept. It goes against the idea that if all the dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the model more flexibility.
Model 5
LOG(KARMA+1) = β _{1} ABS(SELF1)*LOG(COMMENTS+1) + β _{2} LOG(READERS) + β _{3} LOG(READERS) ^{2} + β _{4} MAIN + β _{5} ARTICLE
+ β _{6} BLOG + β _{7} SELF + β _{8} IMAGE + β _{9} AV + β _{1}_{0} POLITICAL + β _{1}_{1} NSFW + C
This is the same as Model 2 but includes an intercept. It goes against the idea that if all the dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the model more flexibility.
Model 6
LOG(KARMA+1) = β _{2} LOG(READERS) + β _{3} LOG(READERS) ^{2} + β _{4} MAIN + β _{5} ARTICLE + β _{6} BLOG + β _{7} SELF + β _{8} IMAGE + β _{9} AV + β _{1}_{0} POLITICAL + β _{1}_{1} NSFW + C
This is the same as Model 3 but includes an intercept. It goes against the idea that if all the dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the model more flexibility.
Empirical Results
Model 1
Dependent Variable: LOG(KARMA+1)
Variable 
Coefficient 
Std. Error 
tStatistic 
Prob. 
LOG(COMMENTS+1) 
0.806755 
0.058561 
13.77636 
0.0000 
LOG(READERS) 
0.210524 
0.060243 
3.494584 
0.0006 
LOG(READERS)^2 
0.015733 
0.004187 
3.757704 
0.0003 
MAIN 
0.336567 
0.165573 
2.032741 
0.0440 
ARTICLE 
0.487596 
0.239094 
2.039354 
0.0434 
BLOG 
0.152499 
0.262717 
0.580469 
0.5626 
SELF 
0.279417 
0.251761 
1.109850 
0.2690 
IMAGE 
0.415287 
0.255414 
1.625938 
0.1063 
AV 
0.117190 
0.271638 
0.431421 
0.6669 
POLITICAL 
0.133671 
0.200026 
0.668270 
0.5051 
NSFW 
0.318165 
0.356252 
0.893090 
0.3734 
Rsquared 
0.631585 
Adjusted RSquared 
0.604295 
The R ^{2} and adjusted R ^{2} in the model are promising, explaining 63% and 60% of the variance respectively. But the absolute tstatistics for the non binary variables, LOG(COMMENTS+1), LOG(READERS), and LOG(READERS)^2 are large. The absolute tstatistics for MAIN and ARTICLE are smaller, but would still be rejected at the 5% level.
Although the explanatory power of the model is still in question, it does seem to match the theory. LOG(COMMENTS+1) has a positive coefficient. The positive and negative coefficients of LOG(READERS) and LOG(READERS)^2 form a downward facing parabola that matches the theory.
One thing that does not make sense is the positive coefficient of MAIN. However, it does not necessarily validate the tossing of the model.
LOG(KARMA+1) is maximized when a subReddit has 646,711 readers. This is, worryingly, above the maximum amount of READERS observed.
Model 2
Dependent Variable: LOG(KARMA+1)
Variable 
Coefficient 
Std. Error 
tStatistic 
Prob. 

ABS(SELF1)*LOG(COMMENTS+1) 
0.896279 
0.085624 
10.46765 
0.0000 

LOG(READERS) 
0.210645 
0.069645 
3.024566 
0.0030 

LOG(READERS)^2 
0.014996 
0.004837 
3.099978 
0.0024 

MAIN 
0.475060 
0.194943 
2.436923 
0.0161 

ARTICLE 
0.376671 
0.275028 
1.369576 
0.1731 

BLOG 
0.263911 
0.302394 
0.872741 
0.3844 

SELF 
1.047127 
0.286963 
3.649001 
0.0004 

IMAGE 
0.196042 
0.296138 
0.661997 
0.5091 

AV 
0.047760 
0.312909 
0.152633 
0.8789 

POLITICAL 
0.182298 
0.230359 
0.791364 
0.4301 

NSFW 
0.207078 
0.413124 
0.501249 
0.6170 

Rsquared 
0.510749 
Adjusted Rsquared 
0.474509 
The R ^{2} and adjusted R ^{2} took a hit. While the tstatistics of LOG(READERS) and LOG(READERS)^2 became slightly more promising, they would still be rejected at the 1% level. This model has less explanatory power that Model 1 but partially solves high tstatistic of
LOG(COMMENTS+1).
One thing that does not make sense is the positive coefficient of MAIN. However, it does not necessarily validate the tossing of the model.
LOG(KARMA+1) is maximized when a subReddit has 1,260,155 readers. Once again, this is alarmingly high.
Model 3
Dependent Variable: LOG(KARMA+1)
Variable 
Coefficient 
Std. Error 
tStatistic 
Prob. 

LOG(READERS) 
0.331916 
0.091763 
3.617118 
0.0004 

LOG(READERS)^2 
0.020601 
0.006424 
3.206884 
0.0017 

MAIN 
0.086370 
0.249326 
0.346416 
0.7296 

ARTICLE 
0.215134 
0.365959 
0.587864 
0.5576 

BLOG 
0.358161 
0.403927 
0.886697 
0.3768 

SELF 
0.513964 
0.377419 
1.361787 
0.1755 

IMAGE 
0.560227 
0.393019 
1.425446 
0.1563 

AV 
0.001797 
0.418129 
0.004299 
0.9966 

POLITICAL 
0.311065 
0.306698 
1.014238 
0.3123 

NSFW 
0.926862 
0.544405 
1.702522 
0.0909 

Rsquared 
0.114504 
Adjusted Rsquared 
0.056333 
Compared to the previous two models, this one seems much less able to explain the variation in LOG(KARMA+1). It seems clear that COMMENTS should be included in the model in one way or another. This is also the first of the models to give MAIN a negative coefficient.
LOG(KARMA+1) is maximized when a subReddit has 9,935,701 readers. Once again, this is alarmingly high.
Model 4
Dependent Variable: LOG(KARMA+1)
Variable 
Coefficient 
Std. Error 
tStatistic 
Prob. 
LOG(COMMENTS+1) 
0.823607 
0.058245 
14.14045 
0.0000 
LOG(READERS) 
0.075551 
0.142707 
0.529417 
0.5974 
LOG(READERS)^2 
0.000407 
0.008085 
0.050355 
0.9599 
MAIN 
0.251460 
0.167757 
1.498955 
0.1362 
ARTICLE 
0.263746 
0.256681 
1.027523 
0.3060 
BLOG 
0.374099 
0.277856 
1.346376 
0.1805 
SELF 
0.529140 
0.272857 
1.939254 
0.0546 
IMAGE 
0.133272 
0.282462 
0.471823 
0.6378 
AV 
0.136549 
0.291516 
0.468411 
0.6403 
POLITICAL 
0.084747 
0.198470 
0.427003 
0.6701 
NSFW 
0.338229 
0.351383 
0.962566 
0.3375 
C 
1.436578 
0.651598 
2.204700 
0.0292 
Rsquared 
0.644481 
Adjusted Rsquared 
0.615297 
Compared to its interceptlacking counterpart, Model 1, this Model explains the variance in the model slightly better. On another positive note, all the variables except LOG(COMMENTS) and C would not be rejected at the 5% level. The addition of an intercept increases the explanatory power of Model 1.
However, the large coefficient of C, however, runs contrary to theory. The positive coefficient of MAIN also does not make sense. Even more troubling, both LOG(READERS) and LOG(READERS)^2 have negative coefficients, meaning a subReddit of 1 readers is most fertile for Karma. This is obviously untrue, and I feel it is enough to throw this model out.
Model 5
Dependent Variable: LOG(KARMA+1)
Variable 
Coefficient 
Std. Error 
tStatistic 
Prob. 

ABS(SELF1)*LOG(COMMENTS+1) 
0.898469 
0.085850 
10.46556 
0.0000 

LOG(READERS) 
0.107513 
0.165289 
0.650453 
0.5165 

LOG(READERS)^2 
0.009460 
0.009390 
1.007499 
0.3155 

MAIN 
0.442396 
0.201006 
2.200903 
0.0295 

ARTICLE 
0.293853 
0.300690 
0.977264 
0.3302 

BLOG 
0.345612 
0.325407 
1.062092 
0.2901 

SELF 
0.963876 
0.311929 
3.090049 
0.0024 

IMAGE 
0.093972 
0.331711 
0.283295 
0.7774 

AV 
0.045044 
0.341285 
0.131983 
0.8952 

POLITICAL 
0.165556 
0.232088 
0.713334 
0.4769 

NSFW 
0.217200 
0.414193 
0.524393 
0.6009 

C 
0.521011 
0.756962 
0.688292 
0.4925 

Rsquared 
0.512473 
Adjusted Rsquared 
0.472452 
Relative to Model 2, which lacks an intercept, this Model explains about the same amount of variance. The tstatistics for all variables other than ABS(MAIN1)*ABS(SELF 1)*LOG(COMMENTS+1) is promising. And although including an intercept runs contrary to theory, the intercept is relatively small.
LOG(KARMA+1) is maximized when a subReddit has 86,250 readers. This is the first such value to make practical sense.
Model 6
Dependent Variable: LOG(KARMA+1)
Variable 
Coefficient 
Std. Error 
tStatistic 
Prob. 
LOG(READERS) 
0.286928 
0.220038 
1.303996 
0.1944 
LOG(READERS)^2 
0.018185 
0.012518 
1.452713 
0.1486 
MAIN 
0.101241 
0.258768 
0.391243 
0.6962 
ARTICLE 
0.178752 
0.401224 
0.445516 
0.6567 
BLOG 
0.394004 
0.435486 
0.904747 
0.3672 
SELF 
0.476977 
0.412836 
1.155366 
0.2500 
IMAGE 
0.515965 
0.440681 
1.170835 
0.2437 
AV 
0.042448 
0.456793 
0.092925 
0.9261 
POLITICAL 
0.303863 
0.309424 
0.982028 
0.3278 
NSFW 
0.932055 
0.546788 
1.704600 
0.0906 
C 
0.227920 
1.012463 
0.225114 
0.8222 
Rsquared
0.114834
Adjusted Rsquared
0.049748
The addition of an intercept to model 3 barely changes the explanatory power of the model. However, the absolutely tstatistic of each individual variable is impressively small with only NSFW being rejected at the 10% level. Also promising is that even though including a intercept runs contrary to theory, the coefficient is small. The negative coefficient of MAIN is also promising.
LOG(KARMA+1) is maximized when a subReddit has 7,119,007 readers. Once again, this is alarmingly high.
Discussion of variables across models
The inclusion of LOG(COMMENTS+1) in some form consistently increased the explanatory power of the model, however its removal increased the probability that the other variables were part of the model. Model 5 seems to exemplify the best tradeoff, where comments relate to Karma only for nonself posts.
All but Model 4 showed that there is a number of readers where Karma is maximized at some point. However, Model 5 was the only one to give a maximum number of readers within the range of readers observed. 86,250 is the magic number of readers, according to Model 5.
MAIN was only negative in 2 models, which does not include Model 5 which seems the most promising. In Model 5 however, MAIN would be rejected at the 5% level. More testing is needed to see how submitting to the Reddit.com subReddit effects Karma.
ARTICLE and IMAGE have positive coefficients in every models, and with comfortable t statistics. It seems conclusive that articles and images receive more Karma than their counterparts.
AV and SELF sometimes have positive, and sometimes have negative coefficients. In Model 5, AV has a negative effect and SELF has a negative effect. However, more testing is needed to verify.
BLOG was the only variable to be negative in every single model. Furthermore, none of the models would reject the effect of BLOG at the 5% level.
POLITICAL has a consistently positive coefficient, while NSFW has a consistently negative coefficient.
Conclusion
Summary
My analysis shows that articles, images, and submissions with a political spin lend themselves to more Karma. Blogs and pornography, however, generate less Karma. Self posts and audio/visual submissions are somewhere in between.
There is also a sweet spot in the size of a subReddit. 86,250 is a rough estimate of where that sweet spot may be. SubReddits smaller than this don’t display the numbers needed to maximize Karma, while subReddits larger than this crowd it out.
There is a lack of evidence to conclude that comments are a good predictor of Karma.
Further Research
The use of comments as a proxy for engagement seemed to be the most notable flaw in my models. If a better proxy could be found, better models could be created.
Limitations
The sample size was unfortunately small, 149 not including spam posts. A larger sample size would greatly increase the validity of the model as well as better expose variable interactions.
Another limitation was the blurry line between the submission categories, particularly between articles and blogs. It is hard to decipher what is a blog when established sources hire bloggers to create content, and individual bloggers attempt to present themselves as established sources.
Another difficult question is what would be considered political. Is a news story which is not overtly editorialized but slightly biased considered political? If the bias wasn’t obvious at a cursory glance, I wouldn’t mark it as political. However, a deeper reading might reveal it was so.
