Sie sind auf Seite 1von 11

Karmanomics

Analysis of Reddit’s popularity ranking system

grimtrigger

12/8/2010

Karmanomics Analysis of Reddit’s popularity ranking system grimtrigger 12/8/2010
Karmanomics Analysis of Reddit’s popularity ranking system grimtrigger 12/8/2010
Karmanomics Analysis of Reddit’s popularity ranking system grimtrigger 12/8/2010

Contents

Introduction

3

Data

4

Descriptive Statistics

4

Models

5

Model 1

5

Model

2

5

Model

3

5

Model

4

5

Model

5

5

Model

6

6

Empirical Results

6

Model 1

6

Model

2

7

Model

3

8

Model

4

8

Model

5

9

Model

6

9

Discussion of variables across models

10

Conclusion

11

Summary

11

Further Research

11

Limitations

11

Introduction

Social media is a radical new way for individuals to reach online content. It gives the user access to millions of sources, and gives non-traditional sources a level playing field with traditional sources. Understanding how readers digest these sources is important to content creators and special interests seeking to promote specific content.

This paper looks into one site in particular, Reddit.com. Reddit users, referred to as “Redditors”, can give submissions either positive or negative votes. The sum of these votes is referred to as “Karma” and reflects the popularity of such posts. It is important to note that Karma begins at 1 and cannot drop below 0. The most popular posts make it to the front page, where they receive they receive the most attention.

Submissions can take many forms, specifically: articles, blogs, self-posts, images, audio/visual and other. Most of these are self-explanatory but self-posts refer to simple text submissions by Redditors. They do not link outside of Reddit.com, and are similar to a blog post with Reddit.com as the blog host. These different types of content are likely to receive different levels of Karma since they are digested differently.

Reddit also has a subReddit system. SubReddits are communities organized around ideologies, interests, or lifestyles. Users who identify with certain subReddits have an option to subscribe, and will have their personalized-front page contain submissions to that subReddit. SubReddits are also biased towards articles which reflect their idealogy, so a liberal article posted to /r/conservative can expect very little Karma.

The size of the subReddit to which an item was submitted is a likely predictor of that submission’s Karma score for two reasons. Firstly, the numbers of readers in that subReddit reflect the level of interest in that subject matter. Secondly, the number of readers in a subReddit is an indicator of how much facetime a new submission will get before it is bumped off. In other words, there is a tradeoff between submitting to a large subReddit with a lot of interest but too many voices, and a small subReddit with little interest but each voice is well heard. In between these two extremes, there is likely a sweet spot which is most fertile for Karma.

It is important to note an anomaly in the subReddit system: the main Reddit.com subReddit. This subReddit has over 33,000 readers but has no discernable special interest. It is a vestigial part of Reddit from before the subReddit system. I mark in my data submissions to this subReddit to see if it affects Karma.

The most important predictor of Karma is immeasurable: engagement. Some posts are more popular simply because they better catch or hold the user’s attention. I use number of comments as an imperfect proxy for engagement. The rational for this is that commenting takes more effort than simply voting. Commenting on a post indicates that the reader has made enough of a connection to the submission to invest a response.

Data

A sample of 200 submissions was randomly selected from the 100,000,000 submissions made

to Reddit prior to submission number 24274635, which was made on December 5 th , 2010. These 100,000,000 only include submissions made after Reddit’s subreddit feature was brought out of beta-mode in 2008. Data was obtained by randomly generating 200 numbers within the range, converting these random numbers to base-36, and appending it to the url:

www.reddit.com/comments/ . 51 of these 200 observations were deemed as spam and not included in the data.

Karma Popularity ranking

Comments Number of comments

Readers Number of Redditors subscribed to that specific subreddit as of December 5 th , 2010

Main

A binary variable; 1 if it was submitted to the “Reddit.com” subreddit, 0 if it was not

Type Article, Blog, Self, AV, Image, Other are mutually exclusive binary variables which tell what type

of submission it was; AV is short for Audio/Visual

Descriptors Political and NSFW are binary variables that describe the link; NSFW is short for Not Safe For Work and usually refers to pornographic submissions

Descriptive Statistics

 

Mean

Median

Max

Min

Std dev

Karma

11.39

1

397

0

46.12

Comments

6.36

0

205

0

22.26

Readers

232478.2

334022

437813

10

19037.1

Main

.26

0

1

0

.44

Article

.31

0

1

0

.46

Blog

.13

0

1

0

.34

Self

.20

0

1

0

.32

AV

.13

0

1

0

.33

Image

.19

0

1

0

.40

Other

.05

0

1

0

.22

Political

.12

0

1

0

.32

NSFW

.03

0

1

0

.18

Models

Model 1

LOG(KARMA+1) = β 1 LOG(COMMENTS+1) + β 2 LOG(READERS) + β 3 LOG(READERS) 2 + β 4 MAIN + β 5 ARTICLE + β 6 BLOG + β 7 SELF + β 8 IMAGE + β 9 AV + β 10 POLITICAL + β 11 NSFW

This is the basic model I will be testing. It gives a logarithmic interpretation to the quantitative variables since they have exponential distributions. It is regressed through the origin since it can be assumed that if all the dependent variables evaluate to 0, LOG(KARMA+1) should be 0.

Model 2

LOG(KARMA+1) = β 1 ABS(SELF-1)*LOG(COMMENTS+1) + β 2 LOG(READERS) + β 3 LOG(READERS) 2 + β 4 MAIN + β 5 ARTICLE

+ β 6 BLOG + β 7 SELF + β 8 IMAGE + β 9 AV + β 10 POLITICAL + β 11 NSFW

This is similar to model 1, but only includes COMMENTS as a dependent variable for non- SELF submissions. This model recognizes that self-posts are often have more comments, not because they are more engaging, but they are asking a question. Karma is ma

Model 3

LOG(KARMA+1) = β 2 LOG(READERS) + β 3 LOG(READERS) 2 + β 4 MAIN + β 5 ARTICLE + β 6 BLOG + β 7 SELF + β 8 IMAGE + β 9 AV

+ β 10 POLITICAL + β 11 NSFW

This model does not include COMMENTS. It recognizes that COMMENTS is an imperfect measure of how engaging a submission is, and it may be better to leave it out all together.

Model 4

LOG(KARMA+1) = β 1 ABS(SELF-1)*LOG(COMMENTS+1) + β 2 LOG(READERS) + β 3 LOG(READERS) 2 + β 4 MAIN + β 5 ARTICLE

+ β 6 BLOG + β 7 SELF + β 8 IMAGE + β 9 AV + β 10 POLITICAL + β 11 NSFW + C

This is the same as Model 1 but includes an intercept. It goes against the idea that if all the dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the model more flexibility.

Model 5

LOG(KARMA+1) = β 1 ABS(SELF-1)*LOG(COMMENTS+1) + β 2 LOG(READERS) + β 3 LOG(READERS) 2 + β 4 MAIN + β 5 ARTICLE

+ β 6 BLOG + β 7 SELF + β 8 IMAGE + β 9 AV + β 10 POLITICAL + β 11 NSFW + C

This is the same as Model 2 but includes an intercept. It goes against the idea that if all the dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the model more flexibility.

Model 6

LOG(KARMA+1) = β 2 LOG(READERS) + β 3 LOG(READERS) 2 + β 4 MAIN + β 5 ARTICLE + β 6 BLOG + β 7 SELF + β 8 IMAGE + β 9 AV + β 10 POLITICAL + β 11 NSFW + C

This is the same as Model 3 but includes an intercept. It goes against the idea that if all the dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the model more flexibility.

Empirical Results

Model 1

Dependent Variable: LOG(KARMA+1)

Variable

Coefficient

Std. Error

t-Statistic

Prob.

LOG(COMMENTS+1)

0.806755

0.058561

13.77636

0.0000

LOG(READERS)

0.210524

0.060243

3.494584

0.0006

LOG(READERS)^2

-0.015733

0.004187

-3.757704

0.0003

MAIN

0.336567

0.165573

2.032741

0.0440

ARTICLE

0.487596

0.239094

2.039354

0.0434

BLOG

-0.152499

0.262717

-0.580469

0.5626

SELF

-0.279417

0.251761

-1.109850

0.2690

IMAGE

0.415287

0.255414

1.625938

0.1063

AV

0.117190

0.271638

0.431421

0.6669

POLITICAL

0.133671

0.200026

0.668270

0.5051

NSFW

-0.318165

0.356252

-0.893090

0.3734

R-squared

0.631585

Adjusted R-Squared

0.604295

The R 2 and adjusted R 2 in the model are promising, explaining 63% and 60% of the variance respectively. But the absolute t-statistics for the non binary variables, LOG(COMMENTS+1), LOG(READERS), and LOG(READERS)^2 are large. The absolute t-statistics for MAIN and ARTICLE are smaller, but would still be rejected at the 5% level.

Although the explanatory power of the model is still in question, it does seem to match the theory. LOG(COMMENTS+1) has a positive coefficient. The positive and negative coefficients of LOG(READERS) and LOG(READERS)^2 form a downward facing parabola that matches the theory.

One thing that does not make sense is the positive coefficient of MAIN. However, it does not necessarily validate the tossing of the model.

LOG(KARMA+1) is maximized when a subReddit has 646,711 readers. This is, worryingly, above the maximum amount of READERS observed.

Model 2

Dependent Variable: LOG(KARMA+1)

 

Variable

Coefficient

Std. Error

t-Statistic

Prob.

 

ABS(SELF-1)*LOG(COMMENTS+1)

0.896279

0.085624

10.46765

0.0000

LOG(READERS)

0.210645

0.069645

3.024566

0.0030

LOG(READERS)^2

-0.014996

0.004837

-3.099978

0.0024

MAIN

0.475060

0.194943

2.436923

0.0161

ARTICLE

0.376671

0.275028

1.369576

0.1731

BLOG

-0.263911

0.302394

-0.872741

0.3844

SELF

1.047127

0.286963

3.649001

0.0004

IMAGE

0.196042

0.296138

0.661997

0.5091

AV

0.047760

0.312909

0.152633

0.8789

POLITICAL

0.182298

0.230359

0.791364

0.4301

NSFW

-0.207078

0.413124

-0.501249

0.6170

R-squared

0.510749

Adjusted R-squared

0.474509

The R 2 and adjusted R 2 took a hit. While the t-statistics of LOG(READERS) and LOG(READERS)^2 became slightly more promising, they would still be rejected at the 1% level. This model has less explanatory power that Model 1 but partially solves high t-statistic of

LOG(COMMENTS+1).

One thing that does not make sense is the positive coefficient of MAIN. However, it does not necessarily validate the tossing of the model.

LOG(KARMA+1) is maximized when a subReddit has 1,260,155 readers. Once again, this is alarmingly high.

Model 3

Dependent Variable: LOG(KARMA+1)

 

Variable

Coefficient

Std. Error

t-Statistic

Prob.

 

LOG(READERS)

0.331916

0.091763

3.617118

0.0004

LOG(READERS)^2

-0.020601

0.006424

-3.206884

0.0017

MAIN

-0.086370

0.249326

-0.346416

0.7296

ARTICLE

0.215134

0.365959

0.587864

0.5576

BLOG

-0.358161

0.403927

-0.886697

0.3768

SELF

0.513964

0.377419

1.361787

0.1755

IMAGE

0.560227

0.393019

1.425446

0.1563

AV

-0.001797

0.418129

-0.004299

0.9966

POLITICAL

0.311065

0.306698

1.014238

0.3123

NSFW

-0.926862

0.544405

-1.702522

0.0909

R-squared

0.114504

Adjusted R-squared

0.056333

Compared to the previous two models, this one seems much less able to explain the variation in LOG(KARMA+1). It seems clear that COMMENTS should be included in the model in one way or another. This is also the first of the models to give MAIN a negative coefficient.

LOG(KARMA+1) is maximized when a subReddit has 9,935,701 readers. Once again, this is alarmingly high.

Model 4

Dependent Variable: LOG(KARMA+1)

Variable

Coefficient

Std. Error

t-Statistic

Prob.

LOG(COMMENTS+1)

0.823607

0.058245

14.14045

0.0000

LOG(READERS)

-0.075551

0.142707

-0.529417

0.5974

LOG(READERS)^2

-0.000407

0.008085

-0.050355

0.9599

MAIN

0.251460

0.167757

1.498955

0.1362

ARTICLE

0.263746

0.256681

1.027523

0.3060

BLOG

-0.374099

0.277856

-1.346376

0.1805

SELF

-0.529140

0.272857

-1.939254

0.0546

IMAGE

0.133272

0.282462

0.471823

0.6378

AV

-0.136549

0.291516

-0.468411

0.6403

POLITICAL

0.084747

0.198470

0.427003

0.6701

NSFW

-0.338229

0.351383

-0.962566

0.3375

C

1.436578

0.651598

2.204700

0.0292

R-squared

0.644481

Adjusted R-squared

0.615297

Compared to its intercept-lacking counterpart, Model 1, this Model explains the variance in the model slightly better. On another positive note, all the variables except LOG(COMMENTS) and C would not be rejected at the 5% level. The addition of an intercept increases the explanatory power of Model 1.

However, the large coefficient of C, however, runs contrary to theory. The positive coefficient of MAIN also does not make sense. Even more troubling, both LOG(READERS) and LOG(READERS)^2 have negative coefficients, meaning a subReddit of 1 readers is most fertile for Karma. This is obviously untrue, and I feel it is enough to throw this model out.

Model 5

Dependent Variable: LOG(KARMA+1)

 

Variable

Coefficient

Std. Error

t-Statistic

Prob.

 

ABS(SELF-1)*LOG(COMMENTS+1)

0.898469

0.085850

10.46556

0.0000

LOG(READERS)

0.107513

0.165289

0.650453

0.5165

LOG(READERS)^2

-0.009460

0.009390

-1.007499

0.3155

MAIN

0.442396

0.201006

2.200903

0.0295

ARTICLE

0.293853

0.300690

0.977264

0.3302

BLOG

-0.345612

0.325407

-1.062092

0.2901

SELF

0.963876

0.311929

3.090049

0.0024

IMAGE

0.093972

0.331711

0.283295

0.7774

AV

-0.045044

0.341285

-0.131983

0.8952

POLITICAL

0.165556

0.232088

0.713334

0.4769

NSFW

-0.217200

0.414193

-0.524393

0.6009

C

0.521011

0.756962

0.688292

0.4925

R-squared

0.512473

Adjusted R-squared

0.472452

Relative to Model 2, which lacks an intercept, this Model explains about the same amount of variance. The t-statistics for all variables other than ABS(MAIN-1)*ABS(SELF- 1)*LOG(COMMENTS+1) is promising. And although including an intercept runs contrary to theory, the intercept is relatively small.

LOG(KARMA+1) is maximized when a subReddit has 86,250 readers. This is the first such value to make practical sense.

Model 6

Dependent Variable: LOG(KARMA+1)

Variable

Coefficient

Std. Error

t-Statistic

Prob.

LOG(READERS)

0.286928

0.220038

1.303996

0.1944

LOG(READERS)^2

-0.018185

0.012518

-1.452713

0.1486

MAIN

-0.101241

0.258768

-0.391243

0.6962

ARTICLE

0.178752

0.401224

0.445516

0.6567

BLOG

-0.394004

0.435486

-0.904747

0.3672

SELF

0.476977

0.412836

1.155366

0.2500

IMAGE

0.515965

0.440681

1.170835

0.2437

AV

-0.042448

0.456793

-0.092925

0.9261

POLITICAL

0.303863

0.309424

0.982028

0.3278

NSFW

-0.932055

0.546788

-1.704600

0.0906

C

0.227920

1.012463

0.225114

0.8222

R-squared

0.114834

Adjusted R-squared

0.049748

The addition of an intercept to model 3 barely changes the explanatory power of the model. However, the absolutely t-statistic of each individual variable is impressively small with only NSFW being rejected at the 10% level. Also promising is that even though including a intercept runs contrary to theory, the coefficient is small. The negative coefficient of MAIN is also promising.

LOG(KARMA+1) is maximized when a subReddit has 7,119,007 readers. Once again, this is alarmingly high.

Discussion of variables across models

The inclusion of LOG(COMMENTS+1) in some form consistently increased the explanatory power of the model, however its removal increased the probability that the other variables were part of the model. Model 5 seems to exemplify the best tradeoff, where comments relate to Karma only for non-self posts.

All but Model 4 showed that there is a number of readers where Karma is maximized at some point. However, Model 5 was the only one to give a maximum number of readers within the range of readers observed. 86,250 is the magic number of readers, according to Model 5.

MAIN was only negative in 2 models, which does not include Model 5 which seems the most promising. In Model 5 however, MAIN would be rejected at the 5% level. More testing is needed to see how submitting to the Reddit.com subReddit effects Karma.

ARTICLE and IMAGE have positive coefficients in every models, and with comfortable t- statistics. It seems conclusive that articles and images receive more Karma than their counterparts.

AV and SELF sometimes have positive, and sometimes have negative coefficients. In Model 5, AV has a negative effect and SELF has a negative effect. However, more testing is needed to verify.

BLOG was the only variable to be negative in every single model. Furthermore, none of the models would reject the effect of BLOG at the 5% level.

POLITICAL has a consistently positive coefficient, while NSFW has a consistently negative coefficient.

Conclusion

Summary

My analysis shows that articles, images, and submissions with a political spin lend themselves to more Karma. Blogs and pornography, however, generate less Karma. Self posts and audio/visual submissions are somewhere in between.

There is also a sweet spot in the size of a subReddit. 86,250 is a rough estimate of where that sweet spot may be. SubReddits smaller than this don’t display the numbers needed to maximize Karma, while subReddits larger than this crowd it out.

There is a lack of evidence to conclude that comments are a good predictor of Karma.

Further Research

The use of comments as a proxy for engagement seemed to be the most notable flaw in my models. If a better proxy could be found, better models could be created.

Limitations

The sample size was unfortunately small, 149 not including spam posts. A larger sample size would greatly increase the validity of the model as well as better expose variable interactions.

Another limitation was the blurry line between the submission categories, particularly between articles and blogs. It is hard to decipher what is a blog when established sources hire bloggers to create content, and individual bloggers attempt to present themselves as established sources.

Another difficult question is what would be considered political. Is a news story which is not overtly editorialized but slightly biased considered political? If the bias wasn’t obvious at a cursory glance, I wouldn’t mark it as political. However, a deeper reading might reveal it was so.