Beruflich Dokumente
Kultur Dokumente
The objective of the report is to explore the text data set in R and generate insights about
the data set. This exploration report will consist of the following:
1.Data Exploration:
> dim(Basic_Data)
[1] 495 19
> names(Basic_Data)
[1] "deal" "description" "episode"
[4] "category" "entrepreneurs" "location"
[7] "website" "askedFor" "exchangeForStake"
[10] "valuation" "season" "shark1"
[13] "shark2" "shark3" "shark4"
[16] "shark5" "title" "episode.season"
[19] "Multiple.Entreprenuers"
> class(Basic_Data)
[1] "data.frame"
> str(Basic_Data)
'data.frame': 495 obs. of 19 variables:
$ deal : logi FALSE TRUE TRUE FALSE FALSE TRUE ...
$ description : Factor w/ 493 levels "\"Magic\" aromatherapy sp
rays for kids to help alleviate common fears and anxieties.",..: 202 382 1
83 352 307 350 69 180 98 152 ...
$ episode : int 1 1 1 1 1 2 2 2 2 2 ...
$ category : Factor w/ 54 levels "Alcoholic Beverages",..: 3
6 45 3 8 8 45 31 43 2 11 ...
$ entrepreneurs : Factor w/ 422 levels "","Aaron Lemieux",..: 96
406 402 310 223 388 84 264 345 259 ...
$ location : Factor w/ 255 levels "Akron, OH","Alexandria, V
A",..: 226 220 8 230 36 154 93 77 117 40 ...
$ website : Factor w/ 456 levels "","http://180cup.com",..:
1 119 130 21 405 128 25 129 278 1 ...
$ askedFor : int 1000000 460000 50000 250000 1200000 500000
200000 100000 500000 250000 ...
$ exchangeForStake : int 15 10 15 25 10 15 20 20 10 10 ...
$ valuation : int 6666667 4600000 333333 1000000 12000000 33
33333 1000000 500000 5000000 2500000 ...
$ season : int 1 1 1 1 1 1 1 1 1 1 ...
$ shark1 : Factor w/ 2 levels "Barbara Corcoran",..: 1 1 1
1 1 1 1 1 1 1 ...
$ shark2 : Factor w/ 4 levels "Barbara Corcoran",..: 3 3 3
3 3 3 3 3 3 3 ...
$ shark3 : Factor w/ 3 levels "Daymond John",..: 2 2 2 2 2
2 2 2 2 2 ...
$ shark4 : Factor w/ 4 levels "Daymond John",..: 1 1 1 1 1
1 1 1 1 1 ...
$ shark5 : Factor w/ 5 levels "Daymond John",..: 3 3 3 3 3
3 3 3 3 3 ...
$ title : Factor w/ 493 levels "180 Cup","50 State Capita
ls in 50 Minutes",..: 205 265 14 90 480 3 104 13 226 83 ...
$ episode.season : Factor w/ 122 levels "1-1","1-10","1-11",..: 1
1 1 1 1 7 7 7 7 7 ...
$ Multiple.Entreprenuers: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
> table(data$deal)
FALSE TRUE
244 251
1.Creating Corpus
> corpus= Corpus(VectorSource(data$description))
> corpus
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 495
> writeLines(strwrap(corpus[[264]]$content,60))
What is this? We're hearing about a tight-knit family, a
closely guarded secret, and mouth-watering pickles. It
sounds like a motion picture but it's Lynnae's Gourmet
Pickles, and although the recipe remains a secret, you can
buy the finished product and enjoy the taste for yourself.
The recipe comes from Lynnae's Great Grandma Toots, a.k.a.
Mrs. Pickles. Her recipe has united the family's women for
over a century. Now Lynnae, together with her sister-in-law
Aly, is bringing its goodness to the masses.
2.Changing to lower case
> corpus2=tm_map(corpus,content_transformer(stri_trans_tolower))
> writeLines(strwrap(corpus2[[264]]$content,60))
what is this? we're hearing about a tight-knit family, a
closely guarded secret, and mouth-watering pickles. it
sounds like a motion picture but it's lynnae's gourmet
pickles, and although the recipe remains a secret, you can
buy the finished product and enjoy the taste for yourself.
the recipe comes from lynnae's great grandma toots, a.k.a.
mrs. pickles. her recipe has united the family's women for
over a century. now lynnae, together with her sister-in-law
aly, is bringing its goodness to the masses.
4.Removing punctuation
> corpus4=tm_map(corpus3,removePunctuation)
> writeLines(strwrap(corpus4[[264]]$content,60))
hearing tightknit family closely guarded secret
mouthwatering pickles sounds like motion picture lynnaes
gourmet pickles although recipe remains secret buy finished
product enjoy taste recipe comes lynnaes great grandma
toots k mrs pickles recipe united familys women century now
lynnae together sisterlaw aly bringing goodness masses
5.Removing Numbers
> corpus5=tm_map(corpus4,removeNumbers)
> writeLines(strwrap(corpus5[[264]]$content,60))
hearing tightknit family closely guarded secret
mouthwatering pickles sounds like motion picture lynnaes
gourmet pickles although recipe remains secret buy finished
product enjoy taste recipe comes lynnaes great grandma
toots k mrs pickles recipe united familys women century now
lynnae together sisterlaw aly bringing goodness masses
>
3.Random Forrest, CART, Logistic regression:
> st_dtm
<<DocumentTermMatrix (documents: 495, terms: 3202)>>
Non-/sparse entries: 9055/1575935
Sparsity : 99%
Maximal term length: 24
Weighting : term frequency (tf)
> dim(dtm_df)
[1] 495 473
Random Forrest:
Call:
randomForest(x = dtm_df[-474], y = dtm_df$Deal, ntree = 100,
importance = T)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 21
Classification tree:
rpart(formula = Deal ~ easier + children + design + roll + maniac +
shape + pad + size + water + retail + unique + shoe + transform +
recyclable + create + perfect + alternative + built + without +
like + custom + students + tradition, data = train_set, method = "clas
s")
n= 372
predtestwsa2 0 1
0 43 46
1 12 22
Accuracy : 0.5285
95% CI : (0.4364, 0.6191)
No Information Rate : 0.5528
P-Value [Acc > NIR] : 0.7377
Kappa : 0.0995
Sensitivity : 0.7818
Specificity : 0.3235
Pos Pred Value : 0.4831
Neg Pred Value : 0.6471
Prevalence : 0.4472
Detection Rate : 0.3496
Detection Prevalence : 0.7236
Balanced Accuracy : 0.5527
'Positive' Class : 0
Logistic regression:
> summary(lr_model2)
Call:
glm(formula = Deal ~ easier + children + design + roll + maniac +
shape + pad + size + water + retail + unique + shoe + transform +
recyclable + create + perfect + alternative + built + without +
like + custom + students + tradition, family = binomial(link = "logit"
),
data = train_set)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0648 -1.0040 -0.2847 1.1725 1.7686
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4226 0.1404 -3.010 0.00261 **
easier 1.6756 0.5159 3.248 0.00116 **
children 1.2036 0.5850 2.058 0.03963 *
design 0.5696 0.3091 1.843 0.06533 .
roll 16.7074 963.0095 0.017 0.98616
maniac 2.5199 1.4305 1.762 0.07815 .
shape 17.4715 1611.5046 0.011 0.99135
pad -0.1115 0.5962 -0.187 0.85171
size 0.7531 0.7706 0.977 0.32840
water -0.5910 0.4387 -1.347 0.17793
retail 0.5219 0.6473 0.806 0.42003
unique -0.6263 0.6445 -0.972 0.33119
shoe 0.4809 0.7088 0.678 0.49747
transform 0.7515 0.8319 0.903 0.36633
recyclable 0.9175 0.9568 0.959 0.33761
create -0.1470 0.5699 -0.258 0.79642
perfect -0.3779 1.0057 -0.376 0.70710
alternative 17.3052 1219.8803 0.014 0.98868
built 0.4340 0.6336 0.685 0.49338
without 0.1357 0.4277 0.317 0.75097
like 0.3757 0.3961 0.949 0.34285
custom -0.6708 0.5068 -1.324 0.18558
students 1.3802 1.0655 1.295 0.19520
tradition -2.5820 1.2165 -2.123 0.03379 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> vif(lr_model)
easier made make children roll design without water
size
1.047677 1.018854 1.068733 1.003810 1.000000 1.009340 1.141335 1.143490 1.
017013
> confusionMatrix(train_cm_lr,positive = "1")
Confusion Matrix and Statistics
pred_lr_trainwsa_class
0 1
0 154 35
1 90 93
Accuracy : 0.664
95% CI : (0.6135, 0.7118)
No Information Rate : 0.6559
P-Value [Acc > NIR] : 0.3945
Kappa : 0.3246
Sensitivity : 0.7266
Specificity : 0.6311
Pos Pred Value : 0.5082
Neg Pred Value : 0.8148
Prevalence : 0.3441
Detection Rate : 0.2500
Detection Prevalence : 0.4919
Balanced Accuracy : 0.6789
'Positive' Class : 1
Random Forest:
> tuneRF(x=train_set2[,zx2],y=train_set2$Deal,ntreeTry = 501,mtryStart = 2
,stepFactor = 2,plot = T,trace = T,doBest = T)
mtry = 2 OOB error = 41.21%
Searching left ...
mtry = 1 OOB error = 41.79%
-0.01398601 0.05
Searching right ...
mtry = 4 OOB error = 39.77%
0.03496503 0.05
Call:
randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
n= 347
>
predtestwsanw2 0 1
0 58 53
1 9 28
Accuracy : 0.5811
95% CI : (0.4973, 0.6616)
No Information Rate : 0.5473
P-Value [Acc > NIR] : 0.2291
Kappa : 0.2
Sensitivity : 0.3457
Specificity : 0.8657
Pos Pred Value : 0.7568
Neg Pred Value : 0.5225
Prevalence : 0.5473
Detection Rate : 0.1892
Detection Prevalence : 0.2500
Balanced Accuracy : 0.6057
'Positive' Class : 1
Logistic regression:
Call:
glm(formula = Deal ~ easier + children + design + roll + maniac +
shape + pad + size + water + retail + unique + shoe + transform +
recyclable + create + perfect + alternative + built + without +
like + custom + students + tradition, family = binomial(link = "logit"
),
data = train_set2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0064 -1.0116 -0.3089 1.1747 1.7615
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4033 0.1436 -2.808 0.00499 **
easier 1.5846 0.5262 3.012 0.00260 **
children 1.1582 0.5846 1.981 0.04758 *
design 0.4817 0.3190 1.510 0.13098
roll 16.6811 952.0888 0.018 0.98602
maniac 2.5258 1.4092 1.792 0.07307 .
shape 17.4197 1594.8903 0.011 0.99129
pad -1.3078 1.1310 -1.156 0.24756
size 0.3312 0.8210 0.403 0.68664
water -0.6501 0.4473 -1.453 0.14618
retail 0.4214 0.6460 0.652 0.51419
unique -0.5250 0.6502 -0.807 0.41941
shoe 0.3421 0.8237 0.415 0.67793
transform 1.0590 0.9364 1.131 0.25806
recyclable 0.9433 0.9405 1.003 0.31588
create -0.2069 0.5619 -0.368 0.71275
perfect -0.4021 0.9897 -0.406 0.68454
alternative 17.7871 1337.4827 0.013 0.98939
built 0.8023 0.6782 1.183 0.23680
without 0.3977 0.4664 0.853 0.39386
like 0.2729 0.3962 0.689 0.49103
custom -0.4729 0.4912 -0.963 0.33571
students 1.2526 1.0470 1.196 0.23157
tradition -2.4945 1.2364 -2.018 0.04364 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
pred_lr_testwsa_class3
0 1
0 54 13
1 45 36
Accuracy : 0.6081
95% CI : (0.5246, 0.6872)
No Information Rate : 0.6689
P-Value [Acc > NIR] : 0.95
Kappa : 0.2405
Sensitivity : 0.7347
Specificity : 0.5455
Pos Pred Value : 0.4444
Neg Pred Value : 0.8060
Prevalence : 0.3311
Detection Rate : 0.2432
Detection Prevalence : 0.5473
Balanced Accuracy : 0.6401
'Positive' Class : 1