Sie sind auf Seite 1von 14

Table of Contents

1 Project Objective .................................................................................................................................. 2


2 Assumptions .......................................................................................... Error! Bookmark not defined.
3 Exploratory Data Analysis ..................................................................... Error! Bookmark not defined.
3.1 Environment Set up and Data Import ........................................ Error! Bookmark not defined.
3.1.1 Install necessary Packages and Invoke Libraries ..................... Error! Bookmark not defined.
3.1.2 Set up working Directory ........................................................ Error! Bookmark not defined.
3.2 Variable Identification................................................................ Error! Bookmark not defined.
3.2.1 Variable Identification – Inferences ........................................ Error! Bookmark not defined.
3.3 Univariate Analysis ..................................................................... Error! Bookmark not defined.
3.4 Bi-Variate Analysis...................................................................... Error! Bookmark not defined.
3.5 Hypothesis Testing: .................................................................... Error! Bookmark not defined.
4 Conclusion ............................................................................................. Error! Bookmark not defined.
4.1 Recommendation: ...................................................................... Error! Bookmark not defined.
5 Sample size – ......................................................................................... Error! Bookmark not defined.
6 Appendix: .............................................................................................. Error! Bookmark not defined.
1 Project Objective

The objective of the report is to explore the text data set in R and generate insights about
the data set. This exploration report will consist of the following:

1. Importing the dataset in R


2. Understanding the structure of text dataset
3. Creating CART, Random forest, Logistic regression for text data (“description”).
4. Adding one more independent variable(“Ratio”)
5. Checking CART, Random forest, Logistic regression with two independent variables.

1.Data Exploration:
> dim(Basic_Data)
[1] 495 19

> names(Basic_Data)
[1] "deal" "description" "episode"
[4] "category" "entrepreneurs" "location"
[7] "website" "askedFor" "exchangeForStake"
[10] "valuation" "season" "shark1"
[13] "shark2" "shark3" "shark4"
[16] "shark5" "title" "episode.season"
[19] "Multiple.Entreprenuers"

> class(Basic_Data)
[1] "data.frame"

> str(Basic_Data)
'data.frame': 495 obs. of 19 variables:
$ deal : logi FALSE TRUE TRUE FALSE FALSE TRUE ...
$ description : Factor w/ 493 levels "\"Magic\" aromatherapy sp
rays for kids to help alleviate common fears and anxieties.",..: 202 382 1
83 352 307 350 69 180 98 152 ...
$ episode : int 1 1 1 1 1 2 2 2 2 2 ...
$ category : Factor w/ 54 levels "Alcoholic Beverages",..: 3
6 45 3 8 8 45 31 43 2 11 ...
$ entrepreneurs : Factor w/ 422 levels "","Aaron Lemieux",..: 96
406 402 310 223 388 84 264 345 259 ...
$ location : Factor w/ 255 levels "Akron, OH","Alexandria, V
A",..: 226 220 8 230 36 154 93 77 117 40 ...
$ website : Factor w/ 456 levels "","http://180cup.com",..:
1 119 130 21 405 128 25 129 278 1 ...
$ askedFor : int 1000000 460000 50000 250000 1200000 500000
200000 100000 500000 250000 ...
$ exchangeForStake : int 15 10 15 25 10 15 20 20 10 10 ...
$ valuation : int 6666667 4600000 333333 1000000 12000000 33
33333 1000000 500000 5000000 2500000 ...
$ season : int 1 1 1 1 1 1 1 1 1 1 ...
$ shark1 : Factor w/ 2 levels "Barbara Corcoran",..: 1 1 1
1 1 1 1 1 1 1 ...
$ shark2 : Factor w/ 4 levels "Barbara Corcoran",..: 3 3 3
3 3 3 3 3 3 3 ...
$ shark3 : Factor w/ 3 levels "Daymond John",..: 2 2 2 2 2
2 2 2 2 2 ...
$ shark4 : Factor w/ 4 levels "Daymond John",..: 1 1 1 1 1
1 1 1 1 1 ...
$ shark5 : Factor w/ 5 levels "Daymond John",..: 3 3 3 3 3
3 3 3 3 3 ...
$ title : Factor w/ 493 levels "180 Cup","50 State Capita
ls in 50 Minutes",..: 205 265 14 90 480 3 104 13 226 83 ...
$ episode.season : Factor w/ 122 levels "1-1","1-10","1-11",..: 1
1 1 1 1 7 7 7 7 7 ...
$ Multiple.Entreprenuers: logi FALSE FALSE FALSE FALSE FALSE FALSE ...

>Data exploration only with Deal,Description,Ratio


> str(data)
'data.frame': 495 obs. of 3 variables:
$ description: Factor w/ 493 levels "\"Magic\" aromatherapy sprays for ki
ds to help alleviate common fears and anxieties.",..: 202 382 183 352 307
350 69 180 98 152 ...
$ deal : logi FALSE TRUE TRUE FALSE FALSE TRUE ...
$ ratio : num 0.15 0.1 0.15 0.25 0.1 ...

> table(data$deal)

FALSE TRUE
244 251

>Data Clean up.

1.Creating Corpus
> corpus= Corpus(VectorSource(data$description))
> corpus
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 495

> writeLines(strwrap(corpus[[264]]$content,60))
What is this? We're hearing about a tight-knit family, a
closely guarded secret, and mouth-watering pickles. It
sounds like a motion picture but it's Lynnae's Gourmet
Pickles, and although the recipe remains a secret, you can
buy the finished product and enjoy the taste for yourself.
The recipe comes from Lynnae's Great Grandma Toots, a.k.a.
Mrs. Pickles. Her recipe has united the family's women for
over a century. Now Lynnae, together with her sister-in-law
Aly, is bringing its goodness to the masses.
2.Changing to lower case
> corpus2=tm_map(corpus,content_transformer(stri_trans_tolower))
> writeLines(strwrap(corpus2[[264]]$content,60))
what is this? we're hearing about a tight-knit family, a
closely guarded secret, and mouth-watering pickles. it
sounds like a motion picture but it's lynnae's gourmet
pickles, and although the recipe remains a secret, you can
buy the finished product and enjoy the taste for yourself.
the recipe comes from lynnae's great grandma toots, a.k.a.
mrs. pickles. her recipe has united the family's women for
over a century. now lynnae, together with her sister-in-law
aly, is bringing its goodness to the masses.

3.Stop words – Including can,also,use,used


> stopwords=c((stopwords("english")),c("can","also","use","used"))
> corpus3=tm_map(corpus2,removeWords,stopwords)
> writeLines(strwrap(corpus3[[264]]$content,60))
? hearing tight-knit family, closely guarded secret,
mouth-watering pickles. sounds like motion picture
lynnae's gourmet pickles, although recipe remains secret,
buy finished product enjoy taste . recipe comes lynnae's
great grandma toots, .k.. mrs. pickles. recipe united
family's women century. now lynnae, together sister--law
aly, bringing goodness masses.

4.Removing punctuation
> corpus4=tm_map(corpus3,removePunctuation)
> writeLines(strwrap(corpus4[[264]]$content,60))
hearing tightknit family closely guarded secret
mouthwatering pickles sounds like motion picture lynnaes
gourmet pickles although recipe remains secret buy finished
product enjoy taste recipe comes lynnaes great grandma
toots k mrs pickles recipe united familys women century now
lynnae together sisterlaw aly bringing goodness masses

5.Removing Numbers
> corpus5=tm_map(corpus4,removeNumbers)
> writeLines(strwrap(corpus5[[264]]$content,60))
hearing tightknit family closely guarded secret
mouthwatering pickles sounds like motion picture lynnaes
gourmet pickles although recipe remains secret buy finished
product enjoy taste recipe comes lynnaes great grandma
toots k mrs pickles recipe united familys women century now
lynnae together sisterlaw aly bringing goodness masses

6.Steming the corpus


> stem_corpus6=tm_map(corpus5,stemDocument, language = "english")
transformation drops documents
> writeLines(strwrap(stem_corpus6[[264]]$content,60))
hear tightknit famili close guard secret mouthwat pickl
sound like motion pictur lynna gourmet pickl although recip
remain secret buy finish product enjoy tast recip come
lynna great grandma toot k mrs pickl recip unit famili
women centuri now lynna togeth sisterlaw ali bring good
mass
2.Freqency and Correlation:
> (freq.term=findFreqTerms(st_tdm,lowfreq = 20))
[1] "accessories" "allow" "bar" "bottle" "business"
[6] "children" "cloth" "come" "companies" "create"
[11] "custom" "design" "device" "easier" "even"
[16] "feature" "fit" "flavor" "food" "fun"
[21] "get" "go" "help" "home" "include"
[26] "just" "keep" "kidfriendly" "like" "line"
[31] "look" "made" "make" "mix" "natural"
[36] "need" "new" "offer" "one" "online"
[41] "package" "people" "play" "product" "protect"
[46] "provide" "safe" "sell" "service" "store"
[51] "system" "take" "time" "useful" "user"
[56] "water" "way" "without" "work"

>
3.Random Forrest, CART, Logistic regression:
> st_dtm
<<DocumentTermMatrix (documents: 495, terms: 3202)>>
Non-/sparse entries: 9055/1575935
Sparsity : 99%
Maximal term length: 24
Weighting : term frequency (tf)
> dim(dtm_df)
[1] 495 473

Random Forrest:
Call:
randomForest(x = dtm_df[-474], y = dtm_df$Deal, ntree = 100,
importance = T)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 21

OOB estimate of error rate: 41.01%


Confusion matrix:
0 1 class.error
0 133 111 0.4549180
1 92 159 0.3665339
> printcp(cart2)

Classification tree:
rpart(formula = Deal ~ easier + children + design + roll + maniac +
shape + pad + size + water + retail + unique + shoe + transform +
recyclable + create + perfect + alternative + built + without +
like + custom + students + tradition, data = train_set, method = "clas
s")

Variables actually used in tree construction:


[1] alternative children design easier like size
[7] tradition

Root node error: 183/372 = 0.49194

n= 372

CP nsplit rel error xerror xstd


1 0.092896 0 1.00000 1.07650 0.052605
2 0.038251 1 0.90710 0.92350 0.052477
3 0.023224 2 0.86885 0.89617 0.052328
4 0.010000 7 0.74863 0.89617 0.052328
Cart:
Confusion Matrix and Statistics

predtestwsa2 0 1
0 43 46
1 12 22

Accuracy : 0.5285
95% CI : (0.4364, 0.6191)
No Information Rate : 0.5528
P-Value [Acc > NIR] : 0.7377

Kappa : 0.0995

Mcnemar's Test P-Value : 1.47e-05

Sensitivity : 0.7818
Specificity : 0.3235
Pos Pred Value : 0.4831
Neg Pred Value : 0.6471
Prevalence : 0.4472
Detection Rate : 0.3496
Detection Prevalence : 0.7236
Balanced Accuracy : 0.5527

'Positive' Class : 0
Logistic regression:
> summary(lr_model2)

Call:
glm(formula = Deal ~ easier + children + design + roll + maniac +
shape + pad + size + water + retail + unique + shoe + transform +
recyclable + create + perfect + alternative + built + without +
like + custom + students + tradition, family = binomial(link = "logit"
),
data = train_set)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.0648 -1.0040 -0.2847 1.1725 1.7686

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4226 0.1404 -3.010 0.00261 **
easier 1.6756 0.5159 3.248 0.00116 **
children 1.2036 0.5850 2.058 0.03963 *
design 0.5696 0.3091 1.843 0.06533 .
roll 16.7074 963.0095 0.017 0.98616
maniac 2.5199 1.4305 1.762 0.07815 .
shape 17.4715 1611.5046 0.011 0.99135
pad -0.1115 0.5962 -0.187 0.85171
size 0.7531 0.7706 0.977 0.32840
water -0.5910 0.4387 -1.347 0.17793
retail 0.5219 0.6473 0.806 0.42003
unique -0.6263 0.6445 -0.972 0.33119
shoe 0.4809 0.7088 0.678 0.49747
transform 0.7515 0.8319 0.903 0.36633
recyclable 0.9175 0.9568 0.959 0.33761
create -0.1470 0.5699 -0.258 0.79642
perfect -0.3779 1.0057 -0.376 0.70710
alternative 17.3052 1219.8803 0.014 0.98868
built 0.4340 0.6336 0.685 0.49338
without 0.1357 0.4277 0.317 0.75097
like 0.3757 0.3961 0.949 0.34285
custom -0.6708 0.5068 -1.324 0.18558
students 1.3802 1.0655 1.295 0.19520
tradition -2.5820 1.2165 -2.123 0.03379 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 515.60 on 371 degrees of freedom


Residual deviance: 438.16 on 348 degrees of freedom
AIC: 486.16

Number of Fisher Scoring iterations: 16

> vif(lr_model)
easier made make children roll design without water
size
1.047677 1.018854 1.068733 1.003810 1.000000 1.009340 1.141335 1.143490 1.
017013
> confusionMatrix(train_cm_lr,positive = "1")
Confusion Matrix and Statistics

pred_lr_trainwsa_class
0 1
0 154 35
1 90 93

Accuracy : 0.664
95% CI : (0.6135, 0.7118)
No Information Rate : 0.6559
P-Value [Acc > NIR] : 0.3945

Kappa : 0.3246

Mcnemar's Test P-Value : 1.366e-06

Sensitivity : 0.7266
Specificity : 0.6311
Pos Pred Value : 0.5082
Neg Pred Value : 0.8148
Prevalence : 0.3441
Detection Rate : 0.2500
Detection Prevalence : 0.4919
Balanced Accuracy : 0.6789

'Positive' Class : 1

Added New Variable:

Random Forest:
> tuneRF(x=train_set2[,zx2],y=train_set2$Deal,ntreeTry = 501,mtryStart = 2
,stepFactor = 2,plot = T,trace = T,doBest = T)
mtry = 2 OOB error = 41.21%
Searching left ...
mtry = 1 OOB error = 41.79%
-0.01398601 0.05
Searching right ...
mtry = 4 OOB error = 39.77%
0.03496503 0.05

Call:
randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4

OOB estimate of error rate: 38.9%


Confusion matrix:
0 1 class.error
0 122 55 0.3107345
1 80 90 0.4705882
Cart:
Classification tree:
rpart(formula = Deal ~ easier + children + design + roll + maniac +
shape + pad + size + water + retail + unique + shoe + transform +
recyclable + create + perfect + alternative + built + without +
like + custom + students + tradition, data = train_set2,
method = "class")

Variables actually used in tree construction:


[1] built children easier like size tradition transform

Root node error: 170/347 = 0.48991

n= 347

CP nsplit rel error xerror xstd


1 0.076471 0 1.00000 1.00000 0.054777
2 0.017647 1 0.92353 0.99412 0.054770
3 0.010000 7 0.78824 0.93529 0.054596

>

Confusion Matrix and Statistics

predtestwsanw2 0 1
0 58 53
1 9 28

Accuracy : 0.5811
95% CI : (0.4973, 0.6616)
No Information Rate : 0.5473
P-Value [Acc > NIR] : 0.2291

Kappa : 0.2

Mcnemar's Test P-Value : 4.734e-08

Sensitivity : 0.3457
Specificity : 0.8657
Pos Pred Value : 0.7568
Neg Pred Value : 0.5225
Prevalence : 0.5473
Detection Rate : 0.1892
Detection Prevalence : 0.2500
Balanced Accuracy : 0.6057

'Positive' Class : 1

Logistic regression:
Call:
glm(formula = Deal ~ easier + children + design + roll + maniac +
shape + pad + size + water + retail + unique + shoe + transform +
recyclable + create + perfect + alternative + built + without +
like + custom + students + tradition, family = binomial(link = "logit"
),
data = train_set2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0064 -1.0116 -0.3089 1.1747 1.7615

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4033 0.1436 -2.808 0.00499 **
easier 1.5846 0.5262 3.012 0.00260 **
children 1.1582 0.5846 1.981 0.04758 *
design 0.4817 0.3190 1.510 0.13098
roll 16.6811 952.0888 0.018 0.98602
maniac 2.5258 1.4092 1.792 0.07307 .
shape 17.4197 1594.8903 0.011 0.99129
pad -1.3078 1.1310 -1.156 0.24756
size 0.3312 0.8210 0.403 0.68664
water -0.6501 0.4473 -1.453 0.14618
retail 0.4214 0.6460 0.652 0.51419
unique -0.5250 0.6502 -0.807 0.41941
shoe 0.3421 0.8237 0.415 0.67793
transform 1.0590 0.9364 1.131 0.25806
recyclable 0.9433 0.9405 1.003 0.31588
create -0.2069 0.5619 -0.368 0.71275
perfect -0.4021 0.9897 -0.406 0.68454
alternative 17.7871 1337.4827 0.013 0.98939
built 0.8023 0.6782 1.183 0.23680
without 0.3977 0.4664 0.853 0.39386
like 0.2729 0.3962 0.689 0.49103
custom -0.4729 0.4912 -0.963 0.33571
students 1.2526 1.0470 1.196 0.23157
tradition -2.4945 1.2364 -2.018 0.04364 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 480.90 on 346 degrees of freedom


Residual deviance: 412.01 on 323 degrees of freedom
AIC: 460.01

Number of Fisher Scoring iterations: 16

Confusion Matrix and Statistics

pred_lr_testwsa_class3
0 1
0 54 13
1 45 36

Accuracy : 0.6081
95% CI : (0.5246, 0.6872)
No Information Rate : 0.6689
P-Value [Acc > NIR] : 0.95

Kappa : 0.2405

Mcnemar's Test P-Value : 4.691e-05

Sensitivity : 0.7347
Specificity : 0.5455
Pos Pred Value : 0.4444
Neg Pred Value : 0.8060
Prevalence : 0.3311
Detection Rate : 0.2432
Detection Prevalence : 0.5473
Balanced Accuracy : 0.6401

'Positive' Class : 1

Das könnte Ihnen auch gefallen