Sie sind auf Seite 1von 18

Basics of Statistics

1) Number of employees according to human resource manager is an example of


A) flowchart variable
B) discrete variable
C) continuous variable
D) measuring variable

2) Five numbers are given: (5, 10, 15, 5, 15). Now, what would be the sum of deviations of
individual data points from their mean?
A) 10
B)25
C) 50
D) 0

3) If a positively skewed distribution has a median of 50, which of the following statement is
true?
A) Mean is greater than 50
B) Mean is less than 50
C) Mode is less than 50
D) Both A and C

4) Which of the following is false?


A) Q2 = Median
B) Q2 = 50th percentile
C) Q2 = 25th percentile
D) Q2 = 5th decile

5) Which of these measures are used to analyze the central tendency of data?
A) Mean and Normal Distribution
B) Mean, Median and Mode
C) Mode, Alpha & Range
D) Standard Deviation, Range and Mean
6) Which of the following is true about below given histogram?

A) Above histogram is unimodal


B) Above histogram is bimodal
C) Given above is not a histogram
D) None of the above

7) The standard normal curve is symmetric about 0 and the total area under it is 1.
A) Yes
B) No
C) Sometimes
D) Can’t say

8) A number is selected at random from the set of numbers {11,12,…,99}.


What is the probability that selected number contains the digit 9?
A) 19/89
B) 18/89
C) 1/10
D) 11/100
9) A random sample of 5 lucky winners is to be selected from a group of 10 ladies and 20 men
using simple random sampling with replacement (SRSWR). The number of ladies lucky winners
will follow:
A) Hypergeometric distribution
B) Bernoulli distribution
C) Binomial distribution
D) Poisson distribution

10) A parcel of 12 books contains 4 books with loose binding. What is the probability that a
random selection of 6 books (without replacement) will contain 3 books with loose binding?
A) 0.24
B) 0.50
C) 0.26
D) 0.48

11) Seventy percent of the letters received by the popular T.V. program ‘SAHELI’ are written by
ladies. What is the probability that exactly 2 prize awards out of 5 are bagged by ladies?
A) 0.2312
B) 0.1323
C) 0.2854
D) 0.1524

12) The average number of misprints per page of a book is 1.5. Find the probability that a
particular book is free from misprints.
A) 0
B) 0.25
C) 0.22
D) 0.15

13) A personnel officer knows that about 20% of the applicants for a certain position are
suitable for the job. What is the probability that the 5th person interviewed will be the first one
who is suitable?
A) 0.082
B) 0.072
C) 0.080
D) 0.062
14) Find the probability that a person tossing a fair coin gets third head at seventh toss.
A) 0.3
B) 0.2734
C) 0.1563
D) 0.1256

15) What happens to the confidence interval when we introduce some outliers to the data?
A) Confidence interval is robust to outliers
B) Confidence interval will increase with the introduction of outliers.
C) Confidence interval will decrease with the introduction of outliers.
D) We cannot determine the confidence interval in this case.

16) A medical doctor wants to reduce blood sugar level of all his patients by altering their diet.
He finds that the mean sugar level of all patients is 180 with a standard deviation of 18. Nine of
his patients start dieting and the mean of the sample is observed to 175. Now, he is considering
to recommend all his patients to go on a diet.
Note: He calculates 99% confidence interval.
What is the standard error of the mean?
A) 9
B) 6
C) 7.5
D) 18

17) Choose the correct option about F statistics:


A) F statistics value is always positive
B) F Statistic value lies between 0 and 1
C) F Statistic value lies between - ∞ and + ∞
D) F statistic value is always negative

18) What is the relationship between significance level and confidence level?
A) Significance level = Confidence level
B) Significance level = 1- Confidence level
C) Significance level = 1/Confidence level
D) Significance level = sqrt (1 – Confidence level)
19) If a mechanic looks at your car engine and says there is nothing wrong with it and your car
breaks down when you leave the garage, what type of error did the mechanic make?
A) Type I
B) Type II
C) Systematic error
D) Matrix error

20) Null and alternative hypotheses are statements about:


A) population parameters
B) sample parameters
C) sample statistics
D) it depends - sometimes population parameters and sometimes sample statistics

Linear and Logistic Regression


21) What happens when we introduce more variables to a linear regression model?
A) The r squared value may increase or remain constant, the adjusted r squared may
increase or decrease.
B) The r squared may increase or decrease while the adjusted r squared always
increases.
C) Both r square and adjusted r square always increase on the introduction of new
variables in the model.
D) Both might increase or decrease depending on the variables introduced.

22) In univariate linear least squares regression, relationship between correlation coefficient
and coefficient of determination is ______ ?
A) Both are unrelated False
B) The coefficient of determination is the coefficient of correlation squared
C) The coefficient of determination is the square root of the coefficient of correlation
False
D) Both are same F
23) We have a linear regression equation (Y = 5X +40) for the below table.

X Y
5 45
6 76
7 78
8 87
9 79

Which of the following is a MAE (Mean Absolute Error) for this linear model?
A) 8.4
B) 10.29
C) 42.5
D) None of the above

24) Suppose you are given 7 Scatter plots 1-7 (left to right) and you want to compare Pearson
correlation coefficients between variables of each scatter plot.

1. 1<2<3<4
2. 1>2>3 > 4
3. 7<6<5<4
4. 7>6>5>4
Which of the above is in the right order?
A) 1 and 3
B) 2 and 3
C) 1 and 4
D) 2 and 4
25) The method of least squares dictates that we choose a regression line where the sum of the
square of deviations of the points from the line is:
A) Maximum
B) Minimum
C) Zero
D) Positive

26) The assumption that the variance of the residuals about the predicted dependent variable
scores should be the same for all predicted scores reflects which assumption?
A) Normality
B) Homoscedasticity
C) Singularity
D) Multicollinearity

27) The percent of total variation of the dependent variable Y explained by the set of
independent variables X is measured by
A) Coefficient of Correlation
B) Coefficient of Skewness
C) Coefficient of Determination
D) Standard Error or Estimate

28)
The above output gives the Regression model results and VIF of variables, how many variables
show high multicollinearity?
A) 3
B) 4
C) 2
D) None

29) Which of the following evaluation metrics can be used to evaluate a model while modeling
a continuous output variable?
A) AUC-ROC
B) Accuracy
C) Logloss
D) Mean-Squared-Error

30) Which of the following statement is true about outliers in Linear regression?
A) Linear regression is sensitive to outliers
B) Linear regression is not sensitive to outliers
C) Can’t say
D) None of these

31) A dummy variable:


A) is an irrelevant, misleading variable.
B) is one nominal category versus all of the other nominal categories of a variable.
C) is the smallest realistic value of a variable.
D) is a measure of intelligence.

32) A classification table:


A) indicates how well a model has predicted group membership.
B) helps the researcher assess statistical significance.
C) indicates how well the independent variables correlate with the dependent variable.
D) helps the researcher classify a variable into its component categories.

33) Which of the following methods do we use to best fit the data in Logistic Regression?
A) Least Square Error
B) Maximum Likelihood
C) Jaccard distance
D) Both A and B
34) Suppose you have been given a fair coin and you want to find out the odds of getting heads.
Which of the following option is true for such a case?
A) odds will be 0
B) odds will be 0.5
C) odds will be 1
D) None of these

35) Which of the following option is true?


A) Linear Regression errors values has to be normally distributed but in case of Logistic
Regression it is not the case
B) Logistic Regression errors values has to be normally distributed but in case of Linear
Regression it is not the case
C) Both Linear Regression and Logistic Regression error values have to be normally
distributed
D) Both Linear Regression and Logistic Regression error values have not to be normally
distributed

36) What are the axes of an ROC curve?


A) Vertical axis: % of true negatives; Horizontal axis: % of false negatives
B) Vertical axis: % of true positives; Horizontal axis: % of false positives
C) Vertical axis: % of false negatives; Horizontal axis: % of false positives
D) Vertical axis: % of false positives; Horizontal axis: % of true negatives

37) If in a dataset with 250 positives, an LogR model classifies 200 positives correctly, the
specificity is
A) 0.8
B) 0.2
C) 1.25
D) Can’t say
38) True Positive Rate is also called as
1) Specificity
2) Recall
3) Sensitivity
4) Accuracy
A) Only 3
B) Only 1
C) Both 2 and 3
D) Both 1 and 4

39) Missclassification Rate is:


A) (FP+FN)/Total
B) 1 - Accuracy
C) Error Rate
D) All of the above

40) Classification Table

Choose the correct statement(s) from below:


A) Accuracy is 0.91
B) Misclassification rate is 0.09
C) Sensitivity is 0.95
D) All of the above
Time Series
41) Which of the following is an example of time series problem?
1. Estimating number of hotel rooms booking in next 6 months.
2. Estimating the total sales in next 3 years of an insurance company.
3. Estimating the number of calls for the next one week.
A) 1 and 2
B) 2 and 3
C) 1 and 3
D) 1,2 and 3

42) 8) Sum of weights in exponential smoothing is _____.


A) <1
B) 1
C) >1
D) None of the above

43) Multiplicative model for time series is Y=...


A) T - S - C - I
B) T + S + C + I
C) T x S x C x I
D) None

44) The augmented Dickey-Fuller unit root test can be used to test for
A) Normality
B) Independence
C) Stationarity
D) Invertibility

45) In an ARIMA, differencing is carried out


A) To convert a stationary process to a non-stationary process
B) To convert a non stationary process to a stationary process
C) To remove seasonal fluctuations from the data
D) To remove cyclical fluctuations from the data
Classification
46) Naive Bayes algorithm is a
A) Supervised learning model
B) Unsupervised learning model
C) Both of the Above
D) None of the Above

47) Which of the following distance metric cannot be used in k-NN?


A) Manhattan
B) Minkowski
C) Mahalanobis
D) All can be used

48) Which of the following machine learning algorithm can be used for imputing missing values
of both categorical and continuous variables?
A) K-NN
B) Linear Regression
C) Logistic Regression
D) None

49) Which of the following algorithm is not an example of an ensemble method?


A) Extra Tree Regressor
B) Random Forest
C) Gradient Boosting
D) Decision Tree
50) Given 1000 observations, Minimum observation required to split a node equals to 200 and
minimum leaf size equals to 300 then what could be the maximum depth of a decision tree?

A) 1
B) 2
C) 3
D) 4

51) The data scientists at “Mart Inc” have collected 2013 sales data for 1600 products across 10
stores in different cities. Also, certain attributes of each product based on these attributes and
store have been defined. The aim is to build a predictive model and find out the sales of each
product at a particular store during a defined period.
Which learning problem does this belong to?
A) Supervised learning
B) Unsupervised learning
C) Reinforcement learning
D) None

52) In Random Forest, which of the following is randomly selected?


A) Number of decision trees
B) features to be taken into account when building a tree
C) samples to be given to train individual tree in a forest
D) B and C
53) The minimum and maximum values of GINI index are:
A) 0 and 0.5
B) 0 and 1
C) 0.5 and 1
D) 1 and 2

54) While constructing decision tree algorithms, attribute selection measures are used to
A) Select the splitting criteria that best separate the data
B) Reduce the dimensionality
C) Reduce the error rate
D) Rank attributes

55) A database of 5000 transactions was partitioned into fraudulent and non-fraudulent
transactions. A machine based learning algorithm was then deployed onto this database.
The algorithm on completion correctly labeled 75% of the actual fraudulent transactions as
fraudulent. Using this information, complete the table below and answer the question
Predicted class → / Fraudulent Non-Fraudulent Total
↓Actual class

Fraudulent 500
Non-Fraudulent A
Total 4400 5000

What is the value of A?


1. 4275
2. 4350
3. 4400
4. 4500

Clustering
56) Which of the following is required by K-means clustering?
A) defined distance metric
B) number of clusters
C) initial guess as to cluster centroids
D) All of the Mentioned
57) What is the minimum no. of variables/ features required to perform clustering?
A) 0
B) 1
C) 2
D) 3
58) What should be the best choice of no. of clusters based on the following results:

A) 1
B) 2
C) 3
D) 4

59) Do we identify a set of independent variables and a dependent variable when we do


clustering?
A) Yes
B) No
C) Can’t say
D) None of the Above

60) Point out the wrong statement:


A) k-means clustering is a method of vector quantization
B) k-means clustering aims to partition n observations into k clusters
C) k-nearest neighbor is same as k-means
D) None of the Mentioned
Market Basket
61) What should we expect if the support value specified to the apriori algorithm is increased?
A) Increase in rules excavated
B) Decrease in rules excavated
C) No change in rules excavated
D) None of the above

62) Following points define:


-data mining technique
-uncovers frequent patterns (association rules) among sets of items in transactional database
-formed as if/then statements
-rely on conditional probability
A) Descriptive Analytics
B) Association Analysis
C) Market Basket Analysis
D) Descriptive Analysis

63) Methods to measure the goodness of association rules


A) Support
B) Confidence
C) Lift
D) All of the above

64) __________________ is the goal of Association Analysis


A) Giving broad insight into the business
B) Turn item set into association rules
C) Describing dataset
D)None of the above

65) Confidence with respect to association rule mining is


A) Rule occurrence count
B) Ratio of rule occurrence count to total transactions count
C) Ratio of rule occurrence given a condition
D) None of the above
R Programming
66) The dplyr package can be installed from CRAN using:
A) installall.packages(“dplyr”)
B) install.packages(“dplyr”)
C) installed.packages(“dplyr”)
D) none of the mentioned

67) What is the class of the object defined by the expression x <- c(4, “a”, TRUE)?
A) Numeric
B) Character
C) Integer
D) Logical

68) If I have two vectors x <- c(1,3, 5) and y <- c(3, 2, 10), what is produced by the expression
rbind(x, y)?
A) A vector of length 2
B) a 2 by 2 matrix
C) a vector of length 3
D) a 2 by 3 matrix

69) Suppose we have a vector x <- 1:4 and y <- 2:3. What is produced by the expression x + y?
A) a numeric vector with the values 3, 5, 3, 4.
B) an integer vector with the values 3, 5, 5, 7.
C) a numeric vector with the values 1, 2, 5, 7.
D) an error.

70) ______ is function in R to get number of observation in a data frame


A) n()
B) ncols()
C) nobs()
D) nrow()

71) Which of the following is used for reading tabular data ?


A) read.csv
B) dget
C) readLines
D) none of the mentioned
72) _________ generate summary statistics of different variables in the data frame, possibly
within strata
A) rename
B) summarize
C) set
D) subset

73) What will be the output of following code snippet?


> paste("a", "b", se = ":")
A) “a+b”
B) “a=b”
C) “a b :”
D) none of the mentioned

74) How missing values and impossible values are represented in R language respectively?
A) NaN, NA
B) NA,NaN
C) NA,NULL
D)NULL, NaN

75) Choose the correct statement


A) There is no difference between lapply and sapply functions
B) sapply gives the output as a vector and lapply gives the output as a list
C) sapply gives the output as a list and lapply gives the output as a vector
D) sapply can be used on a dataframe and lapply cannot be used on a dataframe

Das könnte Ihnen auch gefallen