Sie sind auf Seite 1von 7

GUDLAVALLERU ENGINEERING COLLEGE

(An Autonomous Institute with Permanent Affiliation to JNTUK, Kakinada)


Seshadri Rao Knowledge Village, Gudlavalleru – 521 356.

Department of Computer Science and Engineering

Internship Viva Questions

Class :IV B.Tech Academic Year:2019-20

Batch No: c5 Roll No’s:16481A05H9,16481A05D1,16481A05F5,17485A0526

Name of the project guide: Mahankali Surya Tej

1) What is title of your project?

Ans: Employee Attrition Prediction using Logistic Regression

2) Which domain your project is applicable?

Ans: Organizations

3) What is the problem you have identified in that domain?

Ans: HR and managers are not aware of the voluntary attrition of the employees in an

organization

4) What is the existing methodology you have identified to solve the problem?

Ans: Predicting the voluntary attrition using the Machine learning models like ANN,

Decision Tree.

5) What are the drawbacks in existing methodology?

Ans: Low accuracy rate

6) What is your proposed methodology?

Ans: Predicting using a binary classification Logistic Regression which gives a good

accuracy rate compared to the existing methodologies.

7) What is the expected outcome?

Ans: To predict whether particular employee remains in the organization or not.

8) What is the data?

Ans: The data has the details about employee job, work issues and work environment.
9) Is it structured or semi structured or quasi-structured or unstructured ?

Ans: Structured data

10) Is there any conversion from un structured or quasi-tructured or quasi-tructured to


semi structured or semistructured to structured?

Ans: No

11) What is the reason for conversion?

Ans: No conversions required.

12) Which data structure you are using?

Ans: Numpy arrays are easy to use during preprocess steps.

13) Reasons for choosing this data structure?

14) Is there any advantage by choosing particular data structure?

15) Do you identify any other data structures?

16) Have you identified improvement in choosing the data structure?

17) What is the size of data in terms of GB or TB?

Ans: 0.00223 GB

18) In terms of number of records?

Ans: 1471

19) In terms of number of attributes?

Ans: 35

20) In terms of dimensionality?

Ans: (1471x35)

21) Is it high voluminous data?

Ans: No

22) What is the source of data?

Ans: IBM HR analytics data in kaggle

23) What are the target fields?

Ans: Some attributes like Job Satisfaction, Job level, Working Hours, Monthly Salary.
24) What are the tools and techniques used to process or analyse the data?

Ans: Corelation heatmap, packages of seaborn, matplotlib, numpy arrays and pandas.

25) What is the operating environment or platform?

Ans: Jupyter Notebook and IBM Watson Machine learning.

26) Do you use or designed any GUI to operate end user?

Ans: Node Red Flows are used to design the UI.

27) How clean is the data?

Ans: The data is clean there are no inaccurate records or missing values.

28) Data conditioning is required or not?

Ans: No

29) Which process is adopted like cleaning,normalizing datsets and performing


transformations?

Ans: Standard scaling techniques are used to transform the data and other cleaning

techniques used to find the missing values.

30) To what degree the data contains missing or inconsistent values?

Ans: Our data doesnot contain any inconsistent values.

31) Is the data contains values deviating from normal?

Ans: No

32) Is data types are consistent or not(data type is wholly numeric or mixture of
alphanumeric strings and text

Ans: The data is mixture of alphanumeric strings and text.

33) Is the data make sence?(for example: if the project involves analyzing income values,
previewing the data to confirm that the income values are positive or if it is acceptable
to have zeros or negative values).

Ans: Yes, employee attrition data is defined as 0 or 1 indicating 0 for not leaving the

company and 1 for staying in the company.

34) Have you used any visualization tools to examine data quality,such as whether the
data contains many unexpected values or other indicators of dirty data.

Ans: Some of the methods from matplotlib package are used to find the quality of data.
35) Have you used any open source tool like OpenRefine which is a popular GUI based
tool for performing data transformations or using Python.

Ans: Using python and Node red flows we have developed the UI.

36) List the attributes which are numeric and categorical.

Ans: Job Satisfaction, Job level, Monthly Salary, WorkLifeBalance, PerformanceRating.

37) Do you find any relation ship between variables using scatter plots?

Ans: Yes there is a dependent relationship between the variables like Job Satisfaction

and PerformanceRating

38) Have you explored numerical data using box plot and histograms?

Ans: Yes we have explored the numerical data using histograms to find the cat

39) Do you find any central tendency using mean and median?

Ans: We found a central tendency in target fields.

40) What information you are getting by calculating mean?

Ans: This is used to find average value of a field.

41) What information you are getting based on deviation between mean and median?

Ans: The central tendency can be estimated and central value of the data.

42) Are you finding data dispersion using variance standard deviation?

Ans: There is no dispersion in our data.

43) Have you applied any dimensionality reduction techniques to improve efficiency of
machine learning algorithms in terms of space and time?

Ans: Some of the fields like employee name, id are not need to train the data so they are

deleted using pandas frame work and numpy arrays.

44) Have you dealt with missing values and outliers to improve quality of the data?

Ans: We dealt with outliers to improve the quality of data.

Questions on regression

1) What is regression?

Ans: Regression is a statistical method used in finance, investing, and other disciplines that

attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as

independent variables).

2) What are various regression models?

Ans: Linear Regression, Polynomial Regression, Logistic Regression and multi linear

regression.

3) What is the purpose of regression?

Ans: In order to predict the value of the dependent variable for individuals for whom some

information concerning the explanatory variables is available, or in order to estimate

the effect of some explanatory variable on the dependent variable.

4) What is simple regression?

Ans: Simple linear regression is a linear regression model with a single explanatory

variable.

5) What is multiple regression?

Ans: Multiple linear regression is a linear regression model with a Multiple explanatory

variables.

Questions on classification

1) What is the difference between supervised and unsupervised machine learning?

Ans: Supervised learning is the technique of accomplishing a task by providing training,

input and output patterns to the systems whereas unsupervised learning is a self-

learning technique in which system has to discover the features of the input population

by its own.

2) When should you use classification over regression?

Ans: The main difference between them is that the output variable in regression is numerical

(or continuous) while that for classification is categorical (or discrete).

3) What is the need for Pruning in Decision trees?

Ans: Pruning is a technique in machine learning and search algorithms that reduces the size

of decision trees by removing sections of the tree that provide little power to classify
instances. Pruning reduces the complexity of the final classifier, and hence improves

predictive accuracy by the reduction of overfitting.

4) How do classification and regression differ?

Ans: The difference between them is that the output variable in regression is numerical (or

continuous) while that for classification is categorical (or discrete).

5) What is decision tree classification?

Ans: Decision tree learning is one of the predictive modeling approaches used in statistics,

data mining and machine learning. It uses a decision tree to go from observations about

an item to conclusions about the item's target value

6) What is the role of Entropy in a Decision Tree?

Ans: Entropy is the measures of impurity, disorder or uncertainty in a bunch of examples.

7) What is the difference between Entropy and Information Gain?

Ans: Information gain (IG) measures how much “information” a feature gives us about the

class. Where as Entropy is the measures of impurity, disorder or uncertainty in a bunch

of examples.

8) What is Overfitting? What is Underfitting?

Ans: Overfitting is Good performance on the training data, poor generliazation to other

data. Underfitting is Poor performance on the training data and poor generalization to

other data

9) What is ‘Training set’ and ‘Test set’?

Ans: Training set is the one on which we train and fit our model basically to fit the

parameters whereas test data is used only to assess performance of model.

10) List some classification accuracy measures

Ans: AUC-ROC curves,Confusion matrix…etc

11) What is confusion matrix?

Ans: A confusion matrix is a table that is often used to describe the performance of a

classification model on a set of test data for which the true values are known.

12) What is true positive and false negative?


Ans: A true positive is an outcome where the model correctly predicts the positive class. And

a false negative is an outcome where the model incorrectly predicts the negative class.

13) What is sensitivity, specificity and precision?

Ans: Sensitivity is the ability of a test to correctly identify those with the disease (true positive

rate), whereas test Specificity is the ability of the test to correctly identify those without

the disease (true negative rate).

14) What is ROC curve?

Ans: A receiver operating characteristic curve, or ROC curve, is a graphical plot that

illustrates the diagnostic ability of a binary classifier system as its discrimination

threshold is varied. The ROC curve is created by plotting the true positive rate against

the false positive rate at various threshold settings

15) What are Bayesian classifiers?

Ans: Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class

membership probabilities such as the probability that a given tuple belongs to a

particular class.

16) What is K-Nearest-Neighbor Classifier?

Ans: K-Nearest Neighbors (KNN) is one of the simplest algorithms used in Machine

Learning for regression and classification problem. KNN algorithms use data and classify

new data points based on similarity measures

17) How to select best classifier which is suitable for your application?

Ans: If your data is labeled, but you only have a limited amount, you should use

a classifier with high bias.For Example is Naïve Byaes.

Das könnte Ihnen auch gefallen