Beruflich Dokumente
Kultur Dokumente
Ans: Organizations
Ans: HR and managers are not aware of the voluntary attrition of the employees in an
organization
4) What is the existing methodology you have identified to solve the problem?
Ans: Predicting the voluntary attrition using the Machine learning models like ANN,
Decision Tree.
Ans: Predicting using a binary classification Logistic Regression which gives a good
Ans: The data has the details about employee job, work issues and work environment.
9) Is it structured or semi structured or quasi-structured or unstructured ?
Ans: No
Ans: 0.00223 GB
Ans: 1471
Ans: 35
Ans: (1471x35)
Ans: No
Ans: Some attributes like Job Satisfaction, Job level, Working Hours, Monthly Salary.
24) What are the tools and techniques used to process or analyse the data?
Ans: Corelation heatmap, packages of seaborn, matplotlib, numpy arrays and pandas.
Ans: The data is clean there are no inaccurate records or missing values.
Ans: No
Ans: Standard scaling techniques are used to transform the data and other cleaning
Ans: No
32) Is data types are consistent or not(data type is wholly numeric or mixture of
alphanumeric strings and text
33) Is the data make sence?(for example: if the project involves analyzing income values,
previewing the data to confirm that the income values are positive or if it is acceptable
to have zeros or negative values).
Ans: Yes, employee attrition data is defined as 0 or 1 indicating 0 for not leaving the
34) Have you used any visualization tools to examine data quality,such as whether the
data contains many unexpected values or other indicators of dirty data.
Ans: Some of the methods from matplotlib package are used to find the quality of data.
35) Have you used any open source tool like OpenRefine which is a popular GUI based
tool for performing data transformations or using Python.
Ans: Using python and Node red flows we have developed the UI.
37) Do you find any relation ship between variables using scatter plots?
Ans: Yes there is a dependent relationship between the variables like Job Satisfaction
and PerformanceRating
38) Have you explored numerical data using box plot and histograms?
Ans: Yes we have explored the numerical data using histograms to find the cat
39) Do you find any central tendency using mean and median?
41) What information you are getting based on deviation between mean and median?
Ans: The central tendency can be estimated and central value of the data.
42) Are you finding data dispersion using variance standard deviation?
43) Have you applied any dimensionality reduction techniques to improve efficiency of
machine learning algorithms in terms of space and time?
Ans: Some of the fields like employee name, id are not need to train the data so they are
44) Have you dealt with missing values and outliers to improve quality of the data?
Questions on regression
1) What is regression?
Ans: Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).
Ans: Linear Regression, Polynomial Regression, Logistic Regression and multi linear
regression.
Ans: In order to predict the value of the dependent variable for individuals for whom some
Ans: Simple linear regression is a linear regression model with a single explanatory
variable.
Ans: Multiple linear regression is a linear regression model with a Multiple explanatory
variables.
Questions on classification
input and output patterns to the systems whereas unsupervised learning is a self-
learning technique in which system has to discover the features of the input population
by its own.
Ans: The main difference between them is that the output variable in regression is numerical
Ans: Pruning is a technique in machine learning and search algorithms that reduces the size
of decision trees by removing sections of the tree that provide little power to classify
instances. Pruning reduces the complexity of the final classifier, and hence improves
Ans: The difference between them is that the output variable in regression is numerical (or
Ans: Decision tree learning is one of the predictive modeling approaches used in statistics,
data mining and machine learning. It uses a decision tree to go from observations about
Ans: Information gain (IG) measures how much “information” a feature gives us about the
of examples.
Ans: Overfitting is Good performance on the training data, poor generliazation to other
data. Underfitting is Poor performance on the training data and poor generalization to
other data
Ans: Training set is the one on which we train and fit our model basically to fit the
Ans: A confusion matrix is a table that is often used to describe the performance of a
classification model on a set of test data for which the true values are known.
a false negative is an outcome where the model incorrectly predicts the negative class.
Ans: Sensitivity is the ability of a test to correctly identify those with the disease (true positive
rate), whereas test Specificity is the ability of the test to correctly identify those without
Ans: A receiver operating characteristic curve, or ROC curve, is a graphical plot that
threshold is varied. The ROC curve is created by plotting the true positive rate against
Ans: Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class
particular class.
Ans: K-Nearest Neighbors (KNN) is one of the simplest algorithms used in Machine
Learning for regression and classification problem. KNN algorithms use data and classify
17) How to select best classifier which is suitable for your application?
Ans: If your data is labeled, but you only have a limited amount, you should use