Beruflich Dokumente
Kultur Dokumente
net/publication/281031644
CITATIONS READS
5 1,666
3 authors:
Sachin Pawar
Tata Consultancy Services Limited
39 PUBLICATIONS 82 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Girish Palshikar on 17 August 2015.
Abstract. Talent Acquisition (TA) is an important function within HR, responsible for recruiting
high quality people for given job positions through various sources under stringent deadlines and
cost constraints. Given the importance of TA in the overall successful operations and growth of
any organization, in this paper we identify specific “business questions” focused on analyzing
various aspects of the TA processes, analyze past TA data using statistical analysis techniques and
to discover novel patterns/insights and actionable knowledge which can help in improving the cost,
efficiency and quality of recruitments. Our predictive analytics is mainly related to various
durations and delays in TA, candidate selection or rejection, offer acceptance by selected
candidates, root cause analysis for offer decline. We also use the data-mining technique of
subgroup discovery to identify interesting patterns (e.g., candidate subgroups having unusually
high decline ratios). We illustrate the approaches through a real-life dataset.
1 INTRODUCTION
Quality and productivity of the workforce are very important for people-centric and knowledge-
intensive industries such as IT, BPO and services in general. It is the responsibility of the Talent
Acquisition (TA) function within HR to recruit the workforce of highest possible quality. The TA
function often works under highly variable (and often unclear) demand pipeline of future business
requirements. The TA function needs to attract the best possible talent from a complex supply
chain of educational institutes (if experience is not required), job portals, employment agencies,
recruitment consultants and direct sourcing through buddy, emails, advertisements, walk-ins and
web. The channels differ in terms of the number and quality of resumes sourced, time and cost for
sourcing, selection ratio and joining ratio for sourced candidates etc. The recruitments themselves
need to be done under stringent goals such as shortest possible times-frames, lowest possible
recruitment costs/efforts and working at many locations and dealing with diverse domains and
technical skills. Moreover, a variety of human and economic factors affect recruitments.
Standards such as People CMMI [4] have helped HR management practices to get aligned with
business objectives and to systematize programs of continuous workforce development. The TA
function typically follows a complex business process; Figure 1 shows the major steps in it in a
linear manner for simplicity. In reality, the business process has many variations and special cases
that need to be handled; e.g., candidates declined/delayed to join or found unsuitable during
induction etc. A separate business process is required for internal recruitment within the
organization.
Given the importance of TA in the overall successful operations and growth of any organization,
it is clearly useful to analyze past TA data using suitable statistical analysis techniques to discover
novel patterns/insights and actionable knowledge which can help in improving the cost, efficiency
and quality of recruitments. Such work is part of the general task of workforce analytics (i.e.,
statistical analysis, modeling and mining of HR data), which is gaining importance [5], [12], [18],
[9], [7], [15], [13], [14]. In this paper, we discuss a system called iTAG that we are building for
using domain-driven analytics techniques to answer specific business questions in TA.
Demand
planning
Channel
activation
Resume sourcing
Candidate short-listing
Joining
Placement
Induction
This paper is organized as follows. Section 2 outlines the business questions that are important in
analyzing TA processes. Section 3 describes a real-life TA dataset that we have used as a case-
study in this paper. Sections 4 to 6 discuss analytics for answering specific TA related business
questions outlined in Section 2. Section 7 presents conclusions and future work.
In this paper, we focus on the domain-driven data-mining of TA data, where the goal is to answer
a specific business question related to cost, efficiency and quality related issues in TA business
processes. Some examples are given below. In this paper, we outline analytics-based approaches
to answer some of these questions.
1. What are the most difficult (in terms of cost or time) job requirements to fulfill?
2. What is the most typical (frequent) candidate profile selected (or rejected) for a particular job
requirement?
3. What are the “green flags” for a given job requirement? For example, people with Sun Java
Certification and (Tier 1 college education or Tier 1 company experience) may have a much
higher chance of getting selected than other candidates for a particular job requirement. A
similar question can be asked about “red flags” for a given job requirement.
4. What are the major differences between selected and rejected candidates for a particular
(given) job requirement?
5. What are the major root causes for rejecting candidates for a given job requirement?
6. What are the major root causes for candidates declining the offer for a given job requirement?
7. What campuses are “good” i.e., provide high quality candidates in large numbers with high hit
ratios?
8. What supply channels are “good” i.e., provide high quality candidates in large numbers with
high hit ratios?
9. What is an optimal sourcing plan (in terms of cost or time to join) for a given set of job
requirements? The sourcing plan should partition the job requirements among various supply
channels.
10. What are the major cost heads for TA? How can the cost of recruitment be brought down by,
say, 5%?
11. What are the bottlenecks (in terms of time or quality) in the TA process for sourcing? For
short-listing? For selection? For offers? For joining?
12. Can we predict whether a selected person will join or not?
13. Can we predict whether a candidate called for interview will be selected or not?
3 A REAL-LIFE DATASET
In this paper we present a real-life case-study using a real TA dataset. The dataset covers a 1-year
period (Jan. to Dec. 2010) in a BPO organization and contains 26574 records divided into two
tables. The SELECTED table consisted of 12185 candidates who were selected, out of which 1148
(9.4%) declined and 11037 (90.6%) actually joined. The REJECTED table consisted of 14385
candidates who were interviewed but rejected during the recruitment process. The two tables have
some common attributes (columns) (Table 1). The given data had some quality issues, which we
cleaned using some pre-processing steps.
Table 1. Some columns in SELECTED and REJECTED.
Column Name Example Values / Description
CID Candidate ID
GENDER M, F
AGE
TYPE frontline, support, trainee, leadership
DEGREE BA, BCom, BSc, MBA, BPharm, BE, BCA
SPECIALIZATION Electronics, Physics, Marketing, English
INDUSTRY market research, banking, pharma, IT
CLIENT SuperValue, Lufthansa, Honeywell, ABB
GRADE 1, 2, …, 8
TOTAL_EXP Total experience in years
CURRENT_SALARY
LAST_ORG Last employer
INTERVIEW_DATE
*OFFER_DATE
*JOINING_DATE
*DECLINE_DATE
+REJECT_REASON process,subject,communication,attitude
*: columns only in SELECTED; +: columns only in REJECTED; unmarked columns are present in both SELECTED and REJECTED.
Figure 2. (a) Histogram of joining intervals (b) Joining intervals across values of INDUSTRY attribute.
We built a regression model for joining interval as the dependent variable. The regressor variables
were NOTICE_PERIOD and GRADE. We selected a subset of 767 selected and joined candidates to build
the model. The fitted model’s regression coefficients are:
ˆ0 4.22787, ˆ1 0.709673, ˆ2 4.584716
For this model, R2 = 0.35, which is not very high. The global F-test indicates that at least one of
these regressors is significant and removing either one of them significantly reduces the fit (R2).
Adding other variables (e.g., AGE, TOTAL_EXP, CURRENT_SALARY etc.) does not improve fit of the
regression model (R2) and in fact, partial F-test indicates that the added variables are not significant
in the presence of the above two regressors.
The next important question is whether there are any “patterns” among the joining intervals. In
particular, are there any (sufficiently large) subsets of selected candidates (who share some
characteristics) who have suffered “unusually high” joining intervals? We use a data-mining
technique called subgroup discovery to identify such “interesting” subgroups. A subgroup is
characterized as a selector constructed as a conjunction (AND) of attribute-value pairs. Such a
subgroup is interesting if its joining interval values are “significantly higher” as compared to the
rest of the selected candidates, as determined by Student’s t-test. Our subgroup discovery
algorithm [11] systematically explores (using the beam search technique) the space of all possible
selectors and reports those which are “interesting”. Table 2 shows several such interesting subsets
which have unusually high joining intervals.
The use of systematic statistically rigorous subgroup discovery method facilitates deep exploration
of subgroups (e.g., using say 5 or more attributes), which is impossible to do manually due to the
exponentially large number of possibilities. Similar analysis for other intervals (and delays) also
discovers significant sized sub-groups with very high values. Such insights can help the user to
identify likely sources of long intervals and delays and plan strategies to deal with them
accordingly during recruitment. Efforts (e.g., root cause analysis and improvement plans) to reduce
occurrences of long intervals and high delays can now be focused on specific subgroups, which
are kind of bottlenecks in the TA process.
4.2 Delays
Other than various intervals discussed earlier, the TA process can also be measured against various
delays that may occur. A delay happens when an event does not take place by the expected date.
An important delay (from the perspective of TA process efficiency) is the joining delay, defined
as the difference between the actual day of joining and the expected date of joining (as agreed by
the selected candidate). Note that the joining delay value can be negative, if a candidate joins
before the expected date. We are mainly interested in characterizing positive joining delays, when
the candidate joins later then the agreed date. Other kinds of delays that happen in other stages of
the TA business processes (e.g., for interview, offer roll-out, medical examination) can be similarly
defined and analyzed.
We selected a subset of 3498 selected candidates, out of which 192 (5.5%) had delayed their
joining (i.e., they joined after the expected joining date) and the remaining 3306 (94.5%) were not
delayed (i.e., they joined on or before the expected date). Summary statistics for the joining delays
(in days) of these candidates are: Min:31 Max:98 Average:0.75 STDEV:5.6 Q1:0 Q2:0 Q3:0.
Clearly, most people join before or on the expected date of joining. Fig. 3(a) shows a histogram of
the joining delays. Figure 3(b) shows the variation of joining delays for different values of
INDUSTRY attribute.
Figure 3. (a) Histogram of joining delays (b) Joining delay across values of INDUSTRY attribute.
As a first analysis, we built a multiple regression model for predicting joining delays, with AGE,
CURRENT_SALARY, TOTAL_EXP, GRADE and INDUSTRY as regressors. The R2-value was extremely
low (0.04), and the partial F-tests indicated that regression coefficients for none of the regressors
were significant. Since regression models were poor, we tried classification-based predictive
models. We discretized the joining delay values into a binary class label: DELAYED (if joining
delay > 0) and NOT_DELAYED (otherwise). We then used the well-known WEKA tool
(http://www.cs.waikato.ac.nz/ml/weka/) to build predictive models for joining delay class label,
using standard machine learning techniques for classification, such as Decision Tree, Support
Vector Machines (SVM), and Naïve Bayes for building predictive models. Table 3 shows the
results of 5-fold cross-validation (target class = DELAYED); we do not report the accuracy for
Decision Tree, which was poor. Please note the class imbalance (very few examples of class
DELAYED). The overall prediction accuracy is good, but we are really interested only in the
accuracy for the target class DELAYED (Table 3), for which the accuracy is quite poor, indicating
that the data does not seem to capture many of the relevant attributes that truly affect the decision
of why people delay the joining date.
Table 3. Accuracy of predictive models for joining (for the target class DELAYED).
Classifier Precision Recall F-measure
Naïve-Bayes 0.279 0.328 0.301
SVM (RBF) with Normalized attributes, weight = 0.1, 1.0 0.277 0.677 0.393
Random Forest 0.244 0.161 0.194
The next important question is whether there are any “patterns” among the delays. In particular,
are there any (sufficiently large) subsets of selected candidates (who share some characteristics)
who have suffered “unusually high” joining delays? Our subgroup discovery algorithm [11]
discovered various such interesting subsets (Table 4).
5 ANALYSIS OF TA EFFICIENCY
Selection ratio and join ratio are two important KPI parameters to assess the efficiency of the TA
business process (apart from others). Both parameters directly affect total cost of TA: if the former
ratio is very low, TA has to call more candidates for interview than required and if the latter is
very low then all the steps in TA process have to be repeated. In addition to higher cost of
acquisition, low join ratio may result in direct loss of revenue because of placement delays in
important client assignments.
Considering the importance of selection ratio and join ratio, two natural questions are:
1) Can we predict, by looking at the historical data, what kinds of candidates are likely to be
selected?
2) Can we predict, by looking at the historical data, what kinds of candidates are likely to decline?
For example; knowing a candidate’s qualification, experience, current salary, salary offered, and
grade offered, can we predict if the selected candidate is likely to accept or decline the offer? Even
if we cannot make this prediction with 100% confidence, can we at least compute the probability
of how likely the candidate is going to decline the offer? Similar predictions are needed for
candidates who are likely to be selected.
Among the four models we used for predicting ACCEPT/DECLINE, decision tree outperforms
other classifiers. Decision Tree identifies approx. 64% cases (Recall) of the target class
(ACCEPT/DECLINE) and it correctly predicts (precision) 73% of the cases. An additional
advantage of decision tree is that the decision rules are represented graphically and hence are easy
to understand for end-users. Some examples of decision rules extracted from the discovered
decision trees are shown in Figure 4. Random Forest model is better for SELECT/REJECT
prediction.
IF SOURCE = Buddy
| AND STREAM = COMPUTER/IT
| | AND EXP_BAND = 3-5 years
| | | AND TYPE = Frontline
| | | | AND AGE <= 28
THEN class = SELECT (with 90% confidence)
IF DIVISION = 1.1
| AND HIRING_QTR = Q1
| | AND INDUSTRY = IT
| | | AND SAL_ASPER_GRADE = Above Average
| | | | AND GENDER = M
| | | | | AND HIGHEST_QLFN = Under Graduate
| | | | | | AND TCS_SAL <= 281129
THEN class = DECLINE (with 70% confidence)
Figure 4. Some examples of discovered classification rules.
We build the DECLINE root cause estimation model in the form of a Bayesian Network (BN) [11].
We adopted a 2-step approach for root-case analysis of DECLINE events.
1. Given a historical training dataset of past DECLINE decisions (along with the known root
cause for each, as given by the attribute DECLINE_REASON), a BN discovery algorithm [1]
(https://dslpitt.org/genie/) is used to automatically identify a BN from the data. During this
step, dependencies between data attributes are learned, along with their conditional probability
tables.
2. The discovered BN is used to estimate (using an inference algorithm) the likely root cause for
any candidate who has declined, by using his/her profile as evidence. Essentially, given
candidate profile and DECLINE = yes, the only unknown RV is DECLINE_REASON, for
which the most likely value is estimated using an inference algorithm.
Figure 5 shows the discovered Bayesian Net and an example of inference of the likely decline
reason for a particular candidate.
Figure 5. Discovered Bayesian Net and an example of inference of the likely decline reason for a candidate.
Acknowledgements. We thank Dr. Harrick Vin, Dr. Ritu Anand, and various people from TCS
HR Department for their extensive help during the course of this work.
RFERENCES
[1] Bondarouk, T., Ruël, H., Guiderdoni-Jourdain, K. and Oiry, E. Handbook of Research on E-Transformation and
Human Resources Management Technologies—Organizational Outcomes and Challenges. IGI Global, 2009.
[2] Budhwar P.S., Varma A., Singh V. and Dhar R. HRM systems of Indian call centres: an exploratory study, Int.
Journal of Human Resource Management, 17(5), 2006.
[3] Cooper, G.F. and Herskovits E. A Bayesian method for the induction of probabilistic networks from data.
Machine Learning, 9, pp. 309--347, 1992.
[4] Curtis, W., Hefley, W.E. and Miller, S.A. The People CMM: A Framework for Human Capital Management.
2/e, Software Engineering Institute, 2009.
[5] Connors, D. and Mojsilovic, A. Workforce Analytics for the Enterprise: An IBM Approach. Chapter in Service
Science Handbook, Maglio, P.P., Kieliszewski, C.A. and Spohrer J.C. (ed.s), Springer, 2010.
[6] Harding, J.A., Shahbaz, M. and Srinivas, Kusiak, A. Data mining in Manufacturing: A Review. Journal of
Manufacturing Science and Engineering. 128, pp. 969–976, 2006.
[7] Hu, J., Lu, Y., Mojsilovic, A., Singh, M. and Squillante M. Next generation workforce management analytics
for the globally integrated enterprise. Proc. Institute for Operations Research and the Management Sciences
(INFORMS) Annual Meeting, Washington, DC, October 2008.
[8] Hülsheger, U.R., Maier, G.W. and Stumpp, T. Validity of general mental ability for the prediction of job
performance and training success in Germany: a meta-analysis. Int. Journal of Selection and Assessment, 15: 1,
3–18, 2007.
[9] Lu, Y., Radovanovic, A. and Squillante, M. Workforce Management in Service via Stochastic Network Models.
Proc. 2006 IEEE/INFORMS Int. Conf. on Service Operations, Logistics and Informatics (SOLI 2006), 2006.
[10] Murray, M., Young, J., 2008. Decision Model for Contracting Helpdesk Services. Journal of Service Science,
1(1), (www.cluteinstitute-onlinejournals.com ).
[11] Natu M. and Palshikar G.K. Interesting Subset Discovery and its Application on Service Processes, Workshop
on Data Mining for Services (DMS 2010), Int. Conference on Data Mining (ICDM 2010), Australia, 2010.
[12] Naveh, Y., Richter, Y., Altshuler, Y., Gresh, D. L. and Connors, D. P. Workforce optimization: identification
and assignment of professional workers using constraint programming. IBM Journal of Research and
Development, 51, 2007.
[13] Palshikar, G.K., Deshpande, S., Bhat, S. QUEST: Discovering Insights from Survey Responses. Proc. 8th
Australasian Data Mining Conf. (AusDM09), Dec. 1-4, 2009, Melbourne, Australia, P.J. Kennedy, K.-L. Ong,
P. Christen (Ed.s), CRPIT, vol. 101, published by Australian Computer Society, pp. 83 - 92, 2009.
[14] Palshikar, G.K., Vin H.M., Vijaya Saradhi V. and Mudassar M. Discovering Experts, Experienced Persons and
Specialists for IT Infrastructure Support, Service Science, Vol. 3, No. 1, pp. 1 - 21, Spring 2011.
[15] Patterson, B. Mining the gold: gain competitive advantage through HR data analysis. HR Magazine, 2003.
[16] Pearl, J. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann,
1988.
[17] Phua C., Lee, V., Smith-Miles, K. and Gayler, R. A Comprehensive Survey of Data Mining-based Fraud
Detection Research. Artificial Intelligence Review, 2005.
[18] Richter, Y., Naveh, Y., Gresh, D.L. and Connors, D.P. Optimatch: Applying Constraint Programming to
Workforce Management of Highly-skilled Employees. Proc. 2007 IEEE/INFORMS Int. Conf. on Service
Operations, Logistics and Informatics (SOLI 2007), 2007.
[19] Rousseau, D.M. and Barends, E.G.R. Becoming an evidence-based HR practitioner. Human Resource
Management Journal, 21(3), 221–235, 2011.
[20] Strohmeier, S. Research in e-HRM: review and implications. Human Resource Management Review, 17(1), pp.
19-37, 2007.
[21] Van De Voorde, K., Paauwe, J., Van Veldhoven, M. Predicting business unit performance using employee
surveys: monitoring HRM-related changes. Human Resource Management Journal, 20(1), pp. 44–63, 2010.
[22] Yu, P. S. (ed.). Proc. 2007 Int. Workshop on Domain driven Data Mining. ACM Press, 2007.
[23] Yu, P. S. (ed.). Proc. 2008 Int. Workshop on Domain driven Data Mining. ACM Press, 2008.