Sie sind auf Seite 1von 80

Statistical Process Control (SPC)

A Metrics-Based Point of View of Software Processes

Achieving the CMMI Level Four

Reiner Dumke, Isabelle Côté, Olga Andruschak

Otto-von-Guericke-Universität Magdeburg, Institut für Verteilte Systeme,


1 The CMMI Approach ………………………………………..…………………… 2

1.1 Basic Intentions of the CMMI ………………………………………………………………………. 2
1.2 The CMMI Levels …………………………………………………………………………………... 3
1.3 The CMMI Metrication ……………………………………………………………………………… 7

2 Software Measurement Intentions ……………………………………………….. 10

2.1 The CAME Measurement Framework ……………………………………………………………….. 10
2.2 The CMMI Metrics Set by Kulpa and Johnson …………………………………….………………. 15
2.3 The CMMI-Based Organization’s Measurement Repository ………………………………………. 20

3 The Statistical Software Process (SPC)…………………………………………… 21

3.1 Foundations of the SPC……………………………………………………………………………. 21
3.2 Empirical Strategies ……………………………………………………………………………….. 27
3.3 Testing Methods …………………………………………………………………………………… 33
3.4 Methods of Data Analysis …………………………………………………………………………. 39

4 SPC and CMMI ……………………………………………………………………. 66

4.1 Basics of Quantified Process Management …………………………………………………………… 66
4.2 Controlling the Process Improvement ……………………………………………………………….... 69

5 References ……………………………………………………………………….… 79

The following preprint gives a new form of integration of the idea of the statistical based analysis of the software
process (SPC) in the assessment and improvement activities considering the Capability Maturity Model
Integration initiative. Including the basic statistical methods and software experiment foundations we will
describe a structured approach for metrication of the different stages of the CMMI approach. Further, this
preprint shows appropriate methods of statistical analysis in order to improve the software process areas and
activities for a quantified managed process level based on metrics set defines by Kulpa and Johnson.

1 The CMMI Approach
1.1 Basic Intentions of the CMMI
CMMI stands for Capability Maturity Model Integration and is an initiative for changing the general intention
of an assessment view based of the “classical” CMM or ISO 9000 to an improvement view integrating the System
Engineering CMM (SE-CMM), the Software Acquisition Capability Maturity Model (SA-CMM), the Integrated
Product Development Team Model (IDP-CMM), the System Engineering Capability Assessment Model
(SECAM), the Systems Engineering Capability Model (SECM), and basic ideas of the new versions of the ISO
9001 and 15504. The following semantic network shows some classical approaches in the software process
evaluation without any comments [Ferguson 1998].


2168 1679
People CMM
SA-CMM IEEE Stds. 730, AQAP1,4,9
SW-CMM 828,829,830,1012 DOD-STD-
SCE 1016,1028,1058 7935A
Baldrige MIL-STD-498
ISO 15504
(SPICE) Trillium BS
SE-CMM CMMI DO-178B 12207

(EIA/IS 731) IEEE 1074
TickIT ISO 9000

IEEE 1220 Q9000 IEEE/EIA
EIA/IS 632 AF IPD Guide ISO 10011 12207
MIL-STD-499B EIA 632 ISO 15288

Figure 1: Dependencies of software process evaluation methods and standards

The CMMI is structured in the five maturity levels, the considered process areas, the specific goals (SG) and
generic goals (GG), the common features and the specific practices (SP) and generic practices (GP). The process
areas are defined as follows [Kulpa 2003]:

“The Process Area is s group of practices or activities performed collectively to achieve a specific

Such objectives could be the requirements management at the level 2, the requirements development at the
maturity level 3 or the quantitative project management at the level 4. The difference between the “specific” and
the “general” goals, practices or process area is reasoning in the special aspects or areas which are considered in
opposition to the general IT or company wide analysis or improvement. There are four common features:

The commitment to perform (CO)

The ability to perform (AB)
The directing implementation (DI)
The verifying implementation (VE).

The CO is shown through senior management commitment, the AB is sown through the training personnel, the
DI is demonstrated by managing configurations, and the VE is demonstrated via objectively evaluating
adherence and by reviewing status with higher-level management.

The following Figure 2 shows the general relationships between the different components of the CMMI

Process Area 1 Process Area 2 Process Area n

Specific Goals Generic Goals

Specific Practices Generic Practices

Capability Levels

Figure 2: The CMMI model components

The CMMI gives us some guidance as to what is a required component, an expected component, and simply

1.2 CMMI Levels

There are six capability levels (but five maturity levels), designated by the numbers 0 through 5 [SEI 2002],
including the following process areas:

0. Incomplete: -

1. Performed: best practices;

2. Managed: requirements management, project planning, project monitoring and control, supplier
agreement management, measurement and analysis, process and product quality assurance;

3. Defined: requirements development, technical solution, product integration, verification,

validation, organizational process focus, organizational process definition, organizational training,
integrated project management, risk management, integrated teaming, integrated supplier
management, decision analysis and resolution, organizational environment for integration;

4. Quantitatively Managed: organizational process performance, quantitative project management;

5. Optimizing: organizational innovation and deployment, causal analysis and resolution.

Kulpa and Johnson consider the following specific goals and practices achieving the different maturity levels
relating to the quantification [Kulpa 2003]:

Level 2: Measurement and Analysis:

The purpose of Measurement and Analysis is to develop and sustain a measurement capability that is used to
support management information needs. Specific Practices by Specific Goal:

SG1 Align Measurement and Analysis Activities: Measurement objectives and activities are aligned
with identified information needs and objectives.
SP1.1 Establish Measurement Objectives: Establish and maintain measurement objectives that are
derived from identified information needs and objectives.
SP1.2 Specify Measures: Specify measures to address the measurement objectives.

SP1.3 Specify Data Collection and Storage Procedures: Specify how measurement data will be
obtained and stored.
SP1.4 Specify Analysis Procedures: Specify how measurement data will be analyzed and reported.

SG2 Provide Measurement Results: Measurement results that address identified information needs
and objectives are provided.
SP2.1 Collect Measurement Data: Obtain specified measurement data.
SP2.2 Analyze Measurement Data: Analyze and interpret measurement data.
SP2.3 Store Data and Results: Manage and store measurement data, measurement specifications, and
analysis results.
SP2.4 Communicate Results: Report results of measurement and analysis activities to all relevant

Level 2: Specific Practices by Specific Goal:

SG1 Objectively Evaluate Processes and Work Products: Adherence of the performed process and
associated work products and services to applicable process descriptions, standards, and procedures is
objectively evaluated.
SP1.1 Objectively Evaluate Processes: Objectively evaluate the designated performed processes
against the applicable process descriptions, standards, and procedures.
SP1.2 Objectively Evaluate Work Products and Services: Objectively evaluate the designated work
products and services against the applicable process descriptions, standards, and procedures.

SG2 Provide Objective Insight: Noncompliance issues are objectively tracked and communicated,
and resolution is ensured.
SP2.1 Communicate and Ensure Resolution of Noncompliance Issues: Communicate quality issues
and ensure resolution of noncompliance issues with the staff and managers.
SP2.2 Establish Records: Establish and maintain records of the quality assurance activities.

Level 3: Verification:

The purpose of Verification is to ensure that selected work products meet their specified requirements.
Specific Practices by Specific Goal:

SG1 Prepare for Verification: Preparation for verification is conducted.

SP1.1 Select Work Products for Verification: Select the work products to be verified and the
verification methods that will be used for each.
SP1.2 Establish the Verification Environment: Establish and maintain the environment needed to
support verification.
SP1.3 Establish Verification Procedures and Criteria: Establish and maintain verification procedures
and criteria for the selected work products.

SG2 Perform Peer Reviews: Peer reviews are performed on selected work products.
SP2.1 Prepare for Peer Reviews: Prepare for peer reviews of selected work products.
SP2.2 Conduct Peer Reviews: Conduct peer reviews on selected work products and identify issues
resulting from the peer review.
SP2.3 Analyze Peer Review Data: Analyze data about preparation, conduct, and results of the peer

SG3 Verify Selected Work Products: Selected work products are verified against their specified
SP3.1 Perform Verification: Perform verification on the selected work products.
SP3.2 Analyze Verification Results and Identify Corrective Action: Analyze the results of all
verification activities and identify corrective action.

Level 3: Validation:

The purpose of Validation is to demonstrate that a product or product component fulfills its intended use when
placed in its intended environment. Specific Practices by Specific Goal:

SG1 Prepare for Validation: Preparation for validation is conducted.

SP1.1 Select Products for Validation: Select products and product components to be validated and
the validation methods that will be used for each.
SP1.2 Establish the Validation Environment: Establish and maintain the environment needed to
support validation.
SP1.3 Establish Validation Procedures and Criteria: Establish and maintain procedures and criteria
for validation.

SG2 Validate Product or Product Components: The product or product components are validated to
ensure that they are suitable for use in their intended operating environment.
SP2.1 Perform Validation: Perform validation on the selected products and product components.
SP2.2 Analyze Validation Results: Analyze the results of the validation activities and identify issues.

Level 3: Decision Analysis and Resolution:

The purpose of Decision Analysis and Resolution is to analyze possible decisions using a formal evaluation
process that evaluates identified alternatives against established criteria. Specific Practices by Specific Goal:

SG1 Evaluate Alternatives: Decisions are based on an evaluation of alternatives using established
SP1.1 Establish Guidelines for Decision Analysis: Establish and maintain guidelines to determine
which issues are subject to a formal evaluation process.
SP1.2 Establish Evaluation Criteria: Establish and maintain the criteria for evaluating alternatives,
and the relative ranking of these criteria.
SP1.3 Identify Alternative Solutions: Identify alternative solutions to address issues.
SP1.4 Select Evaluation Methods: Select the evaluation methods.
SP1.5 Evaluate Alternatives: Evaluate alternative solutions using the established criteria and
SP1.6 Select Solutions: Select solutions from the alternatives based on the evaluation criteria.

Level 4: Quantitative Project Management:

The purpose of the Quantitative Project Management process area is to quantitatively manage the project’s
defined process to achieve the project’s established quality and process-performance objectives. Specific
Practices by Specific Goal:

SG1 Quantitatively Manage the Project: The project is quantitatively managed using quality and
process- performance objectives.
SP1.1 Establish the Project’s Objectives: Establish and maintain the project’s quality and process-
performance objectives.
SP1.2 Compose the Defined Process: Select the subprocesses that compose the project’s defined
process, based on historical stability and capability data.
SP1.3 Select the Subprocesses that Will Be Statistically Managed: Select the subprocesses of the
project’s defined process that will be statistically managed.
SP1.4 Manage Project Performance: Monitor the project to determine whether the project’s
objectives for quality and process performance will be satisfied, and identify corrective action as

SG2 Statistically Manage Subprocess Performance: The performance of selected subprocesses

within the project’s defined process is statistically managed.
SP2.1 Select Measures and Analytic Techniques: Select the measures and analytic techniques to be
used in statistically managing the selected subprocesses.
SP2.2 Apply Statistical Methods to Understand Variation: Establish and maintain an understanding
of the variation of the selected subprocesses using the selected measures and analytic techniques.
SP2.3 Monitor Performance of the Selected Subprocesses: Monitor the performance of the selected

subprocesses to determine their capability to satisfy their quality and process-performance objectives,
and identify corrective action as necessary.
SP2.4 Record Statistical Management Data: Record statistical and quality management data in the
organization’s measurement repository.

Level 5: Causal Analysis and Resolution:

The purpose of Causal Analysis and Resolution is to identify causes of defects and other problems and take
action to prevent them from occurring in the future. Specific Practices by Specific Goal:

SG1 Determine Causes of Defects: Root causes of defects and other problems are systematically
SP1.1 Select Defect Data for Analysis: Select the defects and other problems for analysis.
SP1.2 Analyze Causes: Perform causal analysis of selected defects and other problems and propose
actions to address them.

SG2 Address Causes of Defects: Root causes of defects and other problems are systematically
addressed to prevent their future occurrence.
SP2.1 Implement the Action Proposals: Implement the selected action proposals that were developed
in causal analysis.
SP2.2 Evaluate the Effect of Changes: Evaluate the effect of changes on process performance.
SP2.3 Record Data: Record causal analysis and resolution data for use across the project and

Addressing the basics of the project management CMMI considers the following components for the
management of the IT processes [SEI 2002]:

Process Performance
Risk exposure due to
objectives, baselines, models
unstable processes

Statistical Mgmt Data QPM

Quantitative objectives
Organization’s standard Subprocesses to
processes and statistically manage
supporting assets Identified risks
for Coordination and collaboration
Lessons Learned, IPPD among project stakeholders
Planning and Risk
Performance Data Shared vision taxonomies
and integrated team IT & parameters
structure for the project
Process Management
process areas Integrated team Risk status
Project’smanagement for
defined performing
process engineering Risk mitigation plans
Product defined
Coordination, Corrective action
architecture commitments, process
for issues to performance
structuring resolve data
Configuration management, teams
verification, and integration
data Integrated work
environment and
people practices Basic
ISM Engineering and Support Project Management
process areas process areas
Monitoring data as
part of supplier

Figure 3: The CMMI project management process areas

Where QPM stands for Quantitative Project Management, IPM for Integrated Project Management, IPPD for
Integrated Product and Process Development, RSKM for risk management, and ISM for Integrated Supplier

1.3 CMMI Metrication

In order to manage the software process quantitatively, the CMMI defines a set of metrics examples. Some of
these appropriate software measurement intentions are [SEI 2002]

Examples of quality and process performance attributes for which needs and priorities might be
identified include the following:
o Functionality
o Reliability
o Maintainability
o Usability
o Duration
o Predictability
o Timeliness
o Accuracy

Examples of quality attributes for which objectives might be written include the following:
o Mean time between failures
o Critical resource utilization
o Number and severity of defects in the released product
o Number and severity of customer complaints concerning the provided service

Examples of process performance attributes for which objectives might be written include the following:
o Percentage of defects removed by product verification activities (perhaps by type of
verification, such as peer reviews and testing)
o Defect escape rates
o Number and density of defects (by severity) found during the first year following product
delivery (or start of service)
o Cycle time
o Percentage of rework time

Examples of sources for objectives include the following:

o Requirements
o Organization's quality and process-performance objectives
o Customer's quality and process-performance objectives
o Business objectives
o Discussions with customers and potential customers
o Market surveys

Examples of sources for criteria used in selecting subprocesses include the following:
o Customer requirements related to quality and process performance
o Quality and process-performance objectives established by the customer
o Quality and process-performance objectives established by the organization
o Organization’s performance baselines and models
o Stable performance of the subprocess on other projects
o Laws and regulations

Examples of product and process attributes include the following:

o Defect density
o Cycle time
o Test coverage

Example sources of the risks include the following:

o Inadequate stability and capability data in the organization’s measurement repository
o Subprocesses having inadequate performance or capability
o Suppliers not achieving their quality and process-performance objectives

o Lack of visibility into supplier capability
o Inaccuracies in the organization’s process performance models for predicting future
o Deficiencies in predicted process performance (estimated progress)
o Other identified risks associated with identified deficiencies

Examples of actions that can be taken to address deficiencies in achieving the project’s objectives
include the following:
o Changing quality or process performance objectives so that they are within the expected
range of the project’s defined process
o Improving the implementation of the project’s defined process so as to reduce its normal
variability (reducing variability may bring the project’s performance within the objectives
without having to move the mean)
o Adopting new subprocesses and technologies that have the potential for satisfying the
objectives and managing the associated risks
o Identifying the risk and risk mitigation strategies for the deficiencies
o Terminating the project

Examples of subprocess measures include the following:

o Requirements volatility
o Ratios of estimated to measured values of the planning parameters (e.g., size, cost, and
o Coverage and efficiency of peer reviews
o Test coverage and efficiency
o Effectiveness of training (e.g., percent of planned training completed and test scores)
o Reliability
o Percentage of the total defects inserted or found in the different phases of the project life
o Percentage of the total effort expended in the different phases of the project life cycle

Sources of anomalous patterns of variation may include the following:

o Lack of process compliance
o Undistinguished influences of multiple underlying subprocesses on the data
o Ordering or timing of activities within the subprocess
o Uncontrolled inputs to the subprocess
o Environmental changes during subprocess execution
o Schedule pressure
o Inappropriate sampling or grouping of data

Examples of criteria for determining whether data are comparable include the following:
o Product lines
o Application domain
o Work product and task attributes (e.g., size of product)
o Size of project

Examples of where the natural bounds are calculated include the following:
o Control charts
o Confidence intervals (for parameters of distributions)
o Prediction intervals (for future outcomes)

Examples of techniques for analyzing the reasons for special causes of variation include the following:
o Cause-and-effect (fishbone) diagrams
o Designed experiments
o Control charts (applied to subprocess inputs or to lower level subprocesses)
o Subgrouping (analyzing the same data segregated into smaller groups based on an
understanding of how the subprocess was implemented facilitates isolation of special

Examples of when the natural bounds may need to be recalculated include the following:
o There are incremental improvements to the subprocess
o New tools are deployed for the subprocess
o A new subprocess is deployed

o The collected measures suggest that the subprocess mean has permanently shifted or the
subprocess variation has permanently changed

Examples of actions that can be taken when a selected subprocess’ performance does not satisfy its
objectives include the following:
o Changing quality and process-performance objectives so that they are within the
subprocess’ process capability
o Improving the implementation of the existing subprocess so as to reduce its normal
variability (reducing variability may bring the natural bounds within the objectives without
having to move the mean)
o Adopting new process elements and subprocesses and technologies that have the potential
for satisfying the objectives and managing the associated risks
o Identifying risks and risk mitigation strategies for each subprocess’ process capability

Examples of other resources provided include the following tools:

o System dynamics models
o Automated test-coverage analyzers
o Statistical process and quality control packages
o Statistical analysis packages

Examples of training topics include the following:

o Process modelling and analysis
o Process measurement data selection, definition, and collection

Examples of work products placed under configuration management include the following:
o Subprocesses to be included in the project’s defined process
o Operational definitions of the measures, their collection points in the subprocesses, and
how the integrity of the measures will be determined
o Collected measures

Examples of activities for stakeholder involvement include the following:

o Establishing project objectives
o Resolving issues among the project’s quality and process-performance objectives
o Appraising performance of the selected subprocesses
o Identifying and managing the risks in achieving the project’s quality and process-
performance objectives
o Identifying what corrective action should be taken

Examples of measures used in monitoring and controlling include the following:

o Profile of subprocesses under statistical management (e.g., number planned to be under
statistical management, number currently being statistically managed, and number that are
statistically stable)
o Number of special causes of variation identified

Examples of activities reviewed include the following:

o Quantitatively managing the project using quality and process-performance objectives
o Statistically managing selected subprocesses within the project’s defined process

Examples of work products reviewed include the following:

o Subprocesses to be included in the project’s defined process
o Operational definitions of the measures
o Collected measures

Based on these quantifications CMMI defines: “A `managed process` is a performed process that is planned and
executed in accordance with policy; employs skilled people having adequate resources to produce controlled
outputs; involves relevant stakeholders; is monitored, controlled, and reviewed; and is evaluated for adherence to
its process description“.

2 Software Measurement Intentions
2.1 The CAME Measurement Framework
The following measurement and evaluation framework addressed to the software product, process and resources
was developed at the University of Magdeburg [Dumke 1999]. The measurement framework is embedded in
some aspects of strategy in the IT area in organizations and societies which is shown in the following Figure 4.



IT area

CAME strategy

CAME framework

CAME tools

Figure 4: Main areas relating to the software measurement and evaluation framework

We will describe shortly some essential aspects of this framework and the characteristics of the framework
environments. The CAME strategy is related to the experience of measurement frameworks or metric programs
which are embedded in the enterprise area ([Dumke 2002], [Eickelmann 2000], [Fehrling 2003], [Kitchenham
1997], [Munson 2003]) and stands for

• Community: the necessity of a group or a team that is motivated and has the knowledge of software
measurement to install software metrics. In general, the members of these groups are organised in metrics
communities such as our German Interest Group on Software Metrics.

• Acceptance: the agreement of the (top) management to install a metrics program in the (IT) business area.
This aspect is strong connected with the knowledge about required budgets and personnel resources.

• Motivation: the production of measurement and evaluation results in a first metrics application which
demonstrates the convincing benefits of the metrics application. This very important aspect can be
achieved by the application of essential results in the (world-wide) practice which are easy to understand
and should motivate the management. One of the problem of this aspect is the fact that the management
wants to obtain one single (quality) number as a summary of all measured characteristics.

• Engagement: the acceptance of spending effort to implement the software measurement as a permanent
metrics system (with continued measurement, different statistical analysis, metrics set updates etc.). This
aspect includes also the requirement to dedicate personnel resources such as measurement teams etc.

The CAME framework consists of the following four phases which are defined to install a metrics program in
the IT area and which can be used to evaluate the measurement level of this metrics program itself (see also
[Dumke 2001], [Fenton 1997], [Kitchenham 1995], [Putnam 2003], [Zuse 1998]):

• Choice: the selection of metrics based on a special or general measurement view on the kind of
measurement and the related measurement goals,

• Adjustment: the investigation and definition of the measurement characteristics of the metrics for the
specific application field,

• Migration: the installation of a high metrication coverage based on semantic relations between the
metrics along the whole life cycle and along the system architecture,

• Efficiency: the automation level of the construction of a tool-based measurement for the used metrics.

The phases of this framework will be explained in the following sections including the detailed aspects software
measurement evaluation and the role of the CAME tools.

The Measurement Choice involves the use of metrics involves the following two essential questions:

“What is possible to measure?” vs. “What is necessary to measure?”

Obviously, we only want to measure, what is necessary. But, in most software engineering areas, this aspect is
unknown (especially for modern software development paradigms or methodologies such as software agents and
multi-agent systems). The first framework step includes the choice of the software metrics and measures.
Therefore, we must define the set of software metrics explicitly [Dumke 2003]. The structure of this set of
metrics is based on the following classification principles

software product measurement and evaluation is based on the three components: model,
implementation and documentation (see Figure 5),

software architecture: software operation: software documentation:

human interface aspects appropriateness

user interface user interface marketing documents

tutorials user
problem domain product data manual
confi- tasks accessing
guration development
task data documents
manage- manage- (technology, tests,
ment ment distributed tasks and data bases tools, supports)

components tasks behaviour completeness
data basis data handling

Figure 5: Simplified visualisation of the product metrication

Note that the metrication process depends on the kind of the development method, of the application area of
the software system, of the implementation paradigm etc.

software process measurement and evaluation is based on the process aspects: controlling, phases/steps
and methodologies (see Figure 6),

software life cycle: software management: software methodology:

milestones controlling versioning suitability support

problem definition project management ap- develop- upper

requirement analysis/ proach ment me- CASE
specification quality configu- thodology
design manage- ration ma- para-
... implementation ment nagement digm implemen- lower
field test maintenance management tation me- CASE
phases aspects evaluation
workflow efficiency

Figure 6: Simplified visualisation of the process metrication

software resources measurement and evaluation is based on the three resource parts: personnel,
software and hardware (see Figure 7).

personnel: software resources: hardware resources:

skills communication compatibility paradigm reliability availability

user customer COTS CASE computers peripherals

development team
(test team)
system software networks
maintenance team architectures

productivity performance performance

Figure 7: Simplified visualisation of the resources metrication

Our framework starts with the investigation of the chosen metrics and assumes an underlying choice method
such as

• the general measurement goal planning by [Basili 1986] (see also [Wohlin 2000]) which consider the
different measurement goals as understanding of systems, assessment, proof of hypothesis, understanding of
metrics etc.,

• the Goal Question Metrics (GQM) paradigm [Solingen 1999] which is directed on the improvement of a
special aspect or component of the software system related to a special goal.

The measurement choice step defines the static characteristics of the software measurement process [Feiler
1993]. Note, that the choice of software metrics or software measures decides about the areas of controlling and
the areas out of controlling in the IT department.

The Measurement Adjustment is related to the experience (expressed in values) of the measured attributes for
the evaluation. The adjustment includes the metrics validation ([Card 2000], [Kitchenham 1995], [Zelkowitz
1997]) and the determination of the metrics algorithm based on the measurement theory ([Henderson 1996],
[Zuse 2003]). The steps in the measurement adjustment are

• the determination of the scale type and (if possible) the unit,

• the determination of the favourable values (as thresholds) for the evaluation of the measurement
component, e. g. by

o discussion or brainstorming in the development or quality team,

o analysing and using the examples in the literature,

o using the thresholds of the metrics tools,

o taking the results of appropriate case studies and experimentation,

• the tuning of the thresholds as

o approximation during the software development from other project components,

o application of a metrics tool for a chosen software product that was classified as a ‘good
qualitative’ example,

• the calibration of the scale (as transformation of the numerical scale part to the empirical) depends on the
improvement of the knowledge in the problem domain.

In the adjustment step mainly, we consider the metrics characteristics addressed to the qualitative evaluation
(nominal and ordinal scale types) or to the quantitative evaluation (interval or ratio scale types).

The Measurement Migration step is aimed to the dynamic aspects of the measurement framework or metrics
program. This means that we must install a metrics-based network over the software product, process, and
resources components as an Internal Measurement Process (IMP). We “migrate” the idea of metrication to all
of the components of the software development and maintenance. Note, that the most existing software
measurement approaches or frameworks do not consider this step explicitly. First intentions of this idea are
described as complexity traces in [Ebert 1993] and measurement through the life cycle in [Cool 1993], and as
granularity of object-oriented systems in [Abreu 1995]. Some examples of these kinds of migration for software
products are [Dumke 1999]

• metrics tracing along the software life cycle, e. g. #notions (problem definition) → #classes
(specification) → #new-defined-classes (design) → #implemented-classes (implementation),

• metrics refinement along the software life cycle, e. g. informal description of a specified service (text
metrics) → PDL description of a service (design metrics) → Java form of a service (code metrics),

• metrics granulation related to the architecture, e. g. in an object-oriented development as the system, the
component, the class/object and the method.

In the process and resources area the semantic characteristics such as process phases and resources versions are
also considered. Observing the software metrics as class hierarchy, we can understand the measurement
migration as the definition and design of the metrics behaviour.

On the other hand, the migration step includes the definition and installation of the External Measurement
Process (EMP) as software measurement integration. This means that we must consider the final goals of
software measurement in the IT area. Hence, we need all of the process steps such as measurement, evaluation,
exploitation and application (assessment, decision support, improvement) in a persistent manner ([Eickelmann
2000], [Jacquet 1997], [Wohlin 2000]).

The Measurement Efficiency step includes the instrumentation or the automation of the measurement process
by tools. It requires to analyse the algorithmic character of the software metrics and the possibility of the
integration of tool-based ‘control cycles’ in the software development or maintenance process. We will call the
metrics tools as CAME (Computer Assisted software Measurement and Evaluation) tools [Dumke 1996]. In
most cases, it is necessary to combine different metrics tools and techniques related to the measurement phases.

Finally, we can describe software measurement intentions as following:

⇒ We don’t have any general system of measures in software engineering like

in physics. Hence, we must consider in the software development the rules of
thumb, statements of trends, analogue conclusions, expertise, estimations and
predictions also ([Dumke 2003], [Endres 2003]).

⇒ We also don’t have any standardised measurement system which performs

the system of measures. Therefore, we must use the general techniques of
assessment (continues, periodic or certified), general evaluation, experiences
and experimentation. Sometimes, the experimentation is not immediately
used for decision support, improvement or controlling. We also use the
experimentation for understanding of new paradigms or the cognition of new
kinds of problems ([Basili 1986], [Wohlin 2000]).

⇒ Software measurement instruments are mostly not based on a physical

analogy such the column of mercury to measure the temperature. In the most
cases, software measurement is counting [Kitchenham 1995].

⇒ Software measurement has a context and is not finished with measurement

values or thresholds. Software measurement can be a generic measurement
and analysis process ([Card 2000], [Jacquet 1997]).

⇒ Empirical techniques are divided into informally observing, formal

experiments, industrial case studies and benchmarking exercises or surveys
([Juristo 2003], [Kitchenham 1997]).

⇒ “In software engineering metrics area, should place more emphasis on the
validity of the mathematical (and statistical) tools which have been (and are
currently being) used in their development and use. Areas which give cause
for concern in the past include the use of dimensionally incorrect equations,
incorrect plotting of equations and consequent incorrect inferences, the
sloppy use of mathematical notation and of calculated values and the lack of
underpinning mathematical models.” [Henderson 1996]

Hence, the software metrics application based on different methodologies or frameworks requires statistical
methods ([Juristo 2003], [Munson 2003], [Pandian 2003], [Sigpurwalla 1999], [Wohlin 2000], [Zuse 1998]).

2.2 The CMMI Metrics Set by Kulpa and Johnson
The following set of metrics is defined by Kulpa and Johnson in order to keep the quantified requirements for the
different CMMI levels [Kulpa 2003].

CMMI Level 2:

Requirements Management
1. Requirements volatility- (percentage of requirements changes)
2. Number of requirements by type or status (defined, reviewed. approved. and implemented)
3. Cumulative number of changes to the allocated requirements, including total number of changes
proposed, open, approved, and incorporated into the system baseline
4. Number of change requests per month, compared to the original number of requirements for the
5. Amount of time spent, effort spent, cost of implementing change requests
6. Number and size of change requests after the Requirements phase is completed
7. Cost of implementing a change request
8. Number of change requests versus the total number of change requests during the life of the
9. Number of change requests accepted but not implemented
10. Number of requirements (changes and additions to the baseline)

Project Planning
11. Completion of milestones for the project planning activities compared to the plan (estimates
versus actuals)
12. Work completed, effort and funds expended in the project planning activities compared to the
13. Number of revisions to the project plan
14. Cost, schedule, and effort variance per plan revision
15. Replanning effort due to change requests
16. Effort expended over time to manage the hmject compared to the plan
17. Frequency, causes, and magnitude of the replanning effort

Project Monitoring and Control

18. Effort and other resources expended in performing monitoring and oversight activities
19. Change activity for the project plan, which includes changes to size estimates of the work
products, cost/resource estimates, and schedule
20. Number of open and closed corrective actions or action items
21. Project milestone dates (planned versus actual)
22. Number of project milestone dates made on time
23. Number and types of reviews performed
24. Schedule, budget, and size variance between planned and actual reviews
25. Comparison of actuals versus estimates for all planning and tracking items

Measurement and Analysis

26. Number of projects using progress and performance measures
27. Number of measurement objectives addressed

Supplier Agreement Management

28. Cost of the COTS (commercial off-the-shelf) products
29. Cost and effort to incorporate the COTS products into the project
30. Number of changes made to the supplier requirements
31. Cost and schedule variance per supplier agreement
32. Costs of the activities for managing the contract compared to the plan
33. Actual delivery dates for contracted products compared to the plan
34. Actual dates of prime contractor deliveries to the subcontractor compared to the plan
35. Number of on-time deliveries from the vendor, compared with the contract
36. Number and severity of errors found after delivery
37. Number of exceptions to the contract to ensure schedule adherence
38. Number of quality audits compared to the plan

39. Number of Senior Management reviews to ensure adherence to hudget and schedule versus the
40. Number of contract violations by supplier or vendor

Process and Product Quality Assurance (QA)

41. Completions of milestones for the QA activities compared to the plan
42. Work completed, effort expended in the QA activities compared to the plan
43. Number of product audits and activity reviews compared to the plan
44. Number of process audits and activities versus those planned
45. Number of defects per release and/or build
46. Amount of time/effort spent in rework
47. Amount of QA time/effort spent in each phase of the life cycle
48. Number of reviews and audits versus number of defects found
49. Total number of defects found in internal reviews and testing versus those found by the customer or end
user after delivery
50. Number of defects found in each phase of the life cycle
51. Number of defects injected during each phase of the life cycle
52. Number of noncompliances written versus the number resolved
53. Number of noncompliances elevated to senior management
54. Complexity of module or component (McCabe, MeClure, and Halstead metrics)

Configuration Management (CM)

55. Number of change requests or change board requests processed per unit of time
56. Completions of milestones for the CM activities compared to the plan
57. Work completed, effort expended, and funds expended in the CM activities
58. Number of changes to configuration items
59. Number of configuration audits conducted
60. Number of fixes returned as "Not Yet Fixed"
61. Number of fixes returned as "Could Not Reproduce Error"
62. Number of violations of CM procedures (noncompliance found in audits)
63. Number of outstanding problem reports versus rate of repair
64. Number of times changes are overwritten by someone else (or number of times people have the wrong
initial version or baseline)
65. Number of engineering change proposals proposed, approved, rejected, implemented
66. Number of changes by category to code source, and to supporting documentation
67. Number of changes by category, type, and severity
68. Source lines of code stored in libraries placed under configuration control

CMMI Level 3:

Requirements Development
69. Cost, schedule, and effort expended for rework
70. Defect density of requirements specifications
71. Number of requirements approved for build (versus the total number of requirements)
72. Actual number of requirements documented (versus the total number of estimated requirements)
73. Staff hours (total and by Requirements Development activity)
74. Requirements status (percentage of defined specifications out of the total approved and proposed;
number of requirements defined)
75. Estimates of total requirements, total requirements definition effort, requirements analysis effort, and
76. Number and type of requirements changes

Technical Solution
77. Cost, schedule, and effort expended for rework
78. Number of requirements addressed in the product or productcomponent design
79. Size and complexity of the product, product components, interfaces, and documentation
80. Defect density of technical solutions work products (number of defects per page)
81. Number of requirements by status or type throughout the life of the project (for example, number
defined, approved, documented, implemented, tested, and signed-off by phase)
82. Problem reports by severity and length of time they are open

83. Number of requirements changed during implementation and test
84. Effort to analyze proposed changes for each proposed change and cumulative totals
85. Number of changes incorporated into the baseline by category (e.g., interface, security, system
configuration, performance, and useability)
86. Size and cost to implement and test incorporated changes, including initial estimate and actual size
and cost
87. Estimates and actuals of system size, reuse, effort, and schedule 88. The total estimated and actual
staff hours needed to develop the system by job category and activity
89. Estimated dates and actuals for the start and end of each phase of the life cycle
90. Number of diagrams completed versus the estimated total diagrams
91. Number of design modules/units proposed
92. Number of design modules/units delivered
93. Estimates and actuals of total lines of code - new, modified, and reused
94. Estimates and actuals of total design and code modules and units
95. Estimates and actuals for total CPU hours used to date
96. The number of units coded and tested versus the number planned
97. Errors by category, phase discovered, phase injected, type, and severity
98. Estimates of total units, total effort, and schedule
99. System tests planned, executed, passed, or failed
100. Test discrepancies reported, resolved, or not resolved
101. Source code growth by percentage of planned versus actual

Product Integration
102. Product-component integration profile (i.e., product-component assemblies planned and performed,
and number of exceptions found)
103. Integration evaluation problem report trends (e.g., number written and number closed)
104. Integration evaluation problem report aging (i.e., how long each problem report has been open)

105. Verification profile (e.g., the number of verifications planned and performed, and the defects found;
perhaps categorized by verification method or type)
106. Number of defects detected by defect category
107. Verification problem report trends (e.g., number written and number closed)
108. Verification problem report status (i.e., how long each problem report has been open)
109. Number of peer reviews performed compared to the plan
110. Overall effort expended on peer reviews compared to the plan
111. Number of work products reviewed compared to the plan

112. Number of validation activities completed (planned versus actual)
113. Validation problem reports trends (e.g., number written and number closed)
114. Validation problem report aging (i.e., how long each problem report has been open)

Organizational Process Focus

115. Number of process improvement proposals submitted, accepted, or implemented
116. CMMI maturity or capability level
117. Work completed, effort and funds expended in the organization's activities for process assessment,
development, and improvement compared to the plans for these activities
118. Results of each process assessment, compared to the results and recommendations of previous

Organizational Process Definition

119. Percentage of projects using the process architectures and process elements of the organization's set
of standard processes
120. Defect density of each process element of the organization's set of standard processes
121. Number of on-schedule milestones for process development and maintenance
122. Costs for the process definition activities

Organizational Training
123. Number of training courses delivered (e.g., planned versus actual)
124. Post-training evaluation ratings
125. Training program quality surveys

126. Actual attendance at each training course compared to the projected attendance
127. Progress in improving training courses compared to the organization's and projects' training plans
128. Number of training waivers approved over time

Integrated Project Management for IPPD

129. Number of changes to the project's defined process
130. Effort to tailor the organization's set of standard processes
131. Interface coordination issue trends (e.g., number identified and closed)

Risk Management
132. Number of risks identified, managed, tracked, and controlled
133. Risk exposure and changes to the risk exposure for each assessed risk, and as a summary percentage
of management reserve
134. Change activity for the risk mitigation plans (e.g., processes, schedules, funding)
135. Number of occurrences of unanticipated risks
136. Risk categorization volatility
137. Estimated versus actual risk mitigation effort
138. Estimated versus actual risk impact
139. The amount of effort and time spent on risk management activities versus the number of actual risks
140. The cost of risk management versus the cost of actual risks
141. For each identified risk, the realized adverse impact compared to the estimated impact

Integrated Teaming
142. Performance according to plans, commitments, and procedures for the integrated team, and
deviations from expectations
143. Number of times team objectives were not achieved
144. Actual effort and other resources expended by one group to support another group or groups, and
vice versa
145. Actual completion of specific tasks and milestones by one group to support the activities of other
groups, and vice versa

Integrated Supplier Management

146. Effort expended to manage the evaluation of sources and selection of suppliers
147. Number of changes to the requirements in the supplier agreement
148. Number of documented commitments between the project and the supplier
149. Interface coordination issue trends (e.g., number identified and number closed)
150. Number of defects detected in supplied products (during integration and after delivery)

Decision Analysis and Resolution

151. Cost-to-benefit ratio of using formal evaluation processes

Organizational Environment for Integration

152. Parameters for key operating characteristics of the work environment

CMMI Level 4:

Organizational Process Performance

153. Trends in the organization's process performance with respect to changes in work products and task
attributes (e.g., size growth, effort, schedule, and quality)

Quantitative Project Management

154. Time between failures
155. Critical resource utilization
156. Number and severity of defects in the released product
157. Number and severity of customer complaints concerning the provided service
158. Number of defects removed by product verification activities (perhaps by type of verification, such
as peer reviews and testing)
159. Defect escape rates
160. Number and density of defects by severity found during the first year following product delivery or
start of service

161. Cycle time
162. Amount of rework time
163. Requirements volatility (i.e., number of requirements changes per phase)
164. Ratios of estimated to measured values of the planning parameters (e.g., size, cost, and schedule)
165. Coverage and efficiency of peer reviews (i.e., number/amount of products reviewed compared to
total number, and number of defects found per hour)
166. Test coverage and efficiency (i.e., number/amount of products tested compared to total number, and
number of defects found per hour)
167. Effectiveness of training (i.e., percent of planned training completed and test scores)
168. Reliability (i.e., mean time-to-failure usually measured during integration and systems test)
169. Percentage of the total defects inserted or found in the different phases of the project life cycle
170. Percentage of the total effort expended in the different phases of the project life cycle
171. Profile of subprocesses under statistical management (i.e., number planned to be under statistical
management, number currently being statistically managed, and number that are statistically
172. Number of special causes of variation identified
173. The cost over time for the quantitative process management activities compared to the plan
174. The accomplishment of schedule milestones for quantitative process management activities
compared to the approved plan (i.e., establishing the process measurements to be used on the
project, determining how the process data will be collected, and collecting the process data)
175. The cost of poor quality (e.g., amount of rework, re-reviews and re-testing)
176. The costs for achieving quality goals (e.g., amount of initial reviews, audits, and testing)

CMMI Level 5:
Organizational Innovation and Deployment
177. Change in quality after improvements (e.g., number of reduced defects)
178. Change in process performance after improvements (e.g., change in baselines)
179. The overall technology change activity, including number, type, and size of changes
180. The effect of implementing the technology change compared to the goals (e.g., actual cost saving to
181. The number of process improvement proposals submitted and implemented for each process area
182. The number of process improvement proposals submitted by each project, group, and department
183. The number and types of awards and recognitions received by each of the projects, groups, and
184. The response time for handling process improvement proposals
185. Number of process improvement proposals accepted per reporting period
186. The overall change activity including number, type, and size of changes
187. The effect of implementing each process improvement compared to its defined goals
188. Overall performance of the organization's and projects' processes, including effectiveness, quality,
and productivity compared to their defined goals
189. Overall productivity and quality trends for each project
190. Process measurements that relate to the indicators of the customers' satisfaction (e.g., surveys results,
number of customer complaints, and number of customer compliments)

Causal Analysis and Resolution

191. Defect data (problem reports, defects reported by the customer, defects reported by the user, defects
found in peer reviews, defects found in testing, process capability problems, time and cost for
identifying the defect and fixing it, estimated cost of not fixing the problem)
192. Number of root causes removed
193. Change in quality or process performance per instance of the causal analysis and resolution process
(e.g., number of defects and changes in baseline)
194. The costs of defect prevention activities (e.g., holding causal analysis meetings and implementing
action items), cumulatively
195. The time and cost for identifying the defects and correcting them compared to the estimated cost of
not correcting the defects
196. Profiles measuring the number of action items proposed, open, and completed
197. The number of defects injected in each stage, cumulatively, and over-releases of similar products
198. The number of defects

2.3 The CMMI-Based Organization’s Measurement Repository
The following section includes the main activities for defining and implementation of measurement repositories
using in an organizational context. The repository contains both product and process measures that are related to
the organization's set of standard processes ([SEI 2002]). It also contains or refers to the information needed to
understand and interpret the measures and assess them for reasonableness and applicability. For example, the
definitions of the measures are used to compare similar measures from different processes.

Typical Work Products:

1. Definition of the common set of product and process measures for the organization's set of standard

2. Design of the organization’s measurement repository

3. Organization's measurement repository (i.e., the repository structure and support environment)

4. Organization’s measurement data


1. Determine the organization's needs for storing, retrieving, and analyzing measurements.

2. Define a common set of process and product measures for the organization's set of standard
processes. The measures in the common set are selected based on the organization's set of standard
processes. The common set of measures may vary for different standard processes. Operational
definitions for the measures specify the procedures for collecting valid data and the point in the
process where the data will be collected. Examples of classes of commonly used measures include
the following:
Estimates of work product size (e.g., pages)
Estimates of effort and cost (e.g., person hours)
Actual measures of size, effort, and cost
Quality measures (e.g., number of defects found, severity of defects)
Peer review coverage
Test coverage
Reliability measures (e.g., mean time to failure).

Refer to the Measurement and Analysis process area for more information about defining measures.

3. Design and implement the measurement repository.

4. Specify the procedures for storing, updating, and retrieving measures.

5. Conduct peer reviews on the definitions of the common set of measures and the procedures for
storing and retrieving measures. Refer to the Verification process area for more information about
conducting peer reviews.

6. Enter the specified measures into the repository. Refer to the Measurement and Analysis process
area for more information about collecting and analyzing data.

7. Make the contents of the measurement repository available for use by the organization and projects
as appropriate.

8. Revise the measurement repository, common set of measures, and procedures as the organization’s
needs change. Examples of when the common set of measures may need to be revised include the
New processes are added
Processes are revised and new product or process measures are needed
Finer granularity of data is required
Greater visibility into the process is required
Measures are retired.

3 The Statistical Process Control (SPC)
3.1 Foundations of the Statistical Process Control
This section gives a short description of the Software Process Control (SPC) based on [Kulpa 2003]. SPC is
often the most dreaded of all subjects when discussing process improvement. Because it involves numbers, and
then scrutinizing the numbers to determine whether the numbers are correctly collected, reported, and used
throughout the organization. Many organizations will collect metrics to summarize the best practices we can
found in other organizations. So we will describe the different types of charts and discusses reasons for using the
charts and reasons for collecting data.

SPC consists of some techniques used to help individuals understand, analyze, and interpret numerical
information. SPC is used to identify and track variation in processes. All processes will have some natural
variation. Due to the normal variation in any process, the numbers (in this example, the number of cars waiting
at the stoplight, the number of accidents that may occur) can change when the process really has not. So, we
need to understand both the numbers relating to our processes and the changes that occur in our processes so that
we may respond appropriately.

Other terms that you may see are common causes of variation and special causes of variation, as well as common
cause systems and special cause systems. Common causes of variation result from such things as system design
decisions and the use of one development tool over another. This variation will occur predictably across the
entire process associated with it and is considered normal variation. Special causes of variation are those that
arise from such things as inconsistent process execution and lack of resources. This variation is exceptional
variation and is also known as assignable causes of variation. We will use both terms. Other terms you will hear
are in control for predictable processes or steady-state; and out of control for unpredictable processes that are
“outside the natural limits.”

When a process is predictable, it exhibits routine variation as a result of common causes. When a process is
unpredictable, it exhibits exceptional variation as a result of assignable causes. It is our job to be able to tell the
difference and to find the assignable cause. When a process is predictable, it is performing as consistently as it
can (either for better or for worse). It will not be performing perfectly; there will always be some normal, routine
variation. Looking for assignable causes for processes that are running predictably is a waste of time because
you will not find any. Work instead on improving the process itself. When a process is unpredictable, that means
it is not operating consistently. It is a waste of time to try to improve the process itself. In this case, you must
find out why it is not operating predictably and detail the “whys” as specifically as possible. To do that, you
must find and fix the assignable cause(s); that is, the activity that is causing the process to behave erratically.

In contrast to the predictability of a process, we may want to consider if a process is capable of delivering what
is needed by the customer. Capable processes perform within the specification limits set by the customer. So, a
process may be predictable, but not capable.

Usally, there are seven commonly recognized tools or diagrams for statistical process control:

1. Check sheet
2. Run chart
3. Histogram
4. Pareto chart
5. Scatter diagram/chart
6. Cause and effect or fislnhone diagram
7. Control chart

Some basic examples are shown in following which we have cited from [Kulpa 2003] only for illustration
the general characteristics.

Check Sheet: The check sheet (see Table 1) is used for counting and accumulating data in a general or
special context.

Table 1: Check sheet Used for Counting and Accumulating Data

Run Chart: The run chart (see Figure 8) tracks trends over a period of time. Points are tracked in the order
in which they- occur. Each point represents an observation. You can often see interesting trends in the data
by simply plotting data on a run chart. A danger in using run charts is that you might overreact to normal
variations, but it is often useful to put your data on a run chart to get a feel for process behaviour.

Figure 8: Example of a run chart

Histogram: The histogram (see Figure 9) is a bar chart that presents data that have been collected over a
period of time, and graphically presents these data by frequency. Each bar represents the number of
observations that fit within the indicated range. Histograms are useful because they can be used to see the
amount of variation in a process. The data in this histogram are the same data as in the run chart in Figure
9. Using the histogram, you get a different perspective on the data. You see how often similar values occur
and get a quick idea of how the data are distributed.

Figure 9: A simple example of a histogram

Pareto Chart: The Pareto chart (see Figure 10) is a bar chart that presents data prioritized in some
fashion, usuallv either by descending or ascending order of importance. Parcto diagrams are used to show
attribute data. Attributes are qualitative data that can he counted for recording and analysis; for example,
counting the number of each type of defect. I'areto charts are often used to analyze the most often occurring
type of something.

Figure 10: An example of a pareto chart

Scatter Diagram/Chart: The scatter diagram (see Figure 11) is a diagram that plots data points,
allowing trends to be observed between one variable and another. The scatter diagram is used to test for
possible cause-and-effect relationships. A danger is that a scatter diagram does not prove the cause-and-
effect relationship and can be misused. A common error in statistical analysis is seeing a relationship and
concluding cause-and-effect without additional analysis.

Figure 11: An example of a scatter diagram/chart

Cause-and-Effect/Fishbone Diagram: The cause-and-effect/fishbone diagram (see Figure 12) is a graphical
display of problems and causes. This is a good to capture team input from a brainstorming meeting, from a
set of defect data, or from a check sheet.

Figure 12: A cause and effect/fishbone diagram example

Control Chart: The control chart (see Figure 13) is basically a run charts with upper and lower limits that
allows an organization to track process performance variation. Control charts are also called process
behavior charts.

Figure 13: Example of a control chart

These seven graphical displays can he used together or separately to help gather data, accumulate clam,
and present the data for different functions associated with SPC.

The following seven questions are a start in order to reviewing the data for your charts [Kulpa 2003]:

1. Who collected these data? (Hopefully the same people who are trained in proper data
collection techniques.)
2. How were the data collected? (Hopefully by automated means and at the same part of the
3. When were the data collected? (Hopefully all at the same time on the same day or at the same
time in the process - very important for accounting data dealing with month-end or year-end
closings. )
4. What do the values presented mean? (Have you changed the process recently? Do these values
really tell me what I want or need to know?)
5. How were these values computed from raw inputs? (Have you computed the data to arrive at
the results you want, or to accuratelv depict the true voice of the process?)

6. What formulas were used? (Are thev measuring what we need to measure? Are they working,'
Are they still relevant?)

... and the most important question of all:

7. Are we collecting the right data, and are we collecting the data right? (The data collected
should be consistent, and the way data are collected should also be consistent. Do the data
contain the correct information for analysis? In our peer review example, this information
would be size, complexity, and programming language.)

Control charts are used to identify process variation over time. All processes vary. The degree of variance,
and the causes of the variance, can be determined using control charting techniques. While there are many
types of control charts, the ones we have seen the most often are the [Kulpa 2003]:

c-chart: This chart uses a constant sample size of attribute data, where the average sample size is
greater than five. It is used to chart the number of defects (such as “12” or “15” defects per
thousand lines of code). c stands for the number of nonconformities within a constant sample

u-chart:. This chart uses a variable sample size of attribute data. This chart is used to chart the
number of defects in a sample or set of samples (such as “20 out of 50” design flaws were a
result of requirements errors). u stands for the number of nonconformities with varying
sample sizes.

np-chart: This chart uses a constant sample size of attribute data, usually greater than or equal to 50.
This chart is used to chart the number defective in a group. For example, a hardware
component might he considered defective, regardless of the total number of defects in it. np
stands for the number defective.

p-chart: This chart uses a variable sample size of attribute data, usually greater than or equal to 50.
This chart is used to chart the fraction defective found in a group. p stands for the proportion

X and mR charts: These charts use variable data where the sample size is one.

X-bar and R charts: These charts use variable data where the sample size is small. They can also he
based on a large sample size greater than or equal to ten. X-bar stands for the average of the
data collected. R stands for the range (distribution) of the data collected.

X-bar and s charts: These charts use variable data where the sample size is large, usually greater
than or equal to ten.

So, as you can see, you can sometimes use several of the charts, based on m type of data and on the size of
the sample - and the size of the sample may change. Control charts help detect and differentiate between
noise (normal variation of the process) and signals (exceptional variation that warrants further
investigation). Although others may disagree, we recommend that you use the Average Moving Range
(XmR) chart for most situations. There are automated tools that can support building and displaying these
charts. The task we need to undertake is to figure out how to tell the difference between noise and
signals. Properly generated control charts, specifically the XmR chart, can help us in this task. Risk data
(historical data) are critical for generating accurate control charts and for correct SPC analyses. The
Table 2 shows the count for each month of the year 2002 and the mR values (moving range).

Table 2: Example of moving range for the calendar year 2002

We can then average the moving ranges in the following statistical manner (see Figure 14), where Cen
stands for centered line, UCL for upper center line, and LCL for lower center line.

Figure 14: The 2002 moving R chart

We know that the values for the centerlines for each chart were computed by simply taking the average of the
values displayed (i.e., by adding up the values for each month and then dividing by the number of months/values
to compute the average). How were the upper and lower limits calculated for the charts shown above? We can
calculate the limits for both the X (lndividual Values) chart and the Average Moving Range (mR) chart as

For the mR (moving range) chart. The upper range (or upper control limit, or upper natural limit) is
computed by multiplying the average moving range (the centerline of the mR chart).

For the X chart (individual values chart). The upper range for the X chart is computed by
multiplying the average moving range of the associated chart and then adding the value for the
centerline of the X chart. The lower range for the X chart is computed by multiplying the average
moving range and then subtracting the value for the centerline of the X chart.

Notice that values for both representations (individual values and average moving range values) must be
gathered and computed. The upper and lower limits for the individual values chart (X chart) depend on the
average variations calculated for the centerline of the average moving range chart. Therefore, these charts are
interdependent and can be used to show relationships between the two types of charts and the two types of data.

We have also seen the limits for the XmR charts calculated using median ranges instead of average ranges. The
median moving range is often more sensitive to assigned causes when the values used contain some very high
range values that inflate the average. Remember that the median range is that range of numbers that hover
around the middle of a list sequenced in ascending or descending order: thus, the median range chart will
automatically “throw out” the very high- or low-end values. Use of the median moving range approach is valid;
however, the formulas (constants) change.

The most obvious interpretation is when one or more data points fall outside your control limits (either upper or
lower). Those values should be investigated for assignable causes, and the assignable causes should be fixed. If
your control chart shows three out of four consecutive points hovering closer to the limits than to the centerline,
this pattern may signal a shift or trend, and should be investigated (because predictable processes generally show
85 to 90 percent of the data closer to the centerline than to the limits). Remember: useful limits can be
constructed with as few as five or six consecutive values. However, the more data used to compute the limits, the
greater the certainty of the results.

Another way to spot trends is to look at the data points along the centerline. If eight or more consecutive data
points are clustered on the same side of the centerline, a shift in the original baseline or performance of the
process has probably occurred, even without a data point falling outside the limits. This is a signal to be

c-chart appropriateness: While XmR charts are the most often applied in organizations, and are the most
appropriate charts to use most often, they are not infallible. Sometimes, an event will occur that “skews the
norm” that is, a rare event way outside of the average has occurred. When this happens, a c-chart is better used.

A c-chart is used for rare events that are independent of each other. The formulas for c-charts are different from
XmR charts. First, calculate the average count of the rare occurrence over the total time period that the
occurrence happened. That number becomes the centerline. The upper limit is calculated by adding the average
count to three times the square root of the average count. The lower limit is calculated by subtracting the average
count from three times the square root of the average count. Charting the number of times a rare event occurs is
pretty useless. However, charting the time periods between recurring rare events can be used to help predict
when another rare event will occur. To do this, count the number of times the rare event occurs (usually per day
per year) and determine the intervals between the rare events. Convert these numbers into the average moving
ranges and, voilä, you can build an XmR chart.

u-chart appropriateness: The u-chart is based on the assumption that your data are based on a count of
discrete events occurring within well-defined, finite regions/areas, and that these events are independent. The u-
chart assumes a Poisson process. You may want to consider a u-chart when dealing with defects (counts) within
a group of pages (region/area); for example, number of errors per page or the number of defects per 1000 lines of
code. The u-chart differs from the XmR chart in that the upper and lower control limits of the u-chart change
over time. The u` in u-chart is the weighted average of the count (u` = ∑ countj/ ∑ sizej). The upper control limit
is calculated by adding ü to three times the square root of the ü divided by the last size (sizej). The lower control
limit is calculated by subtracting u` from three times the square root of the ü divided by the last size (sizej).

3.2 Empirical Strategies

There are three different types of strategies: survey, case study and experiment ([Juristo 2003], [Kitchenham
1997]). Those three strategies will be looked at in more detail in following.

The survey is being applied to subjects already in use (tools, etc). The usual proceeding to gather information is
the usage of questionnaires or interviews. These are applied to a representative sample group and the outcomes
are then analysed. The aim is to derive conclusions that are descriptive, exploratory or explanatory. With the use
of generalization the result from the sample is mapped to the whole group. It is, however, not possible to
manipulate or control the samples. Nevertheless it is practicable to compare the result with similar outcomes of
other surveys. Both qualitative as well as quantitative data can be derived from this strategy. Which one it is
depends on the data that is being collected through the questionnaires or interviews and whether statistical
analysis methods are applicable or not. A popular field for this kind of investigation is well known to most
people: social studies. An example would be public opinion polls before elections take place. The surveys there
try to show how the people will vote on the actual day of election.

Another helpful kind of surveys methods is the application of experience such as Rules of thumb. Examples of
these rules of thumb are described in following as laws and conjectures cited from [Endres 2004].

Process-related expriences:

Fagan’s law: “Inspections significantly increase productivity, quality, and project stability”. There are
three kinds of inspection: design, code, and test inspection. They are applicable in the development
of all information or knowledge intensive products. This form of inspection is wide spread
throughout the industry today. Inspection also has a key role in the Capability Maturity Model
(CMM). The benefit of inspections can be summarized as followed: they “create awareness for
quality that is not achievable by any other method”.

Porter-Votta law: “Effectivness of inspections is fairly independent of its organizational form”. A. Porter
and L. Votta investigated the inspection process introduced by Fagan and came up with the
following results: physical meetings are overestimated. It can be helpful while introducing the
inspection process to new people. When education and experience are extant it is not that
important anymore. Another point revealed was that it is not true that adding more persons to the
inspection team increases the detection rate.

Hetzel-Myers law: “A combination of different Verification and Validation methods outperforms any
single method alone”. W. Hetzel and G. Myers claim that it is better to use all three methods in
combination to gain better results at the end. This is due to the fact that design, code and test
inspection are not competitors.

Mills-Jones hypothesis: “Quality entails productivity”. It is also known as “the optimist’s law” and can
be seen as a variation of P. Cosby’s proverb “quality is free”. It is a very intuitive hypothesis: on
the one hand, when the quality is high, less rework has to be done which results in better
productivity. On the other hand, when quality is poor more rework has to be considered. Therefore
productivity rate drops, as well.

Mays’ hypothesis: “Error prevention is better than error removal”. No matter when an error is detected a
certain amount of rework has to be done (this amount increases the later it is detected). Therefore
it is better to prevent errors. To be able to do so, the circumstances of errors have to be
investigated, identified and then removed. It is still a hypothesis because it is extremely difficult to

Structured conclusions:

Basili-Rombach hypothesis: “Measurements require both goals and models”. Metrics and measurement
need goals and questions otherwise they do not have a meaning. It is also preferable to use a top-
down approach when specifying the parameters. This leads to the Goal-Question-Metric (GQM)

Conjecture a: “Human-based methods can only be studied empirically”. The human-based methods
involve (human) judgement and depend on experience and motivation. This is why the results also
depend on these different factors. To be able to understand and control those factors empirical
studies are needed.

Conjecture b: “Learning is best accelerated by a combination of controlled experiments and case

studies”. Observing software development helps the developers to learn. The case studies supply
the project characteristics, (realistic) complexity, project pressure etc. The lack of cause and effect
insights can be provided through controlled experiments.

Conjecture c: “Empirical results are transferable only if abstracted and packaged with context”. The
information that has been gained needs to be transformed into knowledge with the context borne in
mind. This can be achieved with the help of abstraction. It offers the opportunity to reuse the
results. When the results are abstracted and packaged only two questions remain to be answered:
“Do the results apply to this environment?” and “What are the risks of reusing these results?”

Another form of experience surveys are the delivering of models such as the Models for measuring software
reliability based on the failure rates and probalistics characteristics of software systems [Singpurwalla 1999]:

• Jelinski-Moranda model: Jelinski and Moranda assume that the software contains an unknown number
of, say N, of bugs and that each time the software fails, a bug is detected and corrected and the failure
rate Ti is proportional to N – i + 1 the number of remaining the code.

• Baysian reliability growth model: This model devoid a consideration that the relationship between the
relationship between the number of bugs and the frequency of failure is tenuous.

• Musa-Okumoto models: These models are based on the postulation a relationship between the intensity
function and the mean value function of a Piosson process, that has gained popularity with users.

• General order statistics models: This kind of models is based on statistical order functions. The
motivation for ordering comes from many applications like hydrology, strength of materials and

• Concatenated failure rate model: These models introduce the infinite memories for storage the failure
rates where the notion infinite memory is akin to the notion of invertibility in time series analysis.

A case study is used to monitor the project. Throughout the study data is collected. This data is then investigated
with statistical methods. The aim is to track variables or to establish relationships between different variables
that have a leading role or effect on the outcome of the study. With the help of this kind of strategy it is possible
to build a prediction model. The statistical analysis methods used for this kind of study consists of linear
regression and principle component analysis. A disadvantage of this study is the generalisation. Depending on

the kind of result it can be very difficult to find a corresponding generalisation. This also influences the
interpretation and thus makes it more difficult. Like the survey the case study can provide data for both
qualitative and quantitative research.

Experiments are usually performed in an environment resembling a laboratory to ensure a high amount of
control while carrying out the experiment. The assignments of the different factors for the experiment are
allotted totally at random. More about this random assignment can be found in the following sections. The main
task of an experiment is to manipulate variables and to measure the effects they cause. This measurement data is
the basis for the statistical analysis that is performed afterwards. In the case that it is not possible to assign the
factors through random assignment, so-called quasi-experiments can be used instead of the experiments
described above.

Experiments are used for instance to confirm existing theories, to validate measures or to evaluate the accuracy
of models [Wohlin 2000]. Other than surveys and case studies the experiments only provide data for a
quantitative study. The difference between case studies and experiments is that case studies have a more
observational character. They track specific attributes or establish relationships between attributes but do not
manipulate them. In other words they observe the on-going project. The characteristic of an experiment in this
case is that control is the main aspect and that the essential factors are not only identified but also manipulated.
It is also possible to see a difference between case studies and surveys. A case study is performed during the
execution of a project. The survey looks at the project in retrospect. Although it is possible to perform a survey
before starting a project as a kind of prediction of the outcome, the experience used to do this is based on former
knowledge and hence based on those experiences gained in the past.

Carrying out experiments in the field of Software Engineering is different from other fields of application
[Juristo 2003]. In software engineering several aspects are rather difficult to establish. These are:

• Find variable definitions that are accepted by everyone

• Prove that the measures are nominal or ordinal scale
• Validation of indirect measures: models and direct measures have to be validated

To be able to carry out an experiment several steps have to be performed [Basili 1986]:

1. The definition of the experiment

2. The planning
3. Carrying out the experiment
4. Analysis and Interpretation of the outcomes
5. Presentation of the results

Now a more detailed look on the different steps mentioned above. The Experiment definition is the basis for the
whole experiment. It is crucial that this definition is performed with some caution. When the definition is not
well founded and interpreted the whole effort spent could have been done in vain and one worse thing to happen
is that the result of the experiment is not displaying what was intended The definition sets up the objective of the
experiment. Following a framework can do this. The GQM templates could supply such a framework for
example [Solingen 1999].

After finishing the definition the planning step has to be performed. While the previous step was to answer the
question why the experiment is performed, this step answers the question how the experiment will be carried out.
6 different stages will be needed to complete the planning phase [Wohlin 2000].

Context selection: The environment in which the experiment will be carried out is selected.

Hypothesis formulation and variable selection: Hypothesis testing is the main aspect for statistical
analysis when carrying out experiments. The goal is to reject the hypothesis with the help of the
collected data gained through the experiment. In the case that the hypothesis is rejected it is
possible to draw conclusion out of it. More details about hypothesis testing can be read in the
following sections. The selection of variables is a difficult task Two kinds of variables have to be
identified: dependent and independent ones. This also includes the choice of scale type and range
of the different variables. The section above also contains more information about dependent and
independent variables.

Subject selection: It is performed through sampling methods. Different kinds of sampling can be found at
the end of this chapter. This step is the fundament for the later generalisation. Therefore the
selection chosen here has to be representative for the whole population. The act of sampling the
population can be performed in two ways either probabilistic or non-probabilistic. The difference
between those two methods is that in the latter the probability of choosing a sample of the
selection is not known. Simple random sampling and systematic sampling, just to name two, are
probability-sampling techniques. Those and other methods can be found at the end of this chapter.
The size of the sample also has influence on the generalisation. A rule of thumb is that the larger
the sample is the lower the error in generalising the results will be. There are some general
principles described in [Juristo 2003]:

If there is large variability in the population, a large sample size is needed.

The analysis of the data may influence the choice of the sample size. It is therefore needed
to consider how the data shall be analysed already at the design stage of the experiment.

Experiment design: The design tells how the tests are being organized and performed. An experiment is
so to speak a series of tests. A close relationship between the design and the statistical analysis
exists and they have effect on each other. The choices taken before (measurement scale, etc.) and a
closer look at the null-hypothesis help to find the appropriate statistical method to be able to reject
the hypothesis. The following sections provide a deeper view into the subject described shortly

Instrumentation: In this step the instruments needed for the experiment are being developed. Therefore
three different aspects have to be addressed: experiment objects (i.e. specification and code
documents), guidelines (i.e. process description and checklists) and measurement. Using
instrumentation does not affect the outcome of the experiment. It is only used to provide means for
performing and to monitor experiments [Wohlin 2000].

Validity evaluation: After the experiments are carried out the question arises how valid the results are.
Therefore it is necessary to think of possibilities to check the validity.

The following components are an important vocabulary needed for the software engineering experimentation

Dependent & Independent variables: Variables that are being manipulated or controlled are called
independent variables. When variables are used to study the effects of the manipulation etc. they
are called dependent
Factors: independent variables that are used to study the effect when manipulating them. All the other
independent variables remain unchanged
Treatment: a specific value of a factor is called treatment
Object & Subject: an example for an object is a review of a document. A subject is the person carrying
out the review. Both can be independent variables
Test (sometimes referred to as Trial): an experiment is built up using several tests. Each single test is
structured in treatment, objects and subjects. However, these tests should not be mixed up with
statistical tests
Experimental error: gives an indication of how much confidence can be put in the experiment. It is
affected by how many tests have been carried out
Validity: there are four kinds of validity: internal validity (validity within the environment and reliability
of the results), external validity (how general are the findings), construct validity (how does the
treatment reflects the cause construct) and conclusion validity (relationship between treatment and
Randomisation: the analysis of the data has to be done from independent random variables. It can also be
used to select subjects out of the population and to average out effects
Blocking: is used to eliminate effects that are not desired
Balancing: when each treatment has the same number of subjects it is called balanced

Software engineering experimentation could be supported by the following sampling methods [Wohlin 2000]:

Simple random sampling: the subjects that are selected are randomly chosen out of a list of the

Systematic sampling: only the first subject is selected randomly out of the list of the population. After that
every n-the subject is chosen.

Stratified random sampling: first the population is divided into different strata, also referred to as groups,
with a known distribution between the different strata. Second the random sampling is applied to
every stratum.

Convenience sampling: the nearest and most convenient subjects are selected.

Quota sampling: various elements of the population are desired. Therefore convenience sampling is
applied to get every single subject.

Controlled Experiments: The advantage of this approach is that it promotes comparison and statistical analysis.
Controlled here means that the experiment follows the steps as mentioned above (Basili 1986], [Zelkowitz

1. Experiment definition: it should provide answers to the following questions [3]: “what is studied?”
(object of study), ”what is the intention?” (purpose), “which effect is studied?” (quality focus), “whose
view is represented?” (perspective) and “where is the study conducted?” (context).

2. Experiment planning: null hypothesis and alternative hypothesis is formulated. The details (personnel,
environment, measuring scale, etc.) are determined and the dependent and independent variables are
chosen. First thoughts about the validity of the results.

3. Experiment realization: the experiment is carried out according to the baselines established in the
design and planning step. The data is collected and validated.

4. Experiment analysis: the data collection gathered during the realization is the basis for this step. First
descriptive statistics are applied to gain an understanding of the submitted data. The data is informally
interpreted. Now the decision has to be made how the data can be reduced. After the reduction the
hypothesis test is performed. More about hypothesis testing can be found in the following sections.

5. Portrayal of the results and conclusion about the hypothesis: the analysis provides the information
that is needed to decide whether the hypothesis was rejected or accepted. These conclusions are
collected and documented. This paper comprises the lessons learned.

Experimental design types: The quality of the design decides whether the study is a success or a failure. So it is
very important to meticulously design the experiment [Juristo 2003]. Several principles of how to design an
experiment are known. Those are randomisation, blocking and balancing. In general a combination of the three
methods is applied. The experimental design can be divided into several standard design types. The difference
between them is that they have distinct factors and treatment. The first group relies on one factor, the second on
two and the third group on more than two factors. The following paragraphs will show some detail about the
different design types.

• One-factor design with two treatments:

Field of use: comparison.
Example: comparing two different analysis techniques using several projects
Assignment: techniques are assigned totally at random; the same objects are used for both
Analysis methods: t-test, Mann-Whitney
Benefit: simple experiment

Project Technique 1 Technique 2

1 ☺
2 ☺
3 ☺
4 ☺
5 ☺
6 ☺

Table 3: Assigning analysis techniques to projects

• Paired comparison design (extends the design mentioned above)
Field of use: comparison of two different analysis techniques (two treatments)
Example: comparing two different analysis techniques
Assignment: the subjects are applied to both treatments on the same object; the assignment is
performed randomly
Analysis methods: Paired t-Test, Sign test, Wilcoxon
Benefit: improves the precision of the experiment.

Subjects Treatment 1 Treatment 2

1 2 1
2 1 2
3 2 1
4 2 1
5 1 2
6 1 2

Table 4: Assigning the treatments to pairs

• One factor design with more than two treatments:
Field of use: comparison of all treatments
Example: comparing different programming languages regarding their quality while using them
Assignment: subjects are randomly assigned; one object to all treatments
Analysis methods: analysis of variance (ANOVA), Kruskal-Wallis

Subject Treatment 1 Treatment 2 … Treatment n

1 ☺
2 ☺
3 ☺
4 ☺

Table 5: Assigning the n-treatments to the subjects

• Randomised complete block design:

Field of use: comparison of all treatments with high variability among the subjects. More than two
Example: same as in the design mentioned above
Assignment: each subject uses all treatments; the order is assigned randomly; restriction of
randomisation because of the blocks
Analysis methods: ANOVA, Kruskal-Wallis
Benefit: minimizing the effect of variability. One of the most used designs in experimentation.
Subjects form a more homogenous unit.

Subject Treatment 1 Treatment 2 Treatment 3

1 3 2 1
2 1 2 3
3 1 3 2
4 2 1 3

Table 6: Assigning the subjects to the different treatments using randomized

complete block design

• Two factor design:

This design is used when more complex experimentation arrangements are needed. There are now
three hypotheses: one for the effect of the first factor, one for the second factor and one for the
interaction between the two factors. The following paragraphs will depict different two factor

• 2*2 factorial design:
Example: investigating the understandability of design documents using two
different designs, i.e. structured versus object-oriented design; two treatments per
Assignment: randomly assign subjects to combination of the two treatments
Analysis methods: ANOVA

Factor 2 Factor 2
Treatment 2_1 Treatment 2_2
Factor 1 C, F B, E
Treatment 1_1
Factor 1 A, H D, G
Treatment 1_2

Table 7: Possible portrayal of a 2*2 factorial design (Available Subjects:

A, B, C, D, E, F, G, H)

• Two-stage nested design:

Field of use: one factor is similar to another factor for different treatments (two or
more treatments)
Example: efficiency of unit testing using two different designs i.e. functional
programming versus object-oriented programming
Assignment: one of the two factors is nested to the other; the subjects are randomly
Analysis method: ANOVA

Factor and treatment combination Subjects

Factor 1 Treatment 1_1 Factor 2 Treatment A, H
Factor 1 Treatment 1_1 Factor 2 Treatment C, F
Factor 1 Treatment 1_2 Factor 2 Treatment B, E
Factor 1 Treatment 1_2 Factor 2 Treatment D, G

Table 8: Two-stage nested design with two treatments per factor (Available Subjects:
A, B, C, D, E, F, G, H)

• More than two factors designs:

Some experimentation arrangements depend on more than two factors. These kinds of designs are
also called factorial design because the dependent variables also depend on interaction between the
n-factors. Known factorial designs with two treatments are: 2k factorial design, 2k fractional
factorial design, one-half fractional factorial design of the 2k factorial design and one-quarter
fractional factorial design of the 2k factorial design.

3.3 Testing methods

A listing of the statistical testing methods needed for the different design types in alphabetical order is given in
following. More details about them can be found in [Juristo 2003]:

• ANOVA: This test is an ANalysis Of Variance between groups of artefacts.

• Binomial test: This test analyse the differences between dichotomy variables.

• Chi2: This type of test is used when frequencies are involved. This means that the data has the form of

• F-test: The F-test compares the variance of two (independent) samples

• Kruskal-Wallis: In this case one-way analysis of variance by ranks is performed.

• Mann-Whitney: When the assumption made in the t-test is uncertain it is possible to use the Mann-
Whitney test instead. Similar to the Wilcox test this method is based on ranks.

• Paired t-test: This method compares two samples, gained through repeated measures.

• Sign test: It depends on the sign of the difference of the values of the examined pairs.

• t-test: This test compares two (independent) samples.

• Wilcoxon: For this method it is important that it is possible to determine the greater value of the
examined pair and that the difference can be ranked because the ranks are the basis of the Wilcoxon

Parametric and Non-parametric testing: We will start with the parametric tests. The main characteristics:
consists in the fact that the analysed models have a specific distribution. Usually the assumption is made that
some parameters are normally distributed. The parameters must be measurable at interval scale, at least, the test
for normality can be done with the Chi2 test.

The non-parametric tests main characteristic is that only a very general assumption is made, more general than
parametric test. When they are available they can be used instead of parametric test but not vice versa.

The decision which one of the two mentioned approaches is best suited can be based on two factors. These are
Applicability (what are the assumptions made? The assumptions must be realistic!) and Power (parametric tests
have, in general, higher power than the non-parametric test). The relation between experimental design types,
test methods and parametric, non-parametric tests is shown in the following Table 9 [Juristo 2003].

Design Type Parametric Non-parametric

One factor, one treatment Binomial test, Chi2
One factor, two treatments t-test, F-test Mann-Whitney, Chi2
completely randomised
One factor, two treatments Paired t-test Wilcoxon, Sign test
paired comparison
One factor, more than two ANOVA Kruskal-Wallis, Chi2
More than two factors ANOVA

Table 9: Overview of parametric and non-parametric test methods

Hypothesis Testing: One way to evaluate if the presumption we have is correct is to use hypothesis testing as
evaluation source. The result, when everything has been taken out correctly, will help us to draw conclusions
whether the presumption that was used to formulate the tested hypothesis established some cause and effect

Hypothesis testing takes place in several steps that are applied repeatedly if needed. The first phase, induction, is
used to formulate the first hypothesis, also called the null hypothesis and also the formulation of an alternative
hypothesis in case of rejection of the null hypothesis. It is possible that the test rejects a true hypothesis or vice-
versa. Should such behaviour occur it is referred to as a risk. Two different kind of risks can be identified, Type-
I-error (the hypothesis is true but rejected) and Type-II-error (the hypothesis is false but accepted). When talking
about the risks it is also necessary to talk about the power of a statistical test. The power indicates the probability
that the statistical test will reveal a true pattern if the null hypothesis is false. It is therefore desirable to choose a
test that has a very high power upon one with a lesser power.

The kinds of visualisation for SPC we have described above. Now we will give some further characteristics
shortly. A graphical visualisation provides an illustrative way of providing information about different aspects.
In the following passages several visualisation methods are described.

Scatter Plot:

Input Paired samples (xi, yi)

Portrayal Two-dimensional grid
Used for Assessing dependencies between variables
Tendency of linear relation
Identification of outliers
Observation of correlation

Box Plot:

Input Percentiles
Portrayal Box plot constructed by different percentiles
Used for Visualisation of dispersion and skewedness


Input Frequency (or relative frequency) of a value or interval of values

Portrayal Bars with different heights
Used for Overview of distribution density
Indicator for normal distribution

Cumulative Histogram:

Input Variables with corresponding samples

Portrayal Bars containing the cumulative sum of frequencies up to the
current class of values
Used for Probability distribution function of the samples from one

Pie Chart:

Input Data values, divided into a specific number of distinct classes

Portrayal Segments in a circle. Angles proportional to the relative
Used for Relative frequency of the data values

In following we will describe an example of a controlled experiment investigating the performance using the
Personal Software Process (PSP) [Wohlin 2000].

First step: Definition

• Object of study: participants in the PSP course, their ability considering performance with respect to
background and experience;

• Purpose: evaluate the individual performance with respect to the individual background;

• Perspective: point of view of researchers and teachers; They would like to know if there are differences
between the participants in the course having different backgrounds;

• Quality focus: Productivity in terms of KLOC1 / development time and Defect density in terms of faults

Thousands of lines of code

• Context: experiment is run within the PSP;

• Summary (of Definition): Analyse the outcome of the PSP for the purpose of evaluation with respect to
the background of the individuals from the point of view of the researchers and teachers in the context
of the PSP course.

Second step: Planning

• Context selection: PSP course at university; It addresses a real problem and is performed off-line
because it is not used for industrial software development. The programming language is C.

• Hypothesis selection:
Null-hypothesis H0: No difference in productivity between students from Computer Science and
Engineering program (CSE) and Electrical Engineering Program (EE)
H0: Product(CSE) = Product(EE)
Alternative Hypothesis H1: Product(CSE) ≠ Product(EE)
Null-hypothesis 2 H0: No difference between the students considering the faults/ KLOC (based on prior
knowledge of C)
H0: # of faults is independent of C experience
Alternative hypothesis 2 H1: # of faults/KLOC changes with experience

• Measures: C experience, Faults / KLOC

• Data to be collected: student program (nominal scale), program size in Lines of Code (ratio scale) ,
development time in minutes (ratio scale), productivity (ratio scale), experience in C (ordinal scale, they
used here a classification into four groups), and faults / KLOC

• Variables selection:
o Independent variables: program and experience in C.
o Dependent variables: productivity and faults / KLOC

• Selection of subjects: chosen based on convenience; They are samples from the two programs and not
chosen by a random sample.

• Experiment design:
o Randomisation: subjects are not assigned at random. They all use the PSP and take part in all
of the assignments.
o Blocking: not applied
o Balancing: not applicable

• Standard design types:

o first design: one factor (program), two treatments (CSE, EE). A parametric test is chosen, in
this case the t-test because the dependent variables are ratio scaled.
o Second design: one factor (experience in C), more than two treatments. Here four treatments
can be identified (4 different groups). The dependent variable is also measured in a ratio scale
so that parametric testing can be applied. In this case the ANOVA test.

• Instrumentation: A survey carried out at the beginning of the course provides the needed data about
experience and background.

• Validity evaluation:
o Internal validity: provided through the number of tests within the course.
o External validity: highly probable that similar results are obtained when the course is run in a
similar way. It is rather difficult to generalize the results to students not taking the course.
However, it might be possible to generalise the outcome to other PSP courses, comparing the
background for example.
o Conclusion validity: not considered to be critical due to the fact that the faked or incorrect data
is independent from the background.
o Construct validity: two major threats can be identified. Are the measures appropriate?
Example: Is LOC/ Development time a good measure for productivity? And because it was a

graded course the student might bias their data. At the beginning of the course it was stated that
the grade did not solely depend on the actual data but rather on timely and properly delivery
and on the reports handed in.

Third step: Operation

• Preparation: The students primarily took a course they were not aware of exactly what was being

• Execution:
execution time: 14 weeks
Number of assignments: 10
Number of participants: 65
At the end of the course interviews were performed to evaluate the course and the PSP.

• Data validation: From the 65 students six were removed because their results were rather questionable
or invalid. This took part based on the personal impression of the given data with regard on the question
whether they were representative or not of the researchers and teachers on the given assignments. The
remaining 59 (32 CSE,27 EE) students were used for the statistical analysis and interpretation.

Fourth step: Analysis and interpretation

Descriptive statistics: In Figure 15 the productivity of the two study programs is shown. It gives a hint
that the productivity of the EE students is not as high as the productivity of the CSE students.

Figure 15: Frequency distribution for the productivity (in classes)

As second method box plots are made (Figure 16). There it is visible that the EE group has on outlier,
which stays in the data and is considered an extreme value.

Figure 16: Box plot of productivity

The two figures already indicate that the productivity of the EE students is lower than of the CSE
students. The hypothesis testing might reveal a difference between the two study programs. Let us move
on to the faults / KLOC. The table below shows the different parameters of the faults/ KLOC. It can be
seen that the distribution is skewed towards the first group (little or no experience). That is why a box
plot for this group is made (see Figure 17).

Class Number of Median value of Mean value of Stnadard
students faults/ KLOC faults/KLOC deviation of
1 32 66.8 82.9 64.2
2 19 69.7 68.0 22.9
3 6 63.6 67.6 20.6
4 2 63 63.0 17.3

Table 10: Faults/ KLOC for the different experience groups

Figure 17: Box plot for faults/ KLOC for the first group

The descriptive statistics tell what can be expected from the hypothesis testing and were problems due
to outliers might appear.

Data reduction:
It was decided that the outliers are being removed which changed the mean values and standard
derivation as can be seen in Table 11.

Class Number of Median value of Mean value of Standard deviation

students faults/ KLOC faults/KLOC of faults/KLOC
1 31 66 72.7 29.0

Table 11: Faults/ KLOC for group 1

Hypothesis testing: For the first null- hypothesis the t-test was applied. The result can be seen in Table
12. The conclusion is that the hypothesis H0 is rejected. The difference between the students from the
two programs is significant. The actual reasons for this have to be further evaluated.

Factor Mean diff. Degrees of t-value p-value

freedom (DF)
CSE vs. EE 6.1617 57 3.283 0.0018

Table 12: t-test result

For the second null-hypothesis the ANOVA test was chosen. The result can be seen in Table 13.

Factor: C vs. Degrees of Sum of Mean F-value p-value

Faults/KLOC freedom (DF) squares square
Between 3 3483 1160.9 0.442 0.7236
Errors 55 144304 2623.7 - -

Table 13: ANOVA test results

The outcome was that there is no significance between the different groups and the faults/ KLOC. The
groups 2,3 and 4 were grouped together to investigate the difference between the new formed group and
group 1. A t-test was then applied to look for differences between those two groups. No significant
results were obtained.

Fifth step: Summary and Conclusion

Two hypotheses were investigated. The study program / productivity and experience in C / faults per
KLOC. The first hypothesis tested showed that the CSE students were more productive than the EE
students. The second hypothesis stated that there is no significant influence on the number of faults
considering the experience in C. Hence,

When following the PSP it is better to use a well-known language so that the focus can solely
be on the PSP.
It is also reasonable to claim that students with a computer science background have a higher
productivity than students with other disciplines as background. It is still necessary to do
further studies.

3.4 Methods of Data Analysis

In the following we will give some examples of statistical analysis in three kinds of domains [Pandian 2003]:

• Metrics data analysis in frequency domain,

• Metrics data analysis in time domain,

• Metrics data analysis in the relationship domain.

The following methods and examples are cited from [Pandian 2004] in order to achieve a consistent form of
statistical descriptions (see also [Juristo 2003] and [Wohlin 2000]).

Metrics data analysis in frequency domain:

All processes show variations that will become evident if a frequency distribution is drawn on the process
metric. Understanding process variation, Demming observes, will lead to profound knowledge of the process.
Frequency distribution also contains an indication about probability of occurrence of events. Analysis of metrics
data in the frequency domain would result in empirical distribution curves. The shape and structure of these
distribution curves represent a process signature. Analyses of distributions are usually based on several well-
known probability distributions. We have selected two distribution types that find practical views in software
projects: normal distribution and the Rayleigh distribution. All empirical distributions are referred to any one of
these two for interpretation.

Normal Distribution: Normal distribution is considered nature's template, the most common pattern of process
variation. A large number of project outcomes can be directly fitted to the ideal normal curve. For example,
effort variance in a family of software projects has been analyzed to find that they have a mean value of 10
percent and standard deviation of 2 percent. The equation to normal distribution is given in the following

The process variation illustrated here makes us view software projects from a statistical standpoint.

Bias: A Process Reality: Real-life process behaviour may exhibit a bias. Such distributions lack symmetry and
are skewed to one side. Also, these have a characteristic “tail”, representing occurrences that have transgressed
or strayed into unusual regions. The bias is characteristic of human systems that use intention or will to choose
among several tactical opportunities. The long tail, such as in Rayleigh distribution, bears evidence to a
fundamental but small propensity of nature to defy human design. This tail could be a symbol of machine failure
in mechanical processes or estimation failure in project management. The tail of the schedule variance
distribution presented in Figure 18 shows how „best-made estimates” have failed.

Figure 18: Schedule variance bias

As a structure, the skewed Rayleigh distribution has been put to great use in software estimation by Putnam.
Software reliability models use this structure to represent defect leakage into the field in the continuum of time.
The Rayleigh curve can be expressed as given in the following equation

where m(t) is the manpower, K the total effort, a the constant (shape parameter), and t the time.

Central Tendency of Processes: Central tendency in a skewed distribution, a more authentic representation of
real-life processes, is difficult to establish. Nevertheless, it is conventional to refer to three measures of central

1. Mean
2. Median
3. Mode

The mean is the arithmetic average of all the observations. The median that divides a series of data arranged in
the order of magnitude of their values so that an equal number of values is on either side of the center or median
value. The median divides the distribution curve into two equal areas. The mode denotes the value that has the
highest frequency of occurrence in the dataset. If the distribution of the data is normal and not skewed, then the
mode, median, and mean are equal.

It is customary to take the mean value to indicate the central value of a metric. It is convenient to think so, and
many business models run on this simple assumption. But when the metrics data set contains outliers and
extreme values, median could be a better choice because it presents a balanced picture. Mode is considered for
setting process goals.

Process Spread: Process results wander away from the mean value. The degree of wandering, or spread, is
denoted by the standard deviation, sigma (σ), of process output values. Frequency distributions are the most
natural tools to study and analyze process spread. In Figure 19, three models for effort variance are plotted, all
with different standard deviations but a common central value of 10 percent. Process variations such as these
indicate trouble. The larger the variation, the larger is the uncertainty. It may be noticed that as the spread
increases, the number of “results on target” decreases. When the process deviations get closer to process
boundaries or tolerance limits, the process tends to become unreliable.

Bin Sigma 2 Sigma 4 Sigma 7

1 0.00 0.02 0.07

2 0.00 0.04 0.09
3 0.00 0.06 0.10
4 0.01 0.10 0.12
5 0.03 0.14 0.13
6 0.08 0.18 0.15
7 0.19 0.23 0.16
8 0.36 0.26 0.16
9 0.53 0.29 0.17
10 0.60 0.30 0.17
11 0.53 0.29 0.17
12 0.36 0.26 0.16
13 0.19 0.23 0.16
14 0.08 0.18 0.15
15 0.03 0.14 0.13
16 0.01 0.10 0.12
17 0.00 0.06 0.10
18 0.00 0.04 0.09
19 0.00 0.02 0.07
20 0.00 0.01 0.06

Figure 19: Dispersion of effort variance: three models

Another example of process dispersion can be seen in how bug-fixing time (TTR, time to repair, in days), falls
into three service levels, corresponding to simple, medium, and complex types of bugs. Fixing each type of bug
is a process of its own, characterized by central tendencies and standard deviations. As illustrated in Figure 20,
the distinction between these processes results in blur in some areas, and the maintenance project manager needs
to use this information while setting goals and limits for delivery schedules.

Figure 20: Three service models for bug fixing

Measures of Dispersion: Measures of dispersion describe how the observations in the dataset are spread out.
Important measures of dispersion are

• Range
• Variance
• Standard deviation

Range is the difference between the highest and lowest values in a dataset. Variance measures the fluctuation of
the observations around the mean. The larger the value of the variance, the greater the fluctuation. The standard
deviation, like the variance, also measures the variability of the observations around the mean. Standard
deviation is equal to the positive square root of variance. A standard deviation has the same units as the
observations, and thus is easier to interpret.

Descriptive Statistics: Before we draw any inferences from data (using inferential statistics), we need to do
descriptive statistical study. Hence, metric data can be first studied for its descriptive statistics, which includes
estimation of the following parameters:

• Mean
• Standard error (of the mean)
• Median
• Mode
• Standard deviation
• Variance
• Kurtosis
• Skewness
• Range
• Minimum
• Maximum
• Sum
• Count
• Largest (#)
• Smallest (#)

Note: Skew means lack of symmetry. The skew can be positive (skewed to the left) or negative (skewed to the
right). For a positively skewed distribution, the mean is greater than the median because a few values are large
compared to the others. If a distribution is negatively skewed, the mean is less than the median. Kurtosis is a
measure of the peakedness of the dataset. It is also viewed as a measure of the "heaviness" of the tails of a
distribution. A tool for calculating descriptive statistics is available in Excel as a macro in the Analysis Tool Pak.

Deriving Frequency Distribution from Data: There are three ways of visualizing frequency distribution,
ranging from mathematical to empirical. Each can be applied to a practical situation; each has its advantages.

Probability Density Function Curve: The first is to work from the mean and sigma to construct an ideal normal
distribution curve, applying the equation to probability density function. One can use the spreadsheet function
NORMDIST and generate the graph by constructing an x,y table (and plotting an x,y chart) in accordance with
the relationship given in the following equation.

This bell shaped curve is a classical way of getting a feel for the process. Next we can draw a histogram and
study its shape. The bin intervals (or class intervals) are marked in the x-axis and the frequency in the y-axis.

One can use a "tally" system to count the number of data points falling into each bin, or use the histogram macro
on the spreadsheet and get the tally as well as the chart. Histogram will present details that had been ironed out
in the normal curve.

Empirical Distribution Curve. Finally, we can transform the histogram into a "curve" by constructing a smooth
line that passes through the tops of the histogram bars. Constructing such a curve, sometimes called the fre-
quency polynomial, is not an attempt to find a mathematical expression for an empirical reality; it is an attempt
to create a graphical pattern, as a model and a continuous representation process behavior.

Frequency Scan: While arriving at empirical distribution curves, we stand to gain by doing alternative analysis
by varying the bin sizes. One such analysis is “scanning”, where we deliberately run a histogram on a large
number of bins, although the number of data points may not warrant a large number of bins. An example of
schedule variance analysis with 32 bins is depicted in Figure 21.

Figure 21: Frequency analysis with modified bins

The frequency diagram scans the entire process range, like a spectral scanner, and finds occurrences in the right
location in the metrics scale. Such an analysis highlights “bursts” of events, which stand far away in the
frequency domain from the primary process modes. In the background, the best-fit normal curve built from the
process mean and average is presented. It may be noted that the normal curve is very broad and shallow,
indicating a widely varying process. The standard deviation is about 2.5 times larger than the mean, with the
obvious consequences on the curve. A frequency scan could make several discoveries in process behaviour,
including the following:

Extreme deviations
Process outliers
Natural clusters
Secondary modes
Primary modes
Zoom view of the significant modes

The Filter Effect - Getting a Smooth Overall Picture: We can obtain a smoother function, with the details
ironed out, to show a broad picture of schedule variance, as shown in Figure 22. The desire here is not to
prescribe discrimination rules or locate troublesome groups, but to get a sense of variation.

Figure 22: Frequency diagram designed to give the overall picture

This choice is deliberately made because of the shift in decision-making approach from class discrimination to
variation control.

The same process data, which was scanned in the previous figure, is now processed with less bin numbers, just 7
instead of the original 32. The result is a smoothened curve, which has muffled the fast variations, like a low
pass filter, and indicates an overall picture.

One can vary the "filter characteristics" of a histogram to see different views of variation, and develop an insight
from these many perspectives. It is like tuning in to different wavelengths, looking for signals.

Looking at Histograms: The histogram is known as the “voice of the process”. On a chosen metric, histogram
analysis can reveal process behaviour such as stability and bias. The first-cut analysis is to look at the shape of
the histogram and see the “process signature”. Standard types of histograms have been identified by Feigenbaum
for manufacturing processes. The shapes and types could reveal the nature of the process from which the data
points have been gathered. For example, a histogram truncated on both sides represented product behaviour after
the „out-of-tolerance components” have been removed. A histogram with the central portion missing can be
traced to a population where the best components have been selected and removed, perhaps marked as a higher-
grade delivery. In software, too, we can identify histograms with telltale signatures. Three of these signatures are
presented in Figure 25, along with their special meanings:

1. Comb structure
2. Right-biased structure
3. Left-biased structure

Many of the other figures furnished in this chapter contain real-life process signatures. Notable among them are
the following:

• Bimodal distribution with equal peaks

• Bimodal distribution with a single dominant peak
• Multiple clusters
• Rayleigh type distribution with long “tail”
• Plateau structure (flat distribution)
• Spurs (in spectral scanning)

Projects can maintain histogram libraries and map them to the contributing process scenarios. This way, every
organization can invent its own histogram types, as shown in Figure 23.

Figure 23: Defect histograms for three processes

Process Capability from Frequency Distribution: A process that is under statistical control is said to be
capable if it is able to satisfy the customer specifications or the goals of the process, in the event customer
specifications are not available. Process capability refers to the inherent ability of a process to repeat results for
a sustained period of time under a given set of conditions. The frequency signature of a capable process has a
few notable characteristics: Single mode, less variation, and process peak tends to be closer to target. In the
classical model of process capability computations, normal distribution is assumed, and numerical indices are
calculated to quantify process capability.

Process Capability Index C p: This index indicates the performance of the process by relating the natural
process spread to the specification (tolerance) spread, as shown in the following equation.

Modifications of this basic definition are in use to account for the following special situations: Single limit and
process drift. Such indices and their variants were originally designed for mechanical processes, based on well-
established statistical models for process variation, defect occurrence, inspection, and sampling. For software
projects, can we apply Cp? There are several constraints. The beginning of the problem lies in the very nature of
the process called project management or software engineering, each having process signatures different than
that of mechanical processes. Next in line are the difficulties of prescribing control limits and specifications
limits, which cannot be calculated based on old assumptions but require a deep understanding of statistical
distributions of process parameters and defects.

Probability: The area under probability density function represents "probability" of occurrence. In Figure 24,
the shaded area represents the probability that the upper specification limit of schedule variance may be

Figure 24: Probability calculation

The exact value of this probability as P(SV > USL) is obtained by the division of the shaded area through the
total area under the curve. The probability that the schedule target will be met corresponds to the unshaded area.
The shaded area, lying outside the limit, constitutes what we can term as “process defects”. The white area is the
acceptable region. The areas are actually integral values of the probability density function, pdf, with the
specified limits, and can be calculated by using the relationship given in the following equation

Probabilistic Expressions of Capability and Risk: Probabilistic models can be used to determine process
capability and risk. Capability is defined as the probability of meeting the target and risk is the probability of
missing the target. Capability and risk are like two sides of a coin. If a process is not “filled” with capability, the
vacuum will be encroached by risk. A similar analysis can be done almost on all metrics, although the core
metrics such as the ones in the following list are preferred choices: Schedule, productivity, and defects.

Analyzing Process Maturity: Process maturity can be analyzed using frequency distributions. Mature processes
show slim frequency diagrams, with sharp peaks - the fat and the process wanderings having been eliminated.
Mature processes show, decisively, a central value. The danger of secondary process intervention would have
been eliminated to secure stability. The voice of the process will stand clear above noise from spurious
performances, outliers, and strange isolated events. Mature process peaks tend to drift toward customer
satisfaction, resource conservation, and better performances. A productivity distribution, as the project matures
in capability, tends to move toward higher values. The defect distribution peak, in a similar environment, will
move to lower values. A process behaviour model is seldom static. It is highly dynamic, constantly shifting its
location, and changing the shape. The process boundaries keep in tune and the process remains in a constant
state of metamorphosis.

The road to process maturity can be tracked using frequency diagram models of the process, and by arranging a
process maturity storyboard or chronicler, which has now become an industry standard for visualizing
“continuous process improvement”. Figure 25 presents a process maturity storyboard of an organization that is
moving up the maturity grid as time passes. Approximately, the signatures correspond with capability maturity
model (CMM) levels. The metric - the chosen indicator - is effort variance. If the organization's goals can be
marked on these frames, one can easily perceive and estimate quantitatively resource management capability as
well as effort escalation risk, and relate the findings to climbing maturity level. Apart from using process
signatures to narrate a story in time, we can use them to compare business units within an organization or
benchmark teams within a business unit. We could also create a signature board to cover all primary metrics to
see if there is balance in capability or how uncertainty and risk propagate into the deeper recesses of processes.

Figure 25: Process maturity storyboard

Process Diagnosis: Process baselines based on mean and sigma sometimes hide real problems, such as in the
case study described here. The effort variance in this instance shows a bimodal distribution, each mode on either
side of zero. The arithmetic mean is almost zero; going by the mean one may think that the process is on target.
Far from it, the process is severely unstable, toggles between two meta-stable states, as revealed in the frequency
analysis. The project team recognized the problem, the first step in diagnosis, did a causal analysis, and spotted
trouble in the estimation process, which was in its juvenile stage. Either effort was overestimated or it was
underestimated. Where they had provided contingency cushions, it turned out that the expected risks did not
attack. Where they had been optimistic, risks had surfaced eventually. More than estimation, the problem was in
risk forecasting, and linking it with estimation. The team was trying to grapple with the problem and the struggle
resulted in the twin modes.

Search for Natural Process Boundary: Higher-level metrics, such as effort variance, denote complex processes
because they tend to capture the net result of several sub-processes. Calculating process control limits in such
cases is a tricky job. The exact distribution type of each sub-process may not be known, much less the way the
sub-processes combine. Traditional control limits use mean and sigma-based concoctions. But we know the
fallacy of blindly choosing the mean as a representative figure. The questions emerge: What is the true process
limit? What is going to be the decision threshold? Which is an outlier and which is the core? What control limits
do we use in our control charts? We are looking for a natural process boundary that we can trust and use in
decision making. The answer to the question lies in a frequency distribution study of the metric.

Typically, as illustrated in Figure 26, such an analysis would manifest a dominant mode, denoting a primary
process, and a subdued mode, denoting a secondary process. The valley point is taken as the natural process
boundary which can be used as the upper control limit.

Figure 26: Natural process boundary

Class Recognition - Productivity: Productivity in software development is a very complex area. Analysis of
productivity using frequency distributions could give tangible benefits. Apart from the baseline normal curve,
the empirical distribution derived with the right choice of bin intervals could reveal "productivity clusters," as
illustrated in the following case study. In Figure 27, four modes have emerged during an organizationwide
analysis of productivity data. These modes point to the existence of four distinct classes of projects; the dis-
criminating factors could be complexity of job and skill grades of staff. There could also be interplay between
other productivity drivers and barriers.

This diagnosis establishes four productivity levels, and facilitates developing management strategies. It also
provides a fair basis for performance measurement and comparison. The mistake of having and quoting one pro-
ductivity figure for the entire organization can now be avoided. The gaps in productivity levels provide a
framework for improvement of performance levels, tools utilization, and better and more objective human
resource management.

Figure 27: Software productivity classes

Benchmarking: A benchmark study using frequency distribution, in addition to the conventional comparison
charts, could bring over more valuable information. Sometimes it is just a comparison of signature between
successful projects and not so successful projects. Sometimes it can be a comparison of motivation level and
commitment. During a benchmarking study using frequency distribution, one can compare the following

• Process central tendency (dominant peak)

• Number of modes
• Natural process boundary
• Process capability (percent)
• Risk (percent)
• Outliers (percent)
• Extreme values (percent)
• Mean (overall)
• Sigma (overall)

Measuring the True Value: Software measurements can have ambiguities as large as 50 percent. The
measuring process, such as review or testing, has its own sources of uncertainty, noise, and variation. The
measuring tool and the measured process both vary simultaneously, making software measurements even more
difficult. In the presence of this ambiguity, histograms help in getting at the true value: the central tendency or
the dominant mode. The histogram successfully points out the true value, even while presenting the details of
variations. All modern measuring techniques and instruments use histogram analysis to detect true value. A case
in point is defect measurement, fraught with uncertainties of high proportions.

Measuring Defects without Ambiguity: hen it comes to defects, the measured value depends on the product of
two factors, as

Measured Value = Actual Defect * Detection Effectiveness .

Detection effectiveness values could vary from 40 to 80 percent, depending on the review methodology used and
the review capability of reviewers. Thus an uncertainty is associated with the review process. Measurement
capability is inversely proportional to measurement uncertainty. The rule book of measurement says that the
measuring instrument should have less uncertainty than the process variation the instrument is trying to measure.
We have to measure defect variations of the order of 10 percent with measuring instruments such as review with
an inherent variation of up to 70 percent. The ambiguity in defect measurements can be overcome by using a
simple signal-processing technique: defect histogram

Comparison when Distinctions Blur: We go to statistics when we cannot make a judgment without its help.
An example is the case study where it was called upon to compare two review methods. The first (DD) is a one-

person method; the other is a group method (PI/DC). Defect detection probabilities looked very similar in both
cases, and the raw data was confusing. Once the frequency distributions of the findings were plotted, the bottom
curves in Figure 28) and the whole picture could now be understood.

Figure 28: Review performance comparison

Figure 29: Six Sigma process model

Six Sigma Model: Six Sigma concepts originally began with a process behaviour model in frequency domain.
The graphs shown in Figure 29 show a Six Sigma representation of process capability. Capability is measured by
the gap - safety distance measured in terms of sigma - between the process tendency and performance limit.
Graph A has a safety distance or gap of 3σ, and hence the process has 3σ capabilities. Graph B has a process
peak that is 6σ away from the specification limit, and hence has 6σ capabilities. Defects in a Six Sigma process -
those transgressions across the specification limits - account for a mere 3.4 parts per million (ppm) of the total
events (even after allowing for some wandering of the process peak from the mean).

Metrics data analysis in time domain:
Viewing in Time: Metrics data, organized in the time domain in a framework, present a window into real world.
Our purpose here is to see what the present holds out in the context of the past. We also wish to connect events,
like a thread connects beads, and see meaningful patterns from which a future can be forecast. We will also be
seeing how control charts can be devised to provide support in decision making. Because software projects run a
predetermined path known as the life cycle, with a finite start and a finite end, time domain analysis proves to be
only natural. Time domain analysis enables project teams to become sensitive to reality, responsive to situations,
and self-organizing through continuous learning.

Temporal Patterns in Metrics: Plotting data in a chronological order brings out the hidden temporal patterns. A
causal factor for attrition, the motivational level of employees is measured here as a commitment index and
gathered every quarter. We recognize first the simple linear trend, and later more intricate nonlinear trends.
While the linear trend captures a broad, long-term behavioural pattern, the local characteristics are captured in
increasing level of details by power, polynomial, and moving average trends. All of them are effective in
suppressing noise but forecasting scope and efficiency vary. Each analysis offers an adaptive perception,
different from the rest. The overall problem, of course, is a steady decline in commitment, but the pattern of
decline, the seasonality, and similarity with known trends provide knowledge.

Time Series Forecasting: Using time series analysis, events can be predicted based on historical trends. The
bug arrival pattern shown here is an important input for maintenance projects to decide the following:

Work scheduling
Human resource balancing
Strategies for service quality assurance

Forecasting requires that we identify structures in the data, which might repeat. Software failure intensity data
can be plotted and the trend can be used to predict failure, as indicated in Figure 30. In fixed assets and facilities
management, assets downtime data can be plotted in time sequence, and the trend may be derived and used to
forecast spare-parts requirements and manpower and tools requirements to fix failure events. With the infor-
mation made available by forecasting, one stands to plan better and even avoid those marginal losses that are
bound to be incurred without the benefit of advance information.

Figure 30: Bug arrival trend

Signature Prediction: Beyond the bug arrival statistics, signatures of bug population are captured periodically,
as illustrated in Figure 31, and used in prediction. The signatures become yet another dimension in forecasting.
Here signature refers to a bar graph showing distribution of bugs among the known categories as percentages.
The distribution pattern keeps changing. Risk tracking, risk exposure magnitude, and risk distribution may be
carried out in a similar fashion. Defect magnitude and defect signature are known to have been tracked in a
similar way by IBM in their ODC framework of defect management.

Figure 31: Signature profiles of bug population

Prediction Windows: Prediction may be done by seeing patterns across projects or can be done locally within a
project. For instance, customer satisfaction index may be tracked in an organization, as shown in Figure 32,
project after project, and the trend may be used in decision making. The prediction window here is quite large
and may run into years. Each project runs within a time window inside which predictions are made. Time to
complete a project and cost at completion are both predicted from the earned value graph (EVG), which
cumulatively tracks value and cost as a time series.

Figure 32: Prediction windows

Within a project, there could be smaller process windows where very short time series curves operate. Reliability
growth curve (RGC) tracks defects within the inspection window of the project. Failure intensity curve, being a
reliability model, operates in a window that begins with inprocess inspection but goes beyond delivery and
penetrates into deeper time zones of alpha, beta, and acceptance tests and application runs. Every metric operates
in a time window, which also becomes the prediction window. The window patterns are eventually called

Process Characterization - Process Central Tendency Chart: A process behaviour is characterized, in

simple terms, by the mean value and the standard deviation. The first refers to the location of the process and the
next represents variation of the process. The weekly average (Xbar value) of time to repair (TTR) bugs in a
maintenance project itself is a good indicator of the process. Such a plot is called the X-bar chart, shown in
Figure 33(a). When the process variations are quite large, central tendency is more meaningful with median
values. Therefore, monitoring of process median charts is recommended in these conditions. Figure 33(b)
shows the plot of median values for the same set of data.

Figure 33: X-bar chart on TTR

Process Variation Charts: Process variation is represented by standard deviation. Figure 34(a) illus-
trates the weekly values for standard deviation, in the form of an S chart. There are occasions when
process range is used as a measure of variation in place of standard deviation, which is represented in
Figure 34(b).

Figure 34: Range-standard deviation chart

Plotting Central Value and Variation Together: When accompanied by another chart showing how the range
(maximum/minimum) varies every week, the pair is called X-bar-R chart, which has been very popular on the
work floor. A simpler way is to plot the mean, minimum, and maximum values in the same graph and construct
the MMM chart. The weekly data set is known as sub-group (the sub-groups could stand for a group of projects,
a group of components, etc.). In our example, the MMM chart is plotted for sub-groups, each corresponding to
one week. The chart could be modified to consider (µ + σ) and (µ -σ) instead of the maximum and minimum
values to express variations. The MMM format allows forecasting and pattern recognition.

Control Charts: Park et al., Fenton and Pfleeger, Adrian Burr and Mal Owen, and Thomas Thelin are among
the earliest to have applied the traditional forms of control charts to software engineering processes. Many
software development houses have adapted control charts in one form or another. An established tool in
manufacturing, the control chart is an emergent technology in software development. In a control chart, process
results are plotted in time and compared with an expected value. Examples for the expected values are

• Control limits set from experience

• Control limits calculated from data
• Specification limits drawn from process requirements
• Process goals set by benchmarking
• Improvement goals
• Estimated value
• Planned value

In Figure 35, the estimated value of cumulative lines of code is plotted against month, and the actually delivered
lines of code are compared with the estimated. The perceived gap between the estimated and actual makes the
process owner see the problem and do something to bring the process result back to the estimated value. Control
here means adhering to a budget or a plan. The essential control chart is a decision support tool, an early warning
radar that alerts the user.

Figure 35: Tracking growth against point estimate

In a control chart, process results are plotted in time and compared with an expected value. Examples for the
expected values are

• Control limits set from experience

• Control limits calculated from data
• Specification limits drawn from process requirements
• Process goals set by benchmarking
• Improvement goals
• Estimated value
• Planned value

In Figure 35, the estimated value of cumulative lines of code is plotted against month, and the actually delivered
lines of code are compared with the estimated. The perceived gap between the estimated and actual makes the
process owner see the problem and do something to bring the process result back to the estimated value. Control
here means adhering to a budget or a plan. The essential control chart is a decision support tool, an early warning
radar that alerts the user.

Range in Expected Values: The estimated value, instead of being a point, could have a range, taking a clue
from real-life process variations. Hence, there exists an upper limit and a lower limit for the estimated value, for
a given confidence level. If σ represents the standard deviation and if the limits are estimated at 3σ, for instance,
the associated confidence level is 99.7 percent.

Figure 36: Tracking growth against interval estimate

As shown in Figure 36, the actual values are plotted in the background of the estimated mean value and the
limits. Now one sees a problem if the actual values cross the limits because we have already given a tolerance
band to deviations from the expected mean value. Those data points, which lie outside the tolerance band, are
known as outliers. The first improvement one can think of is to prevent outliers, the next improvement being
reduction of the allowed variation band.

Life Cycle Phase Control Charts: The acceptable limits (point estimates) on defect levels are marked in
the life cycle phase control chart. The actual data is superimposed on the expectation levels. Perhaps this type of
control chart is most natural for life cycle projects. One can plot the following metrics values in this control chart

• Effort
• Schedule
• Rework
• Defect found
• Defect leaked
• Review effort

These life cycle phase control charts provide an opportunity to disseminate process goals and deploy them
phasewise. One can define the ranges around each estimate to be more realistic about goal setting. The expected
values and process goals change with time and improve when the organization makes progress in its processes.
There is perhaps no expected value that can be stationary and permanent.

When Limits Blur: We must recall that uncertainties are associated with each measured value. Each data point
is not a deterministic entity, but probabilistic in nature. If we plot the probability densities of measured values, as
in Figure 37, each data point is not a single point but a distribution. Let us try to answer the following questions.
Have distributions A, B, C, D, and E crossed the limits? Should we read red alert or early warning? The answer:
these are blurred crossings, not abrupt jumps. Statistically, they represent process diffusion. We may relate
control limits to the assumed confidence levels of judgment and appreciate the tentative nature of limits. We can
move up or down the control limits and opt for yet another reference point as UCL. We can fix the UCL and
LCL at chosen points on the process distribution curve and accept the corresponding confidence level for
decision making. Crossing the limit is a question of degree, which depends on assumptions and perceptions and
not so much on the seemingly rigorous mathematical expressions that are used to compute the limits.

Figure 37: Blurred crossings

Selecting Control Limits for Unknown Distributions: When the type of distribution is not known we can
apply Chebyshev's theorem, according to which, for any population or sample, at least (1 - (1/k)2) of the
observations in the dataset fall within k standard deviations of the mean, where k ≥ 1. This is illustrated in Figure
38 as a relationship between standard deviation and the corresponding confidence level.

Figure 38: Selecting confidence limits for control chart

Chebyshev's theorem provides a lower bound to the proportion of measurements that are within a certain number
of standard deviations from the mean. This lower bound estimate can be very helpful when the distribution of a
particular population is unknown or mathematically intractable. Because the software development process is
totally a human process, one cannot expect a standard distribution pattern. Therefore, we should adopt an
estimation method, which does not depend on data distribution pattern, and at the same time reasonably
represent the actual situation. Therefore, depending on the confidence level required one could set the process
capability baseline limits with 1.5σ, 2σ, or 3σ for 56, 75, and 89 percent confidence levels, respectively.

Control Limits for X m R Chart: When the sample data points are not available it is frequently impossible to
construct an X-bar-R chart. In this case the only alternative available is to construct an X moving range chart.
Here successive data points are grouped to form a sub-group. Control limits for this chart are derived based on
control chart constants. The limits are given in the following equation.

Let us consider an application of X m R chart for effort variance process. Because this data is less frequently
available, at the project closure we can characterize this process and arrive at its baseline value through the appli-
cation of X m R chart.

Process Capability Baseline Charts: Figure 39 shows the process capability baselines with popular control
limits. If tighter control on a metric such as effort variance percent is wanted, one could choose 1.5σ limits; on
the contrary, if the project manager does not want too many causal analyses to be made or if the process is in the
inception stage, one could choose 3σ control limits, wherein nearly 89 out of 100 times the process value will be
within the 3σ control limit.

Figure 39: Control chart with confidence limits

Process Capability Baselines from Empirical Distribution: The process history, if available, can be
used to set control limits such as demonstrated in Figure 40, where frequency distribution of historical data
reveals the existence of natural process limits, the valley points dropping off the principal peak. UNPL refers to
upper natural process limit and LNPL refers to lower natural process limit. This approach allows us to use
empirical frequency distributions, which are perhaps more relevant and accurate than the elegant assumptions
made in the traditional computations of limits.

Figure 40: History-based limits

Metrics data analysis in the relationship domain:
A Fertile Domain: Processes are interdependent, forming a network. The interplay between process parameters
has been the subject of several studies in software engineering, leading to understanding of the hidden process
dynamics. The interactions that exist in the process network can be symbolically represented as a map of
relationships between metrics. The symbolic world of relationship between metrics is a new domain, which
mirrors the real world of processes and the influences they exert on one another. The analysis of an individual
metric in the frequency and time domains enhances the indicative abilities of the metric and allows us to see pat-
terns. In the new domain, we expand our view angle, look at the neighborhood around each metric, spot more
metrics (which seem to be connected), and focus on capturing the interrelationships. The relationship domain
brings in a pragmatic perspective. In the real world, processes do not work in isolation and, as a consequence,
complete truth cannot be represented by isolated metrics. Analysis in the relationship domain complements
analysis in the other domains. When processes work as interconnected systems, the interrelationships may follow
an order or rule. This may be just a local discipline governing a narrow range of process events. Or it may be a
global order, with universal influence. The order may change from time to time when processes shift from one
phase state to another. When we analyze metrics data in the relationship domain, we use metrics "snapshots" of
the process, to try to arrive at formulas that depict the order, rule, or discipline by which the process runs. The
formulas could be local or global, following the characteristic of the process order. Some are ephemeral while
others are everlasting. Some are reversible, some are irreversible. Some are reproducible while others are not.
We search for all. The relationship domain is a fertile hunting ground. Studying relationships among metrics
with existing data is one approach. Making special observations under controlled conditions or conducting
experiments is another approach. The choice between routine observation and experiment is decided by the
proposed degree of rigor in the intended analysis and cost. We proceed with the first choice, studying naturally
available data without incurring the expenditure of experiments. We believe that in a project environment there
is a lot to learn from available data and a lot of improvement can be made from the study results of such data
before the need arises to commission experiments. The relationship between metrics and the expression of the
same as a formula or equation can be presented graphically. In fact, we begin with graphical analysis and then
arrive at empirical formulas.

Search for Relationships: Relationship between metrics is a mirror of interplay between processes. Now we
wish to analyze metrics in search of relationships. In principle we can suppose a relationship between any two
metrics. For example, let us look at the relationships between six core metrics selected from a project:

1. Skill level
2. Productivity
3. Review effectiveness
4. Defect density
5. Effort variance
6. Size

A relationship map of these six core metrics is displayed in Figure 41. The connecting lines denote possible
relationship. Any two metrics, an ordered pair of them, provide an opportunity to conceive a relationship. There
are 15 ordered pairs of metrics and to match there are 15 relationship lines in the map. Not all the supposed
relationships are meaningful. Some are merely mechanical constructs, just unreal mathematical possibilities. In
others, we do have expectations to uncover relationships of practical significance.

Figure 41: Relationship map

Pairing metrics is a limited, simple step, useful within the limits. We can see a complex set of relationships if we
connect one "driven" metrics to five "driver" metrics. This way we are applying a cause-and-effect relationship
or predictor-response model. We take defect density as the effect and can imagine that it is driven by the
remaining five metrics, establishing a one-to-five multivariate mapping. Considering the simultaneous influence
of five predictor metrics on one response metrics is a more complete and more rigorous approach.

Perceiving Relationships: Let us consider metrics in ordered pairs - two at a time - and take a look at the
possible types of relationships that can exist between them. Relationships may be perceived by plotting scatter
diagrams. One of the two chosen metrics will be treated as the dependent variable (y-axis), the other as the
independent variable (x-axis). The scatter diagram may reveal relationships, which can be among the five types
mentioned in Table 14.

Type 1 Strong Positive

Type 2 Strong Negative
Type 3 Weak Positive
Type 4 Weak Negative
Type 5 Weak No Relationship

Table 14: Relationships revealed in a scatter diagram

Perceiving the type of influence between metrics allows us to see the interplay between process elements. In
Figure 42 the five types of influences, or relationships, are illustrated.

Figure 42: Scatter plots of relationships

Strength of Relationship: Correlation Coefficient: We may begin the relationship study between two
variables by estimating the correlation coefficient (r), which is a statistical measure of the degree of linear
relationship between the two variables. It lies between +1 and -1 depending on whether the relationship is
positive or negative. The strength of the relationship is expressed by the absolute value of the correlation

Table 15: Productivity skill level data

Let us consider the metrics Skill Level and Productivity as x and y variables for a correlation study. Metrics
data obtained from a project is given in Table 15. The correlation coefficient r is defined in the following

Computation of r using the equation above yields a value of 0.993 for the correlation coefficient. The
computation is shown in the following Table 16.

The correlation analysis shows that there is a good correlation between productivity and skill level. We need not
go through all these time-consuming steps to do a correlation study. Excel and similar spreadsheets lend support
with built-in statistical functions.

Table 16: Calculation of correlation coefficient

The calculation is based on the following concrete equations relating to the considered productivity data shown
in the table above.

Causal Relationship and Statistical Correlation: There is a difference between correlation and causal
relationship. Correlation between metrics suggests that they are associated; a change in one follows approximate
changes in the other. However, mere association does not assure causal relationship. Correlation could be
superficial. The variables keep pace perhaps by coincidence. In a feeding experiment with pigeons, food was
dropped in a random manner. However, some pigeons happened to see food drop when they raised their heads.

A coincidence, indeed. These pigeons moved their heads up when they needed food and expected food to drop
from the feeder. Other pigeons thought sideways movement caused food drop. The pigeons soon settled in a self-
devised superstition on the basis of apparent correlation. Expectation (or estimation) based on the strength of
mere correlation might be misleading. Likewise, if the linear correlation coefficient is zero, we cannot come to a
conclusion that there is no relationship at all. Other forms of relations might still exist, invisible because they are
“buried” in the data. Sometimes, linear correlation studies may not be able to grasp highly nonlinear or cyclic
patterns. One should be careful while making correlation studies; correlation can degenerate into scientific
superstition if invalidated. Relationship on the other hand goes beyond statistical correlation and coincidence.
Usually a relationship is conceived before data analysis, based on some fundamental assumptions or well-
known, time-proven concepts. Sometimes a new relationship is proposed based on theoretical reasoning, which
awaits validation.

Linear Regression: We will now move from correlation coefficient, which measures the strength of
relationships between two variables, to regression analysis, which determines the mathematical expression of the
relationship. In the simplest form of regression, the dataset is fitted to the equation y = a + bx, where y is the
dependent variable and x is the independent variable. The values of x are assumed to cause or determine the
values of y. y = a + bx is known as the regression line to which the data points regress. This is also taken as a
regression model, which estimates y from x.

Error Sum of Squares: The difference between the estimated value and the true value is called the error
of estimation or residual in regression. For a proposed regression model, one can find error sum of square
by the following equation.

The Principle of Least Squares: The best fit regression model, built according to the principle of least
squares, is the regression line that achieves a minimum value for the error sum of squares. This is done
through a process of iteration, where the error sum of squares converges to its lowest value.

Standard Error of Estimate: Standard error of estimates measures the variability or scatter of the
observed values around the regression line. It is also a measure of reliability of the regression line as an
estimation equation. It is calculated using the following equation.

Total Sum of Squares (TSS): This is the total of the squared observations between each sample
observation and the sample mean, as shown in the following equation.

Coefficient of Determination R2: Coefficient of determination is defined as a measure of the proportion

of variation in y that is accounted for by regression on x.

Linear Regression: Example: We present an example of regression analysis on the relationship between
Review Effectiveness (RE) and Defect Density (DD). The independent variable is Review Effectiveness,
and the dependent variable is Defect Density. We expect a relationship between DD and RE. We believe
that increase in RE will make DD come down. However, we do not know whether the relationship will be
nonlinear, weak, or strong; we wish to find from the regression analysis. A typical regression analysis
using the Excel tool yields outputs that include the following results:

• Regression line
• Regression table
• Residual plot
• Regression statistics

The first output, the regression line, is shown in Figure 43. The equation to the regression line and the coefficient
of determination are also printed in a textbox next to the regression line.

Figure 43: Regression line plot

The regression results are presented by the tool in a tabular form as shown in Table 17. This table presents the
predicted values (y estimated) and the observed values (y true). The difference between them is presented as
residuals. The residuals provide important information for judging the adequacy of the regression analysis. One
way they can be used is in a plot of the residuals versus the independent variable. If the residuals do not appear
to be randomly scattered above the horizontal line, it may indicate a problem with the regression analysis.

Table 17: Regression analysis results

Perhaps a straight-line relationship is not appropriate, or the assumptions of normality or constant variance are
not reasonable. A plot of the residuals is shown in Figure 44.

Figure 44: Residual plot

Regression statistics includes the estimation of coefficient of determination (RI) and the standard error, as in
Table 18.

Table 18: Regression statistics

Outliers in Relationship: A special graph showing the sloping lines (1 SE and 2 SE) that run parallel to the best
fit line indicating outliers is given in Figure 45. Those data points that lie beyond a threshold of 1 SE slopes are
considered as results of process violations, and marked for study and examination. The graph in Figure 47 is
known sometimes as a sloping control chart. Here the control chart raises a trigger when a process changes its
inner dynamics. This trigger is regarded as more proactive than the conventional control charts.

Figure 45: Reliability of regression line

Departure from expected relation is the decision criteria, and, not the magnitude of defect density. For example
in Figure 45, the outlier has the least defect density, and for all practical reasons it represents a good job done by
the developers. However, we wish to question why the relationship with review effectiveness has changed. This
unexpected change in relationship could mean that:

A new complexity has arrived in the development process.

Factors other than Review Effectiveness have contributed to defect reduction.
The intended relationship (DD = -0.1927 RE + 31.199) has failed to govern this outlier for reasons
not known to us.

Nonlinear Regression Models: In nonlinear regression the dataset is fitted to nonlinear curves, again using the
principle of least squares. Where linear relationships are absent, there could be nonlinear relationships that we
must verify. Nonlinear regression analysis is an iterative approach. We try different modelling equations; if one
equation does not describe the data, then we try a different equation. The dataset must be carefully examined
before the iteration begins. If the data is not enough in “critical ranges”, it is safer to wait until more data is
collected in the region. If the data is too scattered, nonlinear fittings could give unstable results. If possible,
collect more data to make sure that the wide scatter (suggesting weak relationship) is not a mistake but a reality
we have to deal with. Simple data transformations or normalization may be tried to see if the data scatter can be

Nonlinear Regression Analysis of Productivity: Software development productivity in the simplest

definition is size/effect. Productivity is a heavily loaded metric, and is very complex in the sense many
factors determine its value. Productivity tends to be fundamentally nonlinear in nature. Studies have been

made in mapping productivity drivers to productivity estimates. We will pick size from the potential
drivers and study its relationship with productivity. Metrics data has been collected for size in function
points (FP) and effect as person months (PM). Size is the predictor variable or independent variable x.
Productivity itself is the “response variable” or dependent variable y. The data is presented in Table 19.

Table 19: Data used for nonlinear regression

Nonlinear Regression Analysis: We will use the following nonlinear equations for regression analysis of
a typical productivity dataset given in Table 19. Excel has been used to generate the regression curves
that correspond with the following six nonlinear equations:

1. Nonlinear regression logarithmic equation

2. Nonlinear regression polynomial-degree 2
3. Nonlinear regression polynomial-degree 3
4. Nonlinear regression polynomial-degree 4
5. Nonlinear regression power equation
6. Nonlinear regression exponential equation

Figure 46: Nonlinear regression

Goodness of Fit: The regression curves are shown in Figure 46. It may be seen that the coefficient
of determination, R2, which represents the quality of fit, is different for different regression
equations. The lowest value is 0.3034 for the logarithmic curve and the best value is 0.5621 for
the fourth degree polynomial curve. R2 gives an indication of closeness of data points to the regression
equation in a statistical sense. This helps in making a first order judgment on regression.

Monotonicity: However, choosing the regression curve must consider the other requirements of curve
fitting. The regression curves must be monotonic and stable. A look at the six models in Figure 46 shows
that one model - the fourth-order polynomial - shows a curve, which reverses its trend in a few places.
Physically, trend reversal means larger program costs less in those regions of reversal - an absurdity.

Stability of Nonlinear Regression Curves: A Comparison: The forecasting ability of nonlinear curves has
to be assessed while choosing regression models. Let us formulate a forecasting problem and examine
how the six nonlinear regression models fare. The forecasting problem we have taken is to predict
productivity value (y) for a given size of 15000 FP (x) (see Table 20). It may be noted that the current
data range is 0 to 11000 FP. This means that the regression curve has to be extrapolated up 4000 FP and
reach an estimate.

Table 20: Results of forecasting

The results of forecasting are illustrated in the figures given in Figure 47. The fourth-order polynomial
predicts a deeply negative value, while all other models predict productivity in the range between 23 and
43 FP/PM. Negative productivity is a physically meaningless number, and magnitude of the negative
value indicates a complete failure in forecasting. The forecasting performance of the fourth-order
polynomial is shown in Figure 47, along with the power curve. It is seen from these results that the
polynomial curve has collapsed to negative values of productivity. Hence, it is a poor and unreliable
estimate. The power curve, however, behaves better and predicts a value that is realistic.

Figure 47: Forecasting nonlinear regression model

Multiple Linear Regression: So far we have been looking at relationships between one dependent
variable (y) and one independent variable (x). But in many studies we need to consider the influence of
several independent variables. In multiple linear regression, the mean of the dependent variable is a
linear combination of the independent variables, as shown in the following equation.

Linearity: If the linearity assumption is not met, sometimes we can transform one or more of the x
variables, like taking the square root, and get a linear dependence.

Interaction: If interactions between the independent variables are to be included in the model, then
additional cross products, xi xj, have to be included in the model.

Surface Plot: We will consider a case study for multiple linear regression with two independent
variables. The dependent variable is Defect Density (y), measure in Defects/KLOC. The
independent variables are Skill Level (x 1) and Review Effectiveness (x2). A surface plot of the
linear model is shown in Figure 48. The planar Defect Density surface indicates how quality of the
software work product is influenced by two variables. The surface gently slopes towards the high
performance point with the following coordinate values:

Figure 48: Surface plot

This surface, being a plane, does not offer optimum points but only indicates the general direction of
process improvement.

4 SPC and CMMI

4.1 Basics of Quantified Process Management

In general we can establish the following four categories of processes in the software development ([Kulpa
2003], [SEI 2002]): the project management processes, the process management processes, the engineering
processes, and the support processes. Based on process models like the CMMI we can evaluate main activities
shown in the Figure 49.

Figure 49: Activities supporting by process models

According the GQM paradigm and the principles of the CAME framework for successful measurement
application we can formulate the basic CMMI intentions considering the SPC approach as following (see Figure

Figure 50: CMMI approach including the SPC

The actual goals are implied in the achieving the different levels of the CMMI maturity evaluation. The
appropriate questions for the process maturity can be identified by considering the CMMI key processes. In
following we will give the essential questions in order to satisfy these key processes cited from [Singpurwalla

Maturity Level 2:

Key Process Area I (K21)-Requirements Management

1. For each project involving software development, is there a designated software manager?
2. Does the project software manager report directly to the project (or project development)
3. Does the Software Quality Assurance (SQA) function have a management reporting channel
separate from the software development project management?
4. Is there a designated individual or team responsible for the control of software interfaces?
5. Is there a software configuration control function for each project that involves software

Key Process Area 2 (K22)-Software Quality Assurance

6. Does senior management have a mechanism for the regular review of the status of software
development projects?
7. Is a mechanism used for regular technical interchanges with the customer?
8. Do software development first-line managers sign off on their schedules and cost estimates?
9. Is a mechanism used for controlling changes to the software requirements?
10. Is a mechanism used for controlling changes to the code? (Who can make changes and under what

Key Process Area 3 (K23)-Software Project Planning

11. Is there a required training program for all newly appointed development managers designed to
familiarize them with software project management'?
12. Is a formal procedure used to make estimates of software size?
13. Is a formal procedure used to produce software development schedules?
14. Are formal procedures applied to estimating software development cost?
15. Is a formal procedure used in the management review of each software development prior to
making contractual commitments?

Maturity Level 3

Key Process Area I (K31)-Integrated Software Management

16. Is a mechanism used for identifying and resolving system engineering issues that affect software?
17. Is a mechanism used for independently calling integration and test issues to the attention of the
project manager?
18. Are the action items resulting from testing tracked to closure?
19. Is a mechanism used for ensuring compliance with the software engineering standards?
20. Is a mechanism used for ensuring traceability between the software requirements and top-level

Key Process Area 2 (K32)-Organization Process Definition

21. Are statistics on Software design errors gathered?

22. Are the action items resulting from design reviews tracked to closure?
23. Is a mechanism used for ensuring traceability between the Software top-level and detailed
24. Is a mechanism used for verifying that the samples examined by Software Quality Assurance are
representative of the work performed?
25. Is there a mechanism for ensuring the adequacy of regression testing?

Key Process Area 3 (K33)-Peer Review

26. Are internal Software design reviews conducted?

27. Is a mechanism used for controlling changes to the Software design?
28. Is a mechanism used for ensuring traceability between Software detailed design and the code?
29. Are Software code reviews conducted?
30. Is a mechanism used for configuration management of the Software tools used in the development

Maturity Leve1 4

Key Process Area 1 (K41)-Quantitative Process Management

31. Is a mechanism used for periodically assessing the Software engineering process and
implementing indicated improvements?
32. Is there a formal management process for determining if the prototyping of Software functions is
an appropriate part of the design process?
33. Are design and code review coverage measured and recorded?
34. Is test coverage measured and recorded for each phase of functional testing?
35. Are internal design review standards applied?

Key Process Area 2 (K42)-Software Quality Management

36. Has a managed and controlled process database been established for process metrics data across all
37. Are the review data gathered during design reviews analyzed?
38. Are the error data from code reviews and tests analyzed to determine the likely distribution and
characteristics of the errors remaining in the product?
39. Are analyses of errors conducted to determine their process-related causes?
40. Is review efficiency analyzed for each project?

Maturity Level 5

Key Process Area 1 (K51)-Defect Prevention

41. Is software system engineering represented on the system design team?

42. Is a formal procedure used to ensure periodic management review of the status of each software
development project?
43. Is a mechanism used for initiating error prevention actions?
44. Is a mechanism used for identifying and replacing obsolete technologies? 45. Is software
productivity analyzed for major process steps?

The appropriate metrics in order to find the answers of the questions above we will give the CMMI metrics
defined by Kulpa and Johnson again (only for the CMMI Level Four) [Kulpa 2003]:

Organizational Process Performance

QM01: Trends in the organization's process performance with respect to changes in work products and
task attributes (e.g., size growth, effort, schedule, and quality)

Quantitative Project Management

QM02: Time between failures

QM03: Critical resource utilization
QM04: Number and severity of defects in the released product
QM05: Number and severity of customer complaints concerning the provided service
QM06: Number of defects removed by product verification activities (perhaps by type of verification,
such as peer reviews and testing)
QM07: Defect escape rates
QM08: Number and density of defects by severity found during the first year following product delivery
or start of service
QM09: Cycle time

QM10: Amount of rework time
QM11: Requirements volatility (i.e., number of requirements changes per phase)
QM12: Ratios of estimated to measured values of the planning parameters (e.g., size, cost, and schedule)
QM13: Coverage and efficiency of peer reviews (i.e., number/amount of products reviewed compared to
total number, and number of defects found per hour)
QM14: Test coverage and efficiency (i.e., number/amount of products tested compared to total number,
and number of defects found per hour)
QM15: Effectiveness of training (i.e., percent of planned training completed and test scores)
QM16: Reliability (i.e., mean time-to-failure usually measured during integration and systems test)
QM17: Percentage of the total defects inserted or found in the different phases of the project life cycle
QM18: Percentage of the total effort expended in the different phases of the project life cycle
QM19: Profile of subprocesses under statistical management (i.e., number planned to be under statistical
management, number currently being statistically managed, and number that are statistically
QM20: Number of special causes of variation identified
QM21: The cost over time for the quantitative process management activities compared to the plan
QM22: The accomplishment of schedule milestones for quantitative process management activities
compared to the approved plan (i.e., establishing the process measurements to be used on the
project, determining how the process data will be collected, and collecting the process data)
QM23: The cost of poor quality (e.g., amount of rework, re-reviews and re-testing)
QM24: The costs for achieving quality goals (e.g., amount of initial reviews, audits, and testing)

SPC depends on historical data. It also depends on accurate, consistent process data. If you are just beginning the
process improvement journey, do not jump into SPC. You (your data) are not yet ready for it. That is why the
CMMI waits until Maturity Level 4 in the staged representation to suggest the application of SPC techniques. At
Level 2, processes are still evolving. At Level 3, they are more consistent. Level 4 takes process information
from Level 3, and analyzes and structures both the data and their collection. Level 5 takes predictable and
unpredictable processes, and improves them.

4.2 Controlling the Process Improvement

Finally, we will describe some statistical methods supporting the Statistical Process Control especially (see
[Pandian 2004], [Putnam 2003], [Zelkowitz 1997] and [Zuse 2003]).

The Shewhart control chart, introduced in 1920, decomposes process variation into two components: random
variation (predictable bounds) and systematic variation (anomalies). Random variations, when the cause system
is constant, approach some distribution function, and hence remain predictable or statistically stable. Systematic
variations are due to assignable causes, which are due to unusual causes, freak incidents, process drifts, and
environmental threats. Shewhart demonstrated how control charts could be used to identify and distinguish the
two types of process variation, to achieve process efficiency, and ensuing economic benefits. Figure 51 shows
how a training manager uses the Shewhart Control Chart to identify (and later solve) two problems:
extraordinary cost for Training ID 7 and the average cost (µ) greater than the budget. Armand V. Feigenbaum
allows specifying control limits from past experience and guesswork in a pragmatic manner.

Figure 51: Controlling the cost of training

Tests for Control Charts: Tests for statistical control have been in use for a long time. The classical tests or
decision rules to be applied while reading the control charts are presented in the following list, along with an
illustration in Figure 52.

Test #1: Any point outside one of the control limits is an indication of a special
cause and needs to be investigated.

Test #2: A run of seven points in succession, either all above the central line or
below the central line or all increasing or all decreasing, is an indication of a
special cause and needs to be investigated.

Test #3: Any unusual pattern or trend involving cyclic or drift behaviour of the data
is an indication of a special cause and needs to be investigated.

Test #4: The proportion of points in the middle-third zone of the distance between
the control limits should be about two thirds of all the points under

Figure 52: Tests for Statistical Process Control charts

Control Chart in the Presence of Trend: If the metric shows trend, such as delivered defect density (DDD)
in Figure 53, the control charts may be partitioned to make a clearer presentation of the problem. The trend line
helps in forecasting and risk estimation. The baseline helps in process analysis, estimation, and setting process

Dual Process Control Charts: Sometimes the metric is a product of two major components, each showing its
own independent characteristics. Defects found by design review, for instance, are a product of defect injected
and review effectiveness, shown in the following equation.

Defects Found = Defects Injected * Review Effectiveness

Figure 53: Trend and baseline

The UCL in the control chart of defect/KLOC, as shown in Figure 55, is more relevant to the designers, who
have to keep defect level below the UCL. The LCL, on the other hand, appeals to the reviewers to find defects
more than the UCL. In the defect control chart in Figure 55, the following references are marked for proper

From Dual Limits to Single Limits: The control chart in Figure 54 is cluttered, and one has to strain to read,
analyze, and interpret the chart. When the chart is used to give process feedback, some process owners may mix
signals, one demanding a minimum production of defects, another may demand just the opposite.

Figure 54: In-process defect control chart

This problem may be solved and effective presentation may be made to the process owner, if only we could
construct two separate control charts, each delivered to the process owner with the appropriate control limits, as
indicated in Figure 55. After the split, the new control charts look simple and clear, with just one decision rule
marked. The process owner, the designer, or the reviewer, gets a clear signal.

Figure 55: Splitting a double-side limit into two single-side limits

The process defects are marked as circles in both cases. With defects clearly marked and the goal (specification
limit) clearly specified, each process owner can go into causal analysis of process violations and initiate
corrective measures. The purpose of this control chart is to provide effective feedback and facilitate corrective

Control Charts Types: There are several control chart forms in use, including the ones we have used so
far. Below is a brief list for a quick reference. The exact formulas for computations may be found
elsewhere. When we have a large number of data points that can be organized as sub-groups according to
some real-life order, and when the sub-group sizes are used in determining the control limits, the
following charts may be useful.

X-bar chart with UCL and LCL

X-bar - R chart with UCL and LCL
X-bar - S chart with UCL and LCL
p Chart (percentage defectives) with UCL and LCL
u Chart (defects per unit size) with UCL and LCL
c Chart (defect counts per module) with UCL and LCL

If instead of sub-groups we have just an individual data point for every process delivery, we can
artificially create a sub-group by selecting data points from a moving average window, and plot a graph
with control limits calculated in the traditional way.

Individuals chart (X m R) with UCL and LCL

When all we desire is to characterize the process and generate some performance baseline on a chosen
metric, the following forms may be used. These forms can be used across life cycle phases or across sub-

If we wish to compare actual values with estimates, then the following may be used:

• Cumulative graphs with point estimates

• Cumulative graphs with interval estimates
• Run charts with estimates shown as USL, LSL
• Life cycle profiles with USL and LSL
• Run charts with baseline values (history) marked Special Forms

Most performance models are constructed this way. A few of them are illustrated in this section.

Multi-Process Tracking Model: A simple way to take a holistic and balanced view of processes is to track all
related process metrics on a radar chart, marking the target values and the achieved values. Cost drivers,
performance drivers, and defect drivers in software development can be plotted on the radar chart for effective
process control. Tracking of multiple goals, all competing for resources, is presented in the radar chart format in
Figure 56. The following is a list of metrics used to represent and measure goals:

• Customer satisfaction index (CUST SAT)

• Productivity index (PROD)
• Employee satisfaction index (EMP SAT)
• Right first time index (RFT)
• Defect removal effectiveness (DRE)
• Training need fulfilment index (TNF)

All these are measured quantitatively on a 0 to 10 scale (ratio scale). Targets and achievement in each direction
are plotted. This is a control chart because it compares reality with expectation and allows one to see deviations.
It gives deeper meaning and allows one to visualize a balanced picture or model on goal achievement.

Figure 56: Goal control radar

Dynamic Model - Automated Control Charts: Control charts in modern times have taken a totally new
form. They are embedded in metric databases and analysis modules, which perform dynamic functions. A
defect-tracking tool uses a defect database as the platform and tracks bug closure. If the time taken exceeds a
preset limit, the software generates a message to the tester. Even if the bug lives long after the message, the
software escalates the issue and the message is now flashed to the project manager. The tester or the manager
does not see a physical control chart but gets the results. The limit setting can be a choice from the manager,
where his experience and judgment prevail. Or the limit setting can be done by the software logic, which will use
an appropriate decision rule and raise an alarm. The decision-making algorithm can be simple algebra or a
sophisticated knowledge engine that learns and works with intelligence. The graph is printed, on demand, as a
report from the tool along with other statistics. In a similar way, metrics data analysis tools can generate dynamic
control charts on all metrics. These charts can be published in the monthly process capability baseline reports.

Control Chart for Effective Application: There are many forms of control charts but they all must be
structured well for effective application. Here are some suggestions. On any metric we can plot a control chart.
Choose the metric that communicates better. For instance, a training manager can choose cost of absenteeism
instead of number of people who are absent because the former makes senior management look at the control
chart seriously. The data should be in chronological order. Most software development processes follow the
learning curve, both first order and second order. Before process stability is achieved, the learning curve is
encountered. Chronological order gives control charts the vital meaning and power. A decision rule must be
provided to enable problem recognition. The rule could be expressed in the following ways:

• Control limits
• Specification limits
• Baseline references
• Estimated values
• Process goals
• Process constraints
• Benchmark values
• Expected trend
• Zones

The reader must be made familiar with the rules for interpretation. The chart must be designed with the most
likely readers in mind, and every effort must be made to make the chart provide effective communication to a
human system (biofeedback). Provide support data as annotations for significant data points. For example, a
defect distribution pie chart can be provided as a companion to a defect control chart. Annotate identified hot
spots or trends with causal analysis findings. We learn from such annotations. Wherever possible, suggested
corrective action may be indicated.

Modernism in Process Control - Decision Support Charts: Metrics data, when presented in time series, offers
a new form that helps to understand the process. A well-structured time series chart could emerge into a model
once it captures a pattern that can be applied as a historic lesson. The time series analysis for trend or process
control is also a time series model of the process, inasmuch as it can increase one's understanding of the process
behaviour and forecast.

What-IFAnalysis: But the outstanding issue in software projects is whether a process goes according to a
plan or estimate. The need for statistically derived, selforganizing goals, should it arise, is only secondary.
The term control chart may then be replaced with the term decision support chart The concept of control
limit will be substituted with the concept of decision thresholds. What-if analysis can be done on a control
chart by shifting the limits and seeing each time how many events are picked up and earmarked for inves-
tigation. The problem set will shift according to the location of the threshold line.

Clues, Not Convincing Proof: There are reasons why metrics control charts end up issuing suggestive
clues but not convincing proof about process problems:

• Data errors
• Ambiguity in measurement scale
• Process having nonnormal distributions
• Nonavailability of defect propagation models

But all a project manager is looking for is a set of clues, not final proof. A decision support chart can
coexist with ambiguity but the classical control chart cannot.

If It Is Written on the Wall, Do Not Draw Control Charts: If known problems are not solved, nobody
wants to use a control chart to detect new problems. If trouble can be spotted without having to use a
control chart, avoid control charts. Going one step further, if without the aid of control limits we can spot
outliers using the naked eye, let us not draw control limits.
The connection of control charts with action is now legendary. The best control chart is the one on which
somebody acts.

Regression models have huge application potential in software engineering and management. They support the
creation of a wide variety of knowledge products from simple visual display of relationships to estimation
equations. They can reflect real situations in different degrees of detail, ranging from simple two-variable models
to complex multiple variable models. They can capture process nonlinearity and allow us to exploit this
knowledge, either in optimization or in risk avoidance.

Regression Model Application - Causal Analysis:

Regression models are naturally poised for causal analysis application. The x-y relationship is a cause-effect
relationship (in the predictor-predicted sense). The regression analysis discussed here makes use of productivity
data. requirement effort% has been chosen as the independent variable. The data and the nonlinear regression
line fitted to the dataset are shown in Figure 57. The association rule for causal analysis demands a good R2, and
we get a value of 64.34 percent. The extraneous data and outliers can be put aside and we can focus on the
regression line to do causal analysis. Logic tells us that software productivity should improve with better
requirement capturing (and a direction for causal analysis is set this way). The regression model (nonlinear,
logarithmic) shows asymptotic rise in productivity, and we can see a shoulder on the curve after which it
becomes flat. Requirement effort affects productivity up to a point, then either other factors take over or further
investment on requirement does not yield return.

Figure 57: Influence of requirement analysis effort on productivity

Regression Model Application - Optimum Team Size:

That there exists an optimum team size has been much discussed and widely quoted. But what are the facts? A
regression model of team size on productivity reveals the real picture. Team size productivity data is shown in
Figure 58, and the graph shows the nonlinear regression curve, a power equation, which fits to an R2 of 42.28
percent. According to the regression model, when the team grows away from the organic small size, its
productivity decreases exponentially. The nonlinear model does permit optimization of team size; it imposes a
constraint equation on software projects. Choice is made not based on the intrinsic demonstration of best among

the lot prediction but based on other factors. For example, a strategic limit on minimum productivity would
dictate the team size limit. In those cases, where a larger team size is chosen based on other considerations, from
the model we know what would be the corresponding loss in productivity, and take appropriate counter
measures. This model would also help in breaking work packages to smaller units and operate the project with
the proverbial small teams.

Figure 58: Team size constraints on productivity

Regression Model Application - Building an Effort Estimation Model:

Predicting effort from size has been a favourite game for several researchers. They go by the name of cost
models and estimation models.

Figure 59: Refined regression model (after removal of outliers)

Our objective here is to apply regression modelling to design an effort estimation model from data commonly
available in projects, namely, effort and size. Some practical data is provided in Table 21.

Table 21: Effort data

Expectation: The metrics used here are effort in hours and size in function points. Size is taken as the
independent variable. The expected relationship, based on several experiences, is a power equation of the

We also expect complications in regression model building. Size measurements can have errors, which
will interfere with regression.

Analysis: Regression analysis of the dataset is shown in Figure 60. A linear regression line appears with
goodness of fit 39.75 percent, a poor value for an estimation model. There is a large scatter of data. The
model requires improvement.

Figure 60: Effort Estimation from size: the first regression

Table 22: Clustered data

Presentation of such scatter plots sometimes invites criticism. Lack of clear trend makes people give up
and lose interest in analysis. They conclude that "if you have enough data you can prove any theory." The
problem is quite basic. The step that had been missed in data collection is "categorization," a discipline
lower in the rank of measurement scales but which could bring in clarity.

Clustering: By examining the scatter plot in Figure 60 we may notice that there is a possibility for
clustering, regrouping data according to some logical rule, and try separate regressions for each cluster.
The exploratory data analysis indicates a natural divide in the data, worth finding. Now we know that
there must be logic for regrouping which is based on some physical reasoning, such as types of projects,
nature of technology, and even year of completion. Histograms can be used to test for existence of strong
clusters. The data was grouped into two clusters. The regrouped data is shown in Table 22.

New Regression Models: The new regression lines, obtained after clustering, are shown in Figure 61. The
goodness of fit figures is 83.44 percent for one and 67.63 percent for the other. Regression quality is far
better than what we had in the first run. This is an example that emphasizes the need for iterative runs in
model building. We can continue the iteration with further clustering, transformation, partitioning, or
other means of model refinement. We can also search for better equations. Of course, we can go to
multiple linear regressions and achieve better and better models. It is a process by itself. The quest is
brought to an end, when we have a reasonable model which will have reasonable confidence level and
which agrees with common sense.

Figure 61: Regression after clustering

Important Lesson: This application proves one principle: estimation models predict better within their
own families. Each estimation model represents a narrow world, inside which it operates best. There is no
universal estimation model. Hence, even if we have just a few data points, better to build our own
estimation model, one for each family.

Statistical Process Control provides a way of handling the increasing complexity of software engineering. In
this preprint the statistical basics were introduced and an example was provided to show how this approach is
practically applied. To be able to use it in a profitable way it is necessary to gain experience with this approach.
With the oblige experience it is a very powerful tool for controlling the software processes being developed at
the moment but also for the planning of future projects. This means that the overall effort decreases while the
quality increases.

5 References
[Abreu 1995] Abreu, F. B.; Gonlao, M.; Esteves, R.: Towards the Design Quality Evaluation in Object-Oriented
Software Systems. Proc. of the 5ICSQ, October 24-26, Austin, Texas, 1995, pp. 44-57

[Basili 1986] Basili, V. R.; Selby, R. W.; Hutchens, D. H.: Experimentation in Software Engineering. IEEE
Transactions on Software Engineering, 12(1986)7, pp. 733-743

[Card 2000] Card, D. N.: Making Measurement Understandable. IEEE Software, January/February 2000, pp. 95-

[Cole 1993] Cole, R. J.; Woods, D.: Measurement Through the Software Lifecycle: A Comperative Case Study. .
Proc. of the 10th Annual Conference on Application of Software Metrics and Quality Assurance in
Industry, Amsterdam, Netherlands, September 1993, Section 19

[Dumke 2001] Dumke, R.; Abran, A.: Current Trends in Software Measurement. Proc. of the IWSM2001,
Montreal, August 2001, Shaker Publ., 2001

[Dumke 2002] Dumke, R.; Abran, A.; Bundschuh, M.; Symons, C.: Software Measurement and Estimation.
Proc. of the IWSM2002, Magdeburg, October 2002, Shaker Publ., 2002

[Dumke 1999] Dumke, R.; Foltin, E.: An Object-Oriented Software Measurement and Evaluation Framework.
Proc. of the FESMA, October 4-8, 1999, Amsterdam, pp. 59-68

[Dumke 1996] Dumke, R.; Foltin, E.; Koeppe, R.; Winkler, A.: Softwarequalität durch Meßtools – Assessment,
Messung und instrumentierte ISO 9000. Vieweg Publ., Braunschweig, Germany, 1996

[Dumke 2003] Dumke, R.; Lother, M.: Softwarequalitätsmanagement (SQM). Vorlesungsskript. Otto-von-
Guericke-Universität Magdeburg, lehre/swt2.shtml

[Ebert 1993] Ebert, C.: Complexity Traces – an Instrument for Software Project Management. Proc. of the 10th
Annual Conference on Application of Software Metrics and Quality Assurance in Industry, Amsterdam,
Netherlands, September 1993, Section 17

[Eickelmann 2000] Eickelmann, N.: Integrating the Balanced Scorecard and Software Measurement
Frameworks. Proc. of the IRMA 2000, Anchorage, Alaska, May 2000, pp. 980-983

[Endres 2003] Endres, Albert; Rombach, D.: A Handbook of Software and System Engineering. Pearson
Education Limited, 2003

[Fehrling 2003] Fehrling, N.: Softwaremetriken im Umfeld der Automobilindustrie. In: Büren et al.: Software-
Messung in der Praxis. Tagungsband der MetriKon 2003, November 2003, Ulm, Shaker-Verlag, 2003,
pp. 163-164

[Feiler 1993] Feiler, P. H.; Humphrey, W. S.: Software Process Development and Enactment: Concepts and
Definitions. Proc. of the 2nd Int. Conference on Software Process, Los Altimos, 1993, pp. 28-40

[Fenton 1997] Fenton, N. E.; Pfleeger, S. L.: Software Metrics – A Rigorous and Practical Approach. Thomson
Publ., 1997

[Ferguson 1998] Ferguson, J.; Sheard, S.: Leveraging Your CMM Efforts for IEEE/EIA 12207. IEEE Software,
September/October 1998, pp. 23-28

[Henderson 1996] Henderson-Seller, B.: The Mathematical Validity of Software Metrics. Software Engineering
Notes, 21(1996)5, pp. 89-94

[Jacquet 1997] Jacquet, J.; Abran, A.: From Software Metrics to Software Measurement Methods: A Process
Model. Proc. of the ISESS, 1997

[Juristo 2003] Juristo, N.; Moreno, A. M.: Basics of Software Engineering Experimentation. Kluwer Academic
Publishers, Boston, 2003

[Kitchenham 1995] Kitchenham, B., Pfleeger, S. L.; Fenton, N.: Towards a Framework for Software
Measurement Validation. IEEE Transactions on Software Engineering, 21(1995)12, pp. 929-944

[Kitchenham 1997] Kitchenham et al.: Evaluation and assessment in software engineering. Information and
Software Technology, 39(1997), pp. 731-734

[Kulpa 2003] Kulpa, M. K.; Johnson, K. A.: Interpreting the CMMI – A Process Improvement Approach. CRC
Press Company, 2003

[Munson 2003] Munson, J., C.: Software Engineering Measurement. CRC Press Company, Boca Raton London
New York, 2003

[Pandian 2004] Pandian, C. R.: Software Metrics – A Guide to Planning, Analysis, and Application. CRC Press
Company, 2004

[Putnam 2003] Putnam, L. H.; Myers, W.: Five Core Metrics – The Intelligence Behind Successful Software
Management. Dorset House Publishing, New York, 2003

[SEI 2002] SEI: Capability Maturity Model Integration (CMMISM), Version 1.1, Software Engineering Institute,
Pittsburgh, March 2002, CMMI-SE/SW/IPPD/SS, V1.1

[Singpurwalla 1999] Singpurwalla, N. D.; Wilson, S. P.: Statistical Methods in Software Engineering. Springer
Publ., 1999

[Solingen 1999] Solingen, v. R.; Berghout, E.: The Goal/Question/Metric Method. McGraw Hill Publ., 1999

[Wohlin 2000] Wohlin, C, Runeson, P, Höst, M, Ohlsson, M, Regnell, B, Wesslén, A.: Experimentation in
Software Engineering: An Introduction. Kluwer Academic Publishers, Boston, 2000

[Zelkowitz 1997] Zelkowitz, M. V.; Wallace, D. R.: Experimental Models for Validating Technology. IEEE
Computer, May 1998, pp. 23-31

[Zuse 1998] Zuse, H.: A Framework of Software Measurement. De Gruyter Publ., Berlin New York, 1998

[Zuse 2003] Zuse, H.: What can Practioneers learn from Measurement Theory. In Dumke et al.: Investigations
in Software Measurement, Proc. of the IWSM 2003, Montreal, September 2003, pp. 175-176