100%(8)100% fanden dieses Dokument nützlich (8 Abstimmungen)

2K Ansichten80 SeitenStatistical Process Control

Statistical Process Control for Level 4

© Attribution Non-Commercial (BY-NC)

PDF, TXT oder online auf Scribd lesen

Statistical Process Control

Attribution Non-Commercial (BY-NC)

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

100%(8)100% fanden dieses Dokument nützlich (8 Abstimmungen)

2K Ansichten80 SeitenStatistical Process Control for Level 4

Statistical Process Control

Attribution Non-Commercial (BY-NC)

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

Sie sind auf Seite 1von 80

Achieving the CMMI Level Four

dumke@ivs.cs.uni-magdeburg.de, http://ivs.cs.uni-magdeburg.de/sw-eng/agruppe/

Contents

1.1 Basic Intentions of the CMMI ………………………………………………………………………. 2

1.2 The CMMI Levels …………………………………………………………………………………... 3

1.3 The CMMI Metrication ……………………………………………………………………………… 7

2.1 The CAME Measurement Framework ……………………………………………………………….. 10

2.2 The CMMI Metrics Set by Kulpa and Johnson …………………………………….………………. 15

2.3 The CMMI-Based Organization’s Measurement Repository ………………………………………. 20

3.1 Foundations of the SPC……………………………………………………………………………. 21

3.2 Empirical Strategies ……………………………………………………………………………….. 27

3.3 Testing Methods …………………………………………………………………………………… 33

3.4 Methods of Data Analysis …………………………………………………………………………. 39

4.1 Basics of Quantified Process Management …………………………………………………………… 66

4.2 Controlling the Process Improvement ……………………………………………………………….... 69

.

5 References ……………………………………………………………………….… 79

Abstract

The following preprint gives a new form of integration of the idea of the statistical based analysis of the software

process (SPC) in the assessment and improvement activities considering the Capability Maturity Model

Integration initiative. Including the basic statistical methods and software experiment foundations we will

describe a structured approach for metrication of the different stages of the CMMI approach. Further, this

preprint shows appropriate methods of statistical analysis in order to improve the software process areas and

activities for a quantified managed process level based on metrics set defines by Kulpa and Johnson.

1

1 The CMMI Approach

1.1 Basic Intentions of the CMMI

CMMI stands for Capability Maturity Model Integration and is an initiative for changing the general intention

of an assessment view based of the “classical” CMM or ISO 9000 to an improvement view integrating the System

Engineering CMM (SE-CMM), the Software Acquisition Capability Maturity Model (SA-CMM), the Integrated

Product Development Team Model (IDP-CMM), the System Engineering Capability Assessment Model

(SECAM), the Systems Engineering Capability Model (SECM), and basic ideas of the new versions of the ISO

9001 and 15504. The following semantic network shows some classical approaches in the software process

evaluation without any comments [Ferguson 1998].

2168 1679

People CMM

SDCE NATO DOD-STD-

SA-CMM IEEE Stds. 730, AQAP1,4,9

SW-CMM 828,829,830,1012 DOD-STD-

SCE 1016,1028,1058 7935A

FAA-iCMM 1063 EQA

Baldrige MIL-STD-498

ISO 15504

(SPICE) Trillium BS

5750

ISO/IEC

SE-CMM CMMI DO-178B 12207

EIA/IEEE

SSE-CMM SECM J-STD-016

(EIA/IS 731) IEEE 1074

TickIT ISO 9000

SECAM IPD-CMM Series

DOD IPPD

IEEE 1220 Q9000 IEEE/EIA

EIA/IS 632 AF IPD Guide ISO 10011 12207

MIL-STD-499B EIA 632 ISO 15288

The CMMI is structured in the five maturity levels, the considered process areas, the specific goals (SG) and

generic goals (GG), the common features and the specific practices (SP) and generic practices (GP). The process

areas are defined as follows [Kulpa 2003]:

“The Process Area is s group of practices or activities performed collectively to achieve a specific

objective.”

Such objectives could be the requirements management at the level 2, the requirements development at the

maturity level 3 or the quantitative project management at the level 4. The difference between the “specific” and

the “general” goals, practices or process area is reasoning in the special aspects or areas which are considered in

opposition to the general IT or company wide analysis or improvement. There are four common features:

The ability to perform (AB)

The directing implementation (DI)

The verifying implementation (VE).

The CO is shown through senior management commitment, the AB is sown through the training personnel, the

DI is demonstrated by managing configurations, and the VE is demonstrated via objectively evaluating

adherence and by reviewing status with higher-level management.

2

The following Figure 2 shows the general relationships between the different components of the CMMI

approach.

Capability Levels

The CMMI gives us some guidance as to what is a required component, an expected component, and simply

informative.

There are six capability levels (but five maturity levels), designated by the numbers 0 through 5 [SEI 2002],

including the following process areas:

0. Incomplete: -

2. Managed: requirements management, project planning, project monitoring and control, supplier

agreement management, measurement and analysis, process and product quality assurance;

validation, organizational process focus, organizational process definition, organizational training,

integrated project management, risk management, integrated teaming, integrated supplier

management, decision analysis and resolution, organizational environment for integration;

Kulpa and Johnson consider the following specific goals and practices achieving the different maturity levels

relating to the quantification [Kulpa 2003]:

The purpose of Measurement and Analysis is to develop and sustain a measurement capability that is used to

support management information needs. Specific Practices by Specific Goal:

SG1 Align Measurement and Analysis Activities: Measurement objectives and activities are aligned

with identified information needs and objectives.

SP1.1 Establish Measurement Objectives: Establish and maintain measurement objectives that are

derived from identified information needs and objectives.

SP1.2 Specify Measures: Specify measures to address the measurement objectives.

3

SP1.3 Specify Data Collection and Storage Procedures: Specify how measurement data will be

obtained and stored.

SP1.4 Specify Analysis Procedures: Specify how measurement data will be analyzed and reported.

SG2 Provide Measurement Results: Measurement results that address identified information needs

and objectives are provided.

SP2.1 Collect Measurement Data: Obtain specified measurement data.

SP2.2 Analyze Measurement Data: Analyze and interpret measurement data.

SP2.3 Store Data and Results: Manage and store measurement data, measurement specifications, and

analysis results.

SP2.4 Communicate Results: Report results of measurement and analysis activities to all relevant

stakeholders.

SG1 Objectively Evaluate Processes and Work Products: Adherence of the performed process and

associated work products and services to applicable process descriptions, standards, and procedures is

objectively evaluated.

SP1.1 Objectively Evaluate Processes: Objectively evaluate the designated performed processes

against the applicable process descriptions, standards, and procedures.

SP1.2 Objectively Evaluate Work Products and Services: Objectively evaluate the designated work

products and services against the applicable process descriptions, standards, and procedures.

SG2 Provide Objective Insight: Noncompliance issues are objectively tracked and communicated,

and resolution is ensured.

SP2.1 Communicate and Ensure Resolution of Noncompliance Issues: Communicate quality issues

and ensure resolution of noncompliance issues with the staff and managers.

SP2.2 Establish Records: Establish and maintain records of the quality assurance activities.

Level 3: Verification:

The purpose of Verification is to ensure that selected work products meet their specified requirements.

Specific Practices by Specific Goal:

SP1.1 Select Work Products for Verification: Select the work products to be verified and the

verification methods that will be used for each.

SP1.2 Establish the Verification Environment: Establish and maintain the environment needed to

support verification.

SP1.3 Establish Verification Procedures and Criteria: Establish and maintain verification procedures

and criteria for the selected work products.

SG2 Perform Peer Reviews: Peer reviews are performed on selected work products.

SP2.1 Prepare for Peer Reviews: Prepare for peer reviews of selected work products.

SP2.2 Conduct Peer Reviews: Conduct peer reviews on selected work products and identify issues

resulting from the peer review.

SP2.3 Analyze Peer Review Data: Analyze data about preparation, conduct, and results of the peer

reviews.

SG3 Verify Selected Work Products: Selected work products are verified against their specified

requirements.

SP3.1 Perform Verification: Perform verification on the selected work products.

SP3.2 Analyze Verification Results and Identify Corrective Action: Analyze the results of all

verification activities and identify corrective action.

4

Level 3: Validation:

The purpose of Validation is to demonstrate that a product or product component fulfills its intended use when

placed in its intended environment. Specific Practices by Specific Goal:

SP1.1 Select Products for Validation: Select products and product components to be validated and

the validation methods that will be used for each.

SP1.2 Establish the Validation Environment: Establish and maintain the environment needed to

support validation.

SP1.3 Establish Validation Procedures and Criteria: Establish and maintain procedures and criteria

for validation.

SG2 Validate Product or Product Components: The product or product components are validated to

ensure that they are suitable for use in their intended operating environment.

SP2.1 Perform Validation: Perform validation on the selected products and product components.

SP2.2 Analyze Validation Results: Analyze the results of the validation activities and identify issues.

The purpose of Decision Analysis and Resolution is to analyze possible decisions using a formal evaluation

process that evaluates identified alternatives against established criteria. Specific Practices by Specific Goal:

SG1 Evaluate Alternatives: Decisions are based on an evaluation of alternatives using established

criteria.

SP1.1 Establish Guidelines for Decision Analysis: Establish and maintain guidelines to determine

which issues are subject to a formal evaluation process.

SP1.2 Establish Evaluation Criteria: Establish and maintain the criteria for evaluating alternatives,

and the relative ranking of these criteria.

SP1.3 Identify Alternative Solutions: Identify alternative solutions to address issues.

SP1.4 Select Evaluation Methods: Select the evaluation methods.

SP1.5 Evaluate Alternatives: Evaluate alternative solutions using the established criteria and

methods.

SP1.6 Select Solutions: Select solutions from the alternatives based on the evaluation criteria.

The purpose of the Quantitative Project Management process area is to quantitatively manage the project’s

defined process to achieve the project’s established quality and process-performance objectives. Specific

Practices by Specific Goal:

SG1 Quantitatively Manage the Project: The project is quantitatively managed using quality and

process- performance objectives.

SP1.1 Establish the Project’s Objectives: Establish and maintain the project’s quality and process-

performance objectives.

SP1.2 Compose the Defined Process: Select the subprocesses that compose the project’s defined

process, based on historical stability and capability data.

SP1.3 Select the Subprocesses that Will Be Statistically Managed: Select the subprocesses of the

project’s defined process that will be statistically managed.

SP1.4 Manage Project Performance: Monitor the project to determine whether the project’s

objectives for quality and process performance will be satisfied, and identify corrective action as

appropriate.

within the project’s defined process is statistically managed.

SP2.1 Select Measures and Analytic Techniques: Select the measures and analytic techniques to be

used in statistically managing the selected subprocesses.

SP2.2 Apply Statistical Methods to Understand Variation: Establish and maintain an understanding

of the variation of the selected subprocesses using the selected measures and analytic techniques.

SP2.3 Monitor Performance of the Selected Subprocesses: Monitor the performance of the selected

5

subprocesses to determine their capability to satisfy their quality and process-performance objectives,

and identify corrective action as necessary.

SP2.4 Record Statistical Management Data: Record statistical and quality management data in the

organization’s measurement repository.

The purpose of Causal Analysis and Resolution is to identify causes of defects and other problems and take

action to prevent them from occurring in the future. Specific Practices by Specific Goal:

SG1 Determine Causes of Defects: Root causes of defects and other problems are systematically

determined.

SP1.1 Select Defect Data for Analysis: Select the defects and other problems for analysis.

SP1.2 Analyze Causes: Perform causal analysis of selected defects and other problems and propose

actions to address them.

SG2 Address Causes of Defects: Root causes of defects and other problems are systematically

addressed to prevent their future occurrence.

SP2.1 Implement the Action Proposals: Implement the selected action proposals that were developed

in causal analysis.

SP2.2 Evaluate the Effect of Changes: Evaluate the effect of changes on process performance.

SP2.3 Record Data: Record causal analysis and resolution data for use across the project and

organization.

Addressing the basics of the project management CMMI considers the following components for the

management of the IT processes [SEI 2002]:

Process Performance

Risk exposure due to

objectives, baselines, models

unstable processes

Quantitative objectives

Organization’s standard Subprocesses to

processes and statistically manage

supporting assets Identified risks

IPM RSKM

for Coordination and collaboration

Lessons Learned, IPPD among project stakeholders

Planning and Risk

Performance Data Shared vision taxonomies

and integrated team IT & parameters

structure for the project

Process Management

process areas Integrated team Risk status

Project’smanagement for

defined performing

process engineering Risk mitigation plans

Project’s

processes

Product defined

Coordination, Corrective action

architecture commitments, process

Project

for issues to performance

structuring resolve data

Configuration management, teams

verification, and integration

data Integrated work

environment and

people practices Basic

ISM Engineering and Support Project Management

process areas process areas

Monitoring data as

part of supplier

agreement

6

Where QPM stands for Quantitative Project Management, IPM for Integrated Project Management, IPPD for

Integrated Product and Process Development, RSKM for risk management, and ISM for Integrated Supplier

Management.

In order to manage the software process quantitatively, the CMMI defines a set of metrics examples. Some of

these appropriate software measurement intentions are [SEI 2002]

Examples of quality and process performance attributes for which needs and priorities might be

identified include the following:

o Functionality

o Reliability

o Maintainability

o Usability

o Duration

o Predictability

o Timeliness

o Accuracy

Examples of quality attributes for which objectives might be written include the following:

o Mean time between failures

o Critical resource utilization

o Number and severity of defects in the released product

o Number and severity of customer complaints concerning the provided service

Examples of process performance attributes for which objectives might be written include the following:

o Percentage of defects removed by product verification activities (perhaps by type of

verification, such as peer reviews and testing)

o Defect escape rates

o Number and density of defects (by severity) found during the first year following product

delivery (or start of service)

o Cycle time

o Percentage of rework time

o Requirements

o Organization's quality and process-performance objectives

o Customer's quality and process-performance objectives

o Business objectives

o Discussions with customers and potential customers

o Market surveys

Examples of sources for criteria used in selecting subprocesses include the following:

o Customer requirements related to quality and process performance

o Quality and process-performance objectives established by the customer

o Quality and process-performance objectives established by the organization

o Organization’s performance baselines and models

o Stable performance of the subprocess on other projects

o Laws and regulations

o Defect density

o Cycle time

o Test coverage

o Inadequate stability and capability data in the organization’s measurement repository

o Subprocesses having inadequate performance or capability

o Suppliers not achieving their quality and process-performance objectives

7

o Lack of visibility into supplier capability

o Inaccuracies in the organization’s process performance models for predicting future

performance

o Deficiencies in predicted process performance (estimated progress)

o Other identified risks associated with identified deficiencies

Examples of actions that can be taken to address deficiencies in achieving the project’s objectives

include the following:

o Changing quality or process performance objectives so that they are within the expected

range of the project’s defined process

o Improving the implementation of the project’s defined process so as to reduce its normal

variability (reducing variability may bring the project’s performance within the objectives

without having to move the mean)

o Adopting new subprocesses and technologies that have the potential for satisfying the

objectives and managing the associated risks

o Identifying the risk and risk mitigation strategies for the deficiencies

o Terminating the project

o Requirements volatility

o Ratios of estimated to measured values of the planning parameters (e.g., size, cost, and

schedule)

o Coverage and efficiency of peer reviews

o Test coverage and efficiency

o Effectiveness of training (e.g., percent of planned training completed and test scores)

o Reliability

o Percentage of the total defects inserted or found in the different phases of the project life

cycle

o Percentage of the total effort expended in the different phases of the project life cycle

o Lack of process compliance

o Undistinguished influences of multiple underlying subprocesses on the data

o Ordering or timing of activities within the subprocess

o Uncontrolled inputs to the subprocess

o Environmental changes during subprocess execution

o Schedule pressure

o Inappropriate sampling or grouping of data

Examples of criteria for determining whether data are comparable include the following:

o Product lines

o Application domain

o Work product and task attributes (e.g., size of product)

o Size of project

Examples of where the natural bounds are calculated include the following:

o Control charts

o Confidence intervals (for parameters of distributions)

o Prediction intervals (for future outcomes)

Examples of techniques for analyzing the reasons for special causes of variation include the following:

o Cause-and-effect (fishbone) diagrams

o Designed experiments

o Control charts (applied to subprocess inputs or to lower level subprocesses)

o Subgrouping (analyzing the same data segregated into smaller groups based on an

understanding of how the subprocess was implemented facilitates isolation of special

causes)

Examples of when the natural bounds may need to be recalculated include the following:

o There are incremental improvements to the subprocess

o New tools are deployed for the subprocess

o A new subprocess is deployed

8

o The collected measures suggest that the subprocess mean has permanently shifted or the

subprocess variation has permanently changed

Examples of actions that can be taken when a selected subprocess’ performance does not satisfy its

objectives include the following:

o Changing quality and process-performance objectives so that they are within the

subprocess’ process capability

o Improving the implementation of the existing subprocess so as to reduce its normal

variability (reducing variability may bring the natural bounds within the objectives without

having to move the mean)

o Adopting new process elements and subprocesses and technologies that have the potential

for satisfying the objectives and managing the associated risks

o Identifying risks and risk mitigation strategies for each subprocess’ process capability

deficiency

o System dynamics models

o Automated test-coverage analyzers

o Statistical process and quality control packages

o Statistical analysis packages

o Process modelling and analysis

o Process measurement data selection, definition, and collection

Examples of work products placed under configuration management include the following:

o Subprocesses to be included in the project’s defined process

o Operational definitions of the measures, their collection points in the subprocesses, and

how the integrity of the measures will be determined

o Collected measures

o Establishing project objectives

o Resolving issues among the project’s quality and process-performance objectives

o Appraising performance of the selected subprocesses

o Identifying and managing the risks in achieving the project’s quality and process-

performance objectives

o Identifying what corrective action should be taken

o Profile of subprocesses under statistical management (e.g., number planned to be under

statistical management, number currently being statistically managed, and number that are

statistically stable)

o Number of special causes of variation identified

o Quantitatively managing the project using quality and process-performance objectives

o Statistically managing selected subprocesses within the project’s defined process

o Subprocesses to be included in the project’s defined process

o Operational definitions of the measures

o Collected measures

Based on these quantifications CMMI defines: “A `managed process` is a performed process that is planned and

executed in accordance with policy; employs skilled people having adequate resources to produce controlled

outputs; involves relevant stakeholders; is monitored, controlled, and reviewed; and is evaluated for adherence to

its process description“.

9

2 Software Measurement Intentions

2.1 The CAME Measurement Framework

The following measurement and evaluation framework addressed to the software product, process and resources

was developed at the University of Magdeburg [Dumke 1999]. The measurement framework is embedded in

some aspects of strategy in the IT area in organizations and societies which is shown in the following Figure 4.

Society

Organization

IT area

CAME strategy

CAME framework

CAME tools

Figure 4: Main areas relating to the software measurement and evaluation framework

We will describe shortly some essential aspects of this framework and the characteristics of the framework

environments. The CAME strategy is related to the experience of measurement frameworks or metric programs

which are embedded in the enterprise area ([Dumke 2002], [Eickelmann 2000], [Fehrling 2003], [Kitchenham

1997], [Munson 2003]) and stands for

• Community: the necessity of a group or a team that is motivated and has the knowledge of software

measurement to install software metrics. In general, the members of these groups are organised in metrics

communities such as our German Interest Group on Software Metrics.

• Acceptance: the agreement of the (top) management to install a metrics program in the (IT) business area.

This aspect is strong connected with the knowledge about required budgets and personnel resources.

• Motivation: the production of measurement and evaluation results in a first metrics application which

demonstrates the convincing benefits of the metrics application. This very important aspect can be

achieved by the application of essential results in the (world-wide) practice which are easy to understand

and should motivate the management. One of the problem of this aspect is the fact that the management

wants to obtain one single (quality) number as a summary of all measured characteristics.

• Engagement: the acceptance of spending effort to implement the software measurement as a permanent

metrics system (with continued measurement, different statistical analysis, metrics set updates etc.). This

aspect includes also the requirement to dedicate personnel resources such as measurement teams etc.

The CAME framework consists of the following four phases which are defined to install a metrics program in

the IT area and which can be used to evaluate the measurement level of this metrics program itself (see also

[Dumke 2001], [Fenton 1997], [Kitchenham 1995], [Putnam 2003], [Zuse 1998]):

• Choice: the selection of metrics based on a special or general measurement view on the kind of

measurement and the related measurement goals,

10

• Adjustment: the investigation and definition of the measurement characteristics of the metrics for the

specific application field,

• Migration: the installation of a high metrication coverage based on semantic relations between the

metrics along the whole life cycle and along the system architecture,

• Efficiency: the automation level of the construction of a tool-based measurement for the used metrics.

The phases of this framework will be explained in the following sections including the detailed aspects software

measurement evaluation and the role of the CAME tools.

The Measurement Choice involves the use of metrics involves the following two essential questions:

Obviously, we only want to measure, what is necessary. But, in most software engineering areas, this aspect is

unknown (especially for modern software development paradigms or methodologies such as software agents and

multi-agent systems). The first framework step includes the choice of the software metrics and measures.

Therefore, we must define the set of software metrics explicitly [Dumke 2003]. The structure of this set of

metrics is based on the following classification principles

software product measurement and evaluation is based on the three components: model,

implementation and documentation (see Figure 5),

tutorials user

problem domain product data manual

confi- tasks accessing

guration development

task data documents

manage- manage- (technology, tests,

ment ment distributed tasks and data bases tools, supports)

readability

components tasks behaviour completeness

data basis data handling

Note that the metrication process depends on the kind of the development method, of the application area of

the software system, of the implementation paradigm etc.

11

software process measurement and evaluation is based on the process aspects: controlling, phases/steps

and methodologies (see Figure 6),

requirement analysis/ proach ment me- CASE

specification quality configu- thodology

design manage- ration ma- para-

... implementation ment nagement digm implemen- lower

field test maintenance management tation me- CASE

thodology

phases aspects evaluation

workflow efficiency

software resources measurement and evaluation is based on the three resource parts: personnel,

software and hardware (see Figure 7).

(mobile)

user customer COTS CASE computers peripherals

(hosts)

development team

(test team)

system software networks

maintenance team architectures

Our framework starts with the investigation of the chosen metrics and assumes an underlying choice method

such as

• the general measurement goal planning by [Basili 1986] (see also [Wohlin 2000]) which consider the

different measurement goals as understanding of systems, assessment, proof of hypothesis, understanding of

metrics etc.,

• the Goal Question Metrics (GQM) paradigm [Solingen 1999] which is directed on the improvement of a

special aspect or component of the software system related to a special goal.

The measurement choice step defines the static characteristics of the software measurement process [Feiler

1993]. Note, that the choice of software metrics or software measures decides about the areas of controlling and

the areas out of controlling in the IT department.

12

The Measurement Adjustment is related to the experience (expressed in values) of the measured attributes for

the evaluation. The adjustment includes the metrics validation ([Card 2000], [Kitchenham 1995], [Zelkowitz

1997]) and the determination of the metrics algorithm based on the measurement theory ([Henderson 1996],

[Zuse 2003]). The steps in the measurement adjustment are

• the determination of the scale type and (if possible) the unit,

• the determination of the favourable values (as thresholds) for the evaluation of the measurement

component, e. g. by

o application of a metrics tool for a chosen software product that was classified as a ‘good

qualitative’ example,

• the calibration of the scale (as transformation of the numerical scale part to the empirical) depends on the

improvement of the knowledge in the problem domain.

In the adjustment step mainly, we consider the metrics characteristics addressed to the qualitative evaluation

(nominal and ordinal scale types) or to the quantitative evaluation (interval or ratio scale types).

The Measurement Migration step is aimed to the dynamic aspects of the measurement framework or metrics

program. This means that we must install a metrics-based network over the software product, process, and

resources components as an Internal Measurement Process (IMP). We “migrate” the idea of metrication to all

of the components of the software development and maintenance. Note, that the most existing software

measurement approaches or frameworks do not consider this step explicitly. First intentions of this idea are

described as complexity traces in [Ebert 1993] and measurement through the life cycle in [Cool 1993], and as

granularity of object-oriented systems in [Abreu 1995]. Some examples of these kinds of migration for software

products are [Dumke 1999]

• metrics tracing along the software life cycle, e. g. #notions (problem definition) → #classes

(specification) → #new-defined-classes (design) → #implemented-classes (implementation),

• metrics refinement along the software life cycle, e. g. informal description of a specified service (text

metrics) → PDL description of a service (design metrics) → Java form of a service (code metrics),

• metrics granulation related to the architecture, e. g. in an object-oriented development as the system, the

component, the class/object and the method.

In the process and resources area the semantic characteristics such as process phases and resources versions are

also considered. Observing the software metrics as class hierarchy, we can understand the measurement

migration as the definition and design of the metrics behaviour.

On the other hand, the migration step includes the definition and installation of the External Measurement

Process (EMP) as software measurement integration. This means that we must consider the final goals of

software measurement in the IT area. Hence, we need all of the process steps such as measurement, evaluation,

exploitation and application (assessment, decision support, improvement) in a persistent manner ([Eickelmann

2000], [Jacquet 1997], [Wohlin 2000]).

13

The Measurement Efficiency step includes the instrumentation or the automation of the measurement process

by tools. It requires to analyse the algorithmic character of the software metrics and the possibility of the

integration of tool-based ‘control cycles’ in the software development or maintenance process. We will call the

metrics tools as CAME (Computer Assisted software Measurement and Evaluation) tools [Dumke 1996]. In

most cases, it is necessary to combine different metrics tools and techniques related to the measurement phases.

in physics. Hence, we must consider in the software development the rules of

thumb, statements of trends, analogue conclusions, expertise, estimations and

predictions also ([Dumke 2003], [Endres 2003]).

the system of measures. Therefore, we must use the general techniques of

assessment (continues, periodic or certified), general evaluation, experiences

and experimentation. Sometimes, the experimentation is not immediately

used for decision support, improvement or controlling. We also use the

experimentation for understanding of new paradigms or the cognition of new

kinds of problems ([Basili 1986], [Wohlin 2000]).

analogy such the column of mercury to measure the temperature. In the most

cases, software measurement is counting [Kitchenham 1995].

values or thresholds. Software measurement can be a generic measurement

and analysis process ([Card 2000], [Jacquet 1997]).

experiments, industrial case studies and benchmarking exercises or surveys

([Juristo 2003], [Kitchenham 1997]).

⇒ “In software engineering metrics area, should place more emphasis on the

validity of the mathematical (and statistical) tools which have been (and are

currently being) used in their development and use. Areas which give cause

for concern in the past include the use of dimensionally incorrect equations,

incorrect plotting of equations and consequent incorrect inferences, the

sloppy use of mathematical notation and of calculated values and the lack of

underpinning mathematical models.” [Henderson 1996]

Hence, the software metrics application based on different methodologies or frameworks requires statistical

methods ([Juristo 2003], [Munson 2003], [Pandian 2003], [Sigpurwalla 1999], [Wohlin 2000], [Zuse 1998]).

14

2.2 The CMMI Metrics Set by Kulpa and Johnson

The following set of metrics is defined by Kulpa and Johnson in order to keep the quantified requirements for the

different CMMI levels [Kulpa 2003].

CMMI Level 2:

Requirements Management

1. Requirements volatility- (percentage of requirements changes)

2. Number of requirements by type or status (defined, reviewed. approved. and implemented)

3. Cumulative number of changes to the allocated requirements, including total number of changes

proposed, open, approved, and incorporated into the system baseline

4. Number of change requests per month, compared to the original number of requirements for the

project

5. Amount of time spent, effort spent, cost of implementing change requests

6. Number and size of change requests after the Requirements phase is completed

7. Cost of implementing a change request

8. Number of change requests versus the total number of change requests during the life of the

project

9. Number of change requests accepted but not implemented

10. Number of requirements (changes and additions to the baseline)

Project Planning

11. Completion of milestones for the project planning activities compared to the plan (estimates

versus actuals)

12. Work completed, effort and funds expended in the project planning activities compared to the

plan

13. Number of revisions to the project plan

14. Cost, schedule, and effort variance per plan revision

15. Replanning effort due to change requests

16. Effort expended over time to manage the hmject compared to the plan

17. Frequency, causes, and magnitude of the replanning effort

18. Effort and other resources expended in performing monitoring and oversight activities

19. Change activity for the project plan, which includes changes to size estimates of the work

products, cost/resource estimates, and schedule

20. Number of open and closed corrective actions or action items

21. Project milestone dates (planned versus actual)

22. Number of project milestone dates made on time

23. Number and types of reviews performed

24. Schedule, budget, and size variance between planned and actual reviews

25. Comparison of actuals versus estimates for all planning and tracking items

26. Number of projects using progress and performance measures

27. Number of measurement objectives addressed

28. Cost of the COTS (commercial off-the-shelf) products

29. Cost and effort to incorporate the COTS products into the project

30. Number of changes made to the supplier requirements

31. Cost and schedule variance per supplier agreement

32. Costs of the activities for managing the contract compared to the plan

33. Actual delivery dates for contracted products compared to the plan

34. Actual dates of prime contractor deliveries to the subcontractor compared to the plan

35. Number of on-time deliveries from the vendor, compared with the contract

36. Number and severity of errors found after delivery

37. Number of exceptions to the contract to ensure schedule adherence

38. Number of quality audits compared to the plan

15

39. Number of Senior Management reviews to ensure adherence to hudget and schedule versus the

plan

40. Number of contract violations by supplier or vendor

41. Completions of milestones for the QA activities compared to the plan

42. Work completed, effort expended in the QA activities compared to the plan

43. Number of product audits and activity reviews compared to the plan

44. Number of process audits and activities versus those planned

45. Number of defects per release and/or build

46. Amount of time/effort spent in rework

47. Amount of QA time/effort spent in each phase of the life cycle

48. Number of reviews and audits versus number of defects found

49. Total number of defects found in internal reviews and testing versus those found by the customer or end

user after delivery

50. Number of defects found in each phase of the life cycle

51. Number of defects injected during each phase of the life cycle

52. Number of noncompliances written versus the number resolved

53. Number of noncompliances elevated to senior management

54. Complexity of module or component (McCabe, MeClure, and Halstead metrics)

55. Number of change requests or change board requests processed per unit of time

56. Completions of milestones for the CM activities compared to the plan

57. Work completed, effort expended, and funds expended in the CM activities

58. Number of changes to configuration items

59. Number of configuration audits conducted

60. Number of fixes returned as "Not Yet Fixed"

61. Number of fixes returned as "Could Not Reproduce Error"

62. Number of violations of CM procedures (noncompliance found in audits)

63. Number of outstanding problem reports versus rate of repair

64. Number of times changes are overwritten by someone else (or number of times people have the wrong

initial version or baseline)

65. Number of engineering change proposals proposed, approved, rejected, implemented

66. Number of changes by category to code source, and to supporting documentation

67. Number of changes by category, type, and severity

68. Source lines of code stored in libraries placed under configuration control

CMMI Level 3:

Requirements Development

69. Cost, schedule, and effort expended for rework

70. Defect density of requirements specifications

71. Number of requirements approved for build (versus the total number of requirements)

72. Actual number of requirements documented (versus the total number of estimated requirements)

73. Staff hours (total and by Requirements Development activity)

74. Requirements status (percentage of defined specifications out of the total approved and proposed;

number of requirements defined)

75. Estimates of total requirements, total requirements definition effort, requirements analysis effort, and

schedule

76. Number and type of requirements changes

Technical Solution

77. Cost, schedule, and effort expended for rework

78. Number of requirements addressed in the product or productcomponent design

79. Size and complexity of the product, product components, interfaces, and documentation

80. Defect density of technical solutions work products (number of defects per page)

81. Number of requirements by status or type throughout the life of the project (for example, number

defined, approved, documented, implemented, tested, and signed-off by phase)

82. Problem reports by severity and length of time they are open

16

83. Number of requirements changed during implementation and test

84. Effort to analyze proposed changes for each proposed change and cumulative totals

85. Number of changes incorporated into the baseline by category (e.g., interface, security, system

configuration, performance, and useability)

86. Size and cost to implement and test incorporated changes, including initial estimate and actual size

and cost

87. Estimates and actuals of system size, reuse, effort, and schedule 88. The total estimated and actual

staff hours needed to develop the system by job category and activity

89. Estimated dates and actuals for the start and end of each phase of the life cycle

90. Number of diagrams completed versus the estimated total diagrams

91. Number of design modules/units proposed

92. Number of design modules/units delivered

93. Estimates and actuals of total lines of code - new, modified, and reused

94. Estimates and actuals of total design and code modules and units

95. Estimates and actuals for total CPU hours used to date

96. The number of units coded and tested versus the number planned

97. Errors by category, phase discovered, phase injected, type, and severity

98. Estimates of total units, total effort, and schedule

99. System tests planned, executed, passed, or failed

100. Test discrepancies reported, resolved, or not resolved

101. Source code growth by percentage of planned versus actual

Product Integration

102. Product-component integration profile (i.e., product-component assemblies planned and performed,

and number of exceptions found)

103. Integration evaluation problem report trends (e.g., number written and number closed)

104. Integration evaluation problem report aging (i.e., how long each problem report has been open)

Verification

105. Verification profile (e.g., the number of verifications planned and performed, and the defects found;

perhaps categorized by verification method or type)

106. Number of defects detected by defect category

107. Verification problem report trends (e.g., number written and number closed)

108. Verification problem report status (i.e., how long each problem report has been open)

109. Number of peer reviews performed compared to the plan

110. Overall effort expended on peer reviews compared to the plan

111. Number of work products reviewed compared to the plan

Validation

112. Number of validation activities completed (planned versus actual)

113. Validation problem reports trends (e.g., number written and number closed)

114. Validation problem report aging (i.e., how long each problem report has been open)

115. Number of process improvement proposals submitted, accepted, or implemented

116. CMMI maturity or capability level

117. Work completed, effort and funds expended in the organization's activities for process assessment,

development, and improvement compared to the plans for these activities

118. Results of each process assessment, compared to the results and recommendations of previous

assessments

119. Percentage of projects using the process architectures and process elements of the organization's set

of standard processes

120. Defect density of each process element of the organization's set of standard processes

121. Number of on-schedule milestones for process development and maintenance

122. Costs for the process definition activities

Organizational Training

123. Number of training courses delivered (e.g., planned versus actual)

124. Post-training evaluation ratings

125. Training program quality surveys

17

126. Actual attendance at each training course compared to the projected attendance

127. Progress in improving training courses compared to the organization's and projects' training plans

128. Number of training waivers approved over time

129. Number of changes to the project's defined process

130. Effort to tailor the organization's set of standard processes

131. Interface coordination issue trends (e.g., number identified and closed)

Risk Management

132. Number of risks identified, managed, tracked, and controlled

133. Risk exposure and changes to the risk exposure for each assessed risk, and as a summary percentage

of management reserve

134. Change activity for the risk mitigation plans (e.g., processes, schedules, funding)

135. Number of occurrences of unanticipated risks

136. Risk categorization volatility

137. Estimated versus actual risk mitigation effort

138. Estimated versus actual risk impact

139. The amount of effort and time spent on risk management activities versus the number of actual risks

140. The cost of risk management versus the cost of actual risks

141. For each identified risk, the realized adverse impact compared to the estimated impact

Integrated Teaming

142. Performance according to plans, commitments, and procedures for the integrated team, and

deviations from expectations

143. Number of times team objectives were not achieved

144. Actual effort and other resources expended by one group to support another group or groups, and

vice versa

145. Actual completion of specific tasks and milestones by one group to support the activities of other

groups, and vice versa

146. Effort expended to manage the evaluation of sources and selection of suppliers

147. Number of changes to the requirements in the supplier agreement

148. Number of documented commitments between the project and the supplier

149. Interface coordination issue trends (e.g., number identified and number closed)

150. Number of defects detected in supplied products (during integration and after delivery)

151. Cost-to-benefit ratio of using formal evaluation processes

152. Parameters for key operating characteristics of the work environment

CMMI Level 4:

153. Trends in the organization's process performance with respect to changes in work products and task

attributes (e.g., size growth, effort, schedule, and quality)

154. Time between failures

155. Critical resource utilization

156. Number and severity of defects in the released product

157. Number and severity of customer complaints concerning the provided service

158. Number of defects removed by product verification activities (perhaps by type of verification, such

as peer reviews and testing)

159. Defect escape rates

160. Number and density of defects by severity found during the first year following product delivery or

start of service

18

161. Cycle time

162. Amount of rework time

163. Requirements volatility (i.e., number of requirements changes per phase)

164. Ratios of estimated to measured values of the planning parameters (e.g., size, cost, and schedule)

165. Coverage and efficiency of peer reviews (i.e., number/amount of products reviewed compared to

total number, and number of defects found per hour)

166. Test coverage and efficiency (i.e., number/amount of products tested compared to total number, and

number of defects found per hour)

167. Effectiveness of training (i.e., percent of planned training completed and test scores)

168. Reliability (i.e., mean time-to-failure usually measured during integration and systems test)

169. Percentage of the total defects inserted or found in the different phases of the project life cycle

170. Percentage of the total effort expended in the different phases of the project life cycle

171. Profile of subprocesses under statistical management (i.e., number planned to be under statistical

management, number currently being statistically managed, and number that are statistically

stable)

172. Number of special causes of variation identified

173. The cost over time for the quantitative process management activities compared to the plan

174. The accomplishment of schedule milestones for quantitative process management activities

compared to the approved plan (i.e., establishing the process measurements to be used on the

project, determining how the process data will be collected, and collecting the process data)

175. The cost of poor quality (e.g., amount of rework, re-reviews and re-testing)

176. The costs for achieving quality goals (e.g., amount of initial reviews, audits, and testing)

CMMI Level 5:

Organizational Innovation and Deployment

177. Change in quality after improvements (e.g., number of reduced defects)

178. Change in process performance after improvements (e.g., change in baselines)

179. The overall technology change activity, including number, type, and size of changes

180. The effect of implementing the technology change compared to the goals (e.g., actual cost saving to

projected)

181. The number of process improvement proposals submitted and implemented for each process area

182. The number of process improvement proposals submitted by each project, group, and department

183. The number and types of awards and recognitions received by each of the projects, groups, and

departments

184. The response time for handling process improvement proposals

185. Number of process improvement proposals accepted per reporting period

186. The overall change activity including number, type, and size of changes

187. The effect of implementing each process improvement compared to its defined goals

188. Overall performance of the organization's and projects' processes, including effectiveness, quality,

and productivity compared to their defined goals

189. Overall productivity and quality trends for each project

190. Process measurements that relate to the indicators of the customers' satisfaction (e.g., surveys results,

number of customer complaints, and number of customer compliments)

191. Defect data (problem reports, defects reported by the customer, defects reported by the user, defects

found in peer reviews, defects found in testing, process capability problems, time and cost for

identifying the defect and fixing it, estimated cost of not fixing the problem)

192. Number of root causes removed

193. Change in quality or process performance per instance of the causal analysis and resolution process

(e.g., number of defects and changes in baseline)

194. The costs of defect prevention activities (e.g., holding causal analysis meetings and implementing

action items), cumulatively

195. The time and cost for identifying the defects and correcting them compared to the estimated cost of

not correcting the defects

196. Profiles measuring the number of action items proposed, open, and completed

197. The number of defects injected in each stage, cumulatively, and over-releases of similar products

198. The number of defects

19

2.3 The CMMI-Based Organization’s Measurement Repository

The following section includes the main activities for defining and implementation of measurement repositories

using in an organizational context. The repository contains both product and process measures that are related to

the organization's set of standard processes ([SEI 2002]). It also contains or refers to the information needed to

understand and interpret the measures and assess them for reasonableness and applicability. For example, the

definitions of the measures are used to compare similar measures from different processes.

1. Definition of the common set of product and process measures for the organization's set of standard

processes

3. Organization's measurement repository (i.e., the repository structure and support environment)

Subpractices:

1. Determine the organization's needs for storing, retrieving, and analyzing measurements.

2. Define a common set of process and product measures for the organization's set of standard

processes. The measures in the common set are selected based on the organization's set of standard

processes. The common set of measures may vary for different standard processes. Operational

definitions for the measures specify the procedures for collecting valid data and the point in the

process where the data will be collected. Examples of classes of commonly used measures include

the following:

Estimates of work product size (e.g., pages)

Estimates of effort and cost (e.g., person hours)

Actual measures of size, effort, and cost

Quality measures (e.g., number of defects found, severity of defects)

Peer review coverage

Test coverage

Reliability measures (e.g., mean time to failure).

Refer to the Measurement and Analysis process area for more information about defining measures.

5. Conduct peer reviews on the definitions of the common set of measures and the procedures for

storing and retrieving measures. Refer to the Verification process area for more information about

conducting peer reviews.

6. Enter the specified measures into the repository. Refer to the Measurement and Analysis process

area for more information about collecting and analyzing data.

7. Make the contents of the measurement repository available for use by the organization and projects

as appropriate.

8. Revise the measurement repository, common set of measures, and procedures as the organization’s

needs change. Examples of when the common set of measures may need to be revised include the

following:

New processes are added

Processes are revised and new product or process measures are needed

Finer granularity of data is required

Greater visibility into the process is required

Measures are retired.

20

3 The Statistical Process Control (SPC)

3.1 Foundations of the Statistical Process Control

This section gives a short description of the Software Process Control (SPC) based on [Kulpa 2003]. SPC is

often the most dreaded of all subjects when discussing process improvement. Because it involves numbers, and

then scrutinizing the numbers to determine whether the numbers are correctly collected, reported, and used

throughout the organization. Many organizations will collect metrics to summarize the best practices we can

found in other organizations. So we will describe the different types of charts and discusses reasons for using the

charts and reasons for collecting data.

SPC consists of some techniques used to help individuals understand, analyze, and interpret numerical

information. SPC is used to identify and track variation in processes. All processes will have some natural

variation. Due to the normal variation in any process, the numbers (in this example, the number of cars waiting

at the stoplight, the number of accidents that may occur) can change when the process really has not. So, we

need to understand both the numbers relating to our processes and the changes that occur in our processes so that

we may respond appropriately.

Other terms that you may see are common causes of variation and special causes of variation, as well as common

cause systems and special cause systems. Common causes of variation result from such things as system design

decisions and the use of one development tool over another. This variation will occur predictably across the

entire process associated with it and is considered normal variation. Special causes of variation are those that

arise from such things as inconsistent process execution and lack of resources. This variation is exceptional

variation and is also known as assignable causes of variation. We will use both terms. Other terms you will hear

are in control for predictable processes or steady-state; and out of control for unpredictable processes that are

“outside the natural limits.”

When a process is predictable, it exhibits routine variation as a result of common causes. When a process is

unpredictable, it exhibits exceptional variation as a result of assignable causes. It is our job to be able to tell the

difference and to find the assignable cause. When a process is predictable, it is performing as consistently as it

can (either for better or for worse). It will not be performing perfectly; there will always be some normal, routine

variation. Looking for assignable causes for processes that are running predictably is a waste of time because

you will not find any. Work instead on improving the process itself. When a process is unpredictable, that means

it is not operating consistently. It is a waste of time to try to improve the process itself. In this case, you must

find out why it is not operating predictably and detail the “whys” as specifically as possible. To do that, you

must find and fix the assignable cause(s); that is, the activity that is causing the process to behave erratically.

In contrast to the predictability of a process, we may want to consider if a process is capable of delivering what

is needed by the customer. Capable processes perform within the specification limits set by the customer. So, a

process may be predictable, but not capable.

Usally, there are seven commonly recognized tools or diagrams for statistical process control:

1. Check sheet

2. Run chart

3. Histogram

4. Pareto chart

5. Scatter diagram/chart

6. Cause and effect or fislnhone diagram

7. Control chart

Some basic examples are shown in following which we have cited from [Kulpa 2003] only for illustration

the general characteristics.

Check Sheet: The check sheet (see Table 1) is used for counting and accumulating data in a general or

special context.

21

Table 1: Check sheet Used for Counting and Accumulating Data

Run Chart: The run chart (see Figure 8) tracks trends over a period of time. Points are tracked in the order

in which they- occur. Each point represents an observation. You can often see interesting trends in the data

by simply plotting data on a run chart. A danger in using run charts is that you might overreact to normal

variations, but it is often useful to put your data on a run chart to get a feel for process behaviour.

Histogram: The histogram (see Figure 9) is a bar chart that presents data that have been collected over a

period of time, and graphically presents these data by frequency. Each bar represents the number of

observations that fit within the indicated range. Histograms are useful because they can be used to see the

amount of variation in a process. The data in this histogram are the same data as in the run chart in Figure

9. Using the histogram, you get a different perspective on the data. You see how often similar values occur

and get a quick idea of how the data are distributed.

22

Pareto Chart: The Pareto chart (see Figure 10) is a bar chart that presents data prioritized in some

fashion, usuallv either by descending or ascending order of importance. Parcto diagrams are used to show

attribute data. Attributes are qualitative data that can he counted for recording and analysis; for example,

counting the number of each type of defect. I'areto charts are often used to analyze the most often occurring

type of something.

Scatter Diagram/Chart: The scatter diagram (see Figure 11) is a diagram that plots data points,

allowing trends to be observed between one variable and another. The scatter diagram is used to test for

possible cause-and-effect relationships. A danger is that a scatter diagram does not prove the cause-and-

effect relationship and can be misused. A common error in statistical analysis is seeing a relationship and

concluding cause-and-effect without additional analysis.

23

Cause-and-Effect/Fishbone Diagram: The cause-and-effect/fishbone diagram (see Figure 12) is a graphical

display of problems and causes. This is a good to capture team input from a brainstorming meeting, from a

set of defect data, or from a check sheet.

Control Chart: The control chart (see Figure 13) is basically a run charts with upper and lower limits that

allows an organization to track process performance variation. Control charts are also called process

behavior charts.

These seven graphical displays can he used together or separately to help gather data, accumulate clam,

and present the data for different functions associated with SPC.

The following seven questions are a start in order to reviewing the data for your charts [Kulpa 2003]:

1. Who collected these data? (Hopefully the same people who are trained in proper data

collection techniques.)

2. How were the data collected? (Hopefully by automated means and at the same part of the

process.)

3. When were the data collected? (Hopefully all at the same time on the same day or at the same

time in the process - very important for accounting data dealing with month-end or year-end

closings. )

4. What do the values presented mean? (Have you changed the process recently? Do these values

really tell me what I want or need to know?)

5. How were these values computed from raw inputs? (Have you computed the data to arrive at

the results you want, or to accuratelv depict the true voice of the process?)

24

6. What formulas were used? (Are thev measuring what we need to measure? Are they working,'

Are they still relevant?)

7. Are we collecting the right data, and are we collecting the data right? (The data collected

should be consistent, and the way data are collected should also be consistent. Do the data

contain the correct information for analysis? In our peer review example, this information

would be size, complexity, and programming language.)

Control charts are used to identify process variation over time. All processes vary. The degree of variance,

and the causes of the variance, can be determined using control charting techniques. While there are many

types of control charts, the ones we have seen the most often are the [Kulpa 2003]:

c-chart: This chart uses a constant sample size of attribute data, where the average sample size is

greater than five. It is used to chart the number of defects (such as “12” or “15” defects per

thousand lines of code). c stands for the number of nonconformities within a constant sample

size.

u-chart:. This chart uses a variable sample size of attribute data. This chart is used to chart the

number of defects in a sample or set of samples (such as “20 out of 50” design flaws were a

result of requirements errors). u stands for the number of nonconformities with varying

sample sizes.

np-chart: This chart uses a constant sample size of attribute data, usually greater than or equal to 50.

This chart is used to chart the number defective in a group. For example, a hardware

component might he considered defective, regardless of the total number of defects in it. np

stands for the number defective.

p-chart: This chart uses a variable sample size of attribute data, usually greater than or equal to 50.

This chart is used to chart the fraction defective found in a group. p stands for the proportion

defective.

X and mR charts: These charts use variable data where the sample size is one.

X-bar and R charts: These charts use variable data where the sample size is small. They can also he

based on a large sample size greater than or equal to ten. X-bar stands for the average of the

data collected. R stands for the range (distribution) of the data collected.

X-bar and s charts: These charts use variable data where the sample size is large, usually greater

than or equal to ten.

So, as you can see, you can sometimes use several of the charts, based on m type of data and on the size of

the sample - and the size of the sample may change. Control charts help detect and differentiate between

noise (normal variation of the process) and signals (exceptional variation that warrants further

investigation). Although others may disagree, we recommend that you use the Average Moving Range

(XmR) chart for most situations. There are automated tools that can support building and displaying these

charts. The task we need to undertake is to figure out how to tell the difference between noise and

signals. Properly generated control charts, specifically the XmR chart, can help us in this task. Risk data

(historical data) are critical for generating accurate control charts and for correct SPC analyses. The

Table 2 shows the count for each month of the year 2002 and the mR values (moving range).

25

We can then average the moving ranges in the following statistical manner (see Figure 14), where Cen

stands for centered line, UCL for upper center line, and LCL for lower center line.

We know that the values for the centerlines for each chart were computed by simply taking the average of the

values displayed (i.e., by adding up the values for each month and then dividing by the number of months/values

to compute the average). How were the upper and lower limits calculated for the charts shown above? We can

calculate the limits for both the X (lndividual Values) chart and the Average Moving Range (mR) chart as

follows:

For the mR (moving range) chart. The upper range (or upper control limit, or upper natural limit) is

computed by multiplying the average moving range (the centerline of the mR chart).

For the X chart (individual values chart). The upper range for the X chart is computed by

multiplying the average moving range of the associated chart and then adding the value for the

centerline of the X chart. The lower range for the X chart is computed by multiplying the average

moving range and then subtracting the value for the centerline of the X chart.

Notice that values for both representations (individual values and average moving range values) must be

gathered and computed. The upper and lower limits for the individual values chart (X chart) depend on the

average variations calculated for the centerline of the average moving range chart. Therefore, these charts are

interdependent and can be used to show relationships between the two types of charts and the two types of data.

We have also seen the limits for the XmR charts calculated using median ranges instead of average ranges. The

median moving range is often more sensitive to assigned causes when the values used contain some very high

range values that inflate the average. Remember that the median range is that range of numbers that hover

around the middle of a list sequenced in ascending or descending order: thus, the median range chart will

automatically “throw out” the very high- or low-end values. Use of the median moving range approach is valid;

however, the formulas (constants) change.

The most obvious interpretation is when one or more data points fall outside your control limits (either upper or

lower). Those values should be investigated for assignable causes, and the assignable causes should be fixed. If

your control chart shows three out of four consecutive points hovering closer to the limits than to the centerline,

this pattern may signal a shift or trend, and should be investigated (because predictable processes generally show

85 to 90 percent of the data closer to the centerline than to the limits). Remember: useful limits can be

constructed with as few as five or six consecutive values. However, the more data used to compute the limits, the

greater the certainty of the results.

Another way to spot trends is to look at the data points along the centerline. If eight or more consecutive data

points are clustered on the same side of the centerline, a shift in the original baseline or performance of the

process has probably occurred, even without a data point falling outside the limits. This is a signal to be

investigated.

c-chart appropriateness: While XmR charts are the most often applied in organizations, and are the most

appropriate charts to use most often, they are not infallible. Sometimes, an event will occur that “skews the

norm” that is, a rare event way outside of the average has occurred. When this happens, a c-chart is better used.

26

A c-chart is used for rare events that are independent of each other. The formulas for c-charts are different from

XmR charts. First, calculate the average count of the rare occurrence over the total time period that the

occurrence happened. That number becomes the centerline. The upper limit is calculated by adding the average

count to three times the square root of the average count. The lower limit is calculated by subtracting the average

count from three times the square root of the average count. Charting the number of times a rare event occurs is

pretty useless. However, charting the time periods between recurring rare events can be used to help predict

when another rare event will occur. To do this, count the number of times the rare event occurs (usually per day

per year) and determine the intervals between the rare events. Convert these numbers into the average moving

ranges and, voilä, you can build an XmR chart.

u-chart appropriateness: The u-chart is based on the assumption that your data are based on a count of

discrete events occurring within well-defined, finite regions/areas, and that these events are independent. The u-

chart assumes a Poisson process. You may want to consider a u-chart when dealing with defects (counts) within

a group of pages (region/area); for example, number of errors per page or the number of defects per 1000 lines of

code. The u-chart differs from the XmR chart in that the upper and lower control limits of the u-chart change

over time. The u` in u-chart is the weighted average of the count (u` = ∑ countj/ ∑ sizej). The upper control limit

is calculated by adding ü to three times the square root of the ü divided by the last size (sizej). The lower control

limit is calculated by subtracting u` from three times the square root of the ü divided by the last size (sizej).

There are three different types of strategies: survey, case study and experiment ([Juristo 2003], [Kitchenham

1997]). Those three strategies will be looked at in more detail in following.

The survey is being applied to subjects already in use (tools, etc). The usual proceeding to gather information is

the usage of questionnaires or interviews. These are applied to a representative sample group and the outcomes

are then analysed. The aim is to derive conclusions that are descriptive, exploratory or explanatory. With the use

of generalization the result from the sample is mapped to the whole group. It is, however, not possible to

manipulate or control the samples. Nevertheless it is practicable to compare the result with similar outcomes of

other surveys. Both qualitative as well as quantitative data can be derived from this strategy. Which one it is

depends on the data that is being collected through the questionnaires or interviews and whether statistical

analysis methods are applicable or not. A popular field for this kind of investigation is well known to most

people: social studies. An example would be public opinion polls before elections take place. The surveys there

try to show how the people will vote on the actual day of election.

Another helpful kind of surveys methods is the application of experience such as Rules of thumb. Examples of

these rules of thumb are described in following as laws and conjectures cited from [Endres 2004].

Process-related expriences:

Fagan’s law: “Inspections significantly increase productivity, quality, and project stability”. There are

three kinds of inspection: design, code, and test inspection. They are applicable in the development

of all information or knowledge intensive products. This form of inspection is wide spread

throughout the industry today. Inspection also has a key role in the Capability Maturity Model

(CMM). The benefit of inspections can be summarized as followed: they “create awareness for

quality that is not achievable by any other method”.

Porter-Votta law: “Effectivness of inspections is fairly independent of its organizational form”. A. Porter

and L. Votta investigated the inspection process introduced by Fagan and came up with the

following results: physical meetings are overestimated. It can be helpful while introducing the

inspection process to new people. When education and experience are extant it is not that

important anymore. Another point revealed was that it is not true that adding more persons to the

inspection team increases the detection rate.

Hetzel-Myers law: “A combination of different Verification and Validation methods outperforms any

single method alone”. W. Hetzel and G. Myers claim that it is better to use all three methods in

combination to gain better results at the end. This is due to the fact that design, code and test

inspection are not competitors.

27

Mills-Jones hypothesis: “Quality entails productivity”. It is also known as “the optimist’s law” and can

be seen as a variation of P. Cosby’s proverb “quality is free”. It is a very intuitive hypothesis: on

the one hand, when the quality is high, less rework has to be done which results in better

productivity. On the other hand, when quality is poor more rework has to be considered. Therefore

productivity rate drops, as well.

Mays’ hypothesis: “Error prevention is better than error removal”. No matter when an error is detected a

certain amount of rework has to be done (this amount increases the later it is detected). Therefore

it is better to prevent errors. To be able to do so, the circumstances of errors have to be

investigated, identified and then removed. It is still a hypothesis because it is extremely difficult to

prove.

Structured conclusions:

Basili-Rombach hypothesis: “Measurements require both goals and models”. Metrics and measurement

need goals and questions otherwise they do not have a meaning. It is also preferable to use a top-

down approach when specifying the parameters. This leads to the Goal-Question-Metric (GQM)

paradigm.

Conjecture a: “Human-based methods can only be studied empirically”. The human-based methods

involve (human) judgement and depend on experience and motivation. This is why the results also

depend on these different factors. To be able to understand and control those factors empirical

studies are needed.

studies”. Observing software development helps the developers to learn. The case studies supply

the project characteristics, (realistic) complexity, project pressure etc. The lack of cause and effect

insights can be provided through controlled experiments.

Conjecture c: “Empirical results are transferable only if abstracted and packaged with context”. The

information that has been gained needs to be transformed into knowledge with the context borne in

mind. This can be achieved with the help of abstraction. It offers the opportunity to reuse the

results. When the results are abstracted and packaged only two questions remain to be answered:

“Do the results apply to this environment?” and “What are the risks of reusing these results?”

Another form of experience surveys are the delivering of models such as the Models for measuring software

reliability based on the failure rates and probalistics characteristics of software systems [Singpurwalla 1999]:

• Jelinski-Moranda model: Jelinski and Moranda assume that the software contains an unknown number

of, say N, of bugs and that each time the software fails, a bug is detected and corrected and the failure

rate Ti is proportional to N – i + 1 the number of remaining the code.

• Baysian reliability growth model: This model devoid a consideration that the relationship between the

relationship between the number of bugs and the frequency of failure is tenuous.

• Musa-Okumoto models: These models are based on the postulation a relationship between the intensity

function and the mean value function of a Piosson process, that has gained popularity with users.

• General order statistics models: This kind of models is based on statistical order functions. The

motivation for ordering comes from many applications like hydrology, strength of materials and

reliability.

• Concatenated failure rate model: These models introduce the infinite memories for storage the failure

rates where the notion infinite memory is akin to the notion of invertibility in time series analysis.

A case study is used to monitor the project. Throughout the study data is collected. This data is then investigated

with statistical methods. The aim is to track variables or to establish relationships between different variables

that have a leading role or effect on the outcome of the study. With the help of this kind of strategy it is possible

to build a prediction model. The statistical analysis methods used for this kind of study consists of linear

regression and principle component analysis. A disadvantage of this study is the generalisation. Depending on

28

the kind of result it can be very difficult to find a corresponding generalisation. This also influences the

interpretation and thus makes it more difficult. Like the survey the case study can provide data for both

qualitative and quantitative research.

Experiments are usually performed in an environment resembling a laboratory to ensure a high amount of

control while carrying out the experiment. The assignments of the different factors for the experiment are

allotted totally at random. More about this random assignment can be found in the following sections. The main

task of an experiment is to manipulate variables and to measure the effects they cause. This measurement data is

the basis for the statistical analysis that is performed afterwards. In the case that it is not possible to assign the

factors through random assignment, so-called quasi-experiments can be used instead of the experiments

described above.

Experiments are used for instance to confirm existing theories, to validate measures or to evaluate the accuracy

of models [Wohlin 2000]. Other than surveys and case studies the experiments only provide data for a

quantitative study. The difference between case studies and experiments is that case studies have a more

observational character. They track specific attributes or establish relationships between attributes but do not

manipulate them. In other words they observe the on-going project. The characteristic of an experiment in this

case is that control is the main aspect and that the essential factors are not only identified but also manipulated.

It is also possible to see a difference between case studies and surveys. A case study is performed during the

execution of a project. The survey looks at the project in retrospect. Although it is possible to perform a survey

before starting a project as a kind of prediction of the outcome, the experience used to do this is based on former

knowledge and hence based on those experiences gained in the past.

Carrying out experiments in the field of Software Engineering is different from other fields of application

[Juristo 2003]. In software engineering several aspects are rather difficult to establish. These are:

• Prove that the measures are nominal or ordinal scale

• Validation of indirect measures: models and direct measures have to be validated

To be able to carry out an experiment several steps have to be performed [Basili 1986]:

2. The planning

3. Carrying out the experiment

4. Analysis and Interpretation of the outcomes

5. Presentation of the results

Now a more detailed look on the different steps mentioned above. The Experiment definition is the basis for the

whole experiment. It is crucial that this definition is performed with some caution. When the definition is not

well founded and interpreted the whole effort spent could have been done in vain and one worse thing to happen

is that the result of the experiment is not displaying what was intended The definition sets up the objective of the

experiment. Following a framework can do this. The GQM templates could supply such a framework for

example [Solingen 1999].

After finishing the definition the planning step has to be performed. While the previous step was to answer the

question why the experiment is performed, this step answers the question how the experiment will be carried out.

6 different stages will be needed to complete the planning phase [Wohlin 2000].

Context selection: The environment in which the experiment will be carried out is selected.

Hypothesis formulation and variable selection: Hypothesis testing is the main aspect for statistical

analysis when carrying out experiments. The goal is to reject the hypothesis with the help of the

collected data gained through the experiment. In the case that the hypothesis is rejected it is

possible to draw conclusion out of it. More details about hypothesis testing can be read in the

following sections. The selection of variables is a difficult task Two kinds of variables have to be

identified: dependent and independent ones. This also includes the choice of scale type and range

of the different variables. The section above also contains more information about dependent and

independent variables.

29

Subject selection: It is performed through sampling methods. Different kinds of sampling can be found at

the end of this chapter. This step is the fundament for the later generalisation. Therefore the

selection chosen here has to be representative for the whole population. The act of sampling the

population can be performed in two ways either probabilistic or non-probabilistic. The difference

between those two methods is that in the latter the probability of choosing a sample of the

selection is not known. Simple random sampling and systematic sampling, just to name two, are

probability-sampling techniques. Those and other methods can be found at the end of this chapter.

The size of the sample also has influence on the generalisation. A rule of thumb is that the larger

the sample is the lower the error in generalising the results will be. There are some general

principles described in [Juristo 2003]:

The analysis of the data may influence the choice of the sample size. It is therefore needed

to consider how the data shall be analysed already at the design stage of the experiment.

Experiment design: The design tells how the tests are being organized and performed. An experiment is

so to speak a series of tests. A close relationship between the design and the statistical analysis

exists and they have effect on each other. The choices taken before (measurement scale, etc.) and a

closer look at the null-hypothesis help to find the appropriate statistical method to be able to reject

the hypothesis. The following sections provide a deeper view into the subject described shortly

above.

Instrumentation: In this step the instruments needed for the experiment are being developed. Therefore

three different aspects have to be addressed: experiment objects (i.e. specification and code

documents), guidelines (i.e. process description and checklists) and measurement. Using

instrumentation does not affect the outcome of the experiment. It is only used to provide means for

performing and to monitor experiments [Wohlin 2000].

Validity evaluation: After the experiments are carried out the question arises how valid the results are.

Therefore it is necessary to think of possibilities to check the validity.

The following components are an important vocabulary needed for the software engineering experimentation

process:

Dependent & Independent variables: Variables that are being manipulated or controlled are called

independent variables. When variables are used to study the effects of the manipulation etc. they

are called dependent

Factors: independent variables that are used to study the effect when manipulating them. All the other

independent variables remain unchanged

Treatment: a specific value of a factor is called treatment

Object & Subject: an example for an object is a review of a document. A subject is the person carrying

out the review. Both can be independent variables

Test (sometimes referred to as Trial): an experiment is built up using several tests. Each single test is

structured in treatment, objects and subjects. However, these tests should not be mixed up with

statistical tests

Experimental error: gives an indication of how much confidence can be put in the experiment. It is

affected by how many tests have been carried out

Validity: there are four kinds of validity: internal validity (validity within the environment and reliability

of the results), external validity (how general are the findings), construct validity (how does the

treatment reflects the cause construct) and conclusion validity (relationship between treatment and

outcome)

Randomisation: the analysis of the data has to be done from independent random variables. It can also be

used to select subjects out of the population and to average out effects

Blocking: is used to eliminate effects that are not desired

Balancing: when each treatment has the same number of subjects it is called balanced

Software engineering experimentation could be supported by the following sampling methods [Wohlin 2000]:

Simple random sampling: the subjects that are selected are randomly chosen out of a list of the

population.

30

Systematic sampling: only the first subject is selected randomly out of the list of the population. After that

every n-the subject is chosen.

Stratified random sampling: first the population is divided into different strata, also referred to as groups,

with a known distribution between the different strata. Second the random sampling is applied to

every stratum.

Convenience sampling: the nearest and most convenient subjects are selected.

Quota sampling: various elements of the population are desired. Therefore convenience sampling is

applied to get every single subject.

Controlled Experiments: The advantage of this approach is that it promotes comparison and statistical analysis.

Controlled here means that the experiment follows the steps as mentioned above (Basili 1986], [Zelkowitz

1997]):

1. Experiment definition: it should provide answers to the following questions [3]: “what is studied?”

(object of study), ”what is the intention?” (purpose), “which effect is studied?” (quality focus), “whose

view is represented?” (perspective) and “where is the study conducted?” (context).

2. Experiment planning: null hypothesis and alternative hypothesis is formulated. The details (personnel,

environment, measuring scale, etc.) are determined and the dependent and independent variables are

chosen. First thoughts about the validity of the results.

3. Experiment realization: the experiment is carried out according to the baselines established in the

design and planning step. The data is collected and validated.

4. Experiment analysis: the data collection gathered during the realization is the basis for this step. First

descriptive statistics are applied to gain an understanding of the submitted data. The data is informally

interpreted. Now the decision has to be made how the data can be reduced. After the reduction the

hypothesis test is performed. More about hypothesis testing can be found in the following sections.

5. Portrayal of the results and conclusion about the hypothesis: the analysis provides the information

that is needed to decide whether the hypothesis was rejected or accepted. These conclusions are

collected and documented. This paper comprises the lessons learned.

Experimental design types: The quality of the design decides whether the study is a success or a failure. So it is

very important to meticulously design the experiment [Juristo 2003]. Several principles of how to design an

experiment are known. Those are randomisation, blocking and balancing. In general a combination of the three

methods is applied. The experimental design can be divided into several standard design types. The difference

between them is that they have distinct factors and treatment. The first group relies on one factor, the second on

two and the third group on more than two factors. The following paragraphs will show some detail about the

different design types.

Field of use: comparison.

Example: comparing two different analysis techniques using several projects

Assignment: techniques are assigned totally at random; the same objects are used for both

treatments

Analysis methods: t-test, Mann-Whitney

Benefit: simple experiment

1 ☺

2 ☺

3 ☺

4 ☺

5 ☺

6 ☺

31

• Paired comparison design (extends the design mentioned above)

Field of use: comparison of two different analysis techniques (two treatments)

Example: comparing two different analysis techniques

Assignment: the subjects are applied to both treatments on the same object; the assignment is

performed randomly

Analysis methods: Paired t-Test, Sign test, Wilcoxon

Benefit: improves the precision of the experiment.

1 2 1

2 1 2

3 2 1

4 2 1

5 1 2

6 1 2

• One factor design with more than two treatments:

Field of use: comparison of all treatments

Example: comparing different programming languages regarding their quality while using them

Assignment: subjects are randomly assigned; one object to all treatments

Analysis methods: analysis of variance (ANOVA), Kruskal-Wallis

1 ☺

2 ☺

3 ☺

4 ☺

Field of use: comparison of all treatments with high variability among the subjects. More than two

treatments

Example: same as in the design mentioned above

Assignment: each subject uses all treatments; the order is assigned randomly; restriction of

randomisation because of the blocks

Analysis methods: ANOVA, Kruskal-Wallis

Benefit: minimizing the effect of variability. One of the most used designs in experimentation.

Subjects form a more homogenous unit.

1 3 2 1

2 1 2 3

3 1 3 2

4 2 1 3

complete block design

This design is used when more complex experimentation arrangements are needed. There are now

three hypotheses: one for the effect of the first factor, one for the second factor and one for the

interaction between the two factors. The following paragraphs will depict different two factor

designs.

32

• 2*2 factorial design:

Example: investigating the understandability of design documents using two

different designs, i.e. structured versus object-oriented design; two treatments per

factor

Assignment: randomly assign subjects to combination of the two treatments

Analysis methods: ANOVA

Factor 2 Factor 2

Treatment 2_1 Treatment 2_2

Factor 1 C, F B, E

Treatment 1_1

Factor 1 A, H D, G

Treatment 1_2

A, B, C, D, E, F, G, H)

Field of use: one factor is similar to another factor for different treatments (two or

more treatments)

Example: efficiency of unit testing using two different designs i.e. functional

programming versus object-oriented programming

Assignment: one of the two factors is nested to the other; the subjects are randomly

assigned

Analysis method: ANOVA

Factor 1 Treatment 1_1 Factor 2 Treatment A, H

2_1

Factor 1 Treatment 1_1 Factor 2 Treatment C, F

2_2

Factor 1 Treatment 1_2 Factor 2 Treatment B, E

2_2_1

Factor 1 Treatment 1_2 Factor 2 Treatment D, G

2_2_2

Table 8: Two-stage nested design with two treatments per factor (Available Subjects:

A, B, C, D, E, F, G, H)

Some experimentation arrangements depend on more than two factors. These kinds of designs are

also called factorial design because the dependent variables also depend on interaction between the

n-factors. Known factorial designs with two treatments are: 2k factorial design, 2k fractional

factorial design, one-half fractional factorial design of the 2k factorial design and one-quarter

fractional factorial design of the 2k factorial design.

A listing of the statistical testing methods needed for the different design types in alphabetical order is given in

following. More details about them can be found in [Juristo 2003]:

• Binomial test: This test analyse the differences between dichotomy variables.

• Chi2: This type of test is used when frequencies are involved. This means that the data has the form of

frequencies.

33

• F-test: The F-test compares the variance of two (independent) samples

• Mann-Whitney: When the assumption made in the t-test is uncertain it is possible to use the Mann-

Whitney test instead. Similar to the Wilcox test this method is based on ranks.

• Paired t-test: This method compares two samples, gained through repeated measures.

• Sign test: It depends on the sign of the difference of the values of the examined pairs.

• Wilcoxon: For this method it is important that it is possible to determine the greater value of the

examined pair and that the difference can be ranked because the ranks are the basis of the Wilcoxon

test.

Parametric and Non-parametric testing: We will start with the parametric tests. The main characteristics:

consists in the fact that the analysed models have a specific distribution. Usually the assumption is made that

some parameters are normally distributed. The parameters must be measurable at interval scale, at least, the test

for normality can be done with the Chi2 test.

The non-parametric tests main characteristic is that only a very general assumption is made, more general than

parametric test. When they are available they can be used instead of parametric test but not vice versa.

The decision which one of the two mentioned approaches is best suited can be based on two factors. These are

Applicability (what are the assumptions made? The assumptions must be realistic!) and Power (parametric tests

have, in general, higher power than the non-parametric test). The relation between experimental design types,

test methods and parametric, non-parametric tests is shown in the following Table 9 [Juristo 2003].

One factor, one treatment Binomial test, Chi2

One factor, two treatments t-test, F-test Mann-Whitney, Chi2

completely randomised

One factor, two treatments Paired t-test Wilcoxon, Sign test

paired comparison

One factor, more than two ANOVA Kruskal-Wallis, Chi2

treatments

More than two factors ANOVA

Hypothesis Testing: One way to evaluate if the presumption we have is correct is to use hypothesis testing as

evaluation source. The result, when everything has been taken out correctly, will help us to draw conclusions

whether the presumption that was used to formulate the tested hypothesis established some cause and effect

relationships.

Hypothesis testing takes place in several steps that are applied repeatedly if needed. The first phase, induction, is

used to formulate the first hypothesis, also called the null hypothesis and also the formulation of an alternative

hypothesis in case of rejection of the null hypothesis. It is possible that the test rejects a true hypothesis or vice-

versa. Should such behaviour occur it is referred to as a risk. Two different kind of risks can be identified, Type-

I-error (the hypothesis is true but rejected) and Type-II-error (the hypothesis is false but accepted). When talking

about the risks it is also necessary to talk about the power of a statistical test. The power indicates the probability

that the statistical test will reveal a true pattern if the null hypothesis is false. It is therefore desirable to choose a

test that has a very high power upon one with a lesser power.

34

The kinds of visualisation for SPC we have described above. Now we will give some further characteristics

shortly. A graphical visualisation provides an illustrative way of providing information about different aspects.

In the following passages several visualisation methods are described.

Scatter Plot:

Portrayal Two-dimensional grid

Used for Assessing dependencies between variables

Tendency of linear relation

Identification of outliers

Observation of correlation

Box Plot:

Input Percentiles

Portrayal Box plot constructed by different percentiles

Used for Visualisation of dispersion and skewedness

Histogram:

Portrayal Bars with different heights

Used for Overview of distribution density

Indicator for normal distribution

Cumulative Histogram:

Portrayal Bars containing the cumulative sum of frequencies up to the

current class of values

Used for Probability distribution function of the samples from one

variable

Pie Chart:

Portrayal Segments in a circle. Angles proportional to the relative

frequency

Used for Relative frequency of the data values

In following we will describe an example of a controlled experiment investigating the performance using the

Personal Software Process (PSP) [Wohlin 2000].

• Object of study: participants in the PSP course, their ability considering performance with respect to

background and experience;

• Purpose: evaluate the individual performance with respect to the individual background;

• Perspective: point of view of researchers and teachers; They would like to know if there are differences

between the participants in the course having different backgrounds;

• Quality focus: Productivity in terms of KLOC1 / development time and Defect density in terms of faults

/ KLOC;

1

Thousands of lines of code

35

• Context: experiment is run within the PSP;

• Summary (of Definition): Analyse the outcome of the PSP for the purpose of evaluation with respect to

the background of the individuals from the point of view of the researchers and teachers in the context

of the PSP course.

• Context selection: PSP course at university; It addresses a real problem and is performed off-line

because it is not used for industrial software development. The programming language is C.

• Hypothesis selection:

Null-hypothesis H0: No difference in productivity between students from Computer Science and

Engineering program (CSE) and Electrical Engineering Program (EE)

H0: Product(CSE) = Product(EE)

Alternative Hypothesis H1: Product(CSE) ≠ Product(EE)

Null-hypothesis 2 H0: No difference between the students considering the faults/ KLOC (based on prior

knowledge of C)

H0: # of faults is independent of C experience

Alternative hypothesis 2 H1: # of faults/KLOC changes with experience

• Data to be collected: student program (nominal scale), program size in Lines of Code (ratio scale) ,

development time in minutes (ratio scale), productivity (ratio scale), experience in C (ordinal scale, they

used here a classification into four groups), and faults / KLOC

• Variables selection:

o Independent variables: program and experience in C.

o Dependent variables: productivity and faults / KLOC

• Selection of subjects: chosen based on convenience; They are samples from the two programs and not

chosen by a random sample.

• Experiment design:

o Randomisation: subjects are not assigned at random. They all use the PSP and take part in all

of the assignments.

o Blocking: not applied

o Balancing: not applicable

o first design: one factor (program), two treatments (CSE, EE). A parametric test is chosen, in

this case the t-test because the dependent variables are ratio scaled.

o Second design: one factor (experience in C), more than two treatments. Here four treatments

can be identified (4 different groups). The dependent variable is also measured in a ratio scale

so that parametric testing can be applied. In this case the ANOVA test.

• Instrumentation: A survey carried out at the beginning of the course provides the needed data about

experience and background.

• Validity evaluation:

o Internal validity: provided through the number of tests within the course.

o External validity: highly probable that similar results are obtained when the course is run in a

similar way. It is rather difficult to generalize the results to students not taking the course.

However, it might be possible to generalise the outcome to other PSP courses, comparing the

background for example.

o Conclusion validity: not considered to be critical due to the fact that the faked or incorrect data

is independent from the background.

o Construct validity: two major threats can be identified. Are the measures appropriate?

Example: Is LOC/ Development time a good measure for productivity? And because it was a

36

graded course the student might bias their data. At the beginning of the course it was stated that

the grade did not solely depend on the actual data but rather on timely and properly delivery

and on the reports handed in.

• Preparation: The students primarily took a course they were not aware of exactly what was being

investigated.

• Execution:

execution time: 14 weeks

Number of assignments: 10

Number of participants: 65

At the end of the course interviews were performed to evaluate the course and the PSP.

• Data validation: From the 65 students six were removed because their results were rather questionable

or invalid. This took part based on the personal impression of the given data with regard on the question

whether they were representative or not of the researchers and teachers on the given assignments. The

remaining 59 (32 CSE,27 EE) students were used for the statistical analysis and interpretation.

Descriptive statistics: In Figure 15 the productivity of the two study programs is shown. It gives a hint

that the productivity of the EE students is not as high as the productivity of the CSE students.

As second method box plots are made (Figure 16). There it is visible that the EE group has on outlier,

which stays in the data and is considered an extreme value.

The two figures already indicate that the productivity of the EE students is lower than of the CSE

students. The hypothesis testing might reveal a difference between the two study programs. Let us move

on to the faults / KLOC. The table below shows the different parameters of the faults/ KLOC. It can be

seen that the distribution is skewed towards the first group (little or no experience). That is why a box

plot for this group is made (see Figure 17).

37

Class Number of Median value of Mean value of Stnadard

students faults/ KLOC faults/KLOC deviation of

faults/KLOC

1 32 66.8 82.9 64.2

2 19 69.7 68.0 22.9

3 6 63.6 67.6 20.6

4 2 63 63.0 17.3

Figure 17: Box plot for faults/ KLOC for the first group

The descriptive statistics tell what can be expected from the hypothesis testing and were problems due

to outliers might appear.

Data reduction:

It was decided that the outliers are being removed which changed the mean values and standard

derivation as can be seen in Table 11.

students faults/ KLOC faults/KLOC of faults/KLOC

1 31 66 72.7 29.0

Hypothesis testing: For the first null- hypothesis the t-test was applied. The result can be seen in Table

12. The conclusion is that the hypothesis H0 is rejected. The difference between the students from the

two programs is significant. The actual reasons for this have to be further evaluated.

freedom (DF)

CSE vs. EE 6.1617 57 3.283 0.0018

For the second null-hypothesis the ANOVA test was chosen. The result can be seen in Table 13.

Faults/KLOC freedom (DF) squares square

Between 3 3483 1160.9 0.442 0.7236

treatments

Errors 55 144304 2623.7 - -

38

The outcome was that there is no significance between the different groups and the faults/ KLOC. The

groups 2,3 and 4 were grouped together to investigate the difference between the new formed group and

group 1. A t-test was then applied to look for differences between those two groups. No significant

results were obtained.

Two hypotheses were investigated. The study program / productivity and experience in C / faults per

KLOC. The first hypothesis tested showed that the CSE students were more productive than the EE

students. The second hypothesis stated that there is no significant influence on the number of faults

considering the experience in C. Hence,

When following the PSP it is better to use a well-known language so that the focus can solely

be on the PSP.

It is also reasonable to claim that students with a computer science background have a higher

productivity than students with other disciplines as background. It is still necessary to do

further studies.

In the following we will give some examples of statistical analysis in three kinds of domains [Pandian 2003]:

The following methods and examples are cited from [Pandian 2004] in order to achieve a consistent form of

statistical descriptions (see also [Juristo 2003] and [Wohlin 2000]).

All processes show variations that will become evident if a frequency distribution is drawn on the process

metric. Understanding process variation, Demming observes, will lead to profound knowledge of the process.

Frequency distribution also contains an indication about probability of occurrence of events. Analysis of metrics

data in the frequency domain would result in empirical distribution curves. The shape and structure of these

distribution curves represent a process signature. Analyses of distributions are usually based on several well-

known probability distributions. We have selected two distribution types that find practical views in software

projects: normal distribution and the Rayleigh distribution. All empirical distributions are referred to any one of

these two for interpretation.

Normal Distribution: Normal distribution is considered nature's template, the most common pattern of process

variation. A large number of project outcomes can be directly fitted to the ideal normal curve. For example,

effort variance in a family of software projects has been analyzed to find that they have a mean value of 10

percent and standard deviation of 2 percent. The equation to normal distribution is given in the following

equation.

39

The process variation illustrated here makes us view software projects from a statistical standpoint.

Bias: A Process Reality: Real-life process behaviour may exhibit a bias. Such distributions lack symmetry and

are skewed to one side. Also, these have a characteristic “tail”, representing occurrences that have transgressed

or strayed into unusual regions. The bias is characteristic of human systems that use intention or will to choose

among several tactical opportunities. The long tail, such as in Rayleigh distribution, bears evidence to a

fundamental but small propensity of nature to defy human design. This tail could be a symbol of machine failure

in mechanical processes or estimation failure in project management. The tail of the schedule variance

distribution presented in Figure 18 shows how „best-made estimates” have failed.

As a structure, the skewed Rayleigh distribution has been put to great use in software estimation by Putnam.

Software reliability models use this structure to represent defect leakage into the field in the continuum of time.

The Rayleigh curve can be expressed as given in the following equation

where m(t) is the manpower, K the total effort, a the constant (shape parameter), and t the time.

Central Tendency of Processes: Central tendency in a skewed distribution, a more authentic representation of

real-life processes, is difficult to establish. Nevertheless, it is conventional to refer to three measures of central

tendency:

1. Mean

2. Median

3. Mode

The mean is the arithmetic average of all the observations. The median that divides a series of data arranged in

the order of magnitude of their values so that an equal number of values is on either side of the center or median

value. The median divides the distribution curve into two equal areas. The mode denotes the value that has the

highest frequency of occurrence in the dataset. If the distribution of the data is normal and not skewed, then the

mode, median, and mean are equal.

It is customary to take the mean value to indicate the central value of a metric. It is convenient to think so, and

many business models run on this simple assumption. But when the metrics data set contains outliers and

extreme values, median could be a better choice because it presents a balanced picture. Mode is considered for

setting process goals.

40

Process Spread: Process results wander away from the mean value. The degree of wandering, or spread, is

denoted by the standard deviation, sigma (σ), of process output values. Frequency distributions are the most

natural tools to study and analyze process spread. In Figure 19, three models for effort variance are plotted, all

with different standard deviations but a common central value of 10 percent. Process variations such as these

indicate trouble. The larger the variation, the larger is the uncertainty. It may be noticed that as the spread

increases, the number of “results on target” decreases. When the process deviations get closer to process

boundaries or tolerance limits, the process tends to become unreliable.

2 0.00 0.04 0.09

3 0.00 0.06 0.10

4 0.01 0.10 0.12

5 0.03 0.14 0.13

6 0.08 0.18 0.15

7 0.19 0.23 0.16

8 0.36 0.26 0.16

9 0.53 0.29 0.17

10 0.60 0.30 0.17

11 0.53 0.29 0.17

12 0.36 0.26 0.16

13 0.19 0.23 0.16

14 0.08 0.18 0.15

15 0.03 0.14 0.13

16 0.01 0.10 0.12

17 0.00 0.06 0.10

18 0.00 0.04 0.09

19 0.00 0.02 0.07

20 0.00 0.01 0.06

Another example of process dispersion can be seen in how bug-fixing time (TTR, time to repair, in days), falls

into three service levels, corresponding to simple, medium, and complex types of bugs. Fixing each type of bug

is a process of its own, characterized by central tendencies and standard deviations. As illustrated in Figure 20,

the distinction between these processes results in blur in some areas, and the maintenance project manager needs

to use this information while setting goals and limits for delivery schedules.

41

Measures of Dispersion: Measures of dispersion describe how the observations in the dataset are spread out.

Important measures of dispersion are

• Range

• Variance

• Standard deviation

Range is the difference between the highest and lowest values in a dataset. Variance measures the fluctuation of

the observations around the mean. The larger the value of the variance, the greater the fluctuation. The standard

deviation, like the variance, also measures the variability of the observations around the mean. Standard

deviation is equal to the positive square root of variance. A standard deviation has the same units as the

observations, and thus is easier to interpret.

Descriptive Statistics: Before we draw any inferences from data (using inferential statistics), we need to do

descriptive statistical study. Hence, metric data can be first studied for its descriptive statistics, which includes

estimation of the following parameters:

• Mean

• Standard error (of the mean)

• Median

• Mode

• Standard deviation

• Variance

• Kurtosis

• Skewness

• Range

• Minimum

• Maximum

• Sum

• Count

• Largest (#)

• Smallest (#)

Note: Skew means lack of symmetry. The skew can be positive (skewed to the left) or negative (skewed to the

right). For a positively skewed distribution, the mean is greater than the median because a few values are large

compared to the others. If a distribution is negatively skewed, the mean is less than the median. Kurtosis is a

measure of the peakedness of the dataset. It is also viewed as a measure of the "heaviness" of the tails of a

distribution. A tool for calculating descriptive statistics is available in Excel as a macro in the Analysis Tool Pak.

Deriving Frequency Distribution from Data: There are three ways of visualizing frequency distribution,

ranging from mathematical to empirical. Each can be applied to a practical situation; each has its advantages.

Probability Density Function Curve: The first is to work from the mean and sigma to construct an ideal normal

distribution curve, applying the equation to probability density function. One can use the spreadsheet function

NORMDIST and generate the graph by constructing an x,y table (and plotting an x,y chart) in accordance with

the relationship given in the following equation.

This bell shaped curve is a classical way of getting a feel for the process. Next we can draw a histogram and

study its shape. The bin intervals (or class intervals) are marked in the x-axis and the frequency in the y-axis.

42

One can use a "tally" system to count the number of data points falling into each bin, or use the histogram macro

on the spreadsheet and get the tally as well as the chart. Histogram will present details that had been ironed out

in the normal curve.

Empirical Distribution Curve. Finally, we can transform the histogram into a "curve" by constructing a smooth

line that passes through the tops of the histogram bars. Constructing such a curve, sometimes called the fre-

quency polynomial, is not an attempt to find a mathematical expression for an empirical reality; it is an attempt

to create a graphical pattern, as a model and a continuous representation process behavior.

Frequency Scan: While arriving at empirical distribution curves, we stand to gain by doing alternative analysis

by varying the bin sizes. One such analysis is “scanning”, where we deliberately run a histogram on a large

number of bins, although the number of data points may not warrant a large number of bins. An example of

schedule variance analysis with 32 bins is depicted in Figure 21.

The frequency diagram scans the entire process range, like a spectral scanner, and finds occurrences in the right

location in the metrics scale. Such an analysis highlights “bursts” of events, which stand far away in the

frequency domain from the primary process modes. In the background, the best-fit normal curve built from the

process mean and average is presented. It may be noted that the normal curve is very broad and shallow,

indicating a widely varying process. The standard deviation is about 2.5 times larger than the mean, with the

obvious consequences on the curve. A frequency scan could make several discoveries in process behaviour,

including the following:

Extreme deviations

Process outliers

Natural clusters

Secondary modes

Primary modes

Zoom view of the significant modes

The Filter Effect - Getting a Smooth Overall Picture: We can obtain a smoother function, with the details

ironed out, to show a broad picture of schedule variance, as shown in Figure 22. The desire here is not to

prescribe discrimination rules or locate troublesome groups, but to get a sense of variation.

43

This choice is deliberately made because of the shift in decision-making approach from class discrimination to

variation control.

The same process data, which was scanned in the previous figure, is now processed with less bin numbers, just 7

instead of the original 32. The result is a smoothened curve, which has muffled the fast variations, like a low

pass filter, and indicates an overall picture.

One can vary the "filter characteristics" of a histogram to see different views of variation, and develop an insight

from these many perspectives. It is like tuning in to different wavelengths, looking for signals.

Looking at Histograms: The histogram is known as the “voice of the process”. On a chosen metric, histogram

analysis can reveal process behaviour such as stability and bias. The first-cut analysis is to look at the shape of

the histogram and see the “process signature”. Standard types of histograms have been identified by Feigenbaum

for manufacturing processes. The shapes and types could reveal the nature of the process from which the data

points have been gathered. For example, a histogram truncated on both sides represented product behaviour after

the „out-of-tolerance components” have been removed. A histogram with the central portion missing can be

traced to a population where the best components have been selected and removed, perhaps marked as a higher-

grade delivery. In software, too, we can identify histograms with telltale signatures. Three of these signatures are

presented in Figure 25, along with their special meanings:

1. Comb structure

2. Right-biased structure

3. Left-biased structure

Many of the other figures furnished in this chapter contain real-life process signatures. Notable among them are

the following:

• Bimodal distribution with a single dominant peak

• Multiple clusters

• Rayleigh type distribution with long “tail”

• Plateau structure (flat distribution)

• Spurs (in spectral scanning)

Projects can maintain histogram libraries and map them to the contributing process scenarios. This way, every

organization can invent its own histogram types, as shown in Figure 23.

44

Process Capability from Frequency Distribution: A process that is under statistical control is said to be

capable if it is able to satisfy the customer specifications or the goals of the process, in the event customer

specifications are not available. Process capability refers to the inherent ability of a process to repeat results for

a sustained period of time under a given set of conditions. The frequency signature of a capable process has a

few notable characteristics: Single mode, less variation, and process peak tends to be closer to target. In the

classical model of process capability computations, normal distribution is assumed, and numerical indices are

calculated to quantify process capability.

Process Capability Index C p: This index indicates the performance of the process by relating the natural

process spread to the specification (tolerance) spread, as shown in the following equation.

Modifications of this basic definition are in use to account for the following special situations: Single limit and

process drift. Such indices and their variants were originally designed for mechanical processes, based on well-

established statistical models for process variation, defect occurrence, inspection, and sampling. For software

projects, can we apply Cp? There are several constraints. The beginning of the problem lies in the very nature of

the process called project management or software engineering, each having process signatures different than

that of mechanical processes. Next in line are the difficulties of prescribing control limits and specifications

limits, which cannot be calculated based on old assumptions but require a deep understanding of statistical

distributions of process parameters and defects.

Probability: The area under probability density function represents "probability" of occurrence. In Figure 24,

the shaded area represents the probability that the upper specification limit of schedule variance may be

transgressed.

The exact value of this probability as P(SV > USL) is obtained by the division of the shaded area through the

total area under the curve. The probability that the schedule target will be met corresponds to the unshaded area.

The shaded area, lying outside the limit, constitutes what we can term as “process defects”. The white area is the

acceptable region. The areas are actually integral values of the probability density function, pdf, with the

specified limits, and can be calculated by using the relationship given in the following equation

45

Probabilistic Expressions of Capability and Risk: Probabilistic models can be used to determine process

capability and risk. Capability is defined as the probability of meeting the target and risk is the probability of

missing the target. Capability and risk are like two sides of a coin. If a process is not “filled” with capability, the

vacuum will be encroached by risk. A similar analysis can be done almost on all metrics, although the core

metrics such as the ones in the following list are preferred choices: Schedule, productivity, and defects.

Analyzing Process Maturity: Process maturity can be analyzed using frequency distributions. Mature processes

show slim frequency diagrams, with sharp peaks - the fat and the process wanderings having been eliminated.

Mature processes show, decisively, a central value. The danger of secondary process intervention would have

been eliminated to secure stability. The voice of the process will stand clear above noise from spurious

performances, outliers, and strange isolated events. Mature process peaks tend to drift toward customer

satisfaction, resource conservation, and better performances. A productivity distribution, as the project matures

in capability, tends to move toward higher values. The defect distribution peak, in a similar environment, will

move to lower values. A process behaviour model is seldom static. It is highly dynamic, constantly shifting its

location, and changing the shape. The process boundaries keep in tune and the process remains in a constant

state of metamorphosis.

The road to process maturity can be tracked using frequency diagram models of the process, and by arranging a

process maturity storyboard or chronicler, which has now become an industry standard for visualizing

“continuous process improvement”. Figure 25 presents a process maturity storyboard of an organization that is

moving up the maturity grid as time passes. Approximately, the signatures correspond with capability maturity

model (CMM) levels. The metric - the chosen indicator - is effort variance. If the organization's goals can be

marked on these frames, one can easily perceive and estimate quantitatively resource management capability as

well as effort escalation risk, and relate the findings to climbing maturity level. Apart from using process

signatures to narrate a story in time, we can use them to compare business units within an organization or

benchmark teams within a business unit. We could also create a signature board to cover all primary metrics to

see if there is balance in capability or how uncertainty and risk propagate into the deeper recesses of processes.

46

Process Diagnosis: Process baselines based on mean and sigma sometimes hide real problems, such as in the

case study described here. The effort variance in this instance shows a bimodal distribution, each mode on either

side of zero. The arithmetic mean is almost zero; going by the mean one may think that the process is on target.

Far from it, the process is severely unstable, toggles between two meta-stable states, as revealed in the frequency

analysis. The project team recognized the problem, the first step in diagnosis, did a causal analysis, and spotted

trouble in the estimation process, which was in its juvenile stage. Either effort was overestimated or it was

underestimated. Where they had provided contingency cushions, it turned out that the expected risks did not

attack. Where they had been optimistic, risks had surfaced eventually. More than estimation, the problem was in

risk forecasting, and linking it with estimation. The team was trying to grapple with the problem and the struggle

resulted in the twin modes.

Search for Natural Process Boundary: Higher-level metrics, such as effort variance, denote complex processes

because they tend to capture the net result of several sub-processes. Calculating process control limits in such

cases is a tricky job. The exact distribution type of each sub-process may not be known, much less the way the

sub-processes combine. Traditional control limits use mean and sigma-based concoctions. But we know the

fallacy of blindly choosing the mean as a representative figure. The questions emerge: What is the true process

limit? What is going to be the decision threshold? Which is an outlier and which is the core? What control limits

do we use in our control charts? We are looking for a natural process boundary that we can trust and use in

decision making. The answer to the question lies in a frequency distribution study of the metric.

Typically, as illustrated in Figure 26, such an analysis would manifest a dominant mode, denoting a primary

process, and a subdued mode, denoting a secondary process. The valley point is taken as the natural process

boundary which can be used as the upper control limit.

Class Recognition - Productivity: Productivity in software development is a very complex area. Analysis of

productivity using frequency distributions could give tangible benefits. Apart from the baseline normal curve,

the empirical distribution derived with the right choice of bin intervals could reveal "productivity clusters," as

illustrated in the following case study. In Figure 27, four modes have emerged during an organizationwide

analysis of productivity data. These modes point to the existence of four distinct classes of projects; the dis-

criminating factors could be complexity of job and skill grades of staff. There could also be interplay between

other productivity drivers and barriers.

This diagnosis establishes four productivity levels, and facilitates developing management strategies. It also

provides a fair basis for performance measurement and comparison. The mistake of having and quoting one pro-

ductivity figure for the entire organization can now be avoided. The gaps in productivity levels provide a

framework for improvement of performance levels, tools utilization, and better and more objective human

resource management.

47

Figure 27: Software productivity classes

Benchmarking: A benchmark study using frequency distribution, in addition to the conventional comparison

charts, could bring over more valuable information. Sometimes it is just a comparison of signature between

successful projects and not so successful projects. Sometimes it can be a comparison of motivation level and

commitment. During a benchmarking study using frequency distribution, one can compare the following

features:

• Number of modes

• Natural process boundary

• Process capability (percent)

• Risk (percent)

• Outliers (percent)

• Extreme values (percent)

• Mean (overall)

• Sigma (overall)

Measuring the True Value: Software measurements can have ambiguities as large as 50 percent. The

measuring process, such as review or testing, has its own sources of uncertainty, noise, and variation. The

measuring tool and the measured process both vary simultaneously, making software measurements even more

difficult. In the presence of this ambiguity, histograms help in getting at the true value: the central tendency or

the dominant mode. The histogram successfully points out the true value, even while presenting the details of

variations. All modern measuring techniques and instruments use histogram analysis to detect true value. A case

in point is defect measurement, fraught with uncertainties of high proportions.

Measuring Defects without Ambiguity: hen it comes to defects, the measured value depends on the product of

two factors, as

Detection effectiveness values could vary from 40 to 80 percent, depending on the review methodology used and

the review capability of reviewers. Thus an uncertainty is associated with the review process. Measurement

capability is inversely proportional to measurement uncertainty. The rule book of measurement says that the

measuring instrument should have less uncertainty than the process variation the instrument is trying to measure.

We have to measure defect variations of the order of 10 percent with measuring instruments such as review with

an inherent variation of up to 70 percent. The ambiguity in defect measurements can be overcome by using a

simple signal-processing technique: defect histogram

Comparison when Distinctions Blur: We go to statistics when we cannot make a judgment without its help.

An example is the case study where it was called upon to compare two review methods. The first (DD) is a one-

48

person method; the other is a group method (PI/DC). Defect detection probabilities looked very similar in both

cases, and the raw data was confusing. Once the frequency distributions of the findings were plotted, the bottom

curves in Figure 28) and the whole picture could now be understood.

Six Sigma Model: Six Sigma concepts originally began with a process behaviour model in frequency domain.

The graphs shown in Figure 29 show a Six Sigma representation of process capability. Capability is measured by

the gap - safety distance measured in terms of sigma - between the process tendency and performance limit.

Graph A has a safety distance or gap of 3σ, and hence the process has 3σ capabilities. Graph B has a process

peak that is 6σ away from the specification limit, and hence has 6σ capabilities. Defects in a Six Sigma process -

those transgressions across the specification limits - account for a mere 3.4 parts per million (ppm) of the total

events (even after allowing for some wandering of the process peak from the mean).

49

Metrics data analysis in time domain:

Viewing in Time: Metrics data, organized in the time domain in a framework, present a window into real world.

Our purpose here is to see what the present holds out in the context of the past. We also wish to connect events,

like a thread connects beads, and see meaningful patterns from which a future can be forecast. We will also be

seeing how control charts can be devised to provide support in decision making. Because software projects run a

predetermined path known as the life cycle, with a finite start and a finite end, time domain analysis proves to be

only natural. Time domain analysis enables project teams to become sensitive to reality, responsive to situations,

and self-organizing through continuous learning.

Temporal Patterns in Metrics: Plotting data in a chronological order brings out the hidden temporal patterns. A

causal factor for attrition, the motivational level of employees is measured here as a commitment index and

gathered every quarter. We recognize first the simple linear trend, and later more intricate nonlinear trends.

While the linear trend captures a broad, long-term behavioural pattern, the local characteristics are captured in

increasing level of details by power, polynomial, and moving average trends. All of them are effective in

suppressing noise but forecasting scope and efficiency vary. Each analysis offers an adaptive perception,

different from the rest. The overall problem, of course, is a steady decline in commitment, but the pattern of

decline, the seasonality, and similarity with known trends provide knowledge.

Time Series Forecasting: Using time series analysis, events can be predicted based on historical trends. The

bug arrival pattern shown here is an important input for maintenance projects to decide the following:

Work scheduling

Human resource balancing

Strategies for service quality assurance

Forecasting requires that we identify structures in the data, which might repeat. Software failure intensity data

can be plotted and the trend can be used to predict failure, as indicated in Figure 30. In fixed assets and facilities

management, assets downtime data can be plotted in time sequence, and the trend may be derived and used to

forecast spare-parts requirements and manpower and tools requirements to fix failure events. With the infor-

mation made available by forecasting, one stands to plan better and even avoid those marginal losses that are

bound to be incurred without the benefit of advance information.

50

Signature Prediction: Beyond the bug arrival statistics, signatures of bug population are captured periodically,

as illustrated in Figure 31, and used in prediction. The signatures become yet another dimension in forecasting.

Here signature refers to a bar graph showing distribution of bugs among the known categories as percentages.

The distribution pattern keeps changing. Risk tracking, risk exposure magnitude, and risk distribution may be

carried out in a similar fashion. Defect magnitude and defect signature are known to have been tracked in a

similar way by IBM in their ODC framework of defect management.

Prediction Windows: Prediction may be done by seeing patterns across projects or can be done locally within a

project. For instance, customer satisfaction index may be tracked in an organization, as shown in Figure 32,

project after project, and the trend may be used in decision making. The prediction window here is quite large

and may run into years. Each project runs within a time window inside which predictions are made. Time to

complete a project and cost at completion are both predicted from the earned value graph (EVG), which

cumulatively tracks value and cost as a time series.

Within a project, there could be smaller process windows where very short time series curves operate. Reliability

growth curve (RGC) tracks defects within the inspection window of the project. Failure intensity curve, being a

reliability model, operates in a window that begins with inprocess inspection but goes beyond delivery and

penetrates into deeper time zones of alpha, beta, and acceptance tests and application runs. Every metric operates

in a time window, which also becomes the prediction window. The window patterns are eventually called

models.

simple terms, by the mean value and the standard deviation. The first refers to the location of the process and the

next represents variation of the process. The weekly average (Xbar value) of time to repair (TTR) bugs in a

maintenance project itself is a good indicator of the process. Such a plot is called the X-bar chart, shown in

Figure 33(a). When the process variations are quite large, central tendency is more meaningful with median

values. Therefore, monitoring of process median charts is recommended in these conditions. Figure 33(b)

shows the plot of median values for the same set of data.

51

Figure 33: X-bar chart on TTR

Process Variation Charts: Process variation is represented by standard deviation. Figure 34(a) illus-

trates the weekly values for standard deviation, in the form of an S chart. There are occasions when

process range is used as a measure of variation in place of standard deviation, which is represented in

Figure 34(b).

Plotting Central Value and Variation Together: When accompanied by another chart showing how the range

(maximum/minimum) varies every week, the pair is called X-bar-R chart, which has been very popular on the

work floor. A simpler way is to plot the mean, minimum, and maximum values in the same graph and construct

the MMM chart. The weekly data set is known as sub-group (the sub-groups could stand for a group of projects,

a group of components, etc.). In our example, the MMM chart is plotted for sub-groups, each corresponding to

one week. The chart could be modified to consider (µ + σ) and (µ -σ) instead of the maximum and minimum

values to express variations. The MMM format allows forecasting and pattern recognition.

Control Charts: Park et al., Fenton and Pfleeger, Adrian Burr and Mal Owen, and Thomas Thelin are among

the earliest to have applied the traditional forms of control charts to software engineering processes. Many

software development houses have adapted control charts in one form or another. An established tool in

manufacturing, the control chart is an emergent technology in software development. In a control chart, process

results are plotted in time and compared with an expected value. Examples for the expected values are

• Control limits calculated from data

• Specification limits drawn from process requirements

• Process goals set by benchmarking

• Improvement goals

• Estimated value

• Planned value

In Figure 35, the estimated value of cumulative lines of code is plotted against month, and the actually delivered

lines of code are compared with the estimated. The perceived gap between the estimated and actual makes the

process owner see the problem and do something to bring the process result back to the estimated value. Control

here means adhering to a budget or a plan. The essential control chart is a decision support tool, an early warning

radar that alerts the user.

52

Figure 35: Tracking growth against point estimate

In a control chart, process results are plotted in time and compared with an expected value. Examples for the

expected values are

• Control limits calculated from data

• Specification limits drawn from process requirements

• Process goals set by benchmarking

• Improvement goals

• Estimated value

• Planned value

In Figure 35, the estimated value of cumulative lines of code is plotted against month, and the actually delivered

lines of code are compared with the estimated. The perceived gap between the estimated and actual makes the

process owner see the problem and do something to bring the process result back to the estimated value. Control

here means adhering to a budget or a plan. The essential control chart is a decision support tool, an early warning

radar that alerts the user.

Range in Expected Values: The estimated value, instead of being a point, could have a range, taking a clue

from real-life process variations. Hence, there exists an upper limit and a lower limit for the estimated value, for

a given confidence level. If σ represents the standard deviation and if the limits are estimated at 3σ, for instance,

the associated confidence level is 99.7 percent.

As shown in Figure 36, the actual values are plotted in the background of the estimated mean value and the

limits. Now one sees a problem if the actual values cross the limits because we have already given a tolerance

band to deviations from the expected mean value. Those data points, which lie outside the tolerance band, are

known as outliers. The first improvement one can think of is to prevent outliers, the next improvement being

reduction of the allowed variation band.

53

Life Cycle Phase Control Charts: The acceptable limits (point estimates) on defect levels are marked in

the life cycle phase control chart. The actual data is superimposed on the expectation levels. Perhaps this type of

control chart is most natural for life cycle projects. One can plot the following metrics values in this control chart

format:

• Effort

• Schedule

• Rework

• Defect found

• Defect leaked

• Review effort

These life cycle phase control charts provide an opportunity to disseminate process goals and deploy them

phasewise. One can define the ranges around each estimate to be more realistic about goal setting. The expected

values and process goals change with time and improve when the organization makes progress in its processes.

There is perhaps no expected value that can be stationary and permanent.

When Limits Blur: We must recall that uncertainties are associated with each measured value. Each data point

is not a deterministic entity, but probabilistic in nature. If we plot the probability densities of measured values, as

in Figure 37, each data point is not a single point but a distribution. Let us try to answer the following questions.

Have distributions A, B, C, D, and E crossed the limits? Should we read red alert or early warning? The answer:

these are blurred crossings, not abrupt jumps. Statistically, they represent process diffusion. We may relate

control limits to the assumed confidence levels of judgment and appreciate the tentative nature of limits. We can

move up or down the control limits and opt for yet another reference point as UCL. We can fix the UCL and

LCL at chosen points on the process distribution curve and accept the corresponding confidence level for

decision making. Crossing the limit is a question of degree, which depends on assumptions and perceptions and

not so much on the seemingly rigorous mathematical expressions that are used to compute the limits.

Selecting Control Limits for Unknown Distributions: When the type of distribution is not known we can

apply Chebyshev's theorem, according to which, for any population or sample, at least (1 - (1/k)2) of the

observations in the dataset fall within k standard deviations of the mean, where k ≥ 1. This is illustrated in Figure

38 as a relationship between standard deviation and the corresponding confidence level.

54

Chebyshev's theorem provides a lower bound to the proportion of measurements that are within a certain number

of standard deviations from the mean. This lower bound estimate can be very helpful when the distribution of a

particular population is unknown or mathematically intractable. Because the software development process is

totally a human process, one cannot expect a standard distribution pattern. Therefore, we should adopt an

estimation method, which does not depend on data distribution pattern, and at the same time reasonably

represent the actual situation. Therefore, depending on the confidence level required one could set the process

capability baseline limits with 1.5σ, 2σ, or 3σ for 56, 75, and 89 percent confidence levels, respectively.

Control Limits for X m R Chart: When the sample data points are not available it is frequently impossible to

construct an X-bar-R chart. In this case the only alternative available is to construct an X moving range chart.

Here successive data points are grouped to form a sub-group. Control limits for this chart are derived based on

control chart constants. The limits are given in the following equation.

Let us consider an application of X m R chart for effort variance process. Because this data is less frequently

available, at the project closure we can characterize this process and arrive at its baseline value through the appli-

cation of X m R chart.

Process Capability Baseline Charts: Figure 39 shows the process capability baselines with popular control

limits. If tighter control on a metric such as effort variance percent is wanted, one could choose 1.5σ limits; on

the contrary, if the project manager does not want too many causal analyses to be made or if the process is in the

inception stage, one could choose 3σ control limits, wherein nearly 89 out of 100 times the process value will be

within the 3σ control limit.

Process Capability Baselines from Empirical Distribution: The process history, if available, can be

used to set control limits such as demonstrated in Figure 40, where frequency distribution of historical data

reveals the existence of natural process limits, the valley points dropping off the principal peak. UNPL refers to

upper natural process limit and LNPL refers to lower natural process limit. This approach allows us to use

empirical frequency distributions, which are perhaps more relevant and accurate than the elegant assumptions

made in the traditional computations of limits.

55

Metrics data analysis in the relationship domain:

A Fertile Domain: Processes are interdependent, forming a network. The interplay between process parameters

has been the subject of several studies in software engineering, leading to understanding of the hidden process

dynamics. The interactions that exist in the process network can be symbolically represented as a map of

relationships between metrics. The symbolic world of relationship between metrics is a new domain, which

mirrors the real world of processes and the influences they exert on one another. The analysis of an individual

metric in the frequency and time domains enhances the indicative abilities of the metric and allows us to see pat-

terns. In the new domain, we expand our view angle, look at the neighborhood around each metric, spot more

metrics (which seem to be connected), and focus on capturing the interrelationships. The relationship domain

brings in a pragmatic perspective. In the real world, processes do not work in isolation and, as a consequence,

complete truth cannot be represented by isolated metrics. Analysis in the relationship domain complements

analysis in the other domains. When processes work as interconnected systems, the interrelationships may follow

an order or rule. This may be just a local discipline governing a narrow range of process events. Or it may be a

global order, with universal influence. The order may change from time to time when processes shift from one

phase state to another. When we analyze metrics data in the relationship domain, we use metrics "snapshots" of

the process, to try to arrive at formulas that depict the order, rule, or discipline by which the process runs. The

formulas could be local or global, following the characteristic of the process order. Some are ephemeral while

others are everlasting. Some are reversible, some are irreversible. Some are reproducible while others are not.

We search for all. The relationship domain is a fertile hunting ground. Studying relationships among metrics

with existing data is one approach. Making special observations under controlled conditions or conducting

experiments is another approach. The choice between routine observation and experiment is decided by the

proposed degree of rigor in the intended analysis and cost. We proceed with the first choice, studying naturally

available data without incurring the expenditure of experiments. We believe that in a project environment there

is a lot to learn from available data and a lot of improvement can be made from the study results of such data

before the need arises to commission experiments. The relationship between metrics and the expression of the

same as a formula or equation can be presented graphically. In fact, we begin with graphical analysis and then

arrive at empirical formulas.

Search for Relationships: Relationship between metrics is a mirror of interplay between processes. Now we

wish to analyze metrics in search of relationships. In principle we can suppose a relationship between any two

metrics. For example, let us look at the relationships between six core metrics selected from a project:

1. Skill level

2. Productivity

3. Review effectiveness

4. Defect density

5. Effort variance

6. Size

A relationship map of these six core metrics is displayed in Figure 41. The connecting lines denote possible

relationship. Any two metrics, an ordered pair of them, provide an opportunity to conceive a relationship. There

are 15 ordered pairs of metrics and to match there are 15 relationship lines in the map. Not all the supposed

relationships are meaningful. Some are merely mechanical constructs, just unreal mathematical possibilities. In

others, we do have expectations to uncover relationships of practical significance.

56

Pairing metrics is a limited, simple step, useful within the limits. We can see a complex set of relationships if we

connect one "driven" metrics to five "driver" metrics. This way we are applying a cause-and-effect relationship

or predictor-response model. We take defect density as the effect and can imagine that it is driven by the

remaining five metrics, establishing a one-to-five multivariate mapping. Considering the simultaneous influence

of five predictor metrics on one response metrics is a more complete and more rigorous approach.

Perceiving Relationships: Let us consider metrics in ordered pairs - two at a time - and take a look at the

possible types of relationships that can exist between them. Relationships may be perceived by plotting scatter

diagrams. One of the two chosen metrics will be treated as the dependent variable (y-axis), the other as the

independent variable (x-axis). The scatter diagram may reveal relationships, which can be among the five types

mentioned in Table 14.

Type 2 Strong Negative

Type 3 Weak Positive

Type 4 Weak Negative

Type 5 Weak No Relationship

Perceiving the type of influence between metrics allows us to see the interplay between process elements. In

Figure 42 the five types of influences, or relationships, are illustrated.

57

Strength of Relationship: Correlation Coefficient: We may begin the relationship study between two

variables by estimating the correlation coefficient (r), which is a statistical measure of the degree of linear

relationship between the two variables. It lies between +1 and -1 depending on whether the relationship is

positive or negative. The strength of the relationship is expressed by the absolute value of the correlation

coefficient.

Let us consider the metrics Skill Level and Productivity as x and y variables for a correlation study. Metrics

data obtained from a project is given in Table 15. The correlation coefficient r is defined in the following

equation.

Computation of r using the equation above yields a value of 0.993 for the correlation coefficient. The

computation is shown in the following Table 16.

The correlation analysis shows that there is a good correlation between productivity and skill level. We need not

go through all these time-consuming steps to do a correlation study. Excel and similar spreadsheets lend support

with built-in statistical functions.

58

Table 16: Calculation of correlation coefficient

The calculation is based on the following concrete equations relating to the considered productivity data shown

in the table above.

Causal Relationship and Statistical Correlation: There is a difference between correlation and causal

relationship. Correlation between metrics suggests that they are associated; a change in one follows approximate

changes in the other. However, mere association does not assure causal relationship. Correlation could be

superficial. The variables keep pace perhaps by coincidence. In a feeding experiment with pigeons, food was

dropped in a random manner. However, some pigeons happened to see food drop when they raised their heads.

A coincidence, indeed. These pigeons moved their heads up when they needed food and expected food to drop

from the feeder. Other pigeons thought sideways movement caused food drop. The pigeons soon settled in a self-

devised superstition on the basis of apparent correlation. Expectation (or estimation) based on the strength of

mere correlation might be misleading. Likewise, if the linear correlation coefficient is zero, we cannot come to a

conclusion that there is no relationship at all. Other forms of relations might still exist, invisible because they are

“buried” in the data. Sometimes, linear correlation studies may not be able to grasp highly nonlinear or cyclic

patterns. One should be careful while making correlation studies; correlation can degenerate into scientific

superstition if invalidated. Relationship on the other hand goes beyond statistical correlation and coincidence.

Usually a relationship is conceived before data analysis, based on some fundamental assumptions or well-

known, time-proven concepts. Sometimes a new relationship is proposed based on theoretical reasoning, which

awaits validation.

59

Linear Regression: We will now move from correlation coefficient, which measures the strength of

relationships between two variables, to regression analysis, which determines the mathematical expression of the

relationship. In the simplest form of regression, the dataset is fitted to the equation y = a + bx, where y is the

dependent variable and x is the independent variable. The values of x are assumed to cause or determine the

values of y. y = a + bx is known as the regression line to which the data points regress. This is also taken as a

regression model, which estimates y from x.

Error Sum of Squares: The difference between the estimated value and the true value is called the error

of estimation or residual in regression. For a proposed regression model, one can find error sum of square

by the following equation.

The Principle of Least Squares: The best fit regression model, built according to the principle of least

squares, is the regression line that achieves a minimum value for the error sum of squares. This is done

through a process of iteration, where the error sum of squares converges to its lowest value.

Standard Error of Estimate: Standard error of estimates measures the variability or scatter of the

observed values around the regression line. It is also a measure of reliability of the regression line as an

estimation equation. It is calculated using the following equation.

Total Sum of Squares (TSS): This is the total of the squared observations between each sample

observation and the sample mean, as shown in the following equation.

of variation in y that is accounted for by regression on x.

Linear Regression: Example: We present an example of regression analysis on the relationship between

Review Effectiveness (RE) and Defect Density (DD). The independent variable is Review Effectiveness,

and the dependent variable is Defect Density. We expect a relationship between DD and RE. We believe

that increase in RE will make DD come down. However, we do not know whether the relationship will be

nonlinear, weak, or strong; we wish to find from the regression analysis. A typical regression analysis

using the Excel tool yields outputs that include the following results:

• Regression line

• Regression table

• Residual plot

• Regression statistics

60

The first output, the regression line, is shown in Figure 43. The equation to the regression line and the coefficient

of determination are also printed in a textbox next to the regression line.

The regression results are presented by the tool in a tabular form as shown in Table 17. This table presents the

predicted values (y estimated) and the observed values (y true). The difference between them is presented as

residuals. The residuals provide important information for judging the adequacy of the regression analysis. One

way they can be used is in a plot of the residuals versus the independent variable. If the residuals do not appear

to be randomly scattered above the horizontal line, it may indicate a problem with the regression analysis.

Perhaps a straight-line relationship is not appropriate, or the assumptions of normality or constant variance are

not reasonable. A plot of the residuals is shown in Figure 44.

61

Regression statistics includes the estimation of coefficient of determination (RI) and the standard error, as in

Table 18.

Outliers in Relationship: A special graph showing the sloping lines (1 SE and 2 SE) that run parallel to the best

fit line indicating outliers is given in Figure 45. Those data points that lie beyond a threshold of 1 SE slopes are

considered as results of process violations, and marked for study and examination. The graph in Figure 47 is

known sometimes as a sloping control chart. Here the control chart raises a trigger when a process changes its

inner dynamics. This trigger is regarded as more proactive than the conventional control charts.

Departure from expected relation is the decision criteria, and, not the magnitude of defect density. For example

in Figure 45, the outlier has the least defect density, and for all practical reasons it represents a good job done by

the developers. However, we wish to question why the relationship with review effectiveness has changed. This

unexpected change in relationship could mean that:

Factors other than Review Effectiveness have contributed to defect reduction.

The intended relationship (DD = -0.1927 RE + 31.199) has failed to govern this outlier for reasons

not known to us.

Nonlinear Regression Models: In nonlinear regression the dataset is fitted to nonlinear curves, again using the

principle of least squares. Where linear relationships are absent, there could be nonlinear relationships that we

must verify. Nonlinear regression analysis is an iterative approach. We try different modelling equations; if one

equation does not describe the data, then we try a different equation. The dataset must be carefully examined

before the iteration begins. If the data is not enough in “critical ranges”, it is safer to wait until more data is

collected in the region. If the data is too scattered, nonlinear fittings could give unstable results. If possible,

collect more data to make sure that the wide scatter (suggesting weak relationship) is not a mistake but a reality

we have to deal with. Simple data transformations or normalization may be tried to see if the data scatter can be

narrowed.

definition is size/effect. Productivity is a heavily loaded metric, and is very complex in the sense many

factors determine its value. Productivity tends to be fundamentally nonlinear in nature. Studies have been

62

made in mapping productivity drivers to productivity estimates. We will pick size from the potential

drivers and study its relationship with productivity. Metrics data has been collected for size in function

points (FP) and effect as person months (PM). Size is the predictor variable or independent variable x.

Productivity itself is the “response variable” or dependent variable y. The data is presented in Table 19.

Nonlinear Regression Analysis: We will use the following nonlinear equations for regression analysis of

a typical productivity dataset given in Table 19. Excel has been used to generate the regression curves

that correspond with the following six nonlinear equations:

2. Nonlinear regression polynomial-degree 2

3. Nonlinear regression polynomial-degree 3

4. Nonlinear regression polynomial-degree 4

5. Nonlinear regression power equation

6. Nonlinear regression exponential equation

63

Goodness of Fit: The regression curves are shown in Figure 46. It may be seen that the coefficient

of determination, R2, which represents the quality of fit, is different for different regression

equations. The lowest value is 0.3034 for the logarithmic curve and the best value is 0.5621 for

the fourth degree polynomial curve. R2 gives an indication of closeness of data points to the regression

equation in a statistical sense. This helps in making a first order judgment on regression.

Monotonicity: However, choosing the regression curve must consider the other requirements of curve

fitting. The regression curves must be monotonic and stable. A look at the six models in Figure 46 shows

that one model - the fourth-order polynomial - shows a curve, which reverses its trend in a few places.

Physically, trend reversal means larger program costs less in those regions of reversal - an absurdity.

Stability of Nonlinear Regression Curves: A Comparison: The forecasting ability of nonlinear curves has

to be assessed while choosing regression models. Let us formulate a forecasting problem and examine

how the six nonlinear regression models fare. The forecasting problem we have taken is to predict

productivity value (y) for a given size of 15000 FP (x) (see Table 20). It may be noted that the current

data range is 0 to 11000 FP. This means that the regression curve has to be extrapolated up 4000 FP and

reach an estimate.

The results of forecasting are illustrated in the figures given in Figure 47. The fourth-order polynomial

predicts a deeply negative value, while all other models predict productivity in the range between 23 and

43 FP/PM. Negative productivity is a physically meaningless number, and magnitude of the negative

value indicates a complete failure in forecasting. The forecasting performance of the fourth-order

polynomial is shown in Figure 47, along with the power curve. It is seen from these results that the

polynomial curve has collapsed to negative values of productivity. Hence, it is a poor and unreliable

estimate. The power curve, however, behaves better and predicts a value that is realistic.

Multiple Linear Regression: So far we have been looking at relationships between one dependent

variable (y) and one independent variable (x). But in many studies we need to consider the influence of

several independent variables. In multiple linear regression, the mean of the dependent variable is a

linear combination of the independent variables, as shown in the following equation.

64

Linearity: If the linearity assumption is not met, sometimes we can transform one or more of the x

variables, like taking the square root, and get a linear dependence.

Interaction: If interactions between the independent variables are to be included in the model, then

additional cross products, xi xj, have to be included in the model.

Surface Plot: We will consider a case study for multiple linear regression with two independent

variables. The dependent variable is Defect Density (y), measure in Defects/KLOC. The

independent variables are Skill Level (x 1) and Review Effectiveness (x2). A surface plot of the

linear model is shown in Figure 48. The planar Defect Density surface indicates how quality of the

software work product is influenced by two variables. The surface gently slopes towards the high

performance point with the following coordinate values:

This surface, being a plane, does not offer optimum points but only indicates the general direction of

process improvement.

65

4 SPC and CMMI

In general we can establish the following four categories of processes in the software development ([Kulpa

2003], [SEI 2002]): the project management processes, the process management processes, the engineering

processes, and the support processes. Based on process models like the CMMI we can evaluate main activities

shown in the Figure 49.

According the GQM paradigm and the principles of the CAME framework for successful measurement

application we can formulate the basic CMMI intentions considering the SPC approach as following (see Figure

50).

66

The actual goals are implied in the achieving the different levels of the CMMI maturity evaluation. The

appropriate questions for the process maturity can be identified by considering the CMMI key processes. In

following we will give the essential questions in order to satisfy these key processes cited from [Singpurwalla

1999].

Maturity Level 2:

1. For each project involving software development, is there a designated software manager?

2. Does the project software manager report directly to the project (or project development)

manager?

3. Does the Software Quality Assurance (SQA) function have a management reporting channel

separate from the software development project management?

4. Is there a designated individual or team responsible for the control of software interfaces?

5. Is there a software configuration control function for each project that involves software

development?

6. Does senior management have a mechanism for the regular review of the status of software

development projects?

7. Is a mechanism used for regular technical interchanges with the customer?

8. Do software development first-line managers sign off on their schedules and cost estimates?

9. Is a mechanism used for controlling changes to the software requirements?

10. Is a mechanism used for controlling changes to the code? (Who can make changes and under what

circumstances?)

11. Is there a required training program for all newly appointed development managers designed to

familiarize them with software project management'?

12. Is a formal procedure used to make estimates of software size?

13. Is a formal procedure used to produce software development schedules?

14. Are formal procedures applied to estimating software development cost?

15. Is a formal procedure used in the management review of each software development prior to

making contractual commitments?

Maturity Level 3

16. Is a mechanism used for identifying and resolving system engineering issues that affect software?

17. Is a mechanism used for independently calling integration and test issues to the attention of the

project manager?

18. Are the action items resulting from testing tracked to closure?

19. Is a mechanism used for ensuring compliance with the software engineering standards?

20. Is a mechanism used for ensuring traceability between the software requirements and top-level

design?

22. Are the action items resulting from design reviews tracked to closure?

23. Is a mechanism used for ensuring traceability between the Software top-level and detailed

designs?

24. Is a mechanism used for verifying that the samples examined by Software Quality Assurance are

representative of the work performed?

25. Is there a mechanism for ensuring the adequacy of regression testing?

67

Key Process Area 3 (K33)-Peer Review

27. Is a mechanism used for controlling changes to the Software design?

28. Is a mechanism used for ensuring traceability between Software detailed design and the code?

29. Are Software code reviews conducted?

30. Is a mechanism used for configuration management of the Software tools used in the development

process?

Maturity Leve1 4

31. Is a mechanism used for periodically assessing the Software engineering process and

implementing indicated improvements?

32. Is there a formal management process for determining if the prototyping of Software functions is

an appropriate part of the design process?

33. Are design and code review coverage measured and recorded?

34. Is test coverage measured and recorded for each phase of functional testing?

35. Are internal design review standards applied?

36. Has a managed and controlled process database been established for process metrics data across all

projects?

37. Are the review data gathered during design reviews analyzed?

38. Are the error data from code reviews and tests analyzed to determine the likely distribution and

characteristics of the errors remaining in the product?

39. Are analyses of errors conducted to determine their process-related causes?

40. Is review efficiency analyzed for each project?

Maturity Level 5

42. Is a formal procedure used to ensure periodic management review of the status of each software

development project?

43. Is a mechanism used for initiating error prevention actions?

44. Is a mechanism used for identifying and replacing obsolete technologies? 45. Is software

productivity analyzed for major process steps?

The appropriate metrics in order to find the answers of the questions above we will give the CMMI metrics

defined by Kulpa and Johnson again (only for the CMMI Level Four) [Kulpa 2003]:

QM01: Trends in the organization's process performance with respect to changes in work products and

task attributes (e.g., size growth, effort, schedule, and quality)

QM03: Critical resource utilization

QM04: Number and severity of defects in the released product

QM05: Number and severity of customer complaints concerning the provided service

QM06: Number of defects removed by product verification activities (perhaps by type of verification,

such as peer reviews and testing)

QM07: Defect escape rates

QM08: Number and density of defects by severity found during the first year following product delivery

or start of service

QM09: Cycle time

68

QM10: Amount of rework time

QM11: Requirements volatility (i.e., number of requirements changes per phase)

QM12: Ratios of estimated to measured values of the planning parameters (e.g., size, cost, and schedule)

QM13: Coverage and efficiency of peer reviews (i.e., number/amount of products reviewed compared to

total number, and number of defects found per hour)

QM14: Test coverage and efficiency (i.e., number/amount of products tested compared to total number,

and number of defects found per hour)

QM15: Effectiveness of training (i.e., percent of planned training completed and test scores)

QM16: Reliability (i.e., mean time-to-failure usually measured during integration and systems test)

QM17: Percentage of the total defects inserted or found in the different phases of the project life cycle

QM18: Percentage of the total effort expended in the different phases of the project life cycle

QM19: Profile of subprocesses under statistical management (i.e., number planned to be under statistical

management, number currently being statistically managed, and number that are statistically

stable)

QM20: Number of special causes of variation identified

QM21: The cost over time for the quantitative process management activities compared to the plan

QM22: The accomplishment of schedule milestones for quantitative process management activities

compared to the approved plan (i.e., establishing the process measurements to be used on the

project, determining how the process data will be collected, and collecting the process data)

QM23: The cost of poor quality (e.g., amount of rework, re-reviews and re-testing)

QM24: The costs for achieving quality goals (e.g., amount of initial reviews, audits, and testing)

SPC depends on historical data. It also depends on accurate, consistent process data. If you are just beginning the

process improvement journey, do not jump into SPC. You (your data) are not yet ready for it. That is why the

CMMI waits until Maturity Level 4 in the staged representation to suggest the application of SPC techniques. At

Level 2, processes are still evolving. At Level 3, they are more consistent. Level 4 takes process information

from Level 3, and analyzes and structures both the data and their collection. Level 5 takes predictable and

unpredictable processes, and improves them.

Finally, we will describe some statistical methods supporting the Statistical Process Control especially (see

[Pandian 2004], [Putnam 2003], [Zelkowitz 1997] and [Zuse 2003]).

The Shewhart control chart, introduced in 1920, decomposes process variation into two components: random

variation (predictable bounds) and systematic variation (anomalies). Random variations, when the cause system

is constant, approach some distribution function, and hence remain predictable or statistically stable. Systematic

variations are due to assignable causes, which are due to unusual causes, freak incidents, process drifts, and

environmental threats. Shewhart demonstrated how control charts could be used to identify and distinguish the

two types of process variation, to achieve process efficiency, and ensuing economic benefits. Figure 51 shows

how a training manager uses the Shewhart Control Chart to identify (and later solve) two problems:

extraordinary cost for Training ID 7 and the average cost (µ) greater than the budget. Armand V. Feigenbaum

allows specifying control limits from past experience and guesswork in a pragmatic manner.

69

Tests for Control Charts: Tests for statistical control have been in use for a long time. The classical tests or

decision rules to be applied while reading the control charts are presented in the following list, along with an

illustration in Figure 52.

Test #1: Any point outside one of the control limits is an indication of a special

cause and needs to be investigated.

Test #2: A run of seven points in succession, either all above the central line or

below the central line or all increasing or all decreasing, is an indication of a

special cause and needs to be investigated.

Test #3: Any unusual pattern or trend involving cyclic or drift behaviour of the data

is an indication of a special cause and needs to be investigated.

Test #4: The proportion of points in the middle-third zone of the distance between

the control limits should be about two thirds of all the points under

observation.

Control Chart in the Presence of Trend: If the metric shows trend, such as delivered defect density (DDD)

in Figure 53, the control charts may be partitioned to make a clearer presentation of the problem. The trend line

helps in forecasting and risk estimation. The baseline helps in process analysis, estimation, and setting process

guidelines.

Dual Process Control Charts: Sometimes the metric is a product of two major components, each showing its

own independent characteristics. Defects found by design review, for instance, are a product of defect injected

and review effectiveness, shown in the following equation.

70

Figure 53: Trend and baseline

The UCL in the control chart of defect/KLOC, as shown in Figure 55, is more relevant to the designers, who

have to keep defect level below the UCL. The LCL, on the other hand, appeals to the reviewers to find defects

more than the UCL. In the defect control chart in Figure 55, the following references are marked for proper

interpretation:

From Dual Limits to Single Limits: The control chart in Figure 54 is cluttered, and one has to strain to read,

analyze, and interpret the chart. When the chart is used to give process feedback, some process owners may mix

signals, one demanding a minimum production of defects, another may demand just the opposite.

71

Figure 54: In-process defect control chart

This problem may be solved and effective presentation may be made to the process owner, if only we could

construct two separate control charts, each delivered to the process owner with the appropriate control limits, as

indicated in Figure 55. After the split, the new control charts look simple and clear, with just one decision rule

marked. The process owner, the designer, or the reviewer, gets a clear signal.

The process defects are marked as circles in both cases. With defects clearly marked and the goal (specification

limit) clearly specified, each process owner can go into causal analysis of process violations and initiate

corrective measures. The purpose of this control chart is to provide effective feedback and facilitate corrective

action.

Control Charts Types: There are several control chart forms in use, including the ones we have used so

far. Below is a brief list for a quick reference. The exact formulas for computations may be found

elsewhere. When we have a large number of data points that can be organized as sub-groups according to

some real-life order, and when the sub-group sizes are used in determining the control limits, the

following charts may be useful.

X-bar - R chart with UCL and LCL

X-bar - S chart with UCL and LCL

p Chart (percentage defectives) with UCL and LCL

u Chart (defects per unit size) with UCL and LCL

c Chart (defect counts per module) with UCL and LCL

72

If instead of sub-groups we have just an individual data point for every process delivery, we can

artificially create a sub-group by selecting data points from a moving average window, and plot a graph

with control limits calculated in the traditional way.

When all we desire is to characterize the process and generate some performance baseline on a chosen

metric, the following forms may be used. These forms can be used across life cycle phases or across sub-

groups.

If we wish to compare actual values with estimates, then the following may be used:

• Cumulative graphs with interval estimates

• Run charts with estimates shown as USL, LSL

• Life cycle profiles with USL and LSL

• Run charts with baseline values (history) marked Special Forms

Most performance models are constructed this way. A few of them are illustrated in this section.

Multi-Process Tracking Model: A simple way to take a holistic and balanced view of processes is to track all

related process metrics on a radar chart, marking the target values and the achieved values. Cost drivers,

performance drivers, and defect drivers in software development can be plotted on the radar chart for effective

process control. Tracking of multiple goals, all competing for resources, is presented in the radar chart format in

Figure 56. The following is a list of metrics used to represent and measure goals:

• Productivity index (PROD)

• Employee satisfaction index (EMP SAT)

• Right first time index (RFT)

• Defect removal effectiveness (DRE)

• Training need fulfilment index (TNF)

All these are measured quantitatively on a 0 to 10 scale (ratio scale). Targets and achievement in each direction

are plotted. This is a control chart because it compares reality with expectation and allows one to see deviations.

It gives deeper meaning and allows one to visualize a balanced picture or model on goal achievement.

73

Dynamic Model - Automated Control Charts: Control charts in modern times have taken a totally new

form. They are embedded in metric databases and analysis modules, which perform dynamic functions. A

defect-tracking tool uses a defect database as the platform and tracks bug closure. If the time taken exceeds a

preset limit, the software generates a message to the tester. Even if the bug lives long after the message, the

software escalates the issue and the message is now flashed to the project manager. The tester or the manager

does not see a physical control chart but gets the results. The limit setting can be a choice from the manager,

where his experience and judgment prevail. Or the limit setting can be done by the software logic, which will use

an appropriate decision rule and raise an alarm. The decision-making algorithm can be simple algebra or a

sophisticated knowledge engine that learns and works with intelligence. The graph is printed, on demand, as a

report from the tool along with other statistics. In a similar way, metrics data analysis tools can generate dynamic

control charts on all metrics. These charts can be published in the monthly process capability baseline reports.

Control Chart for Effective Application: There are many forms of control charts but they all must be

structured well for effective application. Here are some suggestions. On any metric we can plot a control chart.

Choose the metric that communicates better. For instance, a training manager can choose cost of absenteeism

instead of number of people who are absent because the former makes senior management look at the control

chart seriously. The data should be in chronological order. Most software development processes follow the

learning curve, both first order and second order. Before process stability is achieved, the learning curve is

encountered. Chronological order gives control charts the vital meaning and power. A decision rule must be

provided to enable problem recognition. The rule could be expressed in the following ways:

• Control limits

• Specification limits

• Baseline references

• Estimated values

• Process goals

• Process constraints

• Benchmark values

• Expected trend

• Zones

The reader must be made familiar with the rules for interpretation. The chart must be designed with the most

likely readers in mind, and every effort must be made to make the chart provide effective communication to a

human system (biofeedback). Provide support data as annotations for significant data points. For example, a

defect distribution pie chart can be provided as a companion to a defect control chart. Annotate identified hot

spots or trends with causal analysis findings. We learn from such annotations. Wherever possible, suggested

corrective action may be indicated.

Modernism in Process Control - Decision Support Charts: Metrics data, when presented in time series, offers

a new form that helps to understand the process. A well-structured time series chart could emerge into a model

once it captures a pattern that can be applied as a historic lesson. The time series analysis for trend or process

control is also a time series model of the process, inasmuch as it can increase one's understanding of the process

behaviour and forecast.

What-IFAnalysis: But the outstanding issue in software projects is whether a process goes according to a

plan or estimate. The need for statistically derived, selforganizing goals, should it arise, is only secondary.

The term control chart may then be replaced with the term decision support chart The concept of control

limit will be substituted with the concept of decision thresholds. What-if analysis can be done on a control

chart by shifting the limits and seeing each time how many events are picked up and earmarked for inves-

tigation. The problem set will shift according to the location of the threshold line.

Clues, Not Convincing Proof: There are reasons why metrics control charts end up issuing suggestive

clues but not convincing proof about process problems:

• Data errors

• Ambiguity in measurement scale

• Process having nonnormal distributions

• Nonavailability of defect propagation models

74

But all a project manager is looking for is a set of clues, not final proof. A decision support chart can

coexist with ambiguity but the classical control chart cannot.

If It Is Written on the Wall, Do Not Draw Control Charts: If known problems are not solved, nobody

wants to use a control chart to detect new problems. If trouble can be spotted without having to use a

control chart, avoid control charts. Going one step further, if without the aid of control limits we can spot

outliers using the naked eye, let us not draw control limits.

The connection of control charts with action is now legendary. The best control chart is the one on which

somebody acts.

Regression models have huge application potential in software engineering and management. They support the

creation of a wide variety of knowledge products from simple visual display of relationships to estimation

equations. They can reflect real situations in different degrees of detail, ranging from simple two-variable models

to complex multiple variable models. They can capture process nonlinearity and allow us to exploit this

knowledge, either in optimization or in risk avoidance.

Regression models are naturally poised for causal analysis application. The x-y relationship is a cause-effect

relationship (in the predictor-predicted sense). The regression analysis discussed here makes use of productivity

data. requirement effort% has been chosen as the independent variable. The data and the nonlinear regression

line fitted to the dataset are shown in Figure 57. The association rule for causal analysis demands a good R2, and

we get a value of 64.34 percent. The extraneous data and outliers can be put aside and we can focus on the

regression line to do causal analysis. Logic tells us that software productivity should improve with better

requirement capturing (and a direction for causal analysis is set this way). The regression model (nonlinear,

logarithmic) shows asymptotic rise in productivity, and we can see a shoulder on the curve after which it

becomes flat. Requirement effort affects productivity up to a point, then either other factors take over or further

investment on requirement does not yield return.

That there exists an optimum team size has been much discussed and widely quoted. But what are the facts? A

regression model of team size on productivity reveals the real picture. Team size productivity data is shown in

Figure 58, and the graph shows the nonlinear regression curve, a power equation, which fits to an R2 of 42.28

percent. According to the regression model, when the team grows away from the organic small size, its

productivity decreases exponentially. The nonlinear model does permit optimization of team size; it imposes a

constraint equation on software projects. Choice is made not based on the intrinsic demonstration of best among

75

the lot prediction but based on other factors. For example, a strategic limit on minimum productivity would

dictate the team size limit. In those cases, where a larger team size is chosen based on other considerations, from

the model we know what would be the corresponding loss in productivity, and take appropriate counter

measures. This model would also help in breaking work packages to smaller units and operate the project with

the proverbial small teams.

Predicting effort from size has been a favourite game for several researchers. They go by the name of cost

models and estimation models.

Our objective here is to apply regression modelling to design an effort estimation model from data commonly

available in projects, namely, effort and size. Some practical data is provided in Table 21.

76

Expectation: The metrics used here are effort in hours and size in function points. Size is taken as the

independent variable. The expected relationship, based on several experiences, is a power equation of the

form

We also expect complications in regression model building. Size measurements can have errors, which

will interfere with regression.

Analysis: Regression analysis of the dataset is shown in Figure 60. A linear regression line appears with

goodness of fit 39.75 percent, a poor value for an estimation model. There is a large scatter of data. The

model requires improvement.

77

Presentation of such scatter plots sometimes invites criticism. Lack of clear trend makes people give up

and lose interest in analysis. They conclude that "if you have enough data you can prove any theory." The

problem is quite basic. The step that had been missed in data collection is "categorization," a discipline

lower in the rank of measurement scales but which could bring in clarity.

Clustering: By examining the scatter plot in Figure 60 we may notice that there is a possibility for

clustering, regrouping data according to some logical rule, and try separate regressions for each cluster.

The exploratory data analysis indicates a natural divide in the data, worth finding. Now we know that

there must be logic for regrouping which is based on some physical reasoning, such as types of projects,

nature of technology, and even year of completion. Histograms can be used to test for existence of strong

clusters. The data was grouped into two clusters. The regrouped data is shown in Table 22.

New Regression Models: The new regression lines, obtained after clustering, are shown in Figure 61. The

goodness of fit figures is 83.44 percent for one and 67.63 percent for the other. Regression quality is far

better than what we had in the first run. This is an example that emphasizes the need for iterative runs in

model building. We can continue the iteration with further clustering, transformation, partitioning, or

other means of model refinement. We can also search for better equations. Of course, we can go to

multiple linear regressions and achieve better and better models. It is a process by itself. The quest is

brought to an end, when we have a reasonable model which will have reasonable confidence level and

which agrees with common sense.

Important Lesson: This application proves one principle: estimation models predict better within their

own families. Each estimation model represents a narrow world, inside which it operates best. There is no

universal estimation model. Hence, even if we have just a few data points, better to build our own

estimation model, one for each family.

Statistical Process Control provides a way of handling the increasing complexity of software engineering. In

this preprint the statistical basics were introduced and an example was provided to show how this approach is

practically applied. To be able to use it in a profitable way it is necessary to gain experience with this approach.

With the oblige experience it is a very powerful tool for controlling the software processes being developed at

the moment but also for the planning of future projects. This means that the overall effort decreases while the

quality increases.

78

5 References

[Abreu 1995] Abreu, F. B.; Gonlao, M.; Esteves, R.: Towards the Design Quality Evaluation in Object-Oriented

Software Systems. Proc. of the 5ICSQ, October 24-26, Austin, Texas, 1995, pp. 44-57

[Basili 1986] Basili, V. R.; Selby, R. W.; Hutchens, D. H.: Experimentation in Software Engineering. IEEE

Transactions on Software Engineering, 12(1986)7, pp. 733-743

[Card 2000] Card, D. N.: Making Measurement Understandable. IEEE Software, January/February 2000, pp. 95-

96

[Cole 1993] Cole, R. J.; Woods, D.: Measurement Through the Software Lifecycle: A Comperative Case Study. .

Proc. of the 10th Annual Conference on Application of Software Metrics and Quality Assurance in

Industry, Amsterdam, Netherlands, September 1993, Section 19

[Dumke 2001] Dumke, R.; Abran, A.: Current Trends in Software Measurement. Proc. of the IWSM2001,

Montreal, August 2001, Shaker Publ., 2001

[Dumke 2002] Dumke, R.; Abran, A.; Bundschuh, M.; Symons, C.: Software Measurement and Estimation.

Proc. of the IWSM2002, Magdeburg, October 2002, Shaker Publ., 2002

[Dumke 1999] Dumke, R.; Foltin, E.: An Object-Oriented Software Measurement and Evaluation Framework.

Proc. of the FESMA, October 4-8, 1999, Amsterdam, pp. 59-68

[Dumke 1996] Dumke, R.; Foltin, E.; Koeppe, R.; Winkler, A.: Softwarequalität durch Meßtools – Assessment,

Messung und instrumentierte ISO 9000. Vieweg Publ., Braunschweig, Germany, 1996

[Dumke 2003] Dumke, R.; Lother, M.: Softwarequalitätsmanagement (SQM). Vorlesungsskript. Otto-von-

Guericke-Universität Magdeburg, http://ivs.cs.uni-magdeburg.de/sw-eng/agruppe/ lehre/swt2.shtml

[Ebert 1993] Ebert, C.: Complexity Traces – an Instrument for Software Project Management. Proc. of the 10th

Annual Conference on Application of Software Metrics and Quality Assurance in Industry, Amsterdam,

Netherlands, September 1993, Section 17

[Eickelmann 2000] Eickelmann, N.: Integrating the Balanced Scorecard and Software Measurement

Frameworks. Proc. of the IRMA 2000, Anchorage, Alaska, May 2000, pp. 980-983

[Endres 2003] Endres, Albert; Rombach, D.: A Handbook of Software and System Engineering. Pearson

Education Limited, 2003

[Fehrling 2003] Fehrling, N.: Softwaremetriken im Umfeld der Automobilindustrie. In: Büren et al.: Software-

Messung in der Praxis. Tagungsband der MetriKon 2003, November 2003, Ulm, Shaker-Verlag, 2003,

pp. 163-164

[Feiler 1993] Feiler, P. H.; Humphrey, W. S.: Software Process Development and Enactment: Concepts and

Definitions. Proc. of the 2nd Int. Conference on Software Process, Los Altimos, 1993, pp. 28-40

[Fenton 1997] Fenton, N. E.; Pfleeger, S. L.: Software Metrics – A Rigorous and Practical Approach. Thomson

Publ., 1997

[Ferguson 1998] Ferguson, J.; Sheard, S.: Leveraging Your CMM Efforts for IEEE/EIA 12207. IEEE Software,

September/October 1998, pp. 23-28

[Henderson 1996] Henderson-Seller, B.: The Mathematical Validity of Software Metrics. Software Engineering

Notes, 21(1996)5, pp. 89-94

[Jacquet 1997] Jacquet, J.; Abran, A.: From Software Metrics to Software Measurement Methods: A Process

Model. Proc. of the ISESS, 1997

[Juristo 2003] Juristo, N.; Moreno, A. M.: Basics of Software Engineering Experimentation. Kluwer Academic

Publishers, Boston, 2003

79

[Kitchenham 1995] Kitchenham, B., Pfleeger, S. L.; Fenton, N.: Towards a Framework for Software

Measurement Validation. IEEE Transactions on Software Engineering, 21(1995)12, pp. 929-944

[Kitchenham 1997] Kitchenham et al.: Evaluation and assessment in software engineering. Information and

Software Technology, 39(1997), pp. 731-734

[Kulpa 2003] Kulpa, M. K.; Johnson, K. A.: Interpreting the CMMI – A Process Improvement Approach. CRC

Press Company, 2003

[Munson 2003] Munson, J., C.: Software Engineering Measurement. CRC Press Company, Boca Raton London

New York, 2003

[Pandian 2004] Pandian, C. R.: Software Metrics – A Guide to Planning, Analysis, and Application. CRC Press

Company, 2004

[Putnam 2003] Putnam, L. H.; Myers, W.: Five Core Metrics – The Intelligence Behind Successful Software

Management. Dorset House Publishing, New York, 2003

[SEI 2002] SEI: Capability Maturity Model Integration (CMMISM), Version 1.1, Software Engineering Institute,

Pittsburgh, March 2002, CMMI-SE/SW/IPPD/SS, V1.1

[Singpurwalla 1999] Singpurwalla, N. D.; Wilson, S. P.: Statistical Methods in Software Engineering. Springer

Publ., 1999

[Solingen 1999] Solingen, v. R.; Berghout, E.: The Goal/Question/Metric Method. McGraw Hill Publ., 1999

[Wohlin 2000] Wohlin, C, Runeson, P, Höst, M, Ohlsson, M, Regnell, B, Wesslén, A.: Experimentation in

Software Engineering: An Introduction. Kluwer Academic Publishers, Boston, 2000

[Zelkowitz 1997] Zelkowitz, M. V.; Wallace, D. R.: Experimental Models for Validating Technology. IEEE

Computer, May 1998, pp. 23-31

[Zuse 1998] Zuse, H.: A Framework of Software Measurement. De Gruyter Publ., Berlin New York, 1998

[Zuse 2003] Zuse, H.: What can Practioneers learn from Measurement Theory. In Dumke et al.: Investigations

in Software Measurement, Proc. of the IWSM 2003, Montreal, September 2003, pp. 175-176

80