Weighted Execution Profiles For Software Testing

2014 IEEE International Conference on Software Testing, Verification, and Validation Workshops
Weighted Execution Profiles for Software Testing
Joan Farjo and Wes Masri

Department of Electrical and Computer Engineering
American University of Beirut
Beirut, Lebanon
{jmf09, wm13} @aub.edu.lb
Abstract- Existing execution profiles comprise information We evaluated our heuristics by measuring their impact
about the execution of program elements such as statements, on an established test suite minimization technique [8].
branches, and def-use pairs. They lack any information about The rest of this paper is organized as follows. Section II
what elements are potentially more relevant to failure than describes a test suite minimization technique that we use in
others; such information could be leveraged to compute computing one of our heuristic, and in our evaluation.
dissimilarity metrics that discriminate better between failing Section III presents our heuristics and weight computations.
and passing profiles, which is beneficial to software testing. Section IV presents our experimental results. Finally,
In this work, we explore three heuristics from which we will Section V presents conclusions and future work.
derive weights to be associated with the covered program
elements such that higher weights indicate more potential II. CLUSTER-BASED TEST SUITE MINIMIZATION
relevance to failure. We evaluate our heuristics by measuring
their impact on an established test suite minimization Given a program P and a test suite T, the aim is the
technique. Our experiments showed somewhat promising construct T’, a subset of T such that T’ is as small as
results. possible, and as effective as possible in revealing the defects
in P. The minimization proceeds as follows [8]: test cases in
Keywords- Execution profiles, software testing, test suite T are automatically clustered based on the similarity of their
minimization. execution profiles comprising statements, branches, and def-
use pairs. Then T’ is built by randomly selecting one test
I. INTRODUCTION from each cluster. The goal here is for T’ to cover, as much
Many software testing techniques are based on the as possible, the behaviors induced by T. Note that the size of
assumption that failing runs induce execution profiles that T’ is dictated by the chosen number of clusters, i.e., by the
are dissimilar from those induced by passing runs. Existing user.
profiles typically comprise information about the execution III. HEURISTICS AND WEIGHTS
of program elements such as statements, branches, and def-
use pairs. These profiles lack any information about what A. Heuristic1
elements are potentially more relevant to failure than others; This approach aims at giving more weight to the
such information could be leveraged to compute profiling elements that are not covered by many test cases,
dissimilarity metrics that discriminate better between failing since the number of failing tests is typically much smaller
and passing profiles. than the number of passing tests. The steps for computing
In this work, we explore three heuristics from which we this heuristic follow.
will derive weights to be associated with the covered Step1: Count how many tests in test suite T cover a
program elements such that higher weights indicate more given profiling element.
potential relevance to failure. The first heuristic conjectures Step2: Assign each profiling element pei a weight
that an element that is not covered by many tests should be defined as:
assigned a high weight; since in test suites of released
( )
ℎ( ) = 1 −
software, typically the number of failing test cases is much ||
smaller than the number of passing test cases. The second where count(pei) is the count obtained in Step1. And if pei
heuristic is similar to the first but adopts a fuzzy logic [9] never occurred in any of the test cases, then weight(pei) is
approach through applying a subjective membership set to 0.
function that will be used to assign high weights for rarely
occurring profiling elements. The third is based on the B. Heuristic2
premise that an element covered by test cases that are very
This heuristic extends Heuristic1 by using a fuzzy
dissimilar from others (outliers) should also be assigned a
membership function when computing the weights. The
high weight since failing test cases are likely to be outliers
steps are as follows.
[8]. Finally, since we aim at improving software testing, our
Step1: Count how many tests in test suite T cover a
heuristics cannot leverage the (non-existent) pass/fail
given profiling element pei. The result is stored in
information as many fault localization techniques do.
count(pei).
978-0-7695-5194-4/14 $31.00 © 2014 IEEE 296

298
DOI 10.1109/ICSTW.2014.33
Step2: Calculate for every pei the percentage of test ∑2/3 +
,-/ 0
cases that covered it, to be used in the membership function ℎ′( ) =

( )
described below, and shown in Figure 1, to obtain its degree
of membership that will serve as its weight.
where / is a test case that exercised pei, and score(/ ) is the
% score of a test case obtained in step2.
. %% Step5: Normalize the scores in step 4 using min-max
ℎ( ) = (
) = ∗. %% normalization; then the final weight assigned to pei

!"#$
becomes:
where µ(tcc) is the degree of membership, tcc is the ℎ( ) = 1 − ,4(ℎ′( ))
percentage of test cases that covered pei:

( )

= But if pei is not exercised by any test case, then it is

|| assigned a weight of 0.
The membership function assigns full degree to profiling IV. EXPERIMENTAL RESULTS
elements with tcc less than or equal to 20%. As tcc increases
above 20%, the degree of membership linearly decreases We conducted experiments to evaluate our heuristics in
until reaching 60%, where further steeper decrease of the the context of the test suite minimization technique
described in Section II. The subject programs we used were:
membership function is attained to give even less degree to
1) tot_info, schedule, print_tokens2, space, flex2, and sed3
the profiling elements with tcc less than 90%. Any profiling from the SIR repository [1]; and 2) Tomcat3.0, Tomcat3.2.1,
element with tcc equal or greater than 90% is assigned zero Jigsaw, and JTidy that we used in previous work [7][8][1].
degree of membership. Moreover, a profiling element that is Our experiments were divided into parts, one involving a
not covered by any test case is assigned a zero degree of single-fault version of each of the above 10 programs, and
membership. the other part involving multiple-faults versions of these
programs. Table 1 lists the 10 single-fault versions with
1 their respective test suite sizes. Table 2 lists the 10 multiple-
fault versions with their respective test suite sizes and
0.8 number of faults. The tables also show that 3 programs were
0.6 written in C, and the remaining ones in Java. Note that we
μ(tcc)
have previously converted the Siemens programs to Java.

0.4
0.2 Table 1. Subjects with single fault
Programs |T|
0 print_tokens2_v1 1026
0 20 40 60 80 100 tot_info_v1 942
tcc schedule_v2 1114
Figure 1 - Membership function space_v5 (C) 760
Tomcat3.0 478
C. Heuristic3
Tomcat3.2.1 491
This heuristic aims at giving more weight to the
profiling element covered by test cases that are very Jigsaw 511
dissimilar from others since failing test cases are likely to be JTidy_v1 301
outliers. flex2 (C) 392
Step1: Using the original profiles, perform K-means
sed3_v1 (C) 201
clustering with the number of clusters chosen to be equal to
ceiling(|T|/50).
Step2: Assign to each test case tj a score score(tj) equals Table 2. Subjects with multiple faults
to the number of test cases present in the cluster it belongs Programs # of |T|
faults
to; e.g., an outlier will have a small score. print_tokens2 7 1894
Step3: Count how many times each profiling element tot_info 6 833
occurs when running the entire test suite T; i.e., compute
schedule 3 1032
count(pei).
Step4: Assign to each profiling element a weight equals Space (C) 5 2000
to the average of the sum of the scores of the test cases in Tomcat3.0 4 485
which it is exercised. Tomcat3.2.1 3 497
Jigsaw 4 516
299
297
JTidy 3 864 multiple-faults experiments, we can argue that it is the better
flex2 (C) 3 395 of the 3. Finally, given that the multiple-faults experiments
sed3 (C) 3 169
are more realistic, and despite the limited size of our
experiments, we can claim that our heuristics can potentially
improve test suite minimization and possibly other
The subject programs and their respective test suites techniques that rely on execution profiles.
were modified as follows:
1) In some cases of the multiple-faults versions, failing Table 3. Minimization Results for single-fault subjects
test cases caused by more than one bug were discarded; only Programs |T| |T| |T| |T|
passing test cases and failing ones caused by a single bug Heuristic 1 Heuristic 2 Heuristic 3 unweighted
print_tokens_2_v1 8 10 9 10
were kept to eliminate the confusion of what bug triggered
the failure of the test case. tot_info_v1 20 30 10 60
3) Test cases that exercised the bug but did not fail were schedule_v2 9 10 100 10
discarded; i.e., coincidental correctness was mitigated space_v5 150 200 200 200
[4][5]. This was done to eliminate external factors that Tomcat3.0 70 30 20 60
might affect our results. Tomcat3.2.1 40 40 20 40
4) The test suites were trimmed to contain around 5%
Jigsaw 30 30 20 20
failures while keeping the same ratio for test cases caused
JTidy_v1 200 200 250 150
by each bug. The goal here is to achieve a behavior that is
more realistic. flex2 20 30 10 20
The Java programs were instrumented and profiled using sed3_v1 70 90 50 60
a profiler that we developed in previous work [8]. The SCORE (4, 2, 4) (2, 4, 4) (6, 2, 2)
resulting profiles contained the frequency of occurrence of
statements, branches, and def-use pairs. Whereas the C Table 4. Minimization Results for multiple-fault subjects
programs were profiled using GCov to collect statement Programs |T| |T| |T| |T|
profiles only. Heuristic 1 Heuristic 2 Heuristic 3 unweighted
Our three heuristics were then applied to generate the print_tokens2 300 300 300 350
respective weighted profiles. K-means clustering was tot_info 250 100 300 200
repeatedly performed on the unweighted and weighted schedule 450 350 350 450
profiles of each program, while varying the number of space 450 450 400 350
clusters [2]. The Squared Euclidean distance measure was Tomcat3.0 70 90 90 90
used as a dissimilarity metric, the k-initial centroids were Tomcat3.2.1 100 150 100 100
chosen at random from the data set, and whenever a cluster Jigsaw 350 400 250 350
loses its entire member observations a new cluster was
JTidy 70 70 30 100
created consisting of the one point furthest from its centroid.
The one-per-cluster sampling technique was used to flex2 80 90 360 150
select one test case from each cluster and add it to the sed3 70 80 40 70
reduced test suite. The reduced test suite is then checked for SCORE (4, 4, 2) (5, 1, 4) (5, 2, 3)
the percentage of defects it covered. The number of clusters
was varied until 100 percent defect coverage was achieved.
Table 3 shows: a) the sizes of the minimized sets after
applying each of our heuristics on the single-fault subjects; V. CONCLUSIONS AND FUTURE WORK
b) the sizes of the minimized sets without applying any of We presented three heuristics to improve execution
our heuristics (column 5); and c) the score assessing profiles. The first heuristic conjectures that an element that
applying the heuristics against the unweighted profiles. The is not covered by many tests should be assigned a high
score comprises 3 numbers, the number of times the weight. The second is similar to the first but adopts a fuzzy
heuristic performed better, equally good, and worse. Table 4 logic approach. The third is based on the premise that an
shows the same information as Table 3, but for the multiple- element covered by test cases that are very dissimilar from
faults subjects. others (outliers) should also be assigned a high.
The score for Heuristic 3 in Table 3 is (6, 2, 2), We empirically evaluated our heuristics and observed
suggesting a relatively good improvement. But a score of (4, that the third exhibited the most positive impact on the
2, 4) for Heuristic 1suggests no measurable improvement, execution profiles. Even though our results were not
and the score of (2, 4, 4) for Heuristic 2 is clearly not positive in all cases, but they were promising enough to
promising. Now looking at the results in Table 4, the scores have us further investigate more heuristics.
for all 3 heuristics suggest an improvement. And since
Heuristic 3 performed well in both the single-fault and
300
298
ACKNOWLEDGMENT Verification and Validation, ICST 2010, Paris, France, April,
2010.
This research was supported in part by the Lebanese
National Council for Scientific Research. [5] Masri W. and Abou Assi R. Prevalence of Coincidental
Correctness and Mitigation of its Impact on Fault-
Localization. ACM Transactions on Software Engineering
REFERENCES and Methodology (TOSEM), to appear.
[1] Hyunsook Do, Sebastian G. Elbaum, Gregg Rothermel: [6] Masri W., Abou Assi R, and El-Ghali M. Generating Profile-
Supporting Controlled Experimentation with Testing Based Signatures for Online Intrusion and Failure Detection.
Techniques: An Infrastructure and its Potential Impact. Information and Software Technology (IST) (Elsevier).
Empirical Software Engineering 10(4): 405-435 (2005) [7] Masri, W. and Podgurski, A. Application-Based Anomaly
[2] Joan Farjo, Wes Masri, and Hazem Hajj. Isolating Failing Intrusion Detection with Dynamic Information Flow
Test Cases: A Comparative Experimental Study of Clustering Analysis. Computers & Security (Elsevier). Vol. 27 (2008),
Techniques. The 3rd Int’l Conference on Communications and pages 176-187
Information Technology ICCIT 2013, June, 2013, Lebanon. [8] Masri W., Podgurski A. and Leon D. An Empirical Study of
[3] Masri, W. Exploiting the Empirical Characteristics of Test Case Filtering Techniques Based On Exercising
Program Dependences for Improved Forward Computation of Information Flows. IEEE Transactions on Software
Dynamic Slice. Empirical Software Engineering (ESE) Engineering, July, 2007, vol. 33, number 7, page 454.
(Springer), 2008 13:369-399. [9] Zadeh L. A. 1965. Fuzzy Sets. Information and Control, vol.
[4] Masri W., Abou-Assi R. Cleansing Test Suites from 8, pp. 338 – 353, 1965
Coincidental Correctness to Enhance Fault-Localization.
Third International Conference on Software Testing,
301
299

Weighted Execution Profiles For Software Testing

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Weighted Execution Profiles For Software Testing

Hochgeladen von

Copyright:

Verfügbare Formate

2014 IEEE International Conference on Software Testing, Verification, and Validation Workshops