Sie sind auf Seite 1von 3

10th IEEE International Conference on Software Testing, Verification and Validation Workshops

Poster: BDTest, a System to Test Big Data


Frameworks
Alexandre Langeois∗† , Eduardo Cunha de Almeida‡ , Anthony Ventresque∗
∗ Lero@UCD, School of Computer Science, University College Dublin, Ireland Email: anthony.ventresque@ucd.ie
† Ecole Centrale de Nantes, France. Email: alexandre.langeois@eleves.ec-nantes.fr
‡ C3SL Labs, UFPR, Brazil. Email: eduardo@inf.ufpr.br

Abstract—Testing Big Data Processing systems is a challenging • Can we identify portability issues using our testing sys-
task as these systems are usually distributed on various virtual tem and heterogeneous platforms?
machines (potentially hosted by remote servers). In this poster1 • How much BDTest can be extended to Big Data frame-
we present a platform for testing non-functional properties of
Big Data framework and a first implementation with Hadoop, a works?
well known big data management and processing platform.
II. R ELATED W ORK
I. I NTRODUCTION The increased interest in and use of Big Data platforms
“Big Data” has become a reality in the past years, with like Hadoop has led to a lot of research and the development
enormous amounts of data generated by humans (e.g., Social of testing techniques and tools. One of the most interesting
Networks) and all sorts of sensors (e.g., Internet of Things). attempts is MRUnit [8], a Java framework that aims at creating
For instance, it is estimated that we create 2.5 quintillion JUnit [9] tests like test cases for MapReduce jobs. However,
bytes of data every day and that 90% of all data has been this tool requires to introduce specific calls to its API in the
created in the last two years [1]. While managing and storing SUT and can generate new bugs or decrease performance of
large amount of data has been a focus of interest, processing the SUT. BDTest tries to avoid any modification of the SUT.
them is challenging as it is skill and labour intensive [2]. Testing distributed systems has been studied quite a lot
MapReduce [3] and similar programming model have been in the past. GridUnit [10], [11] proposed a mixed model: a
proposed to simplify the ingestion, storage and processing centralised architecture with the execution of tests at the node
of large amount of data on distributed architectures. Apache level. The main issue with GridUnit is that it does not handle
Hadoop [4] implementation of MapReduce is extensively used computing nodes’ volatility – while the fact that nodes can
by companies, such as Ebay, Facebook or Google [5]. join or leave the system dynamically is an important element
The quality of such software systems is difficult to as- in distributed systems. In the context of Big Data process-
sess though, as testing any distributed systems is a complex ing, volatility corresponds to workers being added/assigned
challenge. Issues in a distributed software system can come or failing/being decommissioned [12]. P2PTester [13] and
from various elements: software or hardware components, Pigeon [14] proposed to use a fully distributed architecture
their integration, or the communication between them [6]. but require to modify the code of the SUT as MRUnit.
Another risk is to introduce artificial defects or to impact the PeerUnit [15], [7] is a project that inspired BDTest. It
performance of the system with the monitoring apparatus. can create complex tests for distributed (peer-to-peer) systems
In this poster we present BDTest, a system that aims at without modifying the SUT. Test scenarios in PeerUnit are
testing Big Data Platforms, such as, Hadoop. BDTest uses synchronised over all the computing nodes using either a cen-
a lower tester architecture [7] which gives indirect control tralised or a decentralised architecture [16]. BDTest follows
and information about the system under test (SUT) via the the general ideas proposed by PeerUnit, but extends PeerUnit
underlying services and has a limited impact on the SUT, in two original directions: BDtest can (i) manage parallel exe-
while offering the possibility to run complex test scenarios. cutions (typical of Big Data tasks) and (ii) run heterogeneous
While BDTest could test functional properties, we focus on nodes. HadoopTest [17] is a framework based on PeerUnit
non-functional ones - for instance, changing artificially the created to test Hadoop. The framework uses two kinds of
performance or availability of computing nodes in the SUT. testers: master and worker, but only with a centralised testing
We plan to address in our future work a variety of research architectures. HadoopTest focuses on functional properties,
questions, such as: while BDTest aims at addressing non-functional concerns,
• Can we test the performance of the partitioning and load
such as: scalability or performance by modifying more finely
balancing mechanisms of MapReduce implementations? properties of each computing node.
• Can we discover performance bottlenecks using BDTest
III. M AP R EDUCE
and, for instance, model-based testing techniques?
MapReduce is a paradigm designed to help developers to
1 This work was supported by Science Foundation Ireland grant 13/RC/2094. run data- and process-intensive jobs on large scale distributed

978-1-5090-6676-6/17 $31.00 © 2017 IEEE 394


395
DOI 10.1109/ICSTW.2017.75
systems. The idea behind MapReduce is that every job is BDTest implements assertions and value comparisons as
decomposed into two simple units of work: map and reduce, they are simple testing methods [20], [21]. Log file (offline)
designed by the user. All the rest (distribution, replication, analysis is not yet implemented in BDTest but this is a
coordination, etc.) is managed by the framework. The data is potential direction of research.
divided into blocks, stored in various nodes of the system,
and the Map function is applied on these blocks to process
V. T EST CASE SCENARIOS AND GENERALISATION
intermediate results which are grouped and distributed among
’reducers’. Those reducers combine the results and compute BDTest can be used to stress-test the Hadoop/HDFS’ load-
the final output. MapReduce uses two types of nodes, the balancer, without increasing or decreasing the load but just by
Master which assigns and coordinates tasks and the slaves simulating a change of the available resources inside a scenario
which run the Map or Reduce functions when requested. and creating new constraints (e.g., emulating other processes
IV. BDT EST in the host systems). We can see with this method how the
SUT will respond to a common decrease of 50% of CPU
BDTest is developed in Java and has an interface with and bandwidth in some nodes, and whether the performance
Hadoop for the moment - we will provide interfaces for other remains “acceptable” - i.e., the process time is still under a
Big Data components and frameworks in the future for a full certain limit.
library of Big Data testing tools. In Big Data frameworks
Another significant testing opportunity consists in fault
“many subtle and hard to diagnose errors happen due to
injection. Our aim is not to determine the root cause, but to
misconfiguration across layers and nodes” [18]. BDTest can
stress the SUT with a pre-define ’unexpected’ error. This way,
detect those errors, but is not meant to track down those
we can test robustness and replication (on Hadoop/HDFS) and
failures. Our target is to reproduce them and see how the SUT
generate specific errors: what if the three nodes containing the
behaves.
replicas of a block of data fail at the same time? We can also
Each node (i.e., master or slave) is assigned one tester which
see how the load balancer behaves if a single node leaves the
can monitor and instrument both the underlying layers (e.g.,
cluster, what are the consequences if this node was working
Operating System, network) and the node itself. The testers
on a task and how long time the system needs to detect the
can for instance limit the resource utilisation of the OS (e.g.,
node has left.
cap the CPU or the memory) and run part of global scenarios
Several application of the its ‘galaxy’ run on top of Hadoop,
(e.g., stress the load balancing algorithm in Hadoop/HDFS).
such as, Hive [22] or HBase[23] – similarly, Apache Spark
BDTest manages (i.e., individually for each node) four main
can replace MapReduce in Hadoop and be used with HDFS.
resources: the disk I/O (a critical point when you need to
We believe this interoperability between Big Data components
read and write large files), the CPU and memory assigned
requires a versatile testing framework. BDTest has a limited
to processes and the network bandwidth. It is important to
interaction with the exact implementation and manages under-
notice that BDTest does not ‘really’ change the resources of
lying layers (OS and network) which is why we hope to be able
the machines, but only limits the resources available to the
to extend easily our framework to other Big Data platforms.
Big Data framework processes during the tests. BDTest also
provides a coordinator agent, which dispatches the test case
actions to the relevant testers and manages the set of testers, VI. TAKE AWAY M ESSAGE
in particular synchronises their actions.
BDTest, can run in any of two modes: centralised or This poster presents BDTest, a system to test Big Data
decentralised. In the former, BDTest has only one coordinator, platforms, such as Hadoop. BDTest instruments and profiles
which sends actions on to all the testers and is responsible for the various resources (e.g., network, CPU) of the computing
the parallel execution of the tests. In the decentralised mode, nodes and can run scenarios to test properties of Big Data
on the other hand, all testers belong to a ‘hierarchical testing systems such as, load-balancing, performance, portability or
tree’ where testers are also coordinators of their descendants compatibility. Software testers using BDTest create test scenar-
and report to their parents. The root node kicks-off the test ios using simple unit tests (assertions and value comparisons)
cases, communicates with its children which in turn forward and all sorts of defects - that are distributed by the testing
the actions on to their own children and so on until the leaves framework to every computing nodes (testers) in the system.
of the tree. This decentralised architecture scales up better We now want to run large scale tests of Hadoop (and
compared to the centralised one and limits the communication HDFS) and we plan to report our findings in the near future.
overhead. We also want to extend BDTest to other Big Data mod-
Validating global properties (value threshold for instance) is ules and platforms of the Hadoop galaxy and more recent
one of the key challenges in testing distributed systems [19]. MapReduce implementations, such as: Apache Spark [24]
BDTest makes sure that each tester manages its own data or Apache Storm [25]. We would also like to evaluate the
which are later aggregated into a global view (a test oracle) performance of Big Data platforms with different architectures
by the coordinator (centralised architecture) or the root node (i.e., centralised or decentralised) and see whether BDTest can
(decentralised architecture). be used for benchmarking as well as for testing.

396
395
R EFERENCES
[1] “What is big data?” https://www-
01.ibm.com/software/data/bigdata/what-is-big-data.html.
[2] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem,
and P. Helland, “The end of an architectural era: (it’s time for a complete
rewrite),” in VLDB, 2007, pp. 1150–1160.
[3] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on
large clusters,” OSDI, pp. 137–150, 2004.
[4] The Apache Software Fondation, “Apache Hadoop,” 2017,
http://hadoop.apache.org/.
[5] ——, “Who use Hadoop,” 2017, https://wiki.apache.org/hadoop/PoweredBy.
[6] X. Bai, M. Li, B. Chen, W.-T. Tsai, and J. Gao, “Cloud Testing Tools,”
SOSE, 2011.
[7] E. C. de Almeida, “Testing and validation of peer-to-peer systems,”
Ph.D. dissertation, University of Nantes, 2009.
[8] The Apache Software Fondation, “Apache MRUnit,” 2016,
https://mrunit.apache.org/.
[9] “JUnit,” 2017, http://junit.org/junit4/.
[10] A. Duarte, W. Cirne, F. Beasileiro, and P. Machado, “Using the compu-
tational grid to speed up software testing,” BSSE, 2005.
[11] ——, “Gridunit: software testing on the grid,” ICSE, pp. 779–782, 2006.
[12] D. Warneke and O. Kao, “Exploiting dynamic resource allocation for
efficient parallel data processing in the cloud,” IEEE transactions on
parallel and distributed systems, vol. 22, no. 6, pp. 985–997, 2011.
[13] F. Dragan, B. Butnaru, I. Manolescu, G. Gardarin, N. Preda, B. Nguyen,
R. Pop, and L. Yeh, “P2PTester: a tool for measuring P2P platform
performance,” ICDE, 2007.
[14] Z. Zhou, H. Wang, J. Zhou, L. Tang, K. Li, W. Zheng, and M. Fang,
“Pigeon: a framework for testing peer-to-peer massively multiplayer
online games over heterogeneous network,” CCNC, 2006.
[15] “PeerUnit,” 2011, http://peerunit.gforge.inria.fr/.
[16] E. C. de Almeida, J. E. Marynowski, G. Sunye, Y. L. Traon, and
P. Valduriez, “Efficient Distributed Test Architectures for Large-Scale
Systems,” ICTSS, 2010.
[17] J. E. Marynowski, M. Albonico, E. C. de Almeida, and G. Sunye,
“Testing MapReduce-Based Systems,” SSBD, 2011.
[18] J. Z. Li, S. He, L. Zhu, X. Xu, M. Fu, L. Bass, A. Liu, and
A. B. Tran, “Challenges to error diagnosis in hadoop ecosystems,” in
Presented as part of the 27th Large Installation System Administration
Conference (LISA 13). Washington, D.C.: USENIX, 2013, pp. 145–154.
[Online]. Available: https://www.usenix.org/conference/lisa13/technical-
sessions/presentation/li
[19] G. Sunye, E. C. D. Almeida, Y. L. Traon, B. Baudry, and J.-M. Jzquel,
“Model-based testing of global properties on large-scale distributed
systems,” IST, 2014.
[20] L. Baresi and M. Young, “Test oracles,” Technical Report CI S-TR-01-
02, University of Oregon, Dept. of Computer and Information Science,
Eugene, Oregon, USA, 2001.
[21] E. T. Barr, M. Harman, O. McMinn, M. Shahbaz, and S. Yoo, “The
oracle Problem in Software Testing : a Survey,” TSE, 2015.
[22] The Apache Software Fondation, “Hive,” 2017, https://hive.apache.org/.
[23] ——, “HBase,” 2017, https://hbase.apache.org/.
[24] ——, “Apache Spark,” 2017, http://spark.apache.org/.
[25] ——, “Apache Storm,” 2017, http://storm.apache.org/.

397
396

Das könnte Ihnen auch gefallen