Sie sind auf Seite 1von 429

9

www.ictcs.info

The 2nd International Conference on new Trends


in Computing Sciences (ICTCS’19)

Proceedings

Amman, Jordan
9 -11 October 2019

Organized by

Princess Sumaya University for Technology , Jordan


Prince Mohammad Bin Fahd University , Saudi Arabia

Editor
Prof. Arafat Awajan

IEEE Catalog Number: CFP19HAB-ART (Xplore)


USB: 978-1-7281-2882-5 (Xplore)
IEEE Catalog Number: CFP19HAB-USB (USB)
USB: 978-1-7281-2881-8 (USB)
2019 2nd International Conference on new Trends in 
Computing Sciences (ICTCS) 
 
Copyright © 2019 by the Institute of Electrical and Electronics Engineers, Inc. 
All rights reserved. 
 
 
 
 
Copyright and Reprint Permissions 
Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the 
limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code 
at the bottom of the first page, provided the per‐copy fee indicated in the code is paid through 
Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. 
 
 
 
For other copying, reprint or republication permission, write to IEEE Copyrights Manager, IEEE 
Service Center, 445 Hoes Lane, Piscataway, NJ  08854. 
All rights reserved. 
 
 
 
IEEE Catalog Number:   CFP19HAB-ART (Xplore)  
                                         CFP19HAB-USB (USB) 
ISBN:  978-1-7281-2882-5 (Xplore) 
                                       978-1-7281-2881-8 (USB) 
 
 
 
Printed copies of this publication are available from: 
 
Curran Associates, Inc 
57 Morehouse Lane 
Red Hook, NY  12571 USA 
Phone: (845) 758‐0400 
Fax:  (845) 758‐2633 
E‐mail: curran@proceedings.com 
 
 
 
 
 
 
 
 
Produced by IEEE eXpress Conference Publishing
 
For information on producing a conference proceedings and receiving an estimate,
contact conferencepublishing@ieee.org
http://www.ieee.org/conferencepublishing
Introduction
The second International Conference on new Trends in Computing Sciences (ICTCS’19) is
organized by the King Hussein School for Computing Science at Princess Sumaya University for
Technology in partnership with Prince Mohammad bin Fahd University (PMU), and in
collaboration with the Scientific Research Fund at the Ministry of Higher Education and the Royal
Scientific Society (RSS). Building on the resounding success of its first edition, the goal of the
conference is to continue providing international biannual forum where international scientists
meet with Jordanian scientists from the different fields of computer science to exchange ideas and
information on current trends of research results, system developments, and practical experiences.

The conference is held in Amman-Jordan from 9-11 October 2019. We expect more than 200
participants in this edition of the conference which will feature 65 papers from 28 countries
(Acceptance rate 33%), 8 Keynotes speeches (from the USA, UK, Italy, and Malaysia), an
industrial track with 3 talks and 3 specialized workshops on data science, computer security and
artificial intelligence.

Major areas covered in the conference include:

Area 1: Data Science and Big Data: Evolutionary Computation, Big Data Analytics, Information,
Retrieval for Big Data, Social Network Analysis and Mining, Data and Text Mining, Data Analysis
and Visualization, Computational Statistics and Modeling, Data Engineering, Mining Massive
Data, Data Visualization.

Area 2: Computer and Network Security: Botnet Detection and Prevention, Forensic Investigation
of the IoT, Big Data Forensics, Cloud Computing Security, Network Flow Analysis, Intrusion
Detection and Prevention, Mobile Security, Digital Forensics and Anti-Forensics, Malware
Analysis and Memory Forensics, Wireless Network.

Area 3: Natural Language Processing: Semantic Processing, Lexical Semantics, Ontology, Latent
Semantic Analysis, Linguistic Resources, Paraphrasing, Language Generation, Text Entailment,
Machine Translation, Information Retrieval, Text Mining, Question Answering, Speech Analysis
and Recognition, Arabic Natural Language Processing.

Area 4: Intelligent Systems: Knowledge Representation, Multi-Agent Systems, Machine


Learning, Fuzzy Systems, Expert Systems, Computer Vision, Neural Networks and Applications,
Pattern Recognition, Automated Problem Solving.

Area 5: Internet of Things (IoT): Architectures and protocols for the IoT, IoT new designs and
architectures, IoT/M2M Management, Interoperability of IoT systems, IoT applications, Security,
Identity and privacy of IoT, Reliability of IoT, Scalability issues for IoT networks, Disaster
recovery in IoT.

Area 6: Electronic and virtual Learning: e-Learning Tools, Mobile Learning, Gamification,
Collaborative Learning, Educational Systems Design. Virtual Learning Environments, Virtual
reality for education and workforce training.

iii
Acknowledgments

We would like thank the program committee members, all the reviewers, and sub reviewers who
worked very hard to support ICTCS’19 and have sent their reviews on time. We would like to
thank our main sponsors for their valuable support. In particular, we would like to thank the
Scientific Research Support Fund for their generous support. We also would like to thank the
authors for their high-quality scientific contributions. We appreciate all the support we obtained
from the president of Princess Sumaya University for Technology (Jordan) Prof. Mashhoor Al-
Refai and from Prof. Issa H. Al Ansari president of Prince Mohammad bin Fahd University (Saudi
Arabia). Finally, we would like to extend our sincere gratitude to HRH Princess Sumaya Bint
Elhassan for her continuous support and guidance and for accepting to patron ICTCS’19.

iv
Message from the Conference Chairs
On behalf of the Organizing Committee, we are honored and delighted to welcome you to Amman
and to the Second International Conference on New trends in Computing Sciences (ICTCS’19).
The ICTCS’19 is organized by the King Hussein School for Computing Science at Princess
Sumaya University for Technology (Jordan) in partnership with Prince Mohammad bin Fahd
University (Saudi Arabia), and being supported the Scientific Research Fund at the Ministry of
Higher Education and the Royal Scientific Society (RSS).

Building on the resounding success of its first edition, the goal of the conference is to continue
providing international biannual forum where international scientists meet with Jordanian
scientists from the different fields of computer science to exchange ideas and information on
current trends of research results, system developments, and practical experiences. The main
themes of ICTCS’19 include Data science, Data mining, Big Data, Artificial Intelligence, Internet
of Things, Natural Language Processing and Computer Security.

The technical program is rich and varied with 8 featured specialized keynote speeches, 3 industrial
presentations and 65 technical papers from 23 countries. In addition, several workshops are
organized in parallel with the conference.

On behalf of the organizing committee, we wish to thank all authors for their papers and
contributions to this conference. we would like to thank the keynote speakers for sharing their deep
knowledge and experiences in the hot research topics in the different fields of computer science
and Information and communication technology. we offer our deep thanks to all the members of
the International Scientific Committee and reviewers, who offered their time and technical
expertise in the review process.

We know that the success of the conference depends ultimately on the many people who have
worked in planning and organizing both the technical program and supporting social arrangements.
we would like to share with you our gratitude towards all members of the organizing committee
for their efforts and dedication to the success of this conference. we also thank Professor Mashhour
Al Refai president of PSUT and Professor Issa H. Al Ansari for their support in organizing this
conference.

Finally, we would also like to thank Scientific Research Support Fund (Ministry of Higher
Education and Scientific Research-Jordan) for the valuable support to the conference. Special
thanks to all Session Chairs, Student Volunteers and Sponsors for their contributions to make
ICTCS’19 a success.

Prof. Arafat Awajan Prof. Faisal AL Anezi


Princess Sumaya University for technology Prince Mohammad Bin Fahd University
Jordan Saudi Arabia

v
ICTCS’19 Committees

Conference Chairs
Arafat Awajan, Princess Sumaya University for Technology, Jordan
Faisal AL Anezi, Prince Mohammad Bin Fahd University, Saudi Arabia

Advisory Committee
Arafat Awajan, Princess Sumaya University for Technology, Jordan (Chair)
Faisal AL Anezi, Prince Mohammad Bin Fahd University, Saudi Arabia (Chair)
Gheith Abandah, IEEE Jordan Section Chair, Jordan
Ali Chamkha. Prince Mohammad Bin Fahd University, Saudi Arabia
Thiab Taha, University of Georgia, USA
Nabeel Fayoumi, Royal Scientific Society, Jordan
Omer Rana, Cardiff University, UK
Aladdin Ayesh, De Montfort University, UK
Mohammad Bettaz, Dean of the Faculty of Information Technology, Philadelphia University
Ahmad Hiasat, Princess Sumaya University for Technology, Jordan
Amjad Hudaib, Dean of the Faculty of IT, Jordan University, Jordan
Jaafar Al Ghazo, Prince Mohammad Bin Fahd University, Saudi Arabia
Mohammad Bataz, Dean of the Faculty of IT, Philadelphia University, Jordan
Nijad Najdawi, Dean of the Faculty of IT, Al-Balqa Applied University, Jordan
Sahar Edwan, Dean of the Faculty of IT, Hashemite University, Jordan
Hassan Shalaby, Al-Hussein Bin Tala University, Jordan
Essam Al Daoud, Zarqa University, Jordan

vi
Organizing Committee
Sufyan Al Majali, Princess Sumaya University for Technology, Jordan
Edwrad Jaser, Princess Sumaya University for Technology, Jordan
Jaafar Al Ghazo, Prince Mohammad Bin Fahd University, Saudi Arabia
Muder Al Miani, IEEE Jordan Section, Jordan
Khaled Jaber, IEEE Jordan Section, Jordan
Said Ghoul, Philadelphia University, Jordan
Malek Al-Zewairi, Princess Sumaya University for Technology, Jordan
Laila A-Sayaydeh, Princess Sumaya University for Technology, Jordan
Ezzeldeen A-Issa, Princess Sumaya University for Technology, Jordan

Publication Committee
Edward Jaser, Princess Sumaya University for Technology, Jordan (Chair)
Rania Sinno, Prince Mohammad Bin Fahd University, Saudi Arabia
Laila Al-Sayaydeh, Princess Sumaya University for Technology, Jordan
Malek Al-Zewairi, Princess Sumaya University for Technology, Jordan

vii
Technical Committee
Arafat Awajan, Princess Sumaya University for Technology, Jordan
Chedly B. Yahya, Prince Mohammad Bin Fahd University, Saudi Arabia
Brahim Medjahed, the University of Michigan – Dearborn, USA
Aladdin Ayesh, De Montfort University, UK
Christian Boitet, Joseph Fourier University, France.
Adil Alpkocak, Dokuz Eylul University, Turkey
Abd El-Aziz Ahmed, Anna University, India
Abdallah Qusef, Princess Sumaya University for Technology, Jordan
Eric Schoop,TU-DRESDEN University, Germany
Essam Al Daoud, Zarqa University, Jordan
Essam Rashed, The British University in Egypt, Egypt
Fares Fraij,The University of Texas at Austin, USA
Abdallah Shdaifat, The University of Jordan, Jordan
Abdullah Aref,Princess Sumaya University for Technology, Jordan
Abul Bashar,Prince Mohammad Bin Fahd University, Saudi Arabia
Cathryn Peoples,The Open University,UK
Ankur Singh Bist, Krishna Institute of Engineering & Technology, India
Chiheb-Eddine Ben N'Cir,University of Tunis.,Tunis
Adiy Tweissi, Princess Sumaya University for Technology, Jordan
Adnan Gutub, Umm Al-Qura University, Saudi Arabia
Adnan Hnaif, Al-Zaytoonah University of Jordan, Jordan
Bushra Alhijawi, Princess Sumaya University for Technology, Jordan
Daoud Daoud, Princess Sumaya University for Technology, Jordan
Darin El-Nakla, Prince Mohammad Bin Fahd University, Saudi Arabia
Adnan Shaout, University of Michigan, USA
Ahmad Abusukhon, Al-Zaytoonah University of Jordan, Jordan
Ahmad Al-Qerem, Princess Sumaya University for Technology, Jordan
Ahmad Hiasat, Princess Sumaya University for Technology, Jordan
Ala Al-Fuqaha, Western Michigan University, USA
Albara Awajan, Al Balqa Applied University, Jordan
Ali Hadi, Champlain College, Computer and Digital Forensics, USA
Amin Beheshti,Macquarie University, Australia
Amjad Hudeib, The University of Jordan, Jordan
Amjed Almousa, Princess Sumaya University for Technology, Jordan
Ammar Elhassan, Princess Sumaya University for Technology, Jordan
Gaurav Garg, ABV-Indian Institute of Information Technology & Management, India
George Sammour, Princess Sumaya University for Technology, Jordan
Ghassan Al Qaimari, Jumeira University, United Arab Emirates
Ghassan Shobaki, Sacramento State University, USA
Anas Abu Taleb, Princess Sumaya University for Technology, Jordan
Arinola Adefila, Coventry University, UK
Ashraf Odeh, Isra University, Jordan
Ashraf Tahat, Princess Sumaya University for Technology, Jordan
Baha Khasawnwh, Princess Sumaya University for Technology, Jordan

viii
Basheer Dwaoiri, Jordan University of Science and Technology, Jordan
Bassam Hammo, The University of Jordan, Jordan
Heba Abdelnabi, Princess Sumaya University for Technology, Jordan
Hejab Alfawareh, Northern Border University, Saudi Arabia
Bayan Abu Shawar, Arab Open University, Jordan
Dhiah Abu Tair, German Jordanian university, Jordan
Dima Suleiman, Princess Sumaya University for Technology, Jordan
Doaa ElZanfaly, The British University in Egypt, Egypt
Haytham Bani Salameh, Yarmouk University, Irbid, Jordan
Hosam El-Sofany, King Khalid University, Saudi Arabia
Hunaida Awwad, Dokuz Eylul University, Turkey
Huseyin Abachi, Adnan Menderes University, Turkey
Hussein Sane Yagi, The University of Jordan, Jordan
Ibrahim Aljarah, The University of Jordan, Jordan
Ilyes Jenhani, Prince Mohammad Bin Fahd University, Saudi Arabia
Isidro Maya-Jariego, Universidad de Sevilla, Spain
Dojanah Al-Nabulsi, Amman University College, Jordan
Dojanah Bader, Al-Balqa` Applied University, Jordan
Ebaa Fayyoumi, The Hashemite University, Jordan
Edward Jaser, Princess Sumaya University for Technology, Jordan
Emad Abdallah, The Hashemite University, Jordan
Firas Alghanim, Princess Sumaya University for Technology, Jordan
Ghassen Ben Brahim, Prince Mohammad Bin Fahd University, Saudi Arabia
Ghazi Naymat, Princess Sumaya University for Technology, Jordan
Gheith Abandah, University of Jordan, Amman, Jordan.
Hani Almimi, Al-Zaytoonah University of Jordan, Jordan
Hasan Al Shalabi, Al Hussein University, Jordan
Ismail Ababneh, Al al-Bayt University, Jordan
Jaafar Alghazo, Prince Mohammad Bin Fahd University, Saudi Arabia
Jaafer Saraireh, Princess Sumaya University for Technology, Jordan
Jaber Alwidian, Al-Isra University, Jordan
Khaled Al-Begain, University of South Wales, UK
Khaled Almakadmeh, The Hashemite University, Jordan
Khaled Almi'ani, United Arab Emirates
Khaled Alzoubi, Saint Xavier University, USA
Ja'far Alqatawna, The University of Jordan, Jordan
Jalal Atoum, Princess Sumaya University for Technology, Jordan
Jamal Arafat, Ohio University, USA
Jawad Fawaz Al-Asad, Prince Mohammad Bin Fahd University, Saudi Arabia
Jihad Jaam, International Journal of Computing and Information Sciences, United Arab Emirates
Khair Eddin Sabri, The University of Jordan, Jordan
Khalaf Khatatneh, Al-Balqa` Applied University, Jordan
Khaled Mahmoud, Princess Sumaya University for Technology, Jordan
Khaled Makadmeh, The Hashemite University, Jordan
Khaled Mansour, Al-Zaytoonah University of Jordan, Jordan
Khaled Nagaty, The British University in Egypt, Egypt

ix
Majid Ali Khan, Prince Mohammad Bin Fahd University, Saudi Arabia
Malek Al-Zewairi, Princess Sumaya University for Technology, Jordan
Malik Qasaimeh, Princess Sumaya University for Technology, Jordan
Malik Saleh, Prince Mohammad Bin Fahd University, Saudi Arabia
Khaled Younis, The University of Jordan, Jordan
Khamis Omar, Jordan
Khatatneh Khalaf, Al Balqa Applied University, Jordan
Lalit Garg, L-Università ta' Malta, Malta
Leonel Sousa, Universidade de Lisboa, Portugal
Loay Alzubaidi, Prince Mohammad Bin Fahd University, Saudi Arabia
Majdi Rawashdeh, Princess Sumaya University for Technology, Jordan
Majdi Sawalha, The University of Jordan, Jordan
Mamoun Hattab, University of Petra, Jordan
Maram Bani Younes, University of Ottawa, Canada
Mariam Biltawi, Princess Sumaya University for Technology, Jordan
Mariam Khader, Princess Sumaya University for Technology, Jordan
Marius Nagy, Prince Mohammad Bin Fahd University, Saudi Arabia
Marwah Alian, Princess Sumaya University for Technology, Jordan
Mohamed Anis Bach Tobji, ESEN University, Tunisia
Mohamed Aymen Ben Hajkacem, Higher Institute of Management of Tunis, Tunis
Mohamed Wiem Mkaouer, Rochester Institute of Technology, USA
Mohammad Ababneh, Princess Sumaya University for Technology, Jordan
Nadim Obeid, The University of Jordan, Jordan
Nailah Al-Madi, Princess Sumaya University for Technology, Jordan
Naoufel Werghi, Khalifa University, United Arab Emirates
Mohammad Abusharaih, The University of Jordan, Jordan
Mohammad Alauthman, Al-Zaytoonah University of Jordan, Jordan
Mohammad Alia, Al-Zaytoonah University of Jordan, Jordan
Mohammad Al-Zoube, Princess Sumaya University for Technology, Jordan
Mohammad Belal Al Zoubi, Princess Sumaya University for Technology, Jordan
Mohammad Daoud, Microsoft MVP, Jordan
Omar Nofal, Princess Sumaya University for Technology, Jordan
Omar Rana, Cardiff University, UK
Osama Dorgham, Al-Balqa` Applied University, Jordan
Mohammed Al-Saleh, The University of Jordan, Jordan
Mohammed Alweshah, Al-Balqa` Applied University, Jordan
Mohammed Zeki Khedher, The University of Jordan, Jordan
Montassar Ben Messaoud, Higher Institute of Management of Tunis, Tunis
Mostafa Ali, Jordan University of Science and Technology, Jordan
Mouhammd Alkasassbeh, Princess Sumaya University for Technology, Jordan
Mousa Al-Akhras, Saudi Electronic University, Saudi Arabia
Omar Hiari, German Jordanian University, Jordan
Omar M. Al-Jarrah, Jordan University of Science and Technology, Jordan
Mousa Ayyash, Colorado State University, USA
Mustafa Al Fayoumi, Princess Sumaya University for Technology, Jordan
Nadia Sweis, Princess Sumaya University for Technology, Jordan

x
Nazeeruddin Mohammad, Prince Mohammad Bin Fahd University, Saudi Arabia
Nijad Najdawi, Al Balqa Applied University, Jordan
Omar Al-Hujran, Princess Sumaya University for Technology, Jordan
Omar H. Karam, The British University in Egypt, Egypt
Osama Haj Hassan, Al-Isra University, Jordan
Osama Ouda, The University of Jordan, Jordan
Parag Kulkarni, United Arab Emirates University, United Arab Emirates
Paul Richardson, The University of Michigan – Dearborn, USA
Paul Watta, The University of Michigan – Dearborn, USA
Peter King, Heriot Watt University, UK
Priyanka Chaurasia, ULSTER, UK
Raed Abu Zitar, American University of Madaba, Jordan.
Raghda Hraiz, Princess Sumaya University for Technology, Jordan
Rami Alazrai, German Jordanian University, Jordan
Rawan Ghnemat, Princess Sumaya University for Technology, Jordan
Ridha Ghayoula,Université Laval, Canada
Rosana Marar, Princess Sumaya University for Technology, Jordan
S Smys, RVS Technical Campus, India
Sadiq Alhuwaidi, Prince Mohammad Bin Fahd University, Saudi Arabia
Sahar Idwan, The Hashemite University, Jordan
Said Ghoul, Philadelphia University, Jordan
Salam Fraihat, Princess Sumaya University for Technology, Jordan
Salam Hamdan, Princess Sumaya University for Technology, Jordan
Saleh Abu-Soud, Princess Sumaya University for Technology, Jordan
Thiab Taha, University of Georgia, USA
Varsha Jain, Narsee Monjee Institute of Management Studies, India
Vladimir Geroimenko, The British University in Egypt, Egypt
Wael Etaiwi, Princess Sumaya University for Technology, Jordan
Walid A Salameh, Princess Sumaya University for Technology, Jordan
Samer Sawalha, Princess Sumaya University for Technology, Jordan
Samir Abou El-Seoud, The British University in Egypt, Egypt
Samir Elnakla, Prince Mohammad Bin Fahd University, Saudi Arabia
Samy Ghoniemy, The British University in Egypt, Egypt
Sane Yagi, University of Jordan, Jordan
Saqer Abdel Rahim, Jordan
Sara Tedmori, Princess Sumaya University for Technology, Jordan
Sarah Gellynhail, Western Michigan University, USA
Shadi Aljawarneh, The University of Jordan, Jordan
Yahia Al-Halabi, Princess Sumaya University for Technology, Jordan
Yasmeen Alsufaisan, Prince Mohammad Bin Fahd University, Saudi Arabia
Yi Lu Murphy, University of Michigan, USA
Yousef Daradkeh, Prince Sattam bin Abdulaziz University, KSA
Shahabuddin Muhammad, Prince Mohammad Bin Fahd University, Saudi Arabia
Shaidah Jusoh, Princess Sumaya University for Technology, Jordan
Sharefa Murad, University of Salerno, Italy
Sufyan Almajali, Princess Sumaya University for Technology, Jordan

xi
Suleiman Yerima, De Montfort University, UK
Tarek Abbes, Higher Institute of Electronic and Communication of Sfax, Tunisia
Walid Hussien, The British University in Egypt, Egypt
Wided Guezguez, Tunis Business School, Tunis
Zaydon Hatamleh, Al Ain University of Science and Technology, United Arab Emirates

xii
Keynotes

Prof. Omer Rana


Cardiff University, UK

Realizing Edge Marketplaces


Abstract:
The edge of the network has the potential to host services for supporting a variety of user applications,
ranging in complexity from data preprocessing, image and video rendering, and interactive gaming, to
embedded systems in autonomous cars and built environments. However, the computational and data
resources over which such services are hosted, and the actors that interact with these services, have an
intermittent availability and access profile, introducing significant risk for user applications that must rely
on them. This talk will describe the development of an edge marketplace, which is able to support multiple
providers for offering services at the network edge, and to enable demand supply for influencing the
operation of such a marketplace. Resilience, cost, and quality of service and experience will subsequently
enable such a marketplace to adapt its services over time. This talk will also describe how distributed-ledger
technologies (such as blockchains) provide a promising approach to support the operation of such a
marketplace and regulate its behavior (such as the GDPR in Europe) and operation. Application scenarios
provide context for the discussion of how such a marketplace would function and be utilized in practice.
The talk suggests potential edge services that can be hosted in cities such as Amman (Jordan), and business
models to support these services.

xiii
Prof. Ku Ruhana Bt Ku M
Universiti Utara Malaysia, Malaysia

Hybrid Swarm Intelligence Algorithms for Optimization Problems


Abstract:
Computational intelligence and metaheuristic algorithms have become increasingly popular in computer
science, artificial intelligence, machine learning, engineering design, data mining, image processing, and
data-intensive applications. Several algorithms in computational intelligence and optimization are
developed based on swarm intelligence (SI). Different algorithms may have different features and thus may
behave differently, even with different efficiencies. However, It still lacks in-depth understanding why
these algorithms work well and exactly under what conditions.

The current trend is to design hybrid metaheuristics by combining different metaheuristics which will
benefit from the individual advantages of each method. An effective approach consists in combining a
population-based method with a single-solution method (often a local search procedure such as Taboo
search with ant colony optimization (ACO)). In hybrid optimization algorithms, many combinations of
famous optimization methods have been developed, such as, a hybrid grey wolf optimizer and genetic
algorithm, hybrid Cuckoo Search and Particle Swarm Optimization (PSO), a hybrid PSO and ACO and a
Hybrid ACO and artificial bee colony algorithm. Hybrid SI-based metaheuristics can obtain satisfying
results when solving optimization problems in a reasonable time. However, they suffer especially with
high-dimensional optimization problems. Future research to overcome this limit could be in the area of
parallel metaheuristics.

xiv
Prof. Mubarak Shah
University of Central Florida, USA

View Invariant and Few Shot Human Action Recognition


Abstract:
Automatic recognition of human actions from videos is one of the most active areas of research in Computer
Vision. My group has been working on this problem for some time and we have proposed several different
methods addressing different aspects of this problem. Two important limitations of many of our approaches
and other approaches proposed in the literature are: their sensitivity to view-point change and requirement
of large number of training examples. In this talk I will present our recent work addressing both view-aware
action recognition and learning action with less labels.

xv
Prof. Moussa Ayyash
Chicago State University, USA

Coexistence Strategies in The Era of Artificial Intelligence and


Heterogeneous Technologies: Opportunities and Challenges
Abstract:
Heterogeneous integration of networking and communication technologies has gotten the attention of the
business and research community in recent years. While this integration is logical and brings its own
opportunities when it comes to integrating vast and diverse resources, there are many technical and
administrative challenges which need to be dealt with in order to harvest the real benefits of such
integration. These challenges present themselves at different layers and levels (organization, management,
maintenance, optimization, security, energy utilization, etc.)

This talk highlights the need for a strategic framework for coexisting heterogenous wired and wireless
deployments and computing infrastructures. The talk provides examples of recent promising solutions that
promote coexistence strategies (e.g. coexisting radio and optical wireless deployments (CROWD)). The
speaker will also focus on the fact that large-scale heterogenous integration requires artificial intelligence
(AI) techniques that can naturally deal with coexisting heterogenous environments.

The speaker will briefly shed light on the need for a different future workforce which is ready to deal with
the heterogeneity of computing sciences and the “future-of-work” trends.

xvi
Prof. Giorgio Giacinto Prof. George Dafoulas
University Of Cagliari, Italy Middlesex University, UK

The Digital Future. Research and Education Challenges in


Cybersecurity and Digital Forensics
Abstract:
The increasing “digitalisation” of our era makes our social, economic, and political lives highly dependent
on computers. This complex environment is causing different and multiple weaknesses that allow malicious
actors to misuse the systems and cause threats at different levels of severity. The defence is a priority for
all professionals who need to be aware of the best practices and current technologies for designing secure
systems and be prepared to be resilient to attacks. When facing an attack, not only damages have to be
limited, but also information needs to be properly collected in order to assess the impact, locate the
vulnerabilities that caused the attack, and eventually attribute the attack. In this reality, artificial intelligence
plays a pivotal role in early detection and post-mortem data analysis, and its effectiveness is related to its
resilience to adversarial attacks aimed at evading detection and misleading the decision process.
Considerable efforts in research, education, and public awareness are needed to build a secure environment
and promote trust.

The shift towards Education 4.0 has changed significantly the pressure for a learning experience that is
fully aligned to a volatile employment sector. Therefore, there is a need for a revised pedagogical approach
in the way digital forensic curricula are delivered and supported. The evolution of educational technologies,
as well as the increasing integration of a range of hands-on experiences in the learning process means that
the digital forensic programmes are enhanced with the use of Internet of Things, Immersive Learning
Environments, Social Learning Networks, Augmented Virtual and Mixed Reality, Sensor Generated Data,
Biometrics and new perspectives of the impact ethical, social and professional issues have on security and
privacy. Such pressures triggered a significant reshaping of learning, teaching and assessment practices,
with emphasis on delivery digital forensics programmes in ways that equip graduates towards seamless
employability readiness. This keynote will (i) discuss the various challenges of the changing educational
sector, (ii) share examples of good practice in delivering programmes within the framework of Industrial
Revolution 4.0 and (iii) provide guidance for adapting new educational practices in the delivery of digital
forensic programmes.

xvii
Prof. Elhadj Benkhelifa
Staffordshire University, UK

Towards Bio-Inspired Resilient Cloud Environments


Abstract:
As Organisations are increasingly adopting cloud-computing (centralised and/or decentralised) as the
foundation for their IT infrastructure, the reliability of inherently complex cloud systems becomes under
test. The robustness of these infrastructures and services and the overall resilience is, generally, enhanced
by creating redundancy for backup in times of fault, failure or attack. Concepts and processes existing as
nature's inherently multifunctional capabilities such as robustness, resiliency, survivability, and
adaptability, could provide inspiration for unconventional methods to solve unique problems in the
computing continuum. Ensuring the resilience of critical infrastructures is ever more necessary with the
increasing threat of cyber-attacks, due to the increased complexity. It is generally acceptable that whilst
complexity increases resilience and reliability decreases. However Biological systems subvert this rule;
they are inherently much more complex, yet highly reliable. This talk will review and define resilience
disciplines and techniques for cloud computing, then draws parallels between resilience capabilities in
nature such as those demonstrated in multi-cellular biological systems and capabilities in cloud
environments.

xviii
Prof. Mona T. Diab
The George Washington University, USA

Low Resource Scenarios: Challenges and Opportunities


Abstract:
With the advent of social media, we are witnessing an exponential growth in unstructured data online. A
huge amount of this data is in fact in languages other than English. Some of these languages have rich
automated resources and processing tools, but the majority of the languages in the world are considered
low resource despite presence online. In this talk, I will address the problem of processing low resource
languages. I will present some of our solutions for language identification, information extraction, machine
translation, and resource creation exploiting rich languages via cross linguistic modeling. Such techniques
can also be cast for cross genre and cross domain challenges.

xix
Prof. Salim Hariri
The University of Arizona, USA

Autonomic Cyber Security (ACS) – The Next Generation of Self


Protection Systems and Services
Abstract:
The increased dependence on cyber systems in business, finance, government and education make them
prime targets for cyberattacks due to the profound and catastrophic damage these attacks might inflict on
our economy and all aspects of our life. It is widely recognized that cyber resources and services can be
penetrated and exploited. Furthermore, it is widely accepted that the cyber resilient techniques are the most
promising solutions to mitigate cyber-attacks and change the game to advantage the defender over the
attacker. In this presentation, I will present an approach based on biological systems to develop autonomic
cybersecurity technologies that will significantly change how we manage, secure and protect cyber
resources and services. Our approach is based on autonomic computing (self-manage systems with little or
no involvement from users or system administrators), data mining, and anomaly behavior analysis
techniques. The main building component to implement Autonomic Cyber Security (ACS) are: 1)
Innovative data structures (cyber-DNAs) to accurately detect current operational state of any cyber system
and predict its behavior in the near future; 2) Anomaly Behavior Analysis (ABA) methodology that can
detect with high accuracy and almost no false alarms any anomalous behavior triggered by cyberattacks,
faults (hardware or software) and accidents (malicious or natural); and 3) Self-Management Engine to
deliver automated and semi-automated actions so we can proactively stop or mitigate the impacts of
cyberattacks. I will show through several examples how to apply ACS to secure and protect a wide range
of cyber systems and applications.

xx
Table of Contents

Track 1: Computer and Network Security


Optimized Multi-Layer Hierarchical Network Intrusion Detection System with
Genetic Algorithms .....................................................................................................................1
Pranesh Santikellur, Tahreem Haque, Malek Al-Zewairi, and Rajat Subhra Chakraborty

Leader Election and Blockchain Algorithm in Cloud Environment for E-Health ......................8
Basem Assiri

Automotive Cybersecurity: Foundations for Next-Generation Vehicles ..................................14


Michele Scalas and Giorgio Giacinto

NTRU-Like Secure and Effective Congruential Public-Key Cryptosystem using Big


Numbers ....................................................................................................................................20
Anas Ibrahim, Alexander Chefranov, and Nagham Hamad

Review: Phishing Detection Approaches ..................................................................................27


AlMaha Abu Zuraiq and Mouhammd Alkasassbeh

Detecting Slow Port Scan using Fuzzy Rule Interpolation .......................................................33


Mohammad Almseidin, Mouhammd Al-kasassbeh, and Szilveszter Kovacs

An Approach for Web Applications Test Data Generation Based on Analyzing Client
Side User Input Fields ...............................................................................................................39
Samer Hanna and Hayat Jaber

Achieving Data Integrity and Confidentiality using Image Steganography and


Hashing Techniques ..................................................................................................................45
Ahmed Hambouz, Yousef Shaheen, Abdelrahman Manna, Mustafa Al-Fayoumi, and Sara Tedmori

Detecting Network Anomalies using Machine Learning and SNMP-MIB Dataset


with IP Group ............................................................................................................................51
Abdelrahman Manna and Mouhamad Alkasassbeh

Enhancing Data Protection Provided by VPN Connections over Open WiFi


Networks ...................................................................................................................................56
Ashraf Karaymeh, Mohammad Ababneh, Malik Qasaimeh, and Mustafa Al-Fayoumi

A Proactive Design to Detect Denial of Service Attacks using SNMP-MIB ICMP


Variables ....................................................................................................................................62
Yousef Khaled Shaheen and Mohammad Al Kasassbeh

An Energy Aware Fuzzy Trust Based Clustering with Group Key Management in
MANET Multicasting................................................................................................................68
Gomathi Krishnasamy

Framework for Blockchain Deployment: The Case of Educational Systems ...........................74


Saif Kazakzeh, Eyad Ayoubi, Baraa K. Muslmani, Malik Qasaimeh, and Mustafa Al-Fayoumi

xxi
Track 2: Virtual and Electronic Learning

The JOVITAL Project: Capacity Building for Virtual Innovative Teaching and
Learning in Jordan .....................................................................................................................83
Katherine Wimpenny, Arinola Adefila, Alun DeWinter, Valerij Dermol, Nada Trunk Širca,
and Aleš Trunk

The Relation between Individual Student Behaviours in Video Presentation and


Their Modalities using VARK and PAEI Results .....................................................................87
Ahmed Fekry, Georgios Dafoulas, and Manal Ismail

An overview of Digital Forensics Education ............................................................................94


Georgios A. Dafoulas and David Neilson

Enhancing International Virtual Collaborative Learning with Social Learning


Analytics ..................................................................................................................................101
Alexander Clauss, Florian Lenk, and Eric Schoop

Evaluation of Students’ Acceptance of the Leap Motion Hand Gesture Application in


Teaching Biochemistry............................................................................................................107
Nazlena Mohamad Ali and Mohd Shukuri Mohamad Ali

Designing and Implementing an e-Course using Adobe Captivate and Google


Classroom: A Case Study ........................................................................................................113
Shahd Alia and Thair Hamtini

The Importance of Institutional Support in Maintaining Academic Rigor in


E-Learning Assessment ...........................................................................................................119
Darin El-Nakla, Beverley McNally, and Samir El-Nakla

Deep Learning Assisted Smart Glasses as Educational Aid for Visually Challenged
Students ...................................................................................................................................124
Hawra AlSaid, Lina AlKhatib, Aqeela AlOraidh, Shoaa AlHaidar, and Abul Bashar

Track 3: Data Science and Big Data


DeepDR: An Image Guideddiabetic Retinopathy Detection Technique using
Attention-Based Deep Learning Scheme ................................................................................130
Noman Islam, Umair Saeed, Rubina Naz, Jaweria Tanveer, Kamlesh Kumar, and Aftab Ahmed Shaikh

Mitigating the Effect of Data Sparsity: A Case Study on Collaborative Filtering


Recommender System .............................................................................................................136
Bushra Alhijawi, Ghazi Al-Naymat, Nadim Obeid, and Arafat Awajan

Visualizing Program Quality – A Topological Taxonomy of Features ..................................142


Islam Al Omari, Razan Al Omoush, Haneen Innab, and A. Elhassan

Improved Swarm Intelligence Optimization using Crossover and Mutation for


Medical Classification .............................................................................................................152
Mais Yasen and Nailah Al-Madi

xxii
Novel Approach towards Arabic Question Similarity Detection ............................................158
Mohammad Daoud
Using K-Means Clustering and Data Visualization for Monetizing Logistics Data ...............164
Hamzah Qabbaah, George Sammour, and Koen Vanhoof

Content Based Image Retrieval Approach using Deep Learning............................................170


Heba Abdel-Nabi, Ghazi Al-Naymat, and Arafat Awajan

Data Analytics and Business Intelligence Framework for Stock Market Trading ..................178
Batool AlArmouty and Salam Fraihat

Track 4: Internet of Things


Reducing Ambulances Arrival Time to Patients .....................................................................183
Mohammad Eshtayah, Jalal Morrar, Ameer Baghdadi, and Amjad Hawash

Framework Architecture for Securing IoT using Blockchain, Smart Contract and
Software Defined Network Technologies ...............................................................................189
Hasan Al-Sakran, Yaser Alharbi, and Irina Serguievskaia

Security Issues in Wireless Sensor Network Broadcast Authentication .................................195


Asad Raza, Ali Abu Romman, and Muhammad Faheem Qureshi

Towards an Integration Concept of Smart Cities ....................................................................202


Naoum Jamous and Stefan Willi Hart

Compression Techniques Used in IoT: A Comparitive Study ................................................208


Salam Hamdan, Arafat Awajan, and Sufyan Almajali

Track 5: Natural Language Processing


Using Part of Speech Tagging for Improving Word2vec Model ............................................213
Dima Suleiman and Arafat A. Awajan

Applying Ontology in Computational Creativity Approach for Generating a Story ..............220


Lana Issa and Shaidah Jusoh

Arabic Document Indexing for Improved Text Retrieval .......................................................226


Yaser A. M. Al-Lahham

Evaluation of Question Classification .....................................................................................231


Mariam Biltawi, Arafat Awajan, and Sara Tedmori

Arabic Text Classification of News Articles using Classical Supervised Classifiers .............238
Leen Al Qadi, Hozayfa El Rifai, Safa Obaid, and Ashraf Elnagar

Graph-Based Arabic Key-Phrases Extraction .........................................................................244


Dana Halabi and Arafat Awajan

Arabic Text Keywords Extraction using Word2vec ...............................................................251


Dima Suleiman, Arafat A. Awajan, and Wael Al Etaiwi

xxiii
A Deep Learning Approach for Arabic Text Classification....................................................258
Katrina Sundus, Fatima Al-Haj, and Bassam Hammo

Arabic Text Semantic Graph Representation ..........................................................................265


Wael Mahmoud Al Etaiwi and Arafat Awajan

Sentiment Analysis for Arabic Language using Attention-Based Simple Recurrent


Unit ..........................................................................................................................................271
Saja Al-Dabet and Sara Tedmori

Track 6: Intelligent Systems


A Novel Medical Image Fusion Algorithm for Detail-Preserving Edge and Feature
Extraction ................................................................................................................................277
Fayadh Alenezi

Classification of Short-Time Single-Lead ECG Recordings using Deep Residual


CNN.........................................................................................................................................283
Areej Kharshid, Haikel S. Alhichri, Ridha Ouni, and Yakoub bazi

Identification and Tagging of Malicious Vehicles through License Plate Recognition ..........289
Ahmad Mostafa, Walid Hussein, and Samir El-Seoud

Cascaded Layered Recurrent Neural Network for Indoor Localization in Wireless


Sensor Networks......................................................................................................................296
Hamza Turabieh and Alaa Sheta

Learning with Dynamic Architectures for Artificial Neural Networks - Adaptive


Batch Size Approach ...............................................................................................................302
Reham Saeed, Rawan Ghnemat, Ghassen Benbrahim, and Ammar Elhassan

Hybrid Machine Learning Classifiers to Predict Student Performance ..................................306


Hamza Turabieh

Automated Grading for Handwritten Answer Sheets using Convolutional Neural


Networks .................................................................................................................................312
Eman Shaikh, Iman Mohiuddin, Ayisha Manzoor, Ghazanfar Latif, and Nazeeruddin Mohammad

Wrapper-Based Feature Selection for Imbalanced Data using Binary Queuing Search
Algorithm ................................................................................................................................318
Thaer Thaher, Majdi Mafarja, Baker Abdalhaq, and Hamouda Chantar

Self-Organizing Maps for Agile Requirements Prioritization ................................................324


Amjad Hudaib and Fatima Alhaj

A Parallel Face Detection Method using Genetic & CRO Algorithms on Multi-Core
Platform ...................................................................................................................................329
Mohammad Khanafsa, Ola Surakhi, and Sami Sarhan

Heart Disease Detection using Machine Learning Majority Voting Ensemble


Method.....................................................................................................................................335
Rahma Atallah and Amjed Al-Mousa

xxiv
Resolving Conflict of Interests in Recommending Reviewers for Academic
Publications using Link Prediction Techniques ......................................................................341
Sa'ad A. Al-Zboon, Saja Khaled Tawalbeh, Heba Al-Jarrah, Muntaha Al-Asa'd,
Mahmoud Hammad, and Mohammad AL-Smadi

Reconstructing Colored Strip-Shredded Documents Based on the Hungarians


Algorithm ................................................................................................................................347
Fatima Alhaj, Ahmad Sharieh, and Azzam Sleit

Implementation and Comparative Analysis of Semi-Automated Surveillance


Algorithms in Real Time using Fast-NCC ..............................................................................353
Omer Khan, Nayab Saeed, Raheel Muzzammel, Umair Tahir, and Omar Azeem

Adaptive Control of Nonaffine Nonlinear Systems by Neural State Feedback ......................360


M. Bahita and K. Belarbi

Track 7: Miscellaneous
Would It be Profitable Enough to Re-Adapt Algorithmic Thinking for Parallelism
Paradigm ..................................................................................................................................366
Aimad Eddine Debbi, Abdelhak Farhat Hamida, and Haddi Bakhti

Affordable and Portable Realtime Saudi License Plate Recognition using SoC ....................372
Loay Alzubaidi, Ghazanfar Latif, and Jaafar Alghazo

Two Information Systems in Air Transport It is a Short Journey from Success to


Failure ......................................................................................................................................377
Victor P. Lane, Derar Eleyan, and James Snaith

Task Scheduling Based on Modified Grey Wolf Optimizer in Cloud Computing


Environment ............................................................................................................................382
Abdullah Alzaqebah, Rizik Al-Sayyed, and Raja Masadeh

Causal Path Planning Graph Based on Semantic Pre-Link Computation for Web
Service Composition ...............................................................................................................388
Moses Olaifa and Tranos Zuva

Accelerating Stochastic Gradient Descent using Adaptive Mini-Batch Size..........................393


Muayyad Saleh Alsadi, Rawan Ghnemat, and Arafat Awajan

xxv
Optimized Multi-Layer Hierarchical Network
Intrusion Detection System with Genetic Algorithms
Pranesh Santikellur Tahreem Haque
Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering
Indian Institute of Technology Heritage Institute of Technology
Kharagpur, West Bengal, India 721302 Kolkata, West Bengal, India 700107
pranesh.sklr@iitkgp.ac.in tahreemhaque97@gmail.com

Malek Al-Zewairi Rajat Subhra Chakraborty


Dept. of Computer Science, King Hussein School of Computing Sciences Dept. of Computer Science and Engineering
Princess Sumaya University for Technology Indian Institute of Technology
Amman, Jordan Kharagpur, West Bengal, India 721302
m.alzewairi@psut.edu.jo rschakraborty@cse.iitkgp.ac.in

Abstract—The number of connected devices on the Internet or availability of a network. Because of its role, they have
has exceeded 31 billion devices in 2018 and it is forecasted become an important part of the network security.
that this number will exceed 50 billion by the year 2020.
One the other hand, malicious software and network attacks Over the years, several researchers have proposed different
are raising on an alarming rate. It is estimated that more methods and techniques on network intrusion detection. The
than 230,000 new malware are produced daily and over 53,000 research on network intrusion detection systems has evolved
new Cryptoware malware engines are detected as well. This following different approaches, for instance; the use of rule
proliferation in security attacks constitutes a great challenge based, statistical analysis and Finite State Machine (FSM)
for Intrusion Detection Systems (IDS), in particular, in detecting
modern attacks. In this paper, a multi-layer hierarchical Network based modeling.
Intrusion Detection System (NIDS) is proposed with the aim to • Statistical based intrusion detection involves network
improve the overall detection performance of the IDS for detect-
traffic samples going through statistical inference test
ing modern attack types. The proposed multi-layer NIDS utilizes
multiple models of machine learning algorithms in a hierarchical which decides the packet belongs to normal flow or
architecture in addition to using evolutionary computing, namely malicious. It involves both the parametric tests and non-
Genetic Algorithms, to tune the configurations of the neural parametric tests where the underlying distribution is as-
network models used in the first layer. A modern dataset (i.e. sumed in parametric distribution and non-parametric tests
CICIDS-2017) is used to evaluate the proposed approach, which
are distribution-free tests. Chi-square based detection
contains several modern attacks. The results showed that the
proposed multi-layer system significantly improved on the error methods [1] and parzen window based [2] are parametric
generalization metrics. and non-parametric based statistical detection systems.
Index Terms—Evolutionary Computing; Machine Learning; • Rule based intrusion detection characterizes normal
Network Intrusion Detection; Network Security; Multi-Layer; flow with a rule. Any flow which doesn’t follow the rule is
considered to be malicious flow. Rule learning algorithms
learn the rules where data that can be expressed in the
I. I NTRODUCTION form of an IF-THEN rule. [3] proposed a new rule
formation algorithm called base-support association rules
New technologies such as Cloud and Fog Computing,
to distinguish between normal and intrusive behavior.
Big Data, and the Internet of Things (IoT) have progressed
• FSM based intrusion detection system involves de-
enormously and made people more dependent on computer
ducing FSM from network data where state represents
networking technology more than ever before. At the same
network attacks and transitions are matching features.
time, incidents of information security breaches have increased
On each matched feature, a successful transition is made.
drastically. Ensuring end-to-end security is of utmost concern.
Final acceptance state decides the attack. The authors of
This involves ensuring the safety and trustworthiness of net-
[4] proposed real-time intrusion detection tool (STAT),
working hardware, to high level effective defensive measures
which is based on the state transition analysis technique.
against various types of network attacks which leverage the
[5] Salvador and Chan demonstrate a way to perform
vulnerabilities of the deployed security protocols. Network
time series anomaly detection via generated states and
Intrusion Detection System (NIDS) is considered one of the
rules using RIPPER to form detection system.
most important security controls to detect malicious attack
behaviors that might compromise the integrity, confidentiality, Machine learning has been widely used in NIDS including

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 1


both supervised and unsupervised techniques, e.g. clustering TABLE I
[?], [6], Bayesian Networks [7], ensemble of methods [?], [8], D ISTRIBUTION OF THE CICIDS2017 DATASET L ABELS
Support Vector Machines (SVM) [?], [9], Artificial Neural Labels Count (%)
Networks (ANN) [?], [10], Decision Trees [?], [11], and Benign 83.34
Random Forests [?], [12]. DoS Hulk 8.1
PortScan 5.6
NIDS researchers have used several well-known public DDoS 1.4
datasets over the years, such as: DARPA [13], KDD99 [14], DoS GoldenEye 0.3
KYOTO [15], NSL-KDD [16]. FTP-Patator 0.28
SSH-Patator 0.20
However, a large number of them are not only outdated DoS slowloris 0.20
but also unreliable to use because of the change in the nature DoS Slowhttptest 0.19
of the network traffic. The authors of [14] gives the insight Botnet 0.06
of HTTPS adoption in current network traffic and mention Web Attack Brute-Force 0.05
Web Attack XSS 0.02
about the growing https usage. To solve these problems, the Infiltration 0.001
CICIDS-2017 public dataset was created considering eleven Web Attack SQL Injection 0.0007
criteria that are necessary for building a reliable benchmark Heartbleed 0.0003
dataset [17].
In this paper, the researchers propose a two layer hierar- TABLE II
C LASSIFICATION OF THE CICIDS2017 DATASET ATTRIBUTES
chical model of machine learning algorithms for NIDS an
evaluate it using the CICIDS-2017 dataset. The list of attacks Attributes Count
present in the dataset has grouped into four major categories. Number of Packets 6
Feature selection algorithms like Correlation Coefficient, In- Size of Packet 16
Number of TCP Flags 12
formation Gain, and Recursive feature elimination have been Time between Packets 17
implemented to know the important features that govern the Subflow Information 4
attacks. First layer identifies the traffic as benign or malicious. Active time 4
Idle time 4
Modeling of the first layer is done using AdaBoost, Naive Bulk Packets 6
Bayes and Neural Networks. Genetic algorithms has been Initial Window 2
used for structural optimization of the neural networks. The Header length 3
second layer uses decision trees to find the category of attacks. Segment Size 3
The proposed hierarchical classification aims to improve the
detection rate and achieve lower false alarm rate. TABLE III
G ROUPING OF D IFFERENT ATTACKS
The rest of this paper is organized as follows. Section II
gives details of the CICIDS-2017 dataset, along with types of Attack group Attacks
attacks. Section III describes the design and implementation Dos DDoS, Slowloris, SlowHTTPTest,
GoldenEye
of the proposed NIDS architecture. Section IV presents the Web Brute-Force, XSS, SQL Injection
experimental results and finally Section V concludes the paper. Port scan Port scan
Patator SSH-Patator, FTP-Patator
II. DATASET D ETAILS AND T YPES OF ATTACKS
The CICIDS-2017 dataset is a publicly available dataset Table I shows the distribution of different labels on the
consisting of eight separate comma-separated values (CSV) complete dataset, while; Table II shows the main group of
files. These CSV files list features extracted from data cor- attacks with counts. From the aforementioned tables, it is
responding to seven days of network traffic from captured apparent that data is highly imbalanced. Another observation is
packet format (PCAP) files. The main advantage of CICIDS- that few attacks like Heartbleed, Infiltration, and SQL injection
2017 dataset is labeled traffic and contains realistic background attacks are quite less in occurrence.
traffic. The dataset has 14 different attack classes. These Based on the similarity of attacks which has a good share
attacks also represent the common attacks that takes place in in dataset, We categorize them into four groups as shown in
real life. The attack samples represent 16.65% in the complete Table III. The motivation behind the grouping of attacks is to
dataset. find a generic way of feature extraction of the attack group
The dataset contains 84 common features which are ex- rather than individual attacks. Following, the attack groups are,
tracted using CICFlowMeter [18]. The feature set can be briefly, explained.
classified into two groups. One includes the attributes picked
from packet flow and other one includes that are derived
by applying statistical analysis on first group e.g. the length A. Types of Attacks
of the packet exchanged between source and destination 1) Denial of Service: In a denial of service attack, the
is analyzed to produce the features packet length mean, attacker exploits the connectivity of victim machine to disturb
packet length std. the services offered by flooding a victim with many request

2
packets. A DoS attack [19] can be either a single-source attack
or a multi-source, where latter is called a distributed denial of PCAP
service (DDoS) attack. Files
2) Web Attacks: Websites often uses database servers to
store their records. Databases also keep the various web
applications in the persistent state. One of the attacks that
compromise the security of back-end databases is SQL In- CICFlowMeter
jection. Cross-site scripting ensures that malicious JavaScript
code gets executed on the victim’s machine [20].
3) Port Scan: Port scan attack [21] involves sending client
requests to a range of server ports to find active ports and
inactive ports on the server. This attack also can also get the
AdaBoost ANN* NB
services running in the victim site.
4) Patator: Patator based attacks are popular brute-force Layer-1
attacks that are used for password guessing. Patator [22] is Binomial Classifiers
* Optimized using Genetic
an open-source multi-purpose command line Python tool. The Algorithms
dataset contains packet flows used for brute force SSH and
FTP logins.

III. P ROPOSED A RCHITECTURE Yes Is NO


Attack?
We propose two-layer hierarchical architecture with the
available current attacks and expandable to other attacks.
Fig. 1 shows the proposed architecture in details. The PCAP
DT
files (captured network packet files) are given as input to
CICFLowMeter Application which extracts 84 features from Layer-2
PCAP files. It uses bi-directional flow to generate the features. Multinomial Classifier
The data consisting of 84 features are inputted to the layer-1
model. The task of the layer-1 model is to predict the packet
flow is malicious or benign. This doesn’t inform about the kind
of attack involved. If the layer-1 model predicts the packet flow
as an attack, the same is further by layer-2 for inspection of Attack Category Benign
attack. The layer-2 model predicts the category of attack that
it belongs like DoS, Web Attacks, Patator or Portscan.
The first layer task is modeled with different algorithms
like AdaBoost, Naive Bayes and ANN. While in the second
layer, attack category prediction is carried out by four separate
Fig. 1. Proposed Multi-Layer Hierarchical Network Intrusion Detection
decision trees. The proposed model is easily scalable as the System.
first layer differentiate benign traffic from malicious traffic and
modern attack categories are classified by the second layer.
replaced with group name e.g. SSH-Patator and FTP-
A. Data Pre-Processing Patator labels are replaced with label “Patator”.
The features generated by CICFLowMeter [18] application 3) One-hot encoding has been applied to Source Port
are mainly numeric, except flow identifiers. For our work, we and Destination Port as they are categorical and label
merged all the eight different CSVs to form a single CSV file. encoding has been done on all labels.
The following modifications were made to the dataset before 4) All the NaN values were replaced with zero.
feature extraction: 5) Standardization was applied to the layer-1 dataset. It
removes the mean and scales to unit variance. The
1) Seven features from the dataset were eliminated: Flow-
centering and scaling is done for every feature indepen-
ID, Source IP, Destination IP, TimeStamp, Protocol,
dently.
External IP. These features are connection identifiers and
these are eliminated to make the model to generalize for 1) Feature Selection: Feature selection is the process of
any connection identifier values. choosing a subset of the total features to use for intrusion
2) All the different attack labels are replaced as a single detection. This helps in removing the redundant or unnecessary
attack label for the layer-1 dataset. Layer-2 dataset features. The other advantage is dimensionality reduction. We
consists of benign data with respective attack group use following the feature selection methods:
where all the sub-attacks belonging to that group are • Correlation Coefficient is a statistical measure to deter-

3
mine the relationship between two features [23]. Corre- TABLE IV
lation Coefficient of X and Y is defined as follows: F EATURES S ELECTION FOR L AYER -1
∑N
(xi yi ) − N X̄ Ȳ Selection Metric Selected Features
rX,Y = i=1 Correlation Coef- Packet_Length_Mean,
N σ X σY ficient min_seg_size_forward,
Active_Mean,
If rX,Y is near to 1, indicates the features are highly Active_Std, Active_Max,
correlated. It means one feature contains the information Active_Min, Idle_Mean,
about other feature. So one of the features is redundant Idle_Std, Idle_Max
Information Gain Flow_IAT_Std, Idle_max,
and can be removed. The threshold value of 0.1 is used Flow_IAT_max,
for extracting the features. Table IV shows the list of Fwd_IAT_max,
features selected by the correlation coefficient. Idle_Min, Idle_Mean,
Packet_Length_Std,
• Information gain uses the information-theoretic ap- Fwd_IAT_Total,
proach to find the features, based on their entropy [24]. Bwd_IAT_Std,
An entropy value is higher when the attack distribution is Init_Win_bytes_forward
ANOVA + RFE Fwd_IAT_Std,
more even, that is when the data items have more classes. Bwd_Packet_Length_Mean,
Information gain is a measure on the utility of each Flow_IAT_Max,
attribute in classifying the data items which are measured Fwd_IAT_Max, Idle_Max,
Bwd_Packet_Length_Max,
using entropy value. The entropy and information gain are Idle_Min, Idle_Mean
given by following formula:
∑m
E(D) = − Pi log2 Pi TABLE V
i=1 F EATURES S ELECTION FOR L AYER -2 USING ANOVA & RFE
∑v |Di | Attack Selected Features
E(D, A) = − × E(Di )
i=1 |D| Group
DoS Source_Port, Destination_Port,
Gain(A) = E(D) − E(D, A) Fwd_Packet_Length_Std,
Bwd_Packet_Length_Std, Flow_Bytes.s,
Flow_IAT_Max, Fwd_IAT_Total,
Table IV shows the list of features selected using infor- Bwd_Packets.s
mation gain. Web Source_Port, Destination_Port,
• Recursive Feature Elimination with ANOVA Attack Total_Fwd_Packets, Flow_IAT_Mean,
Flow_IAT_Std, Fwd_IAT_Min, Bwd_IAT_Min,
Univariate feature selection with ANOVA (Analysis of Total_Length_of_Fwd_Packets
Variance) F-test [25] does the feature scoring. It analyzes PortScan Source_Port, Destination_Port,
each feature individually to determine the strength of the Total_Length_of_Fwd_Packets,
Flow_Bytes.s, Flow_IAT_Std, Bwd_IAT_Min,
relationship of the feature with labels. On the scored Fwd_Packets.s, Bwd_Packets.s
features, a recursive feature elimination [26] is applied Patator Source_Port, Destination_Port,
which recursively build a model, placing the feature aside Flow_Bytes.s, Flow_Packets.s,
Flow_IAT_Min, Fwd_IAT_Min,
and then repeating the process with the remained features FBwd_Packets.s, Packet_Length_Std
until all features in the dataset are exhausted. It uses
the weights of a classifier to produce a feature ranking.
The eliminated features are those with the lowest weights crossover and selection procedures to find the dominant in-
computed during training. We use ANOVA+RFE for both dividuals. Genetic algorithms maintains a balance between
layer-1 and layer-2 modeling as shown in Table IV and exploration of search space and exploitation of good solutions
Table V. [28].
2) Artificial Neural Networks: A neural network is a set
B. Classification Methods of interconnected nodes called neurons. Each node has a
1) Genetic Algorithms: Genetic Algorithms (GA) [27] are weighted connection to several other nodes in adjacent layers.
heuristic global optimization technique based on principles Neural networks can learn from supervised or unsupervised
of biological evolution and natural selection. GAs simulate training. The important component of training neural net-
the evolution of living organisms, where the fittest individuals work model includes activation function, loss function, and
dominate over the weaker ones. In Genetic algorithms, search optimization algorithm. The activation function makes the
space of algorithm is represented as the collection of individual neural network to learn non-linear complex functions. For the
which are referred as chromosomes. The set of parameters supervised model, the loss functions calculates the error i.e
specifying an individual is called gene. The part of search difference between output and the target variable. Optimization
space to be examined is called as population. The purpose of algorithms are used to find the proper parameters( weights)
using a genetic algorithm is to find the dominating individual of a model. The backpropagation algorithm is to update the
from search space evaluated w.r.t to evaluation function called weights to each neuron. The proposed method uses feed-
fitness function. Genetic algorithm uses random mutation, forward neural networks trained to predict Layer-1 intrusion

4
TABLE VI leaf node is reached. The main problem here is deciding the
H YPERPARAMETERS USED FOR L AYER -1 N EURAL N ETWORK M ODELS attribute, which will best partition the data into various classes.
Hyperparameters Values There are many methods to construct the decision tree, such as
L2 regularization penalty 0.0001 ID3 and C4.5 [32] and CART (Classification and Regression
Learning Rate 0.001 Trees) [33].
Optimizer Adam
Loss Function Cross-entropy The ID3 algorithm works on the concept of information
Hidden Layer Activation function Sigmoid gain, while the C4.5 algorithm is an extension of ID3. C4.5
avoids overfitting the data by determining a decision tree,
It can also handle continuous attributes, is able to choose
detection. A feed-forward neural networks has an input layer, an appropriate attribute selection measure, handles training
an output layer, with one or more hidden layers in between data with missing attribute values and improves computation
the input and output layer. efficiency. CART (Classification and Regression Trees) is a
The structural optimization of the neural network is done process of generating a binary tree for decision making [33].
using Genetic Algorithms. The optimization involves finding CART handles missing data and contains a pruning strategy.
the optimal number of hidden layers, the number of neurons
within each layer and the right activation function in order to
maximize the performance of neural networks. Each individual IV. E XPERIMENTAL R ESULTS
represents single neural network and the hyperparametrs like
activation function and number of layers are genes. The The proposed modeling setup was implemented using
Genetic algorithm converges with efficient architecture that Python 2.7 and sklearn 0.19 [34], and executed on a Linux
produces better results after 10 generations with 20 individuals workstation with 32 GB of main memory and a 4-core, 3.3
each. Our neural network architecture converged by genetic GHz processor. The dataset was divided into 80% training set
algorithms consists of two hidden layers with 512 neurons in and 20% test set. The Layer-1 uses MLPClassifier from scikit-
each layer. learn, which implements a multi-layer perceptron (MLP) algo-
The hidden layer activation function used is the sigmoid rithm that trains using Backpropagation. Similarly Adaboost
function [29]. It is a special case of the logistic function that and Naive Bayes implementation uses functions from sklearn
is defined by formula Sigmoid(z) = 1+e1−z . It is bounded library. The base classifier used for Adaboost is decision trees.
and has a positive derivative at each point. Sigmoid function Layer-2 uses Decision trees Classifier from scikit-learn. It
is the most commonly used because of its non-linearity and implements the split algorithm very similar to C4.5 which is
the computational simplicity of its derivative. Table VI shows an extension of a popular ID3 algorithm.
different hyperparameters used for modeling. Classification accuracy is not the sole appropriate parameter
3) AdaBoost: AdaBoost is an algorithm for constructing a to measure the performance since the training set consists of a
strong classifier as a linear combination of “weak” classifiers large amount of benign data as compared to malicious network
[30]. The AdaBoost algorithm corrects the misclassified in- traffic. Hence, we have also estimated precision, recall, F1-
stances made by weak classifiers, and it is less susceptible to score and FAR, along with accuracy in our experiments. The
overfitting than most learning algorithms. A group of weak various classifiers mentioned in Section III-B were applied
classifiers has to be prepared as inputs of the AdaBoost to the network traffic dataset containing both benign and
algorithm. Weak classifiers can be linear classifiers, ANNs malicious flows. To ensure that the classifier generalizes well
or other common classifiers. For modeling, we select the to unseen data, evaluation of prediction is done with test
“decision trees” as the weak classifier due to its simplicity. dataset. Table VII shows the best values of accuracy, precision,
4) Naive Bayes: The naive Bayes model is based on the recall, F1-score and FAR obtained from these classifiers on the
Bayes rule in probability theory [31]. The naive Bayes uses test dataset for the first layer. The results were obtained for
the probability of several related evidence variables. The different feature selection algorithms across various models.
probability of an end result is encoded in the model along ANN shows the best accuracy compared to the other two
with the probability of the evidence variables. The naive Bayes classifiers. It correctly identified 98.74% of malicious traffic in
classifier operates on a strong independence assumption. This the test set, with 3.50% of false alarm rate. AdaBoost shows
means that the probability of one attribute does not affect the a low false alarm rate than Neural Networks.
probability of the other. Table VIII presents the results of “Decision trees” based
5) Decision trees: Decision trees are a very popular used learning in Layer-2 model. The accuracy, precision, recall, F1-
approach for classification. Decision trees learn inductively to score values are greater than 99%, and FAR value is reached
construct a model from pre-classified data set. The technique as low as 0.04% in portscan attacks. The important point to
is to select the features, which best divides the data items into consider is that if an intrusion detection system has the higher
their classes. Induction of the decision tree uses the training number of false alarm values, the detection system is not
data, which is described in terms of the attributes. To classify useful. Because the normal flow will be shown as malicious.
an attack, one starts at the root of the decision tree and follows From the above metrics, we can say Layer-1 and Layer-2
the branch indicated by the outcome of each test until a model results are promising and encouraging.

5
TABLE VII
R ESULTS FOR LAYER -1 MODEL

Classifier Features Selection Accuracy (%) Precision (%) Recall (%) F1-Score (%) FAR (%)
Correlation Coefficient 91.21 95.51 94.04 94.77 24.41
ANN Information Gain 94.72 96.88 96.78 96.83 15.72
ANOVA+ RFE 97.54 97.87 99.18 98.52 10.02
All features 98.74 99.30 99.19 99.25 3.50
Correlation Coefficient 93.21 97.82 94.25 96.00 13.44
Adaboost Information Gain 93.12 97.60 94.34 95.94 14.52
ANOVA+ RFE 95.99 97.85 97.35 97.6 11.01
All features 98.19 99.61 98.25 98.92 2.09
Correlation Coefficient 17.21 0.7 88.71 1.42 83.28
Naive Bayes Information Gain 84.03 93.50 88.07 90.71 47.12
ANOVA+ RFE 83.59 91.95 88.78 90.34 49.53
All features 56.64 48.06 99.91 64.90 72.34

TABLE VIII
R ESULTS FOR LAYER -2 MODEL

Sub Attack Categories Accuracy (%) Precision (%) Recall (%) F1-Score (%) FAR (%)
DoS 99.82 99.90 99.89 99.90 0.1
Web Attacks 100 100 100 100 0.24
Portscan 100 100 100 100 0.04
Patator 99.9 99.99 100 100 0.07

V. C ONCLUSION [7] M. Panda and M. R. Patra, “Network intrusion detection using naive
bayes,” International journal of computer science and network security,
In this paper, a multi-layer hierarchical network intrusion vol. 7, no. 12, pp. 258–263, 2007.
detection system was proposed which mainly consists of two [8] S. Mukkamala, A. H. Sung, and A. Abraham, “Intrusion detection
layers; the first layer is used for distinguishing benign from using an ensemble of intelligent paradigms,” Journal of Network
and Computer Applications, vol. 28, no. 2, pp. 167 – 182,
malicious traffic using multiple binomial classifiers including 2005, computational Intelligence on the Internet. [Online]. Available:
AdaBoost, Neural Networks and Naive Bayes. While, in the http://www.sciencedirect.com/science/article/pii/S1084804504000049
second layer a multinomial decision trees classier is used to [9] L. Khan, M. Awad, and B. Thuraisingham, “A new intrusion detection
system using support vector machines and hierarchical clustering,” The
identify the exact attack category from potential malicious VLDB Journal, vol. 16, no. 4, pp. 507–521, Oct 2007.
traffic. Genetic algorithms were used in the first layer in order [10] H. Debar, M. Becker, and D. Siboni, “A neural network component
to perform neural network structure optimization. Moreover, for an intrusion detection system.” in IEEE symposium on security and
privacy, 1992, pp. 240–250.
Correlation Coefficient, Information Gain and Recursive Fea- [11] G. Stein, B. Chen, A. S. Wu, and K. A. Hua, “Decision tree classifier
ture Elimination with ANOVA were utilized in order to find for network intrusion detection with ga-based feature selection,” in
the best features for each attack category. In order to evaluate Proceedings of the 43rd annual Southeast regional conference-Volume
2. ACM, 2005, pp. 136–141.
the proposed model, several combinations of feature selection [12] J. Zhang and M. Zulkernine, “A hybrid network intrusion detection
algorithms and classifiers were applied. The experimental technique using random forests,” in First International Conference on
results showed that the NN classifier showed better accuracy Availability, Reliability and Security (ARES’06), 2006, pp. 8 pp.–269.
[13] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das, “The
results whereas the AdaBoost classifier achieved the lowest 1999 darpa off-line intrusion detection evaluation,” Computer networks,
FAR value in the first layer. Conversely, the results in the vol. 34, no. 4, pp. 579–595, 2000.
second layer might indicate overfitting issue which we intend [14] “KDD Cup 1999. ,” http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html,
to investigate in the future work. 2018, accessed: Aug 2018.
[15] J. Song et al., “Statistical analysis of honeypot data and building of kyoto
2006+ dataset for nids evaluation,” in Proceedings of the First Workshop
R EFERENCES on Building Analysis Datasets and Gathering Experience Returns for
[1] N. Ye and Q. Chen, “An anomaly detection technique based on a Security, ser. BADGERS ’11. ACM, 2011, pp. 29–36.
chi-square statistic for detecting intrusions into information systems,” [16] “NSL-KDD data set for network-based intrusion detection systems,”
vol. 17, pp. 105 – 112, 03 2001. http://nsl.cs.unb.ca/NSL-KDD/, 2018, accessed: Aug 2018.
[2] D.-Y. Yeung and C. Chow, “Parzen-window network intrusion detectors,” [17] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating
in Pattern Recognition, 2002. Proceedings. 16th International Confer- a new intrusion detection dataset and intrusion traffic characterization.”
ence on, vol. 4. IEEE, 2002, pp. 385–388. in ICISSP, 2018, pp. 108–116.
[3] M. Qin and K. Hwang, “Frequent episode rules for internet anomaly [18] A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, and A. A. Ghorbani,
detection,” in Network Computing and Applications, 2004.(NCA 2004). “Characterization of tor traffic using time based features.” in ICISSP,
Proceedings. Third IEEE International Symposium on. IEEE, 2004, 2017, pp. 253–262.
pp. 161–168. [19] L. Garber, “Denial-of-service attacks rip the internet,” Computer, no. 4,
[4] P. Porras, “Stat – a state transition analysis tool for intrusion detection,” pp. 12–17, 2000.
Santa Barbara, CA, USA, Tech. Rep., 1993. [20] T.-S. Chou, “Security threats on cloud computing vulnerabilities,” Inter-
[5] S. Salvador, P. Chan, and J. Brodie, “Learning states and rules for time national Journal of Computer Science & Information Technology, vol. 5,
series anomaly detection.” in FLAIRS conference, 2004, pp. 306–311. no. 3, p. 79, 2013.
[6] L. Portnoy, E. Eskin, and S. Stolfo, “Intrusion detection with unlabeled [21] C. B. Lee, C. Roedel, and E. Silenok, “Detection and characterization
data using clustering,” in In Proceedings of ACM CSS Workshop on Data of port scan attacks,” Univeristy of California, Department of Computer
Mining Applied to Security (DMSA-2001, 2001, pp. 5–8. Science and Engineering, 2003.

6
[22] “Patator Ver 0.7 ,” https://github.com/lanjelot/patator, 2018, accessed:
Aug 2018.
[23] B. Ratner, “The correlation coefficient: Its values range between +1/-
1, or do they?” Journal of Targeting, Measurement and Analysis for
Marketing, vol. 17, no. 2, pp. 139–142, 2009.
[24] L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast
correlation-based filter solution,” in Proceedings of the 20th interna-
tional conference on machine learning (ICML-03), 2003, pp. 856–863.
[25] M. Berenson, D. Levine, and M. Goldstein, “Intermediate statistical
methods and applications: A computer package approach. 1983.”
[26] Guyon et al., “Gene selection for cancer classification using support
vector machines,” Machine Learning, vol. 46, no. 1, pp. 389–422, 2002.
[27] J. H. Holland, “Genetic algorithms,” Scientific american, vol. 267, no. 1,
pp. 66–73, 1992.
[28] M. Gen and R. Cheng, Genetic algorithms and engineering optimization.
John Wiley & Sons, 2000, vol. 7.
[29] J. Han and C. Moraga, “The influence of the sigmoid function param-
eters on the speed of backpropagation learning,” in From Natural to
Artificial Neural Computation. Springer Berlin Heidelberg, 1995, pp.
195–201.
[30] Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,”
Journal-Japanese Society For Artificial Intelligence, vol. 14, no. 771-
780, p. 1612, 1999.
[31] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach
(International Edition). Pearson, 2002.
[32] J. Quinlan, “C4.5: Programs for machine learning,” 1993.
[33] L. Breiman, Classification and regression trees. Routledge, 2017.
[34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.

7
Leader Election and Blockchain Algorithm in Cloud
Environment for E-Health
Basem Assiri
Faculty of CS & IT, Jazan University
Jazan, Saudi Arabia
babumussmar@jazanu.edu.sa

Abstract—The enhancement in e-health systems demands Many people left the city, medical treatments were necessary
the adaptation of computerized techniques, new algorithms but doctors of the respective hospital could not access the
and methods. To achieve better efficiency, Electronic Personal health history of the respective patients [4, 5].
Health Record (E-PHR) requires to adopt a new mode of
storage such as cloud storage. Cloud storage is a supportive In addition, the use of E-PHR technology requires
technique that provides better security, easy availability and supportive infrastructure, software, hardware and resources.
accessibility of files. However, the availability of E-PHR on the Storage is one of the vital aspects to be considered. Indeed,
cloud allows parallel access to the corresponding files. In cloud storage is one of the advanced technologies that
parallel and distributed computing, many users communicate facilitate the access of distributed resources. It provides
and share resources to achieve a targeted goal. Therefore, servers to store files in and to access them wherever the
leader election is a major technique that maintains and Internet is available. Cloud storage technology is a size
coordinates parallelism. This research implies the leader
effective, secure, maintainable and accessible that makes it
election algorithm for E-PHR in the cloud environment. It
proposes an adoptive leader election algorithm (ALEA) that suitable for E-PHR [6, 7].
takes into account medical and healthcare specifications. Use of E-PHR and cloud storage, many devices process
Therefore, the paper has incorporated the ideas of Primary files in parallel. This requires specific control and
Leader, Secondary Leader for emergency, leader appointing
and multiple tokens to allow parallel updates. ALEA limits the
coordination over the shared files to guarantee data
massage passing for leader electing and for acquiring token. In correctness and consistency. Therefore, leader and leader
regular case it reduces the number massages to 0. Moreover, election techniques are useable to control and maintain data
the paper discusses the advantages and disadvantages of using consistency on shared E-PHRs. The control means to decide
Blockchain technology to implement ALEA. who create, access, copy, move, edit and delete the E-PHR.
Data consistency is to have the expected result after each
action or process on data. In other word, the output of each
Keywords—Distributed System, Leader Election, Cloud process on the data is predictable [8]. Actually, parallel
Storage, Electronic Personal Health Record, Blockchain access of E-PHR may cause conflicts. There will be no
confliction when many users are accessing the same E- PHR
I. INTRODUCTION simultaneously for reading. However, when one of them or
During the last few decades, a remarkable development is more are trying to update a file, then some of them may read
noticed in the field of computational applications, Internet un-updated data or the update of one may contrast with the
tools and technologies, which make them part of many other updates of others; which is a conflict. For example, when
fields in people daily life. These overlaps create new fields or two doctors are accessing the same E-PHR in parallel and
sub-fields such as e-governance, e-learning, e- banking, e- one doctor updates the patient's blood pressure record, while
health and many more. This research focuses on e-health the other may still considering non-updated blood pressure
where the computerized technology is used in the field of reading and has no idea about the change. Actually, leader
healthcare. Presently, healthcare organizations have election provides exclusive access (that is known as token) to
incorporated many new sophisticated techniques to avail control and keep data consistency.
more advantageous and modern in their system. These are
Furthermore, Blockchain is one of the new promising
being reflected in providing services, stakeholders'
technology that supports the implementation of decentralized
satisfaction, costs reduction and to get rid of managerial
distributed system. Blockchain is a distributed public ledger
burden. Actually, healthcare organizations compete in
that keeps records, transactions or any digital processes [9].
providing better services to their stakeholders. The services
cost can be time, effort, physical space and infrastructure. Blockchain includes a cluster of nodes that share the same
Also, the use of technology also facilitates the outsourcing of data, propose some processes on the data, and verify the
execution of the process through consensus. Nakamoto
some services to reduce cost and effort of management,
exploits the idea of Blockchain to introduce the first
maintenance, risk handling, and new technology adoption.
cryptocurrency that is known as Bitcoin. Using Bitcoin, users
In this field, one of the major technology is to use E- have a peer-to-peer electronic financial system, where users
PHR. It is a digital version of PHR that enables to access and can exchange money with no need for a third party [9]. After
exchange information of patients electronically [1]. This the success of the Bitcoin, many other cryptocurrencies have
provides data accessibility, availability, privacy, security, been introduced and used for other financial services such as
completeness and consistency (which mean having accurate trading and insurance services [10, 11]. The idea of Bitcoin
and up to date data). It helps in monitoring, controlling, gives inspires people to extend the use of Blockchain to many
better communication and coordination. It decreases not only other fields such as in judiciary, notary, rights, ownership,
the costs but also the risk of gaining physical records of healthcare, and educational services [11].
healthcare [1, 2, 3]. It is known that when Katrina hurricane
This paper proposes and investigates an adoptive leader
hits New Orleans city in the USA in the year 2005, the flood
election algorithm (ALAE) suiting the use of E-PHRs. It
destroyed the healthcare records of thousands of people.
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 8
introduces the idea of a primary leader secondary leader and to store the queue so that every node (user) is capable to see
having multiple tokens. It shows how to handle the failure of the updated information.
leaders with a limited number of massages. Beside the
technical specifications of the leader election, the procedural In some cases such as emergency or transfer, it creates a
aspects and administrative rules of the medical environment temporary queue TempQ and appoint a secondary leader
are considered. The paper also discusses the advantages and called SLeader.
disadvantages of using Blockchain architecture in The Q linked-list is shown in Figure 1, where the main
combination of ALEA. queue list has three doctors, and PLeader pointing to the head
of the queue. Another linked-list queue TempQ appears in
II. RELATED WORK the emergency block with SLeader. Practically, this queue
does not usually exist.
Many works propose methodologies (strict or relaxed) to
maintain the consistency of cloud storage [12, 13].
Coppieters et al. provide a strict consistency algorithm,
where they order all concurrent processes on all replicas. In
fact, in sequential execution, it is easy to argue about
consistency, since a process accesses the file when the other
finishes. Thus, there must be matching between the order of
the concurrent execution and the order of a correct sequential
execution (which is called serializable) [14]. Zellag and
Kemme show that the relaxation of consistency for cloud
results in approximated output; and the influence of such
relaxation has an insignificant effect on the cloud systems
[15].
Another approach to sustain consistency is to choose one
user as a leader. The leader controls and coordinates the tasks
among all other users to achieve the targeted goal. Actually,
when users detect a failure of the leader, they elect a new one
as the leader using leader election algorithms [8, 16]. In bully
algorithm [8, 17], the complexity of electing new leader is
O(n²) messages, which is very expensive. In token ring
algorithm [18], the complexity of electing new leader is O(n)
messages. Numan et al. propose an algorithm that uses
shared centralized queue of all users. The leader is the head
Fig. 1. Leader election queue using linked-list, showing the PLeader
of the queue; and when it fails, another user dequeues the old pointing to the head node; another linked-list queue appears in the
head. The complexity of this approach is O(1) [19]. emergency block with SLeader.
Furthermore, currently E-PHR is managed through
The TokenPointer shows who is holding the token for an
hospitals or healthcare agencies (third party). However, the
exclusive access to update a file. Section V (C), shows how
use of Blockchain technology helps to have fully
to have more than one token.
decentralized management of E-PHR. Blockchain also
supports the availability, robustness and security of E-PHR,
and all related financial and administrative operations [9, 10, IV. PROPOSED ALGORITHM
20].
In the beginning, E-PHR is created for a patient, and
hospital appoints a leader, where the leader is the primary
III. PROPOSED SYSTEM MODEL doctor PLeader of the patient as shown in Algorithm 1. For
ALEA is built and designed in pursuance of the Well- every doctor, create a new node that contains three things
Organized Bully Leader Election algorithms [20] (that uses a (i) data where data is the unique doctor_Id; (ii) a pointer to
linked-list queue to reduces complexity of leader electing point to the next node; and (iii) a token flag with the value
process to O(1)). For more efficiency and adoption, we of either 0 or 1. Upon inserting a new node to the queue,
modify the algorithm significantly to be applicable and the doctor can read the E-PHR (when token value is 0), but
compatible with the healthcare (medical) procedures and for update permission the token has to change to 1. The
specifications. TokenPointer is another pointer pointing to the node that
has a token (token =1). Then increase the queue size.
ALEA creates a queue Q with size Size that shows the
total number of nodes, where node is denoted as Node. The When PLeader is inaccessible for some reasons (except
node represents a processor/doctor; each doctor is in failure situation), create a temporary queue TempQ that
represented with a unique identifier doctor_Id, pointer to the is led by SLeader as shown in Algorithm 2. On creating a
next node, and a token flag. Upon inserting a new node to the new node, increase the size of the queue. The procedures in
queue, the doctor can read the E-PHR but cannot update the Algorithm 2 are almost similar to Algorithm 1.
E-PHR except if the token equal to 1. The head of the queue Algorithm 3 shows how to add a new node (when the
is the leader termed as Pleader. If there is any requirement to PLeader wants to add a new doctor to the doctors' team). it
change the leader, it dequeues the head node and moves the creates a new node, enqueues it to the Q, and increases the
PLeader pointer to the next node. A shared memory is used size of the queue. The same procedure is applicable for
TempQ.
9
For some cases, the team or the leader decides to leader is appointed. Thus, the other detectors (who apply
modify the priority of the doctors who access the E-PHR. CAS) will find the doctor_Id of new Pleader, which is not
Then, it has to rearrange the positions of the nodes in the Q equals to their local ID, so they have nothing to do.
(swapping), as shown in Algorithm 4. After the insertion of
the doctors Id's, TempPointer1 starts from the head
position, it checks the doctor_Id, and keeps shifting until it Algorithm 1
finds the first doctor. After that, TempPointer2 continues 1. ║ Initialization():
and keeps shifting until it finds the other doctor. Finally, it 2. //Upon creating E-PHR
swaps them by inserting doctor_Id1 in the node of 3. //Create Queue Linked-list
TempPointer1 and doctor_Id2 in the node of 4. Size =0
TempPointer2. 5. Node = new_node()
Pleader retires from leadership but still a member in 6. Nodedata = doctor_Id
the doctors' team of the E-PHR (Algorithm 5). If the doctor 7. Nodenext = NULL
is the only node in the Q, the retirement is not allowed. 8. Nodetoken = 0
Otherwise, use TPointer to point to the PLeader node; 9. Pleader←node
move the PLeader pointer to the next node in the Q. If the 10. TokenPointer = PLeader
TPointer has the token, move the token to the next node by 11. TokenPointertoken = 1
resetting the token to 0, move the TokenPointer to the next 12. Size++
node, and set it to 1. Finally, dequeue the TPointer node 13. return
and enqueue it again to other end of the Q. The same thing
works for TempQ. Algorithm 2
14. ║ Emergency():
Algorithm 6 represents the situation when the doctor will 15. //To add new doctor as SLeader
not access the E-PHR anymore (clearness). If the doctor is 16. //Create a temporary queue TempQ
the only one who handles the E-PHR, the clearness is not
17. Node = new_node()
allowed. Otherwise, remove it just like in Algorithm 5. On
the other hand, for TempQ all doctors can make clearness 18. Nodedata = doctor_Id
even if it is the only node in the queue. 19. Nodenext = NULL
20. Nodetoken = 0
Algorithm 7 explains how to practice leadership and to 21. SLeader←node
move the token from one node to another. The PLeader finds 22. TokenPointer1 = SLeader
the required node using doctor_Id and activates the token by 23. TokenPointer1token = 1
changing it to 1, or deactivates it by resetting it back to 0. In 24. Size++
addition, the PLeader can activate the token for more than 25. return
one node at the same time. This helps us to apply
parallelism. This situation arises when there is no work Algorithm 3
dependencies exists among the nodes. It is explained in more 26. ║ AddDoctor():
details in section V (C). 27. //To add a new doctor to Q
28. Node = new_node()
In Algorithm 8, when any doctor needs to get the token, 29. Nodedata = doctor_Id
it sends acquiring message to the PLeader. Then it has to 30. Nodenext = NULL
wait for specific time Timeout. (It gets the current time, adds 31. Nodetoken = 0
the Timeout, stores the new time in T and wait until current 32. Q←enqueue()
time becomes T.) Now, it waits until either it receives 33. Size++
acknowledge message (reply message) from PLeader or the 34. return
Timeout finishes. When the timeout finishes without
receiving the acknowledge message, then leader fails or Algorithm 4
crashes and the node calls Failure(). Since many nodes may 35. ║ SwapDoctors (doctor_Id1, doctor_Id2):
discover the failure of the leader at the same time, every
36. //To change the positions of doctors
node copies and passes the doctor_Id of the failed leader in
ID (more details will be explained in Algorithm 9). 37. TempPointer1 = PLeader
38. TempPointer2
Algorithm 9 illustrates the state of failure of a leader. 39. While i=1 to size do
Upon the discovery of leader failure or crash, the detector 40. If (TempPointer1data != doctor_Id1)
node calls Failure() and passes ID, which is a local copy of 41. TempPointer1 = TempPointer1next
the doctor_Id of PLeader. In Failure(), move the PLeader 42. Else
pointer to the next node and dequeue the failure node. 43. //First doctor is found, now find the other
Conversely, if more than one node detects the failure, all of 44. TempPointer2 = TempPointer1next
them call Failure(), which would results in multiple 45. Break
unnecessary dequeues. Therefore, it must to use Compare-
46. End While
and- Swap statement CAS, which is an atomic operation that
allows only one node to change the leader. Using CAS, one 47. While i ≤ size do
detector checks if the PLeader still in failure (if the 48. If (TempPointer2data != doctor_Id2)
doctor_Id of PLeader still equals to ID), it calls Clearness(). 49. TempPointer2 = TempPointer2next
In Clearness(), the failure leader is dequeued and another 50. Else

10
51. //Second doctor is also found, now swap (CurTime()< T) do
52. TempPointer1data = doctor_Id2 105. Wait()
53. TempPointer1token = 0 106. End While
54. TempPointer2data = doctor_Id1 107. //If there is no response, then PLeader fails
55. TempPointer2token = 0 108. //Otherwise it is a live and do nothing
56. Break 109. If (receive_ack() = false)
57. End While 110. ID = Pleaderdata
58. return 111. Failure(ID)
112. return
Algorithm 5
59. ║ Rretirement(): Algorithm 9
60. //To retire from leadership 113. ║ Failure (ID):
61. If (PLeadernext = NULL) 114. // If leader still in failure or crash
62. return False 115. CAS (Pleaderdata, ID, Clearness())
63. Else 116. Return
64. TPointer = PLeader
65. PLeader = PLeadernext
66. If (TokenPointer = TPointer) V. ANALYSIS
67. TokenPointertoken = 0 This section discusses many important points such as
68. TokenPointer = TokenPointernext algorithm correctness, synchronization, file sharing, traffic
69. TokenPointertoken = 0 flow and replication.
70. TPointer.dequeue()
71. TPointer.enqueue() A. Correctness
72. return
It is trivial to argue about the correctness of ALEA
Algorithm 6 since it relies on the correctness of bully algorithm [8, 17],
73. ║ Clearness(): and the well-organized bully algorithm [19]. Moreover,
74. //To free the patient completely ALEA uses a linked-list queue in which it follows well-
75. If (PLeadernext = NULL) known lock-based or lock-free algorithms such as the
76. return False algorithm of Michael and Scot [21], that is considered as
the best lock-free algorithm in this filed.
77. Else
78. TPointer = PLeader
79. PLeader = PLeader  next B. Syncronization
80. If (TokenPointer = TPointer) To synchronize operations, ALEA considers the event-
81. TokenPointertoken = 0 based models [8]. Indeed, every doctor performs a read
82. TokenPointer = TokenPointernext and/or update operations on the file. Every operation is
83. TokenPointertoken = 0 represented into two instantaneous events, which are begin
84. TPointer.dequeue() and end. Then, order the concurrent operations in a way
85. return that matches a correct sequential execution; this is known
as Linearizability [22]. Linearizability respects the real-
Algorithm 7 time order of the concurrent execution. Therefore, the
86. ║ Leadership (doctor_Id1): synchronization of events must follow a well-form clock.
87. //When the leader moves the token However, since doctors live in different time zoon and
88. //First get the token accessing files remotely, also patients travel to different
89. TokenPointertoken=0 places; the physical clock is difficult to be used except if
the whole world uses one time zone such as Greenwich
90. //Now find the node that will get the token
Time. Otherwise, it is preferred to use a logical clock to
91. While i=1 to size do
order events such as Lamport's logical clock [8, 22]. In
92. If (TokenPointerdata != doctor_Id1) ALEA, the operations that happen in different processors
93. TokenPointer1 = TokenPointer1 next are ordered since they use a single version of the E-PHR
94. Else (extra versions only for recovery) and any update must use
95. TokenPointer1token=1 token to take places. Thus, the order of the operations
96. Break follows the token movements.
97. End While
98. return C. Parallel Access of E-PHR
To use E-PHR in parallel and avoid all kinds of conflict,
Algorithm 8
read operations accesses the file without acquiring the token,
99. ║ Reminder ():
while the update operations have to acquire the token to
100. //Doctor reminds leader to get the token
execute. The token is implemented as a file lock Lock().
101. Send_msg(PLeader, "Acquire token")
However, the access of the read operation may be denied, if
102. //Wait for some time (Timeout)
the file is locked by an update operation.
103. T = CurTime() + Timeout
104. While ((receive_ack() = false) &&

11
To enhance parallelism, divide the E-PHR file into leadership transfers from the other end with respect to the
multiple sections, so that doctors are capable to access enqueuing order (First-In-First-Leader). However, in line for
different sections in parallel. Every section is a range of the special conditions of the healthcare and medical
bytes with a corresponding lock. It means that there are treatment, the leader may decide to change the order of the
multiple locks for the same file and every doctor should nodes in queue. This shrinks the fairness from the technical
specify the required section to access. perspective (Approximate-First-In-First-Leader). However,
Accordingly, ALEA is modified in such a way that there this is fair in the perspective of medical and humanity
are a set of locks and the number of locks donated as k, ground, which is the main concern of the algorithm.
where k is an integer number (the number of lock equals to Second, the leader passes token from one doctor to
the number of sections). Every doctor acquires the token, another, which is completely fair to the ground of medical
determines the required section s where s is an integer from 1 treatment perspective. Moreover, there is no a chance for
to k. In some cases, the leader decides s for every doctor. starvation (when a doctor may wait forever to be a leader or
Now, modify ALEA such that the initial value of the token is to get the token). Starvation has an alternative concern in
0, which implies that there is no access for update. When medical system. In some cases, a doctor is enqueued to
update operation is required, the token value changes to any access the file for a specific task but has no necessity of
value s based on the respective (required) section. In being a leader, so there is no starvation.
addition, to lock the whole file the token value should be However, there are some basic conventional rules and
k+1. So in ALEA, TokenPointer is replaced by a two- regulations in a medical system that tells when a doctor
dimensional array of pointers. From there the leader will be should get the token or wait. In the time of an emergency, the
able to identify the doctors who holds the tokens and their emergency department gives the access permission over E-
respective section number. Clearly, the two-dimensional PHR for a secondary leader (there is no starvation). In
array has k rows and two columns, one for doctor_Id and the addition to the previously mentioned scenario if the leader
other shows the value of s. fails or crashes, a new leader is elected and the token is
passed as usual.
D. Traffic Flow
As mentioned earlier, the concurrent access of some VI. USING BLOCKCHAIN
critical shared recourse causes conflicts that result in
incorrect view of data. To cope with such issue massage As mentioned previously, the Blockchain is a
technology to have an electronic ledger that is build based
passing is required to elect a leader and to move the token.
on consensus of a cluster of nodes [9]. It is used for
Firstly, the leader is elected through passing messages among financial transactions and it can be extended to many other
all nodes. As mentioned before in section II, the centralized areas [9, 10]. Indeed, the Blockchain is a technology that
leader election algorithms complexity reaches to n² messages has many algorithms. Generally, the Blockchain algorithms
[16, 17]. In the decentralized leader election algorithms has three stages: (i) one node broadcasts a proposal to the
enable to have more than one leader and the decisions will be other nodes; (ii) the nodes vote on the correctness of the
based on the votes of the majority [16, 17]. Such type of proposal; (iii) according to the consensus of votes, the
permission requires approximately n messages. However, proposal commits or aborts.
using the Well-Organized Bully Leader Election algorithms
[19] and ALEA the number of messages is reduced to 0, To implement ALEA using Blockchain technology, the
process of queue creation in Algorithm 1 and Algorithm 2
because the leader election is conducted by maintaining a
will be conducted through consensus decision. So,
shared queue linked list, so the leader is elected without
Blockchain helps to avoid the need for a third party such as
traffic. hospitals. In addition, the functions of maintaining queue
Secondly, there are some other kinds of messages to list will go through making a proposal, voting and then
move the token among nodes. In many cases, the token does taking a decision. This is applicable on functions such as
not necessarily to be held by the leader. In ALEA, the token adding a doctor to the queue list (Algorithm 3), swiping
is a flag that exists in each node, and the leader sets it to 1 for doctors (Algorithm 4), or removing a doctor from the
lock acquiring and sets back to 0 for lock releasing. In queue list (Algorithms 5 and 6). The same procedure is
normal case, the leader moves and sets the token with no used to manage token acquiring decisions (Algorithms 7, 8
messages. In rare cases, a node (doctor) for some reasons and 9).
insists to get the token, so it sends a message to acquire the
token (Algorithm 8). In this situation, the respective node VII. DISCUSSION ON THE USE OF BLOCKCHAIN
receives an acknowledgment message from leader. This
phenomenon is rarely occurs and it does not create any The combination of the Blockchain technology with
trafficking problem in the system. ALEA results in several positive and negative
consequences. Thus, there are many issues to discuss such
E. Fairness and Starvation (timeout) as decentralization, robustness, availability, ownership
Talking about fairness, there are two dimensions, one is protection, security, privacy, computational cost and traffic
related to the leader election and the other is related to the flow [9, 10, 22].
token. 1) Decentralization: the decentralization allows to avoid
Firstly, in the situation of fairly leader election, the
the permission of the hospital to access the E-PHR, to assign
hospital or the patient decides the primary doctor from
the leadership or to create an emergency linked list. The
medical point view. Actually, all doctor are enlisted
decentralization of Blockchain denies the need for a third
(enqueued) in the queue linked list from one end, while the
party, where the decision will be taken through the consensus
12
of the cluster nodes. It gives patients the full access to their REFERENCES
E-PHR. However, decentralization issue must be controlled
in strict manner to avoid the delay and trust issues. [1] Tang, Paul C., et al. "Personal health records: definitions, benefits,
2) Robustness: the use Blockchain does not allow a and strategies for overcoming barriers to adoption." Journal of the
American Medical Informatics Association 13.2 (2006): 121-126.
single point of failure, such that when some nodes fail, the
[2] Davis, Selena, A. Roudsari, and Karen L. Courtney. "Designing
others continue the work. For example, if there is an Personal Health Record Technology for Shared Decision
emergency case, while the hospital has a technical issue to Making." Studies in health technology and informatics 234 (2017):
give an access to E-PHR. Then, the majority of nodes do the 75-80.
work and the system still running robustly. [3] Woollen, Janet, et al. "Patient experiences using an inpatient personal
health record." Applied clinical informatics 7.02 (2016): 446-460.
3) Availability: in Blockchain technology, every node
[4] Sherman, Arloc, and Isaac Shapiro. "Essential facts about the victims
has a complete copy of the files. This shows high level of of Hurricane Katrina." Center on Budget and Policy Priorities 1
availability but it increases the number of replications. The (2005): 16.
large number of redundant replications could be considered [5] Taylor, Shayne Sebold, and Jesse M. Ehrenfeld. "Electronic health
records and preparedness: lessons from Hurricanes katrina and
as a negative point, since it increases cost of space Harvey." Journal of medical systems 41.11 (2017): 173.
complexity, communication and files' updating processes. [6] Kuo, Alex Mu-Hsing. "Opportunities and challenges of cloud
4) Ownership: in regular systems, the owner of the file, computing to improve health care services." Journal of medical
hospitals, healthcare agencies and some leaders have the Internet research 13.3 (2011).
privileges to change the ownership of some files. However, [7] Dinh, Hoang T., et al. "A survey of mobile cloud computing:
architecture, applications, and approaches." Wireless communications
with Blockchain only the owner of the file is able to change and mobile computing 13.18 (2013): 1587- 1611.
the ownership, which is a negative point that contrasts with [8] Tanenbaum, Andrew S., and Maarten Van Steen. Distributed systems:
the specifications of this model. For example if the patient in principles and paradigms. Prentice-Hall, 2007. [9] Nakamoto, Satoshi.
unconscious situation or died, then the ownership change "Bitcoin: A peer-to-peer electronic cash system." (2008). Pilkington,
Marc. "11 Blockchain technology: principles and applications."
should has another procedures. Research handbook on digital transformations225 (2016).
5) Security and privacy: with Blockchain the users [10] Kuo, Tsung-Ting, Hyeon-Eui Kim, and Lucila Ohno-Machado.
identities are hidden; their files, transactions and processes "Blockchain distributed ledger technologies for biomedical and health
are encrypted, which is positive from data sensitivity care applications." Journal of the American Medical Informatics
Association 24.6 (2017): 1211-1220.
prespictive. However, the nodes are untraceable which is not [11] Crosby, Michael, et al. "Blockchain technology: Beyond bitcoin."
acceptable for healthcare systems. For example, when some Applied Innovation 2.6-10 (2016): 71.
doctors take suspicious or illegal decisions. [12] Hashem, Ibrahim Abaker Targio, et al. "The rise of “big data” on
6) Immutability: using Blockchain, committed cloud computing: Review and open research issues." Information
systems 47 (2015): 98-115.
transactions are unchangeable. Therefore, users are able only
[13] Agrawal, Divyakant, Sudipto Das, and Amr El Abbadi. "Big data and
to create and read files but in healthcare systems users need cloud computing: current state and future opportunities." Proceedings
to create, read, update and delete files. of the 14th International Conference on Extending Database
7) Performance cost: the blockcain technology has Technology. ACM, 2011.
negative impacts on computational and communication [14] Coppieters, Tim, Wolfgang De Meuter, and Sebastian Burckhardt.
"Serializable eventual consistency: consistency through object method
costs. First, the computation must be executed in more than replay." Proceedings of the 2nd Workshop on the Principles and
one node (validators). Second, the proposer proposes a Practice of Consistency for Distributed Data. ACM, 2016.
transaction by broadcasting it to all nodes, let say n [15] Zellag, Kamal, and Bettina Kemme. "How consistent is your cloud
messages. Then, nodes check the correctness of the proposal application?." Proceedings of the Third ACM Symposium on Cloud
Computing. ACM, 2012.
and send votes to all others, which costs n² messages. After
[16] Tel, Gerard. Introduction to distributed algorithms. Cambridge
that, based on the votes, a commit or abort massage is university press, 2000.
broadcasted from n nodes to all others. Thus, the use of [17] Coulouris, George F., Jean Dollimore, and Tim Kindberg. Distributed
Blockchain negatively affects the speed of the system and its systems: concepts and design. pearson education, 2005.
traffic flow. [18] Soundarabai, P. Beaulah, et al. "Message efficient ring leader election
in distributed systems." Computer Networks & Communications
VIII. CONCLUSION (NetCom). Springer, New York, NY, 2013. 835-843.
[19] Numan, Muhammad, et al. "Well-Organized Bully Leader Election
This paper proposes an adaptive algorithm, for cloud- Algorithm for Distributed System." 2018 International Conference on
based E-PHR, so it can be easily used with a minimal Radar, Antenna, Microwave, Electronics, and Telecommunications
infrastructure. ALEA enhances the parallelism using an (ICRAMET). IEEE, 2019.
alternative leader election technique that is suitable for [20] Pilkington, Marc. "11 Blockchain technology: principles and
applications." Research handbook on digital transformations225
healthcare system. The paper stately analyzes and (2016).
investigates the performance of the proposed algorithm to
[21] Michael, Maged M., and Michael L. Scott. Simple, fast, and practical
clarify the advantages of ALEA comparing to the existing non-blocking and blocking concurrent queue algorithms. No. TR-600.
ones. It also shows that the use of Blockchain to implement Rochester UNIV NY DEPT OF Computer Science, 1995.
ALEA has many negative impacts. [22] Herlihy, Maurice, and Nir Shavit. The art of multiprocessor
programming. Morgan Kaufmann, 2011.

13
Automotive Cybersecurity: Foundations for
Next-Generation Vehicles
Michele Scalas, Student Member, IEEE Giorgio Giacinto, Senior Member, IEEE
Department of Electrical and Electronic Engineering Department of Electrical and Electronic Engineering

University of Cagliari University of Cagliari


Cagliari, Italy Cagliari, Italy
michele.scalas@unica.it giacinto@unica.it

Abstract—The automotive industry is experiencing a serious Vehicle-to-Vehicle), with a generic infrastructure (V2I) or with
transformation due to a digitalisation process and the transition pedestrians (V2P). The typical application of these models is
to the new paradigm of Mobility-as-a-Service. The next-generation smart cities, with the aim of optimising traffic management,
vehicles are going to be very complex cyber-physical systems,
whose design must be reinvented to fulfil the increasing demand sending alerts in case of incidents, coordinating a fleet of
of smart services, both for safety and entertainment purposes, vehicles.
causing the manufacturers’ model to converge towards that of As regards autonomous driving, it consists in expanding the
IT companies. Connected cars and autonomous driving are the current Advanced Driver Assistance Systems (ADASs), such
preeminent factors that drive along this route, and they cause the as lane keeping and braking assistants, in order to obtain a
necessity of a new design to address the emerging cybersecurity
issues: the ”old” automotive architecture relied on a single closed fully autonomous driverless car. The Society of Automotive
network, with no external communications; modern vehicles are Engineers (SAE) provides, in fact, six possible levels of
going to be always connected indeed, which means the attack autonomy, from level 0, with no assistance, to level 5, where
surface will be much more extended. The result is the need for the presence of the driver inside the car is not needed at all.
a paradigm shift towards a secure-by-design approach. All these innovations have a common denominator: in-
In this paper, we propose a systematisation of knowledge about
the core cybersecurity aspects to consider when designing a formation technology. Current top end vehicles have about
modern car. The major focus is pointed on the in-vehicle network, 200 million lines of code, up to 200 Electronic Control
including its requirements, the current most used protocols and Units (ECUs) and more than 5 km copper wires [23], which
their vulnerabilities. Moreover, starting from the attackers’ goals means cars are becoming very complex software-based IT
and strategies, we outline the proposed solutions and the main systems. This fact marks a significant shift in the industry:
projects towards secure architectures. In this way, we aim to
provide the foundations for more targeted analyses about the the ”mechanical” world of original equipment manufacturers
security impact of autonomous driving and connected cars. (OEMs) is converging towards that of IT companies.
Index Terms—Cybersecurity, Mobility, Automotive, Connected In this context, the safety of modern vehicles is strictly
Cars, Autonomous Driving related to addressing cybersecurity challenges. The electronic
architecture of the vehicle has been designed and standardised
I. I NTRODUCTION over the years as a ”closed” system, in which all the data
HE automotive industry is experiencing a serious trans- of the ECUs persisted in the internal network. The above
T formation due to a digitalisation process in many of its
aspects and the new mobility models. A recent report by
new services require instead that data spread across multiple
networks; there is, therefore, a bigger attack surface, i.e.
PwC [20] states that by 2030 the vehicle parc in Europe and new possibilities to be vulnerable to the attackers. Hence,
USA will slightly decline, but at the same time the global automotive OEMs need to reinvent the car architecture with a
industry profit will significantly grow. The main factor for this secure-by-design approach.
phenomenon is the concept of Mobility-as-a-Service (MaaS), Another implication of this transformation is that the vehicle
i.e. the transition to car sharing and similar services, at the will be a fully-fledged cyber-physical system (CPS), that is
expense of individual car ownership (expected to drop from “a system of collaborating computational elements controlling
90% to 52% in China [20]). In this sense, the main keywords physical entities” [15]. This definition reminds that, in terms
that will contribute to this new model are ’connected cars’ of security, both the cyber- and the physical-related aspects
and ’autonomous driving’. should be considered. As an example, an autonomous car
According to Upstream Security [27], by 2025 the totality of heavily interacts with the real world environment and faces
new cars will be shipped connected, intending as connected the challenge of guaranteeing the resilience of the sensing
not only the possibility of leveraging Internet or localisa- and actuation devices. Therefore, security in automotive also
tion services but the adoption of the V2X (Vehicle-to-X) involves addressing the specific issues of a CPS, as can be read
paradigm. This term refers to the capability of the car to in the work by Wang et al. [30]; however, in this paper, we
communicate and exchange data with other vehicles (V2V, will consider the attacks that are carried out in the cyber-space.

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 14


In particular, we propose a systematisation of knowledge that B. Main Standards
focuses on the in-vehicle network, with the aim to provide
Current vehicles mix different types of networks to let
the core elements for further analyses about complementary
the dozens of ECUs communicate. The primary standards,
aspects of automotive cybersecurity.
typically suited for a specific domain and the related re-
Paper structure. In this paper, Section II firstly lists the quirements, are: LIN, MOST, FlexRay and CAN; the latter
constraints in car design, then describes the principal standards represents the backbone of the entire network, so it is the
for the internal network and the related security vulnerabilities. most explanatory protocol to understand the critical points
Section III presents the various goals of the cyberattacks in automotive cybersecurity. It is worth noting that, due to
against vehicles, while Section IV makes an overview of the the transitioning phase in the industry, the topology and the
attack strategies. Section V illustrates the proposed solutions standards are going to change, as will be better illustrated in
for new architectures. Finally, Section VI discusses how the Section V. Following the survey by Huo et al. [8], the main
security evaluation can be expandend to address the impact features of CAN and Automotive Ethernet —one of the new
of artificial intelligence and V2X, and Section VII makes proposed protocols— are the following:
concluding remarks.

II. AUTOMOTIVE N ETWORKS

This Section describes the basic characteristics of a car


internal network, from the design constraints to the main
protocols and their vulnerabilities.

A. Constraints
Although common IT security concepts can be used to
design car electronics, there are some specific constraints
to consider both in the hardware and the software side, as
summarised by Studnia et al. [26] and Pike et al. [18]:
Hardware limitations The typical ECUs for cars are embed-
ded systems with substantial hardware limitations, that is Fig. 1. Main domains in a modern car. [5]
with low computing power and memory. This restriction
means some security solutions like cryptography might be CAN The Controller Area Network is the most used protocol
not fully implementable. Moreover, the ECUs are exposed for the in-vehicle network. It was released in 1986, but
to demanding conditions (such as low/high temperatures, several variants and standards have been developed over
shocks, vibrations, electromagnetic interferences), and the years. For simplicity, there is a low speed CAN that
must have an impact on the size and weight of the vehicle reaches up to 125 Kb/s while the high-speed version
as small as possible. This is why the bus topology, which reaches up to 1 Mb/s; the first one it’s suited for the body
requires a much lower number of wires, is preferable domain, the other one is used in ’powertrain’ (engine or
compared to the star one. transmission control) and ’chassis’ (suspension, steering
These constraints cause the OEMs to be sensitive to or braking) domain. The CAN network is implemented
component costs, which limits the possibility to embrace with twisted pair wires, and an essential aspect is the
innovations. network topology, which is a bus line. Although current
Timing Several ECUs must perform tasks with fixed real-time designs are transitioning to a slightly different setting
constraints, which are often safety-critical. Therefore, any (Figure 1), with a Domain Controller (DC) that manages
security measure must not impact these tasks. different sub-networks for each domain (i.e. function-
Autonomy Since the driver must be focused on driving, the ality), the main idea is still that the CAN bus acts as
car should be as much autonomous as possible when the backbone and all the data spread across the entire
protection mechanisms take place. network, in broadcast mode.
Life-cycle The life-cycle of a car is much longer than that Automotive Ethernet Although its adoption is still limited,
of conventional consumer electronics, so the need for Ethernet has a crucial role for next-generation automotive
durable hardware and easy-to-update software (especially networks; it is a widespread standard for the common IT
security-related one). uses, and its high bandwidth is a desirable characteristic
Supplier Integration To defend intellectual property, suppli- for modern vehicles. However, as it is, its cost and
ers often provide (software) components without source weight are not suited for automotive, hence the need for
code; therefore, any modification to improve security can ’Automotive Ethernet’: in the past few years, among the
be more difficult. various proposals, the ’BroadR-Reach’ variant by Broad-

15
com emerged and now its scheme has been standardised Intellectual challenge The attack is conducted to demon-
by IEEE (802.3bp and 802.3bw); moreover, other variants strate hacking ability.
are under development by ISO. The standard is currently Intellectual property theft This refers to the elicitation of
guided by the One-Pair Ether-Net (OPEN) alliance. the source code for industrial espionage.
The main difference compared to standard Ethernet is the Data theft This is an increasingly important goal, a conse-
use of a unique unshielded twisted pair, which let the cost, quence of the new paradigm of connected cars. There
size and weight significantly decrease, without sacrificing are different types of data to steal, such as:
the bandwidth (100 or 1000 Mb/s). • License plates, insurance and tax data;
Before moving on to the description of the vulnerabilities • Location traces;
caused by these designs, it is useful to introduce an essential • Data coming from the connection with a smartphone,
standard for diagnostics: OBD. It stands for On-Board Diag- such as contacts, text messages, social media data,
nostics, and it consists in a physical port, mandatory for US banking records.
and European vehicles, that enables self-diagnostic capabilities The combination of these data might allow the attacker
in order to detect and signal to the car owner or a technician to discover the victim’s habits and points of interest,
the presence of failures in a specific component. It gives direct exposing him to burglary or similar attacks.
access to the CAN bus, then causing a serious security threat,
as will be described in Section IV; moreover, anyone can buy IV. ATTACK S CENARIOS
cheap dongles for the OBD port, extract its data and read them
In this Section, an overview of attack techniques and
for example with a smartphone app.
examples is provided. Following the work by Liu et al. [12],
C. Vulnerabilities the typical attack scheme includes an initial phase in which a
physical (e.g., OBD) or wireless (e.g., Bluetooth) car interface
The constraints described in Section II-A, such as the need is exploited in order to access the in-vehicle network. The
to reduce the cost and the size impact of the network, together most common interface to access it is OBD, but several works
with the past context in which the in-vehicle data was not leverage different entry points: Checkoway et al. [2] succeeded
exposed to external networks, caused the presence in the in sending arbitrary CAN frames through a modified WMA
(CAN) backbone of the following design vulnerabilities [12]: audio file burned onto a CD. Mazloom et al. [13] showed
Broadcast transmission Because of the bus topology, the some vulnerabilities in the MirrorLink standard that allow
messages between the ECUs spread across the entire controlling the internal CAN bus through a USB connected
network, causing a severe threat: accessing one part of smartphone. Rouf et al. [21] analysed the potential vulnera-
the network (for example the OBD port) implies the bilities in the Tire Pressure Monitoring System (TPMS), while
possibility to send messages to the entire network or Garcia et al. [6] found out that two widespread schemes for
being able to eavesdrop on these communications. keyless entry systems present vulnerabilities that allow cloning
No authentication There is no authentication that indicates the remote control, thus gaining unauthorised access to the
the source of the frames, which means it is possible to vehicle.
send fake messages from every part of the network. Once the interface is chosen, then the following methodolo-
No encryption The messages can be easily analysed or gies are used to prepare and implement the attack:
recorded in order to figure out their function.
Frame sniffing Leveraging the broadcast transmission and
ID-based priority scheme Each CAN frame contains an
the lack of cryptography in the network, the attacker can
identifier and a priority field; the transmission of a high
eavesdrop on the frames and discover their function. It is
priority frame causes the lower priority ones to back off,
the typical first step to prepare the attack. An example of
which enables Denial of Service (DoS) attacks.
CAN frames sniffing and analysis is the work by Valasek
III. ATTACK G OALS et al. [28].
Frame falsifying Once the details of the CAN frames are
In this Section, different motivations that attract the attack- known, it is possible to create fake messages with false
ers are described. Taking the works by Studnia et al. [26] and data in order to mislead the ECUs or the driver, e.g., with
IET [9] as references, these are the possible attack goals: a wrong speedometer reading.
Vehicle theft This is a straightforward reason to attack a Frame injection The fake frames, set with a proper ID, are
vehicle. injected in the CAN bus to target a specific node; this
Vehicle enhancement This refers to software modifications is possible because of the lack of authentication. An
especially realised by the owner of the car. The goal might illustrative —and very notorious— attack regards the
be to lower the mileage of the vehicle, tune the engine exploitation made by Miller et al. [14] towards the 2014
settings or install unofficial software in the infotainment. Jeep Cherokee infotainment system, which contains the
Extortion This can be achieved for example through a ability to communicate over Sprint’s cellular network in
ransomware-like strategy, i.e. blocking the victim’s car order to offer in-car Wifi, real-time traffic updates and
until a fee is paid. other services. This remote attack allowed to control some

16
cyber-physical mechanisms such as steering and braking. adopting the principle of least privilege, i.e. a policy
The discovery of the vulnerabilities in the infotainment whereby each user (each ECU in this case) should have
caused a 1.4 million vehicle recall by FCA. the lowest level of privileges which still permits to
Replay attack In this case, the attacker sends a recorded perform its tasks.
series of valid frames into the bus at the appropriate time, Isolation/Slicing This hardening measure aims at preventing
so he can repeat the car opening, start the engine, turn the chance for an attacker to damage the entire net-
the lights on. Koscher et al. [11] implemented a replay work. This goal can be achieved for example isolating
attack in a real car scenario. the driving systems from the other networks (e.g., the
DoS attack As anticipated in Section II-C, flooding the net- infotainment), or through a central gateway that employs
work with the highest priority frames prevents the ECUs access control mechanisms.
from regularly sending their messages, therefore causing Intrusion detection Intrusion Detection Systems (IDSs)
a denial of service. An example of this attack is the work monitor the activities in the network searching for ma-
by Palanca et al. [17]. licious or anomalous actions. Some examples in the
literature are the works by Song et al. [24] and by Kang
V. S ECURITY C OUNTERMEASURES et al. [10], which uses deep neural networks.
This Section firstly aims to summarise the basic security Secure updates The Over-The-Air (OTA) updates are on the
principles to consider when designing car electronics and one hand a risk that increases the attack surface; on
related technology solutions. Then, it focuses on the major the other, they are an opportunity to quickly fix the
projects for new architectures. discovered vulnerabilities (besides adding new services).
Some recent works to secure the updates but also V2X
A. Requirements
communications are those by Dorri et al. [4] and Steger
A typical pattern to help to develop secure architectures et al. [25], both taking advantage of blockchain.
is the so-called ’CIA triad’, i.e. three conditions that should Incident response and recovery It is necessary to ensure an
be guaranteed as far as possible; they are: confidentiality, appropriate response to incidents, limit the impact of the
integrity, availability. As the previous Sections demonstrated, failures and be always able to restore the standard vehicle
none of them is inherently guaranteed through the current functionality.
reference backbone —the CAN bus. All the above aspects should be fulfilled in a Security Devel-
Bearing in mind these concepts and taking a cue from the opment Lifecycle (SDL) perspective, with data protection and
work by ACEA [1], the proposed countermeasures and some privacy as a priority. Testing and information sharing among
of the related implementations in the research literature are industry actors are recommended.
the following:
Dedicated HW To supply the scarcity of computing power B. Main Projects
of the ECUs and satisfy the real-time constraints, it may In the past ten years, several research proposals and stan-
be necessary to integrate hardware platforms specifically dardisation projects started, aiming to develop and integrate
designed for security functions. This approach has been the ideas of the previous Section organically; a map of these
pursued, for example, in the EVITA and HIS project, and initiatives can be seen in Figure 2.
it is referred to as Hardware Security Module (HSM) or
Security Hardware Extension (SHE).
Cryptography Encryption can help in ensuring confidential-
ity and integrity. It is worth noting that implementing
cryptography is not trivial, since the low computing
power may prevent the OEMs from using robust algo-
rithms, which means cryptography might be even counter-
productive. The guidelines recommend state-of-the-art
standards, taking care of key management and possibly
using dedicated hardware. There are several works about
cryptography; for example, Zelle et al. [31] investigated
whether the well-known TLS protocol applies to in-
vehicle networks.
Fig. 2. Safety and security initiatives inside and outside of the automotive
Authentication Since different ECUs interact with each other, domains. (ENISA [5])
it is fundamental to know the sender of every incoming
message. Two recent works that integrate authentication Among these projects, SAE J30611 , finalised in 2016,
are those by Mundhenk et al. [16] and Van Bulck et al. guides vehicle cybersecurity development process, ranging
[29]. from the basic principles to the design tools. However, a
Access control Every component must be authorised in order
to gain access to other parts. The guidelines suggest 1 https://www.sae.org/standards/content/j3061 201601/

17
new international standard, the ISO/SAE 21434, is under deep learning is the main enabling technology. In addition to
development; its goal is to (a) describe the requirements for the inherent complexity in developing a fully autonomous car
risk management (b) define a framework that manages these for the real world, several studies demonstrated how machine
requirements, without indicating specific technologies, rather learning-based algorithms are vulnerable, i.e. the fact that
giving a reference, useful also for legal aspects. carefully-perturbed inputs can easily fool classifiers, causing,
Moreover, the implementation of these guidelines and the for example, a stop sign to be classified as a speed limit
transition towards a new in-vehicle network architecture is ([7]). These issues originate the research topic of adversarial
currently guided by some projects like AUTOSAR2 . This learning. Moreover, the use of machine learning is not limited
initiative is a partnership born in 2003 between several to computer vision but also includes cybersecurity software,
stakeholders, ranging from the OEMs to the semi-conductors such as IDSs, and safety systems, such as drowsiness and
companies, which aims to improve the management of the E/E distraction detectors. Therefore, it is fundamental to leverage
architectures through reuse and exchangeability of software proper techniques (e.g., [3]) to a) avoid consistent drops of
modules; concretely, it standardises the software architecture performances b) increase the effort of the attacker to evade
of the ECUs. It is still an active project, now also focused on the classifiers c) keep the complexity of the algorithms within
autonomous driving and V2X applications, and it covers dif- an acceptable level, given the constraints described in Sec-
ferent functionalities, from cybersecurity to diagnostic, safety, tion II-A. Ultimately, these concerns must be addressed with
communication. AUTOSAR also supports different software the same attention as the ones related to the internal network
standards, such as GENIVI3 , another important alliance aiming architecture. In this sense, some works, such as [22], propose
to develop open software solutions for In-Vehicle Infotainment to include machine learning-specific recommendations in the
(IVI) systems. ISO 26262 4 standard.
VI. D ISCUSSION VII. C ONCLUSION
To sum up, in this paper we deduced how the digitalisation
process within the automotive industry, where the OEMs
are converging towards IT companies and the vehicles are
becoming ”smartphones on wheels”, came up against serious
cybersecurity issues, due to security flaws inherited by an
original design where the in-vehicle network did not interact
with the external world. By contrast, the Mobility-as-a-Service
paradigm causes the vehicle to be hyper-connected and con-
sequently much more exposed to cyber threats.
In this transition phase, we observed the effort in developing
more and more complex platforms in a safety-critical context
Fig. 3. Applying security principles ([23])
with strict requirements such as the limited hardware and
the real-time constraints. For these reasons, both the industry
The ideas expressed in the previous Section can be sum- and the researchers are pledging to leverage the common IT
marised by Figure 3, which shows how the security principles methodologies from other domains and tailor them for the
can be implemented in practice. In our opinion, the primary automotive one. The route towards this goal is not straightfor-
protocol upon which the backbone of the future in-vehicle ward, as noted in the study by Ponemon Institute [19]: 84% of
network will be built is Automotive Ethernet. Moreover, the the professionals working for OEMs and their suppliers still
takeaway message from these initiatives is the specific focus have concerns that cybersecurity practices are not keeping pace
on security: each building block implies a research activity with evolving technologies.
aimed at proposing a solution tailored for the automotive As a final remark, we claim that the core ideas concerning
domain. the in-vehicle network, and described in this paper, could be
In this paper, we examined the core elements and concerns considered for further analyses on the security for autonomous
for secure internal networks; however, it is worth discussing, driving and V2X communications.
although in an introductory manner, about how the same
awareness should be extended to the very new actors in ACKNOWLEDGEMENT
automotive, i.e. artificial intelligence and V2X. These elements The authors thank Abinsula srl for the useful discussions
enable new advanced, smart services —e.g., platooning, that on the mechanisms of the automotive industry and its trends.
is the use of a fleet of vehicles that travel together in a
coordinated and autonomous way— and, as a consequence, R EFERENCES
further threats. In particular, focusing on artificial intelligence, [1] ACEA. Principles of Automobile Cybersecurity. Tech.
the primary concerns come from autonomous driving, where rep. ACEA, 2017.
2 https://www.autosar.org 4 https://www.iso.org/standard/68383.html
3 https://www.genivi.org

18
[2] Stephen Checkoway et al. “Comprehensive Exper- [18] Lee Pike et al. “Secure Automotive Software: The Next
imental Analyses of Automotive Attack Surfaces”. Steps”. In: IEEE Software 34.3 (May 2017), pp. 49–55.
In: USENIX Security Symposium. San Francisco, CA: [19] Ponemon Institute. Securing the Modern Vehicle: A
USENIX Association, 2011, pp. 447–462. Study of Automotive Industry Cybersecurity Practices.
[3] Ambra Demontis et al. “Yes, Machine Learning Can Tech. rep. 2019.
Be More Secure! A Case Study on Android Malware [20] PwC. The 2018 Strategy & Digital Auto Report. Tech.
Detection”. In: IEEE Transactions on Dependable and rep. 2018.
Secure Computing (2017), pp. 1–1. [21] Ishtiaq Rouf et al. “Security and Privacy Vulnerabilities
[4] Ali Dorri et al. “BlockChain: A Distributed Solution to of In-Car Wireless Networks : A Tire Pressure Moni-
Automotive Security and Privacy”. In: IEEE Communi- toring System Case Study”. In: 19th USENIX Security
cations Magazine 55.12 (Dec. 2017), pp. 119–125. Symposium. Whashington DC: USENIX Association,
[5] ENISA. Cyber Security and Resilience of smart cars. 2010, pp. 11–13.
Tech. rep. ENISA, 2017. [22] Rick Salay, Rodrigo Queiroz, and Krzysztof Czarnecki.
[6] Flavio D Garcia et al. “Lock It and Still Lose It - On the “An Analysis of ISO 26262: Using Machine Learning
(In)Security of Automotive Remote Keyless Entry Sys- Safely in Automotive Software”. In: arXiv preprint
tems”. In: 25th USENIX Security Symposium (USENIX arXiv:1709.02435 (Sept. 2017).
Security 16). Austin, TX: USENIX Association, 2016. [23] Balazs Simacsek. “Can we trust our cars?” 2019.
[7] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. [24] Hyun Min Song, Ha Rang Kim, and Huy Kang Kim.
“BadNets: Identifying Vulnerabilities in the Machine “Intrusion detection system based on the analysis of
Learning Model Supply Chain”. In: arXiv preprint time intervals of CAN messages for in-vehicle net-
arXiv:1708.06733 (2019). work”. In: 2016 International Conference on Informa-
[8] Yinjia Huo et al. “A survey of in-vehicle communi- tion Networking (ICOIN). Vol. 2016-March. IEEE, Jan.
cations: Requirements, solutions and opportunities in 2016, pp. 63–68.
IoT”. In: 2015 IEEE 2nd World Forum on Internet of [25] Marco Steger et al. “Secure Wireless Automotive Soft-
Things (WF-IoT). IEEE, Dec. 2015, pp. 132–137. ware Updates Using Blockchains: A Proof of Concept”.
[9] IET. Automotive Cyber Security : An IET/KTN Thought In: Advanced Microsystems for Automotive Applications
Leadership Review of risk perspectives for connected 2017. Lecture Notes in Mobility. Springer, Cham, 2018,
vehicles. Tech. rep. IET, 2014. pp. 137–149.
[10] Min-Joo Kang and Je-Won Kang. “Intrusion Detection [26] Ivan Studnia et al. “Survey on security threats and
System Using Deep Neural Network for In-Vehicle protection mechanisms in embedded automotive net-
Network Security”. In: PLOS ONE 11.6 (June 2016). works”. In: 2013 43rd Annual IEEE/IFIP Conference
Ed. by Tieqiao Tang. on Dependable Systems and Networks Workshop (DSN-
[11] Karl Koscher et al. “Experimental Security Analysis of W). IEEE, June 2013, pp. 1–12.
a Modern Automobile”. In: 2010 IEEE Symposium on [27] Upstream Security. Global Automotive Cybersecurity
Security and Privacy. IEEE, 2010, pp. 447–462. Report 2019. Tech. rep. 2018.
[12] Jiajia Liu et al. “In-Vehicle Network Attacks and Coun- [28] Chris Valasek and Charlie Miller. “Adventures in Au-
termeasures: Challenges and Future Directions”. In: tomotive Networks and Control Units”. In: Defcon 21.
IEEE Network 31.5 (2017), pp. 50–58. 2013, pp. 260–264.
[13] Sahar Mazloom et al. “A Security Analysis of an [29] Jo Van Bulck, Jan Tobias Mühlberg, and Frank Piessens.
In-Vehicle Infotainment and App Platform”. In: 10th “VulCAN: Efficient component authentication and soft-
USENIX Workshop on Offensive Technologies (WOOT ware isolation for automotive control networks”. In:
16). USENIX Association, 2016. Proceedings of the 33rd Annual Computer Security
[14] Charlie Miller and Chris Valasek. “Remote Exploitation Applications Conference on - ACSAC 2017. New York,
of an Unaltered Passenger Vehicle”. In: Black Hat USA New York, USA: ACM Press, 2017, pp. 225–237.
2015 (2015), pp. 1–91. [30] Eric Ke Wang et al. “Security Issues and Challenges
[15] Roberto Minerva, Abyi Biru, and Domenico Rotondi. for Cyber Physical System”. In: 2010 IEEE/ACM Int’l
“Towards a Definition of IoT”. 2015. Conference on Green Computing and Communications
[16] Philipp Mundhenk et al. “Security in Automotive Net- & Int’l Conference on Cyber, Physical and Social
works: Lightweight Authentication and Authorization”. Computing. IEEE, Dec. 2010, pp. 733–738.
In: ACM Transactions on Design Automation of Elec- [31] Daniel Zelle et al. “On Using TLS to Secure In-Vehicle
tronic Systems 22.2 (Mar. 2017), pp. 1–27. Networks”. In: Proceedings of the 12th International
[17] Andrea Palanca et al. “A Stealth, Selective, Link- Conference on Availability, Reliability and Security -
Layer Denial-of-Service Attack Against Automotive ARES ’17. New York, New York, USA: ACM Press,
Networks”. In: International Conference on Detection 2017, pp. 1–10.
of Intrusions and Malware, and Vulnerability Assess-
ment. Springer Verlag, 2017, pp. 185–206.

19
NTRU-Like Secure and Effective Congruential
Public-Key Cryptosystem Using Big Numbers
Anas Ibrahim Alexander Chefranov Nagham Hamad
Department of Computer Engineering Department of Computer Engineering Palestine Technical University
Eastern Mediterranean University Eastern Mediterranean University Tulkarem, Palestine
Palestine Technical University Famagusta, North Cyprus nagham.hamad@ptuk.edu.ps
anas.ibrahim@emu.edu.tr alexander.chefranov@emu.edu.tr

Abstract—We propose RCPKC, a random congruential public NTRU variants working with polynomials inside more
key cryptosystem working on integers modulo q, such that the complicated structures. Such as MATRU [9], working with
norm of a two-dimensional vector formed by its private key, square matrices of polynomials and showed better encryption
(f, g), is greater than q. RCPKC works similar to NTRU, the
fastest and secure PKC. NTRU, uses high order, N , polynomials and decryption performance than NTRU by 2.5 times. NNRU
and is susceptible to the lattice basis reduction attack (LBRA) [10], working with polynomials also being entries of square
taking time exponential in N . RCPKC is a secure version of matrices forming a specified non-commutative ring.
insecure CPKC proposed by NTRU authors and easily attackable Thus, NTRU and its known variants work with order, N ,
by LBRA since CPKC uses small numbers for the sake of the polynomial rings. The main problem NTRU faces is that it
correct decryption. RCPKC specifies a range from which the
random numbers shall be selected, it provides correct decryption is susceptible to the lattice basis reduction attack (LBRA)
for valid users and incorrect decryption for an attacker using using Gaussian lattice reduction (GLR) algorithm for two-
Gaussian Lattice Reduction (GLR). Because of its resistance dimensional lattices and LLL algorithm for higher dimensions
to LBRA, RCPKC is more secure, and, due to the use of big [11]. LLL algorithm solves in time exponential in N the
numbers instead of high order polynomials, about 24 (7) times shortest vector in a lattice problem (SVP) revealing the secret
faster in encryption (decryption) than NTRU. Also, RCPKC is
more than 3 times faster than the most effective known NTRU key [12] because the private keys are selected as polynomials
variant, BQTRU. with small coefficients for the decryption correctness. NTRU
Index Terms—Congruential public-key cryptosystem, Inte- encryption/decryption mechanism is used for polynomials. In
ger, Lattice, Lattice basis reduction attack, LLL algorithm, [13, p. 373-376], the authors of NTRU applied that mechanism
Minkowski’s boundary for a lattice shortest vector norm, NTRU, to integers modulo q >> 1, considering congruential public
Polynomial
key cryptosystem (CPKC), and found that it is insecure since
GLR finds its private keys in an order of ten iterations. That is
I. I NTRODUCTION why, CPKC is considered there as a toy model of NTRU that
“provides the lowest dimensional introduction to the NTRU
Emerging of cloud computations raises demand for low public key cryptosystem” [13, p. 374]. Insecurity of CPKC
computational complexity homomorphic PKC [1] [2]. NTRU stems from the choice of the private keys used as small
[3] is a PKC standardized as IEEE P1363.1 and faster than numbers to provide decryption correctness.
RSA and ECC [4]. Many variants of NTRU have been Thus, from the analysis conducted we see that NTRU vari-
proposed and studied recently targeting further decreasing ants try minimizing its computational complexity extending
its computational complexity. All these variants work with coefficients of the polynomials used or using matrices of
polynomials and mainly differ in the choice of their coeffi- polynomials that allows preserving security level to decrease
cients, ring defining polynomial, or the polynomials are used the polynomial order because operations with high-order poly-
as entries of structures such as matrices. We overview them nomials are time consuming. The extreme case is a polynomial
briefly below. order zero, that is a number, as used in CPKC, but CPKC is
NTRU variants differing in the choice of their coefficients. shown in [13] as insecure with respect to LBRA by GLR. If
Such as ETRU [5] working with polynomials over CPKC could be made resistant to GLR attack, it would be the
Eisenstein integer coefficients is faster than NTRU in best possible choice for the NTRU modifications.
encryption/decryption by 1.45/1.72 times, BQTRU [6], Herein, we propose a CPKC modification, RCPKC, that
working over quaternions but with bivariate polynomials specifies a range from which the random numbers shall be
that is 7 times faster than NTRU in encryption. selected, it provides correct decryption for valid users and
NTRU variants working with different rings. NTRU variant incorrect decryption for an attacker using (GLR-attacker), i.e.
that works with polynomials with prime cyclotomic rings GLR never can find its private key because GLR solves SVP
is proposed in [7]. A variant of NTRU working with returning the shortest in a lattice vector, whereas our private
non-invertible polynomials is proposed in [8]. key is in the safe region (above the Minkowski’s boundary

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 20


(22), (25) for the shortest vector norm of a lattice). RCPKC is where (3), (4), and (7) are used. CPKC decryption correctness
more secure than NTRU because LBRA currently considered condition (9) holds under conditions (1), (5), (6):
as one of the most effective against NTRU as well as a number √ √ √ √
of other attacks on NTRU are not applicable to RCPKC, while 0 ≤ r · g + f · m < q/2 q/2 + q/2 q/4 < q. (9)
RCPKC resistance to other known attacks on NTRU is similar Thus, the parameters, f , g, r, are selected small compared to
to that of NTRU. RCPKC is about 24 (7) times faster in q (see (1), (5), (6)) to meet the CPKC correctness decryption
encryption (decryption) than NTRU. condition (9) used in Step 2 of the decryption.
The rest of the paper is organized as follows. In Section 2, we Step 2: Multiply (8) by Fg , getting
overview CPKC and briefly introduce NTRU. In Section 3,
LBRA by GLR on CPKC is presented. In Section 4, RCPKC m = a · Fg mod g, (10)
is presented. In Section 5, RCPKC performance comparison
where (3) is used and the contributor with factor g in (8)
versus NTRU and its variants is presented. Section 6 concludes
vanishes due to (9).
the paper.
D. Example 1. Example of CPKC Encryption/ Decryption
II. OVERVIEW OF CPKC AND NTRU
The example is close to Example 1, from [13, p. 375].
In this section, we overview CPKC and illustrate it by an
Let according to (1), (2), (5), q =122430513839, f = 231233,
example of encryption/decryption. We also briefly describe
g =195696, and m =12345.
NTRU.
According to (3), Fg =127505, and Fq =54368439252. Public
A. Overview of CPKC key component, h, is calculated by (4):
Two secret integers, f , g, are defined as follows: h = Fq · g mod q = 107143708775.
√ √ √
f < q/2, q/4 < g < q/2, (1) Let according to (6), r =10101. Ciphertext, e, is computed
according to (7):
gcd(f, qg) = 1, (2)
e = r · h + m mod q = 95290525699. (11)
where q is public.
The first secret value, f , has inverses modulo g and q, that is, To decrypt ciphertext (11), apply Step 1, equation (8):
Fg , Fq , respectively, by virtue of (2): a = f · e mod q = r · g + f · m = 4831296681. (12)
1 = f · Fg mod g, 1 = f · Fq mod q. (3) In Step 2, the message m is retrieved using (10):
The public value, h, is computed using (1), (3) as follows m = Fg · a mod g = 12345. (13)
h = Fq · g mod q. (4)
Thus, in (13), we get back the plaintext, m. We see that
Thus, CPKC has the private (secret) key, SK = CPKC encryption/decryption procedure (7), (8), (10), works
(f, g, q, Fg , Fq ), and the public key, P K = (h, q). correctly due to √
(9) holding. Note that the norm of the
The plaintext message, m, meets the following condition: vector, (f, g) = f 2 + g 2 = 302928.4 is small compared

√ to q = 759250123.0.
0 < m < q/4. (5)
E. Overview of NTRU
A random integer, r, is chosen as follows:
√ NTRU uses a ring
0 < r < q/2. (6)
Zq [x]
Rq = ,
B. CPKC Encryption xN − 1
Ciphertext, e, is computed using (4)-(6) as follows: elements of which are polynomials modulo xN − 1 with
coefficients in Zq .
e = r · h + m mod q. (7)
Let T (d1 , d2 ) is a subset of Rq with polynomials having d1
C. CPKC Decryption coefficients equal to 1, d2 coefficients equal to -1, and the rest
coefficients equal to zero.
Decryption is described by Steps 1, 2 below:
The secret polynomials, f , g, are of the form
Step 1: Multiply ciphertext (7) by f getting
a = f · e mod q f = 1 + p · F, g ∈ T (d + 1, d), (14)
where F = A1 · A2 + A3 , and Ai is from T (di , di ), i= 1,..,3,
= r · f · Fq · g + f · m mod q. (8)
where p < q is a small integer relatively prime to q.
Note that a = r · g + f · m if The public polynomial, h, is computed as follows
0 ≤ r · g + f · m < q, h = p · Fq · g mod q, (15)

21
where Fq is the inverse of f modulo q. A random polynomial, Code 1: GLR algorithm pseudocode finding the shortest
r, and message, m, are of the form: vector v1 of the lattice E(V1 , V2 ).
Input: basis vectors V1 , V2 ;
r = r1 · r2 + r3 , m ∈ Rp , (16) Output: the shortest vector v1 in E(V1 , V2 ) ;
where ri is from T (dri , dri ), i =1..3. v1 = V1 ; v2 = V2 ;
NTRU encryption is represented by (7), and decryption uses Loop
(8) followed by modulo p operation for polynomials, f , h, r, If ||v2 || < ||v1 ||
and m, defined in (14)-(16). swap v1 and v2 .
Compute m = ⌊(v1 · v2 )/||v1 ||2 ⌉.
III. L ATTICE BASIS R EDUCTION ATTACK BY GLR ON If m = 0
CPKC P RIVATE K EY /M ESSAGE return the shortest vector v1 of the basis. {v1 , v2 }
In the following, ||x||, (x · y), ⌊a⌉, and R, denote Euclidean Replace v2 with v2 -mv1 .
norm [14] of the vector x, dot product of the vectors, x and y, Continue Loop.
rounding of the real number, a, and the set of real numbers, LBRA by GLR using Code 1 on CPKC private key/message
respectively. for the data from the Example II-D, finds in 9 iterations the
Let E(V1 , V2 ) ⊂ R2 be a 2-dimensional lattice with basis shortest vector, v1 = (231233, 195696) as shown in Fig. 1. The
vectors, V1 and V2 : shortest vector, v1, found by GLR corresponds to the private
key components, (f, g), because they were selected small,
E(V1 , V2 ) = a1 V1 + a2 V2 : a1 , a2 ∈ Z. (17) √
having order O( q) values according to (1). The message
CPKC private key recovery problem can be formulated as related vector, (r, e−m), is not disclosed in the attack because
the Shortest Vector Problem (SVP) in the two-dimensional e= O(q) in the Example II-D.
lattice, E(V1 , V2 ). From (4), we can see that for any pair of
positive integers, F and G, satisfying:
√ √
G = F h mod q, F = O( q), G = O( q), (18)
(F, G) is likely to serve as the first two components, f , g, of
the private key, SK [13, p. 376]. Equation (18) can be written
as F · h + q · n = G, where n is an integer. So, our task is to
find a pair of comparatively small integers, (F, G), such that
F · V1 + n · V2 = (F, G), (19)
where V1 = (1, h) and V2 = (0, q) are basis vectors, at least
one of them having Euclidean norm of order O(q). Similarly,
CPKC message recovery problem can be formulated as SVP in
the two-dimensional lattice, E(V1 , V2 ), where V1 ,V2 are from
(19). From (7), we can see also that for any pair of positive
integers, (RR, EM ), satisfying:
√ √
EM = RR · h mod q, RR = O( q), EM = O( q),
(20)
(RR, EM ) is likely to serve as the vector (r, e − m) since
Fig. 1. Screenshot of LBRA by GLR using MuPAD Code 2 on CPKC for
the encryption equation (7) can be written as r · h + q · n = the data from the Example 1 finding the private key components,(f, g)=v1,
e − m, where n is an integer. So, our task is to find a pair of in 9 iterations.
comparatively small integers, (RR, RM ), such that
LBRA by GLR succeeds in finding CPKC private key since
RR · V1 + n · V2 = (RR, EM ). (21) it, by the settings (1) used, is likely the shortest vector in the
lattice. Minkowski’s Second Theorem [15, p. 35] sets an upper
We want to find the shortest vector w from E(V1 , V2 ) using
bound for the norm of the shortest nonzero vector, λ, in a 2-
GLR that might disclose (r, e − m) if e, r are of the order of
√ dimensional lattice:
O( q). Comparing (19) and (21), we see that they are the √
same up to the unknowns’ names used, and hence, finding λ ≤ λ2 Vol(L)1/2 , (22)
the shortest vector in E(V1 , V2 ) may reveal either the private √
where λ2 =2/ 3 ≈ 1.154 is Hermite’s constant [15, p. 41],
key components (F, G)=(f, g), or the message related vector,
and Vol(L) is the volume of the lattice which is equal to q for
(RR, EM ) = (r, e − m).
the lattice L = E(V1 , V2 ) where V1 , V2 are defined in (19).
We can rewrite (22) as follows:
GLR algorithm [13, p. 437], shown in Code 1, on termina-

tion returns the shortest vector w=v1 in E(V1 , V2 ). λ ≤ α q, (23)

22

where α= λ2 ≈ 1.07. The LBRA by GLR failure condition (26) holds if (27) is true
since
λ √ √
λ′ = √ , (24) ||(f, g)|| f 2 + g2 α2 · q + g 2
q √ = √ = √ > α,
q q q
the following inequality (25): √
||(r, e − m)|| r2 + (e − m)2
λ′ ≤ α. (25) √ = √
q q

GLR fails attacking CPKC private key/message when (25) is α2 · q + (e − m)2
not satisfied for the secret vector relative norm (f, g), i.e. if = √ > α,
q

||(f, g)||/ q > α (26) for g, e − m ̸= 0. Condition (27), in RCPKC, substitutes for
the conditions (1), (6) on f , r, in CPKC. The message, m,
holds, GLR fails to find CPKC private key/message. and the private key, g, instead of (5), (1), used in CPKC, are
redefined in RCPKC as follows:
IV. RCPKC P ROPOSAL , P ROOF OF I TS C ORRECTNESS ,
AND E XAMPLE OF E NCRYPTION /D ECRYPTION AND LBRA 2mgLen > g ≥ 2mgLen−1 > m ≥ 0, (28)
BY GLR FAILURE
where mgLen represents the length of m and g in bits.
In this section, we propose random CPKC (RCPKC) by For RCPKC, correctness decryption condition (9) shall hold,
adjusting CPKC described in Section 2 to satisfy (26). The that is true (see (33)) when f , r values in addition to (27) meet
main ideas of RCPKC are: (29):
- Contrary to the settings (1) of CPKC, using secret key q
√ 2 · 2mgLen
> f, r. (29)
(f, g) with small norm not exceeding q , so that (f, g) may
be found as a shortest vector (SV) in the lattice E(V1 , V2 ) For q = 2qLen , (27), (29) can be rewritten:
defined by (19), we use (f, g) with a large norm meeting (26)
so that it cannot be returned by LBRA using GLR as an SV; 2qLen−mgLen−1 > f, r ≥ α · 2qLen/2 . (30)
- Small values (1) are chosen in CPKC to meet the de- To have a non-empty range for f , r, of the width at least
cryption correctness condition (9), which we also meet in α · 2qLen/2 , from (30), we get the following condition:
RCPKC due to the skew in the components of (f, g); it
2qLen/2
might happen, and it was noted by an anonymous Reviewer, > 2mgLen+1 . (31)
thanks to him, that in spite of the large norm of (f, g), the 2·α
SV = (F, G), obtained in the result of LBRA using GLR Defining β = log2 1/(2 · α) ≈ −1.103, from (31) we have
may meet decryption correctness condition (9), and thus may
2β · 2qLen/2 > 2mgLen+1 ,
be used for the correct plaintext message disclosure. Our
proposed RCPKC before encrypting by (7), contrary to CPKC qLen + 2 · β > 2 · (mgLen + 1),
using a random number from the predefined range (6), defines
qLen > 2 · (mgLen + 1 − β). (32)
a range for the random number selection using the SV, (F, G),
returned by GLR attack on the lattice E(V1 , V2 ) defined by Let us show that the decryption correctness condition (9) holds
(19), so that decryption correctness condition (9) holds for when (28), (30), and (32) hold:
(f, g) but does not hold for (F, G) that leads to the failure of
r·g+f ·m
LBRA using GLR on RCPKC. Thus, RCPKC assumes that the
private key owner selects the range for random value, r, used < 2qLen−mgLen−1 · 2mgLen + 2qLen−mgLen−1 · 2mgLen−1
in encryption (7) based on the secret key, (f, g) and respective
SV, (F, G), in the lattice E(V1 , V2 ) defined by (19) values, < 2qLen−1 + 2qLen−1 = 2qLen = q. (33)
guaranteeing correct decryption for a valid user and incorrect Thus, for RCPKC, norm of (f, g) meets (26) and decryption
decryption for a GLR attacker. Because of the special choice correctness condition (9) holds. We need additionally that
of the random value range, the proposed algorithm is called decryption correctness condition (9) is violated for (F, G),
Random CPKC, RCPKC. The problem for RCPKC which that is the SV obtained in the result of GLR attack on the
might happen that the range for random numbers such kind lattice E(V1 , V2 ) defined by (19). Hence, it cannot be used as
defined may be rather narrow and, thus, security of RCPKC a private key for the plaintext message correct decryption.
may suffer. But we show that the range is rather large and Inequality (30) defines a range for r so that f, g, r, m meet
may significantly exceed the range for a secret message. (9). Now, we define constraint on r,
A. RCPKC Proposal and Proof of Its Correctness r ≥ rmin ≥ (q + g|F |)/|G| (34)
To meet (26), we require such that F, G, r, m violate (9). Using (34) and (28):

f, r ≥ α · q. (27) |G · r + F · m| ≥ |G| · |r| − |F | · m

23
|G|(q + g|F |) ≥ max(α · 2qlen/2 = 812, 397, 633.7, rmin = 812, 397, 637)
≥ − |F | · m ≥ q + g|F | − m|F | > q. (35)
|G|
= rmin.
Thus, inequality (30) is used for f , but for r from (34) and
(30), we have Ciphertext, e, is calculated according to (7) as follows:
2qLen−mgLen−1 > r ≥ max(α · 2qLen/2 , rmin). (36) e = r · h + m mod q = 65, 549.
For RCPKC security, range defined by (36) shall be rather For decryption, in the first step, according to (8), we multiply
large, max(α · 2qLen/2 , rmin), hence: the ciphertext, e, by private key f :
2qLen−mgLen−1 ≥ 2 · max(α · 2qLen/2 , rmin). (37) a = f · e mod q = 53, 251, 852, 707, 713.
RCPKC Proposal In the second decryption step, according to (10), we multiply,
The private key components, (f, g), meet (2), (3), (28), (30), a, by Fg to get the message m as follows:
where qLen, mgLen meet (32) and (37), where (F, G) is
m = a · Fg mod g = 14.
an SV obtained in the result of GLR attack on the lattice
E(V1 , V2 ) defined by (19). The public key component, h, is We see that the message, m, is correctly retrieved.
defined by (4). Message, m, meets (28), and random integer, Now, attacking RCPKC using GLR Code 1. GLR
r, is selected from the range defined in (34), (36). terminates in 15 iterations finding v1 = (F, G) =
Encryption and decryption follow (7), and (8), (10), respec- (214653159, 709596869) ̸= (f, g) as shown on the screenshot
tively (see Sections II-B, and II-C). in Fig. 2. From the other side, we see that (35) is satisfied as
Decryption correctness condition (9) is proved for RCPKC follows:
in (33), thus proving RCPKC correctness.
576, 474, 822, 603, 342, 779 = |G · r + F · m|
Example 2 illustrates RCPKC encryption and decryption,
and GLR failure to find RCPKC secret key/message. > q = 576, 460, 752, 303, 423.
B. Example 2. Example of RCPKC Encryption/ Decryption Hence, trying to decrypt ciphertext using (F, G) will fail as
and LBRA by GLR Failure follows:
RCPKC encryption and decryption, and GLR failure to find aGLR = F · e mod q = 14, 070, 299, 919, 291
RCPKC secret key/message. For calculations, we use MuPAD.
Let mgLen = 16, qLen = 59, meeting (32), q = 259 = 65, 549 = mGLR = FG · aGLR mod G ̸= m = 14.
576, 460, 752, 303, 423, 488, private key components, g, and We see that the original message is not disclosed. Thus,
f , are selected to meet (28) and (30) respectively as follows: actually, using of the shortest vector, returned by GLR for
g = 65, 535, and f = 812, 397, 637. the ciphertext decryption, fails.
We see that values of g and f satisfy (28) and (30):
V. RCPKC P ERFORMANCE E VALUATION
65, 536 = 2mglen > g ≥ 2mgLen−1 = 32, 768 Herein, we use NTRU parameters, EES401EP2 [16], of the
4, 398, 046, 511, 104 = 2 qLen−mgLen−1
>f security level, k = 112 bits:

≥α·2 qLen/2
= 812, 397, 633.7. N = 401, p = 3, q = 2048, df1 = df2 = 8,

Similarly, message, m, is selected to meet (28), m = 14. df3 = 6, dg = 133, dr1 = dr2 = 8, dr3 = 6. (38)
We see that value of m satisfies (28) for 2mgLen−1 = In order to meet the same security level, the RCPKC settings
215 = 32, 768 > m = 14. According to (3), Fq = satisfying (32) are:
240, 507, 095, 595, 400, 845, and Fg = 8, 728. The public key
component, h, is calculated according to (4) as follows: qLen = 473, mgLen = 225. (39)

h = Fq · g mod q = 42, 620, 364, 389, 368, 179. We use NTRU code [17], and we have implemented RCPKC in
C99 language the same as used in [17] with MPIR library [18]
GLR algorithm Code 1 can be launched with inputs on a PC equipped with 2 GHz Intel Pentium Dual CPU E2180,
V1 = (1, h) and V2 = (0, q). GLR terminates in 15 it- 3 GB RAM, and Windows 10. The both NTRU code [17] and
erations and returns the shortest vector v1 = (F, G) = our RCPKC are implemented in Visual Studio 2017. NTRU
(214653159, 709596869), see Fig. 2. From (34), parameters (38) and RCPKC parameters (39) are used. We
measure CPU encryption and decryption time of RCPKC and
812, 397, 637 = rmin ≥ (q + g|F |)/|G| = 812, 397, 637
NTRU for 103 , 104 , 105 , and 106 runs (see Tables I, II with
Thus, random value, r = 812, 397, 637, is selected to meet respective averages). In each run, new secret and public keys,
(36): and messages are chosen randomly for NTRU and RCPKC.
From Table I (Table II), we see that RCPKC is 23.34 (7.5)
4, 398, 046, 511, 104 = 2qLen−mgLen−1 > r times faster than NTRU in encryption (decryption). The large

24
TABLE I
AVERAGE ENCRYPTION TIME OF NTRU AND RCPKC FOR DIFFERENT
NUMBER OF RUNS

Number of runs

103 105 106


RCPKC average
encryption time (s) 6 × 10−6 7 × 10−6 7 × 10−6

NTRU average
encryption time (s) 1.52 × 10−4 1.45 × 10−4 1.46 × 10−4

NTRU/RCPKC
averages encryption
time ratio 25.33 20.71 20.85

Averaged over
all runs
NTRU/RCPKC
averages encryption
time ratio 23.34

correctness condition similar to (9). Actually, for the sake


of greater effectiveness, NTRU allows the decryption fail-
ure. Contrary to NTRU, RCPKC is resistant to the LBRA
because its private key components, f , g, are chosen big

with respect to q to form a two-component vector with the
norm exceeding the Minkowski’s boundary (22)-(25) for the
Fig. 2. Screenshot of the Code 2 run on Example 2. It shows that GLR shortest vector in a two-dimensional lattice. Hence, LBRA
terminates in 15 iterations finding v1 = (F, G) = (214653159, 709596869) by GLR algorithm returning the shortest vector in a two-
that is neither (f, g) nor (r, e − m), RCPKC decryption using (F, G) results
in mGLR = 65549 that is not the original m = 14. dimensional lattice fails finding the large-norm private key
vector, (f, g), and it is set in such a way that an SV obtained
by GLR on induced lattice E(V1 , V2 ) defined by (19) fails
difference in the RCPKC encryption/decryption time is due to to decrypt correctly the plaintext as shown in (35). In spite
the observation made in our experiments that multiplication of the big numbers, f , r, meeting (30) used in RCPKC, it
time depends on the length of the operands, and the greater- guarantees decryption correctness condition (9) holding (see
length operands are used in the decryption. In comparison to (33)) due to the use of conditions (28), (30), (32) instead
NTRU variants presented in Section I, BQTRU [6] is faster of the conditions (1), (5), (6), used in the original insecure
than NTRU in encryption/decryption by 7 times, ETRU [5] is CPKC (see Sections II-A-II-C) considered in [13]. Note that
faster than NTRU in encryption/decryption by 1.45/1.72 for insecurity of the original CPKC stems from the use of the

N = 400, and MaTRU is faster than NTRU by 2.5 times in conditions (1), (5), (6), defining smaller than q numbers f ,
encryption/decryption, while other NTRU variants introduced g, m, r meeting Minkowski’s boundary (22) and decryption
in Section I have not published information regarding their correctness condition (9). And since the security of RCPKC
performance. We see that RCPKC is faster than the fastest with respect to other known attacks on NTRU is not less
most recent published NTRU variant, BQTRU, by more than than that of NTRU that allows us concluding that RCPKC
3 times in encryption. Table III compares RCPKC and NTRU is more secure than NTRU. We have shown (see Section I)
variants encryption and decryption time. that the multiple known variants of NTRU aim improving its
effectiveness by lowering the polynomials order used due to
VI. C ONCLUSION the expansion of their coefficients. RCPKC uses numbers, i.e.
Thus, we have proposed a secure and effective congruen- minimal possible, order zero, polynomials, that makes it about
tial, modulo q, public-key cryptosystem using big numbers, 25 (7) times more effective in encryption (decryption) than
RCPKC, described in Section 4. It uses the same encryp- NTRU and more than 3 times more effective in encryption
tion/decryption mechanism as NTRU does (see Section II-E) with respect to the fastest most recent published NTRU variant,
but works with numbers. NTRU is susceptible to LBRA BQTRU [6], as experiments show (see Tables I-III). RCPKC
by LLL algorithm because its private key is selected as can be used in various applications considered in Section I as
polynomials with small coefficients to provide decryption being more secure and effective than NTRU.

25
TABLE II [7] Y. Yu, G. Xu, and X. Wang, “Provably secure NTRU instances over
AVERAGE DECRYPTION TIME OF NTRU AND RCPKC FOR DIFFERENT prime cyclotomic rings,” in Public-Key Cryptography - PKC 2017
NUMBER OF RUNS - 20th IACR International Conference on Practice and Theory in
Public-Key Cryptography, Amsterdam, The Netherlands, March 28-31,
Number of runs 2017, Proceedings, Part I, ser. Lecture Notes in Computer Science,
S. Fehr, Ed., vol. 10174. Springer, 2017, pp. 409–434. [Online].
103 105 106 Available: https://doi.org/10.1007/978-3-662-54365-8 17
[8] W. D. Banks and I. E. Shparlinski, “A variant of NTRU with non-
RCPKC average invertible polynomials,” in Progress in Cryptology - INDOCRYPT 2002,
decryption time (s) 2.1 × 10−5 1.9 × 10−5 2.0 × 10−5 Third International Conference on Cryptology in India, Hyderabad,
India, December 16-18, 2002, ser. Lecture Notes in Computer Science,
NTRU average A. Menezes and P. Sarkar, Eds., vol. 2551. Springer, 2002, pp. 62–70.
decryption time (s) 1.55 × 10−4 1.47 × 10−4 1.44 × 10−4 [Online]. Available: https://doi.org/10.1007/3-540-36231-2 6
[9] M. Coglianese and B. Goi, “Matru: A new NTRU-based cryptosystem,”
NTRU/RCPKC in Progress in Cryptology - INDOCRYPT 2005, 6th International
averages decryption Conference on Cryptology in India, Bangalore, India, December 10-12,
time ratio 7.38 7.74 7.20 2005, Proceedings, ser. Lecture Notes in Computer Science, S. Maitra,
C. E. V. Madhavan, and R. Venkatesan, Eds., vol. 3797. Springer, 2005,
Averaged over pp. 232–243. [Online]. Available: https://doi.org/10.1007/11596219 19
all runs [10] N. Vats, “NNRU, a noncommutative analogue of NTRU,” arXiv preprint
NTRU/RCPKC arXiv:0902.1891, 2009.
Averages decryption [11] A. K. Lenstra, H. W. Lenstra, and L. Lovász, “Factoring polynomials
time ratio 7.50 with rational coefficients,” Mathematische Annalen, vol. 261, no. 4, pp.
515–534, 1982.
[12] J. Hoffstein, J. H. Silverman, and W. Whyte, “Estimated breaking times
for NTRU lattices,” NTRU Cryptosystems, Tech. Rep. 012 version 2,
TABLE III
2003.
NTRU VERSUS ALGORITHMS ’ (RCPKC AND DIFFERENT NTRU
[13] J. Hoffstein, J. Pipher, and J. H. Silverman, An Introduction to
VARIANTS ) ENCRYPTION AND DECRYPTION TIME RATIO
Mathematical Cryptography. New York: Springer, 2014. [Online].
Available: https://doi.org/10.1007/978-1-4939-1711-2 7
Algorithm NTRU/Algorithm NTRU/Algorithm [14] N. Bourbaki, Topological Vector Spaces: Chapters 1-–5, 1st ed., ser.
encryption time decryption time Elements of Mathematics. Springer-Verlag Berlin Heidelberg, 2003.
[15] I. Smeets, A. Lenstra, H. Lenstra, L. Lovász, P. Q. Nguyen, and
Proposed RCPKC 23.34 7.5 B. Vallée, The LLL Algorithm: Survey and Applications. Verlag, Berlin,
Heidelberg: Springer, 2010.
BQTRU [6] 7 No data [16] Eess#1: Implementation aspects of NTRU. Last Accessed 18/1/2018.
[Online]. Available: https://github.com/NTRUOpenSourceProject/ntru-
MaTRU [9] 2.5 2.5 crypto/blob/master/doc/EESS1-v3.1.pdf
[17] W. Whyte and M. Etzel. Open source NTRU public key cryptography
ETRU [5] 1.45 1.72 algorithm and reference code. Last Accessed 18/1/2018. [Online].
Available: https://github.com/NTRUOpenSourceProject/ntru-crypto
[18] W. H. B. Gladman and e. a. J. Moxham. (2015) MPIR: Multiple
precision integers and rationals. Version 2.7.0, http://mpir.org, Last
R EFERENCES Accessed 18/1/2018.

[1] K. K. Chauhan, A. K. S. Sanger, and A. Verma, “Homomorphic


encryption for data security in cloud computing,” in 2015 International
Conference on Information Technology, ICIT 2015, Bhubaneswar,
India, December 21-23, 2015, 2015, pp. 206–209. [Online]. Available:
https://doi.org/10.1109/ICIT.2015.39
[2] S. Fehr, Ed., Public-Key Cryptography - PKC 2017 - 20th IACR
International Conference on Practice and Theory in Public-Key
Cryptography, Amsterdam, The Netherlands, March 28-31, 2017,
Proceedings, Part II, ser. Lecture Notes in Computer Science, vol.
10175. Springer, 2017. [Online]. Available: https://doi.org/10.1007/978-
3-662-54388-7
[3] J. Hoffstein, J. Pipher, and J. H. Silverman, “NTRU: A ring-based
public key cryptosystem,” in Algorithmic Number Theory, Third
International Symposium, ANTS-III, Portland, Oregon, USA, June
21-25, 1998, Proceedings, ser. Lecture Notes in Computer Science,
J. Buhler, Ed., vol. 1423. Springer, 1998, pp. 267–288. [Online].
Available: https://doi.org/10.1007/BFb0054868
[4] “Speed records for NTRU,” in Topics in Cryptology - CT-RSA 2010, The
Cryptographers’ Track at the RSA Conference 2010, San Francisco, CA,
USA, March 1-5, 2010. Proceedings, ser. Lecture Notes in Computer
Science, J. Pieprzyk, Ed., vol. 5985. Springer, 2010, pp. 73–88.
[Online]. Available: https://doi.org/10.1007/978-3-642-11925-5 6
[5] K. Jarvis and M. Nevins, “ETRU: NTRU over the Eisenstein integers,”
Designs, Codes and Cryptography, vol. 74, no. 1, pp. 219–242, Jan
2015. [Online]. Available: https://doi.org/10.1007/s10623-013-9850-3
[6] K. Bagheri, M.-R. Sadeghi, and D. Panario, “A non-
commutative cryptosystem based on quaternion algebras,”
Designs, Codes and Cryptography, Dec 2017, published online
https://link.springer.com/journal/10623/onlineFirst/page/1.

26
Review: Phishing Detection Approaches
AlMaha Abu Zuraiq Mouhammd Alkasassbeh
Computer Science Department Computer Science Department
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
alm20178050@std.psut.edu.jo m.alkasassbeh@psut.edu.jo

Abstract—Phishing is one of the most common attacks on approaches like Google safe browsing, Phish Tank, and
the internet that employs social engineering techniques like user’s voting. So, when a web page is initiated the browser
deceiving user with forged websites in an attempt to gain searches the blacklist for it and alerts the user if the webpage
sensitive information such as credentials and credit card was found. Finally, the blacklist can be stored on the user’s
details. This information can be misused, resulting in large machine or in a server [5]. Blacklists are often used to
financial losses to these users. Phishing detection algorithms classify websites as malicious or legitimate. But while these
can be an effective approach to safeguarding users from such techniques have low false-positive rates, but they lack the
attacks. This paper will review different phishing detection ability to classify newly produced malicious URLs [6].
approaches which include: Content-Based, Heuristic-Based,
and Fuzzy rule-based approaches. The content-based approach deploys a deep analysis of
the pages’ content. Building classifiers and extract features
Keywords—Phishing, detection, fuzzy, machine learning, from page contents and third-party services such as search
malicious website. engines, and DNS servers. Yet these methods are ineffective
because of a massive number of training features and the
I. INTRODUCTION reliance on third-party servers which assault user's privacy
The internet is everywhere today, we use web services by uncovering his browsing history [4].
for a range of activities such as sharing knowledge, social A Heuristic Based Approach, the detection technique is
communication and performing various financial activities based on employing various discriminative features extracted
which include buying, selling and money transferring. by understanding and analyzing the structure of phishing
Malicious websites are a serious threat to internet users and web pages. The method used in processing these features
unaware users can become victims of malicious URLs that plays a considerable role in classifying web pages effectively
host undesirable content such as spam, phishing, drive-by- and accurately [7]
download, and drive-by-exploits.
Because Fuzzy logic permits the intermediate level
Phishing is a common attack on the internet and it is among values. In Fuzzy rule-based approach, it is utilized to
defined as the social engineering process of luring users into classifies webpages based on the level of phishness that
fraudulent websites to obtain their personal or sensitive appeared in the pages by implementing and employing a
information such as their user names, passwords, addresses, specific group of metrics and predefined rules [8]. Using
credit card details, social security numbers, or any other fuzzy approach allows processing of ambiguous variables.
valuable information. According to the Anti-Phishing Fuzzy logic integrates human experts to clarify those
Working Group (APWG) report, the number of different variables and relations between them. Also, fuzzy logic
phishing incidents reported to the organization over the last approaches using linguistic variables to explain phishing
quarter of the year 2016 was 211,032 [1] and they increased features and the phishing web page likelihood [9].
up by 12% in last quarter of 2018 which received 239,910
reports [2]. Furthermore, a recent Microsoft Security The main purpose of this study is to represent a
Intelligence (Volume 24) report found that phishing attacks comprehensive survey of existing approaches used in the
were on the top of the discovered web attacks of 2018 and detection of phishing approaches. In the literature review, the
we can only expect them to continue increasing [3]. related work will be discussed based on the aforementioned
classification of phishing detection approaches.
The major challenge when detecting phishing attacks lies
in discovering the techniques utilized. Phishers continuously II. LITERATURE REVIEW
enhance their strategies and can create web pages that are
able to protect themselves against many forms of detection. The review of existing studies in phishing detection
Accordingly, developing robust, effective and up to date approaches will be categorized into three groups which are
phishing detection methods is very necessary to oppose the Content-Based Approach, Heuristic Based Approach, and
adaptive techniques employed by the phishers [4] Fuzzy rule-based approach. The review will be based on
studies in the period between 2013 and 2018.

Surveying the literature on phishing detection techniques A. Content-Based Approach


we can categorize them to the following approaches: Because the traditional anti-phishing methods that based
Blacklist based, Content-based, Heuristic-based, and Fuzzy on visual similarities are effective only in detecting phishing
rule-based approaches. Each of these approaches has its own web pages that show a high similarity rate in their contents
characteristics and limitations. with the counterpart legitimate web page. So, this work
The blacklist approach maintains a list of Suspicious or proposed a novel method for detecting phishing web pages
malicious URL’s that are collected using different by specifying weights to the words that draw out from URLs

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 27


and HTML contents. These words may include brand names used to assess this relationship by utilizing customized
that phishers attend to set them in various parts of URL to feature sets.
make it looks like the real one. Weights are specified These features are separated from third-party services
according to their presence at different positions in URLs. like search engines, blacklists, and white lists. Two feature
Then, these weights are combined to their term frequency- sets will be extracted from web page contents and page
inverse document frequency (TF-IDF) weights which is a document object model (DOM). Support vector machine
numeric statistical that prepared to show how significant of (SVM) algorithm is applied to classify web pages by using
a word to a document. The most potential words are chosen web page features vector which composed of 17 features, 9
and sent to Yahoo Search to return the domain name that selected features that applied in related works and 8 features
has the highest frequency between top 30 outcomes. from aforementioned data sets proposed in this work
Eventually, they decide if the website is authentic or not by
comparing the owners of the domain name that returned The proposed work contains two feature sets, each of
from WHOIS records. A WHOIS lookup is applied to detect them is holding four features, first feature’s set to assess the
the owner of such a domain name [10]. resource's identity of the web page, and second feature’s set
to identify the access protocol of web page components.
Instead of utilizing a brand name and word weights, this These features are separated from third-party services like
study presented a novel approach that utilizes a logo image search engines, blacklists, and white lists. These two feature
to determine the identity of the web page by matching real sets will be extracted from web page contents and page
and fake webpages. The proposed approach is composed of document object model (DOM). Support vector machine
two phases which are logo extraction and identity (SVM) algorithm is applied to classify web pages by using
web page features vector which composed of 17 features, 9
verification. In the first phase, a machine learning algorithm
features that applied in related works and 8 features from
is used to detect the right logo image. While in the second aforementioned feature sets proposed in this work. The
phase, image search offered by Google is used to return the essential contribution of this work is to upgrade the
fake identity, then it will be utilized for the verification. performance of phishing detection methods by produce some
Because the relationship between the logo and domain name novel features. The experiment indicates that the accuracy of
is unique, wherefore the domain name is treated as the the proposed model is 99.14% [12].
identity of the logo. So, a comparison among the domain
name retrieved by Google with the one from web page B. Heuristic-Based Approach
query will permit us to distinguish between phishing and
This approach is depending on using several features
legitimate web pages. The experimental results notice that extracted by evaluating the components of phishing web
logo extraction phase enhanced phishing detection accuracy pages.
and it is more useful than extraction phases based on textual
features. The system evaluated by using two different At the beginning of this study, it is mentioned that Spam
datasets that made of 1140 phishing obtained from filtering methods based on email are unqualified to protect
PhishTank and legitimate webpages obtained from Alexa. other different web services. Therefore, an action that will be
The accuracy of the proposed system is 93.4% [11]. taken to counteract a danger or threat must be generalized
over web services to defend the user from phishing URLs.
Aside from the similarities in contents between Thus, this study presented a system that classifying,
clustering, categorizing and rank URLs automatically, based
legitimate and phishing web pages. In this paper, they
on features that based on the host and URLs lexical. There
conclude that the consistency between the URL and web
are two main contributions in this study which are
page contents are the master key concepts to examine and developing a hybrid technique that integrating both
analyze. So, they proposed a method for recognizing clustering and classification and using a categorization that
suspicious websites as legitimate or phishing, based on the obtained from Microsoft Reputation Services (MRS) to
literal and conceptual consistency among the URL and web implement URLs ranking. Benign URLs are gathered from
page contents. the DMOZ open directory project and Phishing URLs are
gathered from PhishTank URLs that ranked by employing
The proposed method is comprised of two phases: a clustering and categorizing of URLs show that the cluster
prefiltering phase, and a classification phase. Firstly, web labels enhance the accuracy of the classifier from 97.08% to
pages which their URLs and their contents are consistent are 98.46% [13].
filtered out. Then the rest web pages are categorized
corresponding to different factors that contain conceptual
similarity and heuristic scores obtained from the URL. The Due to the importance of defining a valuable and clear
experimental results illustrate that the accuracy is 98% [6]. feature, this paper proposed a novel model to detect phishing
websites using six heuristics features that extracted from
Another study concerning the relationship between the URLs (primary domain, subdomain, path domain) and
contents and URLs of web pages, where two feature sets are website rank (page rank, Alexa rank, Alexa reputation).
prepared to make the process of phishing detection more Precisely, similarities phishing URLs and legitimate URLs
efficient. are considered. The approach is evaluated by utilizing a
The proposed feature sets are utilized to define the training dataset of 11,660 phishing web pages and ten testing
datasets, each one holds 1,000 phishing web pages or 1,000
relationship between the URL of the web page and its
legitimate web pages. The experiment results exhibit that the
contents. The approximate string-matching algorithm is

28
accuracy of the proposed system in detecting phishing web techniques are utilized: support vector machine (SVM),
pages is 97.16% [14]. naive Bayes, decision tree, k-nearest neighbor (KNN),
random tree, and random forest. To evaluate and training a
This study criticizes existing solutions for detecting classifier a dataset that collected 3,000 phishing webpages
phishing webpages such as antiviruses and firewalls are not from PhishTank and 3,000 legitimate webpages from
completely protecting users from web spoofing attacks. Also, DMOZ. The experiment results show that machine learning
the application of Secure Socket Layer (SSL) and a digital classifier that achieved the best performance is Random
certificate (CA) are not fully efficient because that some Forest (FR) with 98.23% of accuracy [18].
types of SSL and CA can be faked even if the web pages are
appeared to be legitimate. So, this paper proposed a phishing This paper proposed a heuristic-based method to detect
detection method that applying multiple steps to check URLs phishing web pages by utilizing URLs features. A selected
and domain name features. The performance of this work is 138 features are developed based on previous work. Then
assessed by applying a dataset of URLs that randomly these gathered features are grouped into four different classes
collected from Phishtank and Yahoo directory. 100 URLs which are Lexical based features, Keyword-based features,
are used (59 legitimate URLs and 41 malicious URLs). Reputation-based features and Search engine-based features.
PhishChecker detected 68 of the URLs as legitimate, and 32 The system is evaluated using data sets that consist of more
URLs are detected as malicious, the results show that the than 16,000 phishing and 31,000 non-phishing URLs is
accuracy of PhishChecker in detecting phishing is 96% [15]. employed. Seven different classifiers are implemented which
are Support Vector Machines (SVM with RBF kernel), SVM
with linear kernel, Multilayer Perceptron (MLP), Random
In this paper, URLs is also utilized for checking whether Forest (RF), Naïve Bayes (NB), Logistic Regression (LR)
the web page is phishing or not, they proposed a heuristic and C4.5. According to experiment results, Random Forest
approach that able to detect zero-day phishing attacks that (RF) achieved a higher accuracy rate and lower error rate
can't be detected by list-based methods. In addition, it is [19].
faster than visually based approaches, system is implemented In the two previous works, a heuristic based approach is
as a desktop application named PhishShield, which takes implemented with a machine learning algorithm. Regardless
URL as an input and classify it as phishing or legitimate of that each of them utilized its own data sets, employing
website. Heuristic features used in this study are drawn out several features and applying different machine learning
from the web page by using JSou without any user algorithms. But in both studies, the Random Forest algorithm
intervention. To evaluate the performance of PhishShield achieved the most effective classification of webpages.
application they obtained 1600 phishing websites from
PhishTank and 250 legitimate websites which 176 of them While the next two studies will demonstrate hybrid
are obtained from PhishTank and rest are collected machine learning approaches that get a benefit from
randomly. The accuracy attained for the proposed application strengthens of each algorithm and overlooked about its
is 96.57% [16]. weaknesses. Because more effective techniques are needed
to limit the fast evolution of phishing attacks.
Some studies combined a heuristic-based with a
machine-learning algorithm to enhance a classification Therefore, this study proposed a method that combines
process of web pages. Machine learning algorithms are two algorithms. K-nearest neighbors (KNN) algorithm which
utilized a clarify features and effective algorithm to produce is an effective approach against noisy data, and Support
an accurate classifier model to distinguish between phishing Vector Machine (SVM) algorithm which is a robust
and legitimate web pages. classifier. The combination process is done in two phases.
At first, KNN will be applying, then SVM will employing as
First of all, this paper suggested a heuristic-based a classification tool. The dataset used for the experiment is
phishing detection method used to recognize the phishing taken from related work which contains more than 1353
web pages. In the beginning, the system extracts and utilize sample gathered from various sources. Each sample record is
URL-based features. Then, these features are applied to composed of nine features in addition to the class label
machine learning algorithms and it will recognize if the web which is Phishing, Legitimate or Suspicious web page.
page is phished or legitimate. This system used 10 features Consequently, the clearness of KNN is integrated with the
on the input URL's dataset. The output results are effectiveness of SVM, regardless of their disadvantages
categorized as either Legitimate or Phishing. Next Support when they used individually. The accuracy of the proposed
Vector Machine algorithm is used on extracted features method is 90.04% [5].
result and find the value of FP, TP, FN, and TN. Also, the
value of F1-measure and accuracy are calculated, where the Likewise, this paper proposed a fast and accurate
accuracy value was 96%. Dataset of URLs is collected from phishing detection method that combined both Naïve Bays
PhishTank and yahoo directory which contains 200 (NB) and Support Vector Machine (SVM), utilizing features
Legitimate and phishing web pages URLs [17]. of URLs, and webpage contents. NB is used in detecting web
pages. But if the web pages are not detected efficiently and
still suspicious, SVM will be employed to reclassifying the
This paper is also implemented a heuristic-based web pages. The utilized dataset is generated from Phish Tank
phishing detection approach besides machine learning which is 600 phishing web pages and 400 are legitimate
algorithms features of URL. As well, the proposed method ones, 100 legitimate and 100 phishing web pages are
elicited URL features of web pages requested by the user and occupied as the training set, and the rest are carried as testing
applied them to decide if a requested web page is phishing or dataset. Experimental results exhibit that this proposed
not. To choose a classifier that most effectiveness for approach achieved high detection accuracy and lower
employing URL-based features, five machine learning detection time [20].

29
C. Fuzzy Rule-Based Approach This paper is also proposed phishing detection method by
There are many studies that suggested different phishing employing fuzzy systems and neural network. Unless a data
detection techniques based on different properties such as set of 300 value is extracted from six data sources which are
URLs, web page contents or combining both. However, Each Legitimate site rule, User behavior profile, PhishTank, User-
study has its own advantages and drawbacks. So, researches specific-site, Pop-up windows, and User-credential profile.
in this field are always required because the most There are the same of the previous study, but a new source is
appropriate, effective, and accurate method does not exist added which is User-credential profile. Also, the proposed
yet. system applied using 2-fold cross-validation to training and
testing the model. The fuzzy model has five functions to
In trying to get the benefits of a fuzzy logic system, This understand and make judgments which includes input layer,
study proposed a novel approach that targeted the URL fuzzification, rule-based, normalization, and defuzzification.
features and fuzzy logic method. The system is applied in The proposed system achieved 99.6% of accuracy which is
five phases which are Select URL features, calculating the better than the previous study[24].
values of 6 heuristics, calculating 12 fuzzy values for 6
heuristics from membership functions, Defuzzification by
calculating mean of 6 fuzzy values of phishing linguistic This paper talk over different phishing techniques
label (MP) and mean of 6 fuzzy values of legitimate developed by other researchers and discussed an efficient
linguistic label (ML), finally the values of MP and ML will way in distinguishing web pages. It is done by getting the
be compared to classify the web page. The approach was benefits from the genetic algorithm to treat phishing web
assessed with 11,660 phishing web pages and 5,000 pages. Then perform it over fuzzy logic technique. Fuzzy
legitimate web pages. The accuracy of the proposed method logic is implemented to assess the phishing degree in
was 98.17% [21]. various web pages based on a set of pre-defined rules. So, If
This paper presented a phishing detection method using a the URL meets the specified rules, then it will be estimated
fuzzy logic technique with five heuristic labels (Highly as a phishing webpage and given a score. Consider 10 sets of
Legitimate, Legitimate, Suspicious, Phished and Highly pre-defined rules that utilized to assess the phishiness degree
Phished). Classifying web pages is based on specific of the URL. If the rule is matched with webpage URL then it
predetermined rules split into 3 main groups: address bar- is Weighted by 0.1 score, after that the total of all ten layers
based features, domain-based features, HTML and which vary between 0 and 1 will denote the phishiness
JavaScript-based features. Whereas the first group used to degree. Whereas 1 indicates very legitimate web page and 0
recognize webpage authenticity, the second group is very phishy web page.
preserving webpage integrity, and the last group gives According to these outputs, it can determine if the
reliability to a webpage. webpage is fake or not. There are four phases to detect
The proposed model is consisting of four steps: in the phishing webpages using Fuzzy logic. Which are
first step, the Fuzzification step which converts crisp inputs Fuzzification where crisp inputs are transformed to fuzzy
to fuzzy inputs, then define a set of fuzzy rules. After that, inputs, evaluating rule using if...then statements, Aggregating
determine the membership function of fuzzy sets. In the final the rule output by unifying the outputs of all rules, and
step defuzzification process that produce the crisp outputs. Defuzzification where fuzzy output is transformed to crisp
The system is tested on a dataset of 300 URLs that randomly output (phishy or legitimate). This study concludes that even
collected from phishTank and DMOZ. The evaluation is the web page contains phishy characteristics it does not mean
based on a fuzzy logic method using triangular membership that the whole page is phishy. Therefore, using fuzzy logic is
function, then using three defuzzification methods which are one of the most effective methods to obtain the phishiness
Mean of maximum, Weighted Average method and Centroid degree of web page [8].
method [22].
III. DATASETS SUMMARY
In Table no.1 listed each paper and corresponding used
Instead of using a standalone fuzzy system this work dataset.
applied a Neuro-Fuzzy Scheme, which is an integration of a
Fuzzy Logic and a neural network. This integration enables
using of linguistic and numeric characteristics. This scheme TABLE I. DATASETS
utilized 288 features that extracted from five inputs
(Legitimate site rules, User-behavior profile, PhishTank, Approach Reference Data sets applied
used
User-specific sites, Pop-Ups from emails) which were not
167 phishing webpages are downloaded from
used together in a one system platform and that is the main [10] PhishTank
contribution of this work. While Neural Network is effective 51 legitimate webpages selected manually.
in treating with raw data, Fuzzy Logic has a high level of 1140 webpages.
reasoning, using numeric and linguistic characteristics. [11]
Phishing webpages are downloaded from
Neuro-Fuzzy was selected due to the abilities of learning Content- PhishTank Legitimate webpages are
data from Neural Network point of view and creates Based downloaded from Alexa
Approach 2, 826 phishing webpages from PhishTank
linguistic rules from Fuzzy viewpoint. The experiment tested
13, 416 legitimate webpages were directly
288 features by applying 2-Fold cross-validation, the [6]
crawled from the Internet.
accuracy results in 98.5% [23].
3066 phishing webpages, 686 legitimate.
[12]

30
Phishing webpages are downloaded from In Content-Based approach, which analyses webpages
[13]
PhishTank Legitimate webpages are content like extracting some words such a brand names from
downloaded from DMOZ URLs or HTML contents and giving weights to them,
extracting of logo images and comparing them with original
11,660 Phishing webpages are downloaded ones, or finding the consistency between URLs and web
from PhishTank content.
[14]
5,000 Legitimate webpages are downloaded
from DMOZ In Heuristic Based Approach, distinctive features
59 Legitimate webpages are downloaded from extracted from the structure of phishing web pages such as
Yahoo directory URLs, domain name and web page rank are employed in the
[15] 41 Phishing webpages are downloaded from
detection process. These features are applied to machine
PhishTank
learning algorithms to build an accurate classifier to
effectively differentiate between phishing and legitimate web
1600 phishing are downloaded from PhishTank.
250 legitimate webpages which 176 are pages.
[16] downloaded from PhishTank and remaining are
considered randomly.
In Fuzzy rule-based approach, classifying web pages is
Heuristic- based on the level of phishness that presented in the web
Based 200 Legitimate webpages are downloaded from pages using predefined rules. The fuzzy logic process is
Approach applied in multiple steps which usually started with
Yahoo directory.
[17] 200 Phishing webpages are downloaded from Fuzzification step and end with Defuzzification step. A fuzzy
PhishTank. rule may be combined with different artificial intelligence
algorithms like Neural network and genetic algorithm to
3,000 Phishing webpages are downloaded from upgrade their functionality.
PhishTank
[18] 3,000 Legitimate webpages are downloaded Finally, we can conclude that there is no perfect
from DMOZ approach to be used in detecting phishing web pages. Each
approach has its advantages and disadvantages and
11,361 Phishing webpages are downloaded improving these approaches is always required.
from PhishTank
[19]
22,213 Legitimate webpages are downloaded REFERENCES
from DMOZ
[1] "Anti-Phishing Working Group (2016). Phishing Activity Trends
600 Phishing webpages are downloaded from Report (4 th Quarter 2016). Unifying the Global Response To
PhishTank Cybercrime. [online] APWG".
[20] 400 legitimate webpages are downloaded from
PhishTank [2] " Anti-Phishing Working Group (2018). Phishing Activity Trends
Report (4 th Quarter 2018). Unifying the Global Response To
Cybercrime".
11,660 Phishing webpages are downloaded
from PhishTank 5,000 Legitimate webpages are [3] "Microsoft Security Intelligence Report Volume 24".
[21] [4] H. a. B. B. a. R. I. Shirazi, "Kn0w Thy Doma1n Name: Unbiased
downloaded from DMOZ
Phishing Detection Using Domain Name Based Features," in
Proceedings of the 23nd ACM on Symposium on Access Control
300 random webpsges Models and Technologies, 2018.
Phishing webpages are downloaded from
Fuzzy PhishTank [5] A. Altaher, "Phishing websites classification using hybrid SVM and
rule-based [22] KNN approach," International Journal of Advanced Computer
Legitimate webpages are downloaded from
approach Science and Applications, vol. 8, pp. 90-95, 2017.
DMOZ
[6] Y.-S. a. Y. Y.-H. a. L. H.-S. a. W. P.-C. Chen, "Detect phishing by
checking content consistency," Proceedings of the 2014 IEEE 15th
11,660 Phishing webpages are downloaded International Conference on Information Reuse and Integration (IEEE
from PhishTank IRI 2014), pp. 109-119, 2014.
[24] 10,000 Legitimate webpages are downloaded [7] N. a. A. A. a. T. F. Abdelhamid, "Phishing detection based associative
from DMOZ classification data mining," Expert Systems with Applications, vol.
41, pp. 5948-5959, 2014.
[8] K. A. K. N. Manoj Kumar, "Detecting Phishing Websites using Fuzzy
Logic," International Journal of Advanced Research in Computer
Engineering & Technology (IJARCET), vol. 5, 2016.
IV. CONCLUSION
[9] M. a. H. M. A. a. D. K. a. T. F. Aburrous, "Intelligent phishing
Phishing web pages are increased these days, resulting in detection system for e-banking using fuzzy data mining," Expert
huge financial losses. The need for protection methods from systems with applications, vol. 37, pp. 7913-7921, 2010.
these phishing web pages has become very necessary. [10] C. L. a. C. K. L. a. o. Tan, "Phishing website detection using URL-
assisted brand name weighting system," 2014 International
Formerly, Blacklist based approach was the most Symposium on Intelligent Signal Processing and Communication
common method used in the detection of phishing web Systems (ISPACS), pp. 054-059, 2014.
pages. The drawback of that approach is an inability to [11] K. L. a. C. E. H. a. T. W. K. a. o. Chiew, "Utilisation of website logo
recognize non-blacklisted or temporary phishing webpages. for phishing detection," Computers \& Security, vol. 54, pp. 16-26,
2015.
Therefore, more robust and effective approaches for
[12] M. a. V. A. Y. Moghimi, "New rule-based phishing detection
detecting phishing attacks have been developed. In this method," Expert systems with applications, vol. 53, pp. 231-242,
paper, different approaches are reviewed according to three 2016.
main groups which are Content-Based approach, Heuristic- [13] M. N. a. M. S. Feroz, "Phishing URL detection using URL ranking,"
Based approach, and Fuzzy rule-based approach. 2015 ieee international congress on big data, pp. 635-638, 2015.

31
[14] L. A. T. a. T. B. L. a. N. H. K. a. N. M. H. Nguyen, "A novel Conference on Computational Intelligence \& Communication
approach for phishing detection using URL-based heuristic," 2014 Technology, pp. 220-223, 2015.
International Conference on Computing, Management and [20] X. a. W. H. a. N. T. Gu, "An efficient approach to detecting phishing
Telecommunications (ComManTel), pp. 298-303, 2014. web," Journal of Computational Information Systems, vol. 9, pp.
[15] A. A. a. A. N. A. Ahmed, "Real time detection of phishing websites," 5553-5560, 2013.
2016 IEEE 7th Annual Information Technology, Electronics and [21] B. L. a. N. L. A. T. a. N. H. K. a. N. M. H. To, "A novel fuzzy
Mobile Communication Conference (IEMCON), pp. 1-6, 2016. approach for phishing detection," 2014 IEEE Fifth International
[16] R. S. a. A. S. T. Rao, "PhishShield: a desktop application to detect Conference on Communications and Electronics (ICCE), pp. 530-535,
phishing webpages through heuristic approach," Procedia Computer 2014.
Science, vol. 54, pp. 147-156, 2015. [22] S. D. Shirsat, "Demonstrating Different Phishing Attacks Using
[17] J. a. V. R. G. Solanki, "Website phishing detection using heuristic Fuzzy Logic," 2018 Second International Conference on Inventive
based approach," Proceedings of the third international conference on Communication and Computational Technologies (ICICCT), pp. 57-
advances in computing, electronics and electrical technology, 2015. 61, 2018.
[18] J.-L. a. K. D.-H. a. C.-H. L. Lee, "Heuristic-based approach for [23] P. A. a. H. M. A. a. T. M. a. S. G. a. A. N. Barraclough, "Intelligent
phishing site detection using url features," Proc. of the Third Intl. phishing detection and protection scheme for online transactions,"
Conf. on Advances in Computing, Electronics and Electrical Expert Systems with Applications, vol. 40, pp. 4697-4706, 2013.
Technology-CEET, pp. 131-135, 2015. [24] L. A. T. a. T. B. L. a. N. H. K. Nguyen, "An efficient approach for
[19] R. B. a. D. T. Basnet, "Towards developing a tool to detect phishing phishing detection using neuro-fuzzy model," Journal of Automation
URLs: a machine learning approach," 2015 IEEE International and Control Engineering, vol. 3, 2015.

32
Detecting Slow Port Scan Using Fuzzy Rule
Interpolation
Mohammad Almseidin Mouhammd Al-kasassbeh Szilveszter Kovacs
Department of Information Technology Computer Science Department Department of Information Technology
University of Miskolc Princess Sumaya University for Technology University of Miskolc
Miskolc, Hungary Amman, Jordan Miskolc, Hungary
alsaudi@iit.uni-miskolc.hu m.alkasassbeh@psut.edu.jo szkovacs@iit.uni-miskolc.hu

Abstract—Fuzzy Rule Interpolation (FRI) offers a convenient fined standard rules. However, multi-step attacks implement
way for delivering rule based decisions on continuous universes several steps, some of which appear legitimate and therefore
avoiding the burden of binary decisions. In contrast with the make these types of attacks more difficult to detect. The most
classical fuzzy systems, FRI decision is also performing well on
partially complete rule bases serving the methodologies having common detection mechanism used is the Intrusion Detection
incremental rule base creation structure. These features make the System (IDS). It can be categorized as either anomaly-based,
FRI methods to be perfect candidate for detecting and preventing or signature-based detection. Anomaly-based detection is able
different types of attacks in an Intrusion Detection System to detect new types of attacks by using the network traffics’
(IDS) application. This paper aims to introduce a detection historical behaviour. However, this type of detection renders
approach for slow port scan attacks by adapting the FRI
reasoning method. A controlled test-bed environment was also a greater value of false positive alerts. On the other hand,
designed and implemented for the purpose of this study. The the signature-based detection offers the lowest value of false
proposed detection approach was tested and evaluated using positives for the stored attack signatures (patterns). Form
different observations. Experimental analysis on a real test-bed another perspective, the signature-based detection mechanism
environment provides useful insights about the effectiveness of the needs to be updated frequently with different attacks patterns
proposed detection approach. These insights include information
regarding the detection approach’s efficacy in detecting the port [3]–[6]. While each of the previous detection mechanisms had
scan attack and in determining its level of severity. In the its own benefits and drawbacks, the anomaly-based detection
discussion the efficacy of the proposed detection approach is mechanism is more widely used [7].
compared to the SNORT IDS. The results of the comparison Detecting the multi-step attacks is not a straight-forward
showed that the SNORT IDS was unable to detect the slow and procedure. The IDS may face difficulties in detecting multi-
very slow port scan attacks whereas the proposed FRI rule based
detection approach was able to detect the attacks and generate step attacks [2]. The characteristic strength of these types of
comprehensive results to further analyze the attack’s severity. attack is that they are carried out sequentially, and usually,
Index Terms—Fuzzy Rule Interpolation, Intrusion Detection start their sequence with some legal actions used to discover
System, Port Scan Attack, SNORT. and probe connected computers. After that, the attacks focus
on opening direct pathways into the system. This is done by
I. I NTRODUCTION the attackers accumulating significant information about the
The rapid growth of technologies makes protecting com- expected victims. Therefore, one of the most important steps
puter networks a challenging task. Another challenge is that for the attacker is to gather the required information about the
attackers’ needs have grown and changed relative to the expected victims.
rapid technological growth. Attacks are generally not executed The port scan attack [8] is considered a preliminary step
blindly. Rather, the new techniques are strategically imple- of different type of multi-step attacks. It provides significant
mented. In other words, the attacker strategically executes information about the intended victims within the connected
several steps before achieving his final goal. The first step is to network. Meanwhile, it gathers large amounts of information
collect the necessary information about their desired victims. that are required for the latter steps of the attack. From
These types of attacks are known as ”multi-step attacks” due another perspective, the port-scan could be useful as a tool
to their strategic execution which takes place in various stages. for the network administrator to diagnose and troubleshot their
Attackers first focus on finding open pathways to implement network. However, attackers abuse the port-scan tool to exploit
their illegal activities in order to eventually break down the it as a means of attacking the system. Practically, the IDS
availability and integrity of the connected network. detects various types of port-scan attacks but it has difficulty
Multi-step attacks made up 60% of total attacks world wide detecting the slow port-scan. Slow port scan [9] means that
[1], [2]. The attackers change their techniques because low- an attacker does not send probe packets from more than two
level attacks (single-stage attacks) are detected using prede- computers permanently. Rather, attackers send packets to a
host for example only every 30 seconds or every 60 seconds.
The attacker uses the slow port-scan to gather the required

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 33


information about the intended victims [10]. At the same time, the same dataset. The IDS based-NN obtained the highest
the IDS generates an alert, detecting the heavy port-scan but false positive rate for detecting the port scan attack within
without any indications about the port-scan’s degree level. The the simulated environment; meanwhile, it recorded the lowest
heavy port-scan’s degree level could provide the administrator true positive rate. The IDS-based ANFIS obtained the highest
with useful information to help recognize the attack in its early true positive rate and the lowest false positive rate compared
stages to understand the current network’s security status [4], with the classical inference system. The aforementioned results
[11]. concluded that the ANFIS method was effective in detecting
As a response to the previous issues, this paper proposes the port scan attack.
the slow port scan detection mechanism built upon on the Kim and Lee [10] propose a classical inference system
FRI reasoning method. The FRI reasoning methods offer combined with the step-wise policy to generate a useful frame-
the required extension of the rule based binary decision to work for detecting the port-scan. The proposed Traffic Control
the continuous space. This addition introduces a mechanism Framework (TCF) was designed based on several fuzzy rules.
for approximating the severity level of the port scan attack. The primary parameter for detecting the port-scan attack was
This information helps the administrator better understand the the number of packets between the source and destination.
current network’s security status. Moreover, because of its The experiment was carried-out based on restricted step-wise
interpolated fuzzy nature, the fuzzy rule base can also be par- policy to control abnormal traffics. Furthermore, the network
tially complete, without risking the loss of conclusion for some mapper tool was used to execute the desired port-scan traffics.
of the observations (compared to the classical compositional The proposed TCF model was able to successfully detect the
fuzzy reasoning). As a result, the size of the intrusion detection heavy port scan traffics within the simulated environment.
fuzzy rule-base can be dramatically reduced, and the rule-base The port-scan attack is considered the first step in multi-step
can be created incrementally. attacks. Therefore, the latter steps of the attack are launched
This paper is organized as follows: section (II) presents according to the success of the initial port-scan step. Zhang et
recent works related to the application of intrusion detection al. [8], separately implement two classical inference systems to
based fuzzy system to detect the port scan attack. Then the detect the port-scan attack and ascertain if the current port-scan
fuzzy rule interpolation is briefly presented in section (III). attack is part of a multi-step attack. The preliminary parameter
Section (IV) introduces the proposed detection approach in for detecting the port-scan was the time parameters between
detail followed by the experiments and results in section (V). different numbers of packets. Moreover, other information was
Finally, section (VI) concludes the paper. recorded such as IP addresses and host name to predict if the
port-scan attack in question was part of a multi-step attack or
II. R ELATED W ORKS not. The proposed system generates a two-level alarm: one, if
This section presents some of the relevant works related to its a standalone port scan attack and another, if it is part of a
the use of fuzzy system against the port-scan attack. It also multi-step attack.
provides a brief overview of different methods and approaches Moreover, there are different dimensions of detecting the
that are used for intrusion detection. port-scan attack where the fuzzy inference system is incorpo-
Moshiul et al. [12] propose the Fuzzy Intrusion Recognition rated along with other data mining algorithms. One example
Engine (FIRE) as a Network Intrusion Detection System is in [11], where Ireland et al. combine the classical inference
(NIDS), where the fuzzy system is implemented to assess system and the genetic algorithm. In this example, the ge-
different types of attacks. The proposed FIRE system was netic algorithm was implemented to optimize the membership
tested and evaluated using a simulated attack environment. function parameters. However, the proposed detection method
It consisted of twenty-five independent fuzzy-rules used to was not limited to detection of the port-scan attack; rather, it
detect the port scan attack. The output response of the FIRE was also used to detect other types of denial of service (DOS)
was defined as follows: if the output response was 33%, the attacks. The experiments were carried-out using benchmark
FIRE system presented a warning alert. However, if the output datasets (KDD dataset and RL09 dataset). These datasets
response was more than 66%, the FIRE system presents an include different types of port-scan attacks under the umbrella
alarm described as a ”dangerous zone”. The proposed FIRE of probe attacks. The proposed detection method successfully
successfully detected the port scan attacks within the simulated recognized 91.64% of denial of service attacks and 94.79% of
environment. probe attacks within those datasets.
Another perspective is presented by Shafiq et al. in [13]. Dickerson et al. in [15], proposed an anomaly-based intru-
They draw on results from a comparative study conducted of sion detection system based on the classical inference sys-
three attack methods tested against the port-scan attack. In tem. The proposed anomaly-based intrusion detection system
this study, the Fuzzy Inference System (FIS), Neural Network deals with the following protocols: TCP, UDP, and ICMP. It
(NN) and the Adaptive Neuro-Fuzzy Inference System (AN- monitors the collected data during the specific slide window.
FIS) were studied and evaluated in order to detect the port Regarding the design of the proposed system, there are five
scan attack. The endpoint based traffic dataset [14] was used membership functions with the following linguistic terms:
as a testbed environment. Several performance parameters Low, Med-Low, Medium, Med-High, and High. The proposed
were calculated to evaluate the FIS, NN and ANFIS using method’s output response was derived to the interval between

34
0 and 1. The intrusion-based fuzzy rules were suggested by for complete fuzzy rules. The FRI methods approximate the
the expert. The major parameter used to detect the port scan required conclusions based on the most important fuzzy rules,
was the Session Description Protocol (SDP); it indicates the a through summary of the FRI methods is presented in [20].
unique connection between source and destination using the
IV. FRI AGAINST P ORT S CAN ATTACK
same port. The proposed method was tested and evaluated
using a simulated attack environment. The proposed system As mentioned in section (III), first the input and output
effectively detected the port scan attack in addition to other universes needed to be defined in order to establish the
intrusion types i.e. backdoor and Trojan horse attacks. fuzzy system. The general structure of the proposed detection
There are several works that contribute to the research approach is shown in Fig. 1.
into different methods for detecting and preventing port scan
attacks. A good summary of the different detection approaches
against port scan attack proposed in [16]. The previous works
provide convincing contributions and support the idea that
implementing a fuzzy inference system as a detection approach
could be a suitable approach for detecting the port scan
attack. From another perspective, the previous works still have
common flaws, namely that the classical inference system
required complete fuzzy rule-base to detect the port scan Fig. 1. The Structure of The Proposed Detection Mechanism
attack. It could be difficult in some cases to obtain a complete
fuzzy rule-base. As a result, when an observation appeared The general structure of the proposed detection approach
it is possible that it was not covered by any of the fuzzy was initiated by extracting the FRI inference system’s required
rules. In this case, the detection approach was incapable of input parameters. The extraction process was executed using
offering the desired output. Unlike the previous efforts, in SNORT. SNORT is a free open source network intrusion
this work, the FRI reasoning method was adapted instead of detection system [21], it can be installed and configured to
the classical fuzzy inference system. The advantage of using detect various types of intrusions using real-time traffics. The
the FRI reasoning method is that it eliminates the need for SNORT structure is implemented based on library of packet-
a complete fuzzy rule-base; the detection approach can be capture. The SNORT detection mechanism is based on pre-
implemented using only a few significant intrusion detection defined rules. These rules act as signatures for different types
fuzzy rules. of intrusions. Every packet that passed through SNORT was
thoroughly analyzed and investigated to define any matches to
III. F UZZY RULE I NTERPOLATION the predefined detection rules. This requires that the repository
The term ”fuzzy logic” was initially introduced by Lotfi of the predefined rule be continuously updated. The SNORT
Zadeh [17]. There are some application areas, where the need rules can be written in a friendly-way allowing the system
for handling continuous universes requires the concept of administrator to easily edit, delete and insert new rules [22].
fuzzy set and continuous valued logic instead of crisp set and The incorporation between SNORT and the FRI reasoning
binary logic. Fuzzy logic can be also implemented as a suitable method are carried-out to derive the network input parameters
reasoning method for application areas dealing with the issue for the FRI detection approach. In the sniffing mode, SNORT
of binary decision. For example, with regard to intrusion in the collects many network parameters and information.
detection, the binary decision is not a suitable for recognizing Important parameters must be defined in order to detect
the level of intrusion. The fuzzy system, however, is able to different types of port scan attacks. Time is one of the primary
avoid the binary decision by smoothing the boundaries and parameters used for recognizing the port scan attack. Accord-
present more comprehensible results [4]. A detection approach ing to the results of the literature in [22], [23], the following
based on the fuzzy system must meet the following demands: parameters were extracted and used as input parameters for
specify the input and output universes, specify the input and the proposed detection approach:
output’s fuzzy partitions, and generate the intrusion detection • The Number of Sent Packet (NSP) between source and
fuzzy rules [18]. distention.
In the classical fuzzy inference system i.e. Mamdani and • The Average Time between received Packets (ATP) by
Takagi-Sugeno, the fuzzy rule-base must cover all observations the destination victim in milliseconds.
(inputs) to generate results. However, the classical fuzzy • The Number of Packets Received by the destination
inference system could not generate the expected results for victim in seconds (NPR).
all observations when dealing with partially defined fuzzy To carry out an actual experimental port scan attack, a test-
rule-base [7]. The FRI reasoning methods are introduced bed network environment was constructed. Fig. 2 shows the
to generate conclusion even in case, when the fuzzy rule test-bed network architecture.
base is only partially defined (sparse). Moreover, the FRI According to the experiments conducted, four connected
methods can significantly reduce the number of fuzzy rules computers (Client 1, Client 2, Client 3 and Client 4) were
[19] because when using the FRI methods there is no need considered attackers and the last one was presented as a victim

35
A. Fuzzification and Fuzzy rule Generation
The FRI detection approach had three input parameters
(NPR, NPS and ATP). For each input parameter, four linguistic
terms were used to represent their ranges during each phase
of the experiments. Table I lists the linguistic terms used to
classify each of the FRI detection approach’s input parameters.

TABLE I
L INGUISTIC T ERMS OF T HE S ELECTED PARAMETERS

Parameters Fuzzy Sets


NPR Very Slow, Slow, Medium, High
NPS Very Slow, Slow, Medium, High
ATP Very Slow, Slow, Medium, High

For the sake of simplicity, the triangular membership func-


tion was chosen to present the above mentioned primary
Fig. 2. The Test-bed Network Architecture parameters. The output response of the FRI detection approach
was also divided into four similar fuzzy sets where it was
distributed from 0 to 1. Fig. 3 shows the support of the FRI
server. The port scan attacks were executed in four phases as detection approach’s antecedent fuzzy sets.
follows: very slow port scan, slow port scan, medium port
scan, and high port scan. The different phases of port scan
attack are distinguished based on the number of the attackers
as follows:
• Very slow port scan: (1 to 1 attack), client 1 executed the
port scan against the victim server.
• Slow port scan: (2 to 1 attack), client 1 and client 2
executed the port scan against the victim server.
• Medium port scan: (3 to 1 attack), client 1, client 2 and
client 3 executed the port scan against the victim server.
• High port scan: (4 to 1 attack), client 1, client 2, client
3 and client 4 executed the port scan against the victim
server.
Fig. 3. Supports of The Antecedent Fuzzy Sets of The Proposed Approach
Four different experiments were performed for the four
distinct phases of port scan attacks. Each experiment phase The fuzzy rules were designed based on the FRI detection
lasted 6 minutes; this time was chosen somewhat arbitrarily. approach’s input parameters and their corresponding fuzzy
The input parameters (ATP, NPS, and NPR) values were sets [24]. In this work, the fuzzy rules were reduced to
collected and analyzed in every phase. In other words, four simplify the system design. Therefore, the FRI detection
different experiment phases were performed in order to obtain approach consisted only of the most significant fuzzy rule-
the threshold values needed to accurately represent the FRI base. These sparse fuzzy rules were generated based on expert
detection approach’s input parameters. IP scanner and NMAP knowledge, the input parameters’ range values during the four
tools were used to execute the port scan attacks. SNORT in phases of experiments, and the relationship between the input
sniffing mode and CommView network analyzer were installed parameters and the number of attacker clients. Subsequently,
on the victim machine. ten preliminary sparse fuzzy rules were designed. Table II
presents the sparse fuzzy rules 1 .
In the first phase of experiments, where a very slow port
The FRI-based detection approach’s inference engine was
scan was executed based on a single attacker machine (1 to
performed using Fuzzy Rule Interpolation based on the POlar
1 attack), the ATP parameter recorded large values meaning
Cuts (FRIPOC) method [25]. This method was introduced by
that there were large increments of time in between received
Johanyák and Kovács in 2006. The FRIPOC method code,
packets. This is why common intrusion detection systems,
along with other FRI methods, can be accessed from the
such as SNORT and Juniper Netscreen, did not detect the slow
FRI toolbox which can be downloaded freely from [26].
port scan attack. The ATP value dramatically decreased until
Overall, there are many benefits for implementing the FRI
the fourth phase (4 to 1 attack). From another perspective,
reasoning method instead of the classical inference systems.
NPR and NPS parameters starting increasing in ascending
order from the first phase through the fourth phase of the 1 Hint: PL = Port scan Level, VS = Very Small, S = Small, M = Medium
experiments. and H = High

36
TABLE II
T HE S PARSE F UZZY RULES

Input Parameters Port scan Level


No NPS NPR ATP PL
1 VS VS H Very Slow
2 VS VS M Very Slow
3 VS S M Slow
4 M M M Medium
5 H H M High
6 H M M Medium
7 M H H High
8 H S S Medium
9 VS H M Medium
10 H VS M Medium

Applying FRI can decreases the fuzzy rule-base size, which,


subsequently, decreases computation time and simplifies the Fig. 5. FRI Detection Approach Output in Case of Very Slow Attack
fuzzy system’s design. Furthermore, FRI reasoning can pro-
duce interpolated conclusion even if the fuzzy rule-base is only
partially completed. These observations, among others, were tested and evaluated
using both the proposed FRI detection approach and SNORT
as presented in Table III.
V. EXPERIMENTS AND RESULTS

This section discusses the results of the implemented ex- TABLE III
FRI A PPROACH V S SNORT O UTPUT A LERTS
periments. SNORT could be enhanced by extending the FRI
reasoning methods’ binary decision to the continuous space. Input Parameters Detection Method-based IDS Alerts
It is worth mentioning that every observation used to evaluate Obs NPS NPR ATP SNORT Alerts FRI Approach Alerts
1 1500 3500 2 Attack Alert High Port Scan Attack
the FRI detection approach was presented as a fuzzy singleton. 2 150 1050 18 No Alert Very Slow Port Scan Attack
The FRI detection approach yielded useful information such as 3 900 2500 7 Attack Alert Medium Port Scan Attack
the ”level of port scan attack”, which gives the administrator 4 77 817 19 No Alert Very Slow Port Scan Attack
5 900 1000 15 No Alert Slow Port Scan Attack
a better understanding of the recent port scan attack. This 6 1600 3750 2 Attack Alert High Port Scan Attack
information can be expressed through these two observations: 7 1100 2020 8 Attack Alert Medium Port Scan Attack
the first, yielded the following crisp values (NPS = 1500, 8 490 1100 16 No Alert Slow Port Scan Attack
NPR = 3500, and ATP = 2) while the second registered (NPS
= 150, NPR = 1050, and ATP = 18). The FRI detection Consequently, these experiments demonstrate the proposed
approach’s output response for the two previous observations FRI detection approach’s ability to present concise, compre-
are illustrated in Fig. 4 and Fig. 5 respectively, where the first hensible results . Moreover, it had the ability to detect the
observation was classified as a high port scan attack and the very slow and slow port scan where the SNORT has no attack
second as a very slow port scan attack. alert. Traditional fuzzy-based detection approaches focus on
adapting the complete fuzzy rules to detect port scan attacks.
However, this may not be a straightforward procedure in some
cases. Therefore, the proposed FRI detection approach was
based on the (FRIPOC) FRI method to smoothe the boundaries
and recognize the level of port scan attack. Furthermore, the
approach was able to generate comprehensive results even if
the fuzzy rule-base is only partially defined.

VI. C ONCLUSION
This paper introduces a novel approach for detecting port
scan attacks. The proposed approach was designed and con-
structed using fuzzy rule interpolation. The FRI-based de-
tection approach’s inference engine was performed using
Fuzzy Rule Interpolation based on the POlar Cuts (FRIPOC)
method. The sparse fuzzy rules were generated based on
expert knowledge, the range values of the input parameters
Fig. 4. FRI Detection Approach Output in Case of High Attack during the experiments’ four phases, and the relationship
between the input parameters and the number of attacker

37
clients. The conducted experiments reflect the proposed FRI- [16] M. H. Bhuyan, D. Bhattacharyya, and J. K. Kalita, “Surveying port scans
based detection approach’s ability to effectively detect the very and their detection methodologies,” The Computer Journal, vol. 54,
no. 10, pp. 1565–1581, 2011.
slow and slow port scans based solely on the sparse fuzzy [17] L. A. Zadeh, “Fuzzy sets,” Information and control, vol. 8, no. 3, pp.
rules. The FRI-based detection approach’s output responses 338–353, 1965.
were compared with SNORT and the results reflected that the [18] S. Dhopte and N. Tarapore, “Design of intrusion detection system
using fuzzy class-association rule mining based on genetic algorithm,”
proposed detection approach was successful in detecting the International Journal of Computer Applications, vol. 53, no. 14, 2012.
very slow port scan attack in instances where the SNORT [19] S. Kovács, “Fuzzy rule interpolation,” in Encyclopedia of Artificial
did not render any alert. Furthermore, the FRI-based detection Intelligence. IGI Global, 2009, pp. 728–733.
[20] Z. C. Johanyák and S. Kovács, “A brief survey and comparison on
approach presented additional information, such as the level various interpolation based fuzzy reasoning methods,” Acta Polytechnica
of port scan attack, instead of a binary alert. Hungarica, vol. 3, no. 1, pp. 91–105, 2006.
[21] M. Roesch et al., “Snort: Lightweight intrusion detection for networks.”
ACKNOWLEDGMENT in Lisa, vol. 99, no. 1, 1999, pp. 229–238.
[22] W. El-Hajj, H. Hajj, Z. Trabelsi, and F. Aloul, “Updating snort with a
The described article was carried out as part of the EFOP- customized controller to thwart port scanning,” Security and Communi-
3.6.1-16-00011 ”Younger and Renewing University – Inno- cation Networks, vol. 4, no. 8, pp. 807–814, 2011.
vative Knowledge City – institutional development of the [23] W. El-Hajj, F. Aloul, Z. Trabelsi, and N. Zaki, “On detecting port
scanning using fuzzy based intrusion detection system,” in Wireless
University of Miskolc aiming at intelligent specialization” Communications and Mobile Computing Conference, 2008. IWCMC’08.
project implemented in the framework of the Szechenyi 2020 International. IEEE, 2008, pp. 105–110.
program. The realization of this project is supported by the [24] Y.-C. Chen, L.-H. Wang, S.-M. Chen et al., “Generating weighted fuzzy
rules from training data for dealing with the iris data classification
European Union, co-financed by the European Social Fund. problem,” International Journal of Applied Science and Engineering,
vol. 4, no. 1, pp. 41–52, 2006.
R EFERENCES [25] Z. C. Johanyák and S. Kovács, “Fuzzy rule interpolation based on
[1] Y. Zhang, D. Zhao, and J. Liu, “The application of baum-welch polar cuts,” in Computational Intelligence, Theory and Applications.
algorithm in multistep attack,” The Scientific World Journal, vol. 2014, Springer, 2006, pp. 499–511.
2014. [26] Z. C. Johanyak, D. Tikk, and S. K. and, “Fuzzy rule interpolation matlab
[2] M. Almseidin, I. Piller, M. Al-Kasassbeh, and S. Kovacs, “Fuzzy toolbox - fri toolbox,” in 2006 IEEE International Conference on Fuzzy
automaton as a detection mechanism for the multi-step attack,” Inter- Systems, July 2006, pp. 351–357.
national Journal on Advanced Science, Engineering and Information
Technology, vol. 9, no. 2, 2019.
[3] M. Almseidin, M. Alzubi, S. Kovacs, and M. Alkasassbeh, “Evaluation
of machine learning algorithms for intrusion detection system,” in In-
telligent Systems and Informatics (SISY), 2017 IEEE 15th International
Symposium on. IEEE, 2017, pp. 000 277–000 282.
[4] M. Almseidin and S. Kovacs, “Intrusion detection mechanism using
fuzzy rule interpolation,” Journal of Theoretical and Applied Information
Technology, vol. 96, no. 16, pp. 5473–5488, 2018.
[5] M. Alkasassbeh, G. Al-Naymat, A. Hassanat, and M. Almseidin,
“Detecting distributed denial of service attacks using data mining
techniques,” International Journal of Advanced Computer Science and
Applications, vol. 7, no. 1, pp. 436–445, 2016.
[6] M. Alkasassbeh and M. Almseidin, “Machine learning methods for
network intrusion detection,” in The 20th International Conference
on Computing, Communication and Networking Technologies ICCCNT
2018, 2018, pp. 105–110.
[7] M. Almseidin, M. Al-kasassbeh, and S. Kovacs, “Fuzzy rule interpo-
lation and snmp-mib for emerging network abnormality,” International
Journal on Advanced Science, Engineering and Information Technology,
vol. 9, no. 3, pp. 735–744, 2019.
[8] W. Zhang, S. Teng, and X. Fu, “Scan attack detection based on
distributed cooperative model,” in Computer Supported Cooperative
Work in Design, 2008. CSCWD 2008. 12th International Conference
on. IEEE, 2008, pp. 743–748.
[9] M. Ring, D. Landes, and A. Hotho, “Detection of slow port scans in
flow-based network traffic,” PloS one, vol. 13, no. 9, p. e0204507, 2018.
[10] J. Kim and J.-H. Lee, “A slow port scan attack detection mechanism
based on fuzzy logic and a stepwise policy,” 2008.
[11] E. Ireland et al., “Intrusion detection with genetic algorithms and fuzzy
logic,” in UMM CSci senior seminar conference, 2013, pp. 1–6.
[12] H. M. Moshiul et al., “An efficient framework for network intrusion
detection.” Computer Science & Telecommunications, vol. 24, no. 1,
2010.
[13] M. Z. Shafiq, M. Farooq, and S. A. Khayam, “A comparative study
of fuzzy inference systems, neural networks and adaptive neuro fuzzy
inference systems for portscan detection,” in Workshops on Applications
of Evolutionary Computation. Springer, 2008, pp. 52–61.
[14] “Endpoint Security dataset,” http://www.nexginrc.org/Datasets, 2004.
[15] J. E. Dickerson, J. Juslin, O. Koukousoula, and J. A. Dickerson, “Fuzzy
intrusion detection,” in Ifsa world congress and 20th nafips international
conference, 2001. joint 9th, vol. 3. IEEE, 2001, pp. 1506–1510.

38
An Approach for Web Applications Test Data
Generation Based on Analyzing Client Side User
Input Fields
Samer Hanna
Department of Software Engineering Hayat Jaber
Faculty of Information Technology, Department of Computer Science,
Philadelphia University, Jordan Faculty of Information Technology,
shanna@philadelphia.edu.jo Philadelphia University, Jordan
hayoot91@gmail.com

Abstract— Since it is time consuming to manually generate for a given Web application under test based on the different
test data for Web applications, automating this task is of great types of inputs for this application.
important for both practitioners and researchers in this
domain. To achieve this goal, the research in this paper To demonstrate the problem discussed in this research,
depends on an ontology that categorizes Web applications suppose that a Web application quality professional wants to
inputs according to input types such as number, text, and date. test a Web application that has only the following 3 inputs:
This research presents rules for Test Data Generation for Web user name, age, and country. This person must decide the
Applications (TDGWA) based on the input categories specified test data that must be used with each of these 3 inputs. If this
by the ontology. Following the approach in this paper, Web task is done manually then it will consume lots of time and
applications testers will need shorter time to accomplish the effort.
task of TDGWA. The approach had successfully been used to
generate test data for different experimental and real-life Web To solve this problem, researchers in this domain must
applications. find approaches to automate the task of TDGWA. One of the
approaches to accomplish this task is to determine the needed
Keywords—Test Data Generation for Web Applications, conditions or constraints that must be applied to each input
Ontology, and Web Applications inputs types of a Web application under test depending on the type of this
input. Examples of such input constraints, consider the Web
I. INTRODUCTION application with 3 inputs mentioned above, the constraints
that can be applied to these inputs are as follows:
Web Applications, in different domains, are used by
millions of people around the world every day. For this • For the "user name" input, user inserted value must
reason, practitioners and researchers in the domain of Web be between 1 character and 40 characters,
Applications must find means to assess the quality of these
• For the "age" input, user inserted value must be
applications.
between 1 year and 150 years (or any other upper
Software testing is an important activity that can be used limit for age), and
to assess the quality of software applications. Testing
includes: generating test data and then executing the • For the "country name" input, user inserted value
applications under test with test data in order to compare the must be a valid country name among a specified list
expected output, according to the requirement specifications, of valid countries such as {USA, UAE, etc.}.
with the actual output resulted from the execution. To write the code of a tool that can be used for
Testing Web applications is different than testing automating the task of TDGWA, this tool must firstly
traditional applications because Web applications have many identify the type of each input in the investigated Web
characteristics that do not exist in traditional applications, application such as name, date, address, etc. And secondly,
one of these characteristics is that it is being used by many the tool must determine the constraints that must be
users at the same time. associated with each of these inputs. After accomplishing
these two tasks, test data can then be generated by applying
Testing and test data generation consume lots of time and different testing techniques such as boundary value testing,
effort if it is done manually and this includes also testing and robustness testing, and syntax testing to the input constraints.
test data generation for Web applications. For this reason, it
is very important to find means to automate this task. To explain this idea, consider again the above example
Web application:
Current Web applications testing tools generate the same
test data for all the inputs of an application under test For the name input, since the constraint associated with
regardless of the purpose, semantic, or meaning of each this input is "a name must be between 1 character and 40
individual input. The main idea of this research is to use an characters (or any other upper limit for a name)," then,
ontology for the purpose of categorizing and relating Web according to the boundary value-based robustness testing is:
applications inputs in order to facilitate test data generation (a) an empty name, and (b) a name with more than 40
characters. Besides, according to the syntax testing, test data

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 39


is: an invalid name that contains special symbols such as Finally, Section 7 presents the conclusion and the future
"a7$j." work.
For the age input, since the constraint associated with this
input is that "the age must be between 1 and 150," then, II. BACKGROUND
according to the boundary value-based robustness testing,
test data for this input is: {0, 1, 50, 149, 150, 151}. A. Test Data Generation
Software testing is a process that is used to detect faults
Finally, for the country input, since the constraint is "a
in a software application and also to assess the quality of this
country name must be a valid country name among a
application.
specified list of valid countries," then according to syntax
testing, test data is: a country that is not in the specified list Software testing is one of the most important actions to
of countries. ensure the quality of software applications. Generating test
data to reveal the errors in software modules is the major task
In brief, identifying the type of a certain input, such as
of software testing. In test data generation structural there is
name, age, etc. will facilitate the task of determining the
an important action; it's to deriving an acceptable or valid
constraints that can be associated with this input and will
and unacceptable or invalid values for inputs.
accordingly facilitate generating test data for this input as
explained above. Test data can be generated using different testing
technique. For example, in the boundary value testing, test
However, a problem arises here, and it is the main
data is chosen in the boundaries in of input, in syntax testing
problem solved by this research, namely: it is difficult to
it using for test data are generated based on violating an input
automatically identify the type of a certain input depending
syntax or regular expression, in robustness testing technique
on the text associated with that input. For example, the texts
test data is chosen outside the allowable range for an input.
that can be associated with the inputs that are related to the
inputs in the “name of the user” category can be: "your
name", "first name", "family name", "user name", etc. The B. Ontology
texts that can be associated with the inputs in the “age” There are two main definitions to ontology; first one is in
category can be: "age", "your age", "You were born in", and real world: ontologies give the real world a formal
many other texts. Another problem is that: it is impossible to representation by explaining concepts and relationships
exhaustively consider the texts associated with all of the Web between them. The second one is in computer science:
applications inputs. ontology knows as a group of representational primitives to
model a domain of discourse or knowledge.
The tool that can be used to automate the task of
TDGWA must be able to determine the type of a Web
application input depending on the client-side HTML text III. WEB APPLICATIONS INPUT DATA
associated with that input. For example, if the text associated CLASSIFICATION
with an input is "your name" or "first name" or and other text To generate test data for Web application, it is very
related to user names, the tool must figure out that this input important to classify the inputs to these applications in order
is a name. to be able to specify test data that are related to each category
Since semantic Web ontologies are used to specify the of inputs. For example, the test data that are related to the
relation between concepts in a certain domain, this research inputs in the "date" category will be different than the test
depends on an ontology that classifies Web applications data related to the inputs in the "number" category such as
inputs so that a software tool can identify the type of a "price".
certain input and accordingly generate test data for this input Since it is impossible to consider all of the current Web
as explained in the above example. applications in the world, this research analyzed a sample of
The contributions of this paper are: 250 Web applications in order to classify the inputs of these
applications.
• Proposing an ontology that can be used to classify
To accomplish the input classifications this research
texts associated with Web applications inputs.
depends on the following artefacts:
• Specifying the constraints that can be associated with
• The text associated with a certain Web application
different Web applications inputs in the ecommerce
input.
domain,
• The HTML type attribute of an input element.
• Specifying rule that can be used for TDGWA.
Consider the following extract from a client-side HTML
• Determining the testing techniques that can be used document in Figure 1:
with the Web applications input constraints in order
to generate valid and invalid test data. <form>
Section 2 of this research gives a brief background about <p> Your name: <input id=”userName” />
the main keywords of the paper. Section 3 discusses the
classifications of Web applications inputs; Section 4 <br />
introduces the ontology that is used to classify Web Age: <input type=”number” id=”userAge” />
applications inputs. Section 5 describes the approach used in
this paper for TDGWA. Section 6 presents the related work. <br />
Country:
40
<select>
<option value="UAE">United Arab Emirates</option>
<option value="USA">United States</option>
<option value="Jor">Jordan</option>
</select>
</p>
</form>
Figure 1. Extract from an HTML client-side page

To analyze and classify or categorize the inputs in the


HTML document in Figure 1: Figure 2. Ontology for classifying Web applications input types.

• For the “name” input, there is no “type” attribute for As shown in the ontology in Figure 2, the input types that
the input element, in this case, the associated text belong to the date category can be classified into sub-
with this input which is “Your name” will be categories, namely, birth date, start date, end date, and
considered in the input classification process for the departure date. There are many texts that belong to each of
purpose of test data generation for this input. the previous sub-categories, for example, the texts that are
associated with the “birth date” sub-category can be: “Your
• For the second input which is the “age,” there is a birth date”, “the date of your birth”, etc. The same discussion
“type” with a value equals to “number” and this value can be made for the other input categories in Figure 2.
is useful for the test data generation process because,
based on boundary value testing, test data like: very It can be concluded that the main duty for the ontology
big number, very small number, nominal number can in Figure 2 is to classify or categorize Web applications input
be used to test this input. Moreover, the associated types in order to conclude the test data that can be used with
text with this input, which is “age,” can also be used a certain input based on its type as explained in the example
for test data generation because, we can use test data in Section 3.
like 300 which is semantically invalid age to test this
input. So, for the age input, both, type attribute and V. AN APPROACH FOR TEST DATA GENERATION
the associated text can be used for test data FOR WEB APPLICATIONS
generation.
The approach that is suggested by this research for test
• For the third input in Figure 1, which is the “Country” data generation is based on the following activities:
input, it uses the HTML <select> tag and this tag has
no “type” attribute, in this case, it is easy to conclude Activity 1: Specify the input elements in the investigated
that the type of this input is “enumerated” since there HTML document.
is a list of only 3 options. The associated text, which Activity 2: Specify the text associated with each input
is “Country,” is important, in this case, for test data element specified by Activity 1 and also the type attribute of
generation since we must know that the options are this element if it exists.
country names to be able to generate test data such as
an invalid country name or a country name that is not Activity 3: For each associated text specified by Activity
among the options of the select tag. 2, determine the type of this associated text depending on the
ontology in Figure 2. For example, if the associated text with
In brief, as shown in the example in Figure 1, a given HTML input is “Your birth date,” then depending on
determining the type of an input using the “type” attribute or the ontology it can be concluded that this text belong to the
using the associated text with that input, or both, will lead to “Birthdate” sub-category which in return belong to the
determining the test data that must be used to test such input. “Date” category, and so on.
Activity 4: If an input element has a type attribute then
IV. WEB APPLICATIONS INPUT DATA map this type to one of the main input categories specified by
CLASSIFICATION the main ontology in Figure 2, namely, text, number, date,
In order to generate test data for a Web application input, enum, and URL. For example:
it is important firstly to determine the category of this input
• If the type attribute value is “phone” then it is mapped
e.g. name, age, country, email, etc., after that, test data for
to the “number” category.
this input can be generated based on the associated text of
this input or the “type” attribute as explained in Section 3. • If the type attribute value is “password” then it is
Based on analyzing a sample of 250 Web applications in mapped to the “text” category.
the ecommerce domain, it was concluded that the inputs of • If the type attribute value is “color” then it is mapped
these applications can be classified into the following main to the “enum” category.
categories: text, number, date, enum, and URL as shown in
the ontology in Figure 2. • If the type attribute value is “url” then it is mapped to
the “URL” category.

41
Activity 5: Based on an input main category specified by • If the input is a "password" then test data is any
Activity 3 or Activity 4 apply the following test data random text with size > 100.
generation rules for each of the main input categories.
5. URL
1. Date category
The URL data type has one rule only; we use syntax
The rules that are associated with the inputs in this testing technique to generate wrong URL.
category are based on boundary value testing and they are:
• If the input is a "URL" then test data is URL without
• If the input is a "Birthdate" "departure date" or "end “http://”.
date" then test data is a date with value <1/1/1900 or a
value > current date. Our complete approach of test data generation based on
activity 1 to activity 5 is demonstrated in Figure 3.
• If the input is a "day" then test data is day with value
of day<1 or day >31.
• If the input is a "month" then test data is a month with
value of month<1 or month >12.
• If the input is a "year" then test data is a year with
value of year >2018 (current year) or year < 1900
2. Number category
The rules that are associated with the inputs in this
category are based on boundary value testing and syntax
testing, examples of these rules are:
• If the input is a "phone" then test data is a phone
number that has letter/symbol or a number like
"000000000". (According to syntax testing).
• If the input is a "price" or "income" then test data is a
price or income with value <0. (According to
boundary value testing). Test data can also a random
string value. (According to syntax testing).
• If the input is a "security code" then test data is a
random string value. (According to syntax testing).
3. Enumeration category
In the Enumeration category each input has specific
accepted values; test data is any different value than these
specific values, for example:
• If the input is a "gender" then tests data is any random
text except “Male” or “Female”. (According to
robustness testing).
• If the input is a "marital status" then test data is any
random text except “Married”, “Single” or “Partner”.
(According to robustness testing).
• If the input is a "title" then test data is any random
text except "Miss", "Mrs." " Mr." or “DR.”.
(According to robustness testing).
4. Text category
Figure 3. Test data generation approach
The rules that are associated with the inputs in this
category are based on boundary value testing and syntax As shown in Figure 3, the approach consists of 4 main
testing, example of such rules are: phases; parse HTML page, determine the category of each
input, apply rules and generate test data to assess user input
• If the input is an "email" then test data is an email validation and finally use these test data to assess the web
without”@” sign. (According to syntax testing) application user input validation by invoking the We
• If the input is an "address" or "comments" or application under test using the test data and then analyzing
"message" then test data is any text with size > 500 the response of the application. If an invalid input is accepted
characters. by the investigated application then this application has
semantic based input validation vulnerability.
• If the input is a "name" then test data is any random
text with size > 50.

42
The approach in this paper had successfully been used to AbdulRazzaq et al. [5] Presented an approach of
generate test data for different experimental and real life disclosing Web application attacks. The research identifies
Web applications. Web application attacks applying semantic rules.
Bisht et al. [6] proposed a black-box approach to detect
VI. EVALUATION parameter tampering vulnerability. In this approach, client-
250 Web applications had been analyzed and the inputs side HTML and JavaScript code are analyzed in order to
of these application were fed into the test data generation extract the constraints imposed on a Web application inputs.
ontology depending to the type of each input as explained in The constraints are violated afterwards in order to exploit
Section 4. tampering vulnerabilities in the tested Web application.
After that, another 10 sample Web application were used In Alkhalaf et al. [7], client-side input validation function
in an experiment to evaluate if the test data generation is checked to make sure that it conforms to the policies
ontology can shorten the needed time for TDGWA. In this specified by the research. The policies are based on regular
experiment, 4 testers that work in the filed were asked to expressions that specify the set of acceptable input values. If
generate test data for the 10 applications sample. an input validation function accepts an input that does not
follow the specified regular expression then this is
Two of the testers only were allowed to use the ontology considered vulnerability. The research in this paper is
and the related rules for test data generation discussed in different in that test case generation is based on analyzing the
Section 5. It was estimated that the testers that used this semantics of each input in an HTML page. The approach by
research ontology and the related rules finished their work in Alkhalaf et al. [7] will not work if the HTML page has no
40% less time than the other two testers. Obviously, the client-side validation functions.
ontology and the rules reduced significantly the needed time
for TDGWA. Aydin et al. [8] presented an automated testing
framework for testing input validation and sanitization
The threats and limitations to the experiment is that it operations in web applications based on vulnerability
was the 4 testers have different experience in the field of signatures that are characterized as automata. For
TDGWA also the experiment was conducted by only 4 specification of different types of vulnerabilities they use
testers and using a sample of only 10 Web applications. regular expressions that characterize the strings that would
cause a problem or vulnerability when sent to a security
VII. RELATED WORK sensitive function.
Since it is important to generate test data to assess input Offutt et al. [9] describes specific rules for generating test
validation and quality of Web applications and since this task data for Web applications based on violating the constraints
is time and labor consuming, researchers proposed many associated with Web applications inputs. The concept of
approaches that can be used for the purpose of reducing the bypass testing was introduced to submit values to Web
needed time and effort for this task. applications that are not validated by client-side checking.
The closest approaches to the approach in this research Lei et al. [10] proposed an approach for test case
are: generation to detect SQL injection vulnerability. The
Li et al. [1] Suggested extracting the text associated with approach aimed at improving the coverage and efficiency of
an input of a client side HTML document of a Web test case generation process.
application and then generating valid and invalid text data None of the previous research proposed rules for test case
based on this text. The research in this paper also proposed generation for Web applications based on different testing
using the associated text for a certain input for test data techniques depending on input categories.
generation, however, this research introduces a systematic
approach for categorizing or classifying Web applications VIII. CONCLUSIONS AND FUTURE WORK
input based on ontology in order to facilitate the process of
test data generation. Web applications are used every day by most of the
people around the world which makes the process of
Scholte et al. [2] Proposed an approach that is used to assessing the quality of these applications one of the most
improve the secure development of web applications by important processes to be considered by the researchers and
transparently learning types for web application parameters practitioners in this domain. Since software testing is one of
during testing, and automatically applying robust validators the important processes that can be used to assess quality,
for these parameters at runtime. researches must find means to test Web applications, to do
Deepa et al. [3] Introduced Web applications parameter that researchers must firstly find means to generate test data
tampering vulnerability which is vulnerability that occurs for Web applications.
when a user violates client-side input data constraints and an The approach of TDGWA in this paper is based on
application accepts that input without validation. The analyzing HTML client-side input fields where the
research in this paper can detect such vulnerability. associated texts with inputs are stored in ontology in order to
Shahbaz el al. [4] Presented an approach for generating be able to classify these inputs and generate test data
test data for string validation routines. The approach accordingly. After classifying or categorizing a Web
produces both invalid and valid test cases. Invalid test data is applications inputs, test data can be generated depending on
produced by mutating the input regular expression. the category or type of each input. The approach in this
research will reduce the needed time and effort for TDGWA.

43
The ontology used in this research for input data Programming, vol. 97, pp. 405-425, 2015.
classification can be augmented when considering more Web [5] A. Razzaq, K. Latif, H. F. Ahmad, A. Hur, Z. Anwar and P. C.
applications in different domains since this ontology is based Bloodsworth, “Semantic security against web application attacks,”
on a sample of 250 Web applications only. Information Sciences, vol. 254, pp. 19-38, 2014.
[6] P. Bisht, T. Hinrichs, N. Skrupsky, R. Bobrowicz and V.
A tool will be built that can automatically generate test Venkatakrishnan, “NoTamper: Automatic Blackbox Detection of
data for a Web application based on analyzing the client-side Parameter Tampering Opportunities in Web Applications,” in 17th
data and searching for this data in an ontology of input types. ACM conference on Computer and communications security,
Chicago, Illinois, USA, 2010.
Future work will also discuss generating test data that can
[7] M. Alkhalaf, T. Bultan and J. L. Gallegos, “Verifying Client-Side
be used to assess whether a Web application can defend itself Input Validation Functions Using String Analysis,” in 34th
against one the known Web applications attacks or International Conference on Software Engineering (ICSE), Zurich,
vulnerabilities, namely, SQL injection. Switzerland, 2012.
[8] A. Aydin, M. Alkhalaf and T. Bultan, “Automated Test Generation
from Vulnerability Signatures,” in International Conference on
REFERENCES Software Testing, Verification, and Validation, 2014.
[1] N. Li, T. Xie, M. Jin and C. Liu, “Perturbation-based user-input- [9] J. Offutt, Y. Wu, X. Du and H. Huang, “Bypass Testing of Web
validation testing of web applications,” The Journal of Systems and Applications,” in 15th International Symposium on Software
Software, vol. 83, no. 11, pp. 2263–2274, 2010. Reliability Engineering, ISSRE, France, 2004.
[2] T. Scholte, W. Robertson, D. Balzarotti and E. Kirda, “Preventing [10] L. Lei, X. Jing, L. Minglei and Y. Jufeng, “Dynamic SQL Injection
Input Validation Vulnerabilities in Web Applications through Vulnerability Test Case Generation Model Based on the Multiple
Automated Type Analysis,” in in IEEE 36th Annual, Turkey, 2012. Phases Detection Approach,” in 2013 IEEE 37th Annual Computer
Software and Applications Conference, 2013.
[3] G. Deepa, P. Thilagam, F. Khan, A. Praseed, A. Pais and N. Palsetia.,
“Black-box detection of XQuery injection and parameter tampering
vulnerabilities in web applications,” International Journal of
Information Security, pp. 1-16, 2017.
[4] M. Shahbaz, P. McMinn and M. Stevenson, “Automatic generation
of valid and invalid test data for string validation routines using web
searches and regular expressions,” Science of Computer

44
Achieving Data Integrity and Confidentiality Using
Image Steganography and Hashing Techniques
Ahmed Hambouz, Yousef Shaheen, Abdelrahman Manna, Dr. Mustafa Al-Fayoumi, and Dr. Sara Tedmori
Departmnet of Computer Science
Princess Sumaya University for Technology
Amman, Jordan
ahmedhambouz@gmail.com, yousefpsut@icloud.com, manna.93@outlook.com, m.alfayoumi@psut.edu.jo, s.tedmori@psut.edu.jo

Abstract—Most existing steganography algorithms keen on weight, while the Most Significant Bit (MSB) is the left
achieving data confidentiality only by embedding the data into most bit and is associated with the highest weight. The
a cover-media. This research paper introduced a new technique proposed in this paper is based on LSB, due to the
steganography technique that achieves both data minimal effect that LSB has on the original image.
confidentiality and integrity. Data confidentiality is achieved
Typically, LSB encoding in steganography is performed by
by embedding the data bits in a secret manner into stego-
image. Integrity is achieved using SHA 256 hashing algorithm altering the LSB of the cover image to become similar to the
to hash the decoding and encoding variables. The proposed value of the most significant bit of the plaintext.
model performed a high PSNR values for using a dataset of
different image sizes with an average PSNR of 82.933%.
B. Hash Function
Keywords— Steganography, Data Confidentiality, Data
Integrity, PSNR, SHA 256, Data Tampering. Cryptographic hash function is a one way function that
takes as input a variable length plaintext and generates a
I. INTRODUCTION fixed size hash value. The hash function is accounted as a
Image steganography is one of premier secure data hiding robust cryptography technique as it is infeasible to compute
techniques. The role of steganography is to hide sensitive the plaintext. The hash function ensures that the sent
data into a cover image. This will protect the data from being plaintext is untampered by comparing the sent hash value
captured by any unauthorized party. Steganography helps with the decoded hash value. Secure Hash Algorithm (SHA)
maintain data confidentiality, data integrity, data is one of the most commonly used hashing techniques.
authentication, and data privacy. Steganography techniques Many versions of the SHA algorithm have been introduced;
vary depending on the algorithm used. Steganography can be the most popular family of hash function is the SHA-2
combined with symmetric algorithms or asymmetric which was adopted in this research paper. SHA-2 consists of
cryptography techniques. The advantage of a specific six hash functions with hash values: 224, 256, 384, or 512
technique lies in the technique ability to achieve the bits. SHA-512 is separated into two sub-families; SHA-
information security fundamentals [1]. In this paper, a new 512/224 and SHA-512/256 [2].
steganography technique that combines a new approach for
Least Significant Bit (LSB) method with a robust hash The rest of this research paper is organized as follows:
algorithm was introduced. The idea behind this approach is section 2 reviews related data encryption works that exploit
to embed any text into a cover image based on an offset flag, steganography and hashing function in data encryption. The
a shared key, and the robust hash function – SHA256 to proposed technique is described in section 3. The
achieve data confidentiality and integrity. Fig.1 illustrates the performance measures of the designed model are presented
general process of steganography algorithm. in section 4. The results are detailed in section 5 and
discussed in section 6 under security and performance
analysis. Section 7 concludes the paper and provides areas
for future research.
II. RELATED WORK
The vast majority of published researches focus on
securing data transmission using different cryptography
techniques. Steganography is an art that researchers have
adopted for its robustness in encrypting sensitive data and
achieving data confidentiality and integrity.
Jose et al. [3] adopted a new model to embed sensitive
Fig.1. Steganography Workflow data into a cover-image by propagating the plaintext bits over
the cover-image using hash salt technique with a password
A. Least Significant Bit Algorithm
provided by the user. The adopted model increased the
All computer data is represented using binary, and difficulties of brute-force attack as the salt hash will result
grouped together in bytes. The LSB represents the right into 2256 combinations of possible salted passwords. The
most bit of an 8 bits array and is associated with the lowest authors also increased the model security by using Advanced

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 45


`

Encryption Standard (AES) algorithm to encrypt the operator XOR is applied between the LSB bits and the array
plaintext before embedding it into a cover-image. indexes.
Gupta et al. [4] proposed a hybrid approach by Indrayani et al. [9] adopted a new mp3 audio
combining steganography with the AES algorithm and then steganography technique by combining steganography with
used a hash function to increase the security of the model. AES algorithm and MD5 hash function. The designed model
The approach the authors proposed start by encrypting the is divided into four core levels: encrypting the data using
plaintext using AES algorithm. The encrypted data then is AES algorithm, where the key that is used to encrypt the data
stored in a hashed pixel location of the cover-image to is digested using MD5 hash function. The encrypted data
generate a stego-image. The results of the proposed model then are embedded into a cover-image which presents the
achieved an accurate Mean Squared Error (MSE) values encoding process. Once stego-image is received by the
when evaluated on different image types such as, .tiff, .png, second party of communication, the image is extracted and
.jpg, .bmp, and .gif. the cipher text is decrypted using the same hashed-key. The
authors achieved a high secured model against several active
Chaudhary et al. [5] proposed an effective steganography attacks types.
technique that uses RGB images. The idea behind their work
is to indicate the pixel value using the most significant bit of Saini et al. [10] proposed a hybrid approach in image
RGB channels instead of utilizing the entire channel. The security. The image is encrypted using a modified AES
algorithm works as follows: the LSB channels that are used algorithm and then embedded into a cover-image to generate
for hidden data depend on the MSB sequence. For example, a stego-object. A new version of AES algorithm is presented
if the sequence of MSB is 101 then the data hidden sequence “MAES”; where a new shift row transformation is presented.
is GRB. Hash function was adopted in this model by This transformation is done as follows: if the bit value that is
applying the logical operator XOR between the cover-image located in the first row and first column of the initial matrix
LSB bits and the stego-image to indicate which pixels have is even then no shifting is applied. Moreover, the other three
changed. rows will be shifted with an offset value equal to the offset
value of the common AES row shifting transformation. The
Madhuravani et al. [6] presented an authenticated designed model achieves a high PSNR rate of all size images
steganography scheme that uses a dynamic hashing comparing to traditional AES algorithm.
algorithm. Firstly, the texture data is embedded into a cover-
image that is then encoded using a stego-key. When the The previous approaches focused on achieving a high
second party receives the stego-image, the stego-image is secured steganography model using various algorithms of
extracted to generate both the plaintext and the image-size. encryption and encoding. In this paper, a new steganography
The plaintext after the extracting process is applied on a algorithm is presented to achieve confidentiality and integrity
dynamic hash function using either MD5 or SHA functions in a high performance secured model.
to generate a digested text. The digested text is then
embedded into a cover-image to generate the stego-image at III. PROPOSED MODEL
receiver side. Once the stego-image is sent and received by This research paper introduces a hybrid steganography
the sender party, the receiver will extract the message and scheme that combines a new steganography algorithm
compares the received hash with the hash of extracted approach with Hash function. The process of steganography
message. This new approach improved the security of is generally divided into two stages of encoding and
steganography technique by securing the communication decoding. In this research, the proposed model has four main
channel. stages: new image addressing, text size hashing, encoding,
Riasat, et al. [7] introduced a robust hash-based and decoding.
steganography model. The designed model has a strenuous A. New Image Addressing and Confusion Concept
capability to hide image and data without losing the image
Image addressing process starts by selecting a conditional
quality. The data scattering depends on a random number
image size that will be used for embedding the plaintext. The
that is generated by using a hash function, where the hash
reason behind using a conditional image size is that the
function uses both elements; the hash-key and the image
following formula must apply.
chunks number. The image chunks are separated into three
fields, where the ASCII values will distributed on these
chunks sequentially.
Image Size (IS) = Selected Image Size – 512 pixels (1)
Charan et al. [8] proposed an efficient secured
steganography technique using multi-level encryption
algorithms. The adopted model is based on two levels of The intuition behind deducting 512 pixels is to store the
encryption, the Chaos encryption and the Ceaser encryption needed variables, in which the 512 pixels are divided into
techniques. At the beginning, the data is encrypted using two halves; the first half is reserved for the permutated
Ceaser cipher. After the encryption, LSB algorithm is used encoded message size using the permutation formula as
and applied on the RGB image. Thus, the data distribution is follows.
done sequentially, where the first three bits are replaced in
the three LSB bits of Red byte, the second three bits are + , 2 = 0
= (2)
replaced in the three LSB bits of Green, and the last two bits + 256 − , 2 = 1
are replaced in the two LSB bits of Blue. The data after
applying LSB are scattered into a Chaos cover-image that is Where L is Message Size Bits’ Pixel Location Address.
divided into a two dimensional array and then the logical

46
`
The second half is reserved for the hashed message size
that will be discussed later. The permutation formula is based
on the bit sequence state; if it is odd then it is stored starting
from the top of the encoded cover-image as shown in
equation (3), and, if the bit sequence is even then it is stored
at the bottom of the encoded cover-image as shown in
equation (4).
X = I + K + √ + ) mod IS (3)

X = MS + K + √ + ) – I) mod IS (4)

Here X is the pixel location where to embed text bit in the


stego-image (encoded cover-image), I represents the text bit
sequence, K is the shared key between the parties of
communication, IS represents the image size, and MS is the
message size.
B. Text Size Hashing
First, the plain text was extracted into an array of bits that
will be scattered over the encoded cover-image. The aim of
adopting text size hashing is to increase the difficulty for any
intruder to alter the text. Even if the texture is altered; the
hash text size value indicates if any intentional alteration
occurred. The text size was hashed using SHA-256 where the
resulted hash value was applied on the logical operator XOR
with the shared key K and then stored in the second half of Fig.2. Encoding Process
512 pixels of the image size as shown in the following
equation. D. Decoding
H = (Hashed Message Size) XOR K (5) The decoding process is also divided into two sub-
categories: once the stego-image is received by the second
C. Encoding party of communication, the receiver retrieves the values of
The encoding process in this research paper is divided both halves that were located in the last 512 pixels; first of
into two sub-categories; the first encoding process was to all, a retrieval of the encoded message size was performed
calculate the address where the text bit will be embedded. using equations 3 and 4, then the receiver calculates the
Next, the encoding process was represented by embedding hashed message size value and apply the XOR logical
the text bits in the encoded cover-image to produce a stego- operator with the shared key to ensure the equality between
image, after distributing the hashed message size into the two values that ensures no data alteration occurred during
encoded cover-image. Fig. 2 illustrates the encoding the transmission.
process.
The second sub-category process is how to extract the
text message from the stego-image. The extracted characters
are then stored in an array which is later converted to the
original text (plaintext). Fig. 3 illustrates the decoding
process.

47
`
Where MSE is the mean square error between the
processed and original cover image, Max2 is the maximum
intensity value used to represent each pixel in the image. If
the obtained PSNR value is more than 30db then the
processed image quality is unremarkably changed compared
to the original. However, if the obtained PSNR value is less
than 29db, this will result in a visual degradation in image
quality. Table 1 illustrates a comparison in PSNR values
between the proposed model and MAES algorithm [10]. The
authors in MAES algorithm achieved a high PSNR values as
well by combining steganography technique with an
enhanced AES algorithm. The simulation results of the
MAES were accurate enough to compare it the proposed
model experimental results.

TABLE I. PSNR VALUES OF COVER-IMAGE VS STEGO-IMAGE


Algorithm Image Size PSNR (dB)
128x128 59.1601

MAES [10] 256x256 59.9814


512x512 59.4681
128x128 78.6012
Proposed Model 256x256 82.1002
512x512 87.4905

The overall performance of the proposed model is


evaluated based on the time consumed in the encoding
process. Attention is always given to ensure that the
encoding process is completed in an acceptable time frame.
Table 2 shows the time consumption of encoding characters
Fig.3. Decoding Process
into different image sizes.

TABLE II. PSNR VALUES OF COVER-IMAGE VS STEGO-IMAGE


IV. PERFORMANCE MEASURES Character Size Image Size Time (ms)
The proposed model performance depends on the 128x128 42
efficiency rate of the overall implementation through the four
500 Character 256x256 166
stages. To ensure the efficiency of the proposed model the
following rules are taken into consideration: 512x512 532
128x128 241
• The data must remain fixed after embedding it into
the cover-image. 1000 Character 256x256 325
512x512 632
• The visual resolution of the image must remain high 128x128 513
with no visual noise. 256x256 756
1500 Character 512x512 852
• The stego-image quality must be near to the original
cover-image quality.
• The consumed time in encoding process is Histogram analysis is well known used technique to
unremarkable. measure changes between original and processed images.
Histogram analysis calculates the levels of intensity values
Peak Signal to Noise Ratio (PSNR) is a well-known tool available in a given image. Therefore, it was used in this
widely used to measure the visual quality of processed research. It resulted in minor differences between the original
images. In this research, it is used to evaluate the stego- cover-image and the stego-image. Fig.4 illustrates the
image quality and to compare it with the original cover- histogram analysis of three images that are used in the
image quality. PSNR can be calculated as shown in eq 6; proposed model, which indicates that the stego-image can
where a higher the PSNR dB indicates a better processed pass the histogram test.
image quality

= 10 log (6)

48
`
Confidentiality was achieved by embedding the text bits into
the LSB pixels of the cover-image using formulas that
depends on a secret key. The text size was hashed SHA-256
algorithm and then applied it to the logical operator XOR
that produced a hashed message size. This process achieves
the concept of integrity. Using robust hash function
increased the difficulties of brute-force attack; where the
hashed message size is hard to be calculated due to the XOR
operation with the secret shared key.

B. Performance Analysis
Fig. 4. (a) Cover-Image Histogram Vs. (B) Stego-Image Histogram The proposed model fulfilled a high performance through
a set of various metrics. The PSNR results that were
The histogram analysis shows a slight difference between discussed in table 1 had shown that the proposed model was
cover-images that are selected for text embedding and the better than the MAES algorithm. Moreover, the encoding
resulting stego-images. This indicates that the user can not and decoding processes were executed over Csharp
recognize any difference between both images resolutions. programing language, concluding that the time that was
consumed by the overall model is less than the consumed
V. EXPERMENTAL RESULTS AND ANALYSIS time while running the implementation over Matlab or other
The proposed model was measured using Intel® CoreTM machine learning. At the end, the performance of the
i7-4580HQ 64 bits system with 8GB RAM running on proposed model depends on the selected image size as it
windows 8.1. found that the stego-image size is approximately equal to
Different image sizes (128x128, 256x256, and 512x512) the cover-image size.
were selected for the encoding process as illustrated in Fig.5 VI. CONCLUSION AND FUTURE WORK
respectively.
The huge data transfer over public networks leads most of
information security engineers to adopt many methods that
allow them to transfer sensitive data over a secure fashion
environment. Steganography is one of the most commonly
used techniques, due to the ease of use and the high data
security that can be provided. In this paper the combined
approach for image security has been presented. The
scattering and embedding processes of texture data were
achieved over a set of equations. The proposed model
increased the difficulties for any intruder to alter the
embedded sensitive data which proved over histogram
analysis that shows a slight difference between cover-
images that are selected for text embedding and the resulting
stego-images, as the concepts of confusion, permutation,
and hashing were adopted in the discussed model. For future
work, the selected image type should be expanding to
include different types of images such as, .tiff, .bmp, and
.gif.
REFERENCES

[1] R. Kaur, and V. K. Banga, “Image Security using Encryption based


Algorithm,” International Conference on Trends in Electrical,
Fig. 5. (a) Cover-Image VS (B) Stego-Image Electroinces and Power Engineering (ICTEEP2012) July 15-16, 2012,
Singaphore.
It was found that the resolution of both images remained [2] I. S. Bajwa, and R. Riasat, “A New Perfect Hashing based Approach
for Secure Steganograph,” Sixth International Conference on Digital
the same. The results ensure that the embedded text did not Information Management (ICDIM2011) 2011, pages 174 – 178.
have an effect on the resulting visual image quality. [3] P. G. Jose, S. Chatterjee , M. Patodia, S. Kabra, and A. Nath, “Hash
and Salt based Steganographic Approach with Modified LSB
The aim behind adopting the proposed work is to come Encoding,” International Journal of Innovative Research in Computer
up with a high performance and secured steganography and Communication Engineering, Vol. 4, Issue 6, June 2016.
scheme. The experimental results concluded two main [4] S. Gupta, A. Kalra, and C. Hasti, “A hybrid Technique for Spatial
Image steganography,” Third International Conference on Computing
analysis as follows. for Sustainable Global Development (INDIACom) 2016, pages 643-
647.
A. Security Analysis
[5] A. Chaudhry, and J. Vasvada, “A Hash Based Approach for Secure
The proposed model achieved two core concepts in Keyless Image Steganography in Lossless RGB Images,” Fourth
information security; confidentiality and integrity.
49
`
International Congress on Ultra Modern Telecommunications and [9] R. Indrayani, H. A. Nugroho, R. Hidayat, and I.Pratama, “Increasing
Control Systems, 2012, pages 941 – 944. the Security of MP3 Steganography Using AES Encryption and MD5
[6] B. Madhuravani, D. S. R. Murthy, P. B. Reddy, and K. V. S. N. Hash Function,” Second International Conference on Science and
Rao, “Strong Authentication Using Dynamic Hashing and Technology-Computer (ICST), 2016, pages 129 – 132.
Steganography,” International Conference on Computing, [10] J. K. Saini, and H. K. Verma, “A Hybrid Approach for Image
Communication and Automation, 2015m, pages 735-738. Security by Combining Encryption and Steganography,” IEEE
[7] R. Riasat, I. S. Bajwa, and M. Z. Ali, “A Hash Based Approach for Second International Conference on Image Information Proccessing
Colur Image Steganography,” International Conference in Computer (ICIIP) 2013, pages 607-611.
Networks and Information Technology, 2011, pages 303-307. IEEE conference templates contain guidance text for
[8] G. S. Charan, N. Kumar, Karthikeyan, Vaithiyanathan, and D. composing and formatting conference papers. Please
Lakshmi, “A novel Based Image Steganography with Multi-Level ensure that all template text is removed from your
Encryption,” IEEE second International Conference of Innovations in conference paper prior to submission to the
Information Embedded and Communication Systems (ICIIECS15), conference. Failure to remove template text from
2015, pages 1 – 5. your paper may result in your paper not being
published

50
Detecting network anomalies using machine
learning and SNMP-MIB dataset with IP group
Abdelrahman Manna Mouhamad Alkasassbeh
Princess Sumaya University of Technology Princess Sumaya University of Technology

manna.93@outlook.com m.alkasassbeh@psut.edu.jo

Abstract— SNMP-MIB is a widely used approach that uses resources such as server, this attack is considered as a
machine learning to classify data and obtain results, but using dangerous attack because it prevents the legitimate users from
SNMP-MIB huge dataset is not efficient and it is also time and reaching the resources whenever they need them, especially
resources consuming. In this paper, a REP Tree, J48(Decision if the resource has sensitive and important information that
Tree) and Random Forest classifiers were used to train a model
needs to be reached immediately.
that detects the anomalies devices inside the network in order to
predict the network attacks that affect the Internet Protocol(IP)
group. This trained model can be used in the devices that are B. Simple Network Management Protocol (SNMP)
used to detect the anomalies such as intrusion detection systems.
SNMP was found in the late 1980s [2] is an application
Keywords—Network attacks, SNMP, SNMP-MIB, Anomaly layer protocol that is used to control the functions of the
Detection, DOS. network nodes(devices) in order to change their information
or to change the devices’ behaviours when needed, SNMP is
I. INTRODUCTION
supported by multiple devices such that routers, switches,
Nowadays, almost the entire world is connected to each servers and more and is included in the internet protocol(IP)
other via the internet and the number of internet users is package.
increasing day by day, every user has at least 1-2 devices such SNMP collects the data that needs to be managed and
as laptop or mobile phone. manages it using a management information base (MIB) that
As the number of users is increasing, the attacks on their describes the system configuration.
devices are also increasing especially the attacks that affect the
networks which is called “Network Attacks”.
II. RELATED WORK
One of the most widely used and a well-known attack is the
One of the current hot topics in the network attacks
denial of service (DOS) that will be described in the coming
section. is the DOS, researchers focus on anomaly detection for the
anomalies that exploit the network and behave badly in order
In this paper, a DOS attack is analyzed as well as the attacks to prevent the legitimate nodes from connecting to the
on Internet Protocol (IP) group that is as a subset of SNMP- network or from reaching sensitive and important
MIB groups that are described in [1] where authors showed information.
different groups that are part of SNMP-MIB including their In [3] showed in details the classification and technical
different attacks and attacks analyses, in this paper only IP analyses of network intrusion detection systems and the
group is taken and analyzed in order to work on its variables aspects that must be taken into consideration when using the
and show the effect of all variables together and its occurrence
Intrusion Detection Systems (IDS).
percentage, then eliminate the most irrelevant ones and
In [4] [5]the authors showed one of the most commonly used
concentrate on the most relevant ones that give the highest
accuracy for the trained model which enables the model techniques in detecting nodes that may affect the network and
detecting the network attacks and reducing the false negative result in denial of service attack by using machine learning
rates, this helps in implementing the trained model in the by training a model and give it a set of attacks with actual
devices that are responsible for detecting the network attacks measures so the model can detect the anomalies or attack
such as intrusion detection systems. depending on the predefined datasets and results.
The authors in [5] discussed machine learning technique for
A. Network Attacks detecting the anomalies that uses the feature selection
Network attacks is a term that reflects and describes the analysis that takes the top or most frequently used attacks and
attacks that may occur and affect the computer network in objects and classifies them in a specific way that does not
general, these attacks have big effects on the connected nodes consume the network resources and does not exhaust them by
as they might destroy the software that is installed on the enhancing the performance, but there is a probability of
connected node or prevent the connection from reaching to having false negative and false positive in the network.
the node or from the node, this is also known as denial of In [6] authors showed ways for detecting the Distributed
service attack (DOS). Denial Of Service(DDOS) attack which is more dangerous
Denial of service attack can be described as an attack that than the regular denial of service because the attacks come
affects the network to prevent the reach to the network from different locations, the authors used a dataset and

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 51


applied it on three classification techniques which Variable Variable Name Variable description
are(Multilayer Perceptron (MLP), Naïve Bayes and Random Identifier
Forest). The total number of input
datagrams that are received
V1 ipInReceives from the interfaces, including
In [7] authors used predictive models and classifications for the ones that are received in
intrusion detection that use machine learning classifiers, error.
they used Logistic Regression, Gaussian Naive Bayes, The total number of input
Support Vector Machine and Random Forest algorithms, datagrams that are delivered
V2 ipInDelivers
to the IP user protocols
their results showed that Random Forest gave the best successfully (including ICMP).
results in classifying the traffic whether it is normal or not. The total number of IP
In [8] the author used MIB and Support Vector Machine datagrams that are supplied to
(SVM) to achieve both high accuracy and fast detection and IP in requests for
V3 IpOutRequests transmission, noting that this
low false alarms. does not include the
datagrams that are counted in
III. RROPOSED MODEL ipForwDatagrams.
The proposed model is trained using Weka tool (V 3.8) The number of output
datagrams that do not have
that uses machine learning to achieve its results by training a V4 ipOutDiscards errors to prevent their
model, in this paper three classifiers were used (Random transmission to their
Forest Algorithm, J48 (Decision Tree), REP Tree Algorithm) destination.
to generate the results and check the accuracy of applying the The number of input
datagrams that do not have
IP group attacks on each classifier, noting that the results will
V5 ipInDiscards errors to prevent their
be shown in the results section. transmission to their
destination.
A. SNMP-MIB Dataset The number of input
In paper [4] the authors used a dataset that contains V6 ipForwDatagrams
datagrams for which this
around 4998 records for 34 variables that are captured using entity was not their final
destination.
MIB, paper [1] contains more information and description The number of datagrams
about the used dataset, in this paper the group that is used is discarded because no route
V7 ipOutNoRoutes
taken from the dataset that is used in [4], and the used group could be found to transmit
is the internet protocol(IP) group, the attacks that are used are them to their destination.
The number of input
the attacks that may result in DOS attack which are(HTTP datagrams discarded because
flood, UDP flood, ICMP-ECHO, TCP-SYN, Slowpost, the IP address in their
V8 ipInAddrErrors
Slowloris). header's destination field was
not a valid address to be
received at this entity.
The MIB variables that are used in the IP group are described
in Table I: TABLE I. INTERNET PROTOCOL(IP) VARIABLE DESCRIPTION

B. Machine Learning Classifiers


Machine learning classifiers are generated by an
application in order to classify the attacks. The classifiers are
mainly used build a model from classified objects and then
use the same model to classify new ones that are not classified
previously in the model in an accurate way as much as
possible, the classifiers will be applied to classify the dataset
that is used in this paper.

The used classifiers are considered as supervised learning


algorithms that use labelled training data; the classifiers are
described in details as follows:
• Random Forest Algorithm Classifier: Random
Forest is a flexible and easy to use machine learning
algorithm that gives great results most of the time. It
is one of the most used algorithms because of its
simplicity.
• J48 (Decision Tree) Classifier: Decision tree is also
called information gain, a concept that measures
the amount of information contained in a set of
data. It gives the idea of importance of an attribute
in a dataset.

52
• REP Tree Algorithm Classifier: REP Tree The true negative (TN) rate is the total number of negative
algorithm uses the regression tree logic then creates traffic that is classified correctly as negative, while false
different multiple trees in different iterations, after negative (FN) rate shows the total number of positive traffic
generating the trees it chooses the best one from that is classified incorrectly as negative traffic.
them and this is considered as the representative [1]
(1)
C. Feature Selection
The features are mainly used to reduce the computation
(2)
time and to improve the performance of the model that is
trained by minimizing the amount of data used, the feature

selection strategy aims to remove the irrelevant fields to 2 (3)

provide good results.
Feature Selection Methods
Fig. 1 shows a description of the recall and precision
There are three methods for feature selection based on the
concepts:
evaluation criteria which are (Filter, Wrapper, and Hybrid)
that are defined by the authors in [9].
Filter methods are used as a step before the processing.
Feature selection is independent of any machine learning
algorithm. So, features are selected depending on their scores
that are calculated from previous steps and statistics.
Wrapper methods are considered as selecting a set of features
like a search problem; this is done by combining different
features together, and then gives a score for them according
to the accuracy of the model.
Hybrid methods are a combination of many feature selection
methods such as filter and wrapper that are used together to
achieve the best results.

Evaluation Metrics Fig. 1: Precision & Recall

In this paper, a well-known evaluation criteria is used to


measure the classifiers performance, such as F-Measure, IV. EXPERMINTAL RESULTS
accuracy, precision and recall.
The results are calculated by using WEKA tool 3.8 that
The basic performance is indicated by the confusion matrix uses machine learning to achieve its results by training a
In Table II. model, the specifications of the used hardware is Intel®
CoreTM i7, 64-bits system with 8 GB RAM running on
TABLE II. CONFUSION MATRIX FOR TWO CLASSES
windows 10.
Predicted Class
Positive Negative The experimental results are shown in this section; the results
Actual Positive TP FN of the proposed model are generated from MIB dataset that is
previously mentioned. The techniques of the classification
Class Negative FP TN are used to get results for the IP group separately from the
main group. At the end, attribute selection techniques were
used to enhance the accuracy of the proposed model by
The weighted average for the accuracy of the classes are removing the irrelevant features and take the most relevant
shown in (Table III), the weighted average for each of them ones. This is used to show the impact of IP group on the
is a result of all of the features that are used for IP group and classification of attacks.
calculated using WEKA tool and REP Tree classifier:

TABLE III. WEIGHTED ACCURACY RATE


The three classifiers’ accuracy is shown in Table IV noting
that REP Tree and Random Forest algorithms were more
Accuracy TP Rate FP Precision Recall F-Measure accurate than J48(Decision Tree).
Measure Rate
Weighted 1.0 0.0 1.0 1.0 1.0
TABLE IV. CLASSIFIERS’ ACCURACY
Average
Classifier Random Forest J48 REP Tree

True positive (TP) rate reflects the rate of the correct Accuracy 99.98% 99.88% 99.98%
predictions of the positive traffic, while false positive (FP)
rate reflects the rate of negative packets that are considered
as positive traffic. The F-Measure results for all of the IP group variables (V1,
V2, V3, V4, V5, V6, V7 and V8) are shown in Fig. 2, it can
be noticed that in bruteforce attack the three used classifiers

53
gave 1 which means that their accuracy for this attack is reduced in comparison with the above results, but udp-flood
100% while they are different in the other attacks. and slowpost attacks were still 100% accurate, this means
that it is not necessary to have more accuracy when
removing more irrelevant variables or reduce the training set
size because in this experiment the top five variables gave
more accuracy than selecting the top 3.

Another attribute evaluator is used to get the top 5 and top 3


variables which is ReliefFAttributeEval which evaluates the
worth of an attribute by repeatedly sampling an instance and
considering the value of the given attribute for the nearest
instance of the same and different class.

The results that are shown in Fig. 5 represent selecting the


top 5 variables which are (V1, V5, V6, V7, and V8):
Fig. 2: F-Measure For All IP Group Variables

A top 5 and top 3 variables were selected to remove the most


irrelevant variables from the IP group and were selected using
InfoGainAttributeEval which evaluates the worth of an
attribute by measuring the information gain with respect to
the class, and by applying ranker method.

The results that are shown in Fig. 3 represent selecting the top
5 variables which are (V1, V4, V5, V6, and V8):

Fig. 5: F-Measure For Top 5 variables - ReliefF Attribute Evaluator

As shown in Fig. 5 the only attack which was 100%


accurate was the slowpost.

The results that are shown in Fig. 6 represent selecting the


top 3 variables which are (V6, V7, and V8):

Fig. 3: F-Measure For Top 5 variables – InfoGain Attribute Evaluator

It can be noticed that the bruteforce percentage is still the


same, but also the udp-flood, slowpost and slowloris and
attacks gave 100% of accuracy.

The results that are shown in Fig. 4 represent selecting the top
3 variables which are (V1, V4, and V5):

Fig. 6: F-Measure for Top 3 variables - ReliefF Attribute Evaluator

V. CONCLUSION
In this paper, SNMP-MIB data were used to detect DOS
attacks anomalies that may affect the network. Three machine
learning algorithms were used to classify the data which are
Random Forest, J48 (Decision Tree) and REP Tree. Two
Fig. 4: F-Measure for Top 3 variables - InfoGain Attribute Evaluator Attribute evaluators were used to remove the irrelevant
variables and get top 5 and top 3 variables, the two attribute
It can be noticed that the bruteforce attack accuracy were evaluators are InfoGain and ReliefF. The classifiers and
54
attributes were applied on the IP group and the results showed Emerging Network Abnormality," International
that applying the REP tree algorithm classifier gave the Journal on Advanced Science, Engineering and
highest accuracy all of the times in all IP group, top 5 and top Information Technology, vol. 9, no. 3, 2019.
3. [6] S. Aljawarneh, M. Aldwairi and M. BaniYassein,
"Anomaly-based intrusion detection system through
feature selection analysis and building hybrid efficient
VI. REFERENCES model,"," Journal of Computational Science25, pp.
152-160, 2018.
[7] M. Alkasassbeh, G. Al-Naymat, A. Hassanat and M.
[1] M. Al-Kasassbeh, G. Al-Naymat and E. Al-Hawari, Almseidin, "Detecting Distributed Denial of Service
"Towards generating realistic SNMP-MIB dataset for Attacks Using Data Mining Techniques,"
network anomaly detection," International Journal of International Journal of Advanced Computer Science
Computer Science and Information Security, vol. 14, and Applications, vol. 7, no. 1, 2016.
p. 1162–1185, 2016.
[8] M. Belavagi and B. Muniyal, "Performance
[2] J. Schdnwdlder, A. Prast, M. Harvan, J. Schipperst Evaluation of Supervised Machine
and R. deMeent, "SNM-- LearningAlgorithms for Intrusion Detection," in
TraficAnalsis:,Approaches,Tools,andFirstesults," 10th Twelfth International Multi-Conference on
IFIP/IEEE International Symposium on Integrated Information Processing, 2016.
Network Management, 2007.
[9] B. Cui-Mei, "Intrusion Detection Based on One-class
[3] N. Nanda and A. Parikh, "Classification and Technical SVM and SNMP MIB data," 2009 Fifth International
Analysis of Network Intrusion Detection Systems," Conference on Information Assurance and Security,
International Journal of Advanced Research in 2009.
Computer Science, vol. 8, 2017.
[10] G. Chandrashekar and F. Sahin, "A survey on feature
[4] M. Alkasassbeh, G. Al-Naymat and E. Hawari, "Using selection methods," Computers & Electrical
machine learning methods for detecting network Engineering, 2014.
anomalies within SNMP-MIB dataset," International
Journal of Wireless and Mobile Computing, 2018.
[5] M. Almseidin, M. Al-kasassbeh and S. Kovacs,
"Fuzzy Rule Interpolation and SNMP-MIB for

55
Enhancing Data Protection Provided by VPN
Connections over Open WiFi Networks
Ashraf Karaymeh Mohammad Ababneh
KPMG King Hussein School of Computing Sciences
akaraymeh@kpmg.com, Princess Sumaya University for Technology
ashrafkaraimeh@gmail.com Amman, Jordan
m.ababneh@psut.edu.jo

Malik Qasaimeh Mustafa Al-Fayoumi


King Hussein School of Computing Sciences King Hussein School of Computing Sciences
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
m.qasaimeh@psut.edu.jo m.alfayoumi@psut.edu.jo

Abstract—Open Wi-Fi networks are a serious challenge to SSID in popular places and tricks people to connect with him
sensitive and private data because it is hard to know who else is rather than the genuine hot spot. This will enable the attacker to
using the network and monitoring traffic. Such open, free and monitor traffic, possibly infect the victim's devices with
unencrypted networks might allow an adversary to hack devices malware, possibly take control of the devices and maybe execute
connected to them making the use of such networks highly risky Man-in-the-middle attacks (MITM). [4]. This is also sometimes
and harmful. In order to use these public networks securly, it is called an Evil-Twin attack, this attack happens mostly for
recommended to use VPN in Tunneling Mode to assure that the unattended hot spot for a long period [5].
data is encrypted during transmission. But this is not enough as
most of today's smart devices and laptops run applications that The proven solution for providing an additional layer of
might start communicating with their servers before this VPN has security when using public and open networks is by establishing
been established. In this work, we solve this problem by creating a a Virtual Private Network (VPN) tunnel. This will ensure that
device that enables users to access the internet securely over Public all traffic be encrypted before transmission. However, until the
Wi-Fi networks and provides security right from the beginning VPN is established, the system remains exposed to
when deployed between the public Wi-Fi and the user's personal vulnerabilities. Some people would think that they are under
devices. Experiments show the security advantages of our solution. VPN protection just because they turned on the VPN connection
on their browser or by entering their credentials to a VPN client
Keywords— open Wi-Fi networks Security, Raspberry Pi, Open
authentication window. Most applications on modern devices
VPN.
need to connect to their servers at startup automatically for
various reasons such as looking for updates, receive emails and
I. INTRODUCTION messages as with WhatsApp, Facebook or even updates to the
It has become very common these days for employees to OS itself as soon as they see an established Internet connection.
work remotely outside their organization's premises. A recent Hackers would take advantage of this behavior by monitoring
survey published by Forrester consulting on Citrix website traffic and acquiring some important information about the
claims that 65% of the respondents have at least worked device and even succeed in sending a malware to the device in
remotely one day per week, 37% said that they worked two or the few minutes before establishing the VPN connection [6].
more days per week [1]. In order to get access to their work data Some solutions try to mitigate this problem by installing a
servers they need to establish internet connections through VPN application or a VPN browser on the user's device. But
gateways of places they are trying to connect from. These places these solutions would only work on certain operating systems
could be Hotels, coffee shops, restaurants, airports, etc. and still need to be connected to the internet before establishing
Allowing employees to work remotely is a high risk to the VPN tunnel, which brings us back to the first square.
organizations' sensitive data. Most of the big companies use and In our work, we present a solution to the problem in the form
require VPN technology in order to allow their employees to of an affordable device of our own design that is deployed
remotely access and exchange sensitive data [2]. However, in between the open WiFi and the user’s device capable of
order to establish a VPN connection, someone should connect to prohibiting any communication from the user’s device until the
the available internet gateway first and wait for few minutes VPN is established. In our solution, we first enforce the
until the VPN connection becomes fully running, leaving his establishment of the VPN tunnel then we allow the encrypted
device vulnerable to various types of attacks, especially if he is data to be transmitted through the open Wi-Fi.
connecting through an open Wi-Fi network [3].
In addition to open WiFi networks, there is the danger of
Rogue WiFi networks, where a hacker masquerades a network

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 56


II. RELATED WORK
In this section, we review relevant work to finding a solution
to our problem.
A. Open-VPN
The standard solution for solving this research problem is by
using a VPN network. Multiple protocols support VPN such as:
PPTP, L2TP/IPSec, OpenVPN, SSTP, and IKEv2. There are
many implementations of the VPN technology, but the most
widely used, free and open-source is OpenVPN. In our work, we
assume that the VPN networks are using OpenVPN for its ease
of use, availability, security and compatibility with the network
devices [7]. We focus on the use of OpenVPN in the following
two ways:
1) OpenVPN without firewall Fig. 2. OpenVPN with a firewall
The VPN service is used to create an encrypted virtual
systems) and can be used for more than 1 device per account.
tunnel between the client and the VPN Server as shown in “Fig.
This is a good solution but still, you need to connect to the Wi-
1,”. This will encrypt all traffic until it reaches the VPN server
Fi first then establish the VPN Tunnel, which is the same
which leads to increase security and makes it difficult for the problem of using OpenVPN without firewall [10].
hacker to monitor it.
Furthermore, file sharing must be turned off across the C. Hotspot 2.0
public network to prevent users on the network to easily find In 2012, Wi-Fi alliance launched this new technology with
you and your files especially when you don’t want to share the goal of making Hotspots smarter by introducing a new
anything with anybody [8]. system that can enable users to roam easily between access
The problem with this solution is that user needs to set up points. In this new technology, a user creates an account in
the VPN connection after connecting to the Public Wi-Fi, which "Passpoint" and authenticate himself by logging in one time.
may cause some data leakage or data monitoring. Now the user will stay connected to the internet while moving
around the city as long as there are hotspots from the same
2) OpenVPN with a firewall author (Passpoint). In this way, users do not have to check in
This runs as the previous setup but with the addition of a any public Wi-Fi and this will reduce the risk of connecting to
firewall to increase the level of security. The software- hacked networks [11].
configured firewall will allow only VPN traffic between the two The problem with this solution is that it is a new technology
ends of the VPN tunnel. Hence, only traffic destined to the VPN and it has not been implemented widely. In addition, it would
provider that is needed for setting up the VPN Tunnel is allowed be very expensive to invest in this solution infrastructure.
and all other types of traffic are prohibited. [9]. D. Comparison of Solutions
This is a a more effective method than the previous one but;
The solutions mentioned earlier have advantages and
the firewall needs to be configured each time a public untrusted disadvantages.A comparison was conducted between them and
Wi-Fi is been connected to. Also, it requires technical is illustrated in “Table I,”.
knowledge so not everyone can use it. “Fig. 2,” illustrates the As it can be seen from the table, all solutions can work on
layout of this solution. all platforms and none of them is an independent solution as
B. EncryptMe they all require the installation of different software on user's
devices. Solutions “1” and “2” require technical skills in order
EncryptMe is an application that offers encryption, once an
to conduct the needed configuration during the setup. Solution
internet connection through Public Wi-Fi is established the
“4” is also easy to use but has a problem that requires
application will set up a connection to the server. An advantage
modification on current user's devices. Solution “3”, which
of this application is that it can be installed on multiple
seems the best of these four solutions as it is friendly to use and
platforms (Windows, IOS, Android and Amazon operating
support all major platforms and compatible with other Wi-Fi
V PN Ser ver
solutions but still has the disadvantage of that the user’s device
needs to be connected to the Internet before establishing the

TABLE I. EXISTING SOLUTIONS COMPARISON

# solution Ease OS Compatability Independent


Tunnel VPN

Internet
of use Support with other Solution
Wi-Fi
solutions
1 Open VPN X √ √ X
Tunnel VPN Tunnel VPN
without firewall
2 OpenVPN with X √ √ X
firewall
Hacker
3 EncryptMe √ √ √ X
4 Hotspot2.0 √ √ X X
Fig. 1. OpenVPN without firewall

57
VPN tunnel, which exposes the device for the hackers
monitoring or attacks.
Internet
E. Research gap VPN Tunnel
Office Servers
Any solution other than the previous ones should work on
all platforms and all Wi-Fi networks regardless of band nor
mechanism, must be easy to use by a person with no technical
experience and most importantly, does not require Laptop

configuration on user's different devices.


To the best of our knowledge, there is no solution available
that fills all the gaps that we are trying to fill.
Public Wi-Fi
VPN
III. THE PROPOSED SOLUTION VPN Tunnel
Server
Raspberry Pi
Our main idea is to design and implement an intermediate Fig. 4. Proposed Solution To Connect Using Open WiFi
device that can operate between unsecured Wi-Fi and end user’s
personal devices. It can be used anywhere a wireless connection C. External Wi-Fi Adaptor
is available, regardless of its security or identity. A second wireless adaptor is used in our solution. The
“Fig. 3,” illustrates how the connection is established from embedded wireless adaptor which comes with Raspberry will be
the user's device to the office server via a public Wi-Fi. It also used to connect the Raspberry Pi with the user's device to
shows how a hacker can easily monitor traffic or even interfere establish a network that we call “Secure Net”. The additional
with the transmission, as Wi-Fi is considered vulnerable to wireless adaptor will be plugged into the USB port and will be
anyone who has the Pre-Shared Key (PSK) [12]. used to connect the Raspberry Pi with the open Wi-Fi network.
The reason why we are using the embedded adaptor on the user's
Our solution introduces an intermediate device between the side is to make sure that this connection is set up first, then
user’s device and the open Public Wi-Fi. It is provided with two create the VPN Tunnel before connecting using the second
Wi-Fi adaptors to connect to each side of the network to provide
adaptor so that a hacker on the open Wi-Fi network would never
a physical layer of segregation between them. It is equipped
be able to access any type of unencrypted data from the user's
with OpenVPN software client to provide the security capability
needed while rerouting the traffic from one end to the other. side.
“Fig. 4,” shows how the device will be placed in the previous D. Open VPN
setup and how it will provide resistance to hacking attacks. The
Open VPN is a VPN implementation that is open- source
Raspberry Pi creates A VPN tunnel connection between the
software under the General Public License (GPL), which is used
user's devices and the office servers blocking the hacker from
to establish a VPN connection Tunnel between two endpoints.
spying on the traffic.
This tunnel is encrypted using the protocol SSL/TLS [15].

IV. SOLUTION COMPONENTS E. Hostapd and UDHCP


Our solution consists of the following components: Hostapd is a Deamon program that runs in the background
A. Raspberry Pi in any Linux-based OS that provides access point capabilities
and authentication service. It is used to implement the IEEE
The core of our solution is a small-sized computer developed
802.11 access point management along with IEEE 802.1X/
by the Raspberry Pi foundation in the United Kingdom that was
WPA/WPA2/EAP authenticators, RADIUS (client and
originally created to empower the teaching of computer science
in developing countries [13]. It has a small ARM processor that authentication server) and EAP server [16]. Since the used
can perform just as any other processor used in a laptop or a version of Raspberry Pi has an embedded Wi-Fi adaptor, we
personal computer [14]. don’t need to download the software and it can be installed with
the following command:
B. Touch Screen 3.5" and protective case • sudo apt-get install hostapd udhcpd
This command will also activate the DHCP server feature on
A case is needed to hold all parts together and a 3.5" the device. The DHCP server is essential in case we need to
touchscreen (LCD resolution 480*320) is also a part of the assign multiple IP addresses to more than one user. However, it
solution. is much better to use static IP addresses in an operational system.
1) Configure the DHCP server [17]:
• start 192.168.1.2 #the range of IPs
Office’s Server
• end 192.168.1.120
• interface wlan0 #The UDHCP device
Internet • remaining yes
Laptop
• opt DNS 8.8.8.8 4.2.2.2 #The DNS servers
• opt subnet 255.255.255.0
• opt router 192.168.1.1 #The Pi's IP addr
Public Wi-Fi • opt lease 864000 #10 days DHCP lease time
• DHCPD_ENABLED="YES" #enable DHCP
2) Configure a static IP address
Fig. 3. Traditional User Connection

58
• sudo ifconfig wlan0 192.168.1.1 # to the
embedded wireless adaptor vlan0
3) ConfigureHostpad and SSID: DHCP Static
The next Step is to configure the Hostapd and to assign the Raspberry Pi
SSID for the user's side Wi-Fi network "SecureNet", along with Wlan0
the security Features needed to secure the connection between Wlan1 Open SSID:SecureNet
the user's devices and wlan0. VPN
• interface=wlan0
• driver=nl80211 Fire Wall
• ssid=SecureNet Tunnel VPN
External Wi-FI Embedded
• hw_mode=g Public Wi-Fi User’s Device
• channel=6 Wi-Fi
• macaddr_acl=0
• auth_algs=1 Fig. 5. Network layout with the solution and components
• ignore_broadcast_ssid=0
• wpa=2 4) Activate the open forwarding feature on the main router.
• wpa_passphrase=****** #A password for The easiest way to do this is to create a file associated with
Wlan0 that IP address of client maily because we are using static IP
• wpa_key_mgmt=WPA-PSK here. If more than one device is going connect as an OpenVPN
• #wpa_pairwise=TKIP # You better do not use client then we need to create a client file for each one of them
this weak encryption (only used by old and change the static IP for each correspondingly.By finishing
client devices) these steps we conclude the implementation of our device. “Fig.
• rsn_pairwise=CCMP 5,” depicts the layout of the network and shows where the tunnel
4) Configure IPTable: is created. Some additional steps were taken to improve the
The final step is to insert an “iptables” rule to allow NAT security of our device such as: operating only in the WPA2
using the following: mode, reducing the signal strength, hiding the SSID and
enabling the MAC-address filtering.
a) Enable IP forwarding in the Kernel by:
• sudo sh -c "echo 1 > /proc/sys/net/ipv4 V. SOLUTION TESTING
/ip_forward" To prove that our device has improved the security of the
b) Enable NAT in the Kernel VPN-based connections, we have tested it in two ways. A
• sudo tables -t nat -A POSTROUTING -o eth0 vulnerability scan using Nessus was conducted to see whether
-j MASQUERADE the solution has increased or reduced the number and type of
• sudo tables -A FORWARD -i eth0 -o wlan0 - vulnerabilities found by the scanner [20]. Then we used
m state -state RELATED, ESTABLISHED -j Wireshark to see if our solution have helped with the encryption
ACCEPT of the data (any data) from the beginning.
• sudo tables -A FORWARD -i wlan0 -o eth0 -
j ACCEPT A. NESSUS Vulnerability Scan
c) Make these changes permanent: Nessus is a vulnerability assessment tool that scans the
• sudo sh -c "iptables-save > network for open ports, services and programs. Its aim is to find
/etc/iptables.ipv4.nat" the weaknesses and flaws that can be exploited. The scan was
conducted in three stages as illustrated in the following sections.
F. The second Wi-Fi Adaptor
The second Wi-Fi adaptor is used to establish the connection 1) Access Point Vulnerability Scan [Stage (1)]:
from the Raspberry Pi to the open Wi-Fi network. This The first stage is to scan the access point itself to see what
connection is configured as “wlan1” on the Raspberry Pi. Its vulnerabilities would be between the user device and the access
configuration is the same as the previous adaptor except that it point. For this experiment a 4G router (Huawei E5377) was
has to be a DHCP connection in order to be able to acquire its IP used as the access point and a regular HP laptop as the user’s
address from the Public Wi-Fi. device. “Table II,” shows the scan results, which were divided
into 5 classifications:Critical, High, Medium, Low, and Info.
G. The VPN server
Info is the lowest rating and Critical is the highest and most
The VPN server is established using OpenVPN as follows: dangerous that needs to be fixed immediately.
2) Access point + Raspberry Pi without VPN connection
1) Install OpenVPN
• sudo apt-get install OpenVPN [19] Vulnerability Scan [Stage (2)]:
• cp -r /usr/share/doc/openvpn/examples/ The second stage is to run the vulnerability scan on the same
easy-rsa/2.0/etc/openvpn/easy-rsa laptop. But this time the laptop is connected to our solution
2) Configure the RSA file device. The VPN is turned off so that, in this case, we can find
• nano /etc/openvpn/easy-RSA/vars
TABLE II. NESSUS SCAN RESULTS
• export EASY_RSA=”/etc/openvpn/easy-RSA
exportKEY_SIZE=2048 Stage CRITICAL HIGH MEDIUM LOW INFO
3) Create the OpenVPN Client File 1 0 0 4 2 19
2 0 0 1 1 15
3 0 0 1 1 14

59
the vulnerabilities of the device itself. “Table II,” also shows transmitted. On the receiving part, it could only see the IP
the number of vulnerabilities found after performing the stage address of the VPN provider, but not the data itself as in “Fig 6”.
2 vulnerability scan. By using only the device the number of This proves the effectiveness of our solution.
medium vulnerabilities has been lowered from four to one.
While low vulnerabilities has become only one. The number of VI. CONCLUSION
INFO vulnerabilities has decreased from 19 to 15. We created an intermediary device that can be used to help
users in connecting to the Internet securely over open Wi-Fi
3) Access point + Raspberry Pi with the VPN connection networks. Our experiments showed that the device is achieving
Vulnerability Scan [Stage (3)]: good results improving security and filling the gap of no-
The third stage is to run the vulnerability scan after turning protection before the establishment of the VPN tunnel. The
the VPN tunnel. This stage test enabled us to find device is easy to use and affordable.
vulnerabilities from the secure connection towards the internet.
“Table II,” shows the number of vulnerabilities found in the REFERENCES
secure part of the network. We can see that running the VPN [1] Forrester,"https://www.citrix.com/content/dam/citrix/en_us/docume
has managed to reduce only the number of INFO vulnerabilities nts/oth/maximize-productivity-and-security-with-mobile-
workspaces.pdf,"
[2] O. Elkeelany, M.M.Matalgah and J. Qaddour, "Remote access virtual
4) Analysis of the vulnerabilities found in the three stages private network architecture for high-speed wireless internet users,"
A comparison of the three stages vulnerability report was WIRELESS COMMUNICATIONS AND MOBILE COMPUTING,
conducted and “Annex 1,” depicts these vulnerabilities. It is vol. 1, no. 4, p. 567, 2004.
clear that the device has reduced the number of medium class [3] IBM, IBM security Virtual Private network V.7.2, Rochester: IBMi,
vulnerabilities to only one vulnerability (50686 - IP Forwarding 2013.
[4] Sachin Shetty, Min Song, Liran Ma, "Rogue Access Point Detection
Enabled), which is Vital for the laptop being used to execute the by Analyzing Network Traffic Characteristics," 1 June 2007.
functions of this experiment. There was also one low [Online]. Available:
vulnerability found on the device (10663 - DHCP Server https://pdfs.semanticscholar.org/384b/54dd72c7f7418d77d70b987d
Detection), which can be neglected since the only reason for 2cfa2c1da4c5.pdf. [Accessed 1 January 2018].
having a DHCP is for the purpose of the demo in this project. [5] Zhanyong Tang, Yujie Zhao, Lei Yang, Shengde Qi, Dingyi Fang,
"Exploiting Wireless Received Signal Strength Indicators to Detect
Once the demo is completed, then the DHCP server will be
Evil-Twin Attacks in Smart Homes," Mobile Information Systems,
removed and the system will work only on Static IPs. vol. 2017, no. Article ID 1248578, pp. 1-14, 2017.
As for Info vulnerabilities, these are informational [6] Pranav S. Ambavkar, Pranit U. Patil, Prof. Pamu Kumar Swamy,
vulnerabilities and have no risk or impact on the security of the "Exploitation of WPA Authentication," IOSR Journal of
project. The number of shared INFO vulnerabilities was Engineering, vol. 2, no. 2, pp. 320-324, 2012.
reduced from 15 to 9, from which six new class INFO [7] "Best VPN," [Online]. Available: https://www.bestvpn.com/vpn-
encryption-the-complete-guide/. [Accessed 23 December 2017].
vulnerabilities were found in stages 2 and 3. These new [8] C. Rubin, " Is public Wi-Fi safe?," Entrepreneur, vol. 44, no. 11, p.
vulnerabilities appeared due to the configuration and 56, 2016.
installation on the Raspberry device to initiate the SSH server, [9] "Restricting uTorrent to VPN interfaces," Ipredator, [Online].
which is very vital to the system and cant be avoided. The level Available: https://blog.ipredator.se/howto/restricting-utorrent-to-
of these vulnerabilities is INFO and is not considered risky. vpn-interfaces-part-1.html. [Accessed 1 January 2018].
[10] "https://encrypt.me/," Encrypt me, [Online]. Available:
B. Using Wireshark https://encrypt.me/. [Accessed 1 January 2018].
[11] "https://www.wi-fi.org/discover-wi-fi/wi-fi-certified-passpoint,"
We used Wireshark to sniff traffic from the network and WiFI alliance, [Online]. Available: https://www.wi-fi.org/discover-
watch packets being transmitted or received. Again, the wi-fi/wi-fi-certified-passpoint. [Accessed 31 12 2017].
experiment was executed in three stages just like the [12] C. Hoffmann, "how to Geek," 08 Dec 2014. [Online]. Available:
vulnerability scan. https://www.howtogeek.com/204335/warning-encrypted-wpa2-wi-
In stages one and two Wireshark was able to monitor the fi-networks-are-still-vulnerable-to-snooping/.
traffic transmitted and received beyond the access point. [13] "wikipedia," [Online]. Available: https://en.wikipedia.org/wiki/
Raspberry_Pi. [Accessed 22 12 2017].
However, in the second stage the device being used to surf the [14] "Raspberry Pi," [Online]. Available: RaspberryPi.org. . [Accessed 02
web was connected to the Raspberry device, but without having Janyary 2018].
the VPN tunnel initiated. It can be seen that, once the VPN [15] A Skendzic, B Kovacic, "Open source system OpenVPN in a function
Tunnel is initiated, Wireshark couldn’t see anything being of Virtual Private Network," in IOP Conference Series: Materials
Science and Engineering, Belgrade, 2017.
[16] "host pad," [Online]. Available: https://w1.fi/hostapd/. [Accessed 22
12 2107].
[17] "elinux," linux, [Online]. Available: https://elinux.org/RPI-Wireless-
Hotspot. [Accessed 22 12 2017].
[18] A Skendzic, B Kovacic, "Open source system OpenVPN in a function
of Virtual Private Network," in IOP Conference Series: Materials
Science and Engineering, Belgrade, 201
[19] "https://www.raspberrypi.org/," raspberry pi, [Online]. Available:
https://www.raspberrypi.org/forums/viewtopic.php?t=81657.
[20] L. Harrison, R. Spahn, M. Iannacone, E. Downing, J.R. Goodall,
"NV: Nessus Vulnerability Visualization for the Web," in VizSec '12
Proceedings of the Ninth International Symposium on Visualization
for Cyber Security, Seattle, Washington, USA, 2012.
Fig. 6. Wireshark Screen shot

60
Annex 1: A Detailed Comparison Between The Three Stages of The Vulnerability Reports

61
A Proactive Design to Detect Denial of Service
Attacks Using SNMP-MIB ICMP Variables
Yousef Khaled Shaheen
Department of Computer Science Dr. Mohammad Al Kasassbeh
Princess Sumaya University for Department of Computer Science
Technology Princess Sumaya University for
Amman, Jordan Technology
yousefpsut@icloud.com Amman, Jordan
m.alkasassbeh@psut.edu.jo

Abstract— One of the most cyber-attacks that attract cyber these attacks. A set of algorithms such as, Meta, Lazy IBK,
criminals is Denial of Services (DOS) Attack. DOS attack aims Bayes, RJ48 and Rule-Based were adopted to find which
to reduce the network appliances performance from doing one of these algorithms is the most effective in detecting
their intended functions. Moreover, DOS Attacks can cause network anomalies.
huge damage to the data Confidentiality, Integrity and
Availability. This paper introduced a system that detects the
network traffic and varies the DOS attacks from normal traffic This paper is organized as follows: Section II provides
based on an adopted dataset. The results had shown that the several related works in the area of using machine learning
adopted algorithms with the ICMP variables achieved a high in detecting network anomalies, where the DOS if service
accuracy percentage with approximately 99.6% in detecting attacks and SNMP-MIB dataset are illustrated in section III.
ICMP Echo attack, HTTP Flood Attack, and Slowloris attack. The proposed model that is used in this contribution is
Moreover, the designed model succeeded with a rate of 100%
discussed in section IV. Section V discusses the
in varying normal traffic from various DOS attacks.
experimental results of the adopted methodology. Finally,
Keywords—Cyber-attacks, availability, DOS attack, ICMP the conclusion of the provided model and future work are
variables, meta, lazy IBK, bayes, RJ48, rule.tree. discussed in section VI.

I. INTRODUCTION
II. RELATED WORK
The wide use of the internet and the rapid increase of
communication and computer networks increase the Most of the current researches focus on detecting
cybercriminals’ activities in attacking these networks and different network attacks using machine learning techniques.
cause catastrophic damage to them. Network security Many of these techniques have been introduced, tested, and
attacks are varied based on their effect on the network and evaluated. One of the most used techniques in detecting and
the financial losses that they may cost the organization. analyzing network anomalies is SNMP-MIB data.
DOS attack listed as one of the easiest attacks that can be Al – Kasassbeh et al. [2] generated effective datasets that
launched with a huge impact on the network assets and cost solved the limited resources in the previous datasets. The
the organizations heavy losses. Many exhaustive researches authors adopted a reliable SNMP-MIB dataset to investigate
have been done on the financial losses that DOS attack can the SNMP for network attacks and anomalies detection. The
cause. Ponemon Institute reported that the average losses for authors collected SNMP-MIB data based on a set of Brute-
641 individuals are approximately equal to $1.5 million over Force attack and DOS attacks. The collected dataset is a
the year of 2015 divided into five categories (Revenue reliable published dataset and it consists of 4998 records,
Losses, Technical Support Costs, Operations Disruption, where each record mapped to 34 MIB variables. The MIB
Lost User Productivity, and Damage to Information groups are categorized as follows: TCP, UDP, IP, ICMP and
Technology Assets) [1]. Thus, many organizations aim to Interface.
protect their networks from several attacks that can cost Al – Kasassbeh et al. [3] adopted a reliable method in
them heavy losses using different network security services. detecting network attacks and anomalies based on SNMP-
One of the commonly used security services is Intrusion MIB dataset using machine learning techniques. They proved
Detection System (IDS), which is a security model that is that the SNMP-MIB is an effective technique in detecting a
designed to detect the abnormal and malicious traffic in large set of various DOS attacks using three algorithms
real-time or close to it. An IDS is as an effective security categories; Random Forest, AdaboostMI, and MLP. The
mentioned algorithm had been applied to several MIB groups
service against DOS attack. The idea behind DOS attack is
(TCP, UDP, IP, ICMP, and Interface). The classified
to prevent a system from doing its intended functions and
algorithms achieved a varied accuracy based on the group.
preventing the authorized users from accessing the system The Random Forest algorithm achieved a high accuracy
resources by injecting a flood of data to a specific target when it was applied to the IP group with a rate of 100% and
system. DOS attack can be categorized into two main 99.93% when it was applied to the interface group.
techniques either by exploiting the vulnerabilities in the
Al – Kasassbeh [4] proposed a new hybrid approach to
network servers, appliances and protocols, and exploiting a
capture and detect the malicious traffic based on the
huge amount of spoofed source addresses. This paper
collected dataset that is applied as an input to the Neural
introduced a new model to detect various DOS attacks by
Network in order to predict the behaviour of input data. The
using a set of ICMP variables and an adopted dataset of
proposed model achieved a high accuracy with a rate of

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 62


98.3% in capturing and detecting malicious traffic with  System Resources: DOS attack can also target the
minimal false-negative rate. system resources by overloading the server and
Sharma et al. [5] proved that volume-based analysis crashing the handled service.
can’t detect all types of network anomalies. The authors
used some services such as Simple Network Management  Application Resources: in this category, DOS attack
Protocol (SNMP), Network Time Protocol (NTP), and targets a specific application, such as a web server.
Domain Name System (DNS) to analyse network anomalies. This attack involves sending a flood of valid
NfDump machine learning has been used to collect and requests which will consume the resources of a
capture the network packets.
specific application.
Niyaz et al. [6] proposed a new scheme using a deep
learning approach in order to improve the efficiency of a
network intrusion detection system. The authors evaluate the This paper derived DOS attacks into 7 types based on
network anomalies detection based on NSL-KDD dataset, the source that generated the flood traffic as follows:
and achieved a high accuracy rate with minimal false alarms
rates. SYN Spoofing attack is a basic flooding attack that
Suganya [7] adopted a new hybrid approach by targets the network server which is responsible of
combining two methods (Misuse based detection and responding to TCP connection requests from network hosts.
network anomaly detection); to allow the built system to This attack aims to flood the server tables that manage and
detect the malicious traffic and attacks without needing any establish the connection between the server and any host in
previous knowledge of this traffic. The idea behind this the network; which leads to denying any future request from
approach is to characterize and differentiate the normal and legitimate users and prohibiting them from accessing the
malicious traffic then the normal traffic will be applied server.
under the process of anomaly detection. The authors SYN Spoofing attack depends mainly on the concept of
achieved an efficient module in detecting DoS attack, as it three-way handshaking, where the client starts establishing
found that the hybrid module is faster in detecting network the connection by launching and sending TCP SYN packet
attacks than the standalone methods (Misuse and Network to the network server that responds with a SYN ACK packet
anomaly detection). Moreover, the hybrid approach towards the client that replies with an ACK packet in return.
achieved a low false-positive rate with high reliability. The idea behind SYN spoofing attack is to exploit the
Namvarasl and Ahmadzadeh [8] adopted a new intrusion victim server behaviour by generating flood SYN packets
detection approach based on two main approaches, the with counterfeit source IP address. This attack represents in
Simple Network Management Protocol and machine embedding a forged IP in SYN packet instead of the
learning. The approached module is designed to detect DOS legitimate IP which leaves the server to respond with a SYN
and DDOS attacks in real-time and approximate to it. The ACK packet to some other client in the internet cloud and
authors designed their models based on three sub-modules, reserving a space for the ACK packet in return, that leads
starting by collecting the MIB variables from a set of the client to keep sending TCP SYN packet waiting for a
classifiers (C4.5, feature selection, and Ripper). The SYN ACK and also leads the server to reply with SYN
intrusion detection system has been set based on chosen ACK packets to another client.
variables to detect DoS attacks, where a dataset of 66 UDP Flooding attack represents in sending flood UDP
variables is mapped to 4 MIB groups (TCP, UDP, IP, and packets to a specific port number of the victim server or
ICMP). system, which takes the specific port which handles a
The previous modules are based on gathering their own specific service down.
dataset in order to come up with a scheme that detects The fourth type of DOS attack represents in HTTP based
network anomalies with the same concept of the intrusion attacks that are divided into two main categories, slowloris
detection system. In this paper, 34 MIB variables were attack, and HTTP flood attack.
chosen and mapped to ICMP group in order to evaluate the Slowloris attack engages by setting up several
accuracy of the proposed model. connections to a webserver, where in each established
III. DENIAL OF SERVICE ATTACKS AND SNMP-MIB DATASET connection an incomplete request will be embedded in these
connections and that doesn’t include the terminating
A. DOS Attacks newline sequence, meanwhile the attacker keeps sending
DOS attacks are one of the most attacks that attract continuous header lines to keep the connection alive. After
intruders; since DOS is a form of attack on service keeping the connection alive, the victim webserver keeps its
availability. NIST defines DOS attack as “A set of actions connection open for any information that will complete the
that comprise networks and its resources and preventing the launched request, which eventually leaves the webserver
authorized users from doing their intended functions”. with all its available channels consumed.
DOS attacks compromise many network resources, these HTTP Flood attack revel the webserver with HTTP
resources can be categorized as below: requests that established from several bots. This attack aims
 Network Bandwidth: network bandwidth relates to to consume the whole webserver resources and take it out of
the channel capacity between the network service. One of the common examples on this attack is
appliances and the server, and the capacity of the exhausting the memory and consuming its capacity by
link that connects a server to the global internet. overwhelming it with many tasks.

63
ICMP Echo attack depends mainly on the ping flood “Management Information Base”. SNMP Agent is
using echo request packets. The ease of using this attack and embedded on the required device, where it responds and
the reason behind considering it as a traditional attack is that exchanges the requests and the actions from the SNMP
the ICMP protocol is useful in network diagnostic, which Manager using SNMP protocol.
leads most of network admins to control and restrict this
protocol using different network security appliances such IV. PROPOSED MODEL
intrusion detection systems, intrusion prevention systems or This part is divided into three sections, starting with a
firewalls. However, this protocol is also critical to some brief description of the used dataset. The second section
networks such TCP/IP networks. Intruders in this attack provided a full explanation of the machine learning
generate a huge volume of ICMP packets toward the victim classifiers to classify the dataset and create a decision if
server which utilize the link bandwidth, which make other either a normal traffic or an attack. The last part provided a
users to face difficulties while reaching to the victim server. summary of feature selection techniques that are used in the
module to evaluate the efficiency of applying these features
on the ICMP variables.
Table 1 classifies the dataset records according to the
related attacks. A. SNMP-MIB Data
In this research paper, (Al – Kasassbeh et al. 2016)
TABLE I. DATASET RECORDS ACCORDING TO RELATED ATTACKS
SNMP-MIB dataset was used for testing and implementing
No. Traffic Label Traffic Count this paper approach. The dataset was built from almost 5000
1 Normal 600 records that related to six main types of attacks (ICMP
2 ICMP-Echo Attack 632 Echo, TCP-SYN, UDP flood, HTTP flood, Slowpost, and
3 TCP-SYN Attack 960 Slowloris). The set of attacks were detected using a set of
4 UDP Flood Attack 773 variables that are included in the dataset. The traffic
5 HTTP Flood Attack 573
prediction will be based on ICMP group. Most of the
network traffic deals with ICMP protocol to ensure the best
6 Slowloris Attack 780
packet delivery by comparing the number of sent and
7 Slowpost Attack 480
received packets. Six MIB variables were selected for this
8 Brute Force Attack 200 group as follows:
 The icmpOutMsgs (iOM) variable indicates the
total count of attempt ICMP sending messages.
B. Simple Network Management Protocol (SNMP)
 The icmpInMsgs (iIM) variable is an indicator
SNMP is an application layer protocol that allows the
user to monitor, analyse and manage network traffic. SNMP
of the total number of ICMP received
protocol divides into three versions that vary in features; messages.
SNMPv1 and SNMPv2 are known as SNMP community,  The icmpOutDestUnreachs (iOU) variable indicates
where SNMPv3 known as SNMP security, the only the total amount of unreachable ICMP messages
difference between these three versions is that SNMPv3 sent at the destination.
designed with advanced security features. Fig.1 illustrates the
network management architecture.  The icmpInDestUnreachs (iIU) variable is an
indicator of the total count of unreachable
ICMP messages at the destination.
 The icmpInEchos (iIE) variable indicates the
total ICMP number request packets received.
 The icmpOutEchos (iOE) variable indicates the total
ICMP number reply packets received.
B. Machine Learning Classifiers
The idea behind using classifiers in a network anomaly
detection system is to analyze and classify the corresponding
Fig.1. Network Management Architecture [9] traffic. In this paper, five classifiers were applied on the
adopted dataset as follows.
Fig.1. shows that the SNMP network model is divided
into two main subsystems; the SNMP Manager and the  Meta Bagging classifier was presented by Efron
SNMP Agent. The SNMP Manager is a personal computer Tibshirani. Bagging is a Meta bootstrap algorithm
that is designed and configured to pull the data from SNMP that trains every single classifier randomly of the
Agent. SNMP Manager is designed to provide a solution for original dataset to generate and form a final
a set of faults and categories such as, fault monitoring, prediction. The bagging classifier is divided into two
performance monitoring, configuration control and security categories based on the dataset subset; if the dataset
control. subsets are drawn randomly, then it called pasting.
While if the dataset subsets are drawn with
SNMP Agent plays the main role in network replacement, then it called bagging.
management model, by collecting the required data from the
network and stores them in a database called the

64
 The lazy classifier is known as an algorithm or a where for filter technique two methods were selected the
system that trains and generalizes the records in the infoGain and the ReleifF, and for the wrapper technique
dataset after the system receives queries. Lazy IBK correlation-based method was selected.
classifier is applied on the adopted dataset, since it InfoGain, ReleifF, and Correlation-based are attributed
proved its efficiency when applying it on large evaluators that are used in the WEKA machine learning
datasets with various attributes. tool. InfoGain finds out the most useful attribute for
 J48 classifier is an implementation branch of tree prejudiced between the various classes to be used.
classifier family that is also called C4.5. J48 Moreover, InfoGain determines the best split to be chosen;
the more accurate split is the one that has a high value.
algorithm was introduced and developed by Ross
Quinlan. The process of attribute selection is done ReleifF attribute evaluator is an effective method of
over top-down induction of decision trees and then attribute ranking. The role behind selecting is the more
important attribute is based on the algorithm output; the
uses information theory key concepts in order to
more positive number means the more important attribute,
select the best attribute.
where the output is a number that varies between -1 and 1.
 The rule-based classifier is one of the most The attribute weight is continuously updated through the
commonly used algorithms in artificial intelligence process. Three samples are selected and recognized
science, due to the high accuracy provided results. respectively, a selected sample from the dataset, the closest
The role of this classifier is using a set of rules in neighbouring sample that belongs to the same class in the
order to generate several choices. Rule-based dataset, and the closest neighbouring sample in a different
classifier falls into two characteristics, the mutually class in the dataset. The attribute weight is affected by any
exclusive rule, where each record in the dataset is change that can be done on any attribute value which also
covered mostly by one rule. In addition, exhaustive could be responsible for the class change.
rule occurs when each record in the dataset is The correlation-based evaluator is based on finding the
covered at least by one rule. correlation between two related features by evaluating the
 Bayes classifier is also known as Naïve Bayes, this correlation coefficient. The attribute can be redundant by
classifier was developed by Thomas Bayes. The role either deriving it from another set of attributes or if it’s
related to some other attributes. So that, to consider an
of this classifier is conditional probability; which is
attribute as a good attribute it should be highly correlated to
the probability of something to happen based on
the attributes class and not highly correlated to any other
something else has already occurred. Bayes attributes. Table 2 shows the ICMP variable ranks when
classifier executes the probabilities for each class of they were applied under the attributes selection factors.
the dataset, where the highest probability rate is the
most occurring class. TABLE II. DATASET RECORDS ACCORDING TO RELATED ATTACKS

C. Attribute Selection Attribute Selection Top 4 ICMP Top 3 ICMP


Factor Variables Ranking Variables Ranking
Attribute selection was adopted to improve the proposed iIU
model by reducing the factors count in the dataset which the
iIU
designed model needed it for testing and learning stages. iOM
Thus, attribute selection neglected the irrelevant fields while iOM
ReliefF

providing accurate results at the same time.


iIE
Attribute selection techniques fall in three main
categories as follows. iOE iIE
 The filter technique operates by selecting features
based on the features earlier scores in various iOU
Correlation-Based

statistical tests corresponding to the outcome iOU


InforGain and

iIE
variables. The strength of this technique is obvious
iIE
when applying it on large datasets.
iOE
 The wrapper technique looks through the feature
space and uses the algorithm to find the best attribute iOM iOE
set. The searching method of the wrapper technique
can be in several directions (forward, backwards, or
bidirectional). The strength in wrapper The search method in this paper was done by using the
technique relates to the efficient results because of ranker searching method, which ranked the attributes based
the complexity of this method, as it participates in
on their evaluations from the highest importance to the
the selection process.
lowest one.
 The hybrid approach combines both the filter and the
D. Evaluation Metrics
wrapper technique which results into a complex
feature selection technique. The performance of the proposed model was measured
using a set of well-known parameters such as accuracy,
The Filter and the wrapper techniques were used in
order to compare the accuracy of the generated results,

65
precision, and recall. The classifiers performance was TABLE IV. CLASSIFIERS ACCURACY FACTORS AVERAGE WEIGHT
measured based on the confusion matrix as follows Accuracy Factors Average Weight
Classifiers TP FP Precision Recall F-
TABLE III. CONFUSION MATRIX Rate Rate Measure
Predicted Class Bayes 0.864 0.014 0.935 0.864 0.879
Lazy-IBK 0.867 0.026 0.895 0.867 0.872
Actual Class Positive Negative
Met -Bagging 0.871 0.029 0.906 0.871 0.874
Positive TP FP Rules-Based 0.867 0.026 0.895 0.867 0.872
Negative FN TN RJ.48 0.868 0.026 0.896 0.868 0.872

The true positive (TP) rate indicates the rate of the Fig.2. illustrates the performance of the classifiers that
correct predictions of the positive traffic proverbs. False- were used in the proposed model in terms of F-Measure rates
positive (FP) rate indicates to the proportion of negative and based on the ICMP variables.
packets that are positive packets. The true negative (TN) rate
is an indication of the total number of negative traffic that
classified correctly as negative, where the false negative
(FN) rate shows the total number of positive traffic that
classified incorrectly as negative traffic.
The precision rate represents the ratio of the total correct
predictions of the positive traffic proverbs to the total count
of irrelevant and relevant traffics. The recall accuracy rate
represents the attribution of the correct prediction rate of the
positive traffic instances to the total count of relevant traffic
instances.
Fig.2. F-Measure Results of All ICMP Group
Finally, the accuracy rate takes all confusion matrix
parameters into its calculation to measure the correctly From Fig.2 it was found that the F-Measure values of all
classified traffic instances. Precision, recall and accuracy classifiers are efficient for normal traffic, HTTP flood attack,
formulas are showing below respectively. and slowloris attack. Moreover, Meta Bagging classifier
achieved a high performance in identifying UDP flood
attack.

Precision = (1) Figs.3, 4 illustrate the performance of the classifiers in


terms of F-Measure based on top 4 and top 3 ICMP variables
that were selected using attribute evaluators.
Recall = (2)

Accuracy = (3)

The F-Measure metric was measured in this research


paper, where its calculation depends on the following
formula.

Fig.3. F-Measure Results with Top 4 ICMP Variables – ReliefF


F-Measure = 2 x (4) Evaluator

Fig.3 had shown that the F-measure results of all


classifiers are efficient for normal traffic, ICMP Echo attack,
HTTP flood attack, and Slowloris attack. On the other hand,
V. EXPERIMENTAL RESULTS
the F-Measures of the other classifiers were fragile in
The results and the performance were evaluated over detecting the rest types of DOS attack.
Weka machine learning, version 3.8 that is running over
Inter® coreTM i5, 64-bits system with 4 GB RAM running on
Windows 10.
The results of the proposed model depended on the MIB
dataset that mentioned earlier in section 4. The classification
techniques were then applied to each group separately. In the
end, the attribute selection methods were used to evaluate the
accuracy of the proposed model by reducing the factor
number in the dataset. Table 4 shows the weighted accuracy
rate for all classes based on the ICMP-Echo group and the
selected classifiers.

66
VI. CONCLUSION
Data filtering becomes essential to protect the local and
remote networks from different types of attack that harm the
sensitive data and cost the organizations heavy losses. Thus,
many methods were introduced to detect network anomalies
in order to keep network structure running normally without
any disturbance and data disruption. In this paper, it was
found that the ICMP group with the adopted classifiers
wasn’t efficient in detecting all DOS attacks. Moreover,
reducing the count of ICMP variables varied in their
performance when detecting these attacks. However, the
Fig.4. F-Measure Results with Top 3 ICMP Variables – ReliefF
Evaluator
designed model achieved an efficient performance in
detecting some attacks such as, ICMP Echo attack, HTTP
From fig.4 it was found that all classifiers achieved a flood attack, and slowloris attack.
high F-Measure rate for normal traffic, HTTP flood attack,
ICMP Echo attack, and slowloris attack. However, all For a future work, an enhancement on ICMP variables
classifiers weren't efficient in detecting the rest types of should be applied in order to increase their ability to detect
attacks. all types of DOS attack.

InfoGain and correlation evaluator selectors achieved ACKNOWLEDGMENT


equalized F-Measure values when applied them on top 3 and I would like to thank Dr. Al-Kasassbeh for his consistent
top 4 ICMP variables as shown in fig.5, 6. support. The door to his office was always open whenever
any assistance was required to complete this research paper.
He steered me in the right direction whenever any help was
required.

VII. REFERENCES

[1] P. Institute, "2015 Cost of Data Breach Study: Global Analysis,"


Ponemon Institute, 2015.
[2] M. Alkasassbeh, G. Al-Naymat and E. Hawari, "Towards
Generating Realistic SNMP-MIB Dataset for Network Anomaly
Detection," International Journal of Computer Science and
Information Security (IJCSIS), vol. 14, no. 9, pp. 1161-1185,
2016.
Fig.5. F-Measure results with Top 4 ICMP Variable – InfoGain and [3] M. Alkasassbeh, G. Al-Naymat and E. Hawari, "Using machine
Correlation Evaluators learning methods for detecting network anomalies within SNMP-
MIB dataset," International Journal of Wireless and Mobile
Computing, vol. 15, no. 1, pp. 67-76, 2018.
Fig.5 shows that all classifiers achieved high F-Measure
[4] M. Alkasassbeh, "A Novel Hybrid Method for Network Anomaly
rates for normal traffic, HTTP flood attack, ICMP Echo Detection Based on Traffic Prediction and Change Point
attack, and Slowloris attack. Moreover, the Bayes classifier Detection".
in the Slowpost attack achieved a high performance in [5] R. Sharma, A. Guleria and R. K. Singla, "Characterizing Network
identifying this type of attack only. However, all classifiers Flows for Detecting DNS, NTP, and SNMP Anomalies," in
are not efficient in detecting the remaining types of DOS Intelligent Computing and Information and Communication,
attack. 2018.
[6] Q. Niyaz, W. Sun, A. Y. Javaid and M. Alam, "A Deep Learning
Approach for Network Intrusion Detection System".
[7] R. Suganya, "Denial-of-Service Attack Detection Using Anomaly
with Misuse Based Method," IJCSNS International Journal of
Computer Science and Network Security, vol. 16, no. 4, pp. 124-
128, 2016.
[8] S. NAMVARASL and M. AHMADZADEH, "A Dynamic
Flooding Attack Detection System Based on Different
Classification Techniques and Using SNMP MIB Data,"
International Journal of Computer Networks and
Communications Security, vol. 2, no. 9, p. 279–284, 2014.
[9] "Cisco," Cisco, [Online]. Available: http://www.cisco.com.
[Accessed 3 April 2019].
Fig.6. F-Measure results with Top 4 ICMP Variable – InfoGain and
Correlation Evaluators

Fig.6 shows that all classifiers achieved high F-measure


rates for normal traffic, HTTP flood attack, ICMP Echo
attack, and Slowloris attack. However, Bayes classifier failed
in detecting ICMP Echo attack.

67
An Energy Aware Fuzzy Trust based Clustering
with group key Management in MANET
Multicasting
1st Dr. Gomathi Krishnasamy
Department of Computer Information Systems
Imam Abdulrahman Bin Faisal University
Dammam, Saudi Arabia
gkrishna@iau.edu.sa

Abstract— The group key maintenance in MANET is especially The ultimate Trust of the node is combination of initial
risky, because repeated node movement, link breakdown and trust of the node estimated using direct or indirect
lower capacity resources. The member movement needs key methodology, energy level of node and packet integrity. One
refreshment to maintain privacy among members. To survive among the node in network is nominated as Certificate
with these characteristics variety of clustering concepts used to Authority, which is a node having highest final trust value
subdivide the network. To establish considerably stable and and this node authorized for issuing trust certificates, and
trustable environment fuzzy based trust clustering taken into this certificate valid only for certain period of time, often
consideration with Group key management. The nodes with certificates will be renewed whenever time elapse.
highest trust and energy elected as Cluster Head and it forms
Apparently data transaction does not include misbehaving
cluster in its range. The proposed work analyze secure
multicast transmission by implementing Polynomial-based key
nodes those who have not assigned with certificate. The
management in Fuzzy Trust based clustered networks fuzzy analyzer divide reliable and unreliable nodes, in the
(FTBCA) for secure multicast transmission that protect against meantime certificate authority notifies other nodes by
both internal and external attackers and measure the producing agitations as soon as the malicious node demands
performance by injecting attack models. for getting certificate. [3]

Keywords—Group Key, Clustering, style, trust, Fuzzy based


The fuzzy based analyzer integrated with trust clustering
to automate the process of classifying the nodes. The
Proposed Fuzzy Trust based Clustering with Group Key
I. INTRODUCTION management (FTBCGKM) enhances the system
Due to the dynamic nature of Mobile Ad-hoc Network performance by eliminating malicious nodes in a short span
(MANET), the presences of internal and external attackers of time. The kind of active attack, node capture attack
are unavoidable. Many of the Group Key management introduced to measure the performance of proposed system.
system focus on external attackers and deny the service for The attacker occupies the node and transmits unwanted data,
attackers. But the problematic part is internal attackers, they subsequently leads to wastage of valuable resources.[17]
mask their originality and pretend like legitimate members. The proposed FTBCGKM proved its efficiency in handling
The group key maintenance in MANET is especially attackers is explained with simulation study. The energy
risky, because repeated node movement, link breakdown level of node highly maintained in FTBCGKM compared to
and lower capacity resources. The member movement needs existing trust based group key systems.
key refreshment to maintain privacy among members. The
key refreshment process is done either periodically or
immediately for every membership change. To survive with II. LITERATURE SURVEY
these characteristics variety of clustering concepts used to
subdivide the network. A. Cluster Based Group Key Management
The joint effect of clustering and key management As in [4], they arranged nodes as Cluster Head (CH),
creates secure environment among group members and core nodes and periphery. All core nodes have the
maintains forward and backward secrecy. responsibility to create and distribute Group Key (GK). The
The proved method of identifying internal attacker is trust values used for electing CH, the distrust nodes are
Trust based evaluations, and it provides classification based eliminated from group. The core members generate GK
on binary values. In place of binary outputs, fuzzy based using Two round agreement protocol (TRP). The rekeying
classification provides more linguistic output. done not only by CH, the core members also take part to
The complex problems can be solved easily without reduce work load.
human intervention; by applying the Fuzzy logic concepts As in [5], they have presented a secure key
and it can be implemented in simple small systems (Micro administration scheme for hierarchical group, rekeying Key
controllers) to large networked systems. It provides definite chain approach is used for GK. A roaming protocol
answers by using ambiguous and imprecise inputs. The developed among host and home groups offers secure group
Fuzzy variable produces accurate results using inaccurate communication without any new keys. However, the
data as input. The Fuzzy rules are framed using simple IF quantity of groups and the tallness of the hierarchical
THEN ELSE conditions, easier for users to apply. [1][2] structure increase the conversation overhead and the
important thing-by-product complexity may also be

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 68


increased. The protocol suitable for population is same, and To deal with this circumstances reliable atmosphere is
nodes mobility within its range. formed by following fuzzy trust based clustering and
hierarchical distributed key management system
As in [6], authors developed methodology to calculate (FTBCGKM). The Fig. 1 represents flow diagram of
trust value of each node by combining direct trust and FTBCGKM.
indirect trust value. The node with higher trust value than its
two hop neighbor selected as CH. And node with next
higher value elected as auxiliary CH. The malicious nodes Start
evacuated from the group by exchanging trust value. The
key agreement protocol used here is GDH2 (Group Diffie Initialize Network Nodes with
Hellman). their ID and key values

As in [7], authors have presented a group key generation


Calculate Trust and Energy for
for intra-group communication. The authors presented each node
polynomial based key generation, which reduces storage Crisp Data
overhead in group controller and members. The dominating
node becomes CH, by calculating maximum trust ability and Fuzzy Logic System -Trust based
Fuzzy
Analyzer
maximum future contact of the node. The intra group keys Rules
are shared between controller and members without any Linguistic Data
cryptographic technique. They demonstrated the number of Segregate Nodes as Trusted/
rekeying is considerably reduced while membership Partially Trusted / Distrusted
movement. The broadcast traffic also reduced with the help
of polynomials. Construct Cluster and elect CH
based on highest trust and energy
B. Possible Attacks On Clustering Operation In Manet value

Securing MANET is a great degree intense issue since


Intra cluster key and Inter cluster
odds of having vulnerabilities are added when contrasting key generation
with customary wired systems. Because of the nonexistence
of central ascendancy and dynamic topology of MANET
can provide lodgings misconducting node as a component of Data transmission using keys
network. The performance of the entire network is degraded,
due to the problems initiated by misconducting nodes. [8]
The communication group is established on the spot,
where there is certainly no phase to check the node’s Member
Join/ Leave
reliability. [15] One in all the foremost crucial problems is
misconducting nodes exists within and out of the network No
and creates voluminous issues during information Yes
transmission further as exhausting treasured system
resources. Unsurprisingly these results in scale back the Rekeying
performance and lessen the lifetime as well [9]. To increase
time period of network varied potentialities of attacks are Stop
analyzed, but new sort of attacks would be possible in
future, therefore analyzing and providing answer to those
Fig. 1. Fuzzy Trust based Clustering and Group Key Management
attacks are ongoing process. [10]
Due to dynamic topology, any range of nodes will enter After initial setup all nodes are assigned with id and
and go away over time and also new nodes could not be corresponding key values. Every node calculates its trust
consistent, apparently unfriendly surroundings created. The value based on the recent transactions with neighboring
cooperation of each node is anticipated however that is not nodes. Along with trust value, energy level of the node
correct in existence communication [11]. estimated, and these values are feed as input for fuzzy based
analyzer. With the help of Fuzzy rule database the malicious
nodes are removed from further communication.
The clusters are constructed only with trustable nodes
III. PROPOSED WORK and group key management process begins from this point.
The intra cluster and inter cluster keys are generated for
A. Fuzzy Trust Based Clustering Algorithm (FTBCA) group communication. Whenever memberships change over
event, fresh keys are generated to keep up forward and
backward security.
To dismiss disobedient node on or after clustering,
honesty of the node remains tested by fuzzy logic and trust
B. Isolation Of Distrust Nodes Victimization By Fuzzy
centered schemes. Several of the clustering conceptions are
below the category of Insecure Clustering, not inspecting the Logic Rules
trustiness of the node, electronic communication group might
comprise attackers who distract and abolish the treasured  The trust value evaluated from (1) are reserved such
data, resources and mislead in the incorrect directions. as input values, then these values are useful to apply

69
in fuzzy table for spontaneous classification of mobile C. Fuzzy Trust Based Clustering With Group Key
nodes. Management (FTBCGKM)


T ( Na , Nb )  tanh rt 1 rt rt  Ea
n
 (1)
The Fuzzy Trust Clustering is united by means of
Polynomial based group key scheme adopted from [12],
“Polynomial-based key management for secure intra-group
where = recent transactions among the nodes. and inter-group communication”) to offer protected
communication. The authentic information conduction
= number of transactions among the two initiates as soon as the development of clusters and
nodes. circulation of group key between cluster members.
= weight of transaction.
IV. EVALUATING FTBCGKM WITH ECGKM
= +1 when the transaction is positive. The proposed Fuzzy Trust based Cluster and Group Key
= -1, when the transaction is negative. Management (FTBCGKM) is compared with existing “An
efficient clustering scheme for group key management in
Ea = Energy of the node ‘a’ MANETs (ECGK)”, [4] where in this direct and Indirect
observations are used for Trust calculation.
In the experiment, four different groups of nodes are
branded specifically Totally Trusted, Trusted, Partly The indirect trust assessment may not be a true value,
trusted and Distrusted by using their trust value. The sometimes malicious node produces fake information during
fuzzy logic variables delimited using the trust value trust calculation. And also Energy is not considered for
ranges from -1 to +1. The subsequent fuzzy table TABLE Electing CH. So it increases the chances of having CH with
I with trust value fixes whether or not to contemplate the low energy.
node for clustering or detach node from network doings.
The advantage over proposed FTBCGKM considers only
The Totally trusted nodes are more suitable to become
direct observation for evaluating trust and energy level of the
CH than the normal trusted nodes.
mobile node is compared before electing CH. The following
simulation section shows comparison of proposed
TABLE I. FUZZY TABLE FTBCGKM with existing ECGK on the basis of increasing
number of nodes and increasing number of attackers.
Fuzzy Evaluated Nodes Category
Ranking Trust Value
A. Simulation Setup
The proposed model simulated using Network Simulator
Very High 0.9 to +1 Totally Trusted NS2 [14], the one hundred mobile nodes unfold with in the
space of 750 x 750m was simulated and shown in Fig. 1. The
High 0.8 to 0.75 Trusted mockup goes for two hundred sec. The complete simulation
factors and their standards are mentioned in TABLE II.
Medium 0.7 to 0.3 Partially Trusted

Low 0.2 to -1 Distrusted

Fuzzy Guidelines:

IF Trust value = HIGH THEN node is PERFECTLY


TRUSTED
IF Trust value = MEDIUM THEN node is PARTLY
TRUSTED
Fig. 2. Hundred Mobile nodes spread in the network
IF Trust value = LOW THEN node is SUSPECTED
As follows abundantly confidential atmosphere instituted The efficiency of the proposed algorithms were evaluated
and best faithful node designated as CH, apparently with the metrics Packet delivery ratio(PDR), Packet delay
steadiness enhanced. Initially perfectly reliable nodes link and Packet drop by invading misbehaving nodes inside.
with CH by swapping HELLO note and subsequently partly Increasing the number of misbehaving nodes from 2 to 10 as
trusted nodes associate with adjacent cluster. As a final point they launch node capture attacks; [16] these are most
suspicion nodes are measured as dishonest nodes and effectively tackled with FTBCGKM. [13]
disqualified since clustering.
70
TABLE II. SIMULATION SETTINGS FTBCGKM energetic nodes for transmitting packets, so packets
transmitted without any delay.
The Packet Drop is 6% decreased in
FTBCGKM when measured with existing ECGKM, since it
perfectly eliminates misbehaving nodes. The foremost
advantage of Trust based clustering is identifying internal
attackers. If and only if, it considers direct trust values,
elimination of misbehaving node is easy. The proposed
FTBCGKM concentrates only on direct trust, deceiving
activities of malicious nodes are discouraged.
Subsequently packet drop decreases in proposed
FTBCGKM.

B. Performance Metrics based on increasing the


misbehaving node
The Fig. 3, Fig. 4 and Fig. 5 represents comparison of
proposed Fuzzy Trust based Cluster Group key
Management (FTBCGKM) with the existing ECGKM using
the metrics Packet Delivery Ratio, Packet delay and Packet
drop. And also the numbers of attackers are increased from
2 to 10.
Fig. 5. Misbehaving Nodes Vs Packet Drop

 

Fig. 3. Misbehaving Nodes Vs Delivery Ratio Fig. 6. Misbehaving Nodes Vs Energy Consumption

The Fig. 6 represent energy consumption comparison of


proposed FTBCGKM with existing ECGKM. The 30%
energy consumption reduced in proposed due to direct trust
calculations and automatic fuzzy analyzer for categorizing
the nodes. This increases the stability of the network
considerably

C. Performance Metrics based on increasing size of the


network
The size of the network increased from 100
nodes to 400 nodes and keeping 15 attackers inside, the
performance of the FTBCGKM with the existing ECGKM
are studied. The packet delivery ratio (PDR) increased, since
Fig. 4. Misbehaving Nodes Vs Packet Delay it isolates nodes that misuse the traffic. The average packet
drop and delay considerably reduced in proposed clustering.
The Delay is 5% decreased in FTBCGKM when
measured with existing ECGKM. The packet delay
diminished, because proposed system considers direct
observations for evaluating the node, so it utilizes only true
nodes for data transmission. The proposed system, involves

71
delay of proposed FTBCGKM approach has 4% less than
the ECGKM approach.
Fig. 10 shows the comparision of FTBCGKM and
ECGKM techniques with respect to Energy Consumption.
The reduced energy consumption represented in proposed
FTBCGKM. Initially all nodes are set with same amount of
energy for different number of nodes scenario. After some
amount of time energy decreased due to data transmission
and other computational activities.

Fig. 7. Different size of network Vs. PDR

Fig. 7 shows the delivery ratio of FTBCGKM and


ECGKM techniques for different number of nodes scenario.
While increase in the network size Packet Delivery rate is
decreased, because of network congestion. The delivery
ratio of proposed FTBCGKM approach has 10% of increase
than ECGKM approach.

Fig. 10. Different size of Network Vs Energy Consumption

V. CONCLUSION
The proposed Fuzzy Trust based Clustering with Group
Key Management developed to automate the process of
eliminating misbehaving nodes. The FTBCGKM construct
clusters with trusted nodes. The proposed FTBCGKM is
compared with existing ECGKM based on some metrics like
delay, delivery ratio and drop.
The node capture attackers are invaded to study the
performance of proposed FTBCGKM. The simulation
results shows betterment of proposed FTBCGKM than the
existing ECGKM. The direct trust evaluation along with
energy estimation nodes supports proposed FTBCGKM to
Fig. 8. Different size of Network Vs Packet Drop
eliminate malicious node from communication.
Fig. 8 shows the drop of FTBCGKM and ECGKM The intra and inter cluster communications are well
techniques for different number of nodes scenario. The organized in proposed FTBCGKM, as well rekeying also
conclusion from above analysis is, the drop of proposed carried out to preserve secrecy among mobile nodes. One of
FTBCGKM approach has 8% less than ECGKM approach. the weak points of fuzzy logic system is considerable amount
of memory is used for storing fuzzy logic rules database.
This could be focused for future study to optimize the
memory size.

REFERENCES

[1] Mohamed DYABI, Abdelmajid HAJAMI and Hakim ALLALI


(2014), “CATP: An Enhanced MANETs Clustering Algorithm Based
on Nodes Trusts and Performances “, International Journal of
Innovative Technology and Exploring Engineering, Vol. 4, Issue 1,
pp. 1-9
[2] Alka Chaudhary, V. N. Tiwari, and Anil Kumar, 2016, “A New
Intrusion Detection System Based on Soft Computing Techniques
Using Neuro-Fuzzy Classifier for Packet Dropping Attack in
MANETs”, International Journal of Network Security, Vol.18, No.3,
pp.514-522
[3] Manoj V (2012), “ A Novel Security framework using Trust and
Fuzzy logic in MANET”, International Journal of Distributed and
Fig. 9. Different size of Network Vs Delay Parallel Systems, Vol. 3, No.12, pp. 285-299
[4] Drira, K. Seba H. and Kheddouci, H. (2010), “ECGK: An efficient
Fig. 9 shows the delay of FTBCGKM and ECGKM clustering scheme for group key management in MANETs”,
Elsevier: Computer Communications, Vol. 33,
techniques for different number of nodes scenario. The pp. 1094–1107.

72
[5] Dijiang Huang and Deep Medhi (2008), “A secure group key
management scheme for hierarchical mobile ad hoc networks”, Ad
Hoc Networks, Vol. 6, pp. 560–577
[6] Bhuvaneswari, V. and Chandrasekaran, M. (2014), “Cluster head
based Group key Management for Malicious Wireless Networks
using Trust Metrics”, Journal of Theoretical and Applied Information
Technology, Vol. 68, No. 1, pp. 1-9.
[7] Yanji Piao, JongUk Kima, Usman Tariq and Manpyo Honga, (2013),
“Polynomial-based key management for secure intragroup and inter-
group communication”, Computers and Mathematics with
Applications.
[8] Athira V and Jisha G (2014), “Network layer attacks and protection in
MANET-A survey”, International Journal on Computer science and
Information Technologies, Vol5(3),
pp 3437-3443
[9] Diwaker C, Choudhary S and Dabas P (2013), Attacks on Mobile Ad-
hoc Networks, International Journal of Software and Web Sciences,
Vol. 4(1), pp. 47-53
[10] Supreet Kaur and Varsha Kumari (2015), “Efficient Clustering with
Proposed Load Balancing Technique for MANET”, International
Journal of Computer Applications Vol. 111, No 13
[11] Jayaraj Singh, Arunesh Singh and Raj shree (2015), “An Assessment
of frequently adopted Security patterns in Mobile Ad hoc Network:
Requirement and Security Management Perspective”, Journal of
Wireless Network and Microsystems, Vol. 4, No. 1-2, pp. 1-7.
[12] Piao, Y., Kim, J., Tariq, U and Hong, M. (2013), “Polynomial-based
key management for secure intra-group and inter-group
communication”, Computers & Mathematics with Applications, Vol.
65, No. 9, pp. 1300-1309.
[13] Saju P john and Philip Samuel (2014), “Self- organized Key
Management with trusted certificate exchange in MANET”, Ain
Shams Engineering Journal,, Vol.6, pp. 161-170
[14] NS-2 simulator. Available online: http://www.isi.edu/nanam/ns .
[15] Veerpal Kaur and Simpel Rani (2018), “A Hybrid and Secure
Clustering Technique for Isolation of Black hole Attack in MANET
“,International Journal of Advanced Research in Computer
Engineering & Technology (IJARCET) Vol. 7, Issue 3, pp. 230-237.
[16] Dheepak, T. and Neduncheliyan S (2017), “Security Scheme in MAC
Protocol based Attack Detection Model using Cryptography and
Basiyan method”, Inter. Journal of Pure and Applied Mathematics
Vol. 116, No. 21, pp. 459-467.
[17] Zhe Wei and Shuyan Yu (2018),” Energy Aware and Trust Based
Cluster Head Selection for Ad-hoc Sensor Networks”, International
Journal of Network Security, Vol.20, No.3, PP.496-501.

73
Framework for Blockchain Deployment:
The Case of Educational Systems
Saif Kazakzeh, Eyad Ayoubi, Baraa K. Muslmani, Malik Qasaimeh, Mustafa Al-Fayoumi
Pricess Sumaya University for Technology
Amman, Jordan
xsaifahmadx@gmail.com, eyadayoubi@gmail.com, b.muslamani@yahoo.com ,m.qasaimeh@psut.edu.jo, m.alfayoumi@psut.edu.jo

Abstract— Blockchain is an emerging technology that lacks verification procedure, known as mining. The new block is
sophisticated guidelines and frameworks for deployment then linked to the last block in the chain. Each BC starts with
purposes. This paper proposes a framework that helps in the root block containing its setting [6].
making suitable decisions concerning blockchain model
adoption. In addition, the authors classified the major categories Blockchain has the following advantages [7]:
of blockchain metrics. Furthermore, the authors evaluated the transparency - each party has the capacity to enter into the
proposed framework with two well-known educational transaction; immutability - it is not possible to modify the
blockchain-based models. written records; security - the infrastructure offers secure
operations using strong cryptography; self-sovereignty,
Keywords—Blockchain, Bitcoin, Ethereum Decentralization, scalability, and decentralization - as a result of the
TruerRec, Blockcerts elimination of the third party, it is possible to add new users
(nodes) to the chain, and users have the authority of managing
I. INTRODUCTION their own data; tamper-proofing - a unique timestamp is
Blockchain (BC) is a distributed and decentralized data associated with each data store operation in the blocks [8].
management solution that includes cryptography, consensus Drawbacks include the high-power consumption and time of
mechanisms, and hashing functions to ensure the the mining process, the complexity of managing one’s data,
immutability of its blocks (data). There is no need for a third and the performance issue; because BC is highly secured,
party to validate the transactions; any completed transaction there is a performance trade-off. One potential application
is recorded simultaneously in an immutable ledger, in a area for BC is in educational documentation.
permanent, transparent, verifiable, and secure way, with a
timestamp [1]. Ledger is the heart of blockchain, where the Proving one’s level of education and skills, work
transactions between two parties are efficiently stored in a experience, or even training accomplishment requires
permanent and verifiable manner. Furthermore, it is possible certification in some format, including several types of
to program the ledger to enable automatic triggering of information statements. The most important are: the kind of
transactions [2]. A smart contract is “a computerized the qualification such as “certificate of: accomplishment,
transaction protocol that executes the terms of a contract”; it attendance, or graduation…etc.”, the name and the address of
executes automatically, and is visible to all users of the the certificate issuer, the name and the title of the certifier,
blockchain [3]. who has validated the certificate, the date of obtaining the
certificate, the name of the learner. Moreover, there could be
The genesis of BC is usually traced to a Japanese theorist more information based on the type of the certificate, such as
known as ‘Satoshi Nakamoto’ who published an online paper the validation period or information about the examination
concerning the original source code for the virtual currency regulations.
Bitcoin in 2009, whereby “nodes collect new transactions into a
Paper-based certificates have advantages such as ease of
block, hash them into a hash tree”, and subsequently broadcast
archiving and retrieving, to be displayed to any person for any
the block “when they solve the proof-of-work… and the block is
purpose. However, this hard copy might be subjected to damage
added to the block chain” [4].
or loss, which can lead to difficulties for the holder to be reissued
Decentralization is one of the most important new documents or obtain new copies, costing extra time and
characteristics of BC, whereby users can manage the database money, at the expense of potential opportunities. Similarly,
in which their transactions are recorded jointly, and there is many forced immigrant students and refugees suffer from a lack
neither presence nor control of a third party. Fault tolerance, of certificates because they have lost access to their original
resistance to attacks, and collusion resistance are assured by locations, and/ or cannot contact the authorities to be issued new
decentralization [5]. ones. In contrast, the certificate issuer or authority needs to
maintain a database of certificates for a long period of time, and
Each block in the BC can contain thousands of
this will lead to the
transactions, and a new block can be added by a hash

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 74


need of huge storage repository. At the same time, the issuer or The following section reviews the literature to list the
the certifier will act as party for validation check of the available models that use blockchain in education and
certificates, which also consumes time, effort, power, and certification accreditation, as well as the available evaluation
storage [9]. There is an increasing problem with fake documents frameworks. The developed framework is then discussed,
in the field of educational certificates, with approximately 500 along with the evaluation metrics. The paper ends with the
fake doctoral diplomas sold per month in the USA [10]. This conclusion and directions for future work.
comprises an illicit industry worth billions of dollars [11],
exacerbating the difficulties of finding people with the II. LITERATURE REVIEW
prerequisite skills to fill available vacancies [12]. A. Models of blockchain in education
Gerl Postel was a German postman who obtained The Blockcerts project was established by the Media Lab
employment as a deputy medical officer in Flensburg using of the Massachusetts Institute of Technology and the American
fake documents and fake curriculum vitae [13]. It is hard to company Learning Machine [18]. It is an open-source platform
estimate the number of fake certificates that are available in for creating, sharing and verifying education credentials on BC
the world. In the USA it was estimated that 41% of job using bitcoin technology. It is open standard, to avoid vendor
applications presented in 2015 used forged educational lock-in and for easier interoperability by not following a certain
information [14]. standard, and it seeks global availability offering its own
standards [19]. Blockcerts has four main elements, as illustrated
Different BC-based models are available for certification
in Figure 1: issuer - the institution that creates digital
accreditation that leverage BC’s technological advantages. The
certificates; certificates - modified with the requirements of the
University of Nicosia was the first higher education institution
Mozilla Foundation initiative Open Badges, which contain
to use Bitcoin BC for storing academic certificates [15].
different statements about skills, achievements, or
Different models of certificate accreditation discussed in the
characteristics of the student, all recorded in a chain of blocks;
literature review used the two best-known BC technologies,
verifier - somebody who wants to verify that the certificate has
Bitcoin and Ethereum, and all available models that are
been issued by a certain institution, is related to particular
considered in our study are based on them. According to [16]
individual, and has not been altered, independent of a third party
and [17], the main differences between Bitcoin and Ethereum
“distributor”; wallet - for storing students’ certificates and
technologies are the used programming language, the time of
sharing them with others, e.g. employers. Each student has his
issuing a block, basic builds, and the purpose of the blockchain,
own wallet. However, Blockcerts suffers from lacking the
as illustrated in Table I. effective way to revoke certificates. The current revocation
Blockchain has different advantages, and large enterprises method involves the cryptocurrency amount being controlled by
and technology leaders in many domains are moving on to apply both parties. Thus, to revoke a document, one or both parties
BC to their businesses and processes, despite the lack of any must spend that amount of cryptocurrency. This is a
clear guidance on what BC-based techniques are reliable for disadvantage, because it builds a barrier for global participation,
particular purposes. At the same time, we found that the as to maintain the system there is a need to pay fees. Moreover,
literature lacks evaluation tools or even frameworks, which this increases the amount of unwanted information in the system
would be a big asset in evaluating and choosing what BC-based (entailing a higher power load and reduced performance).
technology will suit specific needs and requirements. In this
paper, we proposed a foundational framework for BC evaluation
concerning evaluation in the domain of documents accreditation
systems. We aim to help educational institutions to be able to
apply the most suitable BC model for their certification
accreditation systems.

TABLE I. BITCOIN VS ETHEREUM

Comparison Bitcoin Ethereum


Stack-based Turning complete
Programming
language, limited language, any sort of
language
functionality operation is possible
10 minutes for
Transaction 15 seconds to
producing new
time produce new block
block
Basic builds SHA 256 Ethash (SHA-3)
Figure 1. Blockcerts elements [18].
Not only a
cryptocurrency, it is
Alternative to designed to enable A pilot project based on Blockcerts is under development in
Purpose
regular money developers to build
and run distributed
Malta for professional and academic certifications [20], and the
applications. Federation of State Medical Boards in the US is currently
launching a pilot project for issuing official documents to the
blockchain with Blockcerts [21].

75
Truerec by the German multinational software However, it is still at the prototype stage, and needs some
corporation SAP SE is another model employing blockchain improvements. For instance, the accreditation authority is a
technology in the education system and employment. single powerful root node, and if its private key was
TrueRec is based on the open-source distributed platform compromised or lost the whole system would be affected.
Ethereum, which is available to the public. Moreover, a cost overhead is applied for adding certificates
to the blockchain, as it is based on Ethereum blockchain. The
The main objective of TrueRec is to track and verify
revocation model does not allow showing or validating the
credentials for candidates that can later be used in the hiring
revoked certificate.
verification process, or upon admission to an educational
institution (e.g. a university), by enabling candidates to upload CredenceLedger is a system that stores consolidated data
their certificates to TrueRec, enabling verification by trusted proofs of academic credentials in a blockchain, enabling easy
authorities. TrueRec has efficiently reduced the costs (including verification by third parties such as employers or education
time) of the hiring process, as described in their patent [22], stakeholders. This model depends on the permissioned
reducing the seven steps of the traditional verification process to multichain combined with a mobile application for verifying
two: receiving the application with the certificates already academic credentials. When students graduate, they are awarded
verified; and conducting the interview. an authentic digital version of their credentials, in addition to the
paper certificate. This provides easy access from the mobile to
TrueRec (Figure 2) proved that costs can be significantly the certificate, and easy verification by the third party. There is
reduced by employing Blockchain technology while no need for transacting a cryptocurrency as it uses streams
increasing the security and reliability of certificates and being (hexadecimal value with a key value pair). CredenceLedger is a
open to the public ensures the usability of the system. private blockchain that enables digital forms of credentials to be
verified easily, without needing the public blockchain
transaction, which incurs mining costs. Furthermore,
CredenceLedger does not need a centralized system, and it
provides high throughput with low costs [25]. However, it stills
need to be tested in public use, and it should be expanded and
developed to be used on public blockchain for global use,
because otherwise special efforts and knowledge are needed to
access the application.
Other models are developed by different vendors, such as
Sony Global Education by Sony [26] and Open Certificates
by Attores Solutions [27]. However, these models are not
discussed in this paper as the systems are under development
and testing phases.
B. Models of blockchain in education
Figure 2. TrueRec proposed process [22]. To achieve a better decision-making procedure through
blockchain technology, guidelines have been proposed in the
The Dutch organization Applied Scientific Research (TNO) literature, such as [28], which presented a blockchain
started a blockchain project called the Self-Sovereign Identity maturity model that extends the CMMI model based on five
Framework to support supplying official information in digital aspects with four characteristics. The paper aimed to produce
form while only sharing a minimum amount of personal data, guidance on how organizations in different industries could
managed and stored in a wallet on people’s cellphones in an systematically decide on adopting blockchain. However, the
encrypted form. This information provides official confirmation adoption procedure is complex. It investigated three non-
about the identity of the person using a decentralized, public- technical aspects without detailing the process to be
permissioned blockchain [23]. considered as a full reference.
Blockchain for Education is another practical model for Yuan et al [29] have presented a reference model for
issuing, validating, and sharing certificates. Ethereum researchers in the field that divides the blockchain framework
blockchain is concerned with the correctness of security-relevant into six layers, as shown in Figure 3. The model is well-
contracts, using the approved smart contract template of described and presented, with the need for some
OpenZeppelin. The aim of the Blockchain for Education enhancements to become comprehensive, to include all the
platform is to support counterfeit protection as well as secure components that may constitute a blockchain.
management and access of certificates according to the needs of
More literature presented different models to support the
learners, companies, educational institutions, and certification
decision-making process of blockchain technology. Lo et al
authorities. It is similar to Blockcerts in using smart contact for
[30] proposed an evaluation framework to help organizations to
managing the identity of the certification authorities or the
assess the suitability of applying blockchain. Through a decision
certifiers, and for managing the certificate lifecycle. However,
tree, an organization may decide whether using a blockchain
Blockcerts uses Bitcoin, and therefore cannot apply complex
technology is suitable for their system or not. However, the
contracts. Another benefit is it allows the identity of the certifier framework depends on very limited and strict
to remain anonymous [24].

76
Yes/No questions (Figure 4), without taking into section describes the most-used versions of the former, such as
consideration some special aspects that may arise for each Elliptic Curve Digital Signature Algorithm (ECDSA) and X.509
business that may affect the resultant suitability decision. Standard, to define the format of public key certificates, because
blockCAM relies only on hashing algorithms.
a) Elliptic Curve Digital Signature Algorithm
This is a cryptographic algorithm used in many
blockchain platforms to issue public and private keys, and to
digitally sign a file, which allows users to verify the
authenticity of a file. Unlike Advanced Encryption Standard
(AES), which encrypts the content of the file, ECDSA
protects the file from tampering. The main strengths of
ECDSA are that it is impossible to duplicate the signature,
and it requires less computing power compared to other
algorithms [33]. The major services of ECDSA are [34]:
• Ensuring data integrity.
• Origin authentication.
Figure 3. Blockchain components [29]. • Tamper-proof data.

b) X.509 Standard
X.509 Standard defines the format of public key certificates,
in which a certificate contains a public key and an address that
specifies the owner. This certificate can be signed by a
Certificate Authority (CA) or signed the owner. X.509 certificate
uses the Public Key Infrastructure (PKI) to verify that a public
key belongs to the assigned address. The cross-certification
process is calibrated by PKIs [35], certifying that all user
certificates in PKI 2 (User 2) are trusted by PKI 1, whereby CA1
generates a certificate (Cert 2.1) that contains the public key of
CA2. As Cert2 and Cert2.1 have the same subject and public
key, there are two valid chains for Cert2.2 (User 2): "cert2.2 →
cert2" and "cert2.2 → cert2.1 → cert1". Similarly, CA2 can
generate a certificate (Cert1.1) containing the public key of CA1
so that user certificates existing in PKI 1 (User 1) are trusted by
PKI 2.

Figure 4. YES/NO decision questions [30]. 2) Immutability in blockchain


Immutability pertains to non-tampering, such that once a
block is added to the blockchain it cannot be altered and
Blockchain technical specifications have been addressed modified. Immutability is achieved using hashing functions,
in a few evaluation frameworks in the literature, but these in which a hashing function generates a fixed size hash out of
frameworks lack completeness, and only evaluate blockchain an arbitrary size file or data to ensure integrity. The most
models from specific components. Previous studies common hashing functions used in blockchain technology
introduced a novel quantitative framework to analyze the are SHA-256 (Bitcoin Platform) and Ethash (Ethereum
security and performance implications of POW blockchains Platform). The following section briefly describes these
[31] and evaluated the performance of consensus algorithms hashing functions.
in private blockchains in terms of latency and throughput
aspects [2]. a) SHA-256 Algorithm
Secure Hash Algorithm is a cryptographic hash function
III. EVALUATION METRICS designed by the US National Security Agency (NSA), and it
This section describes the major metrics currently used in is a subset of the SHA-2 family. It is a one-way function that
blockchain technology, to provide some insights into related cannot be reversed to decrypt the content of a file or data.
mechanisms. Bitcoin platform uses SHA-256 algorithm to verify the
transactions on the blockchain.
A. Security
1) Cryptography in blockchain b) Ethash Algorithm
Bitcoin and Ethereum are the most used blockchain platforms Ethash is a cryptographic hash function used in Ethereum
as seen in the previously discussed models. These platforms platform. Ethash uses Keccak hashing function, which is
utilize two cryptographic technologies; asymmetric standardized to SHA-3.
cryptography and hashing functions [32]. This

77
SHA-3 is the latest member of the Secure Hash Algorithm This is the most used architecture between digital currencies
family, released by the US National Institute of Standards and (Bitcoin and Ethereum). Permission-less blockchain allows any
Technology (NIST). The main functionality of SHA-3 is the user to use and interact with the blockchain while maintaining
same as mentioned in SHA-256, in which a hash function is anonymity and transparency. Permission-less blockchains allow
a function that takes some message of any length as input and any user to run as a normal node or a mining node, to help in
transforms it into short, fixed-length bit strings called hash verifying new transactions. The main characteristics of
values. permission-less blockchains are that they are decentralized,
transparent, and anonymous.
By utilizing these two cryptographic technologies,
asymmetric cryptography and hashing functions make b) Permissioned blockchain (private)
blockchain one of the most secure existing technologies, with Permissioned blockchain is governed by an organization
algorithms not yet solved mathematically (Esslinger et al., or authority, which determines users who are approved to use
2014) [36]. and interact with the blockchain, with varying degrees of
B. Consensus privileges. The main characteristics of permissioned
blockchains are different levels of decentralization, different
1) Consensus protocol levels of transparency and anonymity, and their governance
One of the main mechanisms used in blockchain structure, whereby organizations and communities have a
technology is consensus protocol, used to achieve agreement decision-making role in what architecture to adopt based on
on a single transaction, value, or block in distributed systems. their needs and size. For organizations it may be safer to
Consensus protocol provides reliability in a network. In other adopt permissioned blockchain, while for public
words, consensus means that all the nodes in the network organizations it is more usable to adopt permission-less
agree on the same state of the blockchain [37]. There are blockchain to serve more users.
many types of consensus protocols adapted by blockchain
platforms. The following subsections describe the most D. Blockchain scalability
common consensus protocols. a) Scalability of transactions
a) Proof of work (POW) The scalability of the blockchain is important to serve
Proof of Work protocol is adopted by the Bitcoin and more users. Problems can arise when there are too many
Ethereum platforms. The mechanism of POW works by transactions to be processed by the network. Figure 5 displays
requiring a solution from blockchain mining nodes to a the drastic increase in the number of daily Bitcoin (BTC)
specific mathematical problem in order to add a new block. transactions since 2009 [40]. The scalability of blockchain is
In this case nodes must solve the hash function; the only way a major concern, which is why authors included this as a
is to use trial and error. When a node solves the hash function category.
it receives a reward of some currency to cover some of the b) Scalability of nodes
power consumption costs [38]. POW thus adds new Another means of blockchain scalability is the simplicity
transactions to the blockchain based on computational power of adding new users (nodes) to the blockchain. In permission-
[39]. less architecture it is easier to add new users to the
b) Proof of stake (POS) blockchain, however it takes more time and effort when
adding new users to a permissioned blockchain in order to
verify the correct identity of the new user, and if the user
The mechanism of proof of stake is that the mining node
fulfills the admission requirements.
is now called a forger node, in which, instead of using
computational power as a measure, POS uses an amount that
must be staked to select the forger node. The higher the stake,
the higher the probability of being selected to validate the
new block or transaction. The rewarding system is similar to
POW, whereby the forger is rewarded for transaction fees. In
the case of false validation of a transaction or block, the
staked amount is lost.
Ethereum platform is trying to change its consensus
protocol to use POS instead of POW. Other mechanisms like
Delegated Proof of Stake and Proof of Authority exist but are
not described in this paper. These consensus protocols aim to
increase the time needed to add new blocks to the blockchain,
Figure 5. Number of BTC transactions since 2009 [40].
and to ensure that no one can validate false transactions or
blocks, or even compromise the blockchain network. E. Network performance
The performance of any network has always been a great
C. Blockchain architecture concern concerning usability and availability, thus due to its
intrinsic significance the authors decided to add this category in
a) Permission-less blockchain (public)
the evaluation of this paper. The main metrics of performance
are throughput and latency, as discussed below.

78
a) 6.1. Throughput
The rate at which the blockchain platform uploads the a) Specify the business needs
verified transactions into the blockchain ledger. Not to be Specifying the business needs and what an organization
mixed with latency, throughput is the rate for uploading a really looks for is one major step to reach our goal in
group of transactions, while latency is the rate for a single evaluation. Sine business requirements may not be amenable
transaction. to certain blockchain models’ specifications. Knowing the
b) 6.2. Latency business/ client goal helps in deciding what model is most
suitable to a particular application.
Blockchain platforms such as Bitcoin and Ethereum
require time to process each block in order to verify its b) Specify the most relevant blockchain models
validity. Latency is the amount of time needed to process and This step is about narrowing the evaluation process to the
validate a single block or transaction before adding it to the models most relevant to the business goals identified in the
ledger. Bitcoin process transactions in minutes, while previous step. This task should be done by an expert (i.e.
Ethereum does so in seconds [41]. Predicting the latency of any person or organization with experience in blockchain
blockchain-based systems using architectural modelling and models and business applications) to ensure that no related
simulation focuses on processing time as the main parameter, model that could be more efficient has been missed.
when the number of transactions increases, and the number
of users or nodes increases.
TABLE II. EVALUATION METRICS CATEGORIZATION
IV. FRAMEWORK
As discussed before, in our paper we propose a prototype Evaluation Metrics Categorization
framework for evaluating different blockchain models. In order Security
to achieve that we started by categorizing the specifications of
Encryption/ digital
blockchains that can be used as metrics. Table II shows different ECDSA X.509
signature
categories that any blockchain model adopting process should
consider. Considering that every blockchain model is driven by Hashing functions SHA-256 Ethash (SHA-3)
its business goals, each of them should be interested in some of Consensus
the mentioned categories and subsequent metrics more than the
other. This offers one way that we can provide a primitive Consensus
POW POS
mechanisms
decision on whether a blockchain model can be adopted for this
business need. For example, in the previously explained Architecture
blockchain models, BlockCerts and TrueRec, the former can be
Blockchain Permission-less Permissioned
used for educational documents accreditation even though it architecture (public) (private)
takes more time (i.e. higher latency), but it is not suitable for
direct purchases due to this characteristic. On the other hand, Scalability
TrueRec can be used for some systems with less stringent Scalability of Transaction
Transaction size
security requirements than those enforced by BlockCerts. transaction processing time
Scalability of nodes Simplicity of adding nodes to blockchain
A. Evaluation process
Network Performance
The literature produced many processes that may rely on
conditional statement [30] or on other guidelines models, such The rate at which the blockchain platform
as CMMI model [28]. We have implemented our framework Throughput uploads the verified transactions into the
with five steps, taking the advantages of the metrics blockchain ledger
categorization produced earlier. The process flow is shown in The amount of time needed to process a single
Figure 6 below. Latency block or transaction before being added to the
ledger
B. Framework steps

Figure 6. Evaluation process flow

79
Prioritize the metrics based on impacts on goals -
c) Specify evaluation categories most related to According to the business goals in this usecase, security
business needs is more important than network performance, due to the
fact that the nature of the business allows students or other
The process of specifying the evaluation categories considers entities with enough privileges to submit a request for
their relation to the business goals. In this step, we can specific documents, and they can later check the status for
include all or some of the evaluation categories that were validity and correctness. Network performance is also
presented earlier that have major influences on the business important; however, the considered business goals can
goals. The aim of this process is to eliminate any unrequired afford network delays, since the process does not need to
metrics from the evaluation, since these unnecessary metrics be attended. Based on our methodology, as explained in
may negatively affect the decision of selecting the suitable the framework, BlockCerts weight is 5, and TrueRec
blockchain model. weight is 7.

d) Apply the evaluation matrix VI. RESULTS


So far, we have specified the business goals, related models, Considering the above factors, and the five outputs and
and metrics categories that we will use to evaluate those models. the evaluation weights of the process, we may assign the
Now it is time to work on the evaluation matrix, as shown in following ranking of suitability for the selected models:
Table III. This step includes extracting all the required • First model: TrueRec; with similar security, it has
information of the selected models to be filled in the matrix improved network performance.
based on the categories of evaluation selected.
• Second model: BlockCerts; with similar security, it
e) Prioritize the models has lower network performance.
Based on the evaluation matrix, we can simply make any
needed trade-offs decisions and specify what most suits our Detailed performance data is shown in Table III.
business and what is less relevant. A graph that shows application of the framework to evaluate
the models using the weighting methodology is shown in
V. USE CASE EVALUATION Appendix A.
In this version of our proposed framework, we considered
security and network performance to be the evaluation TABLE III. EVALUATION MATRIX
categories by which to select the model. Other categories will be
evaluated later when extending this research paper. The metrics Evaluation Blockchain Models
of security are encryption/ digital signature and hashing Categories BlockCerts TrueRec
functions, while the metrics of network performance are
throughput and latency. These metrics will be investigated for Security
the Blockcerts and TrueRec models. The five steps of the Encryption/ digital
ECDSA ECDSA
framework are applied with consideration of these models, and signature
the use case is described below.
Hashing function SHA-256 Ethash (SHA-3)
a) Use case Network Performance
An organization aims to reduce the complexity needed to
accredit the certificates and other related documents, which Throughput [42]
3 transactions/s
7 transactions/s
consumes a lot of effort, time, and money by utilizing a
globally known and trusted blockchain model. Latency Minutes Seconds

b) Applying the framework


Specify the business needs - An educational VII. CONCLUSION AND FUTURE WORK
organization is looking to produce graduation
Blockchain is an emerging technology with insufficient
certificates and other related documents for students
evaluation guidelines and frameworks. This paper proposes a
using a blockchain model.
foundry framework to evaluate blockchain-based models and
Specify the most relevant blockchain models - As classify major blockchain metrics. In addition, we applied our
per to our literature review, we have specified proposed framework on an educational institution use case that
BlockCerts and TrueRec models. aimed to adopt a suitable blockchain model to accredit
documents. Based on the specified business goals and the
Specify the evaluation categories that are most related proposed framework, we found that the Blockcerts model is the
to the business needs - Metrics of security and network most suitable due to its higher security specifications, even
performance categories. though its technical performance is inferior to that of the second
Apply the evaluation matrix - Table III shows the model, TrueRec. The authors aim to produce a deeper
evaluation matrix for the selected blockchain models. classification and evaluation experiment in future work.

80
REFERENCES [24] Gräther, W., Kolvenbach, S., Ruland, R., Schütte, J., Torres,
C., & Wendland, F. (2018). Blockchain for Education:
[1] Holotescu, C. (2018). Understanding Blockchain Lifelong Learning Passport. In Proceedings of 1st ERCIM
Opportunities and Challenges. eLearning & Software for Blockchain Workshop 2018. European Society for Socially
Education, 4. Embedded Technologies (EUSSET).
[2] Hao, Y., Li, Y., Dong, X., Fang, L., & Chen, P. (2018, June). [25] Arenas, R., & Fernandez, P. (2018, June). CredenceLedger: A
Performance Analysis of Consensus Algorithm in Private Permissioned Blockchain for Verifiable Academic
Blockchain. In 2018 IEEE Intelligent Vehicles Symposium Credentials. In 2018 IEEE International Conference on
(IV)(pp. 280-285). IEEE. Engineering, Technology and Innovation (ICE/ITMC) (pp. 1-
[3] Iansiti, M., & Lakhani, K. R. (2017). The truth about 6). IEEE.
blockchain. Harvard Business Review, 95(1), 118-127. [26] Russell, J. (2017). Sony wants to digitize education records
[4] Nakamoto, S. (2009). The original bitcoin source code. Online using the blockchain. Available at:
at https://github.com/trottier/original-bitcoin (Accessed 29 https://techcrunch.com/2017/08/09/sony-education-
December 2018). blockchain. (Accessed on 01 January 2019).
[5] Buterin, V. (2017). The Meaning of Decentralization. [27] Open Certificates. Available at: http://opencertificates.co/.
Medium. Online at https://medium.com/@VitalikButerin/the- (Accessed on 01 January 2019) .
meaning-of-decentralization-a0c92b76a274 (Accessed 29 [28] Wang, H., Chen, K., & Xu, D. (2016). A maturity model for
December 2018). blockchain adoption. Financial Innovation, 2(1), 12.
[6] Dhillon, V., Metcalf, D., & Hooper, M. (2017). Blockchain [29] Yuan, Y., & Wang, F. Y. (2018). Blockchain and
Enabled Applications: Understand the Blockchain Ecosystem cryptocurrencies: Model, techniques, and applications. IEEE
and How to Make it Work for You. Apress. Transactions on Systems, Man, and Cybernetics: Systems,
[7] Grech, A., & Camilleri, A. F. (2017). Blockchain in education. 48(9), 1421-1428.
[8] Bhowmik, D., & Feng, T. (2017, November). The multimedia [30] Lo, S. K., Xu, X., Chiam, Y. K., & Lu, Q. (2017, November).
blockchain: A distributed and tamper-proof media transaction Evaluating Suitability of Applying Blockchain. In Engineering
framework. In Digital Signal Processing (DSP), 2017 22nd of Complex Computer Systems (ICECCS), 2017 22nd
International Conference on (pp. 1-5). IEEE. International Conference on (pp. 158-161). IEEE.
[9] Alexander Grech and Anthony F. Camilleri. 2017. Blockchain [31] Gervais, A., Karame, G. O., Wüst, K., Glykantzis, V.,
in Education. No. JRC108255. Joint Research Centre (Seville Ritzdorf, H., & Capkun, S. (2016, October). On the security
site) . and performance of proof of work blockchains. In Proceedings
[10] Hanna Park and Ashley Craddock. 2017.: Diploma Mills: 9 of the 2016 ACM SIGSAC Conference on Computer and
Strategies for Tackling One of Higher Educations Most Communications Security (pp. 3-16). ACM.
Wicked Problems, https://bit.ly/2DoEeyu [32] Wang, W., Hu, N., & Liu, X. (2018, June). BlockCAM: A
[11] Bazley, T. D. (2005). Degree Mills: The Billion Dollar Blockchain-Based Cross-Domain Authentication Model. In
Industry That Has Sold Over a Million Fake Diplomas. 2018 IEEE Third International Conference on Data Science in
College and University, 80(4), 49. Cyberspace (DSC) (pp. 896-901). IEEE.
[12] Rutkowski, J. (2007). From the shortage of jobs to the shortage [33] Understanding How ECDSA Protects Your Data.
of skilled workers: labor markets in the EU new member https://www.instructables.com/id/Understanding-how-
states. ECDSA-protects-your-data/. (Accessed on 03 January 2019)
[13] Gerhard Mauz. (1997).: A juggler, an artist, [34] Khalique, A., Singh, K., & Sood, S. (2010). Implementation of
http://www.spiegel.de/spiegel/print/d-8742708.html elliptic curve digital signature algorithm. International journal
of computer applications, 2(2), 21-27.
[14] Musee, N. M. (2015). An academic certification verification
system based on cloud computing environment. PhD diss., [35] "Cross-Certification Between Root CAs". Qualified
University of Nairobi. Subordination Deployment Scenarios. Microsoft. August
2009. (Accessed on 05 January 2019)
[15] Mike Sharples et al. 2016. Innovating pedagogy 2016: Open
University innovation report 5 . [36] modelling and simulation. In Software Architecture (ICSA),
2017 IEEE International Conference on (pp. 253-256). IEEE.
[16] Gencer, A. E., Basu, S., Eyal, I., van Renesse, R., & Sirer, E.
G. (2018). Decentralization in bitcoin and ethereum networks. [37] Consensus protocol https://lisk.io/academy/blockchain-
arXiv preprint arXiv:1801.03998. basics/how-does-blockchain-work/consensus-protocols.
(Accessed on 5 January 2019)
[17] Miller, A., & Bentov, I. (2017, April). Zero-collateral lotteries
in Bitcoin and Ethereum. In Security and Privacy Workshops [38] Understanding Blockchain Fundamentals, Part 2: Proof of
(EuroS&PW), 2017 IEEE European Symposium on (pp. 4-13). Work & Proof of Stake. https://medium.com/loom-
IEEE. network/understanding-blockchain-fundamentals-part-2-
proof-of-work-proof-of-stake-b6ae907c7edb . (Accessed on
[18] Blockchain Credentials. (2018). Blockcerts. Available at: 05 January 2019).
https://www.blockcerts.org (Accessed 29 December 2018).
[39] Types of Consensus Protocols Used in Blockchains.
[19] Schmidt, P. (2016). Blockcerts—An Open Infrastructure for https://hackernoon.com/types-of-consensus-protocols-used-
Academic Credentials on the Blockchain. MLLearning in-blockchains-6edd20951899. (Accessed on 05 January
.(2016/10/24) 2019).
[20] Case Study Malta|Learning Machine. from [40] Blockchain.com. (2009). Bitcoin Charts & Graphs -
https://www.learningmachine.com/customer-story-malta/ Blockchain. [online] Available at:
(Accessed 29 December 2018). https://www.blockchain.com/charts [Accessed Nov. 2009].
[21] Case Study FSMB|Learning Machine. from [41] Yasaweerasinghelage, R., Staples, M., & Weber, I. (2017,
https://www.learningmachine.com/customer-story-fsmb/ April). Predicting latency of blockchain-based systems using
(Accessed 29 December 2018). architectural
[22] Tummuru, N., Sheth-Shah, S., Kunzmann, M., Shirole, S., & [42] Chu, S., & Wang, S. (2018). The Curses of Blockchain
Meng, J. (2018). U.S. Patent Application No. 15/385,479. Decentralization. arXiv preprint arXiv:1810.02937.
[23] Jongsma, H. J., & Joosten, H. J. M. (2018). Technical Report
Studybits.

81
Appendix A: Evaluation results

82
THE JOVITAL PROJECT: CAPACITY
BUILDING FOR VIRTUAL INNOVATIVE
TEACHING AND LEARNING IN
JORDAN
Arinola Adefila Alun DeWinter
Katherine Wimpenny
Centre for Global Learning, Education Centre for Global Learning, Education
Centre for Global Learning, Education
and Attainment and Attainment
and Attainment
Coventry University Coventry University
Coventry University
Coventry, United Kingdom Coventry, United Kingdom
Coventry, United Kingdom
ab0191@coventry.ac.uk aa2567@coventry.ac.uk
k.wimpenny@coventry.ac.uk

Aleš Trunk
Valerij Dermol Nada Trunk Širca International School for Social and
International School for Social and International School for Social and Business Studies
Business Studies Business Studies Celje, Slovenia
Celje, Slovenia Celje, Slovenia
ales.trunk@mfdps.si
trunk.nada@gmail.com
valerij@dermol.si

This qualitative paper presents the preliminary findings of an activities and the findings of our project work to date and
ongoing education-focused project JOVITAL an international provides a snapshot of the JOVITAL project during its delivery.
cooperation project co-funded by the Erasmus + Capacity
Building in HE programme of the European Union during the Keywords— E-LEARNING, JORDAN, JOVITAL, HIGHER
period October 2017 - 2020 involving four European institutions EDUCATION, LEARNING TECHNOLOGIES, ONLINE
and five Jordanian universities1. Our paper outlines how new LEARNING, COLLABORATIVE ONLINE INTERNATIONAL
and emerging technologies are being innovatively used in LEARNING, TEACHING AND LEARNING
institutions around the world and on this basis, how they are
being adapted and implemented in Jordan as part of JOVITAL. I. INTRODUCTION
Regulations and instructions on an institutional and national In a world that is increasingly interconnected,
level have been continuously changing over the years, with the interdependent and diverse, engaging in international and
Ministry of Higher Education and Scientific Research intercultural learning and exchange is a key focus for many
(MOHESR) approving the blended model within 25% of Higher Education Institutions (HEIs) around the globe [1][2].
programmes, placing a cap on the amount of online learning that Such a trend can be considered in relation to several issues.
take place within a HE programme. However, an alliance of For example, universities are experiencing exponential
three or more Jordanian universities can establish a fully online growth in their recruitment of international students
programme as well. That being said, MOHESR has expressed [3][4][5]; accordingly, online international learning is
some constraints regarding quality assurance including the way increasingly becoming a core pillar of university
exams are conducted, how learning outcomes are measured and collaborations for globally networked learning [6][7][8]; and
how course funding and cultural perceptions are considered. open courses such as Massive Open Online Courses
Challenges in the open education methodology, therefore, still (MOOCs) target learners, regardless of their geographic and
exist in the academic medium in Jordan, where three main issues cultural background [9][10][11]. Many countries are
are of particular note: the governmental policies instructed by experiencing, due to their demographic and socioeconomic
Ministry of Higher Education and Scientific Research; the context, a massification phenomenon concerning learners
alignment of these policies with regulations published by accessing higher education (HE). Because of such trends,
Jordanian accreditation institutes; and the cultural acceptability responsive, and effective education processes are required to
of open education and distance learning in general. maintain quality learning [12][13][14]. As an answer to the
The ideas we present here include applications of technology for challenges mentioned above, state of the art education
domestic online learning, as well as global partnerships that technology may be used in HEI to encourage learning as well
support the development of intercultural competencies through as the recruitment of international students and the inclusion
the use of Virtual Collaborative Learning (VCL) or Collaborative of students belonging to disadvantaged social groups.
Online International Learning (COIL). This paper presents the However, in some countries, restrictions regarding the
amount of e-learning within study programmes can be noted.
Such limits can also be seen as a rejection of e-learning
1 methodologies as an inferior or lazy option where learning
Full List of JOVITAL Partners: Technische Universität
content merely is dumped online with little effort to
Dresden, Coventry University, International Business
contextualise the learning or to improve the learner
School for Social and Business Studies Slovenia, UNIMED, experience.
Princess Sumaya University for Technology (PSUT),
German Jordanian University (GJU), Tafila Technical
University (TTU), Al-Hussein Bin Talal University (AHU),
and Jordan University of Science and Technology (JUST)
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 83
In this paper, we outline how new and emerging well as support mechanisms such as textbooks or IT which
technologies are being innovatively used in institutions all play crucial role in perception of quality. Overall,
around the world and on this basis, how they are being “satisfaction” in the eyes of a student is a complex concept
adapted and implemented for use in Jordan. This includes with foundations in the subjective impressions of pedagogy
applications of technology for domestic online learning, as and context within which the pedagogy is delivered.
well as global partnerships that develop intercultural
competencies through the use of Collaborative Online III. Gains and benefits to the student experience and
International Learning (COIL), sometimes also referred to as changes to pedagogy
Virtual Collaborative Learning (VCL). This paper presents
One of the most important benefits of e-learning is
the activities and the findings to date of the JOVITAL
project in its goal of building the capacity of Jordanian the possibility for / of enabling students to study at a
academics in the design and delivery of collaborative online convenient “pace, place and mode” in order to ensure that the
(international) pedagogies. JOVITAL is an international quality of teaching and learning is maintained
cooperation project co-funded by the Erasmus + Capacity [17]. The mode of delivery can enhance or inhibit this
Building in HE programmes of the European Union during affordance, and adequately designed e-learning programmes
the period October 2017 - 2020 involving four European allow for learner-centered flexible approaches to HE
institutions and five Jordanian universities. The overall education.
project aims to foster academic exchange using virtual A key area for consideration related to e-learning is also the
mobility in order to develop the capability of academic staff, role of academic facilitator who can significantly improve
university students and disadvantaged learners in Jordan. As students’ learning experience. Such facilitators should have
part of the overview of the JOVITAL project and the appropriate skills and competencies in the field
technologies used, the paper also includes and presentation of learning. Moreover, Buhl,
of the applications of technology for domestic online Andreasen, and Pushpanadham [18] suggest e-learning
learning, as well as global partnerships that develop fragments be included in lecturers’ traditional roles.
intercultural competencies through the use of Online They should not be responsible only for “planning,
International Learning (OIL). practice, and reflection,” rather, such activities may now be
“performed by different actors with different areas of
II. STATE-OF-THE-ART TECHNOLOGIES AND E-LEARNING responsibility” [18]. Therefore, many institutions have
IN HIGHER EDUCATION (HE) introduced support to technical and design areas of e-
Implementing e-learning can present significant challenges learning delivery with the emergence of roles such
for HEIs. Many institutions now view e-learning as a as learning technologists, e-developers, etc. [19].
strategic tool which can be used to boost their reach, Also, teachers have to adopt new skills and techniques so
reputation, and finances. they can prepare and engage the students to become reflexive
There is also an increased competition to deliver innovative learners in e-learning environments. This might be quite
programmes that attract and connect students across the a challenge. Nowadays, the students may be accustomed
world, which has many implications for HEIs in terms of to modern technology, but they are not necessarily adept as
course design and content. E-learning makes it possible engaging in transformative learning and lack the kind of
for students to attend a variety of study programmes digital capital that enables them to be co-creators of their
without even leaving their country whilst enabling students own learning [20][21].
to connect and engage with the wider world. Such
approaches to the delivery of study programmes may
be beneficial also for vulnerable and disadvantaged groups IV. CHALLENGES IN HE IN JORDAN
who would like to study but have little or no access to HE Teaching experiences delivered throughout
[11]. the JOVITAL project and a short review of the use of new
That being said, there are concerns over the use of and emerging technologies in HEI around the world enabled
technology and online resources in terms of quality control us to recognise some key challenges which HE in Jordan
(as evidenced by the 25% cap seen in the Jordanian HE should be facing.
system), a strategic approach to recognition For example, in Jordan, e-learning has been associated with
by national governments needs to exist, especially in regions removing barriers for female learners in remote locations and
where there is no strategic oversight providing opportunities to upskill the existing workforce
over the quality of study programmes and HEIs. The study of [22]. However, the challenges of ensuring high-
Calvo-Porral, Levy-Mangin, Novo-Corti [15], quality training have been discussed by employers leading to
for example, found that tangibility and empathy dimensions restrictions, such as the afore-mentioned 25% cap seen in
have the most substantial influence on student’s perceived Jordan.
quality. The tangibility dimension is associated with facilities Another unique challenge for Jordan is related to
and equipment, while the empathy dimension concerns the the equity in access, as well as the inclusion of Syrian
attitudes of the teaching and administrative staff towards refugees in the region. Although Jordanian institutions
students. Yusoff, McLeay, and Woodruffe-Burton [16] want to include the refugees in e-learning, many barriers
identified 12 aspects that drive student satisfaction and exist. In 2017, the Open University attempted to deliver
among them they emphasised the importance online courses to Syrian Refugees in Jordan, which was not
of student’s learning experience and his or well received due to the lack of interactivity [23]. The
her satisfaction with quality provision of (online) learning as conclusion stemming from this experience shows that
the attitudes towards e-learning are different within different
84
students’ communities and they should be engaging with the VLE. Following this pilot, a summer
properly addressed. A key challenge is, school is to take place in Dresden to allow for the training of
therefore, to change the mindset from using approximately 25 student experts – specialists who will assist
technology, not just as a tool for teaching, but a platform for the future delivery of the online delivery within the VLE in
education that seeks to engage the learner with activities and the Jordanian universities in the future. All participants of
opportunities for feedback and discussion. Such the initial pilot have been invited to give qualitative feedback
change requires the shift from a ‘teacher at the front’ model through a survey tool developed by Coventry University.
of learning to an approach of designing a course together The results of the survey are forthcoming (September 2019),
with appropriate pedagogical implementation. but it is intended that data will be available to present at the
ICTCS conference in October 2019.
Through the JOVITAL project, training was made an
integral part of the study programme delivery,
with a variety of methods and approaches to the teacher as
well as student training. Namely, the HEI has the VI. FINAL THOUGHTS
responsibility to ensure that staff is adequately equipped with Through presenting and exploring the activities and findings
competencies to perform their role, and equally, students of JOVITAL, this paper seeks to outline the challenges and
need to be supported to study online, with the necessary benefits of e-learning technologies in HE teaching and
skills of autonomy and self-efficacy. The preparation of the learning, and how these can be tailored for use within the
students is a demanding task, not least because pre- unique Jordanian context. In addition, it offers insight into a
university education does not typically prepare them to work-in-progress project that is continually developing and
tackle the new technology challenges. Students of the 21st adapting to the needs of all stakeholders and participants.
century need to develop requisite skills (problem- This paper argues that online learning, in many forms, is of
solving, teamwork and communication skills) for the benefit to students and teachers alike, but utilisation of
workplace (Warner & Palmer, 2015). E-learning also technologies requires careful planning,
requires the students to master the communicative and tailoring, and training in order to see maximum benefit. As
networking tactics to engage in such online learning spaces. such, it is imperative that time is taken to train teaching staff
Furthermore, the experiences stemming from the delivery and to prepare student expectations of online learning in
of JOVITAL project shows that institutions need to concern order to gain the maximum benefit e-learning technologies
themselves seriously with ensuring that assessment practices have to offer. It is not merely enough to buy into technology
are appropriate for e-learning context. Assessment needs to and expect it to do all of the work – changes to approach and
align with the evolution of e-learning. Assessments should implementation are vital to the success of
also be varied and flexible. online approaches to pedagogy. In addition to having access
to internet-enabled technology, instututions must also have
an awareness of e-learning and the software required to
support this. Importantly, HEIs must also take the time to
V. STUDENT EXPERIENCES – DRESDEN VIRTUAL LEARNING
train and develop academic staff to fully realise the potential
ENVIRONMENT
of e-learning in order to achieve strong levels of learning
In May 2019, over 500 Jordanian students took part in a engagement, Beyond this, institutions must also invest in
Virtual Learning Environment trial, led by Technische relevant support staff, which might include IT experts,
Universität Dresden, to experience online learning first hand developers and learning technologists. With this in mind, it is
in a ‘live’ environment. Students from the Jordanian not sufficient to simply ‘buy in’ to the technology; e-learning
universities, predominantly from engineering courses, needs investment in staff, resources and infrastructure to
undertook tasks and assessments in the VLE, supported by succeed.
‘e-lectures’ and staff guidance on how to learn in an online
environment. This was powered by Elgg, an open-source tool
that specialises in social and collaborative activities for In terms of the next stages of the JOVITAL project, the
education, with the team at Dresden creating the virtual feedback and results from the Dresden pilot testing and the
environment. The activities took place in closed groups that Dresden Summer School will offer valuable insights into e-
saw students enrol to undertake activities that were mapped learning approaches and student engagement. The ICTCS
to specific topics and modules. There were also discussion presentation will also invite participants to give their own
forums per activity group in order to allow staff and students views and feedback on JOVITAL, which will offer another
alike to provide feedback of their experiences of the pilot. In route for valuable data for the project.
some cases, the online activities were directly incorporated
into local taught elements of a module, for example the
systems analysis and design online group was specifically
incorporated into the teaching and learning activity for a
TTU module, with lectures, class exercises, student
presentations and lab work taking place alongside the online
discussions and activities.

At the time of writing, the online pilot exercise has recently


concluded, with a total of 577 student participants and tutors
85
REFERENCES [13] Foley, A., & Masingila, J. (2014). Building capacity: challenges and
opportunities in large classes pedagogy in Sub-Saharan Africa.
Higher Education, 67 (6), pp. 797 – 808.
[1] Krutky, J. (2008). Intercultural competency: Preparing students to be [14] Dian-Fu, C., & Yeh, C.C. (2012). Teaching Quality after the
global citizens. Effective Practices for Academic Leaders, 3(1), pp. 1– Massification of Higher Education in Taiwan. Chinese Education and
15. Society, 45 (5/6), pp. 31 – 44.
[2] Altbach, P. G., Reisberg, L., & Rumbley. L.E. (2009). Trends in [15] Calvo-Porral, C., Lévy-Mangin, J.P. and Novo-Corti, I. (2013).
Global Higher Education: Tracking an Academic Revolution. Perceived quality in higher education: an empirical study. Marketing
UNESCO. Intelligence & Planning, 31(6), pp. 601-619,
[3] Beech, S. (2018). Adapting to change in the higher education system: https://doi.org/10.1108/MIP-11-2012-0136.
international student mobility as a migration industry. Journal of [16] Yusoff, M., McLeay, F., & Woodruffe-Burton, H. (2015).
Ethnic and Migration Studies, 44 (4), 610 – 625. K. Elissa, “Title of Dimensions driving business student satisfaction in higher education.
paper if known,” unpublished. Quality Assurance in Education, 23(1), pp. 86-104,
[4] Borjesson, M. (2017). The global space of international students in https://doi.org/10.1108/QAE-08-2013-0035
2010. Journal of Ethnic and Migration Studies, 43 (8) 1256 – 1275. [17] Serdyukov, P. (2015). Does Online Education Need a Special
[5] Fleigler, C. M. (2014). Recruiting the World. University Business, 17 Pedagogy? Journal of Computing and Information Technology - CIT
(10), pp. 36 – 41. 23, 2015, 1, 61–74 doi:10.2498/cit.1002511 61
[6] Villar-Onrubia, D. & Rajpal, B. (2016). Online International [18] Buhl, M., Andreasen, L.B., & Pushpanadham, K. (2018). Upscaling
Learning: Internationalising the Curriculum through Virtual Mobility the number of learners, fragmenting the role of teachers: How do
at Coventry University’ Perspectives: Policy and Practice in Higher massive open online courses (MOOCs) form new conditions for
Education, 20 (2-3), pp. 75 – 82. learning design? International Review of Education 64(2), pp. 179-
[7] Redden, E. (2014). Inside Higher Ed (online) “Teaching with Tech 195.
across Borders”, July 2014 [19] Veletsianos, G. (2011). Designing Opportunities for Transformation
https://www.insidehighered.com/news/2014/07/09/faculty-use- with Emerging Technologies. Educational Technology. 51 (2) pp. 41-
internet-based-technologies-create-global-learning-opportunities 46. Available at https://www.jstor.org/stable/44429917
[accessed 12 May 2015]. [20] Warner, T. & Palmer, E. (2015). Personalising learning: Exploring
[8] Bell, S. (2016). Sustainable distance learning for a sustainable world. student and teacher perceptions about flexible learning and
Open Learning, 31 (1), pp. 1 – 8. assessment in a flipped university course. Computers & Education 88
[9] Maringe, F., & Sing, N. (2014). Teaching large classes in an pp. 354 -369
increasingly internationalizing higher education environment: [21] Sadeghi, S. H. (2018). E-Learning Practice in Higher Education: A
pedagogical, quality and equity issues. Higher Education, 67 (6), pp. Mixed-Method Comparative Analysis
761 – 782. [22] Al-Rashdan, A.-F. A. (2009). Higher Education in The Arab World:
[10] Brahimi, T., & Sarirete, A. (2015). Learning Outside the classroom Hopes and Challenges. In Mohamed Elmenshawy (Ed.), New Chapter
through MOOCS. Computers in Human Behaviour, 51 pp. 604 – 609. of Political Islam. Arab Insight Volume 2 (6) ISSN 1936-8984 pp.
[11] Affouneh, S. Wimpenny. K., Ra'fat Ghodieh, A., Abu Alsaud, L., 77–90.
Abu & Obaid, A. (2018). Reflection on MOOC Design in Palestine: [23] Bothwell, E. (2017). ‘Online higher education ‘unappealing’ for
A MOOC as a tool for nationality building. The International Review Syrian refugees’ Times Higher Education [online]
of Research in Open and Distributed Learning. Accessed at: https://www.timeshighereducation.com/news/online-higher-
http://www.irrodl.org/index.php/irrodl/article/view/3469/4610 education-unappealing-syrian-refugees#survey-answer
[12] Affouneh, S. J., & Amin Awad Raba, A. (2017). An Emerging Model
of E-Learning in Palestine: The Case of An-Najah National
University. Creative Education, 8, pp. 189-201.

86
The relation between Individual Student Behaviours
in Video Presentation and their Modalities using
VARK and PAEI Results
Manal Ismail
Ahmed Fekry Georgios Dafoulas
Computer Science Department
Computer Science Department Computer Science Department
National Egyptian
National Egyptian Middlesex University
e-Learning University
e-Learning University London, UK
Cairo, Egypt
Cairo, Egypt g.dafoulas@mdx.ac.uk
mismaeel@eelu.edu.eg
afekrymohamed@eelu.edu.eg

Abstract—This research paper aims to investigate the Previous work has focused on metadata to reach the
relationship between students' personality characteristics using content. [4] Metadata should describe the video content in a
well-recognized models and their behaviours and activities in generic method to support indexing and searching, but in our
video content presentation. This paper is part of a research study research, we use a different way to tag video content and
focusing on video tagging methods for analysing the behaviour of describe human behaviours> This takes place by observing
team members. The authors analysed videos of student group specific behaviours and activities to find a relation which can
presentations and a data set of student personality tests of the same help in building a future model that can judge video content
student cohort, identifying their characteristics. By finding a and generate a point system for specific behaviours. Such a
relation between the two we better support students after assessing
system can be used for both individual behaviour and
video content. The study aims at pursuing associations between
behaviours of several group members.
human behaviour and personal modalities. This practice cab be
very supportive in student assessment and career coaching. The Typical approaches to video annotation include video
work carried out was based on quantitative research methods for structure analysis, object discovery and event classification.
analysing the videos and combining them with two personality All through the past decade, these approaches have been
models, VARK (Visual, Aural, Read/write, and Kinaesthetic) advanced from the use of handcrafted highlights [5] to feature
modality preferences test and PAEI (Producer, Administrator, learning techniques. Recent researches claim that deep
Entrepreneur, Integrator) methodology to give us information learning can achieve great accuracy in video annotation
about student modality and leadership preferences. We found the
applications. [6]
average of behaviour occurrence and presentation duration
average for each VARK profile and PAEI roles. We conclude from To understand human body activities in videos, we need
our results that students with an Administrator role in PAEI roles to define human gestures. Gestures can originate from body
and with Multimodal and Aural style in VARK model are the most movements like walking, bending, jumping, and hand waving.
self-talking, while the highest average value for eye focus While a video is playing, human action detection is not easy
behaviour is for students with Producer role in PAEI roles and to achieve. This problem exists due to variations in motion
with Visual style in VARK model, and the largest speech loudness appearance of actions, camera angles, movement in the
average is for students with an Administrator role in PAEI roles
background and any surrounding noise. The objective of a
and with Visual style in VARK model.
similar application is to detect different gestures in multimedia
Keywords— human behaviour, student presentation, video clips by pre-processing the video and then apply an algorithm
tagging, video content, learning modality, VARK, PAEI, personal for detecting various actions. [7]
modality, recommendation system. Video tagging or concept detection is emphatically related
to tasks like scene recognition and object recognition [8]. In
I. INTRODUCTION (HEADING 1) our research we focus on behaviours of humans while making
Recent progress in information capture, storage, and a group presentation, in order to investigate the relations
communication technologies have increased accessibility to between individuals’ behaviours and their personality. We
video data. Collaboration with mixed media information, and believe that this investigation will help in building a model
video, in specific, requires more than interfacing with data that gives an automatic rating for students and have a good
banks. The recommended approach is to index video understanding of student activities and common behaviour, as
information and transform it into organised media. [1] well as summarise video content and extract important data.
Appropriate annotation of mass video data is very important Furthermore, this model would help and support in building
for traditional text-based search engines to retrieve semantic algorithms for automatic judging model or a recommendation
data. Hence, Video annotation has been recognised as a system from these findings. This paper is part of our research
valuable research area. [2] in video tagging methods for analysing group member
behaviour. This research also involves the exploration mof
There is now a mandatory demand for audio-visual or relations in student behaviour patterns and individual
multimedia contents in various fields. [3] In our research, we characteristics, which is published elsewhere. The focus of
focus more on observing human behaviours that happen in this paper is on investigating the relation between an
video content, and how video tagging techniques can support individual’s behaviours and certain personality types.
understanding what is going on in video content.
While many teachers around the world and pioneers are
calling for students to create 21st-century competencies

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 87


through student-centred, hands-on learning, most school We combined videos from student’s group presentation
frameworks still using conventional forms of instructions. with all observed behaviours, and the result of student
This reliance on conventional shapes of instruction remains personality tests (VARK, PAEI) to identify their
troublesome, and often unsatisfying. This issue is further characteristics as input before the analysis process.
complicated by the large emphasis placed on students
demonstrating their knowledge through standardised exams. II. LITERATURE REVIEW & RELATED WORK
[9] Therefore, in our research we aimed at investigating group
presentations as a method of assessment. Analysis of video content has been investigated in
different research areas, and a lot of trials to make automatic
There are different models for determining an individual’s analysis for video content. We will discuss the rationale for
learning style including the Dunn & Dunn model, the Kolb this work next.
model, the Witkin and Goodenough model, and the VARK
model [10]. In our research, we consider the VARK model. Information indexing and retrieval in multimedia content,
VARK (Visual, Aural, Read / Write, and Kinaesthetic) are required to describe multimedia information and to assist
presents the modalities that are used to study information. The people in search for multimedia resources quickly and
VARK model is one of the simplest and at the same time most efficiently. Videos have the following characteristics:
popular learning styles models. Depending on the way that we 1) Much richer content than images.
perceive information, it uses four sensory modes. Therefore,
this model presents four ways that can affect students who are 2) Huge amount of raw data.
involved in the educational process. [10] 3) Very little prior structure.
This model contains four styles: (i) visual, (ii) learning by These characteristics make the indexing and retrieval of
listening to information, (iii) reading/writing and (iv) practical videos, a difficult task. In the past, video databases were
implementation. Students, who learn through a visual relatively small, and indexing and retrieval depend on
medium, prefer using graphs, diagrams, photos, videos, keywords annotated manually [6]. Some research works
illustrated textbooks, and flipcharts. They like to think in tackle the area of video annotation from the perspective of the
pictures and may learn best from visual displays. Visual device resource. [2]
learners are less fulfilled with a presentation where they
cannot take a point by point notes. Some visual learners will Several research studies focus on analysis in the field of
indeed take notes when they have printed materials on the sports such as “Human action recognition-Based video
work area. [10] This is an important input as we already Summarization for RGB-D personal sports video” which
considering reading from notes during the presentation is one discusses automatic sports video summarisation. The authors
of the observed behaviours. Students who learn by listening propose a personal sports video summarisation method for
prefer verbal sessions and materials through discussions and self-recorded RGB-D videos [14] This research relates to our
debates. These learners always try to read the text and notes work because we also focus on human behaviours and actions
aloud and listening to recorded clips and information from during the video to assess building model for automatic
books. They usually want to discuss information for a better student’s behaviour recognition.
understanding, Students who prefer reading presentations and
A very interesting piece of work titled “Real-time system
publications are like to write some notes, they are similar to
for human activity analysis” focuses on the analysis of human
visual learners as they also like reading, writing, and drawing.
activities by proposing a real-time human activity analysis
Kinaesthetic learners learn best by doing through a hands-on
system, where a user’s activity can be quantitatively evaluated
approach. These learners like an interactive manner of
relative to a ground truth recording. They use two Kinects to
learning and making tests to learn something first. [10]
solve the problem of self-occlusion through extracting optimal
VARK characterises learning preferences as “an joint positions. But in our research, we don't need to observe
individual’s characteristics and favoured ways of gathering, a complicated human body movement. [15]
organising, and thinking about information”. The VARK has
A Recent research made in 2018 shows an interest in
been broadly utilised for prompting students about their
analysing human gestures in video and states that detecting
learning preferences. [11] While all previous types focus more human action or gesture automatically is a difficult point, due
about how student receives the knowledge and how he prefers to extensive variations in motion appearance of actions,
to learn, we need to investigate in this paper to check if those camera angles with respect to the human body, motion in the
styles affect how students present the knowledge in a group or background, noise and many video data. [7]
not, and what are the common behaviours that occur from the
same learning modality’s people. However, most of the researchers working in the area of
personal modalities or learning preferences focus more on
We used another model called (PAEI) introduced by Dr. studying different learning modalities and how they affect the
Ichak Adizes describing four important roles that together way the learner receives the knowledge and his/her academic
make up a successful team. Adizes is an expert in management progress. Some papers discuss about designing a specific
and leadership. [12] This model discusses the characteristics learning platform such as an adaptive e-learning platform that
of a successful manager and effectiveness of factor’s
uses VARK to support an object orientation course. [16] Most
mismanagement, Adizes concluded that managers and leaders
of the current research focuses on using learning styles to
are divided into four categories: Producer (P), Administrator
enhance student learning process and increasing learners’
(A), Entrepreneur (E) and the Integrator (I) and it was called
motivation and understanding even in traditional learning or
the PAEI pattern. [13], Also, there is a combined type called
on online learning, Some research focuses more on comparing
"multi-role" which means that users have an equal modality
between different learning style’s models.
between more than one role.

88
Analysing human behaviour during group presentations, • End time.
requires more investigation about relations between human
behaviours during presentation and their modalities or • Duration of Video.
preferences of learning. This is the focus of our research as we
need to develop a model that can have a good understanding B. Research Question
of human behaviours. We used each behaviour as a tag (node) to collect and
record the occurrence of the behaviour, the following table
III. RESEARCH METHOD describes each behaviour and how to be observed.

A. Research Question C. Data Set


In our research, we need to answer the following question: We recorded videos for a group of final year students in
"Is there any relations between human behaviours in video Middlesex University while they present their projects in a
content while presentations and their learning preferences group presentation and it can be classified into:
(VARK) or leadership modality (PAEI)".
• Final presentation (snapshot shown in figure 1)
From a different perspective, we need to know if learning
preferences or leadership modality given by VARK and PAEI • Brief presentation (snapshot shown in figure 2).
are affecting the behaviours of the student during his Table 1 shows video information, as we started to observe
presentation or not? the occurrence of the non-verbal behaviours manually.
To answer this question, we listed a group of behaviours Regarding verbal behaviours, we transcript the speech into
that we can observe during the videos to be used in the text to calculate wpm (word per minutes) to measure speech
analysis, after manual observation of behaviours for a group pace by dividing the number of words by speech duration in
of students, we will try to figure out the common behaviour of minutes as shown in figure 3.
people with the same preferences to explore any clear relation Speech volume is measured by inserting a waveform of
between them to combine the results of both analysis, we speech for each presenter to an audio software analysis the
listed a group of behaviours that can be observed from video speech loudness in decibels, then, we convert the decibels into
presentations, those behaviours listed below: magnitude to compare all voice’s volume as in figure 4,
• Non-Verbal Behaviour
TABLE I. VIDEO INFORMATION
o Body Movement
Data Source Number Total Average Academic
o Body Pose of Clips Duration Duration Year
Final 14 2.7 Hours 11 Minutes 2016/2017
o Face Expression presentation
o Eye Contact Brief 15 47 Minutes 3 Minutes 2016/2017
presentation
o Self-talking Final 14 2.9 Hours 12 Minutes 2017/2018
presentation
o Pause whiles talking Brief 12 32 Minutes 2.5 2017/2018
presentation Minutes
• Verbal Behaviour Final 12 2.6 Hours 13 Minutes 2018/2019
presentation
o Speech Loudness Brief 11 50 Minutes 4.5 2018/2019
o Speech Pace presentation Minutes
Total 78 10.3 Hours 8 Minutes
We recorded videos for student’s group presentation and
ask students to complete an online questionnaire for Both
VARK and PAEI to get their learning and leadership
preferences to identify their characteristics. In our research,
we used quantitative research method as we observed videos
presentation manually to count observed behaviours’
occurrence for each presenter, and record them to be ready for
the analysis process. Behaviour details are described in table
3, and we also collect the following information from the
video itself to support in the analysis process:
• Order of presentation.
• Number of appearances per member.
Fig. 1 Video snapshot (Final presentation)
• Gender of presenter.
• Individual Presentation Duration.
• Presentation Duration.
• Count of group members.
• Start time.

89
D. Calculations
After the observation of behaviours in the videos, we
created some calculated fields from our observation as
following:
• Stability duration = presentation duration–movement
duration
• Eye focus duration = presentation duration – eye focus loss
duration
• Self-Talking Duration = presentation duration – read from
Fig. 2 Video snapshot (Brief presentation) slide or note duration.

TABLE II. SURVEY SAMPLE LUFS(dB)= 20 log(amplitude) (1)

Member Name VARK Type PAEI Role Where:


#1 Aural Producer
LUFS: Loudness units relative to full scale in decibels.
#2 Aural Multiroles
#3 Aural Integrator
So to calculate the amplitude of speech from the decibels
we can get from (1) that amplitude would be calculated as
#4 Visual Entrepreneur
following:
#5 Aural Producer
#6 Aural Multiroles B % = (Tt-Tb)/100 (2)
#7 Visual Integrator
#8 Visual Entrepreneur Where:
#9 Visual Producer B%: Behaviour percentage
#10 Visual Producer
Tt: Total time of presentation for the presenter
#11 Visual Integrator
#12 Visual Producer
Tb: Behaviour occurrence time while the presentation
#13 Visual Producer • Speech loudness
#14 Visual Producer We calculate the speech volume in decibels, so to calculate
After we finish observing 78 distinct video files contain the amplitude of the speech (Loudness) we use the following
more than 10 hours of presentations for 41 groups of students, equation:
we come up with a result for 329 different presentations as in
( )
Table 4 (Observation sample), then, we calculate the = 10 (3)
percentage of each behaviour’s occurrence against
presentation duration to normalise all the presenter's
behaviours and make it valid to compare values. We also aim
at coming up with another data set for behaviours but in IV. FINDINGS
percentages, regarding student preferences for VARK and By combining our video observations for student’s
PAEI models. We asked students to fill in an online behaviours with survey’s results (VARK, PAEI), we started to
questionnaire for both models, as from students’ answers for figure out what is the major type of modality or preferences
both surveys we could know the learning style and that have the highest behaviour occurrence. We got the results
management style as shown in table 2 (surveys sample). shown in figure 5, which shows the PAEI role that has the
Then using data analysis, we started to visualise and longest appearance duration. Then we drew a column chart for
investigate the relation between the survey results and student eye focus, stability and speech loudness behaviours and sorted
behaviours to answer our research question. the roles from the highest score and we got these results shown
in figure 7.
We repeated the same processes for the VARK model and
compared the appearance duration average for the VARK
styles and we got that results shown in figure 6.
Regarding student behaviours (eye focus, stability and
speech loudness) we sorted them by the highest score and got
these results shown in figure 8.
We also combined both VARK & PAEI to investigate
which combination of styles occurs more than others and
Fig. 3 Speech Transcript which combination happens fewer times to not at all. Our
findings are shown in figure 9.

90
• The most self-talking students are those with an
Administrator role in the PAEI roles and with
Multimodal style in the VARK model.
• The highest average value for eye focus behaviour is
for students with Producer role in PAEI roles and with
Visual style in the VARK model.
• The largest speech loudness average is for students
with an Administrator role in PAEI roles and with
Visual style in the VARK model.
From figure 9, we got that most frequent combination
patterns between VARK & PAEI is (Producer & Visual). We
Fig. 4 Speech Loudness also found thattThe following combinations didn’t appear at
all in our research:
• Read\Write & Integrator
• Read\Write & Multirole
• Multimodal & Multirole

VI. REFERENCES
[1] N. Dimitrova, H.-J. Zhang, B. Shahraray, I. Sezan, T. Huang and A.
Zakhor, "Applications of video-content analysis and retrieval," IEEE
MultiMedia, vol. 9, no. 3, p. 14, 2002.
[2] Y. Mallawarachchi, K. Ashangani, K. U. Wickramasinghe and D. W.
De Silva, "Semantic Video Search by Automatic Video annotation
using TensorFlow," in Manufacturing & Industrial Engineering
Symposium 2016, Colombo, 2016.
[3] Y. Nakamura, M. Ozeki and Y. Ohta, "Human Behavior Recognition
for an Intelligent Video Production System," in Advances in
Multimedia Information Processing, Third IEEE Pacific Rim
Conference on Multimedia, Hsinchu, Taiwan, 2002.
Fig. 5 Appearance in video & PAEI roles [4] M. Sanderson, J. S. Pedro, and S. Siersdorfer, "Automatic Video
Tagging using Content Redundancy," in The 32nd Annual ACM
SIGIR, Boston, Massachusetts, USA, 2009.
[5] L. Shao, "Generic Feature Extraction for Image/Video Analysis," in
IEEE International Symposium on Consumer Electronics, Petersburg,
Russia, 2006.
[6] W. Hu, N. Xie, L. Li and X. Zeng, "A Survey on Visual Content-Based
Video Indexing and Retrieval" IEEE Transactions on Systems, Man,
and Cybernetics, Part C (Applications and Reviews), vol. 41, no. 6, p.
23, 2011.
[7] T. J. Saleem and S. Mushtaq, "Human Gesture Analysis Based on.
Video," International Journal of Advanced Research in Computer
Science, vol. 9, no. 1, p. 5, 2018.
[8] T. Breuel and R. Paredes, "Fast Discriminative Linear Models for
Scalable Video Tagging," in International Conference on Machine
Learning and Applications, Miami, Florida, USA, 2009.
[9] M. Worsley and P. Blikstein, "Towards the development of learning
analytics: Student speech as an automatic and natural form of
assessment," Annual Meeting of the American Education Research
Association (AERA), p. 22, 2010.
Fig. 6 Appearance in video & VARK style [10] M. Bande, A. Stojanova, N. Stojkovikj, M. Kocaleva, B. Zlatanovska,
and C. Martinovska-Bande, "Application of VARK learning model on
V. CONCLUSION "Data structures and algorithms" course," in IEEE Global Engineering
Education Conference (EDUCON), Athens, Greece, 2017.
From previous results in finding’s sections, we can [11] D. J. Lamb, D. Al-Jumeily, A. J. Hussain and M. Alghamdi, "Assessing
conclude that students with an administrator role have the the Impact of Web-Based Technology on," in Sixth International
longest presentation duration, while students with an Conference on Developments in eSystems Engineering, Abu Dhabi,
United Arab Emirates, 2013.
Integrator role has the shortest presentation duration. With
respect to VARK styles, the longest presentation duration is [12] I. Adizes, "Organizational passages—Diagnosing and treating lifecycle
problems of organizations," Organizational Dynamics, A division of
associated with students who are classified as Kinaesthetic American Management Associations. , vol. 8, no. 1, p. 23, 1979.
and shortest duration is for students classified Aural. [13] H. Shiva and T. Hassan, "Study of Conflict between Adizes's
Regarding behaviours’ occurrence, we have got the following Leadership Styles and Glasser’s Five Basic Needs," Mediterranean
important indicators: Journal of Social Sciences, vol. 7, no. 3, p. 8, 2016.
[14] N. Yokoya, A. Tejero-de-Pablos and Y. Nakashima, "Human action
recognition-based video summarization for RGB-D personal sports

91
video," in IEEE International Conference on Multimedia and Expo, [16] R. M. Silva, C. C. Figueroa, T. P. Rubilar and F. S. Díaz, "An Adaptive
Seattle, WA, USA, 2016. E-Learning Platform with VARK Learning Styles to Support the
[15] L. Guan, N. Khan and R. Tan, "Real-Time System for Human Activity Learning of Object Orientation," in IEEE World Engineering
Analysis," in IEEE International Symposium on Multimedia, 2017. Education Conference, Buenos Aires, Argentina, 2018.
[17] S. Tauroza and D. Allison, "Speech Rates in British English," Applied
Linguistics, vol. 11, no. 1, pp. 90-105, 1990.

TABLE III. BEHAVIOUR DESCRIPTIONS

Behavior Values Description Note


Body Stable This is the default behavior for presenter that he is stable Assumed as default, calculated automatic by subtracting
Movement and in standing position. movement duration from total presentation duration.

Moving This behavior happened when the presenter starts to move Calculated manually by counting the number of
his body by changing his legs location on the ground. occurrences, assuming that each occurrence takes 1
second.
Body Pose Front This is the default behavior for presenter, as he is facing the Assumed as default, calculated automatic by subtracting
camera or audience with his body. moving duration from total presentation duration.
Side This behavior happened when presenter move his body Calculated manually when the body in a side position, we
away from the camera or from the panel so one of his neglect when he moves his body to check projector as this
shoulders is not shown. already will be considered in reading mode.
Face Normal This is the default behavior for presenter to show normal Assumed as default calculated automatic by subtracting
Expression facial expression. smile duration from total presentation duration.
Happy(Smile) This behavior happened when presenter show some Calculated manually by # of occurrences, assuming that
positive expression such as happiness, smiling and each occurrence takes half a second.
relaxing. (assuming that he is not looking into his team for
presentation purpose).
Eye Contact In This is the default behavior for presenter to be looking at Assumed as default, calculated automatic by subtracting
the camera. focus out of camera duration from total presentation
duration.
Out This behavior happened when the presenter is look out of Calculated manually by counting the number of
the camera while he is presenting. (assuming that he is not occurrences, assuming that each occurrence takes 1
looking into his team for presentation purpose or not second.
reading slide from paper or notes ).
Reading Self-Talking This is the default behavior for presenter to be self-talking Assumed as default, calculated automatic by subtracting
Method without any external supports. reading from note and projector duration from total
presentation duration.
Note/ This behavior happened when the presenter is read from a Duration calculated manually by recording behavior
Projector note in his hand or projector occurrence duration.
Pauses while Pauses_Count This behavior happened when student make a pause while Calculated manually during the observation process.
Presentation presentation, we consider pause when he stops talking for
more than 3 seconds without any interruption from the team
or supervisor.
Speech Level Calculated using audio analysis software. Calculated in percentage after converting decibel into a
Loudness magnitude, as the highest value the more voice is clear and
understandable
Speech Rate Fast/ Fast: > 190 wpm (word per minute) Calculated using transcript voice into text, then calculate
(Pace) Moderate/ Moderate: between 150 - 190 wpm for a comfortable pace the number of words per minutes.
Slow Slow : < 150 wpm [17]

TABLE IV. OBSERVATION SAMPLE

Movement # Hand Gestures Eye focus (Out) # Reading from Note # Speech Loudness Speech Pace
0 Moderate 14 10 -31.4 Moderate
3 Good 2 12 -21.29 Moderate
0 Bad 6 24 -24.56 Moderate
0 Moderate 6 80 -22.97 Moderate
0 Bad 1 103 -27.58 Moderate
0 Good 1 16 -26.96 Moderate
0 Good 0 50 -21.7 Moderate
0 Bad 3 57 -28.9 Moderate
0 Moderate 1 33 -25.91 Moderate
0 Good 1 17 -22.45 Moderate
0 Good 0 37 -22.77 Moderate
1 Good 0 35 -23.79 Moderate
3 Good 3 7 -24.02 Moderate
5 Moderate 2 3 -25.49 Moderate
2 Moderate 1 96 -22.96 Moderate
2 Moderate 2 28 -21.83 Moderate

92
Fig. 7 PAEI roles Vs. Behaviours

93
An overview of Digital Forensics Education
Georgios A. Dafoulas David Neilson
Computer Science Department Computer Science Department
Middlesex University Middlesex University
London, United Kingdom London, United Kingdom
g.dafoulas@mdx.ac.uk g.dafoulas@mdx.ac.uk

Abstract—This paper follows an initial review conducted as • An analysis of modules taught in undergraduate
part of an EU-funded Erasmus+ project under the programme programmes.
of Capacity Building in Higher Education. The original paper
focused on providing the state of the art in undergraduate • A discussion on learning outcomes at programme
computer forensic programmes [22]. This work is part of a and module level.
study towards the EU funded Pathway in Forensic Computing
(FORC) project. FORC aims to address the challenges in • A review of state-of-the-art tools and techniques
information society development concerned Cyber Security used in such programmes.
and privacy in a world oriented towards e-technologies. The This second paper extends the study further, as it
project meets the regional needs of the Middle East area by emphasises the fundamental concepts and key topics that are
responding to the current and emerging cyber security threats common in postgraduate curricula. The additional study also
by educating the IT and Legal professionals in the field of e-
provides the means to understand how the sector has seen the
crime, thus supporting development of e-based economics, life
and society in partner countries. The work is funded under
development of online programmes and the skillsets that
project reference number 574063-EPP-1-2016-1-IT-EPPKA2- seem to be viewed as critical for recruitment in the relevant
CBHE-JP, Grant Agreement 2016 – 2556 / 001 – 001. In this industry.
paper we focus on the second work package of the project
(WP-2) aiming to ‘establish a forensic computing pathway’ and II. LITERATURE REVIEW
the first task for this work package aiming at ‘defining
pathway objectives, learning outcomes, and career
There seems to be a limited source of relevant works in
perspective’. In this second paper we focus more on the field, as there are not that many researchers investigating
postgraduate programmes, online provision, and career the undergraduate and postgraduate curricula in digital
pathways. We also consider the availability of programmes in forensics. Anderson et al [1] provide a brief comparison
the US, while emphasis is given more on digital forensic between the British and German models of programme
education. structure with emphasis on teaching forensics at University
level. The authors conclude that the British curriculum
Keywords—Digital Forensics Education, Computer design approach appears to be the more mature of the two.
Forensics, Emerging trends in computer forensics, Computer This has affected this investigation as it focused more on UK
Science Education, Curriculum Development, Curriculum provision of Computer Forensics programmes.
Design
Bashir and Campbell [2] correctly identify that “digital
forensics education curriculum needs to be developed by
I. INTRODUCTION taking into consideration the need for students to be aware of
The aim of this paper is to provide a review of the current the multiplicity of field specializations”. In their efforts to
provision in digital forensics. The study is based on a design a Computer Forensics curriculum for a US institution
preliminary state of the art review of current best practice in they identified as major challenges providing students “with
the field of Computer Forensics, mainly at an undergraduate the greatest depth of knowledge of a particular aspect of a
level. The original work attempted to identify the learning field that encompasses a wide range of technical topics.
outcomes of programmes specialising in computer forensics, According to Conklin et al [3] a shift to Knowledge Unit
with or without a cyber security specialism. The scope of the (KU) based cyber security education is beginning. In their
study is to provide a useful reference point for colleagues paper they illustrate the relationship of training to education
who are in the process of curriculum development in digital identifying two-year associate degrees preparing technicians,
forensics. network operators and system administrators, four-year
bachelor’s degrees producing analysts and engineers, and
The work focused on identifying programmes of study in master’s degrees focusing on risk management and
the field of Computer Forensics in UK, US and EU and management specialists.
review the state of affairs in curriculum design and
development in the field. The following steps were followed: Cooper et al [4] provide a series of illustrations that help
visualising the relationship between digital forensics and
• A literature review of curriculum design and other computing principles. They mention the ACM/IEEE
development practices. Joint Task Force for Computing Curricula in their effort to
• An investigation in Forensic Computing map the domain. They conclude with a number of areas
programmes in UK, EU and US. where greater emphasis is needed for digital forensics as
follows: (i) networking, (ii) information security, (iii)
• An analysis of the programme structure for Forensic systems administration, (iv) electronics, (v) mathematics and
Computing programmes. statistics, (vi) ethics, (vii) criminology, (viii) forensics
science and (ix) law and legal issues. As early as 2001, the
DFRWS report [5] identified a number of areas “as valid

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 94


candidates for applicable specialization” in digital forensic consist of four major areas including (i) Computer Science
science, including Data Mining, Languages/Linguistics, and Foundations, (ii) Procedures, Methods and Practice, (iii)
Logic, Statistics and Probability, Signal Processing, Image Legal System and Law, and (iv) Computer Forensics.
Analysis, Encryption, Evidence Preservation, Network Nickson and Venter (2014) provide an interesting
Engineering. According to Gorgone et al [6] the suggested perspective in the field as tin their paper they present “a
courses necessary for a career path in Computer Forensics novel contribution in the digital forensics domain by means
should include Criminal Law (or Criminal Justice), of a guiding ontological model that indicates the placement
Information Assurance & Security, Computer Forensics and of the different digital forensic disciplines and sub-
Network Forensics. disciplines within the domain. The ontology also allows for
the addition of new digital forensic disciplines and sub-
Gottschalk and Liu [7] in their preliminary survey of disciplines, including potential modifications in any one of
computer forensics programs in North America identified a the aforementioned categories”. Sabeil et al [17] analyse the
number of factors for consideration before starting a program Cyber Forensics training programmes in terms of
in Computer Forensics including curriculum design, existing, Competency-Based Framework, proving that “Cyber
programs, faculty, students, facilities, and budget. They Forensics training or education has improper Competency-
suggest “dedicated computer labs are essential for the study Based Framework”.
of computer forensics. Because investigations may involve
unbroken systems, corrupted systems, or physically damaged According to Rowe et al [16] there is a shortage of
systems, computer labs must provide access to multiple “approximately 20,000-30,000 qualified cyber-security
computers as well as expensive equipment for recovering specialists in the US Public Sector alone despite being one of
data from physically damaged or otherwise corrupted the best financially compensated technology-related
systems”. According to Hawthorne and Shumba [9], who domains”. They have suggested a framework for reviewing
discussed their experiences with a “virtual lab for teaching IT provision so cyber-security becomes part of an
digital forensics and cyber investigation online as well as institution’s provision by encouraging IT programmes to:
feedback from distance education students enrolled in the
masters degree program”, lab exercises “ appear to be a very • Verify that they include a pervasive up-to-date
effective method for teaching digital forensics”. Interestingly security element throughout their curriculum.
enough their students reported that the commercial digital • Familiarise students with the terminology of cyber
forensic toolkit Encase from Guidance Software “is a very security.
powerful with too many features, making it less user
friendly”. The authors also investigated five open source • Evaluate their current advanced content in cyber
toolkits and three remote virtual labs. security related topics and where possible, teach
such content in a cyber-security context.
Similarly, Lang et al [10] share their experiences with the
development of an undergraduate programme in digital • Where possible, introduce an advanced cyber-
forensics, by identifying a number of key challenges, security emphasis based on the Prepare, Defend, Act
including: model.
• Balancing training and education, this is why the The FORC project attempts to address these issues by
proposed FORC programme considers practical introducing the Computer Forensics pathway in IT
sessions in the programme modules. programme of partner institutions. Srinivasan [18] shares a
sample undergraduate curriculum, resources needed to
• Lack of an adequate textbook on digital forensics, develop an inexpensive digital forensics lab, and steps to
this is why the FORC programme suggests the integrate this course in the Information Security curriculum,
development of textbooks for each of the eight an approach that is suitable for the scope of the FORC
proposed modules. project. Tu et al [19] claim “there is evidence to suggest that
• Finding qualified faculty, this is why the FORC students can benefit professionally from information
project provides certain training visits for partner assurance skills and knowledge when undertaking network
institution staff. forensics incidents”. The authors also recommend integrating
“a large portion of the business management and business
• Lab setup, this is why the FORC project will include information systems component into the digital forensics
on site visits for setting up appropriate infrastructure. program design, since fraud and other whitecollar crimes are
significant threats to businesses”.
• Selecting appropriate prerequisites, this is why the
current report investigates also admission criteria to An illustration of the career path for digital forensics
the programme. practitioners and a list of academic qualifications,
professional traits and technical skills are Technical Working
• Lack of widely accepted curriculum standards, this is Group for Education and Training in Digital Forensics
why the investigation on UK, EU and US (TWGETDF) [20]. These should be part of the FORC
programmes has taken place. programme documentation and adapted to fit the needs of the
Liu [11] provides yet another perspective on how a local markets.
Computer Forensics programme should be structured. The Woods et al [21] describe in detail how to use realistic
proposed classification of topics includes four key areas, forensic datasets to support digital forensics and security
namely (i) operating systems and networks, (ii) computer education. In particular they explain the use of a multi-modal
security, (iii) procedures, standards and techniques and (iv) corpus consisting of “hard drive images, RAM images,
analysis and presentations. It is therefore proposed that four network captures, and images from other devices typically
major areas of a Computer Forensics programme should
95
found in forensics investigations such as USB drives and cell the relevant provision focuses still more on cyber security
phones”. They conclude that such scenarios can be used for rather than digital forensics. As before, the lack of English
“multiple purposes at a variety of complexity and difficulty documentation, makes it difficult to acquire a full
levels–in undergraduate classrooms and lab, for training representation of each programme’s curriculum. The
exercises, and to support further research and development of preliminary investigation has included the following phases:
digital forensics tools and techniques.
• Phase 1 – based on using a filter on cyber security
programmes in EU from
III. BACKGROUND http://www.bachelorstudies.com.
The scope of the FORC project is to provide pathways in
• Phase 2 – based on using a filter using Forensic
Computer Forensics that can be integrated to existing IT and
Focus Computer Forensic education in Europe
Computer Science programmes in Palestinian and Jordanian
https://www.forensicfocus.com.
institutions. Emphasis is given on identifying suitable subject
areas that can provide the skillset needed in the Computer • Phase 3 – based on using a filter using the European
Forensics sector, while ensuring there is smooth integration Union Agency for Network and Information Security
with existing courses and syllabi. This paper paves the way https://www.enisa.europa.eu.
for further pathway creation for postgraduate studies. This is
done through an analysis of US programmes that mainly
IV. UNDERGRADUATE STRUCTURE AND CONTENT
offer postgraduate, highly specialized study in the field of
digital forensics. After the text edit has been completed, the paper is ready
for the template. Duplicate the template file by using the
The scope of the paper is to provide a reference point Save As command, and use the naming convention
based on the detailed literature review, showing that this is a prescribed by your conference for the name of your paper. In
field that needs further exploration. The research aims to this newly created file, highlight all of the contents and
identify the predominant topics that are essential for import your prepared text file. You are now ready to style
curriculum design in digital forensics. It is also necessary to your paper; use the scroll down window on the left of the MS
consider how this study helps to determine the availability of Word Formatting toolbar.
online courses in the field, as well as the importance of
studying in the field for recruitment in the relevant sectors. Following the analysis of several programmes from the
UK, US and Europe, it appears that the structure of the
A. UK Provision FORC programme must be based on a number of factors
including (i) analysis of current practices and curricula of
The original review focused primarily on UK institutions, European academic programs in Forensic computing, (ii)
as it appears nationwide there is a significant body of analysis of internationally recognised recommendations
knowledge leading to the design of Computer Forensics dealing with the needed levels of knowledge and skills of the
curriculum. During the first study, thirty-one (31) institutions emerging areas of Forensic Computing, (iii) definition of the
were identified offering undergraduate programmes in pathway mission, objectives, and learning outcomes, (iv)
Computer Forensics, with some of them having options for determined pathway structure and courses' specification
two or three similar programmes. This is primarily due to the including their content and learning outcomes and (v)
existence of specialisations in cyber security. All inclusion of eight courses in the following themes:
programmes but one, lead to BSc qualifications.
• Digital Investigation
Although the focus of the analysis was undergraduate
provision, a review of the postgraduate programmes was also • Issues in Criminal Justice
carried out at national level. A list of eighteen (18)
institutions offering postgraduate programmes was indicative • Digital Forensics
of what is offered in the UK. It appears that key institutions • Ethical Hacking
have invested in introducing specialised postgraduate
programmes with emphasis on cyber security. • Digital Evidence.
Our suggestions for undergraduate programmes in digital
B. US Provision forensics include five core themes as identified in the FORC
The US education system is based on a selection of major project:
and minor subject areas for each programme. As in the first
phase of the study, it is almost impossible to identify all the • Knowledge of the five stages of a Digital
major/minor combinations that include a specialism in digital Investigation: Seizure; Acquisition; Preservation;
forensics, especially when considering the use of computer Analysis; and Reporting.
forensics and cyber security specialisms. However, the main • Knowledge and skills relating to a Digital
finding is that most programmes are based on an IT core Investigation e.g. handling of evidence and
with highly specialised modules towards the final two years professional practices.
of study. A more detailed review of US provision is included
later in the paper, following the preliminary investigation of • Knowledge of professional practices that form the
eight (8) indicative institutions. foundations of Computer Forensics.
• Knowledge of the a relevant (in our case the
C. European Provision English) legal system, legal processes, relevant laws
Since the previous paper, some additional programmes and the regulatory environment related to the
have appeared in the EU education sector. The majority of

96
handling of digital evidence and forensic 12. George Mason University – Digital Forensics and
investigations. Cyber Analysis (Online)
• Generic knowledge of computer and IT e.g. data 13. Capitol Technology University – Cybersecurity
storage, operating systems, file systems and (Online)
Computer Networks.
14. University of Detroit Mercy Information – Assurance
We also recommended that the programme learning (Cybersecurity) (Online)
outcomes are distinguished in the following four categories:
15. Norwich University – Information Security &
• Knowledge and understanding Assurance (Online)
• Cognitive (thinking) skills 16. Edinburgh Napier University – Advanced Security &
Digital Forensics (Online)
• Practical skills
17. Stratford University – Digital Forensics (Online)
• Graduate skills
18. Capella University – Information Assurance and
These programme level learning outcomes should be Cybersecurity (Digital Forensics) (Online)
aligned to module-level learning outcomes that would
describe in more detail the achievement of a student who 19. DeSales University – Digital Forensics (Online)
successfully completes each module, ideally demonstrating 20. University of New South Wales Canberra – Cyber
the full experiential learning cycle as described by Kolb’s Security (Digital Forensics) (Online)
learning style model. The learning, teaching and assessment
strategy of the FORC programme should be in line with 21. University of the Sunshine Coast – Cyber
Bloom’s taxonomy and make full use of the learning Investigations and Forensics (Online)
pyramid as suggested by the National Training Laboratories, 22. Edith Cowan University – Cyber Security (Online)
Bethel, Maine.
23. Auckland University of Technology – Information
V. POSTGRADUATE PROVISION Security and Digital Forensics (Online)

Our current investigation was focused on postgraduate From the 23 institutions, 18 appear to have online
programmes and also an analysis of how many of such provision in the field. This is a very interesting finding, as
programmes are available online. We primarily focused on there is evidence that more institutions are able to shift
UK and US programmes, as discussed below. tuition in digital forensics and related subjects online. This is
despite the technical nature of such courses and the need for
A. US Provision using specialised software. The modules taught appear in
figure 1 at the end of the paper. The most popular module
As mentioned already, we only managed to investigate a topics are (i) cybersecurity foundations, (ii) network
specific part of the educational provision in digital forensics forensics, (iii) legal and ethical issues, (iv) research project,
in the US. The selection of providers and their postgraduate (v) digital forensics analysis and (vi) crime scene
programmes titles are listed below: investigation.
1. George Mason University – Digital Forensics and
Cyber Analysis) B. UK Provision
2. University of South Florida – Cybersecurity (Digital The following list provides all the postgraduate
Forensics) programmes we could identify in UK Higher Education
Institutions (HEIs). In the UK there are only two
3. University of Alabama Birmingham – Computer programmes that are offered both online and on-campus. The
Forensics and Security Management list of available modules are included in figure 2 at the end of
the paper.
4. University of Maryland University College – Digital
Forensics and Cyber Investigation 1. University of East London –Information Security
and Digital Forensics
5. John Jay College of Criminal Justice – Digital
Forensics and Cybersecurity 2. University of Greenwich –Computer Forensics and
Cyber Security
6. University of Central Florida – Digital Forensics
(Online) 3. University of Salford –Cyber Security, Threat
Intelligence and Forensics (MSc)
7. Champlain college – Digital Forensics (Online)
4. Edinburgh Napier University –Advanced Security
8. Stevenson University – Cyber Forensics (Online)
and Digital Forensics (Online and on campus)
9. Utica College – Cybersecurity (Computer Forensics)
5. Middlesex University –Electronic Security and
(Online)
Digital Forensics
10. University of Maryland – Digital Forensics and
6. Canterbury Christchurch University –Digital
Cyber Investigation (Online)
Forensics and Cybersecurity - MSc by Research
11. Sam Houston State University – Digital Forensics
(Online)

97
7. De Montfort University –Professional Practice in some of the more specific or rare module titles into more
Digital Forensics and Security (Online and on generic categories to enable better comparison. For example,
campus) “Principles of Cybersecurity” (University of Detroit Macy)
was placed under “Cybersecurity Foundations”. Similarly,
8. University of Westminster – Cyber Security and “Wireless Network Security” was simply moved under
Forensics “Network Forensics”. The results we present are not
9. University of Bedfordshire – Computer Security and intended to be used as a distinct quantitative analysis, rather
Forensics than data presented in a way that allows the overall trends to
be detected. Another example of above point is the
10. University of South Wales – Computer Forensics University of Detroit Macy – here they provide a module
called “Secure Acquisition”. This again is included but the
11. University of Portsmouth – Forensic Information column marked “Digital Evidence Management” and also
Technology digital media forensics as it is assumed that the content will
be very similar.
12. Leeds Beckett University – Computer Forensics and
Security (MEng) Another issue is that very few universities had offerings
in a module entitled cybercrime where the focus may be the
13. Coventry University – Forensic Computing types of crime and methods that are used online. This does
14. University of Derby – Digital Forensics and seem a strange omission but could be due to lots of the types
Computer Security of crime being discussed in other modules, and also due to
the fact that it starts to veer towards the subject of
15. Teesside University – Digital Forensics and Cyber criminology.
Investigations
There also appears to be very little provision in terms of
In the UK the most popular module topics appear to be programming and scripting when compared with
(i) network security, (ii) information security and risk undergraduate programmes. The underlying assumption is
management, (iii) incident response, (iv) crime scene that these skills are already in place for this level
investigation and (v) cyber security.
An interesting finding from our wider searches, is that
institutions in Australia appear to have less focus on the legal
VI. DISCUSSION side, whereas this is a much more prominent feature in the
From the analysis of the programmes we came across we USA. It could be argued that this could be due to the country
can discuss a number of findings. It appears that this field off being a more litigious society but we could not find a
education, although highly specialised, still attracts a supporting reference or resource to support such a statement!
significant number of students. It is also an attractive study It is also noticeable that relatively little coverage seems to be
choice for mature students, as it appears that a significant given to mobile forensics in the US when compared with
number of professionals in the field have no formal their Australian counterparts.
education. Increasingly the need for professional standards
and benchmarks push individuals towards postgraduate study VII. CONCLUSIONS
to ensure compliance with a sector that is likely to be more
regulated in the future. In our paper we extended our original study to include a
wider view of digital forensics, covering both undergraduate
There is also a concern about the appearance of online and postgraduate programmes in the UK and US. We
courses and programmes in the field. Although these provide discussed our main findings in terms of the prominent
a suitable option for professionals who wish to study modules offered for postgraduate study and the reasons
remotely, there is a concern whether these programmes can behind such curriculum design choices,
be taught online. There is a significant proportion of highly
specialised software and specific techniques that are difficult
ACKNOWLEDGMENT
to teach remotely.
FORC is funded by the European Commission under the
The use of forensic science, forensic computing and Erasmus+ funding stream. Project reference Number
digital forensics as search keywords make it difficult for 574063-EPP-1-2016-1-IT-EPPKA2-CBHE-JP. Grant
applicants to identify relevant courses that can be easily Agreement 2016 – 2556 / 001 – 001.
compared. Cyber security appears to be a significant
proportion of most programmes, affecting the balance of
programme learning outcomes in certain provisions. REFERENCES
[1] Anderson, P., Dornseif, M., Freiling, F.C., Holz, T., Irons, A., Laing,
In the US several providers tend to offer Associate of C., & Mink, M. 2006. A Comparative Study of Teaching Forensics at
Technical Arts degrees. These concentrate on a particular a University Degree Level. IMF, 116-127.
skill or trade, generally seen as equivalent to the first two [2] Bashir, M., & Campbell, R. (2015). Developing a Standardized and
years of a bachelor’s degree and therefore have less options Multidisciplinary Curriculum for Digital Forensics Education.
and more generalized content for module topics. [3] Conklin, W.A., Cline, R.E., & Roosa, T. 2014. Re-engineering
Furthermore, most institutions tend to introduce more Cybersecurity Education in the US: An Analysis of the Critical
Factors. HICSS.
specialised modules after the first two years of study, and
[4] Cooper, P., Finley, G.T., & Kaskenpalo, P. 2010. Towards standards
mostly in the final year. in digital forensics education. ITiCSE-WGR '10.
Due to the wide variety of topic names and the different [5] DFRWS. 2001. A Road Map for Digital Forensic Research:
ways in which a topic can be represented, we have placed Collective work of all DFRWS attendees, Proceedings of, The Digital

98
Forensic Research Conference DFRWS 2001 USA, Utica, NY (Aug [14] Karie, N.M., & Venter, H.S. 2014. Toward a general ontology for
7th - 8th). digital forensic disciplines. Journal of forensic sciences, 59 5, 1231-
[6] Gorgone, J.T., Gray, P., Stohr, E.A., Valacich, J.S., & Wigand, R.T. 41.
2006. MSIS 2006: Model Curriculum and Guidelines for Graduate [15] Raghavan, S. and Raghavan, S.V., 2013, November. A study of
Degree Programs in Information Systems. SIGCSE Bulletin, 38, 121- forensic & analysis tools. In Systematic Approaches to Digital
196. Forensic Engineering (SADFE), 2013 Eighth International Workshop
[7] Gottschalk, L., et. al., “Computer Forensics Programs in Higher on (pp. 1-5). IEEE.
Education: A Preliminary Study,” the proceedings of the 36th [16] Ekstrom, J.J., Lunt, B.M., & Rowe, D.C. 2011. The role of cyber-
SIGCSE Technical Symposium on Computer Science Education, St. security in information technology education. SIGITE Conference.
Louis, Missouri, Feb. 23-27, 2005, pp147-151. [17] Sabeil, E. Manaf, A.B.A., Ismail, Z. and Abas, M. 2011. Cyber
[8] Dathan, B., Fitzgerald, S., Gottschalk, L., Liu, J., & Stein, M. 2005. Forensics Competency-Based Framework – Areview. International
Computer forensics programs in higher education: a preliminary Journal on New Computer Architectures and Their Applications
study. SIGCSE. (IJNCAA) 1(3): 991-1000. The Society of Digital Information and
[9] Hawthorne, E.K., Shumba, R.K. 2014. Teaching Digital Forensics Wireless Communications, 2011 (ISSN: 2220-9085).
and Cyber Investigations Online: Our Experiences. European [18] Srinivasan, S. 2013. Digital Forensics Curriculum in Security
Scientific Journal September 2014 /SPECIAL/ edition Vol.2 ISSN: Education. Journal of Information Technology Education:
1857 – 7881. Innovations In Practice. Volume 12, 2013.
[10] Bashir, M., Campbell, R., DeStefano, L., & Lang, A. 2014. [19] Tu, M., Dianxiang, X., Wira, S., Balan, C., and Cronin, K. 2012. On
Developing a new digital forensics curriculum. Digital Investigation, the Development of a Digital Forensics Curriculum. Journal of Digital
11, S76-S84. Forensics, Security and Law, Vol. 7(3). 13-32.
[11] Liu, J. 2016. Developing an Innovative Baccalaureate Program in [20] TWGETDF. 2007. Technical Working Group for Education and
Computer Forensics. 36th ASEE/IEEE Frontiers in Education Training in Digital Forensics. West Virginia University Forensic
Conference S1H-1. Science Initiative
[12] Manson, D., Carlin, A., Ramos, S., Gyger, A., Kaufman, M. and [21] Dittrich, D., Garfinkel, S., Kearton, K., Lee, C.A., LANT, N., Russell,
Treichelt, J., 2007, January. Is the open way a better way? Digital A., & Woods, K. (2011). Creating Realistic Corpora for Security and
forensics using open source tools. In System Sciences, 2007. HICSS Forensic Education. ADFSL Conference on Digital Forensics,
2007. 40th Annual Hawaii International Conference on (pp. 266b- Security and Law, 2011. 123-134.
266b). IEEE. [22] G. Dafoulas, D. Neilson, and H. Sukhvinder, “State of the Art in
[13] Nance, K., Armstrong, H., and Armstrong, C. 2010. Digital Computer Forensic Education – A Review of Compuyter Forensic
Forensics: Defining an Education Agenda. In System Sciences, 2007. Programmes in the UK, Europe and US”, 2017 International
HICSS 2007. 40th Annual Hawaii International Conference on (pp. 1- Conference on New Trends in Computing Sciences (ICTCS) Amman,
10). IEEE. Jordan.

99
Fig. 1. List of modules taught in US postgraduate programmes

Fig. 2. List of modules taught in UK postgraduate programmes

100
Enhancing International Virtual Collaborative
Learning with Social Learning Analytics
Alexander Clauss Florian Lenk Eric Schoop
Chair of Wirtschaftsinformatik esp. Chair of Wirtschaftsinformatik esp. Chair of Wirtschaftsinformatik esp.
Information Management Information Management Information Management
TU Dresden TU Dresden TU Dresden
Dresden, Germany Dresden, Germany Dresden, Germany
alexander.clauss@tu-dresden.de florian.lenk@tu-dresden.de eric.schoop@tu-dresden.de

Abstract— The ability to work collaboratively in intercultural through […] the set-up of an international learning community
virtual teams, is constantly gaining importance for the labour whereby staff and students acquire interpersonal and
market. Virtual Mobility enables students to acquire the necessary intercultural skills” [4]. VM enables students to gain the
intercultural teamwork skills while remaining locally integrated into necessary intercultural teamwork competencies while
their regular studies. But still, international virtual collaborative remaining locally integrated into their regular studies at a
learning scenarios demand much time and effort for planning and lower cost compared to physical mobility [5].
coordination which binds resources. The support concepts for such
collaborative virtual learning groups are also resource-intensive, A proven implementation of Virtual Mobility are Virtual
because learners should be accompanied by qualified e-tutors to Collaborative Learning (VCL) arrangements, these focus on
optimise learning results both at individual and group level. the virtual classroom to include geographically separated
Classical summative tests and exams are rather unsuitable for the learners in a project-based social learning experience [6].
assessment of collaboration as expected learning outcome. These These had been used since 2001 in over 60 mostly
arrangements also need new formative assessment forms, as international learning collaborations at the authors’ chair of
participants need active and ongoing feedback. A meaningful Wirtschaftsinformatik - Information Management.
assessment of learning processes and outcomes should not only be International VCL arrangements are characterised by
based on the observation of ‘soft’ factors but should also be intensive interaction between participants. Tawileh [5] states
complemented by 'hard', fixed, automatically measurable,
that VCL has a „considerable potential to be implemented as
quantitative indicators. To gain these hard indicators the research
a flexible, attractive, and cost-effective modality for virtual
project ISLA - Indicator-based Social Learning Analytics was
launched. This paper presents the procedure for implementation as
mobility that brings authentic international activities to the
well as virtual presence, content creation and relationships within domestic classrooms“. The aim of the arrangement is to
the community as first derived indicators and their prototypical transfer group learning into the virtual room. Small
visualisation in a Learning Analytics Dashboard. international, interdisciplinary groups with around five
participants work on realistic cases for five to seven weeks, in
Keywords—Collaborative Learning; Virtual Mobility; Social a social network using social media tools. The overriding
Learning Analytics; Learning Analytics Dashboard learning objective is the student-centred development of
professional, personal, communication and media skills
I. INTRODUCTION aiming for successful international collaboration, which is
necessary for a well-prepared entry into the knowledge-
Working conditions are shifting more and more, especially intensive, interconnected working world [7]. The learners are
in the field of knowledge work. Modern Information and accompanied by qualified e-tutors to realise formative
Communication Technology (ICT) leads to a decline in the assessment and maximise learning results both at individual
importance of centralised, local, limited workplaces. At the and group level [8].
same time, the ability to work collaboratively in decentralised,
intercultural, interdisciplinary teams is gaining importance [1]. The implementation of formative assessment has a
The preparation of students for these changing working significant influence on teaching and learning settings.
conditions is a major challenge for Higher Education (HE) [2]. "Formative evaluation includes all activities of the teacher
Despite its high importance for the labour market, the and/or the learner that provide information that can be used as
importance of gaining core competencies for international feedback, to modify teaching and learning activities" [9]. The
virtual collaborative work is not yet reflected extensively in general aim is to recognise and respond to students' learning
Higher Educaitonal curricula [1]. to improve it during the learning process [10].
International physical mobility of students is associated This requires a changed assessment culture, which should
with high costs and linked to a variety of external factors. be characterised by new forms of examination that go beyond
Deficiently implemented internationalisation strategies, the assessment of individual performance, such as group
limited financial support and legal and administrative assessments with individual components. These new
restrictions are just a few examples of the typical challenges assessment forms can only be implemented objectively,
of physical mobility [3]. The continuous development of ICT purposefully and with legal certainty if they are embedded in
has made a major contribution to the development of the a new assessment culture that evaluates not only final results
growing possibilities of Virtual Mobility (VM). “VM but also learning processes. Wollersheim and Pengel [11]
facilitates intercultural experiences of students and their staff emphasise that "like summative assessments, formative

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 101


assessments also require the validity of the content of the different stakeholders to share knowledge and information for
individual tasks, which requires a systematic, controlled sustainable development within a community and to engage
derivation procedure from the topic of the course to the in dialogue with the public [17]. In our context, however, the
intended learning outcomes to the (learning) task and an concept of social learning should be explicitly distinguished
accompanying review process." In the virtual room like in from this form of stakeholder participation. SLA make use of
VCL settings this requires continuous monitoring and data generated by learners’ online activity in order to identify
multiple insights into the learning process in order to provide behavioural patterns within virtual learning environments that
supervisors with a constant overview of learning progress and signify effective collaborative learning processes. This
to offer learners personalised learning support and ad hoc analysis includes both: direct interaction - particularly
feedback on the achievement of learning objectives. dialogue - and indirect interaction, such as learners leaving
behind ratings, recommendations or other activity traces that
In the last years this formative assessment had been done can influence the actions of others [18]. As groups engage in
manually in our VCL arrangements. During their daily virtual collaborative activities, the success factors of their
presence in the social network, the e-tutors track the teamwork are a combination of individual knowledge and
communication among learners and document their skills, their personal and virtual environment, their use of tools
observations in a standardised observation sheet, which can be
as well as the ability to work together. Therefor the focus must
used by the supervisors to support and objectify their grading not only be on learners, but also on their tools and contexts to
decisions [12], [13]. This process requires professional identify relevant behavioural patterns [18]. “Social Learning
qualification from both e-tutors and supervisors as well as Analytics should render learning processes visible and
previous experiences and generous time and effort for actionable at different scales: from national and international
monitoring and making manual observation which are networks to small groups and individual learners” [18].
resource intensive processes [14]. In addition to observing
learners, the e-tutors motivate and moderate the collaborative In contrast to this, scientific work on formative assessment
virtual teamwork. In particular, they must be able to identify focuses primarily on individual learning level and uses
and resolve potential problems and conflicts at an early stage. continuous tests to provide learners with automated feedback
This results in an immense workload. Therefore, in our and to facilitate their learning progress [19], [20]. However,
experience, the e-tutors current span for simultaneous this focus is not an adequate approach for the assessment of
monitoring is four groups at maximum [14]. international virtual collaboration, therefore it was the
declared research objective of this project to enhance virtual
Despite professionalised qualification, standardised collaborative learning with the perspective of data driven
observation sheets and high time and effort, the assessments formative assessment by social learning analytics. The aim is
are currently always a result of the subjective, “soft” to support learning and assessment processes that pursue
interpretation of observations made by e-tutors and certain learning objectives and measurable learning outcomes.
supervisors. In order to develop an assessment culture, a This, however, requires, beyond the provision of digital
meaningful assessment of learning processes and learning artefacts, a conscious examination of the didactic design of the
outcomes should be enhanced by “hard”, fixed, automatically teaching/learning arrangements in order to offer learners
measurable, quantitative indicators. Therefore, meaningful higher-quality learning scenarios and to provide supervisors
data on user activities and interactions with learning content with competent support.
as well as between learners in the virtual room must be
identified, recorded, processed and made available in an To contribute to this research objectives, the project ISLA
understandable form on the basis of digital traces relevant to was conducted from September 2017 till December 2018. In
learning objectives and expected learning outcomes. the following, we will describe how ISLA was approached in
order to derive semi-automatically measurable indicators for
This paper explains the research approach in four work the formative assessment of collaborative group work and
packages. The derived indicators “virtual presence”, “content then use them in a prototypical dashboard on the open source
creation” and “relationships within the community” are social networking platform elgg.com. Afterwards, first
presented. Their prototypical visualization in a learning insights of this application and potentials for the enhancement
analytics dashboard is then described. First results of the
of international virtual collaborations will be presented.
prototype usage indicate that a partially digital support by
Social Learning Analytics allows to expand the support range The procedure for achieving the research objectives
of supervisors and e-tutors to enable a higher number of mentioned is described in the following. It shows how
participants with the same resources. indicators were defined to accompany the learning process
and to assess the achievement of the learning objectives, how
II. RESEARCH APPROACH data types belonging to these indicators were collected and
clustered in appropriate ways to implement and visualise them
To achieve the described research goals and to gain those for e-tutors and supervisors as well asfor learner afterwards.
fixed automatically measurable indicators to foster formative For this purpose four work packages were processed, which
assessment, this research focuses on the approach of Social are described in the following.
Learning Analytics (SLA). The term social learning was
coined by Bandura [15] for individual learning, that takes A. Identification of key factors for formative assessment
place in a social context and is influenced by social norms. It
can occur purely through observation or direct instruction. The aim of the first work package was the theoretical
Within the context of Web 2.0 technology and social networks preparation for the subsequent processing and implementation
the term was extended by Robes [16] to informal, self- of the project. This included an intensive reading period into
organised and connected learning, which is supported by the current research in the field of Social Learning Analytics,
social media and social networks. Social learning can also be the identification of key authors and their respective research
seen as the basis for regional cooperation, bringing together focuses. This literature and author overviews were stored in

102
an SQL database and visualised with the help of a network formative accompaniment of learners and which supports
representation. In addition, a systematisation in form of a mind supervisors in objectifying their assessment. In the context of
map was created to show the connections between evaluation, a master thesis a prototypical implementation on elgg was
both formative and summative, snd processes and their realised [21].
analysis possibilities. Based on these detailed findings, the
data-driven and demand-oriented analysis was started in the For this purpose, 35 databases were analysed in an
second work package. explorative way. 14 empty, 9 irrelevant and 12 potentially
relevant tables were identified. 9 of the 12 empty tables were
old log tables, which once served as a backup and were later
B. Definition of indicators for successful virtual replaced by a newer version. The other three tables were
collaboration intended for functions such as API or georeferencing, but are
The second work package was aimed at the definition of not yet implemented or used. Tables were defined as irrelevant
the indicators for successful collaboration that should be if they only served to create and structure the course or
monitored, to operationalise the VCL’s overriding learning platform or contained metadata that was not relevant for SLA
objective collaboration. In the first step, a systematic literature purposes. The identified indicators are summarised in the
review was carried out to provide a broad theoretical basis. three: categories virtual presence, virtual interaction and
This systematic literature review analysed success factors and virtual relation. In the following these categories and detailed
obstacles of virtual collaboration and provided a focus for the descriptions of indicators are shown.
further development of indicators. On the one hand, these
indicators helped to identify promising behaviour patterns in A. Virtual presence
the context of social learning analytics. On the other hand,
obstacles should be identified to be able to recognise problems The number of visits on the VCL platform can be
early and to define early warning indicators. Further indicators compared to the physical presence in a traditional course [5].
were derived from the observation sheets mentioned above, The indicators were analysed in order to map the virtual
which were continuously refined in the course of our own presence data-driven. This allows the analyses of the
research [12], [13]. The aim was to achieve a systematic, following questions:
controlled derivation procedure from the topic of the course to • How often is a participant present on the platform
the expected learning outcome - the ability to collaborate - to compared to other learners?
provide a basis for indicator-based social learning analytics.
• How has the activity of the users changed over the
C. Collection and processing of interaction data course of the project and the different assignments?
The third work package aimed at collecting and processing Logins of all participants: The first indicator that reflects
interaction data and correlating identified indicators with student engagement in a virtual classroom is the number of
available data that can be used to assess and analyse learners' logins. To display the activity history, the database query can
performance. To operationalise the aforementioned defined be started several times within certain intervals. If the amount
indicators further, existing learner data, from both completed falls below a predefined value over a certain period of time,
and running instances on the used elgg platform, were the e-tutor should ask the group for reasons and intervene if
evaluated, to be able to support the formative assessment of necessary.
the determined factors with the help of data traces from the
Average number of logins per group: The presence within
database of the virtual learning platform. Subsequently, the
the VCL course can also be analysed from a group perspective.
database was also exploratively examined. This data-driven
The potential to trigger the e-tutors to intervene in case of
analysis from two views resulted in 23 database queries using
insufficient activity within a certain period of time is also the
the database language SQL.
main purpose of the indicator. A strictly summative view of
the logins can lead to inaccurate conclusions if there are fewer
D. Testing the data-driven evaluation and visualisation of members in some groups than in others. For this reason,
indicators average values of logins were used across all groups.
The fourth work package focussed on testing the data-
driven evaluation of the indicators and at developing a mock- Total login duration of participants: As an alternative
up for an indicator-based data provisioning and visualisation. indicator, an attempt was made to calculate the total login
For this purpose, meaningful data on user activities and duration of the participants. This would provide a higher
interactions with learning content as well as between learners quality statement regarding the presence in the VCL event,
on the virtual platform had to be identified, recorded, because a high number of logins does not indicate how long a
processed and made available in an understandable form on user was active on the learning platform. In elgg database, all
the basis of digital traces. The testing took place during the successful logins are stored in a table, but only the manual
project, but could only be operated by the project team. logouts. If a participants simply closes the corresponding
Therefore, the evaluations of the database were not visible ad- browser window, it cannot be traced when they left the elgg
hoc for the e-tutors. The analyses using SQL queries were platform. A reliable solution that measures whether the
carried out without any problems and delivered meaningful platform is currently visible in the active browser window of
results. the participant and whether the participant has made active
inputs is currently being developed.
III. RESULTS
B. Content creation
In the following it will be described how the results of the The previous indicators only described the unproductive
research project were used to develop a dashboard which virtual presence on the platform. The following provide
supports e-tutors through Social Learning Analytics in the information on the tools used, i.e. the actual productive

103
activity on the learning platform. The content types analysed indication of possible conflicts that require the e-tutor’s
in our case are published blog posts, remarks on blog posts, attention.
comments, remarks on comments, discussion topics created,
remarks on discussion topics, discussion posts, remarks on D. Dashboard development
discussion posts, chat messages, direct messages, comments In the next step, the indicators and indices developed in the
on tasks and the sum of all content. The mere numerical master thesis were integrated and combined into a first draft
representation of the results, however, allows the answering of a dashboard in the course of a master seminar thesis [22].
of several relevant questions for the evaluation of the The analysis was based on the standardised observation sheet
communication between the participants, for example: for e-tutors provided by our chair and on the preliminary
• Which communication tool is used the most/least and results described in before.
thus has the highest/lowest acceptance among the The analysis platform was created step by step with the
participants? insights gained from systematic literature review, the
• Which participants communicate most/least? accessible data basis and the observation sheet. The dashboard
focuses directly on e-tutors and supervisors, for this reason the
• Do participants actively participate in the discussion structure of the dashboard is based on the observation sheets.
with group members? Figure 1 below schematically illustrates the basic structure of
the Learning Analytics Dashboard.
• Is there a continuous communication over a longer
period of time?
The same content types can also be aggregated in an
elevated fashion: at group level. This not only reveals the
differences within a group or between all participants of the
even. It also gives the opportunity to compare the groups with
each other. So as to avoid any misrepresentation of the values
- equivalent to the logins - average communication tools used
per group were used as indicators. By weighting the individual
items, it is now possible to create indices at individual and
group level, which can, for example, provide information on
the extet of collaboration within the group. The following
Fig. 1. Basic structure of the Learning Analytics Dashboard [22]
table shows an exemplary weighting of the individual content
types. It should be emphasised that this is only an example. The result was a summary page with the observation
Weighting of the different indicators should be both evidence- criteria and eight linked dashboards. The main page offers a
based and adapted to the expected learning outcomes of the clear and user-friendly interface for e-tutors. It represents the
course. This well-founded weighting of indices is currently a linking between the criteria of the observation sheet and the
further focus of our research activities. different dashboards. By selection of the criteria to be
considered they are linked to the corresponding dashboard.
TABLE I. EXEMPLARY WEIGHTING OF CONTENT TYPES Consequently, it is very clear and easy to use and facilitates
the formative assessment. The dashboards are kept very
Content type Weighting
simple and can be modified or extended at any time. The
published blog posts 0,5 individual dashboards contain various visualisations such as
comments on blog posts 0,2 simple tables, stacked column or bar charts, pie charts or maps.
discussion topics created 0,4 Stacked column or bar charts have proven to be the best
comments on discussion
0,1 visualisation option. They are characterised by the fact that
topics they usually display as much data as possible without losing
discussion posts 0,3 their clarity, and require as little space as possible. For the
… … team criteria we prefer using stacked column diagrams. For
individual criteria the stacked bar diagrams have been
sufficient. Pie diagrams and maps were used as extensions of
C. Relationships within the community dashboards that still had enough space. Simple tables were
In a social network, the focus lies on the interaction of used to view mere data. All mentioned types of visualisation
participants and their networks. A clear indicator in this were additionally extended with adequate colour
context is the number of friends the participants have in combinations to refer to certain values or simply to generate a
different groups and the total number of friends. From this, better clarity.
basic insights can be derived. In the early phase, it becomes As long as the structure of the database does not change,
visible whether the participants in their team are networked the SQL queries and consequently the dashboards work. With
properly. The threshold value recommended therefore is the a click on the update button, the queries are executed again
defined group size plus e-tutors and supervisors. In addition, and the dashboards are updated as well. Figure 3 shows an
it is also possible to determine which "informal relationships exemplary screenshot from the numerical overview for
of friendship" exist beyond group boundaries. In addition, e- discussion contributions of the participants, in group
tutors have the opportunity to investigate reasons for isolated comparison, as a single page of the Learning Analytics
team members. Above all, a sudden drop in the number of Dashboard.
friends within the team during the course can serve as a clear

104
Fig. 2. Numerical overview for discussion contributions of the participants, in group comparison in the Learning Analytics Dashboard

to realise formative supervised international cMoocs in the


IV. CONCLUSIONS AND FURTHER DEVELOPMENT medium term.
This paper presents the research approach for the
implementation of social learning analytics to foster formative ACKNOWLEDGMENT
assement in four work packages as well as virtual presence,
content creation and relationships within the community as The research project “Indicator-based Social Learning
first derived hard indicators and their prototypical Analytics (ISLA)” was financed by the State Ministry for
visualisation in a Learning Analytics Dashboard[21], [22][20], Higher Education, Research and the Arts in the German
[21][20], [21]. federal state of Saxony.

The dashboard is currently being tested in a joint REFERENCES


international VCL project with 22 participants from TU
Dresden in Germany and 15 participants from Shiraz [1] C. Perez-Sabater, B. Montero-Fleta, P. MacDonald, and A. Garcia-
University in Iran to further adapt it to the needs of e-tutors Carbonell, “Modernizing Education: The challenge of the European
and to re-evaluate and enhance the identified indicators in an project CoMoViWo,” Procedia - Soc. Behav. Sci., vol. 197, no.
international context. First results could indicate that a partial February, pp. 1647–1652, 2015.
digital support of the involved e-tutors and supervisors [2] D. Coyne, “Employability: The Employers’ Perspective and its
through social learning analytics makes it possible to enhance Implications Bologna Process Employability,” Bologna Process
their support range and allows a higher number of participants
Seminar, 2008. [Online]. Available:
with the same resources.
http://www.aic.lv/bolona/2007_09/sem07_09/Luxemb_employ/Pl
In the future, our research on VCL will focus on the enary1_DavidCoyne.pdf. [Accessed: 15-Dec-2017].
expansion of the learning analytics tools as well as the further [3] D. van Damme, “Quality Issues in the Internationalisation of Higher
development of our dashboard. The next step is to gain
Education,” High. Educ., vol. 41, no. 4, pp. 415–441, 2001.
qualitative insights into further development potential through
in-depth interviews with the e-tutors and supervisors involved [4] EuroPACE, “Virtual mobility,” Virtual Mobility, 2010. [Online].
in the project who used the prototype. In an iterative process, Available: http://www.europace.org/interest3.php . [Accessed: 15-
the Learning Analytics Dashboard is implemented, evaluated Jan-2019].
and modified as scientific action research in international [5] W. Tawileh, “Virtual Mobility for Arab University Students: Design
virtual mobility projects. Currently, our research is focusing Principles for International Virtual Collaborative Learning
on supporting e-tutors and supervisors. A future extension to
Environments Based on Cases from Jordan and Palestine,” TU
analyse the effect of the visualised data on the participants
Dresden, 2016.
themselves should be considered. One conceivable approach
would be to use these indicators to promote motivation [6] F. Klauser, E. Schoop, K. Wirth, B. Jungmann, and R. Gersdorf, “The
through gamification measures. The declared research goal is Construction of Complex Internet-Based Learning Environments in
the field of Tension of Pedagogical and Technical Rationality,”

105
2004.
[7] A. Clauss, “How to Train Tomorrow’s Corporate Trainers – Core
Competences for Community Managers,” in 2018 17th
International Conference on Information Technology Based Higher
Education and Training (ITHET), 2018, pp. 1–8.
[8] A. Clauss, F. Lenk, and E. Schoop, “Digitalisation and
Internationalisation of Learning Processes in Higher Education: A
best practices report,” in Proceedings of the 13th Iranian and 7th
International Conference on e-Learning and e-Teaching (ICeLeT
2018), 2019.
[9] P. Black and D. Wiliam, “Assessment and classroom learning,” Assess.
Educ., vol. 5, no. 1, pp. 7–474, 1998.
[10] B. Cowie and B. Bell, “A Model of Formative Assessment in Science
Education,” Assess. Educ. Princ. Policy Pract., vol. 6, no. 1, pp.
101–116, 1999.
[11] H. W. Wollersheim and N. Pengel, “Von der Kunst des Prüfens -
Assessment literacy,” HDS.Journal - Perspekt. guter Lehre, vol. 2,
pp. 14–32, 2016.
[12] M. Rietze, “Analysing eCollaboration : Prioritisation of Monitoring
Criteria for Learning Analytics in the Virtual Classroom,” pp.
2110–2124, 2016.
[13] M. Rietze, “Monitoring E Collaboration Preparing An Analysis
Framework.” 2016.
[14] F. Lenk, “Virtual Social Learning Environments – a Cybernetic System?
Towards a Decision Support System,” in 2018 17th International
Conference on Information Technology Based Higher Education
and Training (ITHET), 2018, pp. 1–5.
[15] A. Bandura, Social learning theory. Prentice Hall, 1977.
[16] J. Robes, “Social Learning zwischen Management, Unternehmenskultur
und Selbstorganisation,” Wirtschaft Beruf Zeitschrift für berufliche
Bild., vol. 66, pp. 20–25, 2014.
[17] M. S. Reed, A. C. Evely, G. Cundill, I. Fazey, J. Glass, and A. Laing,
“What is Social Learning ?,” Ecol. Soc., 2010.
[18] S. B. Shum and R. Ferguson, “Social learning analytics,” Proc. 2nd Int.
Conf. Learn. Anal. Knowl. - LAK ’12, vol. 15, p. 23, 2012.
[19] & K. B. Kay, J., Reimann, P., Diebold, E., “MOOCs: So Many Learners,
So Much Potential What Is a MOOC?,” pp. 2–9, 2013.
[20] D. T. Tempelaar, A. Heck, H. Cuypers, H. van der Kooij, and E. van de
Vrie, “Formative assessment and learning analytics,” p. 205, 2013.
[21] S. Kretzschmar, “Entwicklung und Evaluation von Indikatoren für Social
Learning Analytics am Beispiel eines Virtual Collaborative
Learning Kurses in Elgg,” Technischen Universität Dresden, 2018.
[22] C. Krebs, “Datengetriebenes Feedback: Erstellung und Implementation
einer Plattform zur Datenanalyse mittels Power BI für E-Tutoren,”
Technische Universität Dresden, 2019.

106
Evaluation of Students’ Acceptance of the
Leap Motion Hand Gesture Application in
Teaching Biochemistry
Nazlena Mohamad Ali Mohd Shukuri Mohamad Ali
Institute of Visual Informatics (IVI) Faculty of Biotechnology and
Universiti Kebangsaan Malaysia Biomolecular Sciences
43600, Bangi, Selangor, Malaysia Universiti Putra Malaysia
nazlena.ali@ukm.edu.my 43400 Selangor, Malaysia
mshukuri@upm.edu.my

Abstract— This paper presents an early stage of the Leap mm) and can distinguish between 10 fingers and track them
Motion controller regarding user acceptance in the teaching and individually. This device is a drastic change from one hand on
learning process. The Leap Motion is a new device for a hand- a mouse or two-finger pinch-to-zoom on novel trackpads and
gesture-controlled user interface. For appropriate evaluation, a smartphones. By moving 10 fingers in the workspace, users
novel experiment and questionnaire were created utilizing 35 can communicate with a computer in many more ways than
Biochemistry undergraduate students in Enzymology from the other devices [2].
Universiti Putra Malaysia. The subjects participated in the user
experiment and performed several tasks, such as rotating,
translating and zooming in and out on the molecules. The tasks
were performed using the Molecules application on an Airspace
platform. The research compared the performance of Leap
Motion with mouse interaction. As a result, 79.2% of the
respondents gave a positive opinion about the Leap Motion
because of its ease of use, acceptance, effectiveness and accuracy.
These students were excited and looked forward to
implementing the Leap Motion in class. Thus, the Leap Motion
controller can potentially be used as a teaching tool for a better
learning experience of the biomolecule.

Keywords— Leap Motion, molecules, evaluation, hand


gesture, biochemistry

I. INTRODUCTION
Gesture-based interaction represents a fundamental and Fig. 1. The Leap Motion controller relative size
universal form of nonverbal communication that has an
essential role in the human-computer relationship. Gesture- Leap motion has received great attention in the current
based technology offers a natural way of interaction, thus, years because of their massive applications, include gaming,
contributing to the key area of engagement [1]. Users robotics, education and medicine. Research by [3] developed
generally prefer and are excited to use multimodal a game for hand rehabilitation using the Leap Motion
interaction, which provides users with the freedom and controller for the more effective rehabilitation process.
flexibility to choose the best inputs for specific tasks. Another work conducted on Leap Motion in an educational
Whether users are pointing to select an object out of a group, environment as carried out by [4] on a hands-on field
putting five-fingers down to shut down the computer or experiment to verify the feasibility of using gesture control on
curling fingers to zoom in or out on an image, gestures play a computer free-hand drawing of elementary students. The
an important role in developing technology with no mouse experimental results and statistical evidence suggested that
and less touch. Gesture-based interaction will help teachers Leap Motion could operate elementary free-hand drawing as
and students actively communicate in the classroom. [5] explored Leap Motion’s feasibility in educational usage.
Leap Motion controller (Fig. 1) is a small device that Leap Motion can be considered in applications in educational
allows users to control the computer by gesturing with their fields. An investigation regarding elementary students was
hands and fingers in mid-air (Leap Motion, conducted to assess their technology use and theory of
http://www.leapmotion.com). Leap Motion controller works planned behaviour, or TPB. These students showed significant
effectively, capturing any motion in its workspace and potential in using this new gesture input device.
translating it to the computer. Leap Motion does this through
an array of camera sensors that monitor a 1 cubic foot To demonstrate the easy-to-use and human-friendly
workspace. Leap Motion is also extremely accurate (to 0.01 control, [6] applied and programmed the controller to change
the display settings of three-dimensional objects. Leap Motion
applied easily to educational and medical imaging. In an

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 107


education application, Leap Motion can display real three- app site, Airspace. These apps involve games, art, music,
dimensional properties that cannot be shown with a single science and education. We use Molecule, an application
image. For medical imaging applications, Leap Motion developed by Sunset Lake Software
integrates with images for medical image processing and [https://airspace.leapmotion.com/apps/molecules/osx] as
diagnoses. This integration could be used even during surgery shown in Fig. 2.
when doctors cannot use touch computers or controllers.
Meanwhile, research by [7] explores the possibilities of Leap
Motion Controller as a natural way and immersive interaction
experience by the student in elementary logic learning and the
findings show students understand educational content more
easily.
Several studies have been performed using Leap Motion
in other fields, such as medicine, music and detection.
Another related study conducted by [8] on medical education
that studies the human body. Interactive 3-dimensional
anatomy navigation system supported by Leap Motion
Controller was proposed in which the motion of fingers and
hand gestures were detected for input control and
successfully evaluated among the students and lecturers in the
medical faculty. A preliminary study and evaluation of this Fig. 2. Snapshot of Molecules application. Various 3D molecular structures
new sensor for building a new Digital Musical Instruments were provided as default. External molecule sources can also be downloaded
from the Protein Data Bank (PDB) and PubChem
(DMIs) were proposed [9]. A series of conventional music
gestures can be recognised by the device. The precision and The molecule is a molecule visualizer. It is a file viewer
latency of these gestures were analysed. Reference [10] that allows a user to display three-dimensional renderings of
evaluated the precision and reliability of the Leap Motion molecules and manipulate them using a Leap Motion
controller in static and dynamic conditions. In addition, the controller. The Leap Motion controller allows users to move
suitability as an economically attractive finger/hand and an open hand in three dimensions to rotate and scale the
object tracking sensor was studied. Reference [2] evaluated molecular structure faster than users with the 2D input of a
the accuracy and repeatability of the Leap Motion controller. mouse. Lateral translation of a structure is accomplished by
For their evaluation, the experimental setup for 3D sensors moving two open hands in parallel (Fig. 3).
was developed using an industrial robot system.
An early exploration of the suitability of the Leap Motion
controller for Australian Sign Language (Auslan) recognition
presented in [12]. The controller can be used with significant
success for basic recognition signs but is not appropriate for
complex signs. Reference [11] evaluated 3D pointing tasks
using the Leap Motion sensor to support 3D object
manipulation. Their assessment used a motion gestural
interface over mouse interaction while performing single
target three-dimensional pointing tasks. In this study, multi-
target tasks were executed better with mouse interaction
because of issues regarding 3D input system accuracy.
The main goal of the research presented in this paper was
to introduce simple and straightforward Leap Motion to 3D-
molecular structures in Enzymology. The experiment was
conducted with several tasks: rotation, translation and zoom
in/out. The evaluation was performed on Leap Motion and
Fig. 3. Basic hand movement in manipulating molecular structures via Leap
mouse input devices to compare the performances.
Motion controller. Zoom-in and zoom-out of the molecules are performed by
II. EXPERIMENTAL DESIGN moving the hand towards the monitor or further away, respectively. The
molecules are rotated by turning clockwise and anticlockwise (Right image).
The experiment began with a short discussion and Panning towards multiple directions is done by placing both hands on top of
demonstration regarding Leap Motion application among the the Leap Motion controller (Left image).
students. We tested a Leap Motion controller on 35 In the present study, the same initial orientation of a
Biochemistry students of Enzymology. The experiment was molecule was established for all respondents. The test session
conducted in two laboratories at the University Putra was randomly divided into two groups that were exposed to
Malaysia (UPM), one for students who completed the different interactions. One group was exposed to the mouse
experimental tasks and questionnaire and another laboratory first. Then, they performed their tasks with the Leap Motion
for students who were waiting for their queue. Thus far, the gestural interface. The other group was exposed to Leap
Leap Motion controller has been used primarily with Motion, then the mouse. An optical mouse of 5 V 100 mA
applications developed specifically for the controller. There wired to the computer by USB was used in the experiment.
were more than 90 apps available through the Leap Motion

108
The sensitivity and speed of the mouse were kept consistent gesture devices. Gestures are defined as any physical
during the entire test session. movement, large or small, that can be interpreted by any
Setting up Leap Motion is straight forward. A user plugs motion sensor.
one end into the laptop and the other end into the controller. TABLE II. DEMOGRAPHIC PROFILE OF COMPUTER LITERACY
Then, the user positions Leap Motion where it can see his or AND MOLECULAR GRAPHIC AND GESTURE DEVICE
EXPERIENCE.
her hands, i.e., in front of a laptop or between a desktop
keyboard and the screen gallery. When plugged in, the green Aspects Evaluation
LED on the front of the device and the infrared LEDs beneath
the top plate light up. In the present experiment, the Leap 1 2 3 4 5
Motion controller was placed on a table. The placement was
marked to ensure no undesired movement of the device. Computer literacy 0 0 10 15 10
Moreover, a video camera was placed in front of the (0.0) (0.0) (28.6) (42.9) (28.6)
respondents to record their facial expressions. Each student
must rotate, zoom and pan the molecule using a mouse and Molecular
Leap Motion. The students’ facial expressions indirectly Graphic 1 8 9 10 7
showed whether the input device was exciting or boring. Experience (2.9) (22.9) (25.7) (28.6) (20.0)
III. RESULTS AND DISCUSSION Gesture Device 3 7 6 15 4
Of the 35 students involved in the Leap Motion controller Experience (8.6) (20.0) (17.1) (42.9) (11.4)
user evaluation, 82.86% (n=29) were female and 17.14%
(n=6) were male. As shown in Table 1, all males (n=6) in this
study were Malays. Table I also shows that 5.71% (n=2) of Notes: 1, poor; 2, bad; 3, ok; 4, good; 5, excellent.
the females were Chinese, 2.81% (n=1) were Indian and Table III shows the evaluation performed on hand
2.81% (n=1) were Kadazan. The ages of both males and gesture-based interaction, which uses Leap Motion and a
females ranged from 18-24 years old. mouse input device. The evaluation shows the comparison of
TABLE 1. DEMOGRAPHIC PROFILE. THE DATA DISTRIBUTIONS perceived usefulness, ease of use and acceptance towards the
INCLUDE GENDER, RACE AND AGE. Leap Motion controller and mouse. Most of the subjects
Demographic profile Male Female Total agreed (n=21) and strongly agreed (n=8) that they would like
to use Leap Motion during class compared with a mouse:
(n=6) (n=29) (n=35) only 40% (n=14) agreed and 14.3% (n=5) strongly agreed.
As a result, 57.1% (n=20) of the subjects agreed and 28.6%
Age: (n=10) strongly agreed that Leap Motion would help them
understand molecules better. Five students disagreed and
18-24 6 (17.14) 29 (82.86) 35 (100)
said that a mouse would help them understand molecules
better. Based on the recorded video (Figure not shown), the
Race:
students look focused and excited regarding the given task.
Malay 6 (17.14) 25 (71.43) 35 (100) This approach excites the students and makes them enjoy the
Chinese 2 (5.71) class.
Indian 1 (2.86)

Other 1 (2.86)

According to Table II, most of the students were


proficient in using computers, or 42.9% (n=15). Computer
literacy was defined as understanding concepts, terminology
and operations that relate to general computer use. Computer
literacy is essential knowledge required to function
independently with a computer. This functionality includes
the ability to solve and avoid problems, adapt to new
situations, keep information organised and communicate
effectively with other computer literate people [13].
Molecular graphics experience is experienced in the
discipline and philosophy of studying molecules and their
properties through graphical representations. Because this is
an Enzymology subject, the students were expected to have
better experience in molecular graphics. However, 22.9%
(n=8) of the students did not have much molecular
experience. A majority of the students had little experience in

109
TABE III. EVALUATION OF HAND GESTURE BASED AND MOUSE INTERACTION. THE COMPARATIVE EVALUATION INVOLVES THE
ANALYSIS ON PERCEIVED USEFULNESS, EASE OF USE AND ACCEPTANCE TOWARDS THE LEAP MOTION CONTROLLER AND MOUSE.
STUDENTS MAJORLY ACCEPT THE USE OF LEAP MOTION CONTROLLER APPLICATION.
Aspects Leap Motion Mouse

1 2 3 4 5 1 2 3 4 5

Using device during 0 (0.0) 1(2.9) 4 (11.4) 21 (60.0) 8 (22.9) 0(0.0) 1(2.9) 15 (42.9) 14 (40.0) 5 (14.3)

class

Easy to use 0 (0.0) 1(2.9) 8 (22.9) 18 (51.4) 7 (20.0) 0(0.0) 2(5.7) 12 (34.3) 13 (37.1) 8 (22.9)

Need technical

support 0 (0.0) 3(8.6) 13 (37.1) 10 (28.6) 8 (22.9) 11 (31.4) 11 (31.4) 7 (20.0) 3(8.6) 3(8.6)

Inconsistency 1 (2.9) 14 (40.0) 14 (40.0) 5 (14.3) 0(0.0) 5 (14.3) 11 (31.4) 9 (25.7) 8 (22.9) 2(5.7)

Learn to use it very

quickly 0 (0.0) 0(0.0) 2(5.7) 22 (62.9) 10 (28.6) 0(0.0) 0(0.0) 7 (20.0) 14 (40.0) 14 (40.0)

Cumbersome to use 1 (2.9) 6 (17.1) 17 (48.6) 9 (25.7) 1(2.9) 4 (11.4) 6 (17.1) 18 (51.4) 4 (11.4) 3(8.6)

Confident when

using the device 0 (0.0) 1(2.9) 7 (20.0) 18 (51.4) 8 (22.9) 0(0.0) 0(0.0) 9 (25.7) 17 (48.6) 9 (25.7)

Understand

molecules better 0 (0.0) 0(0.0) 4 (11.4) 20 (57.1) 10 (28.6) 0(0.0) 5 (14.3) 12 (34.3) 12 (34.3) 5 (14.3)

Fatigue if using the

device longer 3 (8.6) 13 (37.1) 8 (22.9) 6 (17.1) 4 (11.4) 8 (22.9) 14 (40.0) 7 (20.0) 4 (11.4) 2(5.7)

Very effective 0 (0.0) 1(2.9) 4 (11.4) 19 (54.3) 10 (28.6) 0(0.0) 5 (14.3) 15 (42.9) 10 (28.6) 5 (14.3)

Very accurate 0 (0.0) 0(0.0) 11 (31.4) 19 (54.3) 4 (11.4) 0(0.0) 2(5.7) 16 (45.7) 13 (37.1) 4 (11.4)

Notes: 1, strongly disagree; 2, disagree; 3, neutral; 4, agree; 5, strongly agree.

A mouse would be expected to receive high responses in benefits, Leap Motion is quiet, accurate and effective during
ease of use because the students are more familiar using it: teaching and learning. Of all of the subjects, 54.3% (n=19)
37.1% (n=13) agreed and 22.9% (n=8) strongly agreed. agreed and 28.6% (n=10) strongly agreed that Leap Motion
However, Leap Motion received a higher percentage in this is very effective when implemented in class. In contrast, only
category than the mouse: 51.4% (n=18) agreed and 20% 28.6% (n=10) of the subjects agreed and 14.3% (n=5)
(n=7). However, the students still needed technical support to strongly agreed regarding the effectiveness of using a mouse
use Leap Motion even though they considered it easy to use in class. Considering the last statement, “This gestural
because 8.6% (n=3) disagree with the statement “I would interface is very accurate”, 54.3% of the subjects (n=19)
need the support of a technical person to use this gestural
agreed, and 37.1% (n=13) strongly agreed. In contrast to
interface”. In contrast, 62.8% (n=22) of the subjects did not
“This mouse is very accurate”, 11.4% (n=4) of the
need technical support to use a mouse. In addition, the
subjects found that they learned both Leap Motion and a respondents agreed and 11.4% (n=4) strongly agreed with
mouse very quickly. this statement.

Of the respondents, 42.9% (n=15) disagreed regarding Overall, 79.2% of the subjects gave a positive opinion of
inconsistency in the Leap Motion controller, and no subject Leap Motion because of its ease of use, acceptance,
strongly agreed. Meanwhile, 45.7% (n=16) of the subjects effectiveness and accuracy. The subjects were interested in
disagreed that a mouse had excessive inconsistency, but two using Leap Motion again in the future.
strongly agreed. Furthermore, 10 subjects agreed that Leap Based on Table IV, more than 97.1% of the respondents
Motion was cumbersome to use. In contrast, the mouse was considered the Leap Motion and mouse easy to use. All the
convenient because only 7 subjects agreed that a mouse was subjects were excited using Leap Motion and some were
cumbersome to use. bored using the mouse. The Leap Motion controller is more
Most of the subjects felt confident using a mouse and efficient, speedier and more relaxed than the mouse for
Leap Motion, but only one subject felt unconfident using 97.2% of the respondents. Of all the respondents, 91.5%
considered the Leap Motion and mouse stable. Furthermore,
Leap Motion. This confidence is a good response to a gestural
they concluded that both devices were accurate.
interface that is new on the market. Furthermore, the subjects
felt fatigued after using Leap Motion for a long time. Using
Leap Motion for approximately a half-hour can feel like an
arm workout. Continually holding the hands up is not desk-
friendly behaviour. Although fatigue could detract from its

110
TABLE IV. SEMANTIC DIFFERENTIAL SCALES FOR THE LEAP “I think it is a fun and interesting way to teach students
MOTION AND MOUSE. about molecular structure. It is awesome and new. Students
Assessment aspects Devices will find it easier to learn a complicated molecular structure.
It also makes learning fun and relaxing. It’s amazing!!!”[PID
LEAP MOTION MOUSE 169082]
Easy 97.1 97.2 Despite the positive feedback, some negative feedback
Difficult 2.9 2.8 was also given.
Exciting 100 74.2 “It would be very nice and interesting if the system is
Mellow 0 25.8
upgraded so that it can detect gestures more quickly. It is slow
like I just experienced, and it is better not to use it in teaching
Efficient 97.2 88.6 and learning.” [PID 166760]
Personable 2.8 11.4 “It is good. But sometimes it is not very efficient because
Speedy 97.2 94.3 we need to control our hand gestures.” [PID 169426]
Methodical 2.8 5.7 “There are pros and cons. The use of hand gestures may
attract students to the subject because of the technology.
Relaxed 97.2 88.6
However, to achieve this goal, hand gestures would be my
Intense 2.8 11.4 last choice.” [PID 169119]
Pleasant 85.8 88.6 The Leap Motion controller undoubtedly represents a
Unpleasant 14.2 11.4 revolutionary input device for gesture-based human-
computer interaction. In this study, we evaluated the
Stable 91.5 91.5 controller to introduce new technology in teaching and
Volatile 8.5 8.5 learning systems. Based on the results and overall
experience, we conclude that the Leap Motion controller
Accurate 91.4 91.4
should be applied to biochemistry classes. The Leap Motion
Inaccurate 8.6 8.6 could receive more attention if the sensory space and
inconsistent sampling frequency were improved.
IV. CONCLUSION
In general, the subjects gave positive feedback regarding
the introduction of the Leap Motion to teaching and learning. Teaching with technology can deepen student learning by
The positive reactions and acceptance during the evaluation supporting academic objectives. However, the best
might have occurred because of the Hawthorne effect. Instead technology must be selected while not losing sight of the
of a traditional method (mouse), people tend to prefer the goals of student learning. Gesture-based technology might be
unique and attractive device when introduced to new the best choice for teaching and learning rather than through
technology (Leap Motion). A Leap Motion application can typing or moving a mouse. Gestures are universal and more
provide more intuitive interactions. This scenario presents natural than operating a keyboard or mouse. Gestures could
avenues for further investigation. Most subjects gave positive
also be a valuable tool in maintaining and focusing on
feedback regarding Leap Motion.
students' attention and promoting an interactive classroom.
“I think it’s good if you know how to use it. It is very
efficient in showing the molecules as opposed to using a Based on the results from the experiments and
mouse. It saves time and might even get the students’ questionnaire, we can conclude that the new Leap Motion
attention in class. Truthfully, I’m not into all this enzyme- input device possesses huge potential for use during lecture
protein thing but this device managed to grab my attention (a sessions. Based on the attitudes toward Leap Motion, the
little).” [PID 170176] respondents were very excited to use Leap Motion in the
future compared with mouse interaction. Leap Motion could
“I think it is very exciting to use this technique in teaching make the class more appealing and efficient.
and learning. It corresponds with the growth of technology &
science. I think this will create interest for students to learn REFERENCES
about structural biochemistry, which used to be considered a [1] M. S. A. Rahman, N. M. Ali, and M. Mohd, “Natural user
lame subject. I’m looking forward to this technique being interface for children: From requirement to design,” in Lecture
used in future lessons.” [PID GS38981] Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics),
“Hand gestures can enhance both student and lecturer 2017.
experience during teaching and learning. They make it easier [2] F. Weichert, D. Bachmann, B. Rudak, and D. Fisseler, “Analysis
to navigate the molecule, rotate it and zoom in and out. Hand of the accuracy and robustness of the Leap Motion Controller,”
gestures are more reliable and easier compared to a mouse.” Sensors (Switzerland), 2013.
[PID 168820] [3] M. Alimanova et al., “Gamification of hand rehabilitation
process using virtual reality tools: Using leap motion for hand
“I would use it during class because it is interesting and rehabilitation,” in Proceedings - 2017 1st IEEE International
new. People will be more interested during class. We must Conference on Robotic Computing, IRC 2017, 2017.
move our bodies with gestures, but it is not a burden or [4] T. Yang, K. Miao, and J. Hung, “Gesture control in education for
a young student,” Comput. Technol. Mod. Educ., pp. 44–54,
tiresome at all.” [PID 169901] 2014.
[5] L. Kuo, Y. HJ, M. Ho, S. Su, and H. Yang, “Assessing a new
input device for an educational computer,” Mod. Comput. Appl.

111
Sci. Educ., pp. 114–116, 2014.
[6] D. Huszar, L. Kovacs, A. Palffy, and A. Horvath, “Application of
three dimensional gesture control for educational and medical
purposes,” Budapest Peter Pazmy Cathol. Univ. Fac. Inf.
Technol. Bionics., 2013.
[7] S. Deb and T. Nama, “Interactive boolean logic learning using
leap motion,” in Proceedings - IEEE 9th International
Conference on Technology for Education, T4E 2018, 2018.
[8] F. L. Nainggolan, B. Siregar, and F. Fahmi, “Anatomy learning
system on human skeleton using Leap Motion Controller,” in
2016 3rd International Conference on Computer and Information
Sciences, ICCOINS 2016 - Proceedings, 2016.
[9] V. Silva, S, Eduardo., Anderson, Jader., Henrique, Janiel.,
Teichrieb and G. Ramalho, “A Preliminary Evaluation of the
Leap Motion Sensor as Controller of New Digital Musical
Instruments,” Cent. Inform., 2013.
[10] J. Guna, G. Jakus, M. Pogačnik, S. Tomažič, and J. Sodnik, “An
analysis of the precision and reliability of the leap motion sensor
and its suitability for static and dynamic tracking,” Sensors
(Switzerland), 2014.
[11] J. Coelho and F. Verbeek, “Pointing Task Evaluation of Leap
Motion Controller in 3D Virtual Environment,” Creat. Differ.
Proc. Chi Sparks 2014 Conf., 2014.
[12] L. E. Potter, J. Araullo, and L. Carter, “The leap motion
controller: A view on sign language,” in Proceedings of the 25th
Australian Computer-Human Interaction Conference:
Augmentation, Application, Innovation, Collaboration, OzCHI
2013, 2013.
[13] H. M. Robinson, Emergent computer literacy: A developmental
perspective. 2008.

112
Designing and Implementing an e-Course Using
Adobe Captivate and Google Classroom: A Case
Study
Shahd Alia Dr.Thair Hamtini
Department of Computer Information Systems Department of Computer Information Systems
University of Jordan University of Jordan
Amman, Jordan Amman, Jordan
shahdalia94@gmail.com thamtini@ju.edu.jo

Abstract—Nowadays, Learning Management Systems (LMS) And since its cloud-based it also gives unlimited storage
are widely spread across academic institutions. They are not capacity. This study propose a modified Technology Accep-
restricted to online and distant courses but are also useful during tance Model (TAM) [2] that tries to analyze the effectiveness
or in addition to face-to-face learning sessions. This Study took
place at The University of Jordan as an attempt to evaluate and acceptance of google classroom using a course designed
the acceptance level of Google Classroom which is one of the on Adobe Captivate 2019, this course targets beginner re-
most trending LMS platforms. The experiment used a course searchers, students and anyone interested in writing a scientific
designed on Adobe Captivate 2019, this course teaches LATEX, a document according to the standards. The platform that is
high-quality typesetting system that includes features designed going to be the subject of this course is called Overleaf, it
for the production of technical and scientific documentation to
help researchers focus on the content of their research and is a very easy to use, beneficial text-editing platform that uses
worry less about the structure of the document. To measure the LATEX [3], yet the learning process is very practical and it is
satisfaction level of learners and teachers with Google classroom, hard to teach this topic using the theoretical abstract ways.
a modified and extended version of Technology Acceptance This study measures the learners understanding for such a
Model(TAM) has been used. The results showed that participants practical topic using google classroom.The rest of the paper
felt comfortable taking a course using Google Classroom, and
they agreed upon the high effectiveness of Google Classroom as is organized as follows: in the next section, a review of
a learning management system. related works is provided, followed by the research questions,
Index Terms—Learning Management System (LMS), Elearn- sampling methods , Instrument, a brief on the topic discussed
ing, Google Classroom, LaTex, Overleaf, Technology Acceptance in this e-course, research methodology, . The results and
Model (TAM) findings are then explained and summarized.
II. R ELATED W ORK
I. I NTRODUCTION
A vast majority of people started to prefer online learning
Google Classroom is a new - cloud-based - product in in the last twenty years over the traditional tedious ways of
Google Apps for Education (GAFE) [1]. This product aims learning. Hence, most countries directed their efforts towards
to give teachers more time to teach and students more time to having a successful, robust online education system for schools
learn. Unlike traditional teaching and learning approaches that and for higher education too. one of the journeys that was
are teacher-centered, time consuming, and inflexible, google worth going through, is the united states journey with online
classroom is an interactive LMS that gives students and education. Elaine and Jeff mentioned in their book [4] that
teachers the ability to ask, comment, and give feedback, it also number of higher education students taking at least one online
eliminates the time needed to get ready and reach the campus course, has been increasing linearly during a ten- years period
since many students and teachers might be living in distant of time. This study started during fall 2002 with less than 10%
areas, this google product definitely helps saving their time of students taking at least one online course, the percentage
and keeps all the concentration on the teaching and learning has been increasing until it reached 32% in fall 2011, which
process. Talking about flexibility, google classroom not only shows how important and beneficial online education was,
allows teachers to provide material and extra examples or even at the beginnings . Elaine and Jeff continued to track the
resources, but it is a free, easy to use platform that allows progress of online education in the united states through their
teachers to create classes, distribute assignments, track student reports [5] [6] all the way from 2011 to 2016. In 2018, three
progress, grade and send feedback in a way that is much more researchers decided to study the impact of the new Online
flexible and efficient than the traditional ways that are based Master of Science in Computer Science (OMSCS) offered by
on papers for everything! the Georgia Institute of Technology (Georgia Tech), to answer

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 113


one important question which is ’Can Online Delivery Increase developed to help designers in creating interactive eLearning
Access to Education?’ [7]. Georgia Techs Computer Science and mLearning content [14], it has many features that helps
Department, ranked in the top ten in the United States, was developing an interesting e-course and personalize it according
first to offer a full online version of its highly regarded masters to the learning styles of the targeted learners, then it could be
degree, which was totally equivalent to the in-person degree published either to a web page or published as an application,
and the students were not labeled as ’online students’. The or even published to a specific LMS, just like we did after
biggest benefit of this program, is that it costs about 7,000$ designing the course in this study, it has been published to
compared to the in-person computer science masters degree google classroom.
that costs 45,000$. The price factor has definitely attracted The last step is to test whether the targeted learners and
students to choose the online version, also the fact that this teachers are satisfied with learning such courses using Google
online version said to be equivalent to the traditional in- person Classroom. Technology Acceptance Model(TAM) [15] is one
version, this made it more believable and authentic. As a result, of many models that can be used for this purpose. It is a very
online access to education was significantly increased. popular model that has been used for more than quarter a
Due to the numerous advantages that students and teachers century now to measure human behavior and their acceptance
has experienced with online education, Learning Management or rejection of a certain technology under study based on
System (LMS) concept has emerged and accomplished a some factors, like Perceived usefulness (PU) and Perceived
huge success in the field of teaching, knowledge sharing, and ease-of-use (PEOU), which are considered the two main
dissemination across the Web [8] [9]. One of those LMS’s is factors to be measured when using TAM. Now with the non-
google classroom that we are studying here. Although google stopping development of technologies it became a necessity to
classroom has been launched in 2014, the research papers study the human reaction towards each technology, therefore,
and articles which addressed this topic and tried to find how TAM has been considered a key model for such purposes,
effective Google Classroom is, are very few till this day. Three since it has proved its effectiveness among other models and
papers discussing google classroom were reviewed before approaches that are made for the same purpose. But TAM
starting this study. The first paper [10] discussed the features was not used as an abstract rigid model in all cases, most
of google classroom, factors for adopting it, how teachers researchers believed that there are some other aspects that
are using it, it’s effectiveness and limitations. Those aspects needs to be assessed depending on the topic under study. For
were discussed after interviewing 7 teachers from different example, Saad, Nebebe, and Tan [16] decided to extend TAM
departments and 35 students from the department of English model to fit the e-learning and online education context, since
at Daffodil International University-Bangladesh. it is important to assess students participation, involvement,
The second paper [11] has measured google classroom’s and other factors [11]. Subsequently, TAM has proved its
active learning using a data mining course, the writers used effectiveness for measuring human acceptance for learning
an extended version of Technology Acceptance Model(TAM) management systems [17], and platforms that are made for
suitable to test the satisfaction of learners who enrolled the online education [18] [19] .
data mining subject, and the results showed that overall
students are satisfied with google classroom, therefore, this III. R ESEARCH Q UESTIONS
paper shows that Google classroom is effective as an active The following four questions will act as a guide for this
learning tool. study, and will be addressed based on the research that is
We also managed to go through an Arabic experience conducted:
with google classroom, this paper [12] has adopted Partial • How well do students and teachers in our society accept
Least Square-Structural Equation Model (PLS-SEM), which is to use Google Classroom in education?
considered as a method of structural equation modeling that • Has Google Classroom satisfied all the learner needs?
allows estimating complex cause-effect relationship models • What is the level of interactivity that Google classroom
with latent variables [13]. PLS-SEM was used here to examine has reached?
factors that affect the student’s acceptance of google classroom • Would it be expected to see Google Classroom integrated
at Al Buraimi University College (BUC) in Oman, this exper- in the teaching and learning of various topics in the
iment used an online questionnaire with 337 respondents, and University of Jordan?
results showed that students are willing to take courses using
google classroom. And they are looking forward to implement IV. B RIEF ON THE COURSE M ATERIAL AND D ESIGN PHASE
this technology in higher educational institutions. Overleaf is an online LaTeX editor that is easy to use,
Before publishing the course on any LMS platform, it needs needs no installation, real-time collaboration, version control.
to be structured using one of the tools that are designed It allows researchers to release the burden of editing their
to create online courses, each tool has its features, and you documents, adjusting the fonts, spacing, and writing the ref-
can choose the tool you are going to use to design your erences according to a specific standard, they will only worry
course depending on the purpose of your course and the about the content of the document. Overleaf has accomplished
features you are going to need, one of the most famous a huge success in the past couple of years, and currently
tools is Adobe Captivate. Adobe Captivate is basically a tool used by 600,000 authors at more than 3600 institutions in

114
180 countries, including CalTech, Stanford, MIT, Harvard,
and Brown. Many researchers and students started to use
Overleaf in the writing of their thesis and research papers,
that is why the goal of this course is to become a guide
for beginners to teach them the basics of Overleaf. All the
topics discussed were very simple in order to match the
understanding of beginners and help them start their first latex
document, therefore, no advanced commands or codes were
mentioned in this course, just the basics of every scientific
document.
This course contains six lessons, starting with an overview
Fig. 1. Screenshots of the Lessons and the knowledge Tests
of the environment, the most important preserved words in
Overleaf and the job of each preserved word, how to add
tables, figures, and equations to your document, the last lesson
V. S AMPLING
talked about referencing and citation.
The design of this course has been done using Adobe The sample was chosen carefully from The University
Captivate 2019, because of the great enhancements added to Of Jordan so that it contains both males and females from
this release. The following points are the teaching and learning different ages, varying demographic information, with IT-
strategies that have been followed in the design of this course: background and without IT-background for the reason that not
• The design of the course depends mainly on software all researchers are researchers in the IT field or even interested
simulation technique. Since it aims to teach researchers in technology, the researcher might be from other fields like
how to use overleaf environment, the simplest way to biology or physics and wishes to learn how to use overleaf,
design the course was by simulating the real environment therefore, google classroom and the design of the course must
and directly show learners how the actual work is done. be simple taking into consideration those learners with no IT-
• Text-to-speech feature provided by Adobe Captivate has background. The one thing that all the participants have in
been used to keep the learners engaged, since many common is that they are all either postgraduate students who
studies proved that students learn better by listening. Also are required to write papers and documents, or undergraduate
according to cognitive learning theory [20] and modality students who are interested in the research field, master and
principle [21], it will be overwhelming to the learners PHD students who are preparing to write their thesis and
if we added on-screen texts to the software simulation. dissertations . The sample also included teachers, so we can
We need to minimize the chances of overloading learners measure Google classroom as an LMS from their point of
visual/pictorial channel and present verbal explanation view.
as speech processed through auditory/verbal channel in-
stead. VI. I NSTRUMENT
• Pop colors and side notes were used when necessary to
A questionnaire was developed to measure the learners satis-
grab the learners attention towards something, like when
faction level with google classroom features, this questionnaire
its time to try a command on Overleaf, or when we
contained 3 parts. the first part consisted of some demographic
are about to explain something that needs learner’s full
questions to ensure that the sample have participants with
attention.
varying demographics. In the second part we asked the par-
• Red squares and arrows were used to show learners on
ticipants about how often do they use the internet on a daily
which part of the screen we are working. Sometimes
basis to determine the level of information and communication
when the learners are not familiar with the software
technology (ICT) usage among those participants.The third
environment they might get lost.
part of the questionnaire has 23 questions to measure the
• Learners were given real examples of published papers,
opinion of participants in the following areas after taking the
so they can relate and understand what the lesson is
course on google classroom. ”Fig.2”:
talking about.
• At the end of each lesson there was a test of knowledge
so learners could refresh their memories and make sure
they understood the current lesson before moving to the
next one. Those tests were designed using motivators and
sound effects to keep the learner active and engaged, and
to make the learning process more interesting.
The following figure ”Fig.1 ” shows some screenshots of
the course and the knowledge test questions to help you
understand the whole idea : Fig. 2. Areas Under Investigation

115
To measure the participant answers mathematically, we used familiar with using technology- and this aspect will be further
a 5-point Likert scale [22] ranging from 1 (strongly disagree) tested in the next part of the questionnaire- we also have 8
to 5 (strongly agree). Then the answers were analyzed and participants with other different jobs.
tabulated to make the data understandable and organized
in a meaningful way so that we can make decisions and TABLE I
enhancements based on those tables. D EMOGRAPHIC Q UESTIONS .

R ESEARCH M ETHODOLOGY Variable Frequency Percent


Gender
The first step was to design the lessons of the e-course used Male 23 46%
in this study. Six lessons were designed using Adobe Captivate Female 27 54%
Age
2019, those six lessons included all topics that are important Less than 18 0 0
for beginners at Overleaf, and included also some questions 18-20 6 12%
to test the learners knowledge in each topic after finishing the 21-24 17 34%
25-30 23 46%
related lesson.
above 30 4 8%
Then a class was created on Google classroom, and the Residence
material was uploaded to start the experiment . After that we Amman 25 52%
chose the sample that included teachers, beginner researchers, Other cities 24 48%
Wrok
and students at The University of Jordan. Student only 17 34%
A questionnaire was developed to measure all the aspects Teacher 10 20%
that needed to be measured regarding google classroom fea- IT-related job 15 30%
tures and services along with the design of the course, the Other jobs 8 16%
lecturer’s ability to deliver information, and the quality of the
received information since all of those factors are important The second part of the questionnaire included a single
too. After that all sample participants joined the class, took question to measure how familiar are participants with using
the course and rated their satisfaction with each feature under internet, and web-based applications since it affects their
study. The answers were collected and analyzed based on 5- acceptance of using technology in their daily lives, especially
point Likert scale, then tabulated in a meaningful form. You elearning, you don’t expect a student who does not use internet
can check the analysis in the next section, the results and much, to be comfortable taking a course in a google classroom.
conclusion in the last section . ”Table.II” shows the results of this question and proves
that most of the participants uses internet above the normal
VII. R ESULTS AND F INDINGS range on a daily basis. This indicates that they are familiar
The current sample of 50 participant has submitted the ques- with technology, and familiar with seeing and using web
tionnaire after completing their e-course and testing the fea- pages in general. Eventually, this helps them accepting new
tures that Google classroom offers to enhance learning process. technologies and minimize their resistance towards it.
The first part of the questionnaire contained 4 demographic-
personal questions to ensure the diversity of the participants TABLE II
L EVEL O F I NFORMATION AND C OMMUNICATION T ECHNOLOGY (ICT)
in spite of the fact that the sample is relatively small. Also to U SAGE A MONG PARTICIPANTS .
ensure that the sample included both students and teachers to
Variable Frequency Percent
test Google classroom’s environment from two points of view. Number of Times Participants
The answers of the first part of the questionnaire are as uses the internet daily
described in ”Table.I”. You can see in the results of the first 1-3 Times a day 2 4%
part that the sample includes both males and females with a 4-7 Times a day 3 6%
8-10 Times a day 8 16%
good range, 4 of the participants are above 30, and a good 11-15 Times a day 20 40%
combination of undergraduates and postgraduates appears in More than 15 Times a day! 17 34%
the range 21-30 years old.
The questionnaire contained a question about where do par- The third and last part of the questionnaire contains 23
ticipants live, since we are trying to measure the effectiveness questions to measure the quality of information delivery pro-
and acceptance of a distance learning tool, it is important to cess, communications and interactions, perceived usefulness,
have a number of participants who lives in different cities to perceived ease of use, and user satisfaction.
see if google classroom has helped them achieve the learning The following 5 tables discusses and summarizes the results
objectives without having to go all the way to the campus. of participants answers in those aspects respectively.
The last question of this part was about the job of each ”Table.III” shows the participants answers regarding the
participant, 17 participants were only students both under- quality of information delivery process in this class. All their
graduates and postgraduates, we have 10 teachers to measure answers were above the average. They strongly agree that
Google classroom from a teacher’s side, we also have 15 the course activities helped them understand what is overleaf,
participants with IT-related jobs who are expected to be and how to start working on it, they also agreed that the

116
lecturer was straight to the point and all ideas and concepts TABLE V
were explained clearly. Therefore, they find google classroom P ERCEIVED U SEFULNESS
a successful medium for disseminating such courses. The Perceived Usefulness
highest score was for the feedback property, most of the Aspect Mean
participants agreed that this feature helped them receive the Participants believed that Google Classroom is
4.63
an excellent medium for E-learning.
information much more clearly. This feature gives google Google Classroom is a very organized
classroom a competitive advantage against other learning platform that helps students track their
3.98
management systems. performance and understand their current
situation in a particular topic.
Assignment submission method in Google
TABLE III Classroom made it easier for student to 4.2
Q UALITY OF I NFORMATION D ELIVERY submit assignments on time.
Participants successfully achieved their course
Quality of Information Delivery 3.77
objectives.
Aspect Mean
The course Idea and concepts were
3.62
demonstrated clearly.
Lecturer was straight to the point and delivered ”Table.VI” shows an outstanding results regarding the ease
3.6
the needed information effectively. of use in google classroom. Most of the participants strongly
Participants found Google Classroom suitable agreed that it was easy to login, join class, navigate through
3.9
for disseminating courses like this course.
Course activities helped participants build a the system, access the course material, and understand how
4.2
basic knowledge in overleaf. each process is done using google classroom .
Participants found the feedback property
4.72
in Google Classroom very useful
TABLE VI
P ERCEIVED E ASE OF U SE
The next table ”Table.IV summarizes the participants evalu-
ation of communications and interactions in google classroom. Perceived Ease of Use
Aspect Mean
The highest mean goes to google classroom’s being an open It was easy for participants to sign\login 4.68
platform, which is a very important feature since the whole Participants did not face any trouble to join
4.47
idea of an LMS is to exchange knowledge and learn using the class
such platforms. The lowest mean was for the second aspect, It was easy to understand the system and
4.19
navigate through it
the participants didn’t agree much that they were active most It was easy to access the course material 4.17
of the time. This indicates that there is a need to have more It was easy to understand the method of
3.94
motivation boosters to keep learners engaged, and we need to assignment submission and feedback.
The design of Google Classroom is
come up with new ideas to ensure that the learners are fully user-friendly and everything is easy 4.12
active and focused, also to make the learning process a bit to find.
more interesting.
The last table ”table.VII” shows the level of user’s satisfac-
TABLE IV tion with some aspects they tested during the course. Results
C OMMUNICATION AND I NTERACTION showed that users were relatively satisfied with most of google
Communication and Interaction classroom’s features. The lowest mean value 3.5 was for the
Aspect Mean third aspect, not all learners believed that this method of learn-
The class was open to new ideas, and ing is more interesting than face-to-face learning. Although
participants could have contacted each 4.1
other if they wanted to. google classroom focused on being an open platform that
Participants felt engaged in the learning allows announcements, commenting, and feedback properties,
3.59
process and active most of the time. clearly it was not enough to make all learners believe that it
Participants felt comfortable taking this
course in a Google Classroom.
3.81 is more interesting than traditional learning.
Participants felt that the lecturer is
4.23
available and easy to contact most of the time. TABLE VII
Participants believes that Google Classroom U SER ’ S S ATISFACTION
4.72
is an open platform to exchange knowledge.
User’s Satisfaction
”Table.V discusses the perceived usefulness, all mean values Aspect Mean
Participants would not have preferred to take
were above average again. The highest mean was for the first this course in a normal class.
3.66
aspect, learners believed that google classroom is an excellent Participants would recommend using Google Classroom
4.31
medium for elearning. Also participants with mean value of in other courses.
This method makes learning process less boring. 3.5
4.2 preferred google classroom’s way of submitting assign- Participants preferred Google Classroom’s method
ments. Most of the participants also liked the organization of of examination and assignment submission 3.8
google classroom’s and believed it helped them track their more than the traditional (paper) way
performance.

117
VIII. C ONCLUSION AND F UTURE WORK [13] W. Afthanorhan, “A comparison of partial least square structural equa-
tion modeling (pls-sem) and covariance based structural equation mod-
This study shows that most of the participants are satisfied eling (cb-sem) for confirmatory factor analysis,” International Journal
with the features of google classroom that were presented of Engineering Science and Innovative Technology, vol. 2, no. 5, pp.
198–205, 2013.
through this course. This proves google classroom’s effec- [14] K. Siegel, Adobe Captivate 2017: The Essentials. IconLogic, 2017.
tiveness as a learning management system(LMS), and makes [15] N. Marangunić and A. Granić, “Technology acceptance model: a liter-
it one of the leading LMS’s that are expected to be widely ature review from 1986 to 2013,” Universal Access in the Information
Society, vol. 14, no. 1, pp. 81–95, 2015.
used in the teaching and learning process of various topics [16] R. Saade, F. Nebebe, and W. Tan, “Viability of the” technology ac-
in the next few years. Google classroom almost satisfied all ceptance model” in multimedia learning environments: a comparative
the needs of any student, it allowed students to take classes, study,” Interdisciplinary Journal of E-Learning and Learning Objects,
vol. 3, no. 1, pp. 175–184, 2007.
view material, check the teacher announcements, comment on [17] S. Alharbi and S. Drew, “Using the technology acceptance model in
it, submit assignments, track their progress in a specific topic, understanding academics behavioural intention to use learning manage-
request feedback from the lecturer, and check their grades that ment systems,” International Journal of Advanced Computer Science
and Applications, vol. 5, no. 1, pp. 143–155, 2014.
are updated by the teacher, all this and more on one platform. [18] N. Fathema, D. Shannon, and M. Ross, “Expanding the technology
This study also shows that interactivity of google classroom acceptance model (tam) to examine faculty use of learning management
has not reached the required level yet, it needs to motivate the systems (lmss) in higher education institutions.” Journal of Online
Learning & Teaching, vol. 11, no. 2, 2015.
learners more and keep them engaged in order to be an equal [19] F. Abdullah and R. Ward, “Developing a general extended technology
alternative of face-to-face learning in certain topics. acceptance model for e-learning (getamel) by analysing commonly used
Although the initial results of the study are positive, but for external factors,” Computers in Human Behavior, vol. 56, pp. 238–256,
2016.
future work, the number of participants needs to be increased [20] S. Sepp, S. J. Howard, S. Tindall-Ford, S. Agostinho, and F. Paas,
to minimize the sampling error and reach more students and “Cognitive load theory and human movement: towards an integrated
teachers at the University of Jordan. Also the instrument model of working memory,” Educational Psychology Review, pp. 1–25,
2019.
and course used in the experiment needs to be designed in [21] J. Wang, K. Dawson, K. Saunders, A. D. Ritzhaupt, P. . Antonenko,
way that allows further analysis of the difference between L. Lombardino, A. Keil, N. Agacli-Dogan, W. Luo, L. Cheng et al.,
teacher’s feedback and learner’s feedback. Most importantly, “Investigating the effects of modality and multimedia on the learning
performance of college students with dyslexia,” Journal of Special
this method of learning needs to applied to other topics to Education Technology, vol. 33, no. 3, pp. 182–193, 2018.
ensure Google classroom’s effectiveness in all types of classes, [22] A. Joshi, S. Kale, S. Chandel, and D. Pal, “Likert scale: Explored and
all areas and topics. And to ensure that all expected users explained,” British Journal of Applied Science & Technology, vol. 7,
no. 4, p. 396, 2015.
accept this platform for education.

R EFERENCES
[1] M. E. Brown and D. L. Hocutt, “Learning to use, useful for learning:
a usability study of google apps for education,” Journal of Usability
Studies, vol. 10, no. 4, pp. 160–181, 2015.
[2] P. Legris, J. Ingham, and P. Collerette, “Why do people use information
technology? a critical review of the technology acceptance model,”
Information & management, vol. 40, no. 3, pp. 191–204, 2003.
[3] C. Hayes, “An introduction to latex,” 2016.
[4] I. E. Allen and J. Seaman, Changing course: Ten years of tracking online
education in the United States. ERIC, 2013.
[5] ——, Grade Level: Tracking Online Education in the United States.
ERIC, 2015.
[6] ——, Online Report Card: Tracking Online Education in the United
States. ERIC, 2016.
[7] J. Goodman, J. Melkers, and A. Pallais, “Can online delivery increase
access to education?” Journal of Labor Economics, vol. 37, no. 1, pp.
1–34, 2019.
[8] S. M. Jafari, S. F. Salem, M. S. Moaddab, and S. O. Salem, “Learning
management system (lms) success: An investigation among the univer-
sity students,” in 2015 IEEE Conference on e-Learning, e-Management
and e-Services (IC3e). IEEE, 2015, pp. 64–69.
[9] D. E. Marcial, J. M. N. Te, M. B. Onte, M. L. S. Curativo, and J. A. V.
Forster, “Lms on sticks: Development of a handy learning management
system,” in 2017 7th International Conference on Cloud Computing,
Data Science & Engineering-Confluence. IEEE, 2017, pp. 782–787.
[10] S. Iftakhar, “Google classroom: what works and how?” Journal of
Education and Social Sciences, vol. 3, no. 1, pp. 12–18, 2016.
[11] I. N. M. Shaharanee, J. M. Jamil, and S. S. M. Rodzi, “Google classroom
as a tool for active learning,” in AIP Conference Proceedings, vol. 1761,
no. 1. AIP Publishing, 2016, p. 020069.
[12] R. A. S. Al-Maroof and M. Al-Emran, “Students acceptance of google
classroom: an exploratory study using pls-sem approach,” International
Journal of Emerging Technologies in Learning (iJET), vol. 13, no. 06,
pp. 112–123, 2018.

118
The Importance of Institutional Support in
Maintaining Academic Rigor in E-Learning
Assessment
Darin El-Nakla Beverley McNally Samir El-Nakla
College of Business Admintsration College of Business Admintsration College of Engineering
Prince Mohammad Bin Fahd Prince Mohammad Bin Fahd Prince Mohammad Bin Fahd
University University University
Alkhobar, Saudi Arabia Alkhobar, Saudi Arabia Alkhobar, Saudi Arabia
delnakla@pmu.edu.sa bmcnally@pmu.edu.sa snakla@pmu.edu.sa

Abstract— This paper reports on the perceptions of a group of of gathering and analysing information about
academics regarding the role of higher education institutions in student learning by teachers as well as learners
dealing with cheating when completing on-line assessments. A and of evaluating it in relation to prior
thematic approach to data collection and analysis was utilized. achievement and attainment of intended, as well
The findings showed there was an ad-hoc approach to the issue as unintended learning outcomes”
of academic integrity and dealing with cheating. While
institutional policies did exist concerns were expressed as to
their overall effectiveness. Additionally, faculty were not Furthermore, when a student submits an online assessment,
provided with sufficient training in the use of detection is it possible to prove that he/she wrote it themselves or that
methods and the use of available systems and processes to they truly understand the subject or material? There has
ensure academic rigour in relation to cheating in on-line been ongoing concern expressed by educationalists about
assessments. The findings have implications for institutions in the perceived increase in the incidence of student academic
the development and implementation of academic misconduct dishonesty [4, 5, 6]. Academic dishonesty is deemed to be
policies. any act of deception perpetrated by the student with the
Keywords—online, faculty, students, cheating, tools, intent to misrepresent one’s learning achievement for
plagiarism.
evaluation purposes [7]. This is of particular concern for
higher education institutions as research has indicated that
I. INTRODUCTION increases with the age of the student through to age 25 [7, 8,
This paper reports on a small exploratory study conducted in 9, 10].
a UK university. The study examined the perceptions of a Additionally, there is a view that cheating is much easier in
group of faculty as to how academic dishonesty (cheating) an online environment as faculty and students are separated
can be minimized and academic integrity achieved and by time and space [6]. There is a lack of research examining
sustained when using on-line assessments. On-line academic misconduct related to cheating. Where research
instruction has been growing exponentially over the past has been conducted it indicates that formal warnings and
two decades. For example in 2002 a total of 1,602,970 student counselling are the most preferred means of
students in higher education took at least one course online. controlling the prevalence of cheating [7]. However, this
By 2011 this had risen to 6,714,792 students taking one or appears not to be as successful from the perspective of
more online classes [1]. Stack goes on to state this faculty as it could be. Therefore, the faculty members prefer
represents an increase of 318.9%, or a 4.189 to one ratio [1]. more severe penalties for students involved in cheating. The
This signifies a three-fold increase in the level of on-line faculty members are aware of the prevalence of different
participation from 9.6% to 32.0% in 2011[2]. Consequently, types of cheating strategies, but they fail to confront it due
it can be argued there has been a corresponding increase in to lack of evidence [7].
on-line assessment as a feature of distance and eLearning Consequently, three key issues have been identified with
programs provided by tertiary (higher) education institutions regard to the use of on-line assessment [11]. The first, is the
[2]. This situation gave rise to the following research difficulty with synchronicity of assessments, the second,
problem: How can higher education institutions support security and prevention of students hacking into the system
faculty in ensuring the incidence of cheating can be to re-take the test and third, collusion where someone other
minimized? than the student takes the assessment.

II. ONLINE ASSESSMENT III. METHODOLOGY

As the increase of e-learning delivery has occurred, on-line The study took place in a UK University under the auspices
assessment has become more sophisticated, cost-efficient of the University’s research ethics policy. As an exploratory
and easy to use, making it more attractive to educators [2]. study, a mixed-methods approach was used to gather the
Therefore, a question is posited as to what extent it is data. As an exploratory study, a mixed-methods approach
possible to trust the results achieved. For the purposes of was used to gather the data. The form and nature of the
this study the definition of on-line assessment proposed by research questions indicated the need to employ different
Pachler, Daly, Mor, and Mellar as cited in Baleni [3] was data sources. For example, studies that answer who, what
utilized: and when questions are more likely to be found in the
“the use of ICT to support the iterative process quantitative domain. In order to attain a more in-depth
This research was funded by Prince Mohammad Bin Fahd University.
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 119
understanding of these questions a researcher turns to the
answering of how and why questions. Bryman [12] contends 6
that the, what, when and where questions support the 5
achieving of an understanding of the causes and effects of

Participants
4
people’s actions. Whereas, the how and why questions allow 3
for clarification of the underlying motivations or 2
explanations of the behavior of the individual [13]. The use 1
of questions of a how and why nature, encourage the
0
research participants to be self-reflective about their

Increased

Decreased

No change
perceptions and views and how they construct meaning from
the situations they find themselves in.

Convenience and purposive sampling was used to obtain the Possible answer

sample. The criteria for the sample are that the Faculty
member used Blackboard to conduct on-line assessments
and were available to be interviewed. This resulted in a Fig. 1. Incidents of cheating over the last five years
sample size of six. The participants came from three
schools, Computing and Creative Technology, Business and • Question 2: Have you used your institution’s
Contemporary Science. process to deal with cheating?
Data collection involved face to face interviews of
approximately 30 minutes. The questionnaires were emailed Five of the participants have said that they have used the
to the participants prior to the interview in order to University process to deal with cheating. Only one
maximize the use of time. The questions were designed to participant has not used it before. This respondent was one
elicit responses of both qualitative and quantitative nature. who stated he had not taken steps to prevent or identify
The qualitative questions sought to identify the participants’ cheating in his classes, see Fig. 2.
personal opinions regarding cheating in on-line tests. The
responses to the quantitative questions were summarized in
terms of frequency of responses. The participants were also 6
able to provide additional comments to these questions. The
5
responses were analyzed using thematic analysis. The
concurrent data collection and thematic analysis followed 4
participants

the six-steps recommended by Braun and Clarke [14]. These 3


are: familiarization with the data, the generation of initial
codes, searching of themes, reviewing of themes, defining 2
and naming of themes; and producing the report. 1

0
IV. RESULTS AND DISCUSSIONS Yes No
possible answers
A. Quantitative questions
• Question 1: Do you believe the incidence of
cheating has increased or decreased in the last five Fig. 2. Number of participant used the institution policy against cheating
years?
• Question 3: Please rate your satisfaction with the
Two of the participants said that the incidence of cheating outcome(s) of the process.
has increased whereas four participants stated that it has
stayed the same. For the participants who said it has One participant was satisfied with the institution process;
increased indicated the detection methods and the university however, a lack of consideration and flexibility seems to
rules for cheating have to be reviewed and reconsidered to occur. Two of the participant were not satisfied and as one
overcome the cheating issues especially in the on-line of them referred to the situation where the University has
environment. For the participants who stated cheating stayed not prosecuted any of the students that have been caught
the same in the last five years indicated that if it had risen cheating. Whereas the other referred to the process as
then urgent measures should be taken to combat this ridiculously formal and the academic staff were frightened
increase, see Fig. 1. to use the process. Three of the participants were satisfied
with the institution’s processes. However, they were
emphatic in their view that there was a need for more
training and development of Faculty, see Fig. 3.

120
invigilators were not easily available. Also to have a set of
questions where the order is different. Thus, every student
gets the same questions with slight variation of wording and
6 randomizing of the order of questions. This helps prevent
5 collusion or unintentional cheating if someone glances at
Participants

4 another screen. Faculty did attempt to teach students how to


3 avoid plagiarism and making them aware of the
2 consequences of cheating. Also cheating can be reduced by
1 having fixed time for the online test, ensuring enough time
0 to use references properly if required and to ensure every
question is not multiple choice. Participants stated that the
Acceptable

Satisfied

satisfied
steps taken were generally successful.
Not
The other half of the participants considered that students
Possible answer were fundamentally honest and did not use any methods of
detection other than Blackboard facilities. However, all
Fig. 3. Rating the satisfaction of the university process participants believed these were not robust enough to detect
high levels of cheating. There was an awareness of the
increasing sophistication of technology and the ability of
students to manipulate their answers.
B. Qualitative Questions
• Question 1: Briefly describe up to three incidents The participants observed that where steps were taken to
where you have detected cheating in online prevent cheating they were effective. However, they were
assessments in your subjects? difficult to implement as they were resource intensive
especially of Faculty time. Designing a randomized online
Five of the six participants stated that they have not detected test requires the instructor to prepare a large set of questions
any cheating in online assessments, only one of the to be distributed on-line to students for students to have
participants had detected cheating as he caught some of the different questions from each other. Purchased test-banks
students attempting to use emails to communicate and share were not always feasible. Often Faculty were not provided
answers during the test. Consequently, this has been stopped with the training in the appropriate software to complete this
and the email is blocked from being used during the test. with ease and in a time effective manner. This was deemed
Despite the participants not detecting cheating they do vital to meet the challenges presented by the exponential
believe that cheating does exist. However, owing to poor changes in technology.
resources, the lack of detection methods, and the software
used to deliver the online assessment, it is not revealed. • Question 4: What methods do you use to detect
cheating?
• Question 2: How did the different types of cheating
occur in your subjects? All of the participants used their own memory when
grading. They considered that the students submitting their
All of the participants agreed that students collaborated work together will have similar assignments and get scores
closely with each other leading to the possibility of copying that are very close together. Thus raising suspicions there is
work, also engaging in plagiarism in terms of copying from cheating. Also if a student is receiving high grades and had
the internet without references nor varying the writing style. not been attending the class then this is suspicious. Only
three of the participants use Plagiarism Detection Tools such
Based on the responses of the participants most of the as JISC, SafeAssign and iThenticate software. These are
cheating occurs because students copy from each other. The useful when students are copying and pasting from the
reason proffered for this happening is, either the University internet and from fellow students.
policy against cheating is not strict enough to deter the
student or the instructor failing to take action when it is • Questions 5: How does your institution convey the
detected thereby not deterring students from, or even policies and processes pertaining to cheating?
encouraging the students, to continue cheating. There also
appeared to be no education for students in what comprised The University has its own academic deceit policy and
cheating and plagiarism. procedures which can be accessed from the University
website, the most important part is quoted:
• Question 3: How have you implemented processes “The degrees and other academic awards of University
to prevent cheating in your subjects over the last are granted in recognition of a student’s individual
five years and how effective were they? achievement. Students are not permitted to seek unfair
academic advantage, i.e. to cheat. Any deliberate
Half the participants used codes to prevent students from attempt to obtain unfair advantage by one or more of a
printing during an online test, to have an invigilator for the variety of means will be penalized”.
test if taken in a formal classroom situation to support the
Faculty member. However, this is not always possible as

121
Five of the six respondents suggested that the university includes but is not limited to the following
policy is well known. However, it was also suggested that recommendations.
the University did not take action beyond having a policy.
There was a suggestion that often the University [any It is recommended that the use of online cameras and
university] could compromise its academic reputation if the biometric data to monitor students to verify their
true incidence of cheating became widely known. identification be investigated, especially for those students
who a sitting tests away from the university. This would aid
• Question 6: How effective are the processes your in reducing the potential for students to have someone else
institution uses to reduce cheating? sit the exam for them.

The participants stated that there needed to be more It is recommended that the University act to provide
preemptive attempts by the University to reduce cheating. invigilators for all on-line assessments. This may mean
The University needed to be proactive in publicizing the taking a non-traditional approach to their use. One that is
Academic Deceit Policy. For example, students registering more suited to on-line assessments as opposed to being
at the University for the first time be given a copy of the present in a classroom. This would include supporting
Academic Deceit Policy and Procedures. Program tutors Faculty with the preparation of assessments so they achieve
also ensuring that the policy is re-stated at the ‘best practice’ in minimizing cheating.
commencement of the course. Again the importance of the
students understanding what comprised cheating was Students' awareness of academic integrity policies of
stressed. University and signing a code of conduct document is
believed to lower the occurrence of cheating. Further
• Question 7: How could the processes be improved? research is needed to establish the effectiveness of such
strategies. Also further research is needed investigating
All the participants were emphatic that more training and student awareness of what exactly constitutes cheating and
development for Faculty is urgently required. They were plagiarism
very open to finding out more about the latest techniques to
prevent cheating, how they can change the assessment While the participants had not been involved with off-
design and use collaborative groups to share ideas. The campus assessment they were aware of the issues that could
University also had a responsibility to ensure academic staff arise from this and stated that invigilators were essential if
do make use of the available processes and resources. There this were to occur. Further research is required to establish
also needs to be a change in policy and practices better best practice in this regard.
reflecting the on-line environment.
It was noted the Faculty’s memory is not always effective
• Question 8: Do you agree that a low grade especially if there is a large number of students in the class.
weighting of the online test would reduce cheating? Therefore, it is recommended that the Plagiarism Detection
Tools are upgraded and should become a requirement for
There was a disagreement between the participants as to the use by academic staff and not only by few and training
best strategy regarding weighting of assessments. Two of should be given by the University to the academic staff. It is
the participants agreed that using low grade weighting for recommended that institutions investigate the challenges
online assessment will discourage students from cheating, as presented for Faculty in this situation.
the penalty and being labelled a cheater if caught is not
worth it. However, with a high weighted assessment the Faculty training in using Plagiarism Detection Tools such as
temptation may be greater to cheat. Two participants stated JISC, SafeAssign and iThenticate software which is
that this may depend on the students themselves, one imperative. The study showed that students cheating
participant preferred to have one big assessment for the increased with faculty who do not use the tools.
module rather than have many smaller one, and he thinks it
does not work to have low grade for an assessment. In summary, the challenge of maintaining academic
integrity is not going to go away. It is imperative that all
higher education institutions are proactive in meeting the
IV. CONCLUSION AND RECOMMONDATION challenge and ensuring Faculty are supported in their efforts
The aim of this exploratory study was to identify the to combat these issues especially in the e-learning
awareness of cheating in on-line environment and Faculty’s environment.
responses and satisfaction with efforts made to ensure
academic integrity. It was found that there was a limited ACKNOWLEDGMENT
awareness on the part of academic staff as to the potential
and extent of cheating in on-line assessment. Moreover, the The authors would like to acknowledge the support and
cheating and plagiarism tools available to academic staff to funding research by Prince Mohammad Bin Fahd University
detect cheating were limited. The study provides a basis for (PMU).
further research investigating the challenges posed by the
increase in on-line assessment and the potential for a growth
in cheating in this form of assessment. This research

122
REFERENCES
[1] S. Stack, “Learning Outcomes in an online vs traditional course,”
International Journal for the Scholarship of Teaching and Learning, Vol. 9,
No. 1, Article 5. 2015
[2] I. E. Allen and J. Seaman, “Changing course: Ten years of tracking
online education in the United States,” Babson Survey Research Group and
Quahog Research Group, LLC, 2013
[3] Z. Baleni, “Online formative assessment in higher education: Its pros
and cons,” The Electronic Journal of e-Learning, 13(4), 228-236, 2015
[4] M. J. Bishop and M. Cini, “Academic dishonesty and online
education (Part 1): Understanding the Problem,”
https://evolllution.com/revenue-
streams/distance_online_learning/academic-dishonesty-and-online-
education-part-1-understanding-the-problem/ accessed on 1st April 2019
[5] N. Rowe, “Cheating in online student assessment: beyond plagarism,”
Online Journal of Distance Learning Administration,7(2), 2004
[6] G. Watson and J. Sottile, “Cheating in the digital age: do students cheat
more in online courses,” Online Journal of Distance Learning
Administration, Volume 13.1spring 2010
[7] P. Singh and R. Thambusamy,” To cheat or not to cheat, that is the
question: undergraduates’ moral reasoning and academic dishonesty,” 7th
International Conference on University Learning and Teaching, 2016
[8] G. J. Cizek, “ Cheating on tests: how to do it, detect it and prevent it,”
Mahwah, NJ: Lawrence Erlbaum, 1999
[9] M. Dick et al., “Addressing student cheating: definitions and solutions,”
ACM SIGCSE, 35(2), 172-184, 2003
[10] A. Lathrop and K. Foss,”Student cheating in the Internet era: A wake
up call,” Englewood, Co: Libraries Unlimited, 2000
[11] M. Olt,” Ethics and distance education: Strategies for minimizing
academic dishonesty in online assessment,” Oneline Journal of Distance
Learning Administration, 5(3), 2002
[12] A. Bryman, “Quality and Quantity in Social Research”, Unwin
Hyman, London, 1988
[13] V. Braun, & V. Clarke, "Using thematic analysis in psychology”,
Qualitative research in psychology, 3(2), 77-101, 2006
[14] A. Bryman, “Social research methods”, New York, Oxford University
Press, 2001

123
Deep Learning Assisted Smart Glasses as Educational
Aid for Visually Challenged Students
Hawra AlSaid, Lina AlKhatib, Aqeela AlOraidh, Shoaa AlHaidar, Abul Bashar

College of Computer Engineering and Sciences


Prince Mohammad Bin Fahd University
Al-Khobar, Saudi Arabia 31952
Email: abashar@pmu.edu.sa

Abstract— Computer Vision Technology has played a levels of needs and not all levels require special places and
significant role in assisting visually challenged people to carry special schools. For instance, people with vision difficulties can
out their day to day activities without much dependency on study with other students if they have an appropriate
other people. Smart glasses in one such solution which enables environment. In order to solve this issue, we can use the help of
blind or visually challenged people to “read” images. This computer vision technology to make special aids which the
paper is an attempt in this direction to build a novel smart glass visually impaired people can live comfortably, as far as
which has the ability to extract and recognize text captured from possible.
an image and convert it to speech. It consists of a Raspberry Pi It is observed that most blind people are intelligent and can
3 B+ microcontroller which processes the image captured from study if they have the chance to be able to study in regular
a webcam super-imposed on the glasses of the blind person. government administered schools as they exist almost
Text detection is achieved using the OpenCV software and open everywhere. It is a misconception among majority who think
source Optical Character Recognition (OCR) tools Tesseract people who are blind or with vision difficulties cannot live
and Efficient and Accurate Scene Text Detector (EAST) based alone and they need help of other people at all times. In fact,
on Deep Learning techniques. The recognized text is further they do not need help all the times, they can be independent
processed by Google’s Text to Speech (gTTS) API to convert most of the times and they have the chance to live like other
to an audible signal for the user. A second feature of this people.
solution is to provide location-based services to the blind people One of the popular solution in this scenario is to use Smart
by identifying locations in an academic building using the RFID Glasses for the visually impaired people [3]. These types of
technology. This solution has been extensively tested in a glasses make the use of computer vision hardware and software
university environment for aiding visually challenged students. tools (camera, image processing, image classification and
The novelty of the implemented solution lies in providing the speech processing). Such a solution gives a chance to visually
desired computer vision functionalities of image/text impaired people to lead a comfortable life with other people and
recognition which is economical, small-sized, accurate and uses study in any school or university without the need of help from
open source software tools. This solution can be potentially other people every time. It has been observed that the use of
used for both educational and commercial applications. Smart Glasses has increased the percentage of educated people.
Most schools, colleges and universities are accepting students
Keywords: Image Recognition; Speech processing; Optical with vision difficulties. It is expected that from next academic
Character Recognition; Deep Learning; Raspberry Pi; Python. year Prince Mohammad bin Fahd University (PMU) will accept
blind students for admission [4]. The college would like to start
I. INTRODUCTION using smart glasses for the first time in this setup and help
In our societies, there are many people who are suffering students to improve their education level with minimum
from different diseases or handicap. According to World Health assistance from the instructor.
Organization (WHO), about 8% of the population in eastern This was the motivation behind the design and
Mediterranean region has vision difficulties, which includes development of smart glasses is to help blind and visually
blindness, low vision and some kind of visual impairment [1]. impaired students with their studies. These glasses are designed
Such people need to be provided special facilities so that they to use the computer vision technology to capture an image and
can live comfortably. Especially in the field of education, there extract English text and convert it into audio signal with the aid
are special schools and universities for people with special of speech synthesis. Also, it was decided to add a feature of
needs [2]. Most blind people and people with vision difficulties translating text/words from English to Arabic language as the
were not in a position to complete their studies special schools majority of the students at PMU are Arabic speaking.
for people with special needs are not available everywhere and
most of them are private and expensive. So the only alternative The main objectives of the proposed system can now be
was that they study at home acquiring basic knowledge from summarized as the follows: capturing image, extracting text
their parents. This education was not technical enough and from the image, identifying the correct text, converting text to
hence cannot compete with other people. There are different speech, translate the text to other language, to integrate the

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 124


for text detection and recognition is around 600 ms for one
image [7].
For text recognition in natural scene images, this method
has proposed an accurate and robust solution. It has used an
(MSER) algorithm to detect almost all characters from any
image. The datasets used for this system are ICDAR 2011 and
Multilingual datasets. The results showed that the MSER has
achieved 88.52% accuracy in character level recall [8].
Fig. 1: Conceptual Design of the Proposed System
End-to-end real-time text recognition and localization
system have used ER (External Regions) detector that covered
different hardware and software modules, and to test and about 94.8% of the characters, and the processing time of an
troubleshoot the working of the proposed system. image with 800×600 resolution was 0.3s on a standard personal
computer. The system used two datasets ICDAR 2011 and
The rest of the paper is organized as follows. Section II will SVT. On the ICDAR 2011 dataset, the method achieved 64.7%
provide an overview of the various solutions provided in the of image-recall. For SVT dataset, it achieved 32.9% of image-
area of using deep learning based computer vision techniques
recall [9].
for implementing smart glasses for the visually impaired. The
details of the proposed system design and implementation will Text detection and localization using Oriented Stroke
be presented in Section III. Experimental results and their Detection is a method that took advantage of two important
implications will be discussed in Section IV. Conclusions and methods that are connected to a component with a sliding
future directions of the proposed solution will be provided in window. The character or the letter has been recognized as a
Section V. region in the image that has some strokes in a particular
direction and particular position. The dataset that has been used
II. RELATED WORK is ICDAR 2011, the experimental results showed 66% recall
Text detection and recognition have been a challenging which is better than the previous methods [10].
issue in different computer vision fields. There are many An Efficient and Accurate Scene Text (EAST) detector
research papers that have discussed different methods and method is a simple and powerful pipeline that allows detecting
algorithms for extracting the text from the images. The main a text in natural scenes, and it achieves high accuracy and
purpose of this literature review is to evaluate some of these efficiency. Three datasets have been used in this study, ICDAR
methods and their effectiveness regarding their text detection 2015, COCO-Text and MSRA-TD500. The experiments have
accuracy rates. shown that this method has better results than previous methods
In end-to-end text recognition with the power of regarding accuracy and efficiency [11].
Convolutional Neural Networks combined with the new Tesseract is an open source Optical Character Recognition
unsupervised feature, learning growth took advantage of the (OCR) engine whose development has been sponsored by
known framework for the training to achieve high accuracy of Google Inc. The version 4 of Tesseract is based on a deep
the text and character detection and recognition modules. These learning-based Artificial Recurrent Neural Network called as
two models have been combined using simple methods to build Long Short-Term Memory (LSTM) architecture. This OCR
end to end text recognition system. The datasets that have been engine can support up to 116 languages and has reasonable
used are ICDAR 2003 and SVT. The method of 62-way character recognition rates [12].
character classifier obtained 83.9% of accuracy for a cropped
character from the first dataset [5]. The idea of smart glasses is to create wearable computer
vision-based glasses for various purposes (e.g. reading, search,
In novel scene text recognition, which is an algorithm that navigation). These uses determine the type of practical glasses
mainly depended on machine learning methods. Two types of that are needed to be designed and developed. In the early stages
classifiers have been designed to achieve more accuracy, the of development, the smart glasses were simple and provide
first one was developed to generate candidates, but the second features for carrying out basic tasks which served as a front-end
one was for filtering of candidates that are non-textual. A novel display for the remote system. However, recent smart glasses
technique has been developed to take advantage of multi- have become more sophisticated and provide several features
channel information. Two datasets have been used in this study, for aiding blind and visually impaired people. A comparative
ICDAR 2005, ICDAR 2011. This method has achieved summary of practically implemented smart glasses solutions
significant results in different evaluation scenarios [6]. have been presented in Table I. It provides a description of the
In photoOCR which is a system designed to detect and conceptual model, their benefits, drawbacks and possible
extract any text from any image using machine learning improvements required to make them better. Also, it provides
techniques, it also used different distributed language modeling. the research gap in this area and which is filled with our
The goal of this system was to recognize any text from any proposed solution (CCES, PMU).
challenging image such as poor quality or blurred images. This Based on the literature survey presented above regarding the
system has been used in different application such as Google different text detection techniques from images, we propose to
Translate. The datasets that have been used for this system are implement a novel solution with the following features. The
ICDAR and SVT. The results showed that the processing time conceptual design of the proposed system is shown in Fig. 1.

125
Table I: Comparative Summary of Smart Glasses Solutions

Solution Developer Conceptual Design Benefits Drawbacks Improvements


High resolution
Helps low vision Does not improve Water proof versions
camera for image and
people, avoiding vision as it just an are under
eSight 3 CNET's video capture for low-
surgery aid development
vision people

Symbols to audio For only people Can be improved to


Keisuke Shimakage, Support for dyslexic
Oton Glass conversion, normal with reading support blind people
Japan people. Converts
looking glasses, difficulty and no also by including
images to words and
supports English and support for blind proximity sensors.
then to audio
Japanese languages people
Suman Kanuganti Aira agents help Waiting time To include language
Aira uses smart
users to interpret connected to the translation features.
Aira glasses to scan the
their surroundings by Aira agents in order
environment.
smart glasses. to be able to sense.
Eyesynth is consist 3D Can use verbal audio
Allows blind/limited
cameras, which turns for better feel and
sight people to ‘feel It is expensive
the scene to sound navigation services.
Eyesynth Eyesynth, Spain the space’ through (575.78$) and it
signal (non-verbal) to
sounds. It converts only recognizes the
provide information
spatial and visual objects and
about of position, size,
information into directions.
and shape. It is
audio.
language independent.
Google Glasses show Can capture images Reduce costs to make
information without and videos, get It is expensive it more affordable for
Google using hands gestures. directions, send (1,349.99$) and the the consumers.
Google Inc.
Glasses Users can messages, audio glasses are not very
communicate with the calling and real- time helpful for blind
Internet via normal translation using people.
voice commands. word lens app.
Currently it supports
Helps blind people to only the English
avoid obstacles, to language, cannot be
Our The glasses can
Help people who have aid in reading and used while driving,
proposed CCES, PMU support other
vision difficulties learning, convert processing unit is
solution languages, it also can
especially blind image to text and separated from the
be smaller and easy to
people. search for glasses, and
wear.
information about the captures the object
words on the Internet. with a specific
range of distances.

a. Camera and ultrasonic sensor based smart glasses to below.


capture an image having embedded text.
b. Optical character recognition software based on EAST (i) Raspberry Pi 3 Model B+
and Tesseract open source tools. Raspberry Pi is a credit card-sized computer. It needs to be
c. Google text to speech conversion of the identified text connected with a keyboard, mouse, display, power supply, SD
for the visually challenged person to hear. card and installed operating system [14]. Raspberry Pi is a low-
d. RFID-based navigation system to enable the visually cost embedded system that can do a lot of significant tasks. It
impaired person to explore the academic building for can be run as a no-frills PC, a pocketable coding computer, a
locating various lecture and lab rooms. hub for homemade hardware and more. It includes GPIO
(General Purpose Input/Output) pins to control various sensors
and actuators. Raspberry Pi 3 used for many purposes such as
education, coding, and building hardware projects. It is used as
III. SYSTEM DESIGN AND IMPLEMENTATION a low-cost embedded system to control and connect all of the
The proposed system consists of two main parts, the I/O components together. It uses the Raspbian or NOOBs as the
hardware and the software. This section describes the details of operating system which can accomplish many important tasks
the hardware used and the software components used. Fig. 2 However, for our solution we decided to work on Raspbian as
shows the process diagram of the proposed system. the operating system.
A. Hardware Design & Implementation (ii) Digital Camera
The various sub-components of the hardware system that The webcam has a view angle of 60° with a fixed focus. It can
have been used to make the smart glasses system are described

126
Fig. 2: Process Diagram of the Proposed System

capture images with maximum resolutions of 1289 x 720 pixels.


It is compatible with most operating systems platforms such as
Linux, Windows, and MacOS. It has a USB port and a built-in
mono microphone. In the solution, the Webcam will be used as
the eyes of the person who wears the smart glasses. The camera
is going to capture a picture when the button is pressed (called
as Button 1, see Fig. 4), in order to detect and recognize the text Fig. 3: Procedural steps during the OCR process
from the image.
(iii) Ultrasonic Sensor
B. Software Design & Implementation
The purpose of ultrasonic sensors is to measure the distance
Below are the software components that have been used in
using ultrasonic waves. Ultrasonic sensors emit the ultrasonic
the proposed system for programming the functionalities of the
waves and receive back the echo. So, by measuring the time the
ultrasonic sensor will measure the distance to the object. It can smart glasses system.
sense distances in the range from 2-400 cm. In the smart
glasses, the ultrasonic sensor is used to measure the distance (i) OCR Tools: Tesseract and EAST
between the camera and an object to detect the text from the OCR (Optical Character Recognition) is used to convert
text image. It was observed based on experimentation that the typed, printed or handwritten text into machine-encoded text.
distance to the object should be from 40 cm to 150 cm to capture There are some OCR software engines which try to recognize
a clear image (see Fig. 4). any text in images such as Tesseract and EAST. In this project
(iv) RFID Sensor Tesseract version 4 is used because it is the best open source
OCR engines. OCR process consists of multiple stages, as
A Radio Frequency IDentification (RFID) sensor consists shown in Fig. 3
of two main devices, namely the RFID reader and the RFID tag.
The RFID tag has digital data, integrated circuits, and a tiny (a) Preprocessing: The main goal of this step is to reduce
antenna to send information to the RFID reader. Signals the noise that resulted from scanning the document where the
frequencies are usually between 125 to 134 kHz and 140 to characters might be broken or smeared and causes poor rates of
148.5 kHz for low frequencies and 850 to 950 MHz and 2.4 to recognition. Preprocessing is done by smoothing the digitized
2.5 GHz for high frequencies. characters through filling and thinning. Another aim of this step
is to normalize the data to get characters of uniform size,
The RFID reader is mainly used to collect information from rotation, and slant. Moreover, significant compression in the
the RFID tag with the help of electromagnetic fields. The amount of information is achieved through thresholding and
process of transformation data from the tag to the reader is done thinning techniques.
by the radio waves. However, in order to achieve this process
successfully the RFID tag and the RFID reader should be within (b) Segmentation: In this process, the characters or words
a range between 3-300 feet. Any object can be identified are isolated. The words are segmented into isolated characters
quickly when it is scanned, and the RFID can recognize it. that are recognized separately. Most of OCR algorithms
RFID has many applications such as passport, smart cards, and segment words into isolated characters which are recognized
home applications. The RFID sensor is used in our solution to individually. Usually, segmentation is done by isolating every
attach the RFID reader in the hall and various classrooms so the connected component.
blind person can recognize them. (c) Feature Extraction: This process will capture the
(v) Headphones significant features of symbols and it has two types of
algorithms which are pattern recognition and feature
Wired headphones was used in our solution since Raspberry extraction/detection.
Pi 3 Model B & B+ come with Audio jack and it is better to take
advantage of this feature rather than occupying one of the four (d) Classification: OCR systems use the techniques of
USB ports that can be useful for other peripherals. The pattern recognition that assigns an unknown sample into a
headphones will be used to help the user listen to the text that is predefined class. One of the ways to help in character
been converted to audio after it has been captured by the camera classifying is to use the English directory.
or to listen to the translation of the text. The headphones are (e) Post-processing: This process includes grouping and
going to be small, lightweight and connected to the glasses, so error detection & correction. In grouping, the symbols relate to
the user will not be concerned about losing them or feel strings. The result of plain symbol recognition in the text is a
uncomfortable wearing them.

127
group of individual symbols.
(ii) OpenCV Libraries
OpenCV is a library of programming functions for real-time
computer vision, the library is cross-platform and free for use
under the open-source BSD license [15]. For the installation of
the OpenCV 4 libraries, the recommended operating system for
the raspberry pi B+ which is Raspbian Stretch was installed.
Win32 Disk Imager was used to flash the SD card.
(iii) Google Text to Speech (gTTS) API
One of the most important functions of the smart glasses is
text to voice conversion. In order to implement this task, we
installed gTTS (Google Text-to-Speech). It is a python library
that interfaced with Google Translate API [13]. gTTS has many
features such as convert ultimate length of text to voice, provide
error pronunciation using customizable text pre-processors and
support many languages and retrieve them when needed. We
used the gTTS to perform language translation from English to
Arabic (called as Button 2, see Fig. 4).

IV. EXPERIMENTAL RESULTS AND DISCUSSION


A. Text Recognition from captured image
In this result, the main goal was to check if the text detector Fig. 4: Procedural steps for the design and development of the prototype
that was used in this solution, which is EAST pre-trained text
detector and the text recognizer which is OCR using Tesseract academic building.
is working well. The test showed mostly good results on big D. Challenges and Limitations
clear texts and failed on small texts. We found that the
recognition depends on the clarity of the text in the picture, its During the implementation of this system, the first
font theme, size, and spacing between the words. One such challenge was to decide which microcontroller will be
result is presented in Fig. 5. Here the image had the text appropriate. After some research it was found that Raspberry Pi
“Electronics Lab & Embedded Systems Lab”, which was has the required features suitable for system objectives.
recognized as “Electronics Lab Embedded Svstems Lab”, which Initially, Raspberry Pi zero w was the choice because it was
can be said to be about 80% accurate. smaller and lighter than the others versions. It could be set up
easily on the glasses rather than the person holds it by hand.
B. Text to Speech Conversion
However, we also found that out that the Raspberry Pi 3 has
In this result, the idea was to check if the detected text is higher processing power since it consists of quad core processor
converted to audio text. We used gTTS (Google Text To and it is has faster processing speed than Raspberry Pi zero w.
Speech) libraries after it was found that the voice quality was It also has a larger memory and extra I/O ports for connecting
better and clearer than the other TTS libraries such as Festival peripheral devices. Since we also decided to use OCR in
TTS, Espeak TTS, and Pico TTS. The voice was clear for the
conjunction with OpenCV and Efficient and Accurate Scene
right detected words. One such result is presented in Fig. 6.
Here the text captured from the image was Moonshine, which Text Detector (EAST), the Raspberry Pi should be able to do
was accurately converted into audio signal, which was clearly multitasking efficiently. Then we finally decided to opt for
audible. Raspberry Pi 3.
We had proposed glasses with a built-in camera and it
C. Final Prototype turned out that this built-in camera is not compatible with
Fig. 7 shows the images of the final prototype which we Raspbian OS, and it was only compatible with Windows and
developed. As can be see, the hardware circuit (Raspberry Pi MacOS, so we tried to install Windows IoT (operating systems
processor board) can be worn on the arm and it is mobile with from Microsoft designed for use in embedded systems.) on the
the user. The glasses can be worn on the face which consists of Raspberry but unfortunately it didn’t work. To solve this
the normal sunglasses mounted with the camera and the problem, we decided to use a webcam which was compatible
ultrasonic sensor. We admit that this is not very comfortable but with the Raspberry Pi.
it has other benefits like low cost ($330), open source hardware
(Raspberry Pi) and software (OpenCV, Tesseract and Python).
Fig. 7 also shows the testing and working of the RFID-based V. CONCLUSIONS AND FUTURE WORK
navigation unit. It provides the visually challenged with This paper has proposed, implemented and tested a novel
information (in audio signals) about the current location in the

128
worthwhile to include multi-lingual feature (e.g.
French or Urdu ) in the speech translation module.
● To improve the direction and warning messages to the
user, we can include GPS-based navigation and alert
system.
● To provide for more space visibility, we can include a
wide angle camera (e.g. 2700 degrees as compared to
600 currently used).
● Finally, to provide for more real-time experience, we
can include video processing instead of still images.

Fig. 5: OCR result with EAST detector ACKNOWLEDGEMENT


We sincerely thank the College of Computer Engineering
and Science (CCES) and the management of Prince
Mohammed Bin Fahd University (PMU) for their cooperation
and support in accomplishing this BS senior design project.
REFERENCES
[1] World Health Organization, "Global data on visual impairments 2010".
Retrieved from: https://www.who.int/blindness/GLOBALDATAFINAL
forweb.pdf (Last accessed on 30th May, 2019).
[2] Best Colleges, “ College guide for students with visual impairments”,
Retrieved from: https://www.bestcolleges.com/resources/ college-
planning-with-visual-impairments/ (Last accessed on 30th May, 2019)
[3] Google Inc., "Google Glasses". Retrieved from https://en.wikipedia.org/
wiki/Google_Glass (Last accessed on 30th May, 2019).
[4] Humanitarian projects, "Prince Sultan bin Abdulaziz College for the
visually impaired". Retrieved from http://www.princemohammad.org/
Fig. 6: Google Text to Speech result en/Initiatives-College-for-the-Visually-Impaired.aspx (Last accessed on
30th May, 2019).
[5] T. Wang, D. J. Wu, A. Coates, A. Y. Ng, "End-to-end text recognition
with convolutional neural networks” IEEE 21st International Conference
on Pattern Recognition (ICPR), pp.3304-3308, 2012.
[6] H. I., Koo, D. H. Kim, "Scene text detection via connected component
clustering and non-text filtering" IEEE Transactions on Image
Processing, 22(6), pp. 2296-2305.
[7] A. Bissacco, M. Cummins, Y. Netzer, H. Neven, "Reading text in
uncontrolled conditions", Proceedings of the IEEE International
Conference on Computer Vision, pp. 785-792, 2013.
[8] X. C. Yin, X. Yin, K. Huang, H. W. Hao, "Robust text detection in
natural scene images " IEEE transactions on Pattern Analysis and
Machine Intelligence, 36(5), pp. 970-980, 2014.
[9] L. Neumann, J. Matas, "Real-time scene text localization and
recognition ", Proceedings of IEEE International Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 3538-3545,
2012.
[10] L. Neumann, J. Matas, "Scene text localization and recognition with
Fig. 7: Final developed prototype oriented stroke detection", Proceedings of IEEE International
Conference on Computer Vision, pp. 97-104, 2013.
smart glasses for visually challenged students, which has the [11] X. Zhou, C. Ylao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, "EAST:
features to capture an image, extract and recognize the an efficient and accurate scene text detector", Proceedings of the IEEE
embedded text and convert it to speech. Our design and conference on Computer Vision and Pattern Recognition, pp. 5551-5560,
2017.
implementation was a practical demonstration of how open
[12] GitHub Inc., "Tesseract OCR". Retrieved from
source hardware and software tools can be integrated to provide
https://github.com/tesseract-ocr (Last accessed on 30th May, 2019).
a solution which is low-cost, lightweight, re-configurable and
[13] Python Software Foundation, "Python". Retrieved from
scalable. However, there are some limitations in our proposed https://www.python.org/ (Last accessed on 30th May, 2019).
solution which can be addressed in the future implementations. [14] Raspberry Pi 3 B+: Retrieved from https://www.raspberrypi.org/ (Last
Hence, we recommend the following features to be accessed on 30th May, 2019).
incorporated in the future versions. [15] OpenCV. Retrieved from https://opencv.org/ (Last accessed on 30th May,
2019).
● In order to cater to a wide variety of users, it would be

129
DeepDR:An image guideddiabetic retinopathy detection technique using
attention-based deep learning scheme
Noman Islam1, Umair Saeed2, Rubina Naz3, Jaweria Tanveer4, Kamlesh Kumar5, Aftab Ahmed Shaikh6
1
Iqra University
2-6
SindhMadrassatul Islam University, Karachi

Abstract- This paper proposes an efficient and cost effective It can be said that a multidisciplinary approach is required for
deep learning architecture to detect the diabetic retinopathy in catering to this challenge.
real time. Diabetes is a leading root cause of eye disease in
patients. It illuminates eye vessels, and releases blood form In this paper, a machine learning approach to diagnosis of
vessels. Early detection of diabetic retinopathy is useful to reduce diabetes mellitus is proposed. Machine learning is the branch
the risk of blindness or any hazard. In this paper, after some pre-
of artificial intelligence that is based on learning a model from
processing and data augmentation,InceptionV3 is used as pre-
trained model to extract the initial features set. Convolutional data that can later on perform prediction. So, the paper
neural network has been used with attention layers. These proposes an approach based on convolutional neural network
additional CNN layers are added to extract the deep features to to perform classification task. The images of the fundus are
improve classification performance and accuracy. Initially, the acquired and a convolutional neural network model is trained
model was proposed by Kevin Mader in Kaggle. The paper that provides improved accuracy compared to conventional
introduced additional layers in proposed model and improved approaches.
the validation and testing accuracy significantly. More than 90%
validation accuracy was achieved with the proposed 2. Literature Review
Convolutional Neural Network model. Testing accuracy was
In Table 1 previous work was summarized with respective
improved up to 5%. This improvement in accuracy is very
significant because the dataset is imbalanced and contains noisy accuracy. NurselYalçin et al. [1] proposed a deep learning
images. It is concluded that global average pooling (GAP) based approach for DR disease classification. After some pre-
attention mechanism increased deep learning architecture processing, CNN was used to classify the disease in image
accuracy to detect the Diabetic Retinopathy in imbalanced and dataset with 98.5% validation accuracy. Omer Deperlioglu et
noisy image dataset al. [2] proposed a CNN based deep learning model. 96.67%
validation accuracy was achieved. DarshitDoshi et al [3]
Keywords:diabetic retinopathy, deep learning, transfer proposed a CNN model with 0.386 accuracy. Three deep
learning, convolutional neural network, attentionmechanism, learning models was proposed. Images channels (Green, Red)
global average pooling were extracted from original images and were given to models
respectively. ArkadiuszKwasigroch et al [4] proposed CNN
1. Introduction based decision support system for DR disease classification.
Diabetes mellitus has reached to an epidemic level globally The 82% validation accuracy as claimed. Fully connected
and according to some statistics it will reach to 360 million convolutional neural network was proposed by Manaswini
people by 2030. Despite decades of intense research, diabetic Jena et al [5]. The model validation accuracy was claimed as
retinopathy (DR) is still the leading causes of visual loss all 91.66%. XiaoliangWang et al [6] used deep learning model
over the world and account for 28% of diabetes patients in with 63.23% validation accuracy. The proposed model was
USA. Specifically, it is quite prevalent among the working age based on pre-trained model inceptionNetV3.
populations. Patients, who suffered from visual loss due to this
problem, often reflect late diagnosis of diabetes or sometimes HaiQuan Chen et al [7] obtained validation accuracy up to
they are unaware of diabetes and eye problems. It has been 80.0%. Deep neural network model was discussed in his
observed that an earlier diagnosis of retinopathy can prevent paper. Abhay Shah et al [8] described a CNN model with
or avoid a significant proportion of visual loss. This can also 53.57% accuracy. IgiArdiyanto et al [9] proposed a Deep
ease the healing process or stop the progression of disease. learning model for assessment DR disease in embedded
However, accurate diagnosis of this disease and identifying system. This model was named as Deep-DR-Net. The
the stage of the disease is a challenge. Often ophthalmologist accuracy of this model was claimed up to 65.40%.
performs the screening through visual inspection of fundus HanungAdiNugroho [10] discussed the three different
and evaluation of color photographs. However, this is an approaches. First approach was based on pathologies. Second
expensive and time consuming process. Most of the patients of approach was based on foveal avascular zone (FAZ) structure.
diabetic retinopathy live in underdeveloped areas where In third approach, deep learning was proposed with more than
specialist and the diagnostic infrastructure is not available. 95% validation accuracy.
Early detection of disease and treatment is very essential to
combat the increasingly large number of retinopathy patients.

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 130


FengLiYu et al. [11] obtained 95.42% validation accuracy detailed overview of our CNN proposed architecture. Table 2
using deep learning model. CNN was used to classify the DR described further detail about our proposed deep learning
in images. BhavaniSambaturu et al [12] achieved 91% architecture.
validation accuracy via deep learning techniques. Yashal
Shakti Kanungo et al. [13] discussed deep learning model with Table 2: Detailed description about our proposed CNN
88% training accuracy. SyahidahizzaRufaida et al. [14] architecture
achieved 51.05% accuracy using CNN deep learning model. Layer (type) Output Param Connected to
Ratul Ghosh et al. [15] proposed two deep learning techniques Shape #
input_3 (InputLayer) (None, 0
for two DR stages. 95% and 85% validation accuracy were 512,
achieved respectively. Roye [16] explained a model based on 512, 3)
fuzzy C mean based technique to extract the features and xception (Model) (None, 208614 input_3[0][0]
support vector machine to classify the feature. Dong et al. [17] 16, 16, 80
2048)
proposed a wavelet based feature classification techniques batch_normalization_1 (Batch 8192 xception[1][0]
with up to 84% validation accuracy. S. Choudhury et al. [18] 0 No
extracted features using Fuzzy C mean based feature (None,
extraction technique. These extracted features were classified 16, 16,
2048)
using support vector machines. This model obtained 97.6%
dropout_4 (Dropout) (None, 0 batch_normalization_10[0][
validation accuracy. Validation accuracy of previous work and 16, 16, 0]
our proposed work were shown in Figure 12. 2048)
conv2d_15 (Conv2D) (None, 131136 dropout_4[0][0]
Table 1: Comparison between different proposed models 16, 16,
64)
Validation
S. # First author, Year Model Accuracy conv2d_16 (Conv2D) (None, 1040 conv2d_15[0][0]
16, 16,
(%)
16)
1 NurselYalcin, 2018 CNN 98.5
conv2d_17 (Conv2D) (None, 136 conv2d_16[0][0]
2 Omer Deperlioglu, 2018 CNN 96.67 16, 16,
3 ArkadiuszKwasigroch, 2018 CNN 82 8)
4 Manaswini Jena, 208 CNN 91.6 conv2d_18 (Conv2D) (None, 36 conv2d_17[0][0]
16, 16,
CNN, 4)
5 Xiaoliang Wang, 2018 63.3
InceptionNetV3
conv2d_19 (Conv2D) (None, 5 conv2d_18[0][0]
6 HaiQuan Chen, 2018 CNN 80 16, 16,
7 Abhay Shah, 2018 CNN 53.5 1)
8 IgiArdiyanto, 2017 CNN 73.3 conv2d_20 (Conv2D) (None, 2048 conv2d_19[0][0]
16, 16,
9 FengLi Yu, 2017 CNN 95.4 2048)
10 BhavaniSambaturu, 2017 CNN 91 multiply_2 (Multiply) (None, 0 conv2d_20[0][0]
11 Yashal Shakti Kanungo, 2017 CNN 88 16, 16, batch_normalization_10[0][
2048) 0]
12 SyahidahizzaRufaida, 2017 CNN 50.05
global_average_poolin (None, 0 multiply_2[0][0]
13 Ratul Ghosh, 2017 CNN 95 g2d_3 (GAP) 2048)
14 Arisha Roy, 2017 Fuzzy C mean,SVM 96.23 global_average_poolin (None, 0 conv2d_20[0][0]
15 Yanyan Dong, 2017 CNN, SVM 94.07 g2d_4 (GAP) 2048)
RescaleGAP (Lambda) (None, 0 global_average_pooling2d_
16 S. Choudhury, 2016 Fuzzy C mean,SVM 97.6 2048) 3[0][0]
17 DarshitDoshi, 2016 CNN 38.6 global_average_pooling2d_
4[0][0]
18 Our Proposed Model CNN 94.3
dropout_5 (Dropout) (None, 0 RescaleGAP[0][0]
2048)
dense_4 (Dense) (None, 262272 dropout_5[0][0]
3. Proposed Methodology 128)
dropout_6 (Dropout) (None, 0 dense_4[0][0]
Several architectures were trained and test with different pre-
128)
trained models like DenseNet, MobileNet, InceptionV3, dense_5 (Dense) (None, 8256 dropout_6[0][0]
VGG16 and VGG19. Optimized results were obtained with 64)
InceptionV3 architecture. Initially we utilized attention dense_6 (Dense) (None, 325 dense_5[0][0]
mechanism based CNN with pre-trained IncptionV3 model 5)
Total params: 21,274,926
discussedinKaggle for this dataset. This model was proposed Trainable params: 407,302
by Kevin Mader initially. We contributed in this model by Non-trainable params: 20,867,624
adding some layers to improve the performance and accuracy.
Initials layers were used to learn deeper features. The last
layers were used to classify the DR label.Following is a

131
InceptionV3 pre-trained model was used for feature
extraction. To extract deeper features, further convolutional
layers were being added.To reduce over fitting, dropout layer
was being used with 0.5 rate.64 filters with 1 X 1 kernel size
were used in first convolutional layer. Activation function was
Relu.Second convolutional layer contained 1 X 1 sized 16
filters with Relu activation function.Third Convolutional layer
was being added contained 8 filters with 1 X 1 size. This was
our contribution in Kevin Mader's proposed model.In fourth
convolutional layer, sigmoid activation function was used with
1 kernel.

An attention layer was added with liner activation function. Figure 1: Sample images from Kaggle DR Dataset
This layer was not being trained during training process
(trainable = False) because this layer was used for attention
purpose. Mask features were calculated with the help of Initial Before training, we augmented lots of images to improve the
extracted features generated by pre-trained model and deeper classification performance. We used 640 × 640 size for
features extracted after adding further Convolutional layers.To augmented images. We implemented horizontal flip. We used
build attention mechanism, Global average pooling was being random brightness and contrast. Random saturation was used.
used. GAP features and GAP mask were obtained from mask Color mode was RGB. Minimum crop percentage was 0.001
features and attention layers respectively.Lambda layer was and maximum crop percentage was 0.005. Rotation range was
used to rescale the features. set up to 10. For Data augmentation, batch size was 16 and
crop probability was set to 0.5. We shuffle the whole dataset
Two Dropout layers with 0.25 rate were used with fully before training.
connected layer. This fully connected Dense layer was used
with 128 units with ReLu activation function.Another fully Google Colab GPU environment (1xTesla K80 GPU with
connected layer was added with 64 unit with linear activation. 2496 CUDA cores, 12.6 GB RAM) was used for model
This was also our contribution in this architecture.Finally training and testing. 778 images (equal number of images
output layer was used with softmax activation. 5 unit were from all classes) were used for training and 274 images were
used in this output layer to classify the all five labels used for validation process. For training, we adjusted reduce
accordingly.The model was compiled with Adamax optimizer learning rate parameters. Patience was setas 20 number of
and categorical crossentropy lose function. Initially, Kevin epochs. Cool down parameter was set as 5. Factor adjusted as
Mader compiled his proposed model with Adam optimizer. 0.4 (reduction of learning rate).Training stop parameter were
adjusted. We adjusted patience of early stop parameter as 20
4. Details of Proposed Approach and validation lose quantity as parameter to be monitored. For
Images of Diabetic retinopathy were used from Kaggle testing, 1008images were used. To show attention advanced
dataset. This dataset contains 35,000 color images. 5 class visualization technique heatmap was used. Testing
labels were defined as “No DR”, “Mild, Moderate”, “Severe” performance measures like accuracy, recall, precision, f-score
and “Proliferative DR”. Retina images are high-resolution statistical analysis were used to evaluate the architecture.
taken under a diversity of imaging circumstances. A left and InceptionV3 transfer learning based architectures from
right eyes images are provided for every patient. Noise is imageNet were used to extract initial features.
observed in the images. Due to the lighting effects, pixel
intensity varies and it causes variation dissimilarity to Accuracy can be further increased by adjustment in
classification pathology. Sample images were provided in convolutional layers and fully connected layers. Further pre-
Figure 1.Images were normalized by using Gaussian processing can enhance classification process in proposed
Smoothing Filters. Unsharp masking techniques were used to architecture.
enhance the edges in images. Filtering technique of Contrast
Limited Adaptive Histogram Equalization was used to adjust 5. Results and Discussion
the contrast in images. Training time, training/validation accuracies and losses were
provided in Table 3. Performance parameters were provided in
Table 4. More than 94% validation accuracy was achieved. On
Test dataset, 65% accuracy was obtained. Compared with
initial proposed model, up to 5% test accuracy was improved.
For class labels0 (No DR), 1 (Mild), 2 (Moderate), 3 (Severe),
4 (Proliferative DR), testing precision was obtained 72%,
16%, 22%, 11% and 27% respectively. Testing recall was

132
achieved 90%, 4%, 11%, 03% and 26% respectively. 80%,
7%, 11%, 3% and 27% F-score was obtained for class label 0,
1, 2, 3 and 4 respectively. Total testing time was 33 second
and 32 millisecond per step. Improved model prediction
(AUC) was 60%. For class label 0, 632 was true positive. 5,
15, 1 and 6 were true positive for class label 1, 2, 3and 4
accordingly. Training time, training lose and accuracy graphs
were provided in Figures 1, 2, 3 and 4. Validation loss and
accuracy can be seen in Figures 5 and 6. Comparison of
overall testing accuracies of initial proposed model and our
proposed was shown in Figure 7. Confusion matrix and ROC Figure3: Training Loss for each Epoch
curve are visualized in Figures 8 and 9.Some examples of
Actual severity and predicted severity were shown in Figure
10. In Figure 11heatmap visualization was given. The heat
map described the prominent features of relevant class label.
In Figure 13 we compared learning time between proposed
model and Kevin Mader’s model. Learning time of our
proposed model is greater but it is reducing gradually on each
epoch.As per Figure 10 and 11, proposed model is able to
predict the correct class basis on concerned regions. Figure 4
and 6 show that validation is improving on each epoch. Figure
1 and 3 show that learning time is decreasing on each epoch
and proposed model is trained more quickly. Figure 5 Figure4: Training Accuracy for each Epoch
describes that on each epoch, loss of proposed model is also
decreasing.
Table 3: Training time, training loss/accuracy and Validation
loss/accuracy for each Epoch
Training Training Validation
Epoch Training Validation
Time Validation Accuracy
No Loss (%) Loss (%)
(Second) (%) (%)
1 1802 1.5947 68.4% 1.6432 56.9%
2 512 1.3743 79.3% 1.4293 68.4%
3 507 1.3381 83.2% 1.1537 89.5%
4 514 1.3249 85.4% 1.0832 89.7%
5 511 1.2376 97.0% 0.9956 94.3%
Figure5: Validation Loss for each Epoch

Table 3: Performance parameters of proposed CNN


architecture
Class Precision Recall F-Score Support
0 - No DR 72 90 80 699
1 - Mild 16 4 7 113
2 - Moderate 22 11 14 1425
3 - Severe 11 3 5 31
4-
Proliferative 27 26 27 23
DR
Figure6: Validation Accuracy for each Epoch

Figure7: Comparison of testing accuracy of Kaven’s[19]


model and our proposed model
Figure2: Training Time for each Epoch

133
Figure8: Confusion matrix of our proposed model

Figure10: Actual severity results predicted by our proposed


model

Figure11: Activation heatmap of our proposed learning model


Figure9: ROC Curve of our proposed model

Figure12: Validation accuracy comparison of different proposed models and our proposed model

6. Conclusion model learned the features resides in image portion correctly.


Global weighted average pooling-based Attention mechanism These features were clearly visible by specialist.
in convolutional neural network increased the performance
and accuracy to detect the diabetic retinopathy in imbalanced
and noisy dataset.Further pre-processing and balance dataset
will increase the performance and accuracy.Our study shows
that CNN-based deep learning model can detect severity level
of diabetic retinopathy at initial stage via retinal fundus
medical images. CNN models are capable enough to
understand the training images and learn from raw values of
pixles. Our Heatmap visualization demonstrates that our

134
learning based method for retinal lesion detection.
ICACCI. 2017; 33-37
12. Yashalshaktikanungo K, Bhargavsrinivasan S,
Savitachoudhary C. Detecting diabetic retinopathy
using deep learning. RTEICT. 2017; 801-804.
13. Syahidahizzarufaida S, Mohamad ivanfanany M.
Residual convolutional neural network for diabetic
retinopathy. ICACSIS. 2017; 367-374.
14. Ratulghosh R, Kuntalghosh K, Sanjitmaitra S.
Figure13: Learning time comparison between proposed Automatic detection and classification of diabetic
model and Kevin Mader’s model retinopathy stages using CNN. SPIN. 2017; 550-554.
15. Bariqiabdillah B, Alhadibustamam A, Dewisarwinda
References D. Classification of diabetic retinopathy through
1. Nurselyalçin NY, Seyfullahalver SA, Neclauluhatun texture features analysis. ICACSIS. 2017; 333-338.
NU. Classification of retinal images with deep 16. Arisharoy R, Debasmitadutta D,
learning for early detection of diabetic retinopathy Pratyushabhattacharya B, Sabarnachoudhury C. Filter
disease. SIU. 2018;1-4. and fuzzy c means based feature extraction and
2. Omer deperlioglu O, Utkuköse U. Diagnosis of classification of diabetic retinopathy using support
Diabetic Retinopathy by Using Image Processing and vector machines. ICCSP . 2017; 1844-1848.
Convolutional Neural Network. ISMSIT. 2018. 17. Yanyan dong D, Qinyanzhang Z, Zhiqiangqiao Q, ji-
3. Darshitdoshi D, Aniketshenoy A, Deep sidhpura S, jiang yang Y. Classification of cataract fundus image
Prachigharpure P. Diabetic retinopathy detection based on deep learning. IST. 2017;1-5.
using deep convolutional neural networks. CAST. 18. S Choudhury C, S Bandyopadhyay B, S K latibL, D
2018;261-266. K Kole K, c giri G. Fuzzy C means based feature
4. Arkadiuszkwasigroch K, Bartlomiejjarzembinski J, extraction and classification of diabetic retinopathy
Michal grochowski G. Deep CNN based decision using support vector machines. ICCSP. 2016; 1520-
support system for detection and assessing the stage 152.
of diabetic retinopathy. IIPhDW. 2018;111-116. 19. “Diabetic Retinopathy detection”,
5. Manaswinijena M, Smitapravamishra S, https://www.kaggle.com/kmader/inceptionv3-for-
debahutimishra D. Detection of Diabetic Retinopathy retinopathy-gpu-hr, 2018
Images Using a Fully Convolutional Neural Network.
ICDSBA. 2018.
6. xiaoliangwang W, yongjinlu L, yujuanwang Y, wei-
bang chen C. Diabetic Retinopathy Stage
Classification Using Convolutional Neural Networks.
IRI. 2018; 465-471.
7. Haiquanchen C, Xianglongzeng Z, Yuan luo L,
Wenbin ye Y. Detection of Diabetic Retinopathy
using Deep Neural Network. DSP. 2018; 1-5.
8. Abhay shah A, Stephanie lynch S, Meindertniemeijer
M, Ryan amelon R, Warren clarida W, James Folk J,
Stephen Russell SR, Xiaodong Wu X, Michael D.
Abràmoff MD. Susceptibility to misdiagnosis of
adversarial images by deep learning based retinal
image analysis algorithms. ISBI. 2018; 1454-1457.
9. HanungAdiNugroho H. Towards development of a
computerised system for screening and monitoring of
diabetic retinopathy. EECSI. 2017; 1-1.
10. Fengliyu Y, Jing sun S, Annan li L, Jun cheng C,
Cheng wan W. Image quality classification for DR
screening using deep learning. EMBC. 2017; 664-
667.
11. Bhavanisambaturu B, Bhargavsrinivasan S,
Sahanamuraleedharaprabhu M, Kumar
thirunellairajamani T, Thennarasupalanisamy P,
GirishHaritz G, Digvijay Singh BS. A novel deep

135
Mitigating the Effect of Data Sparsity: A Case
Study on Collaborative Filtering Recommender
System
Bushra Alhijawi∗ , Ghazi Al-Naymat† Nadim Obeid‡¶ , Arafat Awajan§
King Hussien School of Information Technology, Princess Sumaya University for Technology Amman, Jordan
¶ King Abdullah II of Information Technology, The University of Jordan Amman, Jordan

Email: ∗ bus20179001@std.psut.edu.jo, † g.naymat@psut.edu.jo, ‡ nadim@ju.edu.jo, § awajan@psut.edu.jo

Abstract—The sparsity problem is considered as one of the CF can provide recommendations for [8]. Therefore, the CF
main issues facing the collaborative filtering. This paper presents may be unable to produce a recommendation to those items
a new dimensionality reduction mechanism that is applicable to which have only a small number of rates. This is due to the
collaborative filtering. The proposed mechanism is a statistical-
based method that exploits the user-item rating matrix and item- fact that the users are usually rating a small proportion of
feature matrix to build the User Interest Print (UIP) matrix. The the items compared with the total number of items in the
UIP is a dense matrix stores data that reflects the satisfaction system. Neighbor transitivity refers to the problem in which
degree of the users about the item’s semantic feature. This like-minded (i.e. similar) users may not be determined since
method is developed based on the assumption that people tend they may not have sufficient and enough common ratings [9].
to buy items related to what they have previously bought. Also,
this method benefited from the fact that the number of features Consequently, the sparsity problem has a significant negative
is much less than the number of items and mostly constant. impact on the accuracy of the CF prediction. The effect of
The effectiveness of the proposed mechanism is tested on two the sparsity problem on CF is examined by Bobadilla and
real datasets namely Movielens and HetRec 2011. The obtained Serradilla [7]. They concluded that the impact of the sparsity
accuracy results using UIP matrix are compared with the one effect depends on the k-neighborhood value selected and the
obtained using the user-item rating matrix. The experimental
studies demonstrate the superiority of our proposed method. On used similarity measure.
average, using UIP matrix the collaborative filtering achieved 8% Several methods have been considered to alleviate the data
improvement in terms of prediction accuracy. sparsity problem. Traditionally, the user’s demographic infor-
Index Terms—Dimensionality reduction, sparsity, collaborative mation (e.g. gender, country, age, etc...) is utilized to compute
filtering, recommender system. the similarity among users [10], [11] which helps in alleviating
the neighbor transitivity issue. In addition, the item’s semantic
I. I NTRODUCTION information has been used to overcome the issues related to the
Collaborative Filtering (CF) is the most common and pop- sparsity problems (i.e. coverage and neighbor transitivity) [2],
ular recommendation approach [1], [2]. The core idea behind [12]–[14]. Alhijawi and Kilani [3] used the genetic algorithm
the CF is to estimate a particular item’s probability to be to obtain the optimal similarity values among users instead of
favorite to a user by comparing that user’s historical shopping using the user-item rating matrix. Representing the historical
behavior record with the recorded shopping behavior of other rating data in lower dimensional space is one of the proposed
like-minded users [3], [4]. The basic assumption that motivates solutions to deal with this challenge [15]. Principal Component
the CF is that there is a high probability that the users will Analysis (PCA) [16], [17] and Singular Value Decomposition
give similar rates to other items if they gave rates to n items (SVD) [18]–[20] are two-dimensionality reduction methods
in a similar way [5]. used to alleviate the data sparsity problem. PCA is a di-
The historical shopping behavior records are stored in a mensionality reduction technique proposed by Pearson [21]
data file that can be viewed as a matrix whose rows and that is a statistical-based method to obtain an ordered list of
columns represent users and items, respectively. This matrix components that account for the largest amount of the variance
is called the user-item rating matrix. The performance and the from the data (i.e. finding patterns in a high dimensionality
recommendation quality produced by the CF depend mainly on space) [15]. SVD is a matrix factorization approach that
the quality of stored data = in the user-item rating matrix. The decomposes the user-item rating matrix into the product of
user-item rating matrix usually stores rating records related three lower dimensionality rectangular matrices proposed by
to tens of thousands of users for tens of thousands of items, Billsus and Pazzani [22].
thus it will be extremely sparse. The sparsity problem is a This paper presents a dimensionality reduction method that
result of the fact that most of the users only rated a small is applied to handling the sparsity problem. The proposed
proportion of the items [2], [6], [7]. This problem contributes technique is a statistical-based method that exploits the user-
to reduce coverage and cause neighbor transitivity [8]. The item rating matrix (U × I) and item-feature matrix (I × F ) to
CF’s coverage is defined as the fraction of items that the build the User Interest Print (UIP) matrix (U × F ). The UIP

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 136


matrix reflects the user’s satisfaction degree about the item’s
semantic features (i.e. the concept level). The UIP matrix is a
dense matrix, which is used to compute the similarity among
users instead of using the user-item rating matrix. The idea
of constructing UIP matrix depends on the assumption that
people tend to buy items related to what they have bought and
benefit of the fact that |F | < |I| and |F | is mostly constant.
The rest of this paper is organized as follows. Section II
describes the details related to the construction process of the
UIP matrix. Section III presents how to use the UIP matrix
to produce the recommendation. Section IV focuses on the
testing of UIP matrix and shows the comparison results with
the user-item rating matrix. Finally, section V concludes the
research paper.
II. T HE P ROPOSED D IMENSIONALITY R EDUCTION
M ETHOD
The UIP matrix is constructed depending on the user-item
rating matrix (U × I) and the binary item-feature matrix
(I × F ). The rows (U ) and columns (I) of user-item rating
matrix represent the users and the items in the system. Each
entry in the user-item rating matrix related to the user’s
satisfaction degree about the items (i.e. instance level). The Fig. 1. The basic idea of UIP construction process.
sparsity of this matrix is high as the users gave rates to a small
number of items compared with the overall number of items degree and 0 indicates that the user is not interested in the
in the system. The sparsity level of the matrix has a significant items which belong to this feature. Fig. 2 shows a simple
effect on the prediction accuracy of the CF. The item-feature example of the UIP matrix. Where the sparsity level of both
matrix is a binary matrix that its rows (I) and columns (F ) user-item and item-feature matrices is 58% while UIP matrix’s
represent the items and the item’s semantic features such as sparsity level is 0%. Note that the sparsity is computed using
genre. In other words, the item-feature matrix categorizes the Eq. 1. Based on the UIP matrix, the RS can conclude that
items based on the semantic features. This matrix is sparse U 1 interested in the items which belong to feature 1 and 4.
since the item usually belongs to a small proportion of the Thus, the RS should recommend items to this user which
features. Note that the number of semantic features in the belong to those features. At the same time, it has no sense
system is almost constant while both the number of users and that the recommendation list contains any items that belong
items is permanently increasing. Also, the number of features to feature 2 since U 1 never showed any interest in the items
is smaller than the number of users and items in the system which belong to this feature. In the baseline CF methods, if the
(i.e. |F | < |I| and |F | < |U |). item has no rates will not be included in the recommendation
Both sparse matrices are exploited to build a dense matrix; list. Using the UIP even if the feature has a satisfaction degree
UIP matrix (U × F ). The UIP matrix is constructed to reflect of 0, the items belong to this feature still have a probability
the satisfaction degree of the users about the item’s semantic to be included in the recommendation list. In the example,
feature (i.e. concept level). In other words, the UIP matrix the items belong to feature 2 still have a probability to be
contains information about the preferred and non-preferred included in the recommendation list since those items may
features of each user. The core idea lies in the following belong to other features which have satisfaction degree greater
assumption: people tend to buy items related to what they than 0. For instance, I4 and I8 belong to feature 4 and 3,
have bought. By meaning, people usually buy items that have respectively, thus there is a probability to be included in the
common semantic features. Given the n × m matrix data U- recommendation list. In general, the items which have a small
I (n users, m items), the m × k matrix data I-F (m items, k number of rates will have a probability to be included in the
features) and aggregation function (A), we can obtain an n×k recommendation list. This will improve the coverage of the
matrix UIP (n users, k features). The UIP matrix will always CF. The process of constructing the UIP matrix is detailed
be a dense matrix which represents the user’s interests. Fig. 1 below.
illustrates this idea.
The rates gave by the users are aggregated in term of each |Rate|
SparsityLevel = 1 − (1)
item’s semantic feature. Thus, the UIP matrix represents the |U | × |I|
user’s interests. Moreover, the sparsity level of UIP matrix
1) Represent each user by a vector of rates (Vu ) as follows:
is zero. The entry value in UIP matrix is ranged between
[0, M axr ]. Whereas, the M axr reflects the highest satisfaction Vu = (ru1 , ru2 , . . . , run ), (2)

137
where, rui is the rate that user u gave to item i. For Let R : U × I → U − I be a utility function that measures
instance, VU 1 = (5, 0, 0, 0, 0, 0, 4, 0, 3, 2). the probability of item (i) to be favorite to the user (u).
2) Represent each item by a vector of features (Vi ) as For each user, the recommendation problem consists of
follows: finding the item (i∗ ) which maximizes the utility of user u.
Vi = (fi1 , fi2 , . . . , fik ), (3) Mostly, the utility of an item is represented as a rate which
indicates the user’s satisfaction level about this item. These
where, the value of fit is either 0 or 1. fit = 1 if item
values are aggregated in term of features and stored in UIP
i belongs to feature t. Otherwise, fit = 0. For instance,
matrix. To achieve this goal (i.e. finding i∗ ), the similarity
VI1 = (1, 0, 0, 1, 0).
between AU and other users are computed depending on the
3) Compute the interest print (IP u ) for each user as
UIP matrix. The most similar users to the AU (u∗ ) are used as
follows:
an input to a prediction function P : u∗ ×F to predict the AU ’s
u u u u
IP = (SatDegreef 1 , SatDegreef 2 , . . . , SatDegreef k ) satisfaction degree about the features (f ). Then, the predicted
(4) rate of each item is computed depending on the satisfaction
For instance, IP U 1 = (3.5, 0, 3, 3.5, 2.5). level of the features which the item belongs to it (Eq. 6). Based
P i on the predicted rates, the items set (I) will be labeled as either
ru
SatDegreef t =u
, where i ∈ f t
(5) favorite items or non-favorite items. The favorite item set is
#i ordered and considered as a recommendation list for the AU .
where, P u
AU ft
u
• IP represents the interest print of user u. ItemP Ri = , where i ∈ f t (6)
u
• SatDegreef t represents the satisfaction degree of
#f
user u about semantic feature f t . IV. E XPERIMENTS AND R ESULTS
For instance, This section provides details of how the UIP matrix was
I1 I7 I9 I10
tested. Various experiments were conducted for purposes of
rU 1 +rU 1 +rU 1 +rU 1 5+4+3+2
SatDegreeU F
1
1 = 4 = 4 = 3.5 comparing the prediction accuracy resulted from using UIP
matrix with the one obtained using the user-item rating matrix.
Section IV-A presents details related to the data that have
been used in the experiments. Section IV-B provides detailed
related to the experiments design and the measures that were
used in the experiments. Finally, the results are presented and
discussed in Section IV-C.
A. Datasets
To evaluate the UIP matrix, two real datasets, MovieLens
and HetRec 2011, were considered (described in Table I). The
descriptions of the datasets are as follows:
1
• Movielens dataset [23]. This dataset is considered as
one of the most popular reference RS research over the
last years [24]. In this dataset, 100,000 5-star scale rating
collected from 943 users on 1682 movies. Each user has
rated at least 20 movies and each movie belongs to at
least one category from the 18 categories. Hence, 32.4%
of the users gave rates to 20 − 40 items which are a
small number of rates compared with remaining users.
The average number of rates is 106. The sparsity level of
Fig. 2. Example on UIP matrix construction process. this dataset is 93.7% (sparsity level =1 − (100000/(943 ∗
1682)) = 0.937).
III. UIP M ATRIX FOR R ECOMMENDATION • HetRec 2011 (MovieLens + IMDb/Rotten Tomatoes)

In general, the recommendation problem consists of finding dataset 2 [25]. It is an extension of the MovieLens10M
a set of items that have the highest probability to be favorite dataset, published by GroupLeans research group 3 . The
to a particular user (AU ). The challenge is to predict these HetRec 2011 dataset includes 2113 users, 10197 movies,
probabilities accurately. More formally, the recommendation 95321 actors, 4060 directors and 20 genres. In this
problem can be formulated as follows: dataset, the users have provided ratings on a 5-star scale
Let U = u1 , u2 , u3 , ..., un be a set of users, I = 1 http://grouplens.org/datasets/movielens/
i1 , i2 , i3 , ..., im be a set of all possible items and F = 2 https://grouplens.org/datasets/hetrec-2011/

f1 , f2 , f3 , ..., fk be a set of item’s features. 3 https://grouplens.org/

138
TABLE I For the prediction step, the Resnick’s Adjusted Weighted
T HE DATASETS SPECIFICATIONS USED IN THE EXPERIMENTS . Sum (Eq.10) was considered. Note that, in this step, the
feature’s (x = f ) rate and the item’s (x = i) rate are predicted
Movielens HetRec 2011
when using the UIP matrix and the user-item rating matrix,
Number of users 943 2113
Number of movies 1682 10197
respectively.
Number of genres 18 20
Number of actors 0 95321 PkAU x
Number of directors 0 4060 u=1 [sim(AU, u) ∗ (ru − r̄u )]
pAU,x = r̄AU + PkAU (10)
Number of ratings 100000 855598
Rating scale 1-5 1-5 u=1 sim(AU, u)
Sparsity level 93.7% 96.1% Note that only the genre feature was considered to construct
the UIP matrix. Thus, the dimensions of the user-item rating
matrix and the UIP matrix when using Movielens dataset are
and include 855598 rating. Each user gave rates at least 943 × 1682 and 943 × 18, respectively. While a 2113 × 10197
to 20 items. Hence, 496 users gave rates to 20−100 items user-item rating matrix and a 2113 × 20 UIP matrix are
(i.e. 23.5% of overall users) and 38% of those users gave considered when using HetRec 2011 dataset.
rates to 20 − 40 items. The average number of rates is
405. The sparsity level of this dataset is 96.1% (sparsity C. Results
level = 1 − (855598/(2113 ∗ 10197)) = 0.961). The results presented in this section refer to the prediction
accuracy, processed using the MAE. The x-axis represents the
B. Experiments setup different values used of K-neighbor and y-axis represents the
The prediction accuracy resulted from using the UIP matrix MAE results.
shown in this paper compares to the one result obtained from Fig. 3 shows the MAE results obtained from applying
using the user-item rating matrix. The Mean Absolute Error Pearson-based CF (Fig. 3(A)) and cosine-based CF (Fig. 3(B))
(MAE) (Eq. 7) is utilized as a prediction accuracy measure. using Movielens dataset. Generating the recommendation us-
The prediction accuracy is computed using a different number ing Pearson-based CF and depending on UIP matrix lead to
of neighbors; (K). The value of K range between 25 to 400. fewer errors, particularly for K values lie in the range [25-
A smaller value of MAE signifies better prediction quality. 125]. Using the UIP matrix with Pearson-based CF improved
the prediction accuracy, on average, 1.6%. In general, the
U PIu
1 X i=1 |pu,i − ru,i | performance of the cosine-based CF when depends on the UIP
M AE = , (7) matrix is better than when depends on the user-item rating
#U u=1 #Iu
matrix. However, the prediction errors recorded when using
where #U represents the number of users and #Iu repre- UIP matrix is, on average, 2.4% less than the prediction errors
sents the number of items rated by the user u. recorded when using the user-item rating matrix.
Both metrics are utilized in the baseline CF approaches: Fig. 4 inform about the accuracy results that are collected
Pearson-based CF and cosine-based CF. Thus, the most pop- using HetRec 2011 dataset. The recommendation methods
ular similarity metrics are considered in the experiments to (i.e. Pearson-based CF (Fig. 4(A)) and cosine-based CF (Fig.
select the similar users to the AU ; Pearson correlation (Eq.8) 4(B))) achieve significant fewer errors when using Using
and cosine (Eq.9) [7]. UIP matrix than when using user-item rating matrix for any
selected value of K-neighbors in the range [25-200]. The com-
parative results in Fig. 4(A) show improvements in accuracy
P earson(AU, u) = up to 16.3% when depending on the UIP matrix. While the
P cosine-based CF (Fig. 4(B)) improved the prediction accuracy
i∈I (rAU,i − rAU
¯ )(ru,i − r¯u )
qP q (8) by 4.2% when using the UIP matrix.
2 P 2
i∈I (rAU,i − rAU¯ ) i∈I (ru,i − r¯u ) According to Bobadilla and Serradilla [7], the performance
of the cosine-based CF is negatively affected by the sparsity
Puz i i problem and this negative behavior can be reduced by selecting
u=1 (rAU ∗ ru ) high k-neighbor values. While the performance of the Pearson-
Cosine(AU, u) = qP qP , (9)
i 2 2
i∈I (rAU ) i∈I (rui ) based CF is positively affected by the sparsity problem. The
experiments had been conducted using two real datasets with
where, different sparsity level. Whereas, the sparsity level of HetRect
• I is the group of items that both users AU and u have 2011 dataset is higher than the one of Movielens dataset.
rated. The gathered results indicate that the sparsity level has a
• rAU,i is the rate of user AU on item i. positive impact on the behavior of Pearson-based CF. The
• rAU
¯ is the mean rating value of user AU . percentage improvement made by Pearson-based CF using UIP
• ru,i is the rate of user u on item i. matrix lies in the range [0.16%-12.3%] and [0.38%-16.3%]
• r¯u is the mean rating value of user u. when using Moveilens and HetRec 2011, respectively. Fig.

139
Fig. 3. The prediction accuracy results using Movielens. Fig. 4. The prediction accuracy results using HetRec 2011.

3(B) and Fig. 4(B) show that the highest distance between that the user has no interest in the items which belong to
the accuracy results achieved using UIP matrix and the one this feature.
achieved using the user-item rating matrix is for K values lie
To generate the recommendation, the UIP matrix is used
in the range [25-75]. Thus, using the UIP matrix alleviate the
to compute the similarity between users instead of using the
effect of sparsity problem on the cosine-based CF performance
user-item rating matrix.
even when the k-neighbor value is small. The percentage
The prediction accuracy resulted from using the UIP matrix
improvement achieved by applying the cosine-based CF using
was compared to the one gathered from using the user-item
UIP matrix on Moveilens and HetRec 2011 lies in the range
rating matrix. Two benchmark datasets are utilized in the
[0.58%-15.1%] and [1.8%-17.5%], respectively.
experiments namely, Movielens and HetRec 2011. The results
V. C ONCLUSION obtained proved that using the UIP matrix leads to fewer errors
In this research, a new dimensionality reduction method was in prediction than when using the user-item rating matrix.
proposed to handle the sparsity problem of CFRS. The core
idea lies in exploiting both the user-item rating matrix and R EFERENCES
item-feature matrix to form the UIP matrix. The UIP matrix [1] Lalita Sharma and Anju Gera. A survey of recommendation system:
has two main features: Research challenges. International Journal of Engineering Trends and
Technology (IJETT), 4(5):1989–1992, 2013.
• The UIP is a dense matrix. [2] Bushra Alhijawi, Arafat Obeid, Nadim amd Awajan, and Sara Tedmori.
• The UIP matrix reflects the user’s satisfaction degree Improving collaborative filtering recommender system using semantic
about the item’s semantic features. The UIP matrix stores information. page International Conference on Information and Com-
munication Systems (ICICS 2018). IEEE, 2018.
values that are ranged between [0, M axr ]. The M axr [3] Bushra Alhijawi and Yousef Kilani. Using genetic algorithms for
represents the highest satisfaction degree and 0 indicates measuring the similarity values between users in collaborative filtering

140
recommender systems. In 2016 IEEE/ACIS 15th International Confer- and Intelligent Agent Technology-Volume 01, pages 71–78. IEEE Com-
ence on Computer and Information Science (ICIS), pages 1–6. IEEE, puter Society, 2011.
2016. [25] Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2nd workshop on
[4] Jonathan L Herlocker, Joseph A Konstan, and John Riedl. Explaining information heterogeneity and fusion in recommender systems (hetrec
collaborative filtering recommendations. In Proceedings of the 2000 2011). In Proceedings of the 5th ACM conference on Recommender
ACM conference on Computer supported cooperative work, pages 241– systems, RecSys 2011, New York, NY, USA, 2011. ACM.
250. ACM, 2000.
[5] Bushra Alhijawi. The use of the genetic algorithms in the recommender
systems, 2017.
[6] Bushra Alhijawi and Yousef Kilani. The recommender system: A survey.
International Journal of Advanced Intelligence Paradigms, 10:1, 2018.
[7] Jesus Bobadilla and Francisco Serradilla. The effect of sparsity on
collaborative filtering metrics. In Proceedings of the Twentieth Aus-
tralasian Conference on Australasian Database-Volume 92, pages 9–18.
Australian Computer Society, Inc., 2009.
[8] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative
filtering techniques. Advances in artificial intelligence, 2009, 2009.
[9] Sajad Ahmadian, Mohsen Afsharchi, and Majid Meghdadi. A novel
approach based on multi-view reliability measures to alleviate data
sparsity in recommender systems. Multimedia Tools and Applications,
pages 1–36, 2019.
[10] Laila Safoury and Akram Salah. Exploiting user demographic attributes
for solving cold-start problem in recommender system. Lecture Notes
on Software Engineering, 1(3):303–307, 2013.
[11] Mohammad Yahya H Al-Shamri. User profiling approaches for demo-
graphic recommender systems. Knowledge-Based Systems, 100:175–
187, 2016.
[12] Mehrbakhsh Nilashi, Othman Ibrahim, and Karamollah Bagherifard. A
recommender system based on collaborative filtering using ontology and
dimensionality reduction techniques. Expert Systems with Applications,
92:507–520, 2018.
[13] G. Lv, C. Hu, and S. Chen. Research on recommender system based on
ontology and genetic algorithm. Neurocomputing, 187:92–97, 2016.
[14] Qusai Shambour, Mouath Hourani, and Salam Fraihat. An item-
based multi-criteria collaborative filtering algorithm for personalized
recommender systems. International Journal of Advanced Computer
Science and Applications, 7(8):274–279, 2016.
[15] Xavier Amatriain, Alejandro Jaimes, Nuria Oliver, and Josep M Pujol.
Data mining methods for recommender systems. In Recommender
systems handbook, pages 39–71. Springer, 2011.
[16] Mehrbakhsh Nilashi, Mohammad Dalvi Esfahani, Morteza Zamani
Roudbaraki, T Ramayah, and Othman Ibrahim. A multi-criteria col-
laborative filtering recommender system using clustering and regression
techniques. Journal of Soft Computing and Decision Support Systems,
3(5):24–30, 2016.
[17] Mehrbakhsh Nilashi, Othman bin Ibrahim, Norafida Ithnin, and Nor Ha-
niza Sarmin. A multi-criteria collaborative filtering recommender system
for the tourism domain using expectation maximization (em) and pca–
anfis. Electronic Commerce Research and Applications, 14(6):542–562,
2015.
[18] Jesús Bobadilla, Rodolfo Bojorque, Antonio Hernando Esteban, and
Remigio Hurtado. Recommender systems clustering using bayesian non
negative matrix factorization. IEEE Access, 6:3549–3564, 2018.
[19] Remigio Hurtado Ortiz, Rodolfo Bojorque Chasi, and César Inga Chalco.
Clustering-based recommender system: Bundle recommendation using
matrix factorization to single user and user communities. In Interna-
tional Conference on Applied Human Factors and Ergonomics, pages
330–338. Springer, 2018.
[20] Bo Zhu, Fernando Ortega, Jesús Bobadilla, and Abraham Gutiérrez. As-
signing reliability values to recommendations using matrix factorization.
Journal of computational science, 26:165–177, 2018.
[21] Karl Pearson. Liii. on lines and planes of closest fit to systems of points
in space. The London, Edinburgh, and Dublin Philosophical Magazine
and Journal of Science, 2(11):559–572, 1901.
[22] Daniel Billsus and Michael J Pazzani. Learning collaborative informa-
tion filters. In Icml, volume 98, pages 46–54, 1998.
[23] Jonathan L Herlocker, Joseph A Konstan, Al Borchers, and John Riedl.
An algorithmic framework for performing collaborative filtering. In
22nd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR 1999, pages 230–237.
Association for Computing Machinery, Inc, 1999.
[24] Qusai Shambour and Jie Lu. A hybrid multi-criteria semantic-enhanced
collaborative filtering approach for personalized recommendations. In
the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence

141
Visualizing Program Quality – A Topological
Taxonomy of Features
Islam Al Omari Razan Al Omoush Haneen Innab A. Elhassan
isl20188022@std.psut.edu.jo raz20188047@std.psut.edu.jo han20188024@std.psut.edu.jo a.elhassan@psut.edu.jo

King Hussein School of Computing Sciences


Princess Sumaya University for Technology,
Amman, Jordan

Abstract— In this paper we design a hierarchical, interactive II. PROBLEM STATEMENT


visualization to simplify the assessment of BSc program quality. A. Learning Management System (LMS)
The idea is based on extracting features from direct assessment
The most common solutions academic management resort to
data at various levels in the QA taxonomy and linking these
are off the shelf or home-designed Learning Management
features to enable the user to navigate the visualization and Systems [7, 8] which tend to suffer from (a combination of)
obtain insights at different levels of detail. The data was high cost, non-trivial learning curve requiring
gathered over a period of 5 academic semesters and processed business/teaching model adjustments and changes in the roles
in Oracle before being modelled in Microsoft’s Power-BI and habits of teachers and administrators alike, hence they
Desktop which is gaining prominence as a platform for business induce grassroots resistance at various levels. In terms of QA
intelligence. reporting the main issue with LMS systems is that they
require specialist skills for configuration, data loading, data
exporting and the static nature of reports they produce. In
Keywords—LMS, Quality Assurance, Student Outcome, addition, these reports are not necessarily according to the
Assessment, Topology, Taxonomy, Treemap, Rubric, Features, ever-changing requirements of the QA and governing
Exploratory, Explanatory, Machine Learning, Data Science.
organizations. In contrast, the new breeds of Business
Intelligence tools offer [10, 29]:
I. INTRODUCTION
The surge in influence and breadth of Data Science and AI
developments has opened up application opportunities in all i. Depth of KPIs. BI software packages can handle
walks of life, facilitating and augmenting our daily chores and metrics and insights across disparate disciplines and
tasks with fast, reliable and ever evolving machines; both industries. Good BI applications cope well with
hardware and software. Many of the high-risk, high-cost or changes of scenarios and operationalization
laborious tasks are now prime targets for recent development processes including: manufacturing, accounts and
in AI and Robotics. Business Intelligence, requiring parsing finance, healthcare, ecommerce, management and
of massive data and linking it to complex packages for education.
generating management reports is also reaping the benefits of
recent developments in Machine Learning, Data Science and, ii. Interactive. These packages can be plugged into
particularly, Data Exploration and Visualization [12]. databases with in-built or 3rd party libraries to
monitor triggers on data updates in real-time and
Educational Data Mining techniques have also contributed consequently sync across multiple access points.
quite positively enabling academic management to monitor,
assess, pre-empt and react to shortcomings in student iii. Intuitive. Supporting both dashboards and reporting
performance [20, 21]. Regulations and policies for licensing without explicit instructions or help from the vendor
and quality notwithstanding, there are many metrics that with a simple learning curve that amplifies cognition
academic programs need to assess, monitor and report on a for users of any background.
regular basis and to specific standards and templates. These
include, but are not limited to, Student and Faculty Direct iv. Support for Ad hoc queries. Allowing multiple users
Assessments at various levels of detail: Campus/Site, or from the same dataset or different data sources
Program, Year, Course, Section, Gender, Knowledge with no data refresh requirements.
Domain, Assessment Method, etc. Factors like monitoring,
assessing and reporting on segregated campuses, flipped v. Rich Visual Toolsets. Covering bar graphs, pie
classroom [7], blended, online or collaborative programs charts, line charts, Gantt charts and other visual
introduce additional challenges.

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 142


formats in support of the business intelligence and of detailed information [27] nor should they require steep
operationalization process. learning and familiarization curves [36].

vi. GUI/Wizard Drag-and-Drop Tools. Allowing


simple and quick dragging and dropping of data III. BACKGROUND
controls or visual controls while the BI engine There are several visual metaphors [18, 19] that can be
handles the synching and linking code seamlessly deployed to convey features or data properties to users [11]
in a way that maintains a minimal level of quality, reusability,
interactivity [16, 36]. The choice of visualization depends on
B. The Convergence of Business Intelligence and Data
the data domain, operationalization tasks and target audience,
Science
some visualization techniques have advantages over others
for performing certain tasks [29].
Contemporary Business Intelligence solutions [22], [23],
[24], [25] have capitalized on recent developments in AI, Depending on the data mining output, data items that require
Data Science and Big Data to offer intuitive, integrated representation and modelling exist at several layers
exploration and visualization platforms that allow the including: (i) Student Level Modeling; such as knowledge,
analysts to quickly load and clean data, manage links and academic achievement and learning styles. (ii) Student
relationships, highlight anomalies and potential insights as Behavior modeling; e.g., sleeping, inquiry and willingness to
well as generate easy to use, shareable interactive dashboards collaborate. (iii) Student Performance modeling; such as
for business non-technical managers to use to explore and achievement, competence, and deficiencies. (iv) Assessment
explain data trends. modeling; e.g., testing, as well as online and offline
assessment. (v) Student support and feedback modeling;
Given the complexity of the data structures and myriad of e.g., as complaints, critique, and evaluations. (vi)
sources that flow within and beyond a typical academic Curriculum and domain knowledge modeling; including
program on a daily basis these platforms are an ideal topics pacing, TA support and student-centered learning [21].
candidate to bridge the following gaps:
Treemaps (figure 3.1), are robust and stable [30] visualization
• Technical Support Personnel metaphors which can handle data changes well and allow
• Daily Users who are used to their own data format visual organization of data into non-occlusive hierarchical
and are highly resistant to change data groups in rectangular shapes, which are proportional (in
• Ever-changing demands of regulatory bodies in size, color and label) to the occurrence of the source data in
terms of data content and structure of reports the corresponding database or number of branches in a
traditional 2D tree structure. They support representation of
• Management demands for an outlook shift from data multi-dimensional measurements for various tasks in the
to information to insights operationalization [12] process that lead to interactive
dashboards.
It is inevitable that capabilities such as Data Mining of
student assessments [6] to offer hitherto unseen insights, Treemaps are ideal for displaying relatively rich data content
anomalies and relations will be a major factor in academic in a small canvas thus assisting users in the process of
institutes’ decision process for procurement or home- understanding data distributions and data structures [33] who
development. have proposed a method to improve the Moon and Spencer
aesthetics measure theory, and introduce it into the evaluation
C. Motivation of the aesthetics of multiple blocks. They also made use of
The challenges and history described above have motivated the principle of expansive and contractive colors [32] to
the writers to design a hierarchical exploratory and emphasize the differences between data blocks in order in
visualization system for QA data for academic programs of support of data discrimination.
most subject areas [13, 26, 14]. The idea is to have a very
short turnaround cycle between instructors collecting data as Figure 3.1 shows a treemap representation of the various
part of their teaching tasks on the one hand and management assessments in a BSc program (Final, Project, Major1, etc.),
access to intuitive, interactive dashboards with insights rather block color coding will be covered in discussion below.
than tables of data with little or no information value. The
solution that is proposed in this article is based on the use of
treemaps in a hierarchical structure that supports drill-down,
drill-up, cross-referencing and cross-filtering between levels
of learning taxonomies [9] to offer satisfactory visual
exploratory analyses [17, 15].The main requirement for the
visualization is that it should offer insights from the data
optimized for readability and navigation [28] especially for Figure 3.1 Assessment Treemap for a BSc program
users with limited knowledge of both the data business
domain and visualization metaphor [Hi 3]. Visualization The power of treemaps and their ability to represent
overviews should be easy to use, reliable, able to handle hierarchical structures that support grouping, filtering and
complexity and not fall short when presenting large amounts drill-down tasks are also adapted in [34] wherein treemap sin

143
conjunction with calendar controls are used to represent upgrade non-viable in terms of added visualization and
temporal dimension as main attribute to hierarchy insight value. This is due to the usual issues with 3D models
configuration and drill down navigation. The authors created including depth ambiguity, tilt angel impact, hidden
an application with Java and Swing library to create the components etc. The alternate option explored in this paper
graphical user interface based on the MVC model to represent is linked multi-view models with cross-reference, cross-
3 layers: filtering and drill-down capability.

Model – Database interface, initialization and functional


operations IV. METHODOLOGY
View – User facing visual interface In order to model and visualize the performance of the
Controller – Auxiliary operations including adapting the educational process at various levels the basic elements of
treemap to the interface and managing auxiliary operations data are collected and engineered to identify features,
(breadcrumbs). anomalies and relationships. Then a recursive process of
aggregation and abstraction generates aspect views for
Treemaps are subject to a balancing, trade-off risk between different models that will feed the PowerBI reports as
compactness and readability. If the analyst opts for a very described below.
compact design, it makes it very difficult for readers and
A. Micro Assessment Features - Layer1
users to grasp the intended insights and information stories
due to the limited space for grouping cues or other visual The education process can be categorized as micro learning
features. [35] have presented Bubble treemaps, which support that takes place in the classroom and involves the professors
uncertain, hierarchical data visualization by deliberately and the students, and macro learning which describes the
allocating extra space to encode additional information. This student experiences and knowledge discovery at program
was achieved with the use of group circles in conjunction level, thus involving the student and the program
with contours inspired by circular arc splines, figure 3.2. (administration) as an abstract stakeholder.

The features in the data that is produced form the micro


learning process are of the lowest granularity level and
constitute the essential building components for other
model/abstraction levels, these are:

StudentID, CourseID, Course-Outcome (CLO), Section,


Assessment, Grade%, Professor, Semester-Year

Examples of the data collected throughout the teaching


semester and collated as part of a formative assessment and
monitoring plan is shown in table 4.1 in appendix A.

B. Rubric Features - Layer2


Because of the wide variation of the values of student
classroom scores, the large number of assessments that can
be collected each semester and the lack of sufficient resources
or justification to respond to individual student scores, it is
Figure 3.2 Bubble Treemap for Hierarchical Data that is also more practical for education managers to focus efforts on a
Uncertain higher-level set of aggregated rubric assessments, these are
generated from student scores as follows:
Any algorithm or solution designed to provide visibility of
business processes needs to deploy familiar business graphics Pseudo-code: Make-Rubric-Line
[27], preserve the link between the end product and the Input(): 4 performance standards U-
order/structure of the underlying data and not do so at the cost Min/U-Max, ...E-Max
of instability over time, especially when displaying Input(): Layer 1 Assessment Data
dynamically changing data [31]. (Student-Course-Outcome assessments)
Output(): Single Line Rubric
A. Limitations of 2D Models (Aggregated on Count(Student)):
Population
Regardless of how versatile the modelling tool and notation
Output():Performance Standards
are, the effectiveness of 2D treemaps is subject to many
(Percentages): U, M, A, E
factors including complexity of the feature-set, number of
Output():Population. Number of
visual elements/categories offered as well as the types of
Student Assessment instances (ai)
insights sought. AS the 2D visualization approaches its limit
of expressiveness, the value of alternative representations parsed
begins to make more sense, one option is 3D visualizations for each Course & Outcome CID-CLO
which introduce their own set of anomalies rendering the combination

144
Begin Begin
For each Course & Outcome pair in For each CIDϵ C; C: All courses,CID:
All Assessments((CID-CLO)ϵ AA) (feature) in Rubric Dataset
For each Rubric-Line (rl) ϵ R; R:
For each Assessment Instance in All
All rubric assessments
Assessments(aiϵ AA) If (rl.CID == CID)
Int Population = 0; {
If (ai.Score> U-MIN &&ai.Score< CU += rl.U;
U-Max) CM += rl.M;
U++; CA += rl.A;
//Unsatisfactory CE += rl.E;
else if (ai.Score> M-MIN CAssessmentPopulation +=
&&ai.Score< M-Max) rl.Population;
M++; CAssessments++;
//Minimal }
End If
else if (ai.Score> A-MIN
End For
&&ai.Score< A-Max)
End For
A++;
End.
//Adequate
else
Table 4.4 (Appendix A) shows samples of course level
E++;
(layer3) assessments for data in Table 4.3 in appendix A.
//Exemplary
Population++;
End If D. Student Outcome (SO) Assessment – Layer4
End For
U = 100 * U / Population;
M = 100 * M / Population;
The assessment layer that is most indicative of the health
A = 100 * A / Population; status of the BSc program is the one that uses the three
E = 100 * E / Population; assessment layers above in conjunction with Course-SO and
End For CLO:SO mappings to calculate the attainment rates in the
End. Student Outcomes. The health status of the Student Outcome
where U, M, A and E denote Unsatisfactory, Minimal, attainment rates tend to form the first item on the checklist of
Adequate and Exemplary performance standards most QA processes and requirements. This is calculated
respectively, as shown in table 4.2 for example as shown in according to the algorithm below
table 4.2 (Sample Performance Standards) and table 4.2 (From
and To range values for each performance standard) are Pseudo-code: Make-SO-Assessment
designated by the academic program administration as Begin
For each ((CID-CLO)ϵ AA) in
necessary. A rubric line looks like the ones shown in table
rubric instance (ri) in Layer2
4.3, all tables are in appendix A. For each SO in CLO-SO Mapping |
per CID-CLO pair
C. Course Aggregated Assessments - Layer3 SO.U += ri.U;
SO.M += ri.M;
The rubrics in Layer2 above include a performance instance SO.A += ri.A;
for every Course-CLO pair that is assessed throughout the SO.E += ri.E;
academic semester. In the build-up to the 4th abstraction SO.Population+= ri.Population;
layer, below it follows to group rubric records for every SO.Lines++;
End For
course (CID, no CLO) and taking the accumulated averages
SO.U = 100 * SO.U / SO.Lines;
for the performance standards U, M, A and E. The resulting SO.M = 100 * SO.M / SO.Lines;
dataset contains one single instance for every CID as follows: SO.A = 100 * SO.A / SO.Lines;
SO.E = 100 * SO.E / SO.Lines;
End For
End.

Pseudo-code: Make-Course-Assessment The data is derived from a set of Course Learning Outcome
Input(): All Rubric Lines of CID-CLO Assessments, (CLOs) collected from the classroom of a small
Assessments as in Layer2 academic college over a 3-year period. Instruments including
Output():For each CID in Course- Major1 (M1), Major2 (M2), Midterm (MT), Final (F), In-
Rubricassessment (cr ϵ CR)
Output(): CID, CU, CE, CM, CE : Aggregated
course Projects (P), Capstone Projects (SD), Internships
from U, M, A, E in Layer2 (INT), Quizzes (Q), Homework Assignments (HW) and
Output(): CAssessment: Number of Labs.
Assessmennts in cr

145
The multi-layer design above comes as part a comprehensive
data ETL, wrangling and PowerBI/Tableau modelling and Usability Analysis (number of Clicks)
visualization process as illustrated in figure 4.1 below.

18 18
17

12 12 12
10 10 10

7
6
5 5
4
1 1 3

SO HEALTH BEST SO BEST SO – BEST SO – BEST SO – MANAGEMENT


STATUS TRIGGER TRIGGER TRIGGER REPORT
COURSE COURSE – COURSE –
RUBRIC LINE RUBRIC LINE –
STUDENT
GRADE

(Viz Dashboard) LMS Spreadsheet

Figure 5.1 Use Case Usability Analysis

VI. LIMITATIONS AND FUTURE WORK


Although the editing, exploration and explanation power of
data analysis and visualization packages is already very
strong, developments in this area as well as Big Data and Data
Mining technologies continue unabated, with new features
added on a daily basis to the most prominent packages such
as PowerBI, Tableau and other BI toolkits.
Figure 4.1 Process Summary
Some of the limiting aspects of current version of the
technology that the authors had to overcome include:
V. EXPERIMENTAL USE CASES
(i) There decoupling from classic relational
The usability of a visualization for data exploration or insight concepts of foreign keys, referential integrity,
explanation is dependent on many factors including the 1: many and many: many relationships are still
underlying data structures, the ability of the visualization to
simulate the real world business model accurately, the not complete and good awareness of these
intuitive design of the user interface and views to name but a concepts is still necessary as shown in the
few. In order to assess the usability of the dashboard design, relationship editor – figure 6.1a and 6.1b
an experiment was conducted to answer a few business
questions. For each question, the number of steps/clicks is
shown such that the user has the answer they seek in table 5.1
in appendix A (note that details including Course Id, Title,
Student ID and Professor Name have been concealed in some
of the screenshots).

It is clear that even with simple ETL and data preparation,


combined with basic business knowledge of the typical set of
management requirements for this data domain, the
information value of the visualization is, nonetheless,
positive with potential to cover many more use cases than the
Figure 6.1a Relationship Editor
sample above (table 5.1). Comparative analysis between the
PowerBI dashboard and a typical LMS system for common
functional requirements like the ones in table 5.1 is presented
in table 5.2.

146
REFERENCES

[1] Handl, J.; Knowles, J. Feature subset selection in unsupervised learning


via multiobjective optimization.Int. J. Comput. Intell. Res. 2006, 2,
217–238.

[2] Jain, Divya & Singh, Vijendra. (2018). An Efficient Hybrid Feature
Selection model for Dimensionality Reduction. Procedia Computer
Science. 132. 333-341. 10.1016/j.procs.2018.05.188.

[3] Jianyu Miao, Lingfeng Niu, A Survey on Feature Selection, Procedia


Computer Science, Volume 91, 2016, Pages 919-926, ISSN 1877-
0509,
[4] https://doi.org/10.1016/j.procs.2016.07.111.
(http://www.sciencedirect.com/science/article/pii/S187705091631304
7)

[5] R. A. Fisher, The use of multiple measurements in taxonomic


problems, Annals of eugenics 7 (2) (1936) 179–188.
Figure 6.1b Relationship Editor [6] B. Guo, R. Zhang, G. Xu, C. Shi; L. Yang: "Predicting Students
Performance in Educational Data Mining", International Symposium
on Educational Technology (ISET), Pages: 125 - 128, (2015).
(ii) R Script– Limitation in the size of the [7] Blackboard Learning Management System.
https://jo.blackboard.com/index.html?nog=1&cc=US. Accessed on
dataset that an R script can handle before it st
21 April 2019.
can be hosted in PowerBI, in order to parse [8] Blackboard Learning Management System https://moodle.org/
Accessed on 21st April 2019.
our 19000 assessment instances, we had to [9] Bloom, B., 1956. Taxonomy of Educational Objectives: The
loop through an R script with 1000 instance Classification of Educational Goals, Handbook 1 Cognitive Domain
McKay, New York.
capacity 19 times to create our Rubrics [10] Finance Online Best BI Systems for Dashboards.
view. https://financesonline.com/whats-the-best-bi-tool-to-create-
th
dashboards-with-kpi-and-reporting/, accessed 20 April 2019
[11] Hiniker, A. , Hong, S. (., Kim, Y. , Chen, N. , West, J. D. and Aragon,
(iii) Due to marketing competition requirements C. (2017), Toward the operationalization of visual metaphor. Journal
of the Association for Information Science and Technology, 68: 2338-
between big players in the IT solutions, the 2349. doi:10.1002/asi.23857
capabilities of analysis and visualization [12] Making Data Visual: A Practical Guide to Using Visualization for
packages is constantly shifting, with updates Insight. D Fisher, M Meyer - 2017 - O'Reilly Media.
and patches made on a daily basis to most [13] Swaid S., Maat M., Krishnan H., Ghoshal D., Ramakrishnan L. (2018)
Usability Heuristic Evaluation of Scientific Data Analysis and
common packages. While this is seemingly an Visualization Tools. In: Ahram T., Falcão C. (eds) Advances in
advantageous scenario, it adds its own Usability and User Experience. AHFE 2017. Advances in Intelligent
Systems and Computing, vol 607. Springer, Cham
challenges to developers who seek stability in [14] Al‐Murtadha, M. (2019), Enhancing EFL Learners’ Willingness to
solutions, at least over quarterly periods. Communicate with Visualization and Goal‐Setting Activities. TESOL
Q, 53: 133-157. doi:10.1002/tesq.474
[15] A. Marcus, D. Comorski and A. Sergeyev, "Supporting the evolution
(iv) The piling of linking, loading, hosting, sharing of a software visualization tool through usability studies," 13th
and exporting features to visualization International Workshop on Program Comprehension (IWPC'05), St.
Louis, MO, USA, 2005, pp. 307-316.doi: 10.1109/WPC.2005.34
packages can often complicate the process of
[16] Elmqvist, N., & Yi, J. S. (2015). Patterns for visualization evaluation.
implementing solutions. The lines between Information Visualization, 14(3), 250–269.
data loading, ETL, scripting, AI/ML, data https://doi.org/10.1177/1473871613513228
storage and retrieval capabilities are quite [17] F. P. Gusmao, B. R. Delazeri, S. N. Matos, M. Guimaraes, and M. G.
Canteri, “Hierarchical visualization techniques : a case study in the
blurry in many of the packages we domain of meta-analysis,” no. April, 2018.Res., vol. 6, no. 6, pp. 1660–
experimented with, wherein each solution is 1664, 2017.
[18] H. J. Schulz, “Treevis.net: A tree visualization reference,” IEEE
attempting to be the one place for all aspects of Comput. Graph. Appl., vol. 31, no. 6, pp. 11–15, 2011.
AI, ML & BI.
[19] The Data Visualization
(v) Regardless of the visualization solution we Catalogue,https://datavizcatalogue.com/index.html ,accessed on 22th
April 2019
experimented with, it is still very difficult to [20] T. Devasia, Vinushree T P and V. Hegde, "Prediction of students
implement a visualization solution without performance using Educational Data Mining," 2016 International
Conference on Data Mining and Advanced Computing (SAPIENCE),
solid awareness in the business domain as well Ernakulam, 2016, pp. 91-95
as the underlying structure of the data schema
to be presented for analysis.

147
[21] I. Jenhani, G. B. Brahim and A. Elhassan, "Course Learning Outcome
Performance Improvement: A Remedial Action Classification Based
Approach," 2016 15th IEEE International Conference on Machine
Learning and Applications (ICMLA), Anaheim, CA, 2016, pp. 408-
413.
[22] ] MicrosoftPowerBI – Business Intelligence & Visualization Package.
https://powerbi.microsoft.com/en-us/ accessed May 2019
[23] Tableau Business Intelligence.
https://www.tableau.com/products/desktop accesed May 2019
[24] Business Intelligence Solutions.
https://www.sap.com/products/analytics/business-intelligence-bi.html
accessed May 2019
[25] Matthew O. Ward, Georges Grinstein, Daniel Keim. Interactive Data
Visualization: Foundations, Techniques, and Applications, Second
Edition.First Published 2015, eBook Published 11 June 2015,
https://doi.org/10.1201/b18379, eBook ISBN 9780429173226
[26] A. Mittmann and A. Von Wangenheim, “A Multi-Level Visualization
Scheme for Poetry,” 2016 20th Int. Conf. Inf. Vis., pp. 312–317, 2016.
[27] R. Vliegen, J. J. van Wijk and E. van der Linden, "Visualizing Business
Data with Generalized Treemaps," in IEEE Transactions on
Visualization and Computer Graphics, vol. 12, no. 5, pp. 789-796,
Sept.-Oct. 2006.
doi: 10.1109/TVCG.2006.200.
[28] P. Craig and X. Huang, “Animated Space-Filling Hierarchy Views for
Security Risk Control and Visualization on Mobile Devices,” no.
Meita, pp. 772–775, 2015.
[29] J. S. Yi, Y. Kang, J. T. Stasko, and J. A. Jacko, “Toward a Deeper
Understanding of the Role of Interaction in Information Visualization,”
IEEE Trans. Vis. Comput. Graph., vol. 13, no. 6, pp. 1224–1231, 2007.
[30] M. Sondag, B. Speckmann, and K. Verbeek, “Stable Treemaps via
Local Moves,” IEEE Trans. Vis. Comput. Graph., vol. 24, no. 1, pp.
729–738, 2018.
[31] B. Shneiderman and M. Wattenberg, “Ordered Treemap Layouts,” vol.
2001, pp. 2–7, 2001.
[32] H. Di, X. Tang, and S. Wang, “A Novel High-dimension Data
Visualization Method Based on Concept Color Spectrum Diagram,”
2015 IEEE 11th Int. Colloq. Signal Process. Its Appl., pp. 140–144,
2015.
[33] Y. Xie, “Using Color to Improve the Discrimination and Aesthetics of
Treemaps,” vol. 21, no. 4, p. 2016, 2016.
[34] M. B. De Carvalho, B. S. Meiguins, and J. M. De Morais, “Temporal
data visualization technique based on Treemap,” Proc. Int. Conf. Inf.
Vis., vol. 2016-August, pp. 399–403, 2016.
[35] J. Görtler, C. Schulz, D. Weiskopf, and O. Deussen, “Bubble
Treemaps for Uncertainty Visualization,” IEEE Trans. Vis. Comput.
Graph., vol. 24, no. 1, pp. 719–728, 2018.
[36] H. M. Nicholas, B. Liebold, D. Pietschmann, P. Ohler, and P.
Rosenthal, “Hierarchy Visualization Designs and their Impact on
Perception and Problem Solving Strategies,” Proc. Int. Conf. Adv.
Comput. Interact., no. c, pp. 93–101, 2017.

148
Appendix A
Tables

Table 3.1 Sample Student Assessments

Student ID Course Section CLO Asses Semes Score Instr


sment ter % uctor
20161234 C101 1 1 M1 161 45 ProfA
20171234 C101 1 1 M1 161 85 ProfA
20162345 C101 1 1 M1 161 65 ProfA
20181234 C102 1 2 Q1 161 77 Ms D

Table 3.2 Sample Performance Standards

Standard Code From To


Unsatisfactory U 0 24.99
Minimal M 25 49.99
Adequate A 50 74.99
Exemplary E 75 100

Table 3.3 Sample Rubric Lines

Course ID Description CLO U% M% A% A% Population


C401 Capstone 1 0 0 8 91 832
C401 Capstone 3 10 20 30 40 832
C102 CS2 4 7 15 29 48 606
C102 CS2 5 17 5 39 38 606
Data
C104 4 2 18 35 43 870
Struct.
Data
C104 1 1 9 36 53 870
Struct.
Data
C104 3 2 7 37 51 870
Struct.
Data
C104 5 3 8 34 52 870
Struct.

Table 3.4 Course Level Assessments (of table 3.3)

Course Assess
Description U% M% A% E% Population
ID ments
C401 Capstone 2 5 10 19 65 1664
C102 CS2 2 12 10 34 43 1212
C104 Data Struct. 4 2 11 35 50 3480

149
Table 5.1 Visualization Use Cases

Business Question Visualization Source - Steps


Overall, aggregated health Calculated, visualized in 1st view (Student Outcomes).
status of the Student Color code:
Outcomes (Program Green within attainment threshold
Learning Outcomes) for Amber Borderline
all courses and all Red requires attention
assessments in all sections
for all semesters that the
data covers.

Best performing SO From view above, d, n, l


Worst performing SO From view above, a
Worst performing SO 1. Click on SO “a”, all other views will correspondingly cross-filter,
cause/trigger course 2. sort “Course Performance” view by Unsatisfactory rate (descending),
see worst performing courses of SO “a”.

Assessments of Worst From view above,


performing SO 1. select a course from “Course Performance” view
cause/trigger course 2. check 3rd view: “Rubrics – Aggregated CLO Assessments” to see
available assessments
3. check all grades of selected course in 4th view: “Student Grades”, or
4. click a rubric assessment in 3rd view and see its filtered grades in 4th
view
view below shows course assessment for CLO4 selected in the rubrics on the
left and all the constituent grades for individual students shown on the right
filtered by CLO 4 across all sections.

150
Students with best/worst From 4th view, sort by “GRADE”, descending /ascending
grades of the assessments
above

Table 5.2 Use Case Usability Analysis

Business Operation Process Process Operations/Clicks Process Operations/Clicks


Operations/Clicks (LMS) (Other, e.g., Excel)
(Viz Dashboard)
SO Health Status 1 5-10 5-20
Best SO 1 10 5-20
Worst SO 1 10 5-20
Best SO – Trigger Course 3-4 10-15 15-20
Best SO – Trigger Course – 4-5 10-15 15-20
Rubric Line
Best SO – Trigger Course – 5-6 15-20 15-25
Rubric Line – Student Grade
Management Report 1-3 3-10 10-15

151
Improved Swarm Intelligence Optimization using
Crossover and Mutation for Medical Classification
Mais Yasen, Nailah Al-Madi
Department of Computer Science
Princess Sumaya University for Technology
Amman, Jordan
mai20130045@std.psut.edu.jo, n.madi@psut.edu.jo

Abstract – Early diagnoses helps in curing most diseases or in and adjusts the weights without returning back to the input layer,
making them more bearable, it is vital to enhance the accuracy of and avoids getting stuck in local optima. This can explain why
predicting chronic diseases. Extreme Learning Machine (ELM) is a ELM has good generalization performance without using cycles,
classifier which can be efficiently used to predict diseases. Artificial thus learning faster than other training methods such as
Bee Colony algorithm (ABC) and Dragonfly Algorithm (DA) have
backpropagation [5].
been efficiently used in several optimization problems, including the
optimization of ELM settings. Evolutionary Computation is a type To increase the prediction accuracy of ELM, it can be
of optimization algorithm, which has biological operators to find implemented in conjunction with optimization algorithms to
desired solutions. Two of these operators are crossover and mutation efficiently choose the number of its hidden layer nodes and values
(CM) that are used to generate new solutions from old ones, and can of weights throughout the learning process [27]. Swarm
be integrated with swarm intelligence algorithms to enhance their Intelligence (SI) is a type of population based and nature inspired
results. In this paper, models that make use of ABC and DA to metaheuristic optimization algorithms that reflects the natural
optimize the number of hidden neurons and weights of ELM are behavior of biological swarm groups [6]. Artificial Bee Colony
presented. Moreover, crossover and mutation are combined with the algorithm (ABC) and Dragonfly Algorithm (DA) are SI algorithms
swarm search of ABC and DA for chronic diseases forecasting, in
that can be applied in the optimization of the number of hidden
models called ELM-ABC-CM and ELM-DA-CM. Using 4 real
datasets to evaluate the proposed models, and compare their results nodes and weights from an ELM. The reason why ABC and DA
with the results of standard ABC and DA, and other well-known are chosen is that ABC has a feature of grouping the solution, and
classifiers, including regular ELM, using different evaluation DA has a feature of distracting from enemies, these features and
metrics. The results show that crossover and mutation improved the their phases will enable the employment of natural operations.
outcome of ABC and DA. Moreover, ELM-DA-CM proved its Also, ABC and DA proved their efficiency before in previous
efficiency over ELM-ABC-CM. works [7, 8].
Keywords—Machine Learning; Swarm Intelligence; Evolutionary Evolutionary Computation (EC) is another type of population
Computation; Extreme Learning Machine; Dragonfly Algorithm; based and nature inspired metaheuristic optimization algorithms.
Artificial Bee Colony; Crossover; Mutation; Medical Prediction. EC iteratively applies biological evolution to generate solutions
I. INTRODUCTION [9]. Crossover and mutation are two vital biological operators in
Early diagnoses is important to cure most diseases or to manage EC that are used to generate new populations from an existing one
them by preventing their consequences and making them more and enhance the results by having more exploration and
bearable [1]. Therefore, it is an essential requirement to increase exploitation [9]. These operators can be applied with SI
the accuracy of predicting diseases such as heart disease, hepatitis, optimization algorithms to enhance the accuracy of prediction of
diabetes, and diabetic retinopathy. The symptoms of these diseases an optimized classification algorithm. The contribution of this
need to be taken into consideration when forecasting them using paper is summarized as the following:
machine learning [2]. 1. Using crossover and mutation on ABC and DA.
Machine learning (ML) in artificial intelligence enables the 2. Optimizing ELM using ABC-CM and DA-CM and improving
computers to learn without being explicitly programmed [2]. It the tuning of ELM.
finds patterns by searching through the data and uses the detected 3. Using 4 real datasets for training and testing our models.
patterns to alter program actions accordingly [2]. The process 4. Evaluating the proposed models and comparing them with other
when algorithms reflect what has been learned in the past from classifiers.
training data to predict new data is called supervised ML [3]. This paper is structured as follows: Section II includes the
Classification is one of the main supervised ML tasks that aims related literature in the area of work. Section III describes the
to build a model based on previous data to classify new data. background of methods used in this work. Section IV includes the
Extreme Learning Machine (ELM) is a neural network that is proposed methodology used in the development. Section V
inspired by the biological brain, it consists of a computational presents the experiments and the results, and Section VI concludes
model that contains a number of processing nodes called neurons the research and discusses future work.
[4]. Neurons send signals to one another over large number of
weighted connections that link input, hidden and output layer
together for communication purposes. ELM training method is
feedforward, which travels from the input layer to the output layer

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 152


II. RELATED WORK III. BACKGROUND
Crossover and mutation operations of EC can be used to This section discusses the methods used in this work, starting with
enhance the search process of SI algorithms, and they were ELM, then ABC and DA, and lastly crossover and mutation.
previously applied on particle swarm optimization (PSO).
As mentioned in [10], the PSO adjusts itself based on the A. Extreme Learning Machine
previous information about particles and performance of Extreme learning machine consists of a single hidden layer
neighbors. The work presented a PSO with discrete binary feedforward Neural Network (NN) used for classification. A
variables, where the authors tested 5 different De Jong evaluation neural network could be defined as “a computing system made up
functions, where evolutionary algorithms with crossover combine of a number of simple, highly interconnected processing elements,
information from the parents to allow leaps. When there are many which process information by their dynamic state response to
global solutions, crossover could be harmful because two solutions external inputs” [15]. Fig. 1 shows the structure of NN, where NNs
may give better results than any crossover of them. The results are distributed in layers that consist of a group of interconnected
showed that PSO was able to solve various problems which means nodes, also known as neurons, which have an activation function
it is extremely flexible and robust, but it had a problem in getting inside and are inspired by the biological brain [4].
out of a good local optimum.
The work presented in [11], used PSO for solutions updates and
combined it with Gaussian Mutation. The results were compared
with the original PSO and Genetic Algorithm (GA) using DeJong's
functions. PSO with Gaussian Mutation was able to be better than
GA. Furthermore, PSO with mutation was applied on a gene
network where it got better results than standard GA and PSO.
The authors in [12], proposed a theta-PSO with crossover and
mutation to enhance PSO. Their proposed algorithm was capable
Fig. 1 ELM NN Structure [4]
of getting out of local minima by adjusting the parameters
properly. The results were tested on 4 multi-modal functions, their In ELM the training method is feedforward which has no
algorithm reached the global optimal in a limited number of cycles, taking into consideration that ELM is feedforward using an
iterations achieving a high success rate. On the other hand, the appropriate number of hidden nodes and weights will enable the
value of fitness increased in the beginning of crossover and algorithm to learn fast without propagating back. Choosing ELM
mutation, resulting in a long iteration time. settings unwisely can slow down ELM or result in a low accuracy,
In [13], the authors addressed that PSO is widely used as a therefore it is vital to select the number of hidden neurons and
stochastic technique in global optimization. As PSO includes the weights wisely, instead of just generating them randomly.
variables: local and global best position, and because of its early In the process of training, sigmoid function is used in hidden
convergence it can be easily stuck in a local optimal solution. neurons, which is used in the hidden layer to transfer between
Keeping a big search space and ensuring the population diversity different neurons as shown in Equation (1) [16]. Where x is the
can help in preventing that problem by balancing the exploration value of input for each node.
and exploitation. The authors introduced using crossover and / …
mutation with PSO where it is performed on all of the particles in The execution steps of ELM are:
the current iteration if the diversity of the particles reaches a value 1. Extract features for the input layer nodes.
less than a predefined threshold. The results were applied on a 12 2. Randomly specify the number of hidden layer nodes.
widely used nonlinear functions and it showed that the proposed 3. Randomly generate the weights and biases connecting the input
approach had better performance than standard PSO. layer with the hidden layer.
As mentioned in [14], the authors presented a new PSO 4. For each hidden node compute the sigmoid value.
algorithm for solving global optimization problems called QPSO. 5. Calculate the inverse Moore-Penrose generalized hidden layer
QPSO is a combination of Quadratic crossover and the Basic PSO output matrix.
algorithm, where they maintained diversity by preventing the
6. For each output compute the sigmoid value.
search space from shrinking and accepting any new solution even
From the steps above it can be seen that ELM could be altered
if it is worse than the best solution found so far. The results were
to work with an optimization method to solve the problem of
tested on 12 benchmark functions and showed that QPSO
initializing the number of hidden nodes and weights randomly.
performed better than BPSO algorithm in dimensions up to 50.
From the related work and to the best of our knowledge, it can B. Artificial Bee Colony
be seen that applying crossover and mutation with SI enhanced the Artificial Bee Colony algorithm (ABC) has been efficiently
results, and was never studied on DA. Moreover, optimizing ELM implemented in many optimization problems. The optimization of
using EC combined with ABC and DA was never studied before. hidden nodes and weights of an ELM could be one of these

153
problems. The ABC algorithm was first proposed by Karaboga in survive, thus all dragonflies move towards the food sources in the
2005 [17]. And it is a meta-heuristic SI optimization algorithm attraction to food principle [19]. Fifth, to survive all dragonflies
inspired by the foraging behavior of honeybees in nature. The move away as far as possible from the enemy sources in the
solution of ABC is represented in a multi-dimensional search distraction from enemies principle [19]. To calculate the values of
space as food sources and a population of three different types of the different principles the following equations are used [19]:
bees (employed, onlooker, and scout). Let xi be the food source set ∑ … (5) ∑ / … (6)
found by the employed bees for each iteration of the ABC, xi = ∑ / … (7) … (8)
{xi1, xi2, …, xin} where n is the number of solutions needed. … (9)
Equation (2) is used to calculate a new derived solution [18]. The separation is calculated using Equation (5), where Xi is the
, , , ∗ , , … (2) position of the current dragonfly (i), Xj is the position of the jth
Where ϕ is a random number between 0 and 1, y is a random dragonfly close to the current, and n is the total of dragonflies. The
number between 0 and the maximum number of food sources, y alignment is found using Equation (6), where Vj is the velocity
should not equal the current food source (i), and j is a random value of the jth dragonfly close to the current (i). The cohesion is
number generated between 0 and maximum number of solutions. calculated as shown in Equation (7). The attraction to food is
Equation (3) is to calculate the probability of each solution calculated by using Equation (8), where Xf is the position of the
suggested by the employed bees, it is also known as the roulette solution. The distraction from enemy is calculated as shown in
wheel equation that evaluates the solutions based on the fitness Equation (9), where Xe is the position of the enemy. Values of ∆X
values achieved, this phase is called the onlooker bee phase [18]. and X are calculated using Equations (10) and (11) [19]. Where s,
/ ∑ … (3) a, c, f, e, and w are the weights of their correspondent principle (S,
A, C, F, E, and ∆X). e is calculated using Equation (12) where i is
Where i is the current solution, pi is the probability of solution
the current iteration and I is the maximum iterations. s, a, and c are
i, fiti is the fitness value of solution i, sn is the solutions total, j is
three different random numbers between 0 and 2e, f is a random
the solutions counter, and fitj is the fitness of each solution j.
number between 0 and 2, and w is calculated using Equation (13).
The scout bee phase is the final stage, it is responsible of
∆ , ∆ , … (10)
checking the epoch reached so far, which is the number of times
, , ∆ , … (11)
the solution is allowed to get worse than the solution produced
. ∗ . / / … (12)
before. The scout bee abandons the old solution and discovers a
. ∗ / … (13)
new solution for the employed bees to work on it in the following
iterations. Equation (4) is used to calculate the new solution [18]. D. Crossover and Mutation
, ∗ … (4) Evolutionary Computation (EC) is another type of population
Where ub, lb are the vectors that contain the upper bounds and based and nature inspired metaheuristic optimization algorithms.
lower bounds allowed for the solution, ϕ is a random number What distinguishes EC is the use of biological evolution on
between 0 and 1, i is the current food source, and j is a random candidate solutions to remove the worst, and to change solutions
number between 0 and maximum number of solutions. iteratively [9]. Crossover and mutation are popular examples on
the operators used in EC. These operators are applied to generate
C. Dragonfly Algorithm new solutions from existing ones [9].
Dragonfly Algorithm (DA) was first proposed by Seyedali Crossover usually occurs in every iteration to combine the
Mirjalili in 2016 [19]. And it is an algorithm that can be used in genetic information of two parents and generate new children [23].
ELM number of hidden nodes and weights optimization. DA is a Crossover has many types, uniform crossover is illustrated in Fig.
meta-heuristic SI optimization algorithm inspired by the static and 2, where two parents integrate in a uniform pattern to generate a
dynamic behaviors of dragonflies in nature [20]. In the static new child [24]. There are two reasons why uniform crossover was
behavior a large number of dragonflies migrate in a certain chosen to be implemented; first is that using a uniform pattern will
direction travelling for long distances [21]. On the other hand, in guarantee having stable and proportional new derived solutions.
the dynamic behavior dragonflies get into groups and fly over Second is that uniform crossover will help in reaching the best
different areas to find food resources [22]. solution faster, because the amount of solution change is high. On
DA has five principles that are important in finding the the other hand, mutation usually happens less frequently to find
solutions required. First, the separation principle implies the static better solutions by altering the genetic information of one or more
collision avoidance of a dragonfly from other dragonflies that are genes of members of a solution [23]. Fig. 3 shows bit inversion
close to its position [19]. Second, the alignment principle reflects mutation where a single gene is altered [24].
the process of velocity matching of a dragonfly to other
dragonflies that are close to its position [19]. Third, the cohesion
principle is the tendency of a dragonfly towards the center of the
space that contains other dragonflies close to its position [19].
Fourth, the main aim of dragonfly swarms is to stay alive and Fig. 2 Crossover [24] Fig. 3 Mutation [24]

154
IV. PROPOSED APPROACHES A. Data
The following are the execution steps of the proposed models, The performance evaluation was done on four Medical datasets
where the fitness calculation is done by sending the proposed [25]. Where first, we applied feature selection on the datasets using
number of hidden nodes and their weights to ELM. ABC-CM and gain ratio to consider only the most relevant features to the class
DA-CM steps are: attribute using WEKA [26]. Then we split the data into two sets,
1. Calculate the solution probability using roulette wheel. Check if 66% for training and 34% for testing, as shown in Table 1.
the solution probability is lower than probability of mutation. Table 1 Number of Records and Features in the data files
2. In the employed bee or dynamic phase: if the probability is Dataset Training Testing Features (Selected)
lower, alter the current solution using mutation, where a new Heart disease 177 93 14 (10)
solution is derived using the equations explored in section III. Hepatitis 102 53 20 (11)
3. Choose the two parent solutions that got the highest fitness. Diabetes 506 262 9 (5)
Retinopathy 760 391 19 (16)
4. Calculate the solution probability using roulette wheel. Check if
the solution probability is lower than probability of crossover.
5. In scout bee or static phase: if probability is lower, reset solution
that reached the epoch and generate a new solution by
combining the two parents selected in a uniform pattern.
6. Repeat steps 1 to 5 in each iteration of ABC or DA.
The execution steps of ELM-ABC, as shown in Fig. 4, are:
1. Initialize all food sources randomly.
2. Employed bees find all the possible solutions.
3. Find the fitness value for each proposed solution using ELM,
and retrieving the resulting accuracy.
4. Onlooker bee phase calculates the probability of each solution.
Then decides greedily based on a random number whether to Fig. 4 ELM-ABC Process Fig. 5 ELM-DA Process
follow the solution or not. B. Experiments settings
5. Scout bee phase checks if each solution reached the epoch time. For the evaluation of our model the fitness function was
6. Store the best solution based on a greedy selection. accuracy. The settings of ABC and DA used are; Iterations: 100,
7. Repeat from step 2 to 6 until maximum iterations is reached. Swarm size: 20, Seed: Random, Number of Sources: 50, Upper
The execution steps of the ELM-DA, as shown in Fig. 5, are: bound: 1, Lower bound: 0, Epoch: 50, Crossover probability: 0.8,
1. Initialize the dragonfly positions and positions difference (∆X) Mutation probability: 0.2. ELM settings are; Output Neurons: 2,
randomly. Seed: Random, Hidden Layers: 1, Hidden Layer Nodes: Random.
2. Calculate the fitness values for the proposed solutions. After preparing the datasets and building our proposed models,
3. Start the static phase by updating the best fitness value. ELM-ABC-CM and ELM-DA-CM need to be run 30 times to
4. If the fitness value was better than the best fitness found so far, cover the randomness of ABC and DA solutions. To evaluate our
then update the best food source with the solution. models, they are compared with seven classifiers implemented in
5. If the fitness value was worse than the worst fitness found so far, WEKA with their default settings: Bayes Network (BN), Naïve
then update the worst enemy source with the solution. Bayes (NB), Decision Tree (J48), K-Nearest Neighbors (IBK), K-
6. Start the dynamic phase by calculating s, a, c, f, e, w, and ∆X. star (K*), Repeated Incremental Pruning (J-Rip), Artificial Neural
7. Calculate the separation, alignment, cohesion, attraction to food, Network (ANN).
and distraction from enemy values. To evaluate the efficiency of the classifiers we can use the
8. Update dragonfly positions difference (∆X) and dragonfly following metrics; Accuracy, recall, precision, Fmeasure, and
positions (X). AUC, using the Equations (14-18). Where TN is true negative, TP
9. Repeat from step 3 to 8 until maximum iterations is reached. is true positive, FN is false negative and FP is false positive.
/ … (14)
V. EXPERIMENTS AND RESULTS / … (15)
The performance of our approaches was evaluated by conducting / … (16)
a number of experiments that are explained in this section. / … (17)
… (18)

155
Table 2 Results (*1 Accuracy, *2 Precision, *3 Recall, *4 F-measure, *5 AUC)
Classifier Heart Disease Hepatitis
*1 *2 *3 *4 *5 *1 *2 *3 *4 *5
BN 81.52 0.88 0.80 0.84 0.82 81.13 0.45 0.56 0.50 0.71
NB 82.61 0.88 0.82 0.85 0.83 86.79 0.60 0.67 0.63 0.79
J48 67.39 0.79 0.62 0.69 0.69 86.79 0.67 0.44 0.53 0.70
IBK 78.26 0.89 0.73 0.80 0.80 81.13 0.45 0.56 0.50 0.71
K* 71.74 0.78 0.73 0.75 0.71 90.57 0.75 0.67 0.71 0.81
J-Rip 72.83 0.83 0.69 0.75 0.74 83.02 0.50 0.67 0.57 0.77
ANN 76.09 0.87 0.71 0.78 0.77 88.68 0.67 0.67 0.67 0.80
ELM 75.00 0.90 0.65 0.76 0.77 84.91 0.60 0.33 0.43 0.64
ELM-ABC 83.70 0.83 0.91 0.87 0.82 84.91 0.56 0.56 0.56 0.73
STDEV 1.35 0.03 0.02 0.01 0.02 5.05 0.12 0.13 0.12 0.07
best runs 85.87 0.88 0.93 0.88 0.85 88.68 0.67 0.67 0.67 0.80
ELM-DA 83.70 0.83 0.91 0.87 0.82 88.68 0.80 0.44 0.57 0.71
STDEV 0.00 0.01 0.01 0.00 0.00 0.82 0.04 0.15 0.09 0.06
best runs 83.70 0.84 0.93 0.87 0.83 88.68 0.75 0.67 0.67 0.80
ELM-ABC-CM 84.78 0.86 0.89 0.88 0.84 88.68 0.80 0.44 0.57 0.71
STDEV 1.33 0.03 0.01 0.01 0.02 0.96 0.02 0.06 0.05 0.03
best runs 85.87 0.88 0.93 0.88 0.85 90.57 0.83 0.56 0.67 0.77
ELM-DA-CM 84.62 0.84 0.93 0.88 0.82 90.57 0.83 0.56 0.67 0.77
STDEV 0.54 0.01 0.01 0.00 0.01 0.96 0.05 0.08 0.04 0.03
best runs 84.78 0.84 0.95 0.88 0.83 90.57 0.83 0.89 0.70 0.88

Classifier Diabetes Diabetic Retinopathy


*1 *2 *3 *4 *5 *1 *2 *3 *4 *5
BN 74.05 0.83 0.76 0.80 0.73 77.39 0.79 0.88 0.84 0.73
NB 77.48 0.80 0.87 0.84 0.73 74.33 0.80 0.80 0.80 0.72
J48 76.34 0.83 0.80 0.82 0.74 83.14 0.80 0.98 0.88 0.77
IBK 68.32 0.75 0.79 0.77 0.63 91.57 0.89 1.00 0.94 0.88
K* 69.47 0.74 0.83 0.78 0.63 88.51 0.85 1.00 0.92 0.84
J-Rip 77.48 0.81 0.86 0.83 0.73 81.61 0.82 0.91 0.87 0.77
ANN 76.34 0.83 0.80 0.82 0.74 84.67 0.84 0.95 0.89 0.80
ELM 76.72 0.79 0.89 0.84 0.71 72.80 0.75 0.88 0.81 0.66
ELM-ABC 77.86 0.84 0.82 0.83 0.76 75.10 0.78 0.85 0.82 0.71
STDEV 2.22 0.03 0.03 0.02 0.03 3.82 0.04 0.02 0.02 0.06
best runs 77.86 0.84 0.83 0.83 0.76 77.39 0.82 0.89 0.83 0.75
ELM-DA 78.63 0.85 0.82 0.84 0.77 75.86 0.77 0.91 0.83 0.69
STDEV 0.18 0.00 0.01 0.00 0.00 0.58 0.02 0.05 0.01 0.02
best runs 82.44 0.88 0.85 0.87 0.81 76.63 0.80 0.96 0.84 0.72
ELM-ABC-CM 78.63 0.86 0.80 0.83 0.78 75.38 0.79 0.85 0.82 0.71
STDEV 8.15 0.07 0.12 0.08 0.09 0.00 0.00 0.00 0.00 0.00
best runs 88.55 0.96 0.86 0.91 0.90 75.10 0.79 0.85 0.82 0.71
ELM-DA-CM 79.77 0.83 0.87 0.85 0.76 76.25 0.78 0.89 0.83 0.71
STDEV 2.31 0.03 0.05 0.02 0.03 0.71 0.01 0.02 0.00 0.02
best runs 82.06 0.86 0.91 0.87 0.78 76.63 0.79 0.91 0.83 0.72
C. Results ELM-ABC-CM and ELM-DA-CM improved the results of
It can be concluded from Table 2 that DA is performing more ELM-ABC and ELM-DA, crossover and mutation decreased the
productively with ELM (ELM-DA) than ABC (ELM-ABC) in 3 STDEV values of the 30 runs, which means that the randomness
datasets. This is explained by comparing the Standard Deviation of the solutions is decreased. The diabetes, heart and hepatitis
(STDEV) values of ABC with DA, where ABC has higher STDEV datasets give a high indication that the ELM-DA-CM is capable of
due to the huge effectiveness of the randomness of its operation on reaching the best solution, in comparison with all of the classifiers
the solutions it produces. However, best runs of ABC are very mentioned in the table. Moreover, ELM-ABC-CM has the best
competitive in comparison with DA. Furthermore, ELM-ABC and accuracy in the heart disease dataset. On the other hand, the results
ELM-DA enhanced the results of ELM in all datasets.

156
of both approaches were very competitive in the other datasets, and [10] J. Kennedy and R. C. Eberhart, "A discrete binary version of the
their best runs were better than most classifiers in the table. particle swarm algorithm", (1997), IEEE International Conference
on Systems, Man, and Cybernetics, Vol. 5, PP. 4104-4108.
VI. CONCLUSION AND FUTURE WORK [11] N. Higashi and H. Iba, "Particle swarm optimization with Gaussian
This work goal was to construct models that can predict mutation", (2003), IEEE Swarm Intelligence Symposium, PP. 72-79.
chronic diseases and evaluate their performance. The proposed [12] Weimin Zhong, Jianliang Xing and Feng Qian, "An improved theta-
models are swarm-based which integrate crossover and mutation PSO algorithm with crossover and mutation", (2008), 7th World
with the search of ABC and DA (called ABC-CM, DA-CM). The Congress on Intelligent Control and Automation, PP. 5308-5312.
enhanced ABC and DA models were used to improve the results [13] Dong G, Cooper J., “Particle Swarm Optimization with Crossover
of ELM classifier. The datasets used in this research were real and Mutation Operators Using the Diversity Criteria”, (2013),
patients’ records of four different medical cases. Results were ASME International Design Engineering Technical Conferences and
compared with other well-known classifiers including ELM using Computers and Information in Engineering Conference, Vol. 3A, PP.
different evaluation metrics. The results showed that ELM-ABC- V03AT03A010.
CM and ELM-DA-CM improved the efficiency of ELM-ABC and [14] Pant M., Thangaraj R., Abraham A., (2007), “A New PSO Algorithm
ELM-DA, and crossover and mutation decreased the randomness with Crossover Operator for Global Optimization Problems”,
of the solutions produced. Moreover, ELM-DA-CM reached the Innovations in Hybrid Intelligent Systems, Advances in Soft
best prediction in three datasets, and ELM-ABC-CM got the best Computing, Vol. 44, PP. 215-222.
accuracy in one dataset. [15] Maureen Caudill, (1989), "Neural Network Primer", San Francisco:
Based on the results, as a future work it is necessary to enlarge Miller Freeman Inc., PP 321.
the search space of ABC and DA to increase their accuracy. [16] A. C. C. Coolen, (1998), “A Beginner’s Guide to the Mathematics of
Moreover, the running time was long, thus it is important to find a Neural Networks”, Springer, Chapter 2, PP 13-70.
way of parallelizing these models to achieve good results in a [17] Dervis Karaboga, (2005), “An Idea Based on Honey Bee Swarm for
Numerical Optimization”, Technical Report-TR06, PP 1-10.
meaningful time.
[18] Yunfeng Xu, Ping Fan, Ling Yuan, (2013), “A Simple and Efficient
REFERENCES Artificial Bee Colony Algorithm”, Mathematical Problems in
[1] WEBMD, “Health Screening: Finding Health Problems Early”, Engineering (MPE), Volume 2013, PP 1-9.
Retrieved on: February 11, 2019. From: www.webmd.com. [19] Seyedali Mirjalili, (2016), “Dragonfly algorithm: a new meta-
[2] Margaret Rouse, (2016), “Analytics tools help make sense of big heuristic optimization technique for solving single-objective,
data”, AWS, Retrieved on: December 6, 2018, From: discrete, and multi-objective problems”, Springer, PP 1053–1073.
searchbusinessanalytics.techtarget.com. [20] M. A. Salam, H. M. Zawbaa, E. Emary, K. K. A. Ghany and B. Parv,
[3] Jerome H. Friedman, (1997), “Data mining and statistics: What’s the (2016), “A hybrid dragonfly algorithm with extreme learning
connection”, Proceedings of the 29th Symposium on the Interface machine for prediction”, INnovations in Intelligent SysTems and
Between Computer Science and Statistics, PP 1-7. Applications (INISTA), PP. 1-6.
[4] Jun-Shien Lin, and Shi-Shang Jang, (1998), “Nonlinear Dynamic [21] Robert W. Russell, Michael L. May, Kenneth L. Soltesz, John W.
Artificial Neural Network Modeling Using an Information Theory Fitzpatrick, (1998), “Massive Swarm Migrations of Dragonflies in
Based Experimental Design Approach”, American Chemical Eastern North America”, University of Notre Dame, PP 325-342.
Society, Vol. 37, PP 3640–3651. [22] Martin Wikelski, David Moskowitz, James S Adelman, Jim
[5] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew, (2006), Cochran, David S Wilcove, Michael L May, (2006), “Simple rules
“Extreme learning machine: Theory and applications”, guide dragonfly migration”, PMC, PP 325-329.
Neurocomputing, Vol. 70, PP 489-501. [23] Zakir H. Ahmed, (2010), “Genetic Algorithm for the Traveling
[6] Beni G., Wang J., (1993), “Swarm Intelligence in Cellular Robotic Salesman Problem using Sequential Constructive Crossover
Systems”, Robots and Biological Systems: Towards a New Bionics?, Operator”, International Journal of Biometrics and Bioinformatics
Vol. 102, PP 703-712. (IJBB), Vol. 3, PP 96-105.
[7] M. Z. Yasen, R. A. Al-Jundi and N. S. Al-Madi, (2017), “Optimized [24] Marek Obitko, (1998), “Introduction to Genetic Algorithms”,
ANN-ABC for Thunderstorms Prediction”, 2017 International Obitko, Retrieved on: February 14, 2019, From: obitko.com.
Conference on New Trends in Computing Sciences (ICTCS), PP 98- [25] David Aha, (2013), “UCI Machine Learning Repository”, University
103. of California Irvine.
[8] M. Yasen, N. Al-Madi and N. Obeid, (2018), “Optimizing Neural [26] WEKA, Version: 3.8, Retrieved on: September 5, 2016, From:
Networks using Dragonfly Algorithm for Medical Prediction”, 2018 www.cs.waikato.ac.nz.
8th International Conference on Computer Science and Information [27] Faris, H., Ala’M, A. Z., Heidari, A. A., Aljarah, I., Mafarja, M.,
Technology (CSIT), PP 71-76. Hassonah, M. A., & Fujita, H. (2019). “An intelligent system for
[9] Al-Jundi, Ruba, Mais Yasen, and Nailah Al-Madi, (2017), spam detection and identification of the most relevant features based
“Thunderstorms Prediction using Genetic Programming”, on evolutionary random weight networks”. Information Fusion, 48,
International Journal of Information Systems and Computer 67-83.
Sciences, Vol. 7, PP 1-7.

157
Novel Approach towards Arabic Question
Similarity Detection
Mohammad Daoud
CS Department, Faculty of IT
American University of Madaba
Madaba, Jordan
m.daoud@aum.edu.jo

Abstract—In this paper we are addressing the automatic Besides, questions are paraphrased more often that other
detection of Arabic question similarity, which is an essential utterances [12].
issue in a variety of NLP/NLU applications such as question Arabic questions similarity is even more challenging,
answering systems, virtual assistants, chatbots…etc. We are because Arabic is a pi-language (poorly informatized
proposing and experimenting a rule-based approach that relies
language) [13] [14] and gaining semantic information from
on lexical and semantic similarity between questions with the
utilization of supervised learning algorithms. Our approach its corpus is difficult. Few research attempts have addressed
categorizes questions semantically according to their type and Arabic question similarity where mediocre results have been
scope; this categorization is based on hypothetical rules that achieved (when compared to other resourceful languages)
have been validated empirically, for example, a Timex Factoid [15].
question (a question asking about time) is less likely similar to With the absence or the scarceness of relevant semantic
an Enamex Factoid question (a question asking about a named corpus for Arabic, a rule-based system for categorizing
entity). This article details the procedures of question pairs questions can be used [16]. In this paper we are seeking a
preprocessing, lexical analysis, feature extraction and selection hybrid approach that utilizes supervised learning and
and most importantly the similarity measures. According to
hypothetical rules to find similarity and to detect
the experiment we have conducted, our approach achieved
promising precision and accuracy based on a test data of 1450 paraphrasing.
question pairs. Many researchers focus only on corpus data-driven
approaches to cluster, classify and map words and phrases
Keywords—text similarity, question analysis, question [7] [17]. We believe that this is an essential part of the
similarity, semantic similarity, data science, Natural Language similarity detection task. However, in the context of
Processing. question similarity, certain rules can be set to improve the
understanding of the questions and to relate them
I. INTRODUCTION accordingly, for example, these two questions are distanced
Finding similarity between various textual units (words, even though they have high string similarity, high term
expressions, phrases, paragraphs …) is an important NLP similarity, and high semantic similarity, simply because the
task [1]. Many applications report significant improvements first one asks about the time and the second one asks about a
in their performance when a text similarity component is location. Q1 = “Arabic: ‫ متى وقعت غزوة بدر؟‬- English: When
deployed, such as information retrieval [2], machine did the Battle of Badr take place?” Q2 = “Arabic: ‫اين وقعت‬
translation [3], text clustering [4], sentiment analysis ‫ غزوة بدر؟‬- English: Where did the Battle of Badr take
[5]…etc. This task was tackled by researchers from different place?”. In this paper we are forming a framework to
point of views, some methods assumes that two textual units understand the Arabic questions and to use this in improving
are similar if they share subsequences of characters and question similarity.
words, for example, cosine similarity and Jaccard similarity This paper is organized as follows: the next section lists
[6] can be used as a simple similarity measure between and compares the most relevant related work. After that, in
section three we introduce our approach in question
phrases based on the common words between them.
comparison and analysis. In section four we detail the
Semantic similarity tries to find logical similarity between
aspects of the data set we are using for the experiment, and
texts even with the absence of the lexical similarity [7], for the preprocessing method. And then section five shows the
example, a semantic network or a corpus can be used to experiment and its results, while section six evaluates and
determine the degree of similarity between two words or assesses our method. And finally, we draw some
expressions even if the text seems different in terms of its conclusions, future work and possible applications.
characters and words [8].
Similarity between questions is an interesting task that II. RELATED WORK
can be very helpful for a series of applications such as
question answering systems [9], virtual assistants [10], Similarity between phrases can be approached through
chatbots [11]…etc. It can be considered as a sub problem of textual (String) similarity and semantic similarity. Question
text similarity. The challenge here is that questions are similarity, which is the focus of this paper, is a sub problem
difficult to be processed and has short to no textual context. of phrasal similarity. Therefore, this section will address

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 158


phrasal similarity in general and then will discuss attempts of Algorithm 1
Arabic question similarity detection. FindSimilarityFeatures( Couples of questions C )
Textual similarity [18] relies on the string representation // start of Algorithm 1
of phrases. And therefore, simply, two phrases are similar if For each couple cx ( q1 , q2 ) in C
they have similar strings. There are two main approaches in nq1 = Normalize ( q1 )
string similarity; the first one treats the phrase as a sequence nq2 = Normalize ( q2 )
of characters [19] and the second one treats phrases as lexical nqq1 = QuestionNormalization ( nq1 )
units glued with a syntax [20]. Longest common nqq2 = QuestionNormalization ( nq2 )
subsequence [21], Jaro [22], Damerau-Levenshtein [23], and bowq1 = BOW ( nqq1 )
Needleman-Wunsch [24] are considered amongst the most bowq2 = BOW ( nqq2 )
frequently used character-based similarity algorithms. While nerq1 = NER ( q1 )
Block Distance, Cosine similarity [25], Dice’s coefficient nerq2 = NER ( q2 )
[26], Euclidean distance (L2), and Jaccard similarity [6] are Posq1 = pos ( nqq1 )
well known algorithms for lexical-based similarity [27]. The Posq2 = pos ( nqq2 )
advantage of these two approaches is that they are simple, F[x][]= { lcs ( nq1 , nq2 ) ,
and effective for short phrases that belong to the same cos ( bowq1 , bowq2 ) ,
domain, where there is limited word ambiguity. jaccard ( bowq1 , bowq2 ) ,
Semantic similarity can be effective to address word euc ( bowq1 , bowq2 ) ,
ambiguity [17]. It tries to map different lexical units based on jaccard ( nerq1 , nerq2 ) ,
their meaning distance, regardless of their string distance. cos ( nerq1 , nerq2 ) ,
Most of the semantic similarity algorithms rely on large jaccard ( posq1 , posq2 ) ,
corpus to extract additional information about the constructs cos ( posq1 , posq2 ) ,
of the phrase. For example, finding the similar words based Startsim ( bowq1 , bowq2 ) ,
on their frequent colocation. The following algorithms and Endsim ( bowq1 , bowq2 ) ,
methods are considered as corpus semantic similarity QWsim ( bowq1 , bowq2 ) }
algorithms: Hyperspace Analogue to Language (HAL) [28], Return F
Latent Semantic Analysis (LSA) [29], Generalized Latent // end of Algorithm 1
Semantic Analysis (GLSA) [30], Explicit Semantic Analysis
(ESA) [31], Pointwise Mutual Information - Information
Retrieval (PMI-IR) [32], Second-order co-occurrence The algorithm starts by normalizing the Arabic text of q1
pointwise mutual information (SCO-PMI) [33], Normalized and q2. Then special question normalization is done as
Google Distance (NGD) [34] and Extracting DIStributionally shown in Algorithm 2, where nonstandard question words
similar words using COoccurrences (DISCO) [35]. These and expressions are detected and replaced by standard words.
algorithms are effective only with the availability of large This will eliminate unnecessary variations and will result in
and clean corpus, and they assume relatedness based on the more accurate similarity measures. Algorithm 2 is equipped
textual colocations. with a list of nonstandard question words and their standard
equivalences. The list is sorted according to the length of the
Usually, a semantic network is augmented to the nonstandard question words, so that the algorithm will make
semantic similarity engine such as Wordnet [36]. In fact longest match detection.
many researchers are using Wordnet heavily to measure the
distances between words and phrases which can be
considered as an independent semantic similarity measure.
Algorithm 2
Which is effective for resourceful languages such as English
QuestionNormalization ( q )
(English Wordnet has 155 327 words organized in 175 979
synsets). // start of algorithm 2
Read table1 ( n , s ) [ ]
We are proposing a hybrid approach that utilizes string //table1 (n = non standard question words,
similarity and semantic similarity but without demanding //standard question form)
huge resources, which is still considered a problem for // table1 is sorted according to the numbers of
languages such as Arabic. //words of n descending order
For each t ( n , s ) in table1 [ ]
III. QUESTION COMPARISON q.replace ( n , s )
In this paper we are introducing a novel method to Return q
determine similarity between two Arabic questions. Our // end of Algorithm 2
algorithm employs textual and semantic similarity. This
section details our approach, starting with the main After question normalization, the similarity will be
algorithm, preprocessing, feature generation and question measured between the following (1) bag of words from the
scope analysis. normalized q1 and q2 (2) the named entity in q1 and q2 (3)
q1 and q2 after part of speech tagging. For named entity
A. Lexical and Semantic Similarity recognition (NER) and for part of speech (POS) analysis we
To compare between two questions, we generate a list of use [37].
features for every couple. Algorithm 1 receives q1 and q2
and utilizes certain similarity measures to produce a list of
features that belong to the couple (q1 - q2).
159
Algorithm 3 dissimilarity of the last 1 or two words might alter the focus
Startsim( q1 [ ] , q2 [ ] ) of the questions completely.
// start of algorithm 3 Algorithm 5
If q1 [ 0 ] == q2 [ 0 ] and q1[ 1 ] = = q2 [ 1 ] QWsim( q1 [ ] , q2 [ ] )
Return 1 // start of algorithm 5
Else if q1 [ 0 ] == q2 [ 0 ] qw1 = Getquestionword ( q1 )
Return 0 qw2 = Getquestionword( q2 )
Else if qw1 and qw2 belong to same scope
Return – 1 Return 1
// end of algorithm 3 else if qw1 and qw2 belong to related scopes
Return 0
Algorithm 1 generates the following features:
else
1. Longest common subsequence between Return – 1
the normalized q1 and q2 // end of algorithm 5
2. Cosine similarity between the normalized
BOW of q1 and q2 Algorithm 5 calculates similarity based on question type,
it returns 1 if q1 and q2 are of the same type and scope. It
3. Jaccard similarity between the normalized
returns 0 if they have related scopes, and it returns -1 if they
BOW of q1 and q2
have completely different scopes. Getquestionword is
4. Euclidian distance between the normalized afunction that detects the question word(s) that has been in
BOW of q1 and q2 thequestion. Next section discusses question scopes analysis
in details.
5. Jaccard similarity between the named
entity of q1 and q2
B. Semantic similarity
6. Cosine similarity between the named Table 1 suggest a categorization of the main scopes of
entity of q1 and q2 Arabic questions, as we can see; each scope is categorized by
7. Jaccard similarity between the part of the possible question. The answer of a TimexF question
speech analysis for q1 and q2 would be a time or date. While the answer of a LocF
question is a location. Semantically the two question will
8. Cosine similarity between the part of most likely get two different answers and therefore, they
speech analysis for q1 and q2 have a semantic distance, even if the two questions are
9. Start similarity which is described in lexically similar.
algorithm 3
TABLE 1. Scopes of Arabic questions
10. End similarity which is described in
algorithm 4 ID Scope Question Paraphrase
words d words
11. Question word similarity which is TimexF Time - ‫ ايان‬,‫متى‬ ‫“ في اي وقت‬in
described in algorithm 5 Factoid “When” what time”
Algorithm 3 returns 1 if the 2 starting question words in ‫“ في اي سنة‬in
q1 and q2 are the same. It returns 0 if only the first word in what year”
q1 is equivalent to the first word in q2. And it returns -1 if ‫ما ھو تاريخ‬
the first and the second words in q1 are not the same as the “what is the
date”
words in q2.
LocF Location - ‫أين‬ ‫“ ما موقع‬What
Algorithm 4 Factoid Where is the
Endsim( q1 [ ] , q2 [ ] ) location”
// start of algorithm 4 ‫“ في اي مدينة‬in
If q1 [ q1.length – 1 ] == q2 [ q2.length – 1 ] what city”
“in which
and q1 [ q1.length – 2 ] == q2[ q2.length – 2 ] country” ‫في اي‬
Return 1 ‫دولة‬
Else if q1 [ q1.length – 1 ] == q2 [ q2.length – 1 ] NVF Numeric ‫كم‬ ‫“ ما طول‬what
Return 0 value - How many is the length”
Else Factoid How Much ‫ما ھي المسافة‬
Return – 1 “what is the
// end of algorithm 4 distance”
‫ما عرض‬
Algorithm 4 returns 1 if the last 2 words in q1 and q2 are “what is the
the same. It returns 0 if the last word in q1 is equivalent to width”
the last word in q2. And it returns -1 if the last two words in NEF Named ‫لمن‬ ‫“ الى من‬for
q1 are not the same as the words in q2. The idea behind the Entity - Whose whom”
feature generated by algorithm 4 is simple, some couple Factoid ‫“ من ھو‬Who
might produce high textual similarity, and however, the is”

160
‫“ الي‬For The 1450 couples were normalized (Arabic and question
whom” normalization) and then used to generate the features
NED Named ‫ ما‬,‫من‬ ‫ما تعريف‬ described in section III.
Entity - What “what is the
Definition difinition” The distribution of the scopes of the 600 unique
‫“ من ھو‬Who questions was as shown in table 2.
is”
M Method ‫كيف‬ ‫ما ھي طريقة‬ TABLE 2. The distribution of the scopes of the 600 unique questions
How “What is the
method” Scope Number of
‫ما ھو وصفة‬ questions
“What is the
recipe” Time - Factoid 88
‫ما الخطوات‬ Location - Factoid 79
“What are
the steps” Numeric value - 69
P Purpose ‫لماذا‬ ‫“ ما ھو السبب‬ Factoid
Why what is the
reason” Named Entity - 27
‫ما المسبب‬ Factoid
“What
Named Entity - 55
causes”
Definition
C Cause ‫ماذا‬ ‫ما الذي‬
What “What” Method 78
L List ‫ عدد‬,‫اذكر‬
List Purpose 48
YN Yes/No ‫ھل‬ ‫“ ء‬Question
Cause 45
Is/was/are… Hamza”
List 19
We seek to give a similarity measure for a couple of Yes/No 92
Arabic questions based on the scope of their interrogative
word (question word). We use empirical and hypothetical
approaches to establish the needed rules.
V. EXPERIMENT
It is intuitive that a method question that starts with “‫كيف‬
- How” will be dissimilar to a factoid timex question that We used several classification algorithms provided by
starts with “‫ متى‬- when” and based on that we can WEKA 3.8 [38] on the generated data set. Random Forests
hypothesize the following rule: [39] with 10 folds cross validation has produced the best
results amongst other classifiers that we have tested in terms
If q1.scope = M and q2.scope = TimexF then qw1 = -1 of precision, recall and f-Measures.
This hypothetical rule can be confirmed empirically by Table 3 shows the results reported from Random Forests
an experiment. In the same way we assumed that if the scope Classifier.
of the two questions is the same then they have a similarity
measure of 1.
TABLE 3. Results reported by Random Forests Algorithm, with our
We found out through the experiment that some of the proposed features
scopes have unconfirmed similarity such as NEF – NED, and
Precision Recall F-measures
P – M. Therefore, such occurrence would result in a 0
similarity measure. Yes 0.82 0.59 0.69
No 0.85 0.95 0.90
IV. DATA PREPARATION
For experimentation, we have selected 300 Arabic Weighted 0.84 0.85 0.84
questions from the Frequently Asked pages of various United Avg.
Nation’s organizations. And we have randomly selected 300
interrelated casual Arabic questions from ejaaba.com. We
used these 600 questions to randomly generate 1450 couples. To evaluate our novel approach we ran the test after
Each couple was given a YES or NO label, to indicate the removing our special features (End similarity, Start
similarity of the two questions. 419 couples were labeled Similarity, Question Word Similarity), and therefore the
with a YES, and 1031 couples were labeled as NO. Because remaining features were simply based on cosine similarity,
it was difficult to find YES-labeled questions in the jaccard similarity, Euclidean distance and Longest Common
randomly generated couples, we used paraphrasing to Subsequence. Table 4 shows results for the same test but
generate half of the YES-labeled couples and we used the without our features.
same technique with 100 NO-labeled questions.

161
TABLE 4. Results reported by Random Forests Algorithm, without our [2] X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu, “From word
proposed features
embeddings to document similarities for improved information
Precision Recall F-measures retrieval in software engineering,” in Proceedings of the 38th
Yes 0.40 0.32 0.35 international conference on software engineering, 2016, pp. 404–
415.
No 0.74 0.80 0.77
[3] M. Simard, N. Ueffing, P. Isabelle, and R. Kuhn, “Rule-based
Weighted 0.64 0.66 0.65 translation with statistical phrase-based post-editing,” in
Avg. dl.acm.org, 2007, pp. 203–206.
[4] C. C. Aggarwal and C. X. Zhai, “A survey of text clustering
algorithms,” in Mining Text Data, vol. 9781461432, Boston, MA:
As you can see there is a significant drop in accuracy for Springer US, 2012, pp. 77–128.
the same algorithms in terms of precision (-0.2), recall (-
0.19) and F-measures (-0.19). [5] B. Pang and L. Lee, Opinion Mining and Sentiment Analysis:
Foundations and Trends in Information Retrieval, vol. 2, no. 1–2.
2008.
VI. EVALUATION AND ASSESSMENT
[6] A. Huang, “Similarity measures for text document clustering,” in
Our system can detect question paraphrasing and
New Zealand Computer Science Research Student Conference,
synonymy with an overall precision of 0.85. The proposed
question type similarity increased the accuracy, especially NZCSRSC 2008 - Proceedings, 2008, pp. 49–56.
for NO-labeled questions. This was achieved without using a [7] A. Islam, “Semantic text similarity using corpus-based word
lexical or semantic dictionary. similarity and string similarity,” ACM Trans. Knowl. Discov.

From table 3 we notice that the accuracy of the YES- [8] M. Steyvers and J. B. Tenenbaum, “The large-scale structure of
labeled questions is behind the accuracy of the NO-Labeled semantic networks: Statistical analyses and a model of semantic
questions and that can be due to the fact that question type growth,” Cogn. Sci., vol. 29, no. 1, pp. 41–78, 2005.
similarity was very effective in determining if two questions [9] J. Weston et al., “Towards AI-Complete Question Answering: A
are dissimilar (for example, “When” questions can’t be Set of Prerequisite Toy Tasks,” arxiv.org, 2015.
similar to “Where” questions, and that can be easily
[10] T. R. Gruber, C. D. Brigham, D. S. Keen, G. Novick, and B. S.
determined). However, determining similar questions within
the same scope needs more than question type similarity. We Phipps, “Using Context Information to Facilitate Processing of
noticed that some of the YES-Labeled errors could be Commands in A Virtual Assistant,” Washington, DC U.S. Pat.
avoided by a simple synonymy lexicon. Trademark Off., 2018.

Our accuracy results are comparable with similar [11] N. M. Radziwill and M. C. Benton, “Evaluating Quality of
experiments, even those that were performed on resourceful Chatbots and Intelligent Conversational Agents,” Apr. 2017.
languages such as English [40] [41]. [12] T. Jurczyk, A. Deshmane, and J. D. Choi, “Analysis of
Wikipedia-based Corpora for Question Answering,” Jan. 2018.
We believe that utilizing a domain dedicated lexicon can
improve the results even more, and that is definitely a future [13] M. Daoud, “Building Arabic polarizerd lexicon from rated online
research focus. customer reviews,” in Proceedings - 2017 International
Conference on New Trends in Computing Sciences, ICTCS 2017,
VII. CONCLUSION 2018, vol. 2018-Janua, pp. 241–246.

We have presented a novel approach to detect similarity [14] C. R. Silveira, M. T. P. Santos, and M. X. Ribeiro, “A flexible
between Arabic questions. Our rule based similarity architecture for the pre-processing of solar satellite image time
algorithm showed effectiveness according to the experiment series data - The SETL architecture,” Int. J. Data Mining, Model.
we have conducted, despite its limited dependency on a Manag., vol. 11, no. 2, pp. 129–143, 2019.
lexical resource. String based similarity and lexical based [15] A. Hamza, N. En-Nahnahi, K. A. Zidani, and S. El Alaoui
similarity can be used as a base for our algorithm, but they Ouatik, “An arabic question classification method based on new
have narrow capabilities and thus our proposed similarity
taxonomy and continuous distributed representation of words,” J.
measures presented in this paper has improved accuracy and
King Saud Univ. - Comput. Inf. Sci., 2019.
precision. The results obtained by the experiment were
comparable to similar experiments in the English language, [16] C. Grosan and A. Abraham, “Rule-Based Expert Systems,” 2011,
which is significant considering that English is a resource pp. 149–185.
rich language if compared to Arabic. We anticipate that the [17] A. Prior and M. Geffet, “Word Association Strength, Mutual
result will be improved furthermore with the help of a Information and Semantic Similarity,” in EuroCogSci 2003,
carefully constructed multi domain Arabic lexicon. And this
2003.
is part of our future work.
[18] J. Lu, C. Lin, W. Wang, C. Li, and H. Wang, “String similarity
measures and joins with synonyms,” in Proceedings of the 2013
REFERENCES
international conference on Management of data - SIGMOD ’13,
2013, p. 373.
[1] M. K. Vijaymeena and K. Kavitha, “A survey on similarity [19] G. Navarro and Gonzalo, “A guided tour to approximate string
measures in text mining,” Mach. Learn. Appl. An Int. J., vol. 3, matching,” ACM Comput. Surv., vol. 33, no. 1, pp. 31–88, Mar.
no. 2, pp. 19–28, 2016. 2001.

162
[20] P. Gamallo, C. Gasperin, A. Agustini, and G. P. Lopes, similar words,” Nat. Lang. Process., no. 2003, pp. 37–44, 2008.
“Syntactic-Based Methods for Measuring Word Similarity,” [36] G. A. Miller, “WordNet: A Lexical Database for English,”
Springer, Berlin, Heidelberg, 2001, pp. 116–125. Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.
[21] A. Apostolico and C. Guerra, “The longest common subsequence [37] A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, “Farasa:
problem revisited,” Algorithmica, vol. 2, no. 1–4, pp. 315–336, A Fast and Furious Segmenter for Arabic.”
Nov. 1987. [38] E. Frank et al., “Weka-A Machine Learning Workbench for Data
[22] P. Angeles and A. Espino-gamez, “Comparison of methods Mining,” in Data Mining and Knowledge Discovery Handbook,
Hamming Distance , Jaro , and Monge-Elkan,” DBKDA 2015 Boston, MA: Springer US, 2009, pp. 1269–1277.
Seventh Int. Conf. Adv. Databases, Knowledge, Data Appl., no. c, [39] L. Breiman, “Random forests,” Mach. Learn., pp. 5–32, 2001.
pp. 63–69, 2015.
[40] P. Nakov et al., “SemEval-2017 Task 3: Community Question
[23] F. Miller, A. Vandome, and J. McBrewster, “distance: Answering.”
Information theory, computer science, string (computer science),
[41] B. V Galbraith, B. Pratap, and D. Shank, “Talla at SemEval-2017
string metric, damerau? Levenshtein distance, spell checker,
Task 3: Identifying Similar Questions Through Paraphrase
hamming distance,” 2009.
Detection.”
[24] V. Liki, “The Needleman-Wunsch algorithm for sequence
alignment 7th Melbourne Bioinformatics Course,” cs.sjsu.edu,
pp. 1–46.
[25] R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and
knowledge-based measures of text semantic similarity,” in
Proceedings of the National Conference on Artificial
Intelligence, 2006, vol. 1, pp. 775–780.
[26] N. Oco, L. R. Syliongka, R. E. Roxas, and J. Ilao, “Dice’s
coefficient on trigram profiles as metric for language similarity,”
in 2013 International Conference Oriental COCOSDA held
jointly with 2013 Conference on Asian Spoken Language
Research and Evaluation (O-COCOSDA/CASLRE), 2013, pp. 1–
4.
[27] D. Daoud and M. Daoud, “Extracting terminological
relationships from historical patterns of social media terms,” in
Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 2018, vol. 9623 LNCS, pp. 218–229.
[28] L. Azzopardi, M. Girolami, and M. Crowe, “Probabilistic
hyperspace analogue to language,” in Proceedings of the 28th
annual international ACM SIGIR conference on Research and
development in information retrieval - SIGIR ’05, 2005, p. 575.
[29] T. Hofmann, “Probabilistic latent semantic indexing,” in
Proceedings of the 22nd Annual International ACM SIGIR
Conference on Research and Development in Information
Retrieval, SIGIR 1999, 1999, vol. 51, no. 2, pp. 50–57.
[30] M. Monjurul Islam and A. S. M. Latiful Hoque, “Automated
essay scoring using Generalized Latent Semantic Analysis,” in
2010 13th International Conference on Computer and
Information Technology (ICCIT), 2010, pp. 358–363.
[31] O. Egozi, S. Markovitch, and E. Gabrilovich, “Concept-Based
Information Retrieval Using Explicit Semantic Analysis,” ACM
Trans. Inf. Syst., vol. 29, no. 2, pp. 1–34, Apr. 2011.
[32] G. Bouma, “Normalized (Pointwise) Mutual Information in
Collocation Extraction.”
[33] M. A. Islam and D. Inkpen, “Second Order Co-occurrence PMI
for determining the semantic similarity of words,” in Proceedings
of the 5th International Conference on Language Resources and
Evaluation, LREC 2006, 2006, pp. 1033–1038.
[34] R. L. Cilibrasi and P. M. B. Vitanyi, “The Google Similarity
Distance,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 3, pp.
370–383, Mar. 2007.
[35] P. Kolb, “Disco: A multilingual database of distributionally

163
Using K-Means Clustering and Data Visualization
for Monetizing logistics Data
George Sammour2 Koen vanhoof 1
Hamzah Qabbaah1
Department of Management Department of Business Informatics1
Department of Business Informatics1
Information Systems2
Hasselt university
Hasselt university
Princess Sumaya University for
Diepenbeek, Belgium
Diepenbeek, Belgium Technology (PSUT) Amman, Jordan
Koen.vanhoof@uhasselt.be
Hamzah.qabbaah@uhasselt.be George.sammour@psut.edu.jo

Abstract— Logistics companies possess collect large amount This paper investigates this process in particular for a large
of data on the shipments they perform while at the same time logistics company in the Middle East. Our focus will be most
facing a challenge to understand their complicated market better. on the segmentation phase as well as visualizing the data set to
They can extract useful market knowledge by using data mining be monetized.
technologies such as visualization and clustering. The detailed
results of such big data analytics methods can also be monetized
under certain circumstances. We studied the data on the II. RESEARCH QUESTION AND METHODOLOGY
transactions of a logistics company in the Middle East. K-Means In this paper, we try to answer the following research
clustering of their data proved to generate deeper insight into question: “How can segmentation in several customer groups
several clusters of customers having different profiles. The be used to enable the monetization of the data used in it?” The
results propose a best fit model for the clustering. Since the data mining technique we used to segment the available data
clustering and visualization results are relevant, reliable and set is clustering.
anonymous they fit the monetization criteria as well. Improved
data driven marketing applications are possible for the Clustering is the task of segmenting a heterogeneous
customers. population into a number of more homogeneous subgroups or
clusters with similar characteristics, such that both
Keywords— k-means clustering, data visualization, customer homogeneity of elements within clusters and the heterogeneity
segmentation, big data monetization between clusters are maximized [8]. It has been applied in a
wide variety of fields, such as engineering, computer sciences
I. INTRODUCTION (web mining, spatial database analysis, and segmentation), life
and medical sciences, earth sciences, social sciences and
Data, when analysed and interpreted well, can tell
economics (in marketing, business analysis and CRM
companies a lot about their customer’s interests and allow to
management) [9]. What distinguishes clustering from
improve their customers’ experiences. They are also a potential
classification is consequently that clustering does not rely on
source of income generation.[1] Companies like Google and
ordering data along predefined classes. Cluster analysis is
Facebook are already earning most of their revenues by
based on heuristics that try to maximize the similarity between
enabling marketers to target a specific audience, based on the
in cluster elements and the dissimilarity between inter-cluster
audience characteristics.[2] Companies can derive this income
elements [10]. This task has been performed in our paper
from their own collected high-quality data by selling them to
through the k-Means algorithm. This algorithm partitions the
other companies. Data are thus valuable for internal use and
data set into k clusters in which each object or instance is
potential use by other companies.[3] This monetizing process
assigned to the closest central point with the nearest mean [11].
is however facing a number of challenges. Acquiring the
Next, the heuristic performs a reassignment of the central
required large amount of data often exceeds the budgets of
points. The algorithm is completed when the assignments of
potential customers and the platforms for monetizing the data
the individual instances no longer change.
efficiently are still lacking. Moreover data quality has to be
unquestionable at all [4]. Our study consists of two separate parts. The first part
results in obtaining a data set that eventually can be monetized.
Data monetization by companies has not been studied
This part develops in depth statistics and visualization charts
extensively. Only few articles [5-7] have studied the
about the dataset. It also shows the product market share
phenomenon and mainly from the angle concerns over privacy.
statistics according to our destination countries. Since this part
Authors in [5] adopted an economics-based approach which
aims at showing in which way the visualization charts and
addresses the issue of disseminating sensitive data to a third
statistics can help in getting a clearer understanding about the
party data user.[2]. The economics-based approach normally
dataset, we will in this paper only shortly refer to it, moreover,
assesses the value of the data to be monetized on four
we will show an example of products market share of Jordan
characteristics. The data quality has to be reliable, the data set
case. The second part is the K-Means clustering itself, that is
has to be relevant to the potential customer, the data have to be
explained in a separate section. We then look at the
anonymous and secured, and finally segmented data have a
monetizability. We will use the monetizing characteristics
larger potential to lead to relevant business applications [3] [2].
mentioned before in section one to evaluate the monetizability
This signifies that before starting the monetizing effort,
of these data.
companies first have to visualize their data and apply
segmentation methods to them to make them more valuable for
potential customers.
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 164
III. DATA PREPARATION AND VISUALIZATION TABLE I. DATA DESCRIPTION

The data used in this research were obtained from a Variables Data type Data Description
logistics services company situated in the Middle East. Variables in the original data set
Cleaning, merging tables and pre-processing of the data has
ID Integer The ID of the order
been applied in order to obtain the final data set. We have
created new relevant variables to describe and group some CODValueUSD Double The amount of cash on delivery
other variables. We also standardized the values of some
Payment String Type of payment: prepaid, cash, third
variables to the same unit (kg , US dollar) to have more party, free
accurate results when analyzing the data. The total number of Destination String Destination city of shipment
transactions in the final dataset is equal to the size of the
sample (n=85959). Table I below shows the variables, the type Origin Country String Country of origin of the shipment
of data they represent and the description of each of the DestCountry String Country of destination of the
variables used in our research. shipment
ShipperID Integer The ID of the E-commerce
These data were then visualized using Tableau software. companies
Different attributes and dimensions, such as location, products, CODFlag Boolean Cash on delivery flag
customers and e-commerce companies were extensively
“Consignee Tel” Integer The telephone number of the
represented in graphs. These attributes are grouped in different customer
ways, such as by “customers”, “products”, “e-companies”, Created variables after data preparation
“destination countries” and so on. The list of visualized
dimensions is only shown here in Table II. Moreover, our Weight In KG Double Total weight in KG
destination countries were “Saudi Arabia”, “UAE” and Total Value Double The price of the goods in the
“Jordan”, we will show a sample of the results of the e- USD shipment in US Dollar
commerce market share for the common products transferred Product Group String Product group name of the shipment
only in Jordan case as shown in Fig.1. name
Product group Integer Product group ID
The figure presents the e-commerce companies market ID
share on the basis of the products transactions for Jordan. E-
company “15037” has the highest market share for “Apparel”,
”Bag/Case”, “Beauty supplies”, “Book”, “Food/Grocery”,
“Jewellery Accessories” and “shoes” with 69%, 86%, 92%, TABLE II. THE VISUALISATION OF THE DIFFERENT DIMENSIONS
82%, 77%, 85% and 79% respectively. Whereas e-company
“197483” has the highest market share percentages for “letter/ Number of transactions Dimensions
(variables)
card/ document” product with 40%. We can see the market Distribution percentages of country of Origin country
share of the products for the top five e-companies in the figure. destination, country of origin and city of Destination country
destination. Destination city
Products transferred to the country of Products 
IV. K -MEANS CLUSTER ANALYSIS AND RESULTS destination Destination country
Products transferred to the city of destination Products 
Customer segmentation focuses on getting knowledge Destination city
about the structure of customers and is used for targeted Products transferred from country of origin Products Origin
country
marketing [12], such as in new product development, The distribution percentages of the e- E-commerce
optimizing placement of retail products on shelves, analysis of commerce companies have orders transferred companies 
cannibalization between products and more general in to country of destination Destination country
analysing the affinity between products and cross-category The distribution percentages of the customers Customer 
sales promotion [13, 14]. The segmentation efforts we have orders transferred to the country of Destination country
destination
performed are essential for developing improved segmentation
The distribution percentages of the product Customer 
bases for e-marketing applications such as the monetization of categories by the customers Products
the data in the dataset. [14, 15] The distribution percentages of the product Origin country 
categories transferred to the countries of Destination country
Our segmentation model has the purpose to find segments destination from the countries of origin.  Products
of customers sharing the same profile on the basis of a Retuned orders distribution by the country of Return products
combination of the variables products bought, location and destination Destination country
value of the goods purchased. Retuned orders distribution by the city of Return products 
destination Destination city
The variables used in our model are Avg. Total Value Retuned orders distribution by the e- Return products 
USD, Product Group Name, Country of destination, Consignee commerce companies E-commerce company
Tel and Destination. Retuned orders distribution by the customers Return products 
Customer
In order to find the best cluster fit experiments we have
experiment the analysis for 2 to 5 clusters. Table III shows the
results of the 2-clusters solution. Both clusters have “Apparel”
as the most common product ordered.

165
TABLE IV. THE RESULTS OF THE 3-CLUSTERS SOLUTION
Attributes/
Cluster 1 Cluster 2 Cluster 3
Clusters
Number of 78244 5457 2257
Items
Avg. Total 111.31 46.487 95.425
Value USD
Product Group Apparel Apparel Apparel
Name

Most Common
Country of SA JO AE
Fig.1: E-commerce companies market share on the basis of the Destination
products transactions for Jordan Consignee Tel 9665555XX 96265358X 97145076XX
X XX X
Destination RUH AMM DXB
Note DXB: Dubai.AE: United Arab Emirates

TABLE V. THE RESULTS OF THE 4-CLUSTERS SOLUTION


Attributes/
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Clusters
Number of 78234 5457 2257 10
Items
Avg. Total 109.09 46.487 95.425 17466
Value USD
Product Apparel Apparel Apparel DVD/CD
Group Name

Most Common
Country of SA JO AE SA
Destination
Consignee 9665555 9626535 97145076 96614393
Tel X 8XXX XXX X
Destination RUH AMM DXB JED
Note: JED: Jeddah

TABLE VI. THE RESULTS OF THE 5-CLUSTERS SOLUTION


Fig.1: E-commerce companies market share on the basis of the
products transactions for Jordan Attributes/
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
Clusters
Number of 78140 5457 2257 99 5
Items
Cluster-1 shows that the orders are most frequently Avg. Total 105.27 46.487 95.425 3608 25251
shipped to “Riyadh” and have an average total price of 110.86. Value USD
Product Apparel Apparel Apparel Apparel Camera
In Cluster-2 orders are most frequently shipped to “Amman”
Group
with an average of the price of 46.487. The Consignee Tel.
Most Common

Name
variable shows the most common customer having transactions Country of SA JO AE SA SA
within each clusters. Destination
Consignee 9665555 962655 971457 9665712 9665051
Table IV, table V and table VI show the results of the 3- Tel XX 8XX 7XX XX XX
clusters, 4-clusters and 5-clusters solution respectively. Destination RUH AMM DXB RUH JED

TABLE III. THE RESULTS OF THE 2-CLUSTERS SOLUTION


Table IV shows that all clusters have “Apparel” as the most
Attributes/Clusters Cluster 1 Cluster 2 common product ordered. Cluster-1 shows that the orders are
Number of Items 80501 5457 most frequently shipped to “Riyadh” and have an average total
Avg. Total Value 110.86 46.487 price of 111.31 . In Cluster-2 orders are most frequently
USD shipped to “Amman” with an average of the price of 46.487.
Product Group Apparel Apparel Cluster-3 contains orders shipped most frequently to “Dubai”
Most Common

Name with the average total price of 95.425. The Consignee Tel.
Country of SA JO
Destination
variable shows the most common customer having transactions
Consignee Tel 9665555XXX 9626535XXX within each clusters.
Destination RUH AMM Table V shows the first three clusters have “Apparel”
Note: RUH: Riyadh, AMM: Amman. SA: Saudi Arabia, JO: product as a most common product ordered, while in cluster-4
Jordan the most common one is “DVD/CD”. Cluster-1 shows that the

166
most frequent orders are shipped to “Riyadh” with the average TABLE VII. THE RESULTS OF THE ANALYSIS OF THE VARIANCE TEST FOR
OUR MODEL
of the total price = 109.09. Cluster-2 shows that the most
frequent orders are shipped to “Amman” with the average of Number of
Variable F-statistic
P-value
the price = 46.487. Cluster-3 shows that the most frequent clusters
orders are shipped to “Dubai” with the average of the total 2-clusters Avg. Total Value USD 269.9 0.000
price = 95.425. 3-clusters Avg. Total Value USD 138.5 0.000

Cluster-4 shows that the most frequent orders are shipped 4-clusters Avg. Total Value USD 1.28e+04 0.000
to “Jeddah” with the highest average of the total price = 17466. 5-clusters Avg. Total Value USD 1.4e+04 0.000
Consignee Tel variable shows the most common customer has
transactions within each clusters.
Cluster-1 has the highest average of the total price of
Table VI shows the first four clusters have the “Apparel”
shipments transferred to “Riyadh” in Saudi Arabia with 1250
product as the most common product ordered, while in cluster-
USD. The most expensive shipped products are “Computer”
5 the most common one is “Camera”. Cluster-1 shows that the
and “I-Phone” respectively in the cluster. The most expensive
most frequent orders are shipped to “Riyadh” with the average
shipped products in the cluster-2 which are transferred to
of the total price = 105.72. Cluster-2 shows that the most
“Amman” in Jordan are “IPad”, “Computer” and “Laptop”
frequent orders are shipped to “Amman” with the average of
with 600, 450 and 440 USD respectively. The most expensive
the total price = 46.487. Cluster-3 shows that the most frequent
shipped products in the cluster-3 which are transferred to
orders are shipped to “Dubai” with the total average of the
“Dubai” and “Abu Dhabi” in UAE are “Laptop” and
price = 95.425. Cluster-4 shows that the most frequent orders
“Computer” with 900 and 600 USD respectively. Whereas the
are shipped to “Riyadh”” with the average of the total price =
shipped products to “Abu Dhabi” are much cheaper since the
3608.
average of the total values are less than 200 USD .
Finally, cluster-5 shows that the most frequent orders are
shipped to “Riyadh” with the highest average of the total IV. CONCLUSION
price= 25251. Consignee Tel variable shows the most common
customer has transactions within each clusters. Our best fit K-Means clustering model segmented the
customers mainly according to destination cities, products and
We use the Calinski-Harabasz criterion to assess cluster the price . Each cluster group profiles customers sharing
quality. The Calinski-Harabasz criterion is defined as (1): identical product interests coupled to the amount they normally
(1) like to spend when using e-commerce for shopping. The model
proves to be an excellent model for e-commerce websites
wanting to segment their customers based on their interests and
location, one of the potential marketing applications.[17]
Where SSB is the overall between-cluster variance, SSW the
overall within-cluster variance, k the number of clusters, and N Moreover the clustering and data visualization also allow to
the number of observations [16]. The greater the value of this know the distribution pattern of the shipments according to
ratio, the more cohesive the clusters (low within-cluster “product types”, “customers”, “cities” and so on. This
variance) and the more distinct/separate the individual clusters information is highly valuable for the logistics companies
are (high between-cluster variance). If a user does not specify possessing these datasets. It helps them in managing their
the number of clusters, Tableau picks the number of clusters transactions better, but also allows to monetize the knowledge
corresponding to the first local maximum of the Calinski- contained in their data and selling it to other companies. The
Harabasz index automatically. The result of Calinski-Harabasz major benefit lies in identifying groups of customers with
test indicates that the best cluster fit model contains three profiles that are fairly similar and to draw value from these
clusters. profile characteristics as much as possible.
To validate the best fit cluster solution we used (ANOVA) Knowing for instance the average value of the shipments
statistics. The results are shown in Table VII. The analysis of and a percentage wise subdivision of the product categories
variance (ANOVA) of the all clusters solution show that the p- involved that are shipped to a certain destination is marketing
value <0.001 of the continuous variable “Total Value USD”. knowledge shippers (shipperID was one of the variables) can
So the values were statistically different between all the be interested in directing their marketing efforts. These
clusters. Moreover, the number of items of the last two clusters companies normally do not have this knowledge themselves in
for a 5-clusters solution is 99 and 5 items only, and that cluster- the same detail, so the logistics service companies can help to
4 in the 4-clusters solution only contains 10 items. The improve their marketing efforts and eventually monetize the
distribution of the number of the items for the 3-clusters data as a marketing application. The data are reliable (as they
solution is much more acceptable, since cluster-3 containing are taken from the dataset of all logistics transactions by the
the lowest number of items counts 2257 items in total. logistics company). They are relevant to the customer
Therefore our selection confirms the Calinski-Harabasz result companies as they are all situated in the same sector and
that the 3-clusters solution is the best cluster fit the model. region. The shipperID makes the data anonymous and the
results are segmented. Thus all four criteria for monetization
Fig.2 shows the distribution of the average of the total price
previously mentioned are fulfilled.
according to the 3-clusters solution for our model.

167
Fig.2 The distribution of the average of the total price of the most frequently shipped products to the
most common destinations according to the 3-cluster solution.

V. MANAGERIAL CONSEQUENCES
We recommend all e-commerce companies to segment
Data
their customer base. It will improve their campaign contents by Segmentation
tying them better to customer characteristics and thus improve
their effectiveness perspective. The study grouped each Data source Customer:
Data collection Demographic
customer per product category most frequently bought, Data Geographic
location and e-commerce company most frequently dealt with. Preprocessing Behavioral
Thus when an e-commerce company intends to increase their Profitability

market share, it should consider the customers segmentation


results in their communication. They should for instance not
send campaigns of “electronics” products to customers just
interested just in “Food” products, or direct communication
campaigns to customers in the SA offering of products just
Marketing Data
send to Jordan. In the other words, sending relevant
Application Visualization
advertisements to the right customers based on their interests
and characteristics will have positive short term effects and in Targeted And
the long run avoid customers unsubscribing from their website. communication
Clustering Data
Moreover the results of the segmentation model can also be mining
Monetizing data
monetized by the companies gathering the individual modeling
transaction data.
Thus this study is an excellent example of the process by
which data driven marketing applications are developed over FIG.3. DATA DRIVEN MARKETING APPLICATIONS PROCESSES FOR
time by companies. This process starts with the data available LOGISTIC COMPANIES

in the company and their pre-processing and ends with results


that could be used for marketing applications including their
monetization. In our case this was linked to logistics channels.
This process is represented in Fig. 3.

168
In this paper we indeed proposed a marketing application conference on Next Generation Mobile Apps, Services and
for a logistics company. After the first step in which the data Technologies. 2013.
[5] Li, X.-B. and S. Raghunathan, Pricing and disseminating customer data
are made ready for data modelling, we suggest that the with privacy awareness. Decision support systems, 2014. 59: p. 63-73.
companies involved segment their customers geographically, [6] Laudon, K.C., Markets and privacy. Commun. ACM, 1996. 39(9): p. 92-
behaviourally and on the basis of profitability. The products 104.
and transaction routes can be segmented on the basis of the [7] Bélanger, F. and R.E. Crossler, Privacy in the Digital Age: A Review of
logistics application service we proposed. The next step is to Information Privacy Research in Information Systems. MIS Quarterly,
2011. 35(4): p. 1017-1041
visualize the results that have to be made clear for the decision [8] Joseph F. Hair, J., et al., Multivariate data analysis (4th ed.): with
makers. Thus our proposed work is made ready to be used for readings. 1995: Prentice-Hall, Inc. 745.
marketing applications linked to the logistics channels. [9] George Sammour , B.D., Koen Vanhoof and Geert Wets, Identiying
homogenous customer segments for risk email marketing experements,
Our research used logistics data in a different way. By in 11th International Conference on Enterprise Information Systems.
applying k-means clustering to these data. We focused on 2009: milan , italy. p. 89-94.
finding segments of customers sharing the same profile on the [10] Fraley, C. and A.E. Raftery, Model-Based Clustering, Discriminant
basis of a combination of the variables products bought, Analysis, and Density Estimation. Journal of the American Statistical
location and value of the goods purchased. Our contribution in Association, 2002. 97(458): p. 611-631.
[11] Carmona, C.J., et al., Web usage mining to improve the design of an e-
this study is to add to this research stream the value of commerce website: OrOliveSur.com. Expert Systems with Applications,
extensively looking into the monetization possibility of specific 2012. 39(12): p. 11243-11249.
logistics data of e-commerce companies (a field and [12] Gruca, T.S. and B.R. Klemz, Optimal new product positioning: A
combination has not studied before) and tries to indicate genetic algorithm approach. European Journal of Operational Research,
whether in an international context these data are valuable 2003. 146(3): p. 621-633.
[13] Leeflang, P.S.H., et al., Decomposing the sales promotion bump
enough to be marketed. accounting for cross-category effects. International Journal of Research
in Marketing, 2008. 25(3): p. 201-214.
[14] Holý, V., O. Sokol, and M. Černý, Clustering retail products based on
customer behaviour. Applied Soft Computing, 2017. 60: p. 752-762.
REFERENCES [15] Tsai, C.Y. and C.C. Chiu, A purchase-based market segmentation
methodology. Expert Systems with Applications, 2004. 27(2): p. 265-
[1] Tsai, C.-W., et al., Big data analytics: a survey. Journal of Big Data, 276.
2015. 2(1): p. 21. [16] Tableau. Find Clusters in Data. 2019; Available from:
[2] Bataineh, A.S., et al., Monetizing Personal Data: A Two-Sided Market https://onlinehelp.tableau.com/current/pro/desktop/en-us/clustering.htm.
Approach. Procedia Computer Science, 2016. 83: p. 472-479. [17] Hamzah Qabbaah, George.Sammour., Koen Vanhoof, DECISION TREE
[3] platform, L.s.d.m., How to Monetize Your Data, in How to Monetize ANALYSIS TO IMPROVE E-MAIL MARKETING CAMPAIGNS.
Your Data. 2018, Lotame. International Journal “Information Theories and Applications”, 2018.
[4] Mizouni, R. and M.E. Barachi. Mobile Phone Sensing as a Service: 25(4): p. 303-330.
Business Model and Use Cases. in 2013 Seventh International

169
Content Based Image Retrieval Approach using
Deep Learning
Heba Abdel-Nabi Ghazi Al-Naymat Arafat Awajan
Department of Computer Science Department of Computer Science Department of Computer Science
Princess Sumaya University for Princess Sumaya University for Princess Sumaya University for
Technology Technology Technology
Amman, Jordan Amman, Jordan Amman, Jordan
h.yousif88@yahoo.com g.naymat@psut.edu.jo awajan@psut.edu.jo

Abstract— In a world that seeks perfect results of any search the fixed keyword and feature engineering is not a suitable
query, an information retrieval system that produces an for image retrieval especially for large scale image databases.
accurate and relevant output is desired. However, because of
the famous semantic gab problem of image representation, a On the hand, the second approach which is the Content
Content Based Image Retrieval (CBIR) system faces some Based Image Retrieval (CBIR) overcome these limitations
difficulties, since it highly depends on the extracted image and slightly improve the retrieval performance through
features as basis for a similarity check between the query searching the images based on their visual contents
image and database images. This purposed approach represented by its low and middle level features such as
overcomes these difficulties with the aid of the most fast color, texture and shape, then comparing the similarities in
growing technology, namely Deep Learning. In addition, it some of these features between the images in the database
explores the effects of merging the features extracted from the and the query image. Determining the similarities and the
latter layers of the deep network to achieve better retrieval proper features that best describe the image is often relative.
results. The experimental results demonstrate the effectiveness Therefore, this raises the famous semantic gap problem that
of the proposed scheme in terms of the number of relevant are formed by a low level visual features of the images
retrieved images of the query results, and the mean average represented normally by their intensity or pixel values and
precision, while keeping low computational complexity since it the high level of human perception [2].
uses an already trained deep convolutional model called
AlexNet. Thus in turn, a reduction in the complexity that With the advances in machine learning methods, a
combines training a deep model from the scratch has been retrieval methods based on them succeed in outperforming
achieved. the traditional retrieval methods that are only based on image
indexing and keyword tagging, especially if we search for an
image in a large database for a match of the requested image
Keywords—Image Retrieval, Content Based, Deep Learning, query. However, the machine learning approaches has some
AlexNet. limited performance, because in order to be successful they
must be combined with supervised learning that required a
I. INTRODUCTION labeled dataset for the training process, and these labeled
In the fast growing and technology accelerated era, the data indicates that pairs of inputs and labels, that identify the
distribution and storage of digital images become easy and correct output specified with that input, must be manually
available widely. Therefore, a huge amount of digital images extracted by a domain human expert. For a large database
is stored and uploaded online in huge databases such as in containing millions of images, doing a fixed feature
the World Wide Web or in the medical images databases for engineering become infeasible, and consequently the same
example. Consequently, a search query based on images limitation of the text based approaches reappeared.
became of a great essential. Since these databases differ from The recent revolutions in computer vision and image
the traditional databases by the type of the unstructured data recognition, thanks to the deep learning breakthrough in
stored in them, new information retrieval methods are 2006 [3], make the deep learning seems a potential bridge to
introduced. this gap for retrieving images, because it has the ability to
There are mainly two main approaches for image process raw data and build the internal feature representation
retrieval; the text or concept based approach and the content of it through its multiple nonlinear layers of abstraction to
based approach. The text based approach depends on the provide eventually a high conceptual representation of the
manual indexing and the quality of the tagged keywords and image. In other words, the deep learning has the capability to
annotations that describe the images for the retrieval learn the image semantic representation through its training
purposes. However, the annotation based method can be phase. Therefore, a deep learning based model for content
considered an infeasible retrieval method due to many based image retrieval is proposed in this paper.
reasons, such as: the manual annotation process is time Any content based image retrieval system is concerned
consuming, tedious, subjective, and incomplete. Moreover, with achieving two goals; be able to recognize the existence
these assigned keywords may not describe the image or not of the query image in the image database [4], and to
properly since for example different keywords can describe a retrieve the most similar images to it (not the images of the
certain image and in the same time a single keyword can most probable classes) [5] through the multidimensional
describe multiple images [1], i.e., single keyword can have extracted feature vector. This feature vector is extracted from
different semantic meanings. All these factors indicate that

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 170


each image in the database and considered a substitute of the A framework of deep learning with application to CBIR
images when a search query is conducted. tasks using convolutional neural network for learning
effective feature representations of images is introduced by
This paper proposes a design for an image retrieval Wan et al. [18]. It introduced three schemes; the first is a
framework that combine a fusion of different high level direct feature representation from the full connected layers of
feature representations of the image that are obtained from Convolutional Neural Network (CNN), the second scheme
exploring the feature extracted from different layers of a refine the similarity learning by considering the relationship
deep neural model, and combine them into a single feature of instances belonging to the same class as relevant and the
vector in order to achieve higher retrieval efficiency and to one belongs to different classes as irrelevant. Finally, the
increase the degree of similarity in the retrieved images as a third scheme is refining by retraining a pretrained model on a
response to some query image. Our system does not rely on new dataset.
any human crafted features; instead it learns the feature
directly from a convolutional neural network model called Wang et al. proposed a CBIR model that is also based on
AlexNet [6] deep model. CNN [19]; it presented a thorough analysis of the feature
representation extracted from the different layers of a CNN.
This paper is organized as follows: section 2 gives the They evaluate their study based on the pretrained AlexNet
literature review of the recent content based image retrieval model and IMAGENET 2012 dataset. They state the
methods, section 3 gives an overview of the deep learning following observations: 1) the feature extracted from the
concept in general and the convolutional neural network fc4096a and fc4096b fully connected layers of the examined
model in particular, the proposed approach methodology is model perform well, i.e., the layers that follows directly the
discussed in section 4. The experimental setup and stack of convolutional and pooling layers perform the best on
discussion is outlined in section 5, and finally the conclusion the datasets from unseen categories, they have the better
is presented in section 6. generalization ability. They trained the network on their new
dataset. 2) The cosine similarity is better than the Euclidean
II. LITERATURE REVIEW similarity.
Variety of image retrieval methods that uses the low level A new Deep Supervised Hashing (DSH) method is
feature descriptors have been proposed for image proposed by Liu et al. [20] to learn compact binary codes for
representation, they can be divided into global features, such highly efficient image retrieval on large-scale datasets with
as color feature [7], edge feature [8], texture feature [9], and the help of convolutional neural network.
GIST Centrist, and into local feature representations, such as
the bag of words model [10] that uses local feature
descriptors (SWIFT [11], SURT [12]). III. DEEP LEARNING: OVERVIEW
Deep learning is a branch of artificial intelligence (AI),
Tunga et al. proposed an image retrieval approach based that is inspired by the way that the human brain works, i.e., it
on machine learning algorithm that treats the image tries to simulates the human brain ways in reasoning,
categories as semantic concepts of the images [13]. It prediction, and flexible thinking when learning a thing for
predicts the category of the query image and instead of the first time. Humans tend to organize their ideas and
finding similarity between the query image and all the concepts hierarchy, they begin by learning simple concepts
images in database, it is found between the query image and first, and then combine a bunch of these simple concepts to
only the images belonging to the query image category. form a complex one. Similarly, the deep learning network is
A mapping learning scheme for large image scale fed by examples through the training data, initially the
applications is proposed by Singh [14] that maps from high network doesn’t do well, and then it adjusts and fine tune its
dimensional data to binary codes that preserves the semantic parameters through its hierarchy structure until it reaches the
similarity. Kumar et al. proposed a CBIR system based on required state of proficient. Deep learning processes the data
SIFT and ORB, K-Means clustering and LPP dimensionality through multiple nonlinear structures that begins by
reduction methods [15]. modeling simple patterns in the image such as edges then
begins gradually as moving through the layers until finally a
A two-layer codebook features based image retrieval higher conceptual representation of the data is achieved by
method is proposed by Liu et al. [16] that represent a fusion exploring its underlying structure without human
of both high and low level features; where the high level involvement.
features are extracted from GoogLeNet deep convolutional
network that capture the human perception, while low-level Convolutional Neural Network (CNN) is the most
features such as texture and color are generated from Dot- popular and widely used deep network, especially in the
Diffused Block Truncation Coding (DDBTC). computer vision field since it has a large learning capabilities
due to the stacked layer structure. It succeeded in achieving
Saritha et al. proposed a Content-based image retrieval tasks that were beyond the reach, such as image
(CBIR) framework [17] that uses deep belief network (DBN) classification, video classification, object recognition, and
to learning effective feature representations of images. It image captioning. The CNN network has a hierarchical
introduced a multi-feature image retrieval method by structure of feature maps; it consists of three layers;
combining the features of color histogram, edge, edge Convolutional layer; Pooling or subsampling layer; Fully
directions, etc. These features are extracted and are stored as connected layer. The convolutional layer is the core and the
small signature files. Similar images should have similar most important layer of CNN, the convolution term refers to
signatures. These signatures are compared with the content the process of filtering through the image for a specific
based signature. During the similarity measure, the distances pattern. It accepts the data from the input layer, then put this
between the different features are measured. Appropriate data, which is mainly an image into a set of convolutional
weights are applied to normalize the distance coefficients. filters, each sweep over a certain slice of the image. Each set
171
of filters activates certain features from the image, i.e.,
search for a specific pattern in the entire image. When a
match is found, it is mapped into a feature space identical to
that particular feature that is searched by the filter, creating a
feature map for each pattern searched. Thus, the number of
the required visual elements that must be found in the image
will determine the number of feature maps that will be
organized in a hierarchal structure.
After the convolutional layer, the input is passed through
a nonlinear transformation in the pooling layer to reduce the
dimensionality of the feature maps and to shrink the number Fig. 1. ALexNet Architecture [21].
of parameters the network needs to learn. Consequently,
reduce the required storage and processing resources. The
pooling layer focuses on detecting the most relevant pattern retrieval efficiency.
discovered by the convolutional layer by applying either the
maximum or the average subsampling operation on each of
the regions of the feature map one at a time as done in the A. The used deep model
convolutional layer. As a result, the local features are AlexNet is a large deep convolutional neural network
aggregated to identify more complex features and the that is used to classify 1.2 million high-resolution images in
location of the strongest correlation of each feature are the LSVRC-2010 ImageNet contest into the 1000 different
preserved. This can induce a certain amount of translation classes. It consists of 60 million parameters and 650,000
invariance of the feature. neurons, consists of five convolutional layers, followed by
max-pooling layers, and three globally connected layers with
The final layer is the fully connected layers which are a a final 1000-way softmax, as can be shown in Fig. 1. A
regular neural network that are added as a final stage after variant of this model is used in the ILSVRC-2012
the alternating structure, that consist of the convolutional competition and achieved a winning top-5 test error rate of
layer and the pooling layer. They aim for classifying the 15.3%. The fully connected layers can be considered as the
discovered complex pattern by matching them against labels characteristic representation of the image for the CBIR task
to define one correct label for this pattern, in order to give [19].
them more understanding.
FC6: The first fully connected layer of AlexNet, the first one
In the deep model of CNN, the alternating of the two that follow the alternating of the convolutional and pooling
convolution and pooling layers are repeated many times layers. The dimension of the feature vector extracted from
depending on the image size and pattern complexity. The this layer is 4096.
CNN star shined in 2012, when Krizhersky used them to
achieve outstanding super human recognition results in the FC7: The second fully connected layer with a feature vector
ImageNet recognition computation [6], their proposed of size 4096.
architecture is shown in Fig. 1. It used multiple feature map FC8: The last fully connected layer, the one that are directly
per convolutional layer, in addition to the multiple before the output layer and the feature vector size is 1000.
convolutional layers to capture the patterns at different levels
with the higher level layers takes the feature maps from the The query image: The image to be searched in the image
lower-level layers as input. database, whether the same image is present or not, or how
many similar images exist in the database or not.
IV. THE PROPOSED METHODOLOGY A search for the best layers to combine and the best weight
Deep learning has an important property that is factors to maximize the efficiency for the CBIR tasks are
represented by its ability to transfer knowledge to connect made in this paper, and after some trial and error, we found
new data by taking advantage of an already pretrained deep that the two fully connected layers of the AlexNet model
models that are built by experts such as AlexNet [6]. The use named FC6 and FC8 are the best pair and the final combined
of the pretrained network is to try to work around the feature vector of these two layers’ features are computed
requirements of large datasets for deep training model. The using the Equation (1) below.
higher layers in the model capture the high level features that combined feature vector=tanh (FC6(end-
may change from dataset to the other, while the low level (1)
1001,end)) +sinh(FC8)
features captured by the lower layers are almost the same.
The features extracted in the upper layers of the CNN can be The fully connected layers FC6, FC7 and FC8 process
a good descriptor for the image retrieval [5]. the resulted features from the convolution and pooling layers
by flatting them into a one dimensional vector that reflects
CBIR requires extraction of features from images, these features probabilities. Afterwards, using
comparing and ranking based on similarity between image backpropagation, appropriate weights, and different level of
features. Enhancing the effective feature representations abstraction are constructed according to the layer order. The
learning and improving the similarity measures are crucial combination of the FC6 and FC8 layers is done to explore
requirement for a good content based image retrieval system. the abstraction obtained directly after the convolutional
The image retrieval using only single feature may be layers with the abstraction right before the output layer.
inefficient. It may either retrieve images not similar to query
image or fail to retrieve similar images [4]. Therefore, the Since we combined the features from different fully
proposed approach Combine features to achieve higher connected layers in the AlexNet model as depicted in

172
Equation (1), an appropriate indirect weights are considered 2. Searching Phase: The second phase is the search for
to combine the features when calculating the similarity identical or similar images to the query image. The main task
measures between the query image and the images in the of the CBIR system is to find N exact matches or similar
database. Below are the design steps of the proposed CBIR images to that query. The query image undergoesthe same
system, and Fig. 2 shows a flowchart that represents the feature extraction procedure described in phase one above.
proposed approach.
3. Similarity and Ranking Phase: The similarity
measurement step capture the semantic similarity between
B. Design Steps of the Proposed CBIR System the features extracted from the combined feature vector
1. Preparing Phase: The database of images undergoes obtained from the images in the database and between the
the first phase of preparing and collecting the features; this is combined feature vectors obtained from the query image.
done for each image in this database. The preprocessing Then all the images in the database are ranked, and the top N
consists of the following steps: images are retrieved.
A) The database images must be preprocessed in order to
suit the network model input; either cropping the image to V. DISCUSSION AND EXPERIMENTAL RESULTS
the correct size or resizing it. In the proposed approach, each The performance is judged by the quality of the retrieval
image in the database is resized to the proper size that the images, i.e., their counts and how relevant they are to the
used model accepts at its input layer. To improve the model query image. This is measured using the precision and recall
performance and to avoid overfitting in the deep model any metrics.
images with a size less than the size that the input layer
accepts are excluded from the database, and therefore A. Image Dataset
increase the chances for successful retrieval.
It consists of the collection of 600 images with 20
B) The images then are fed to the deep model. different categories including: Horses, Bears, Buses, Cars,
C) Then a new two features for each image are extracted Sport Cars, Cats, Dogs, Ducks, Flowers, Roses, Boathouses,
from the FC6 and FC8 layers of the deep model. Guitars, Old Castle, Owl, Pepper, Sailboats, Sheep, Sunset,
Tiger and Tomatoes. Each of these categories contains 30
D) A weighted combination of the two vectors is done, in images, taking from the ImageNet 2012 dataset [22]. A set of
which only part of the higher dimensional layer, i.e., FC6 is 15 image query has been applied as shown in Fig. 3 to test
taken and combined with the FC8 feature vector according to the efficiency of our system, similar categories has been
Equation (1). This is done to increase the efficiency of the selected such as cars vs sport cars for example. The top 30
features extracted from the last fully connected layers in images are retrieved for each query image. The AlexNet
retrieving the relevant images by introducing partial support network is not trained on this dataset before.
of the FC6 layer. Note that each of these layers learns a
different abstraction of the image.

Fig. 2. Flowchart of the Proposed CBIR.


173
B. The Similarity measures
The similarity measure between feature representation of - The number of relevant retrieved images.
the query image and feature representation of the images in - The Mean average precision (MAP): it is
the dataset is a critical step in assessing the overall measured by computing the mean of the average
performance of CBIR system. Based on the recommendation precision of each query image.
of [19], we used the cosine similarity as the used measure in
the experiments.

C. Performance Metrics
The retrieval performance of a content based image
retrieval system depends mainly on the feature
representations and similarity measurements. The main aim
is to design an image retrieval system that is efficient and
effective [13] through fulfilling two requirements: Speed and
Precision. The quality of retrieval and how relevant it is to
the query image is measured through their precision and
recall values. A higher value on precision and recall indicates
a better result on image retrieval, meaning the set of returned
images are more preferable to user.
For performance evaluation, we used five metrics to
evaluate the proposed scheme, they are listed below. The
results of these metrics for each of the query images used in
the experiment is presented in tables 1 to table 15, in which
the precision and recall values at ranks 1, 5, 10, 15, 20, 25
and 30 are listed, in addition to the number of the retrieved
images that are similar to the query image, and the average
precision (AP) for each query image. In addition, we Fig. 3. The query images that are used in the experiments.
compared the results of the proposed approach with the
results obtained from image retrieval based of the features Fig. 4 shows the mean average precision values of
extracted from layers FC6 and FC8 that are used to construct the system when the features are extracted from
the combined vector. different layers such as FC6, FC8 and from the
combined vector used in the proposed approach. As can
- The precision at a particular ranks (P@K) : it be noted, the proposed scheme enhanced the results
measures the ability of the system to retrieve only the compared with the single feature based system which
images that are relevant when the number of retrieved proved the effectiveness of this approach.
images are k.
Precision = # relevant images retrieved / # total (2) 94%
of images retrieved 93%
- The recall at a particular ranks (R@K): it 92%
measures the ability of the system to retrieve all the 91%
images that are relevant when the number of retrieved
90%
images are k.
89%
Recall = # relevant images retrieved / # total of (3)
relevant images. 88%
Using LAYER FC6 Using LAYER FC8 Proposed
- The average precision (AP): averages the precision approach
values of the rank positions where a relevant image is
retrieved Fig. 4. MAP of CBIR using the features obtained from each layer and
the proposed approach.
Table 1. Query Image Tiger: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.950 0.920 0.866
FC6: 4096 – D 26/30 0.9705
Recall 0.033 0.166 0.333 0.500 0.633 0.766 0.866
Precision 1 1 1 1 1 0.920 0.900
FC8: 1000 – D 27/30 0.9826
Recall 0.033 0.166 0.333 0.500 0.666 0.766 0.900
Proposed approach Precision 1 1 1 1 1 1 1
30/30 1
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 1

Table 2. Query Image sailboat1: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 0.866 0.750 0.680 0.666
FC6: 4096 – D 20/30 0.9048
Recall 0.033 0.166 0.333 0.433 0.500 0.566 0.666

174
Precision 1 1 1 1 0.900 0.880 0.833
FC8: 1000 – D 25/30 0.9544
Recall 0.033 0.166 0.333 0.500 0.600 0.733 0.833
Proposed approach Precision 1 1 1 1 1 1 1
30/30 1
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 1

Table 3. Query Image sailboat 2: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.900 0.866 0.650 0.520 0.433
FC6: 4096 – D 13/30 0.9521
Recall 0.033 0.166 0.300 0.433 0.433 0.433 0.433
Precision 1 0.800 0.800 0.733 0.700 0.640 0.600
FC8: 1000 – D 18/30 0.7913
Recall 0.033 0.133 0.266 0.366 0.466 0.533 0.600
Proposed approach Precision 1 0.600 0.400 0.460 0.550 0.640 0.700
21/30 0.6263
1000 - D Recall 0.033 0.100 0.133 0.233 0.366 0.533 0.700

Table 4. Query Image horse: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.950 0.960 0.900
FC6: 4096 – D 27/30 0.9722
Recall 0.033 0.166 0.330 0.500 0.633 0.833 0.933
Precision 1 1 1 1 1 0.92 0.866
FC8: 1000 – D 26/30 0.9864
Recall 0.033 0.166 0.333 0.500 0.666 0.766 0.866
Proposed approach Precision 1 1 1 1 1 1 0.966
29/30 0.9976
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 0.966

Table 5. Query Image sunset: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.900 0.933 0.900 0.760 0.633
FC6: 4096 – D 19/30 0.9569
Recall 0.033 0.166 0.300 0.466 0.600 0.633 0.633
Precision 1 1 1 0.866 0.800 0.800 0.733
FC8: 1000 – D 22/30 0.9560
Recall 0.033 0.166 0.333 0.433 0.533 0.666 0.733
Proposed approach Precision 1 1 1 1 0.850 0.880 0.800
24/30 0.9606
1000 - D Recall 0.033 0.166 0.333 0.500 0.566 0.733 0.800

Table 6. Query Image cat: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.900 0.866 0.750 0.640 0.600
FC6: 4096 – D 18/30 0.9001
Recall 0.033 0.166 0.300 0.433 0.500 0.533 0.600
Precision 1 1 0.900 0.666 0.550 0.440 0.400
FC8: 1000 – D 12/30 0.8935
Recall 0.033 0.166 0.300 0.333 0.366 0.366 0.400
Proposed approach Precision 1 1 0.900 0.866 0.750 0.640 0.533
16/30 0.9240
1000 - D Recall 0.033 0.166 0.300 0.433 0.500 0.533 0.533

Table 7. Query Image guitar: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.950 0.800 0.666
FC6: 4096 – D 20/30 0.9909
Recall 0.033 0.166 0.333 0.500 0.633 0.666 0.666
Precision 1 1 1 1 1 1 0.900
FC8: 1000 – D 27/30 0.9960
Recall 0.033 0.166 0.333 0.500 0.666 0.833 0.900
proposed approach Precision 1 1 1 1 1 1 1
30/30 1
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 1

Table 8. Query Image brown bear: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.9 0.92 0.833
FC6: 4096 – D 25/30 0.9720
Recall 0.033 0.166 0.333 0.500 0.600 0.766 0.833
Precision 1 1 0.900 0.866 0.900 0.88 0.800
FC8: 1000 – D 24/30 0.9097
Recall 0.033 0.166 0.300 0.433 0.600 0.733 0.800
Proposed approach Precision 1 1 1 1 1 0.96 0.866
26/30 0.9891
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.800 0.866

Table 9. Query Image Owl 1: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.9 0.8 0.766
FC6: 4096 – D 23/30 0.9488
Recall 0.033 0.166 0.333 0.500 0.600 0.666 0.766
Precision 1 1 0.9 0.933 0.9 0.88 0.8
FC8: 1000 – D 24/30 0.9294
Recall 0.033 0.166 0.300 0.466 0.600 0.733 0.800
Proposed approach Precision 1 1 1 1 0.95 0.88 0.833
25/30 0.9749
1000 – D Recall 0.033 0.166 0.333 0.500 0.633 0.733 0.833

175
Table 10. Query Image Owl 2: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 0.400 0.200 0.133 0.100 0.080 0.066
FC6: 4096 – D 2/30 0.8335
Recall 0.033 0.066 0.066 0.066 0.066 0.066 0.066
Precision 1 0.600 0.500 0.466 0.400 0.400 0.366
FC8: 1000 – D 11/30 0.5992
Recall 0.033 0.100 0.166 0.233 0.266 0.333 0.366
Proposed approach Precision 1 1 0.700 0.600 0.55 0.56 0.600
18/30 0.7636
1000 - D Recall 0.033 0.166 0.233 0.300 0.366 0.466 0.600

Table 11: Query Image Color Duck: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # of relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 0.800 0.750 0.720 0.633
FC6: 4096 – D 19/30 0.8983
Recall 0.033 0.166 0.333 0.400 0.500 0.600 0.633
Precision 1 1 1 1 1 1 0.866
FC8: 1000 – D 26/30 0.9972
Recall 0.033 0.166 0.333 0.500 0.666 0.833 0.866
Proposed approach Precision 1 1 1 1 1 1 0.966
29/30 0.9988
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 0.966

Table 12. Query Image Mix Pepper: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.700 0.733 0.650 0.640 0.566
FC6: 4096 – D 17/30 0.8029
Recall 0.033 0.166 0.233 0.366 0.433 0.533 0.566
Precision 1 1 0.800 0.666 0.550 0.520 0.566
FC8: 1000 – D 17/30 0.7660
Recall 0.033 0.166 0.266 0.333 0.366 0.433 0.566
proposed approach Precision 1 1 0.800 0.866 0.850 0.840 0.766
23/30 0.8742
1000 – D Recall 0.033 0.166 0.266 0.433 0.566 0.700 0.766

Table 13. Query Image Red Pepper: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 0.800 0.700 0.666 0.650 0.680 0.666
FC6: 4096 – D 20/30 0.7384
Recall 0.033 0.133 0.233 0.333 0.433 0.566 0.666
Precision 1 0.800 0.900 0.866 0.750 0.720 0.700
FC8: 1000 – D 21/30 0.8436
Recall 0.033 0.133 0.300 0.433 0.500 0.600 0.700
Proposed approach Precision 1 0.800 0.700 0.800 0.850 0.800 0.766
23/30 0.8278
1000 - D Recall 0.033 0.133 0.233 0.400 0.566 0.666 0.766

Table 14. Query Image sheep 1: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 0.800 0.800 0.733 0.65 0.68 0.633
FC6: 4096 – D 19/30 0.7668
Recall 0.033 0.133 0.266 0.366 0.433 0.566 0.633
Precision 1 1 0.900 0.933 0.95 0.800 0.800
FC8: 1000 – D 24/30 0.9311
Recall 0.033 0.166 0.300 0.466 0.633 0.666 0.800
proposed approach Precision 1 1 1 1 1 1 0.900
27/30 0.9986
1000 - D Recall 0.033 0.166 0.333 0.5 0.666 0.833 0.900

Table 15. Query Image sheep 2: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # of relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.900 0.666 0.700 0.720 0.633
FC6: 4096 – D 19/30 0.8452
Recall 0.033 0.166 0.300 0.333 0.466 0.600 0.633
Precision 1 1 1 1 0.850 0.760 0.666
FC8: 1000 – D 20/30 0.9548
Recall 0.033 0.166 0.333 0.500 0.566 0.633 0.666
proposed approach Precision 1 1 1 1 1 1 0.900
27/30 0.9986
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.8333 0.900

proposed approach by either increasing the number of


relevant retrieved images or increasing the mean average
VI. CONCLUSION AND FUTURE WORK precision. A further enhancement can be done to improve the
results by fine-tuning the network by modifying the last layer
In this paper a direct approach was used for extracting the and train the model again using a new dataset that will be the
image features by using a combination of the features image database used for retrieving the relevant images.
extracted from the fully connected layers in a pretrained Moreover, an extra technique for dimensionality reduction
AlexNet model. A single feature is not always guaranteed to can be deployed in the proposed framework to reduce the
give accurate results and best performance, thus, for high size of the extracted feature vector.
accuracy retrieval, we combined the feature vectors extracted
from two fully connected layers into a single one. The AlexNet is a simple yet powerful deep learning model
experimental results demonstrated the effectiveness of the that is considered the base of the computer vision revolution.

176
Other alternatives such as ResNet [23] succeed in level, using patch-based visual words. IEEE Trans. Medical Imaging,
outperforming AlexNet by achieving a lower error rate; a 30(3), pp.733-746.
3.57% error rate using an ensemble of residual nets on the [11] Lowe, D.G., 1999. Object recognition from local scale-invariant
features. In Computer vision, 1999. The proceedings of the seventh
ImageNet while the AlexNet achieved 15.3% error rate. IEEE international conference on (Vol. 2, pp. 1150-1157).
Nevertheless, the complexity also rises; AlexNet is just eight [12] Bay, H., Tuytelaars, T. and Van Gool, L., 2006, May. Surf: Speeded
layers while ResNet may contain up to 152 layers with a up robust features. In European conference on computer vision (pp.
residual connection. Therefore, to prove our idea we used the 404-417). Springer, Berlin, Heidelberg.
simpler model, however, the effects of the ResNet and other [13] Tunga, S., Jayadevappa, D. and Gururaj, C., 2015. A comparative
recent deep models on the image retrieval can be explored as study of content based image retrieval trends and approaches.
a future extension of this work. International Journal of Image Processing (IJIP), 9(3), pp.127-155.
[14] Singh, A.V., 2015. Content-based image retrieval using deep learning.
Rochester Institute of Technology.Anshuman Vikram Singh.
REFERENCES [15] Kumar, M., Chhabra, P. and Garg, N.K., 2018. An efficient content
based image retrieval system using BayesNet and K-NN. Multimedia
[1] Li, T., Mei, T., Yan, S., Kweon, I.S. and Lee, C., 2009, June. Tools and Applications, pp.1-14.
Contextual decomposition of multi-label images. In Computer Vision [16] Liu, P., Guo, J.M., Wu, C.Y. and Cai, D., 2017. Fusion of deep
and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. learning and compressed domain features for content-based image
2270-2277). IEEE. retrieval. IEEE Transactions on Image Processing, 26(12), pp.5706-
[2] Saritha, R.R., Paul, V. and Kumar, P.G., 2018. Content based image 5717.
retrieval using deep learning process. Cluster Computing, pp.1-14. [17] Saritha, R.R., Paul, V. and Kumar, P.G., 2018. Content based image
[3] Hinton, G.E., Osindero, S. and Teh, Y.W., 2006. A fast learning retrieval using deep learning process. Cluster Computing, pp.1-14.
algorithm for deep belief nets. Neural computation, 18(7), pp.1527- [18] Wan, J., Wang, D., Hoi, S.C.H., Wu, P., Zhu, J., Zhang, Y. and Li, J.,
1554. 2014, November. Deep learning for content-based image retrieval: A
[4] Shereena, V.B. and David, J.M., 2014. Content Based Image comprehensive study. In Proceedings of the 22nd ACM international
Retrieval: A Review. In Computer Science & Information conference on Multimedia (pp. 157-166). ACM.
Technology, Computer Science Conference Proceedings (CSCP) (pp. [19] Wang, H., Cai, Y., Zhang, Y., Pan, H., Lv, W. and Han, H., 2015,
65-77). November. Deep learning for image retrieval: What works and what
[5] Piras, L. and Giacinto, G., 2017. Information fusion in content based doesn't. In Data Mining Workshop (ICDMW), 2015 IEEE
image retrieval: A comprehensive overview. Information Fusion, 37, International Conference on (pp. 1576-1583). IEEE.
pp.50-60. [20] Liu, H., Wang, R., Shan, S. and Chen, X., 2016. Deep supervised
[6] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet hashing for fast image retrieval. In Proceedings of the IEEE
classification with deep convolutional neural networks. In Advances conference on computer vision and pattern recognition (pp. 2064-
in neural information processing systems (pp. 1097-1105). 2072).
[7] Hiremath, P.S. and Pujari, J., 2007, December. Content based image [21] Han, X., Zhong, Y., Cao, L. and Zhang, L., 2017. Pre-trained AlexNet
retrieval using color, texture and shape features. In Advanced architecture with pyramid pooling and supervision for high spatial
Computing and Communications, 2007. ADCOM 2007. International resolution remote sensing image scene classification. Remote
Conference on (pp. 780-784). IEEE. Sensing, 9(8), p.848.
[8] Jain, A.K. and Vailaya, A., 1996. Image retrieval using color and [22] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,
shape. Pattern recognition, 29(8), pp.1233-1244 Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. and Berg, A.C.,
2015. Imagenet large scale visual recognition challenge. International
[9] Islam, M.M., Zhang, D. and Lu, G., 2008, December. Automatic Journal of Computer Vision, 115(3), pp.211-252.
categorization of image regions using dominant color based vector
quantization. In Digital Image Computing: Techniques and [23] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning
Applications (pp. 191-198). IEEE. for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 770-778).
[10] Avni, U., Greenspan, H., Konen, E., Sharoon, M. and Goldberger, J.,
2011. X-ray categorization and retrieval on the organ and pathology

177
Data Analytics and Business Intelligence
Framework for Stock Market Trading
Batool AlArmouty Salam Fraihat
Computer Science Department Computer Science Department
King Hussein School of Computing Sciences King Hussein School of Computing Sciences
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
armoutib@gmail.com, s.fraihat@psut.edu.jo

Abstract— Business intelligence is an umbrella term that applied on the data after the transformation to extract
combines architectures, tools, databases, analytical tools, information, and finally the presentation of the extracted
applications, and methodologies. The efficiency of making information to the end-user, most likely the decision-maker.
decisions can increase significantly using business intelligence The rest of the paper is structured as follows, section II
solutions, by taking advantage of the existing historical or real-
presents the related work, section III explains the
time data of the business. Trading in stock markets is imminent
with taking risks of losing money, which requires extensive requirements of the framework, the proposed architectures
experience in the market, to make efficient decisions. In this in section IV, the design in section V, and business
paper, we propose a framework that makes use of stock prices intelligence presentation in section VI, finally the
historical data, to help investors in making more efficient trading conclusion in section VII.
decisions.
Keywords— Business Intelligence, Business Analytics, Data II. RELATED WORK
Analytics, Decision Support Systems, Stock Market.
A. Stock Market Analysis
I. INTRODUCTION Umadevi et al [3] applied analytical techniques on stock
market data and tried to design to a prediction model. The
Business Intelligence (BI) is defined as a collective
authors obtained Google, Apple and Microsoft stock prices
term that combines different technologies, applications, and
over six months, with four attributes (low, high, open and
tools used for the gathering of data from sources, storing,
close). The analysis applied on the stock market data
analyzing and visualizing it, with the purpose of helping
involves stock scores and candlelight plot to visualize all the
users to make better decisions [1]. In the last few years, data
parameters.
has been increasing rapidly, and with the ease of acquiring
Alraddadi [4] made analysis using the stock prices data
and storing this data, organizations have started to leverage
of John Wiley & Sons company over one year, the data
it for enhancing the decision making the process.
contains six attributes (open, high, low, closing, and
BI’s objective is transforming data to information
adjusted close). The author applied descriptive statistics to
through analysis to meet the business objective of the user
explore the nature of the data, and analysis measures
[2], by enabling the user to interactively manipulate the
including measures of central tendency, and measures of
data, and apply different analysis in the way she/he needs
variability. Moreover, they made use of plots to fully
for extracting information, and get valuable insights from
understand the nature of the data, like histogram and time
the data.
series plots.
The stock market is a public market with strict
Sen et al [5] made analysis on the Indian stock market,
regulation for the trading of companies’ stocks, where each
by decomposition the time series data into three
stock is a share of company permitted to be traded called
components; the trend, the seasonal component, and the
listed company ownership. Investors make money by buying
random component. The decomposition was done to help
stocks at a lower price while selling them at a higher price,
understand whether the buys are short-term or long-term,
stocks prices are determined by the success of the company,
and discover the pattern of the stocks trading. Based on this
supply and demand, and external factors like government
analysis, the months in which the seasonal component plays
regulations.
a major role were discovered, and have an idea about the
Investors take risks in determining the best time for
trends of the stocks. Moreover, the decomposition results
selling or buying stocks, that’s why they need an efficient
were used to forecast the values for 12 months.
help to reduce the risks of this decision. In this paper, a
Bhoopathi et al [6], proposed a framework to discover
business intelligence framework is proposed using the
the trends in stock trading by finding casual relationships in
historical stock market data, with the objective of analyzing
stock dataset, in the form of direct, indirect, and exception
stock market attributes for a collection of companies, and
association rules in the stock dataset, the framework also
enhance the efficiency of choosing the appropriate time for
considers the events and government decision that may
buying or selling specific company stocks, in order to
influence the stock trading.
reduce the risk of losing money.
The proposed framework covers the conversion of the
data to useful information satisfying the business objective. B. Business Intelligence
The framework contains the processes of acquiring the data Martin et al [7] proposed a business intelligence
from its source, transformations applied on the data, storing framework consist of Quantitative bankruptcy prediction
the data into an appropriate data warehouse, the analytics components, where financial features found using Genetic

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 178


Algorithm are applied to predict the business performance system developments, because of their importance in
quantitatively, Qualitative Bankruptcy prediction developing effective systems, and detecting errors in early
components, in which the features are found using expert stages of the development process [12]. In the proposed
analysis and predict the business performance qualitatively framework, the requirements are divided as follows:
using Ant Miner Algorithm and a Customized reporting
A. Business Requirements
where the right information is delivered to the right user in
the requested presentation using Fuzzy Multi-Criteria • Show each company’s stock price and quantity
Decision Support System (FMCDS). trend over time.
Jadi et al [8] suggested a framework for collecting data, • Compare companies’ trends.
as the first step of implementing an e-government business • Predict the better investment in terms of expected
intelligence system. For the purpose of taking benefits of the profit.
immense data collected in enhancing decision making and • Dashboard to perform visual analytics and
effectively enhance public services, the authors took the prediction.
morocco e-government system as a case study for collecting
B. Data Requirements
the data. They suggested three sources of data: First,
government-to-government, where they suggest enabling the 1- Data Source: The data can be collected from
interaction between government departments’ databases. Google Finance.
Second, government-to-business, this source of data 2- Data Acquisition: Historical data of four
depends on the organization's way of storing data, instead of companies’ stock prices will be collected from the
storing each government organization separately, store them source and saved in a place for preparations.
in one database. Third, government-to-citizen, the authors Company names: Apple, Nike, Disney, Microsoft.
proposed an approach for collecting information related to 3- Data Transformation: the collected data should
citizen interaction with government organizations. be prepared for the analysis before storing it, the
Khedr et al [9] proposed a framework of a business preparation consists of:
intelligence system for healthcare analytics. The framework • Data Quality: Data must be assured to have
contains six tiers: First, Data source tier, in this tier, twelve good quality, which is achieved when the data
different sources are proposed to be used in the business embodies the “Five Cs”: clean, consistent,
intelligence system. Second, Extract Transform and Load conformed, current, and comprehensive.
(ETL), the data extracted from the data sources are The data contains missing dates, these dates
integrated and transformed in the staging area, to ensure the are weekends and holidays, in which the
data quality. Third, Data storage tier, this tier composed of market closes, therefore we will not fill them.
two components, data warehouse and three types of data • Feature Selection: Data must be reduced to
mart (Operational data mart, Medical claims data mart, and contain only the features that will be used for
financial data mart). Fourth, Analytics tier, in this tier, the the analysis and prediction.
authors suggested applying diagnostic, descriptive and 4- Data Storage: Data must be stored to be ready for
predictive analytics on the data. Fifth, Optimization tier, the being viewed or used for reports and analytics
results obtained from the analytics tier is modified in this anywhere and at any time.
tier. Finally, Presentation tier, in this tier the result of the
C. Functional Requirements
system is presented visually to the user, which makes it
easier to make decisions. • The system should allow the user to view the trend
Olexova [10] presented a case study by applying BI in of each company for a period chosen by the user
the retail chain, the study was conducted in a sports-fashion (year, season, or month).
multi-brand chain of retail stores. In this study, the BI life • The system should allow the user to show which is
cycle was analyzed, besides evaluating the factors impacting bigger, the opening or the close for each day.
the BI adoption. The main findings of this study are • The system should allow the user to show the trend
considering the most important benefit of BI adoption is for more than one company in the same Fig,
improving decision-making in both speed and quality, and companies chosen by the user.
according to the managers, the customization of the BI • The system should allow the user to view basic
system is the more important factor for a successful BI descriptive statistics (maximum, minimum, first,
adoption. second, and third quantiles) for each attribute of the
company.
• The system should allow the user to show the
III. REQUIREMENTS distribution for each company for specific years or
The system requirements describe the capabilities and months.
functions that will be implemented to satisfy the user’s • The system should allow the user to compute basic
business objectives [11]. Requirements are supposed to statistics (Mean, Standard Deviation, and Range)
define what the system is supposed to do and to be, which for each company in a specific year.
distinguishes each system from its competitors, they range • The system should allow the user to compute basic
from high-level requirements defining the system statistics for each company from the start of the
performance to the very basic functionalities that will be current year.
needed in the system. Defining high-quality requirements is
considered a critical phase in the Business Intelligence
179
• The system should provide a recommendation to a A. Information Architecture
user based on the prediction of the best action to
take (strongly sell, sell, neutral, buy, strongly buy). Defines the processes needed to transform the data
obtained from the source, to readable information in the BI
D. Non-functional Requirements system (Fig 1).
• Interactive Visualization: User should be able to
interact directly with the BI application to display
the wanted companies, plot, and metric she/he
wants.
• Performance: the dashboard must be fast when
changing companies or functionalities, quick
responses for the user requests, and use a minimum
memory.
• User-friendly: User should not have difficulties
when interacting with the dashboard.
• Portability: User should be able to access the
system using (Windows, Linux, Mac, Android, Fig. 1. Information Architecture.
IOS) operating systems, installed on any hardware.
• Reliability: System should be tested to determine 1- Data Creation: This process is done by the data
the probability of failure, and ensure that the source, where the data is created.
system can handle these failures without disturbing 2- Data Integration: This process is done after
the user. collecting the data from sources, where the data is
• Availability: User should be able to access the integrated, cleaned, and ensure the quality of the
system anytime and anywhere. data.
• Scalability: The system should have the ability to 3- Data Analytics: Analytics is applied to the data
improve, by adding new functionality, or after storing it in the data warehouse.
companies without disturbing existing activities. 4- Business Intelligence Application: The results of
the analytics will be represented to the decision-
E. Technical Requirements makers.
• ETL (Extract, Transform, Load) Data should be B. Data Architecture
extracted from the data sources, transformed in the
proper form, then will be stored in a data Define the processes of data integration,
warehouse. transformation, storage and workflow needed to meet the
• Data Acquisition: Data should be extracted from requirements of the information architecture (see Fig 2).
different sources e.g. the internet, DB server, Excel
File.
• Staging: Data will be temporary stored between
the data source and the data warehouse, to be
transformed and analyzed before storing in the
Data warehouse.
• Data Warehouse: Transformed data will be stored
after the staging in a structured form, where the
information will be retrieved for applying the
requested analytics.
• OLAP (Online Analytics Processing) used for
analyzing the data using the Multidimensional
model.
• Dashboard: is the business intelligence
application, where results of the analytics applied
Fig. 2. Data Architecture.
on the data, will be displayed to the user and allow
the interaction with the dashboard.
1- System of Record (SOR): The authorized data source
for a particular data that can be found in multiple
IV. ARCHITECTURE FRAMEWORK sources. In this framework the data type that will be
used is historical data, the historical data contains the
The proposed business intelligence system is represented stock price at the opening of the market, the price at
through a set of architectures identifying system the closing of the market, the highest and the lowest
components and the relations between them [13]. The prices of the stock at the day, and the volume of the
system requirements defined earlier are interpreted into traded stocks.
structures to meet the business objectives. In our business 2- System of Integration (SOI): The process of
intelligence framework, we provide four types of integrating the data. After the data gathering,
architectures: techniques to ensure the quality of the data, like detect

180
missing data, and duplication will be applied, then the
relevant features will be selected before storing the
data in the data warehouse.
3- System of Analytics (SOA): The discovery of
meaningful patterns in the data, the requested analysis
will be applied in this process.

C. Technical Architecture
Fig.3 illustrate the techniques that will be used in
implementing the business intelligence system.

Fig 5. Data Schema.


V. DESIGN
The proposed schema is provided in Fig 5. In this
framework we used the dimensional model, with three
dimensions for company, stock and date.

Fig. 3. Technical Architecture.

The required data is available on the internet, therefore


it will be extracted, transformed, and then loaded in the data
warehouse before OLAP can be applied on the data, finally,
the data will be displayed to the user using the dashboard.
D. Product Architecture
In Fig.4, we define the products that could be used in
implementing the techniques required for the business
intelligence system.

Fig 6. Dashboard prototype.


VI. BUSINESS INTELLIGENCE APPLICATION
The information extracted from the data must be
presented to the end-user in an understandable way, using an
application of BI to display the requested metrics, the
application is implemented to help monitoring the progress
of the business, and embower making efficient decisions to
improve the business.
Fig. 4. Product Architecture. In the proposed framework, we suggest using a
dashboard, which is a data visualization tool that displays
The historical stock market data is provided by google several metrics on the same page, allowing the user to
finance 1 for free and are extracted using python Quandt compare the results of different metrics. For the obtained
library 2 , using this library the number of years’ historical data, the dashboard will contain companies’ different
data can be determined, and then the transforming stage can attributes plots; a line plot to show the trend of the company
be done using python. Then the data warehouse can be build stocks, and boxplot to show the quantiles and make it easier
using sql server 3, OLAP can be done using IcCube 4 , and to spot outliers easily, in addition to the presentation of the
finally, the dashboard can be implemented using Tableau5. standard deviation, and mean for each year. The dashboard
will also have the ability to compare more than one
company, and year together in the same plot. A prototype of
the proposed dashboard is presented in Fig 6.
VII. CONCLUSION
1
www.google.com/finance Business Intelligence plays an important role in the
2
www.quandl.com/tools/python success and survival of the business, nowadays it became
3
www.microsoft.com/en-us/sql-server/sql-server-2017 easier to apply the business intelligence, because of the
4
www.iccube.com/ easiness of collecting data using the internet. In this work,
5
www.tableau.com/ we proposed a framework that makes use of the historical

181
stock market of companies, to help the investors in making
future trading decisions. The framework proposed the
techniques and tools to collect data, transform, store,
analyze and present them to the end-user, in our case the
investor.
REFERENCES
[1] Azeroual O., Theel H., “The Effects of Using Business
Intelligence Systems on an Excellence Management and
Decision-Making Process by Start-Up Companies: A Case
Study”, International Journal of Management Science and
Business Administration, 2018, pp.30-40.
[2] Chang V., Larson D., “A Review and Future Direction
of Agile, Business Intelligence, Analytics and Data
Science”, International Journal of Information Management,
2016, pp.700-710.
[3] Gaonka A., Kulkarni R. et al, “Analysis of Stock Market
using Streaming Data Framework”, International
Conference on Advances in Computing, Communications
and Informatics, 2018, pp.1388-1390.
[4] Alraddadi R., “Statistical Analysis of Stock Prices in
John Wiley & Sons”, Journal of Emerging Trends in
Computing and Information Sciences, 2015, pp. 38-47.
[5] Sen J. and Chaudhuri T., “A framework for Predictive
Analysis of Stock Market Indices – A Study of the Indian
Auto Sector”, arXiv, 2015, pp. 1-19.
[6] Bhoopathi H. and Rama B., “A Novel Framework for
Stock Trading Analysis Using Casual Relationship Mining”,
2017, International Conference on Advances in Electrical,
Electronics, Information, Communication, and Bio-
Informatics (AEEICB), pp. 1-6.
[7] Martin A., Lakshmi T., and Venkatesan V., “A Business
Intelligence Framework for Business Performance using
Data Mining Techniques”, International Conference on
Emerging Trends in Science, Engineering and Technologies,
2012, pp. 373-380.
[8] Jadi Y., Lin J., “An Implementation Framework of
Business Intelligence in e-government systems for
developing countries: Case study: Morocco e-government
system”, International Conference on Information Society,
2017, pp.138-142.
[9] khedr A., Kholeif S. et al, “An Integrated Business
Intelligence Framework for Healthcare Analytics”,
International Journal of Advanced Research in Computer
Science and Software Engineering, 2017, pp. 263-270.
[10] Olexova C., “Business Intelligence Adoption: A Case
Study in the Retail Chain”, WSEAS Transactions on
Business and Economics, 2014, pp. 95-106.
[11] Bahill, A.T. and Dean, F.F. 2009. Discovering system
requirements. Handbook of Systems Engineering and
Management. A.P. Sage and W.B. Rouse, eds. John Wiley
& Sons. 205–266.
[12] Chakraborty A., Baowaly M. et al, “The Role of
Requirement Engineering in Software Development Life
Cycle”, Journal of Emerging Trends in Computing and
Information Sciences, 2012, pp.723-729.
[13] Bass L., Clements P. et al, “Software Architecture in
Practice”, Second Edition, Chapter 1, Addison-Weley,2003.

182
Reducing Ambulances Arrival Time to Patients
1st Mohammad Eshtayah 2nd Jalal Morrar 3rd Ameer Baghdadi 4th *Amjad Hawash
ICS Dept. ICS Dept. ICS Dept. ICS Dept.
An-Najah N. University An-Najah N. University An-Najah N. University An-Najah N. University
Nablus, Palestine Nablus, Palestine Nablus, Palestine Nablus, Palestine
mohammed.eshtayah@gmail.com jj.yy.mm1996@gmail.com ameer.r.baghdadi@gmail.com amjad@najah.edu

I. ABSTRACT traffic crisis, controlling traffic lights and constructing spread


Minimizing access time for patients in the case of accidents health care centers. The introduction of different Geographic
and/or cure needs is a very important issue related to saving Information Systems (GIS) services and specially locating
lives. It is very important for ambulances to reach places the exact position of patients by Global Positioning System
of accidents and/or medical requests in a minimized time. (GPS) technology enhanced the patients’ curing and decreases
Several solutions were proposed to handle this issue. With the the lives loss in case of emergencies [2]. However, reaching
prevalence of GPS system, it is possible to develop mobile patients in minimal time can be enhanced by decreasing the
applications to direct ambulance drivers to reach patients’ distance between patients and ambulances. This is possible in
places in a minimized period of time. However, it is possible the case a patient is able to move (if his/her medical situation
to minimize this time more and more when people with permits) from his/her location towards a requested ambulance
emergency cases are able to drive to a point where they can that is in move towards the patient. By this, a reduction in
reach ambulances moving ahead towards them. In this work, access time is gained.
we developed a mobile-based application that enables patients This work is related to developing a GPS-based emergency
to request ambulances after searching for the nearest ones application that can be used by all Human Information System
depending on the patients GPS locations. Upon a request, (HIS) parties: patients, hospitals and medical centers, ambu-
the application, and depending on the GPS location of the lance drivers and Ministry of Health as a governmental and
patient and the nearest ambulance, draws the shortest route organizer part [6]. By this application, a patient, in the case
for both (on their handheld devices) after the ambulance crew of some emergency, can request an ambulance. The system
receives the request. The application keeps updating the route searches for the nearest available one to be informed with the
while both are moving with their vehicles until they both reach request. The ambulance starts moving towards the patient who
a contact point. The application enables contacting involved starts also moving towards the ambulance both directed by a
hospitals and/or health care centers for curing purposes. A shared map2 on their handheld devices drawn for both and
website is also developed for managerial purposes by all two placeholders icons moving towards each others in a way
application stakeholders. Experimental tests were conducted of depicting the patient and the ambulance moves until they
and promising results were achieved in terms of minimizing reach a contact point, taking into account the minimal path
patients’ access times. between both to decrease the access time as possible [11].
Index Terms—GPS, Patient’s Treatment, Shortest Route. After providing the necessary first medical aid, the ambu-
lance crew sends (using the application) a medical request for
II. I NTRODUCTION some hospital in which it is notified with the case in order
for necessary medical preparations till the patient reaches the
In the field of patients health care, several technologies were hospital. The application after that determines and draws the
applied to improve their care and treatment. Supplying hospi- shortest route to the hospital and directs the ambulance driver
tals with improved medical equipments, risk analysis, catastro- to deliver the patient to the hospital.
phes management and other fields are considered important for After considering related work in Section III, Section IV
patients’1 treatment. Accessing patients locations on accurate illustrates the system architecture and how the system works,
time definitely increases the possibilities of saving their lives while Section V contains the conducted experimental tests and
and enables providing the right medical treatment [1]. The finally, Section VI contains the conclusion of the work.
average time (in big cities) to get ambulance on time is 7
minutes for the sever and serious situations while it is 15 III. RELATED WORKS
minutes for other cases [7]. Several works were conducted to
Several works were conducted to enhance the arrival of
minimize patients’ access time by ambulances like: reducing
patients to hospitals in the case of emergencies. Some of these
*Corresponding Author. works depend on GPS system, others depend on finding the
1 We would like to notify the reader that the term patient can be exchanged
with the term user to comply with traditional Software Engineering terms. 2 The map is drawn by Google maps API.

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 183


minimal path between the patient location and a hospital taking a web-based program installed in a local hospital computer
into account the roads conditions. based on a person’s employee, providing the GPS interfaces
The work presented in [6], an application entitled ”One- with the functionality necessary to run the system effectively,
Click Smartphone Automatic Sending System” was developed ambulances and physically available warders.
in order to facilitate access to patients via GPS. The developed To facilitate the process of accessing information by means
application enables patients to request ambulances easily by a of the thumb, smart application linked to Arduino device is
simple and handy interface. The application searches for the developed in [8]. The fingerprint is located inside all the data
nearest available vacant ambulance and sends the request to associated with the person by the thumb. When the finger
it. The message contains patient information such as his/her is placed on, all the data entered by the person appear to
location, personal information and kind of infection. Upon the medical crew. Arduino device with a connected GSM
the receive of the message, the ambulance driver can view modem are used to send SMS message containing necessay
the patient’s location on the map and starts moving towards information about the patient to some hospital. The patient’s
him/her. ID and his/her location are transferred via the GPS shield to
The involvement of a control room based on GPS and GIS to his mobile phone, which in turn sends an ambulance to the
monitor streets congestion and control the traffic lights was the patient’s location.
work of [9]. Authors of the work developed a web application Our contribution in this work is related to minimizing the
to view vacant ambulances locations in which patients are able arrival time of ambulances to patients. This can be done by
to request the nearest one. As soon as an ambulance arrives at determining both locations of patients and available ambu-
the patient’s location, the the nearby hospitals and clinics are lances. The system then computes and draws the shortest route
shown. Upon the provided data to the application, the control between a given patient and the nearest ambulance. After
rooms become able to identify the route of the ambulance to determining an ambulance, a map that contains both initial
reach the requested hospital in which enables the control room locations is constructed and appears on both mobile devices.
to control the traffic lights involved in the route to facilitate Upon the movement of patient’s vehicle and the requested
the arrival of patient to the hospital. ambulance 3 the route becomes shorter and shorter until they
Enhancing the communication between citizens and ambu- both reach a contact point. The ambulance crew uses the
lances was introduced in the work [10]. Authors of the work application to send a medical request to a related hospital that
developed an application to be managed by a supervisor. When receives a request in order to prepare the necessary medical
the patient asks for an ambulance, the request is sent to a treatment.
special database, including the patient’s location. Depending
IV. S YSTEM A RCHITECTURE
on the saved data in the database, the supervisor sends an
ambulance to the patient’s site. The ambulance driver follows In this section, we present an overview of the system and
the patient’s location on the map. The supervisor in turn sends its architecture showing its major components and their func-
full information about the ambulance to the patient. tionalities. The system enables patients to request and track
To transform the cities of India into a smart city by GPS, an ambulances using patients’ mobile phones. When some patient
application has been developed depends on the implementation sends a request through the system, the later searches for the
of functional and behavioral of people [4]. When searching nearest available vacant ambulances. Upon the determination
for an ambulance by a patient, ambulances within a radius of some ambulance, the application sends patient’s information
of 5 km is searched for. Hospitals within 10 km were then to be saved inside the Firebase database in the request table
searched for and a query is sent to the closest one. The system (discussed later on). These information are then sent to the
sends information about the patient to the hospital and sends medical crew of the selected ambulance including patient
information about the hospital to the patient as well. location. Depending on the GPS service, a route between the
The work of [3] is related to develop a device with ARM 7 patient and the ambulance locations is drawn on a map on the
processor module which consists of biomedical sensors, GPS patient’s mobile and the crew mobile taking into account the
receiver, and GSM modem. This small device is able to Obtain shortest available route between both. Once the patient vehicle
patient health conditions using a body temperature sensor and the ambulance reach some contact point after following
and a heart rate sensor. The collected data stored in micro- the drawn route, and after determining the patient’s medical
controller memory and sent to a special server. The location condition, the crew asks the application to suggest the nearest
of the ambulance can display the condition of the heart rate hospital, in which a request is sent to the hospital containing
and patient temperature after receiving SMS messages. all necessary information with respect to the patient medical
In order to improve the quality of health care in the rural condition.
areas, an application has been created that is associated with Figure 1 contains the major parts (software and hardware)
a GPS service [5]. Patients are allowed to book consultations involved in the system with arrows depicting the movement of
in the hospital and to request ambulances and also allow the data between these parts. The major components of the system
patient to communicate with health care providers on their are:
way to the patient’s site. Receive instant feedback in terms of 3 Special placeholders appear on the map to indicate the patient’s vehicle
time of arrival and delay time. Authors of the work created and the ambulance.

184
1) Patients’ side software. a parameter to the function findTheClosest along with the
2) Ambulance crew side software. parameter patient.GPSLocation. This function is responsible
3) 3G/4G wireless network for data traffic. for finding the closest ambulance to the patient in which
4) Web-based Application. the data related to that ambulance is saved in the object
5) The GPS service. ambulance. Determining the closest ambulance to the patient
6) The Firbase4 database. includes determining the complete shortest route between the
selected ambulance and the patient. After that, the function
sendRequest() is executed to save necessary data in the Fire-
base database taking patient and ambulance objects as param-
eters. Finally, If the reply of the function sendRequest() id
true, then the function drawMap() is executed taking two pa-
rameters: patient.GPSLocation and ambulance.GPSLocation.
Figure 1 below, represents a sample drawn route between a
given patient and the location of a selected ambulance. When
the closest available ambulance is determined, the request is
sent to its crew. The patient is then able to track the ambulance
until they both reaches the contact point. The positions of
both patients vehicle and the requested ambulance logos on
the drawn map changes every 3 seconds by continuously
contacting the GPS service for both5 .
Fig. 1. Major components of the system.

In the following sections, each component is described


briefly taking into account the structure of each component
as well as its role in the system:
A. Patients’ side software:
It is an Android-based software utilizes the GPS service
that allows patients to request ambulances. After the request
process -and depending on the medical situation of the patient-
s/he has the choice to move his/her vehicle following the
direction of the route drawn by the application till the contact
point with the requested ambulance, or just wait the arrival of
the ambulance. The main algorithm works in the patient’s side
software is the code responsible to search for a nearest ambu-
lance. The following pseudocode illustrates the algorithm:
f u n c t i o n requestAmbulance ( P a t i e n t p a t i e n t ){
p a t i e n t . GPSLocation = g e t L o c a t i o n ( ) ;
Ambulance [ ] a m b u l a n c e s = s e a r c h F o r A m b u l a n c e s ( ) ;
Ambulance a m b u l a n c e f i n d T h e C l o s e s t ( p a t i e n t . GPSLocation ,
ambulances ) ;
Boolean r e p l y = sendRequest ( p a t i e n t , ambulance ) ;
i f ( r e p l y == t r u e )
drawMap ( p a t i e n t . GPSLocation , a m b u l a n c e . G P S L o c a t i o n )
;
}

The code starts by executing the function requestAmbulance


that takes patient object as a parameter. the patient object
contains necessary informaation to identify the current patient
who is requesting the service. The function getLocation()
contacts the GPS service and returns the current GPS location
of the patient stored in a member variable GPSLocation of the
patient object. After that, the function searchForAmbulances()
contacts the Firbase database to load all available and vacant Fig. 2. Sample shortest route between patient and the closest available
ambulances and saves their records in the array of objects ambulance.
ambulances. The retrieved array ambulances is then sent as
4 We adopted Firbase DBMS in this work due to its simplicity and its ability
to listen for data change that simplify the notification process for the different 5 In the case of traffic jam and or any road obstacles, the system is able to
system components. find the next available and shortest route.

185
B. Ambulance crew side software: and when the ambulance crew searches the suitable hospital to
The software here can be adjusted in one of two modes: On handle the case. After the medical crew determines the medical
Duty or Of Duty to indicate whether the ambulance in a given situation of the patient, the crew fills (with their software) a
time is in duty or vacant. Of course this state is saved in the special form indicating the medical situation of the patient and
Firebase database. However, the software here is programmed determining the suitable hospital to deliver the patient to6 .
to keep contacting the database every 10 seconds in order f u n c t i o n p a t i e n t P i c k U p ( P a t i e n t p a t i e n t , Ambulance a m b u l a n c e
){
to check if there is a related request. Now, when a patient’s Hospital hospitals [] = searchForNearestHospital (
request is sent to the database and saved in a table called ambulance . g e t L o c a t i o n ( ) ) ;
foreach ( h o s p i t a l as h o s p i t a l s ){
Requests, the crew software is automatically accepted the case B o o l e a n r e p l y = s e n d R e q u e s t ( ambulance , h o s p i t a l
with a notification appears on their handheld device. Upon this, );
i f ( r e p l y == t r u e ) {
a map is drawn on the device showing both the ambulance drawMap ( a m b u l a n c e . g e t L o c a t i o n ( ) , h o s p i t a l .
and the patient’s locations. As in the part of the software getLocation () ) ;
break ;
installed for patients, the application keeps updating these }
positions till the ambulance and patient’s vehicle reaches the }
c h a n g e S t a t e ( ambulance , ON) ;
contact point. After examining the patient medical situation, }
the crew use the application to fill a special form describing
the patient’s medical situation like: the type of blood, blood The code starts by executing the function patientPickUp()
pressure, inhale/exhale condition, if the patient suffers from that takes the objects patient and ambulance as parameters.
chronic disease(s), types of daily medicine s/he takes, ...etc. The location of the ambulance is extracted from the object
All these data are sent to the database in order to be used by ambulance by executing the member function getLocation()
the desired hospital previously selected by the crew in order and the returned location is then sent to the function search-
to prepare the necessary treatment upon the patient’s medical ForNearestHospital() that searches for the nearest hospitals
condition. with respect to the current location of the ambulance. All
The following pseudocode illustrates the process. Please hospitals data are stored in an array of Hospital objects called
notice that this code is executed when the mode of the software hospitals that contains a list of ascending sorted hospitals
is On Duty: according to their GPS locations with respect to the ambulance
one. The code then revolves the list of hospitals and in each
f u n c t i o n c h e c k R e q u e s t ( Ambulance a m b u l a n c e ) {
Request r e q u e s t = l o a d R e q u e s t ( ambulance ) ; rotation the function sendRequest() is executed by the crew
GPSLocation p a t i n e t L o c a t i o n = r e q u e s t . g e t L o c a t i o n ( ) ; (if the current hospital is suitable). The function sendRequest
GPSLocation ambulanceLocation = g e t L o c a t i o n ( ) ;
replyRequest ( request , true ) ; takes two parameters: ambulance and hospital objects and
drawMap ( p a t i n e t L o c a t i o n , a m b u l a n c e L o c a t i o n ) ; returns a Boolean variable (saved in reply) indicates whether
c h a n g e S t a t e ( ambulance , OFF ) ;
} the hospital accepted the request or not. Upon the request
acceptance, the function drawMap() that takes the parameters
The code starts by executing the function checkRequest() amulance.getLocation() and hospital.getLocation() is executed
that takes ambulance object as a parameter. The body of the in which it draws the shortest map between the ambulance
function contains set of function calls started by executing location and the selected hospital location. The same route
the function loadRequest() that takes the parameter ambulance is also drawn by the web-based application installed for the
object as a parameter. The return value of that function is a involved hospital as we will illustrate in the next section. After
request object that contains necessary information about the breaking the loop, the crew changes the state of the ambulance
patient requesting the ambulance. The patient GPS location to ON to indicate they are vacant, of course after the delivery
is extracted from the object request by the member function process takes place, the crew software is configured to OFF
getLocation() and saved in the variable patinetLocation. The Duty to indicate the readiness of other medical requests.
ambulance GPS location is determined by executing the func-
tion getLocation() that contacts the GPS service. The function C. Web-based application:
replyRequest is then executed that saves in the database The web-based application installed in all involved hospitals
the value true along with the request object to indicate the keeps checking the database for any medical request sent by
readiness of the ambulance to handle the request. Both of some ambulance crew. If any request is available, and if the
the locations (patient and ambulance) are parameterised to the hospital is able to accept the medical case (for example, if it
function drawMap that draws the route between the patient and has enough vacant rooms), the hospital guarantees the request,
the ambulance. Finally, the function changeState() is executed and the crew software is immediately informed by that. If
taking two parameters: the ambulance and the binary value the hospital response is negative, the crew software is also
OFF to change the state of the ambulance to be In Duty in informed in order to let the crew to contact another related
order to prevent the system from displaying that ambulance in
6 If the patient medical situation is not serious and can be delivered to any
other requests till its state becomes Off Duty.
close hospital, the crew software searches for the nearest hospital and draws a
The following pseudocode is executed after the a given am- map containing the shortest route between the contact point and the hospital
bulance reaches the contact point with the requesting patient location.

186
hospital. Upon the accept sign of the hospital, a map is drawn to generate different reports related to all system participants:
in the web-based application indicating the GPS position of patients, ambulances and hospitals.
the ambulance and showing the closest route between the
D. Database:
ambulance and the hospital7 . This route is keep drawn till
the ambulance reaches the hospital. During the heading of We used Firebase as a DBMS in this work due to its
the patient of the hospital, the latter prepares the necessary simplicity and fast data retrieval. Figure 4 below represents the
medical treatment till the delivery of the patient. ER diagram of the constructed database. users entity is used to
The following pseudocode illustrates the process involved store information about patients of the system, Requests entity
in the web-based application of the hospital: is used to store requests of patients for ambulances as well as
function checkRequests ( Hospital h o s p i t a l ){
requests of an ambulance crew for a hospital, ambulance entity
if ( exist ( request ) ) is used to store information about participated ambulances,
i f ( H o s p i t a l R e a d y == t r u e ) {
r e p l y R e q u e s t ( r e q u e s t . ambulance , t r u e ) ;
hospital entity is used to store information about participated
drawMap ( a m b u l a n c e . g e t L o c a t i o n ( ) , h o s p i t a l . hospitals, and a patient entity contains information about the
getLocation () ) ;
}
patients who requested the service. However, a new patient
else record is created if the hospital has no previous information
r e p l y R e q u e s t ( r e q u e s t . ambulance , f a l s e ) ;
}
about the patient.

Figure 3 below contains some information sent to a hospital


(appear on its web-based application) according to a notifi-
cation sent by an ambulance crew through the system. The
information contains GPS location of the patient with other
personal and initial medical situation of him/her.

Fig. 3. Sample notification data arrived to some hospital from an ambulance


crew.

The code starts by executing the function checkRequests() Fig. 4. ER diagram.


that takes the object hospital as a parameter. Upon the
Figure 5 represents a sequence diagram for requesting an
existence of some request (by some ambulance) and upon
ambulance, as an example. The diagram includes the main
the ability for the hospital to handle the case, the function
objects involved in the process as well as the set of functions
replyRequest() with the value true is executed to inform the
invoked.
ambulance object with the positive reply, and the function
drawMap for the two locations ambulance.getLocation() and
hospital.getLocation() is then executed to draw the shortest
route between the ambulance and the hospital8 . If the hospital
is unable to handle the request, the same function is executed
with the value false in this case.
The Red Crescent Society has its own system account
to use the web-based application in order to manage the
different accounts in the system. The society is responsible
for all ambulances in Palestine in terms of their mapping with
hospitals as well as their daily situations and conditions. Its
part of software enables the society to track all ambulances
and their movements involved in the system. The society has
the authority to manage all accounts as well as enables them Fig. 5. Sequence diagram for requesting an ambulance.
7 The same map and route are also drawn in the crew software.
8 Ofcourse the same route is drawn by the application of the crew after The sequence of execution starts when some patient requests
getting the positive reply of the hospital. an ambulance by invoking the function reqAmbulance() on the

187
System Interface object in which contacts the GPS System by
invoking the function getLocation() to get the current location
of the patient. The GPS System object in turn returns the
location saved in patientGPSLocation the the System Inter-
face invokes the function loadAvailableAmbulances() on the
Firebase DBMS object that returns all the vacant ambulances
loaded in an array called Ambulances. The loaded array then
being searched for the closest GPS location to the patient.
Upon the search process, the closest ambulance is requested
by invoking the function request(ambulanceID, patientGPSLo-
cation) on the Ambulance object that returns a response of the
value ok, then the patient is notified.
V. E XPERIMENTAL T ESTS
In order to measure the improvement of our approach in
this work, we conducted 5 different experiments to measure
the amount of time needed to initially request an ambulance
till the reach of some hospital. We intentionally fixed the Fig. 6. A chart represents data appears in Table I
requests times for the two types of requests: by phone and by .
the system in order to be accurate in the calculation process.
So, we asked some volunteers to divide themselves into two
groups. The first group requests an ambulance by phone and arrival of ambulances to patients by utilizing the GPS system
at the same time the second group requests another ambulance in order to compute the shortest route between the two parties.
using the system, and in order to be fair and accurate in For future works, we plan to improve the data exchanged
calculations, we asked the two ambulances to be in the same between patients and ambulance crew like voice messages so
location at the time of requests. We repeated the experiment that the crew could direct patients and/or their relatives for
f times from different locations with respect to requesters some directive information till the arrival of ambulance.
and ambulances. Table I represents a comparison between
R EFERENCES
requesting an ambulance and waiting it till reaches the patient
and between requesting an ambulance and start moving by car [1] Alan Campbell and Matt Ellington. Reducing time to first on scene:
An ambulance-community first responder scheme. Emergency medicine
till the contact point with the ambulance given that the request international, 2016, 2016.
times (given in hours:minutes) for both: by phone and by the [2] Kerr J Saelens BE Natarajan L Frank LD Glanz K Conway TL Chapman
system, in the five experiments are 15:10, 17:38, 12:07, 14:02 JE Cain KL Sallis JF. Carlson JA, Schipperijn J. Locations of physical
activity as assessed by gps in young adolescents. PubMed Central
and 9:05 respectively. We noticed a reduction in access time PMCID: PMC4702023, 137:2015–2430, 2016.
as a whole. [3] S Dixit and A Joshi. A review paper on design of gps and gsm based
intelligent ambulance monitoring. International Journal of Engineering
Research and Applications, 4(7):101–103, 2014.
TABLE I
[4] Poonam Gupta, Satyasheel Pol, Dharmanath Rahatekar, and Avanti
A COMPARISON BETWEEN REQUESTING AN AMBULANCE BY PHONE CALL
Patil. Smart ambulance system. International Journal of Computer
VS . BY THE SYSTEM .
Applications, 6:23–26, 2016.
Request By Phone Request by the System [5] Bassey Isong, Nosipho Dladlu, and Tsholofelo Magogodi. Mobile-based
Exp. # Arrival Time Tour Time Arrival Time Tour Time Saving medical emergency ambulance scheduling system. International Journal
#1 15:23 13 15:20 10 3 of Computer Network and Information Security, 8(11):14, 2016.
#2 17:47 9 17:44 6 3 [6] Vijdan Khalique, shafaq shaikh, murlee daas, and Syed Muhammad
#3 12:10 3 12:10 3 0 Shehram Shah. Automatic ambulance dispatch system via one-click
#4 14:10 8 14:06 4 4 smartphone application. Indian Journal of Science and Technology, 10,
#5 09:13 8 09:09 4 4 09 2017.
[7] Price L. Treating the clock and not the patient: ambulance response
times and risk. Qual Saf Health Care, 15(2):127–30, 2006.
Figure 6 represents the data appear in Table I where a reader [8] Miss Priyanka Bachate Miss Pratima Jadhav Miss Anjalee, Miss Son-
ali Rayewar and Prof. Premlatha G. Survey on ambulance tracking with
can notice the amount of time preserved with the ambulance patient healthmonitoring system using gps. open access international
request by the system vs. requesting by phone call. journal of science and engineering (oaijse), 2:2456–3293, 2017.
[9] Muhd Zafeeruddin Bin Mohd Sakriya and Joshua Samual. Acmbulance
VI. C ONCLUSION emergency response application.
[10] Muhd Zafeeruddin Bin Mohd Sakriya and Joshua Samual. Ambulance
Minimizing the waiting time for ambulance arrival increases emergency response application.
the lives saving possibilities. The work presented is related to [11] Thije van Barneveld, Caroline Jagtenberg, Sandjai Bhulai, and Rob
decreasing the distance between patients and the requested van der Mei. Real-time ambulance relocation: Assessing real-time
redeployment strategies for ambulance relocation. Socio-Economic
ambulances in a try to minimize the waiting time for cure Planning Sciences, 62:129–142, 2018.
and treatment. The simple experiments done in the work
highlighted the possibility of reducing the wait time till the

188
Framework Architecture for Securing IoT Using
Blockchain, Smart Contract and Software Defined
Network Technologies
Hasan Al-Sakran YASER ALHARBI Irina Serguievskaia
MIS Department MIS Department unaffiliated
King Saud University King Saud University Riyadh, Saudi Arabia
Riyadh, Saudi Arabia Riyadh, Saudi Arabia serguievskaia@gmail.com
halsakran@ksu.edu.sa 437106487@student.ksu.edu.sa

Abstract— The botnet problem of launching Distributed and control server represent a malicious party [7]. Botnets
Denial of Service (DDoS) attacks on other networks mainly are typically constructed in several operational stages:
arises from the rapid growth in the number of insecure propagation, infection, command and control
Internet of Things (IoT) devices distributed across these communication, and execution of attacks [8].
networks. The focus of this work is to defend such an IoT
network and its associated resources from attacks, and to IoT devices have low computing capabilities. Client-
prevent such networks from becoming a part of botnet server architecture for management of the IoT devices has a
launching DDoS attacks on other networks and resources. To single point of failure which may lead to DDoS attacks like
achieve these objectives, this research emphases designing of a Mirai Botnet. There are several conventional solutions to IoT
botnet prevention model for Internet of Things using emerging security challenges. All of them come from the traditional
technologies such as Blockchain, Smart Contract, and Software information security practices that build controls to protect
Defined Networking (SDN). Blockchain is decentralized the IoT devices and its users which, in turn, consist of
structure which fits with the decentralized nature of IoT. For technical, operational and managerial controls [1].
securing IoT network, the proposed solution presented in this
research based on building above the IoT network a The complexity of managing the IoT networks security
Blockchain network and on top of it to use Smart Contracts significantly increased by the dynamic nature of IoT devices,
that embedded the SDN rules. like smart devices (cars and watches) with rich resources to
sensors, industrial robotics, and actuators with limited
Keywords— blockchain, internet of things, software-defined resources, and their heterogeneity.
networks, smart contract, botnet, distributed denial of service
IBM describes blockchain as a technology for
democratizing the future IoT since it addresses the current
I. INTRODUCTION critical challenges [9]:
There will be 50 billion Internet of Things (IoT) devices
by 2020 according to Cisco prediction [1]. Number of • A lot of IoT solutions are expensive as a result of
interconnected systems already exceeded the number of related to the deployment costs and maintenance of
human beings [2]. The worldwide technology spending on centralized clouds compiled mostly of the supplier
IoT is predicted to reach $1.2 Trillion in 2022 [3]. As and middlemen costs.
number of IoT implementations increases, so does the • Software updates distribution to millions of devices
number of connected into networks devices. Devices for maintenance purposes is quite problematic.
connected to the Internet are subjects to cyber-attacks. For
example, there has been a noticeable upsurge in DDoS • Technological partners usually give device access to
attacks [4]. Security issues, such as privacy, authorization, centralized organizations (service providers or
verification, access control, system configuration, manufacturers). This may lead to breach of privacy
information storage, and management, are the main and anonymity thus causing diminishing trust of IoT
challenges in an IoT environment [5]. These security issues adopters.
cannot be solved with conventional security solutions alone. This work proposes an alternative solution methodology
There are many differences between conventional networks, for solving IoT security problems by applying blockchain
those that are used to connect PC's and servers, and IoT technology. Blockchain technology is an attractive way to
networks, decentralized and distributed in nature, and they enforce privacy for IoT enabled devices and to maintain trust
have to be taken into account within the IoT network. It follows digital security
A serious problem for IoT is Botnet attacks. Botnet requirements: availability, accountability, integrity, and
attacks were originally created for PCs. But such an increase confidentiality. Availability of data in distributed network is
of IoT devices in recent years and their low security level led assured by keeping a copy of data in each block. Data
to emerging and rapid evolving of IoT-based botnets. integrity can be achieved by checking the received data that
already checked within a blockchain network. Transferring
A botnet is a computer network of infected devices data only within the network of trusted devices assures
controlled by malware [6]. Infected IoT devices or bots and cconfidentiality; and accountability is maintained because
are controlled by a botmaster bots and botnets via command any transaction of data must be verified by other devices. All

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 189


this can prevent IoT devices from forwarding malicious data Application Layer: provides high quality smart services
or information to other devices. according to requester’s needs. Business Layer: receives the
data from application layer; holds responsibility for building
Proposed architecture employs blockchain, smart a business model on the received data, designing, analyzing,
contracts and SDN to handle a botnet prevention system for implementing, evaluating, monitoring and developing IoT
IoT in a fully decentralized manner. The objective is to system elements. IoT system management is conducted on
create an automated and easy-to-manage prevention system this layer.
by not allowing IoT devices to run applications which will
make them a part of botnet network to launch DDoS attacks The following are six building blocks or elements of IoT:
on other network and assets. IoT devices are just forwarding
devices. The aim of SDN is to computerize the network • Identification: responsible for naming and matching
functions and separate the forwarding plane from the control services with their demands, and differentiation
plane that applies programmability and automation to between objects and their addresses based on their
networks thus enabling business agility. SDN controller ID's. Electronic Product Codes EPC and ubiquitous
represents an effective approach to customize security Codes uCode are examples of identification.
policies and services in a dynamic way, so that flow rules • Sensing: collecting related objects data and sending it
can be applied in smart contract to enforce security policies back to the database or the cloud so that the data can
and for tracking any suspicious traffic for prevention of be analyzed to perform specific activities based on
botnet creations. required services.
The remainder of this paper is organized as follows. • Communication: different communication
Background on IoT and blockchain are discussed section 2. technologies, for example Bluetooth, WiFi, ZigBee,
Literature review and related work presented in section 3. 4G and others, are being used to connect together
Section 4 will introduce the overall architecture of the heterogeneous objects, usually working in low power
proposed solution platform. The conclusion and future work mode, to carry out specific services.
are in section 5.
• Computation: encompasses microcontrollers,
II. BACKGROUND microprocessors, FPGA, system-on-chip SoC, and
software applications.
A. Internet of Things • Services represented by four categories: identity-
Internet of Things (IoT) is the term used to describe related, information aggregation, collaborative-aware,
equipping objects or things with communications capabilities and ubiquitous services.
to connect with internet technologies by using the proper
architecture and infrastructure for accessing, managing and • Semantics: includes resources discovery and usage,
controlling these connected things. It aims to connect information modeling, and data recognition and
everything, anywhere, and anytime and provide each object a analysis by smart extraction of knowledge using
virtual existence in the cyber space. It provides the different machines in order to deliver the required
enterprises a more detailed awareness of their operations and services.
assets by leveraging data generated by IoT devices. Traditional security models, such as authentication,
The IoT components are: physical objects, sensors, nonrepudiation, confidentiality, access control, integrity and
actuators, virtual objects such as electronic tickets, people, availability, could be applied here to guarantee basic security
services such as cloud services, platforms such as data services because the IoT represent an extension of Internet
analytics platforms, and networks [10]. IoT architecture technologies to the higher level, But such developments
should be flexible and layered to make it capable to bring with them new problems [2]:
interconnect a large amount of IoT heterogeneous devices. • As limited access or closed networks are evolving to
There are several architectural layers for IoT [11]. open systems, the need for protection of these
Objects or Perception Layer represents the physical devices, interconnected devices from attacks is increasing
for example sensors and actuators. These devices perform significantly. But it could be dealt with, for example,
collection data about weight, temperature, vibration, by means of security alarms or some other ways.
acceleration, motion, humidity, location, etc. This layer • Different security policies and numerous security
transfer data to the object abstraction layer. techniques are used for devices’ interactions with
Object Abstraction Layer is where the data gathered from each other in an IoT network, making interactions can
objects layer are moved to the service management layer via become quite complicated.
secure channels by means of telecommunications • Limited computational power and different
technologies such as ZigBee, 3G, Bluetooth, WiFi and etc. It operational IoT environments can present yet another
performs cloud computing and data management problem.
functionalities. Service Management or Middleware Layer:
here the service is matched with its requester based on • Serious security problems can be encountered due to
addresses and names so that IoT application programmers the some IoT potential to interact with multiple
can work with the objects without taking into account the nodes.
hardware. It functions are processing data, making decisions
To secure the IoT systems, a large number of
and delivering the required services.
conventional countermeasures should be designed and
implemented. However, conventional security models are not

190
capable of complete IoT protection due to the differences • Transactions: the blockchain enables the information
between the conventional networks and IoT. In contrast from sharing and exchange among nodes on a P2P basis. This
the conventional networks, IoT devices are configured on information is transferred from node to node in files. After
low-power lossy network (LLN) topology which has tight each transaction the blockchain state is changed.
limits on power, memory, and processing resources.
• Consensus Mechanism: is needed to keep track of the
One example how these limits can affect system security transactions and ensure secure exchange (transferred in full,
is node impersonation in LLN that can lead to great data cannot be altered, time stamped) to avoid fraud such as
losses. It can happen if an attacker can connect to the double-spending attacks. To maintain a consistent state the
network using any identity during the data transmission same content-updating protocol for the ledger is agreed upon
process, and he can be assumed an authentic node [5]. There and used by all nodes. Blocks will not be accepted without
are also some differences in security features and this consensus mechanism.
requirements.
Four key characteristics of Blockchain were formalized in
[14]:
B. Blockchain
A blockchain, as its name implies, is a chain of • Immutable: permanent and tamper-proof. A
timestamped blocks that are linked by cryptographic means. blockchain is a permanent record of transactions.
It is a distributed ledger whose data are shared among a Once a block is added, it cannot be altered thus
network of peers. Blockchain technologies are capable to creating trust in the transaction record.
track, coordinate, carry out transactions, and store • Decentralized: a blockchain is stored in a file that can
information from a large amount of devices, enabling the be accessed and copied by any node on the network
creation of applications that require no centralized cloud. thus ensuring decentralization.
Four basic concepts that Blockchain is based on are [12]: • Consensus Driven: trust verification. Consensus
• A peer-to-peer network: there is no central trusted models provide rules for validating a block. Each
third party and all nodes have the same privileges. At block on the blockchain is verified independently
each node a pair of public/private keys is used for using these rules. In Bitcoin, this is referred to as the
interaction with other nodes, where the public key is mining process. Frequently a scarce resource is used
used as an address of the node on the network and to prove that adequate effort was made, such as
private key is used to sign transactions. computing power. No central authority or an explicit
trust-granting agent is participating in this
• Open and distributed ledger: each node has got its mechanism.
own copy of the same ledger. The ledger is open and
transparent to everyone. • Transparent: the blockchain is an open file, any party
can access it and audit transactions.
• Ledger copies synchronization: is done by
broadcasting the new transactions publicly, validating A Blockchain is built of a chain of blocks and each one
the new transactions, and adding the validated contains a database of transactions (see Fig. 1). The
transactions to the ledger. Blockchain is extended by adding blocks that are related to
each other by hashing algorithms and hence the Blockchain
• Mining: are competing among themselves to represents a complete ledger of transactions history. The
understand who will be the first to take the new additional block can be validated by network using
transaction, validate it and put it into ledger thus cryptography. In addition to transactions, each block has a
creating the chain. timestamp, a hash value of previous block, and a nonce
The core components that build Blockchain and its which is a random number for verifying the hash. Hashing
operations are as follows [13]: concept ensures the integrity of the data in the chain. Hash
values are unique. Fraud can be prevented since changes to
• Asymmetric Key Cryptography: public/ private key the block needs to change the whole chain of blocks [15].
pairs are used to secure its operation.

Block 1 Hash Block i-1 Hash Block i Hash Block i+1 Hash
Timestamp Timestamp Timestamp Timestamp

Nonce Nonce Nonce Nonce

Trans1 Trans1 Trans1 Trans1

: : : :
Trans n Trans n Trans n Trans n

Fig. 1 Blockchain Structure

191
Smart contracts are digital contracts of an agreement that detection is aimed to break the chain of botnet cycle but was
can be programmed and automatically execute the terms of designed to be implemented on the propagation stage.
the agreement to carry out a particular action, events,
transaction, etc., at a given time or after a certain set of In contrast with the above method that intended to do its
conditions have been met. Smart contracts execute exactly job on the early stage of botnet life cycle, Meidan et. al., [8]
what its transitioning parts want to do without an developed a method that is used on the last stage and acts as
intermediary. They can be developed by using a high-level a last line of defense. Researchers assumed that botnets are
language called Solidity and are the ultimate automation of evolving and that they would be able to bypass the detection
trust. Digitizing a contract can be very useful for discovering tools that targeted early stages of the botnet lifecycle. This
attacks. Smart Contracts stored and replicated on a method uses deep learning techniques to take behavior
blockchain. The use of smart contracts allows for the snapshots and train the model, called AutoEncoder in the
validation of transactions and verification of counterparties, paper, to detect abnormal behavior of the system .
therefore reducing the risk of attacks. In [7] detecting of botnets is performed by analyzing
communities of IoT devices that are formed according to
C. Software Defined Network (SDN) Overview their network traffic. IoT devices sense and process data, and
SDN holds the concept of privacy and security by design communicate with other IoT devices. Developed system,
and aims to increasing network programmability by called AutoBotCatcher, uses blockchain to allow
separation of control and data planes. It’s a cost-effective collaboration of a set of pre-identified untrusted parties in
software-based solution compared to manual intervention to order to perform dynamic collaborative botnet detection by
each of the devices in IoT network. By separating the control collecting and auditing IoT devices’ network traffic flows as
plane from data or forwarding plane from independent IOT blockchain transactions. This solution uses blockchain rather
devices, SDN allows a sub network and overall network to than a centralized system because of the benefits that the
be centrally managed and monitored. Using SDN it becomes blockchain might bring. The consensus concept allows
easy to define security policies for a network that provides AutoBotCatcher validate correct execution of the collaborate
capabilities of preventing DDoS attacks. Using SDN in IoT process to be performed without central trusted part .
has several advantages. But using SDN instead of traditional A botnet prevention mechanism supplemented by
networking paradigm brings the issue of centrality nature of blockchain and SDN has been proposed in [17]. According
SDN. SDN separates control plane from data plane, but this to authors, each network consists of three modules; Security
means centralizing the control plane and make it target to policy module (SecPoliMod), Controller module (ConMod),
attackers. and Log module (LogMod). The SecPoliMod enforces
SDN has the following characteristics [16]: security policies and designate approved list of IoT devices
that meet minimum security requirements to prevent the
• Directly Programmable - due to decoupling of control whole IoT network from becoming a part of a botnet. The
plane and forwarding functions, network control is LogMod parses the flow rules running on the SDN controller
directly programmable of the network and checks the latest authenticated flow rules
linked in the blockchain for tracking any suspicious traffic
• Agile – where network flow could be dynamically destined to any innocent network for prevention of Botnet
adjusted to meet changing needs creations.
• Centrally managed – network intelligence is Architecture to defend against DDoS attacks by building
centralized in software-based SDN controllers a collaborative mechanism between service providers'
• Programmatically configured – managers can networks was developed in [18]. Blockchain is leveraged as
configure, manage, secure, and optimize network a transactions exchange media between SDN controllers in
resources very quickly by using SDN programs the service providers’ autonomous networks. Service
providers enrolled in this Blockchain service can signal the
• Open Standards based and Vendor neutral – hence occurrence of the DDoS attacks and take advantage of the
network design and operation is simplified because of shared detection and motivation mechanisms. The goal is to
open standards and vendor-agnostic devices and create an automated and easy-to-manage DDoS mitigation
protocols service. Three building blocks of this solution are:
blockchain, smart contracts and software defined network.
III. LITERATURE REVIEWS Collaborative defense system participants first need to create
a smart contract which is linked to registry-based type of
The major challenge in IoT is security of IoT devices and smart contracts. When the attacker overloads the web server
networks, and privacy of people and organizations that get of one of the service provider's autonomous network, IP
benefits from using the IoT. The traditional approaches to addresses of attackers are stored in the smart contract.
defeat threats on IoT are inapplicable due to the Service provider's autonomous network will then receive
decentralization nature of IoT network. updated lists of addresses to be blocked when they receive
Researchers attempt develop solutions to detect botnets. the Blockchain blocks that contain smart contracts.
Prokofiev et. al. [6] built a machine learning predictive A blockchain based solution to secure IoT devices in a
model that employs logistic regression technique for botnet smart home setup has been described in [19]. The developed
detection. This model has ability estimate the probability of blockchain has three-tier architecture, smart home or local
the IoT device being a member of a botnet or a bot. Data network, overlay network, and cloud storage.
need to be gathered to train the model. It is accomplished by
collecting data from 100 botnets oriented on IoT devices and When developers want to create blockchain systems for
capable to perform brute-force attacks. This method of botnet specific purposes, they are must to have a platform that will
192
support such a system with physical world applications, big Each smart contract segment contains security policies
data integration, data integrity, data storage, big data that meet minimum security requirements to prevent network
analytics, identity privacy, data access security, trusted data segment of IoT devices to become a botnet. SDN controller
sharing and collaboration, IoT integration, and general acts as a firewall. It is responsible for data analysis and
distributed and parallel computing. Design of such detection service in a timely manner. Task agent of each
blockchain platform described in [20] segment is monitoring network traffic flow of IoT devices
within the segment and taking actions according to the
Discussion on the hosting location of blockchain: situation. In performing such operations, two types of IP
directly on the IoT device, in the cloud, or in the fog address lists are created: one for predefined and trusted, and
presented in [21]. Hosting blockchain directly on the IoT the second for IP addresses that have been previously
device is impractical due to the limitations of computational detected as part of a botnet. Each local smart contract needs
resources, insufficient devices’ bandwidth, and need to first to register itself in the global smart contract registry,
preserve power. which stores all relevant smart contracts that should be
watched. Each network segment reports the results of data
IV. FRAMEWORK ARCHITECTURE OF IOT SECURITY processing to the global blockchain layer via mobile agent.
SYSTEM
Local blockchain and its associated smart contract are
In this work we demonstrate an adaptive blockchain and managed by local network administrator who is responsible
SDN in hierarchal distributed environment network. Figure 2 for adding/ removing IOT devices. Adding devices is done
presents an overview of the architecture of the proposed by performing a first transaction that starts the local
model, which consists of: blockchain. All the transactions then are chained together. It
• Global BC (GBC) is to store source IP addresses that is also a responsibility of the administrator to remove devices
should be allowed or blocked. It provides the from the sub network. The removal is done by removing the
necessary information on a large scale. The global ledger related to device. Devices can communicate with each
blockchain is used to provide large-scale event other if the administrator permits. This permission is done by
detection. giving the devices a shared key.

• Network segments are to maintain a local Private The controllers within the smart contract update the flow
Blockchain that saves all transactions, and smart rules by verifying the version of flow rule table that
contract containing SDN controllers that hold policies maintains IP address lists of trusted IP addresses of IoT
of accessing the devices. devices in their network segment, and IP addresses detected
as part of a botnet. Smart contracts need to run on a
• Task agent: constantly sniffing and monitoring blockchain to ensure that the contract content cannot be
network traffic flow of IoT devices and taking actions changed. If any IoT device is not following the rules in the
according to that . smart contract, for example sending undesirable data, then
this device is considered being a part of a botnet and capable
• Mobile agent: moving within the system that contains of launching a DDoS attack.
an object communicate. It can be dynamically
generated during the execution and can reconfigure The architecture checks network traffic flow of each IoT
itself dynamically based on changes of the services. device within its network segment. If a network traffic of an
IoT device within the trusted IP addresses then it is allowed
This design deploys a set of SDN controllers at each to forward the data. Otherwise the device and its traffic are
network segments to respond to attacks in this specific considered untrustworthy, the IoT device is isolated, and the
segment. All SDN controllers are imbedded within smart system will immediately stop the flow of data from it. This
contracts. Each blockchain segment is connected directly to attack information should be reported to the global
the GBC via a mobile agent. All SDN controllers in each blockchain so it can be shared among connected controllers
network segment are connected to GBC in a distributed to block similar activities before other segments of the
manner using local private blockchain smart contract network are affected.
techniques. This allows automatic configuring of responses
from IoT devices. Each network segment covers a small
associated community of IOT devices and local storage unit V. CONCLUSION
saving all transactions; and could be used as a local backup In this work, we proposed a decentralized blockchain-
drive. based architecture for securing IoT devices. Blockchain
network build on top of an SDN network that interrelates
different service providers' networks that are willing to share
information about botnet DDoS attacks on their networks
using smart contracts. Blockchain, smart contract and SDN
technology represent a perfect solution to solve security
problems of IoT. In this work, we proposed an approach that
is can prevent IoT device from becoming the part of botnet
by using the above technologies. The benefits of blockchain
for IoT and IoT security issues where discussed. Enterprises,
that already have IoT systems or just developing IoT
initiatives, are recommended to take into consideration the
blockchain technology and develop a strategy to secure its
Fig. 2 System Architecture IoT systems.

193
REFERENCES [12] A. Panarello, N. Tapas, G.Merlino, F. Longo, and A. Puliafito, ”
Blockchain and iot integration: A systematic survey,” Sensors
[1] M. A. Khan and K. Salah, “IoT security: Review, blockchain (Switzerland),vol. 18, http://doi.org/10.3390/s18082575
solutions, and open challenges. Future Generation Computer
[13] B. D. Puthal, N. Malik, S. P. Mohanty, E Kougianos, and G. Das, ”
Systems,”vol. 82, pp. 395–411, November 2017,
http://doi.org/10.1016/j.future.2017.11.022 Everything You Wanted to Know About the Blockchain : Its Promise
, Components , Processes , and Problems Everything you Wanted to
[2] A.Sfar, E. Natalizio, Y. Challal, and Z. Chtourou, ” A roadmap for Know about the Blockchain,“ July 2018,
security challenges in the Internet of Things,” Digital http://doi.org/10.1109/MCE.2018.2816299
Communications and Networks, vol. 4 , pp. 118–137, 2018,
https://doi.org/10.1016/j.dcan.2017.04.003 [14] K. Sutan, U. Ruhi, and Rubina Lakhani, ” Conceptualizing
Blockchains: Characteristics & Applications,“ 11th IADIS
[3] L. Columbus, “Roundup Of Internet Of Things Forecasts And Market International Conference Information Systems, 2018, Accessed in
Estimates,” https://www.forbes.com/sites/louiscolumbus Feb. 27, 2019, https://arxiv.org/ftp/arxiv/papers/1806/1806.03693.pdf
/2018/12/13/2018-roundup-of-internet-of-things-forecasts-and-
market-estimates/#729e5aa27d83, Accessed Febriary 23, 2019. [15] M. Nofer, P. Gomber, O. Hina, and D. Schiereck, ” Blockchain“ Bus
Inf Syst Eng vol. 59(3), pp.183–187, 2017,
[4] Akamai, “ How to Protect Against DDoS Attacks - Stop Denial of http://DOI.org/10.1007/s12599-017-0467-3
Service,” https://www.akamai.com/us/en/resources/protect-against-
[16] ONF, Open Networking Foundation), “Software-Defined Networking
ddos-attacks. jsp, Accessed 10 Jan 2017.
Definition,” Accessed April, 9 2019,
[5] F. A. Alaba, M. Othman, I. Abaker, T. Hashem, and F. Alotaibi, https://www.opennetworking.org/sdn-definition/
“Internet of Things security: A survey,” vol. 88, pp. 10–28, April
2017, http://doi.org/10.1016/j.jnca.2017.04.002 [17] Q. Shafi and Abdulbasit, ” DDoS Botnet Prevention using Blockchain
in Software Defined Internet of Things“ Proceedings of 2019 16th
[6] A. O. Prokofiev, Y. S. Smirnova, and V. A. Surov, (2018)”A method International Bhurban Conference on Applied Sciences & Technology
to detect Internet of Things botnets,” Proceedings of the 2018 IEEE (IBCAST) Islamabad, Pakistan, 8th – 12th January, 2019
Conference of Russian Young Researchers in Electrical and
[18] B. Rodrigues, T. Bocek, A. Lareida, D. Hausheer, S. Rafati, and B.
Electronic Engineering, ElConRus 2018, January 2018, pp. 105–108,
http://doi.org/10.1109/EIConRus.2018.8317041 Stiller, ” A Blockchain-Based Architecture for Collaborative DDoS
Mitigation with Smart Contracts,” D. Tuncer et al. (Eds.): AIMS
[7] G. Sagirlar, B. Carminati, and E. Ferrari, “ AutoBotCatcher: 2017, LNCS 10356, pp. 16–29, 2017, https://DOI.org/10.1007/978-3-
Blockchain-based P2P Botnet Detection for the Internet of 319-60774-0 2
Things,”pp. 1–8, 2018, http://doi.org/10.1109/CIC.2018.00-46
[19] A. Dorri, S. S.Kanhere, and R. Jurdak, “Blockchain in internet of
[8] Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D. things: Challenges and Solutions,”
Breitenbacher, and Y. Elovici, ” N-BaIoT-Network-based detection http://doi.org/10.1145/2976749.2976756
of IoT botnet attacks using deep autoencoders,”. IEEE Pervasive
[20] Z. Shae, and J. J. P. Tsai, ” On the Design of a Blockchain Platform
Computing, vol. 17(3), pp. 12–22, 2018,
http://doi.org/10.1109/MPRV.2018.03367731 for Clinical Trial and Precision Medicine,“ Proceedings -
International Conference on Distributed Computing Systems, pp.
[9] T. M. Fernández-Caramés and P. Fraga-Lamas, ” A Review on the 1972–1980, http://doi.org/10.1109/ICDCS.2017.61
Use of Blockchain for the Internet of Things,” IEEE Access, vol. 6,
[21] M. Samaniego and R. Deters, “Blockchain as a Service for IoT,”
pp. 32979–33001, 2018,
http://doi.org/10.1109/ACCESS.2018.2842685 Proceedings - 2016 IEEE International Conference on Internet of
Things, IEEE Green Computing and Communications, IEEE Cyber,
[10] Irena Bojanova, “What Makes Up the Internet of Things?”, Accessed Physical, and Social Computing, IEEE Smart Data, IThings-
in Feb. 24, 2019, https://www.computer.org/web/sensing- GreenCom-CPSCom-Smart Data 2016, pp. 433–436, 2017,
iot/content?g=53926943&type=article&urlTitle=what-are-the- http://doi.org/10.1109/iThings-GreenCom-CPSCom-
components-of-iot- SmartData.2016.102
[11] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and M.
Ayyash, ” Internet of Things: A Survey on Enabling Technologies,
Protocols, and Applications,” IEEE Communications Surveys and
Tutorials, vol. 17(4), pp. 2347–2376,
http://doi.org/10.1109/COMST.2015.2444095

194
Security issues in Wireless Sensor Network
Broadcast Authentication
Asad Raza1 Ali Abu Romman2 Muhammad Faheem Qureshi3
ISET Department ISET Department ISET Department
Abu Dhabi Polytechnic Abu Dhabi Polytechnic Abu Dhabi Polytechnic
Abu Dhabi, United Arab Emirates Abu Dhabi, United Arab Emirates Abu Dhabi, United Arab Emirates
asad.raza@adpoly.ac.ae ali.aburomman@adpoly.ac.ae muhammad.qureshi@adpoly.ac.ae

Abstract—Wireless sensor networks influence is increasing complexity of air pollution monitory and to obtain
day by day due to its cost effectiveness in handling real world real time measurements.
challenges. WSN consists of many small limited power and
limited computational-power devices to monitor physical and ⎯ Forest fire Detection:
environmental conditions, allowing communication over There have been many incidents of massive forest
wireless link. WSN is introducing new techniques of fire due to human carelessness and mistakes which
communication and dissemination of information in wireless
cause a negative impact on the ecosystem and the
network. Due to involvement in many applications, secure
authentication of broadcast packets is a mandatory results of forest fire could be catastrophic [17]. In
requirement which is one of the great challenges in WSN forest sensor nodes are installed to provide real-time
security. In this paper we have discussed all the threats to the and accurate fire detection. Advance fire detection
WSN concerning secure communication. However, the focus of
this paper is to highlight the security issues regarding is very critical to minimize the impact.
broadcast authentication in WSN and analyze the proposed ⎯ Battlefield Monitoring:
solutions with respect to various parameters. In battlefield sensor nodes can be uses to monitor
Keywords— Wireless sensor network, Broadcast
the enemy activities. Based on the information
authentication and Security gathered from the sensor nodes, the army can plan
how to prepare against the enemy’s activities.
⎯ Weather Monitoring:
I. INTRODUCTION The application of WSN in weather monitoring is
Wireless sensor network is a collection of nodes few hundred similar to that of air pollution and fire detection.
or even thousands organized into a cooperative network to Sensor nodes can be used for weather monitoring
monitor temperature, sound, vibration, pressure etc. and then and early prediction about rain or flood can be made
pass it to central device known as access point (AP) or base to take precautionary measures in advance.
station (BS). There are three major components in wireless
⎯ Health Monitoring:
sensor network: a sensor component (which senses and takes
measurements), computing component (processes data) and One of the most popular applications of WSN is
communication component (enables communication between patient health monitoring. These wearable health
nodes) [16] .The sensor nodes have low-power and limited monitoring units can help the doctors to
processing capability. WSNs have variety of applications continuously monitor the patient health and
ranging from military to home and industrial applications.
The applications of wireless sensor networks have not only maintain an optimal health status. Research shows
impacted but changed our daily life. Some of the most that the addition of WSNs in health monitoring has
common applications of WSN are summarized below shown very positive indicators in terms of patient’s
recovery responding to critical medical conditions
⎯ Land sliding detection:
like cardiac arrest. [18].
WSN are used in land sliding detection to detect
movement of soil and changes in the parameters
These are only few of the applications of wireless sensor
before and after the land slide. The information
networks describe above. There are numerous situations in
gathered from the sensor nodes can be utilized to which wireless sensor networks can prove to be the most
forecast land sliding in advance. suitable solution.
⎯ Air Pollution Monitoring: As any other network wireless sensor networks are also
With the fast-growing industrial activities, the prone to security threats. The sensor nodes may be located in
problem of air pollution is becoming a serious different location and uses wireless link to gather and
concern for health. Traditional data logging transport or communicate important information. It is
important to understand the attacks pertaining to wireless
methods are considered not only complex but also
sensor networks before we narrow down our discussion to
time consuming. WSNs are used to reduce the broadcast authentication issues. Some of common the
attacks to wireless sensor networks have been discussed
below.
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 195
ATTACKS ON WIRELESS SENSOR NETWORKS are strong or dead links are alive. As a result, node can
The most common attacks on WSN are select weak link for routing and packet send over the
• Routing information spoofing: In this type of attack, the weak link can be lost.
attacker sends fake routing information, creating
routing loops, false error messages are generated, • Impersonation: In this type of attack, the attacker adds
increases or decreases the source route. This attack a node in the existing network by copying the ID of
causes decreased lifetime of the network and increased existing node. Then attacker is able to corrupt, misroute
latency [7]. or delete the packet and attacker can disclose the
cryptographic keys as well.
• Selective Forwarding: In WSN multi hop paradigm is
common, in which each node must forward the received • Eavesdropping: Eavesdropping is not and active attack.
message correctly and securely. In case a node is get In this attack the attacker listens to the network traffic
compromised, it may refuse to forward the message or to discover some secret information. This type of attack
forward a selected (malicious code) is very hard to prevent. In most cases encryption is the
only solution to prevent eavesdropping.
• Denial of service attack: Dos is a type of attack that
prevents the network to perform its normal operation or • Traffic Analysis: Through traffic analysis the attacker
make it unavailable to the legitimate users. It can be can determine the base station as all traffic goes toward
launched in different ways i.e. sending jamming a single point. If the base station is compromised the
signals, which are radio signal transmitted to interfere attacker will be able to make the whole network useless
with the radio frequency used by the sensor network to
jam the node or sending too many bogus messages • Mote Class: Mote class attacks also called insider
(flooding) to the nodes to cause the power failure. attacks, are launched either by the compromised node
or the by attacker who has taken (stolen) the key
• Sybil Attack: In this type of attack, a malicious node material, code or data from legitimate sensor node.
appears to be in more than one place. The node
presents more than one identity to the neighboring node • Laptop Class: Also called outsider attack has no special
in the network. This type of attack mostly affects access to the WSN. It has access to more powerful
geographical. This attack can be prevented if each pair devices such as laptop which replace the legitimate
of neighboring node uses unique key for initializing node. This attack can jam the entire network as its radio
communication. transmitter power is high

• Wormhole: The basic idea of such type of attack is The next section will discuss the related research work on
tunnel packets received on one part of the network to security issues in general and broadcast authentication in
another part. A well-placed wormhole can disrupt particular pertaining to WSNs.
whole routing. A node that is multiple hops away from
base station is deceived to be one or two hops away II. RELATED WORK
from base station through wormhole. This attack can be
Broadcast authentication is a very crucial security service in
launched with conjunction of Sybil attack.
wireless sensor networks because it allows the nodes to send
authenticated broadcast messages to all other nodes.
• Sinkhole: In this type of attack, a compromised node
Techniques such as μTESLA and multi-level μTESLA have
looks attractive to other nodes in the network with
been proposed to handle broadcast authentication but none of
respect to routing algorithm. The compromised node
these techniques have been effective in terms of bandwidth
attracts all the network traffic to pass through the
and the number of sender nodes and also these techniques
compromised node creating a sinkhole with the attacker
cannot handle denial of service attacks.
at the center to get information. This attack can be used
to launch other type of attacks such as selective
forwarding [7] DonggangLiu proposed a technique for broadcast
authentication which is based on μTESLA but it can handle
• Hello Flooding: In WSN many protocols need that the both the problems of number of sender nodes and DOS
nodes should broadcast Hello message to the attack. He also proposed a technique to revoke broadcast
neighboring nodes to advertise its presence. The capability from the nodes which are malicious or
receivers of hello message assume that they are in radio
compromised [19].
range of the sender. The laptop-class attacker can
broadcast routing or any other information to deceive Considering the challenges of μTESLA Mohamed Hamdy
the nodes in the network that he is its neighbor and may Eldefrawy has proposed a protocol which uses two different
start exchanging information. hash functions and Chinses Remainder Theorem. The
protocol has proved to be more efficient and effective than
• Acknowledgement spoofing: Some routing protocols
μTESLA because the receivers can authenticate the
uses link layer acknowledgement which can be spoofed
by the attacker to convince the nodes that weak links broadcast messages in real time [20].

196
Rongxing L, Xiaodong L, Haojin Z, Xiaohui L and Xuemin ⎯ Location privacy in WSN
S proposed a scheme called BECAN which is not only
effective in terms of bandwidth, but it can also preemptively We will briefly discuss all these challenges in this section
detect bogus packet injections and helps to conserve the
energy. This also helps to lessen the burden of sink in 1. Secure routing protocols:
detecting bogus data injection [21]. One of the major issues in WSNs is secure routing protocol
M.Ramesh and Dr.C.Suresh have proposed a broadcast which not only protects the routing information, but it should
authentication scheme based on TESLA and ECDH, also be light weight. It is extremely challenging to design
however it user three keys .This scheme reduces the loss and secure routing protocol as the sensor nodes are of low-power,
delay in WSNs. After taking the initial parameters this low-capacity memory and processing power. WSN routing
scheme is based on three main steps: Auxiliary key security deals with the authentication of user node and
generation, Public/private key generation based on elliptic verification of packet being sent. Authentication can be
curve diffie-hellman and keys concatenation which finally achieved by using base station, a key or a certificate. The
results in a hash key [22]. This has key is validated before certificate is unique ID of each node. The scheme described
broadcasting the packets. If the key is verified, the packets in [8] provides secure and efficient routing protocol using
are broadcasted otherwise they are discarded, and the sink is encryption and authentication.
informed. Iman Almomani and Emad Almashakbeh[9] proposed a
Tsern-Huei Lee proposed two new user authentication power-efficient, secure routing protocol to manage the
recourse limitation in WSN. This protocol is a combination
protocols slightly different from password-based solutions.
of tree-based and cluster-based protocols. But the proposed
These protocols are very light weight in terms of protocol uses LEACH as a base protocol for cluster
computational power and communication load as compared formation process that has some weakness [10]. The aim of
to strong password-based techniques. Despite of the any secure routing protocol is to guarantee the
simplicity and this authentication scheme provides authentication, integrity and availability of packets. Some of
comparable security [1]. the well-known secure routing protocols that address various
issues are TESLA, μTESLA, intrusion tolerant routing
Junqi Zhang; Varadharajan, V. proposed scheme is based on
protocol (INENS), SPINS and trust routing for location
LOCK scheme and employees ID-based secure group key aware sensor networks (TRANS) to name a few.
management which minimizes the key storage requirement
and the number of messages rekeying [2]. Yilin Wang and 2. Key Establishment Issues:
Maosheng Qin proposed a scheme dealing with key To securely exchange the data, the protocol must establish
management issues using asymmetric cryptographic and manage keys distribution between all the nodes in WSN
techniques [3]. Haiguang Chen, XinHua Chen, Junyu Niu that wants to communicate. New node should be securely
proposed an authentication scheme focusing on the unique deployed and enabled to start secure communication with the
existing nodes in the network. No unauthorized node or user
characteristics and novel misbehavior find in WSN. The
should get access to the network.
proposed scheme authenticates the node based on some the The limitations of WSN such as physical node capturing
abnormal behavior or actions they would carry out anyway vulnerability, limited computational and communication
[4]. Norziana Jamil, Sera Syarmila Sameon, Ramlan power, no prior knowledge of deployment of sensor network,
Mahmood proposed a scheme focusing on the authentication. makes the WSN design more challenging.
Their work is based on identity bits commitments for the One of the basic approaches is to use a single shared key for
authentication, mainly addressing forgery and replay attack the entire network. In this approach all communication is
encrypted with the same key and then MAC (message
[5].D.Manivannan, B.Vijayalakshmi and P.Neelamegam
authentication code) is appended. Although this approach
proposed a new protocol in which congruence equations and lifts the burden of key management, but it has many
number theory concept has been introduced to provide secure drawbacks. If one node is compromised the whole network
authentication among the nodes [6]. will be compromised [11].
Another approach of key distribution is using asymmetric
cryptography or more commonly named as public key
III. SECURITY CHALLENGES IN WSNBA cryptography. In this approach before deployment of the
The main advantage of broadcasting and multicasting is that nodes, a master public/private key pair is generated. Then for
it reduces the communication overhead but at the same time each node public/private key pairs are generated. Each node
it also requires that only legitimate nodes / parties should be stores its key-pairs, master public key and master key
able access those messages. Some of the major challenges in signature. Now all nodes are ready to exchange the keys.
WSNBA are listed below. Nodes exchange their public keys and master key signatures.
Public keys of nodes can be verified by checking the master
⎯ Secure routing protocols key signature using master public key. Once the verification
of a node public key is done, a symmetric key is generated
⎯ Key Establishment Issues and is transmitted between the communicating nodes by
⎯ Fast response broadcast authentication encrypting it with public key of the receiver node. Now the
⎯ Defending DoS Attacks two nodes are ready for secure communication using this

197
symmetric key. The reason for using symmetric key for schemes mentioned above uses digital signature as primary
encryption is that it is computationally less expensive as authentication mechanism along with bloom filters. In these
compared to the public key encryption. But this approach schemes the sensors are low-cost without temper resistance
also has some problems e.g. key generation and verification hardware devices, performing basic cryptographic and public
overhead, vulnerable to DoS and node replication attacks. key operations. Both key pool scheme and key chain
Another approach is to use pair-wise shared keys, in which schemes use the strength or public key cryptography very
each node has unique symmetric shared key with each other efficiently solves the issue without the need of periodic key
node in the network. The main disadvantage of this approach distribution or synchronization. We will briefly discuss these
is the storage of too many keys on each node in the network. two schemes below
If the network is bigger in size than this approach is not
feasible at all in terms of storage capacity. Adding new nodes i. Key Pool Scheme
will be very difficult to the network which causes scalability
issues. In Key pool scheme, the nodes are divided into groups each
There is another key distribution approach known as possess partition of the network’s key pool while the access
Random key pre-distribution described in [11] and [12] None point possess all the keys. This scheme consists of three
of the schemes is perfect in all respect; all have advantages phases
and disadvantages and apply to specific situation. a) Pre-deployment phase:
In this phase each node stores keys necessary
3. Fast response broadcast authentication
for node-level operations, access point public
Another major challenge in WSNs is fast and authenticated key, network-wide hash function and
broadcast operations. Public key-based method of broadcast independent hash function for BFVs. [13].
authentication in wireless sensor networks is considered b) Signature generation:
more efficient as compared to the symmetric key based In this phase digital signature is generated and
approach because of simple protocol operations like no then access point creates BFV from digital
synchronization. Using PKC-based approach a sensor node signature which is used by the sensor nodes to
can detect false messages as it first authenticates before pre-verify the signature [14].
forwarding the message. But the problem is that PKC Finally message is broadcasted as {M,tt,Ds,I,
operation on low-computational power nodes will increase BFV} Where
message propagation time. Another issue is that sensor node M=message
may forward messages before even authenticating them. To tt= time stamp
achieve fast and efficient broadcast operation, the sensor
I= set of all key indices included in BFV
node has to make a decision that when to authenticate first
and when to forward the message first based on the situation. c) Message verification and forwarding phase:
To deal with this dilemma two new schemes have been This phase is responsible for message
proposed in [13]. These schemes will help the sensor nodes verification and forwarding whenever a
to decide when to authenticate first and when to forward broadcast message is received on each node in
first. These two schemes are known as: the network [13].
⎯ Key Pool Scheme
⎯ Key Chain Scheme Advantages of Key Pool Scheme:
For efficient and capture resistance PKC-based broadcast • By key partition, the attacker will learn limited
authentication protocols, distribution of secret keys among information from node capturing.
the sensor nodes and bloom filters are used in both schemes. • Provides protection against denial of service
There are two ways to solve the broadcast authentication attacks.
problem, one is using hardware approach and second is using
• Minimum broadcast authentication delay is
protocol approach. Hardware approach secures keys inside
achieved.
the sensor node to prevent any attack to bypass the protocol
Disadvantages of Key Pool Scheme:
by equipping them with temper-resistant memory, allowing
MAC approach to be used in wireless sensor network. • Hashing operation adds computational overhead
Because of high cost of temper-resistance hardware, • Transmitting additional bits in every broadcast
hardware approach is limited to critical applications. message results in communication overhead.
Many researchers are focusing on creating new protocols to
resist node capturing. As discussed in the previous sections ii. Key Chain Scheme
µTESLA (Timed Efficient Stream Loss-tolerant
Authentication) and its various extensions are efficient low Key chain Scheme uses multiple one-way hash chains to
computational overhead broadcast authentication protocols prevent communication overhead as in case of key pool
but require some level of synchronization between nodes scheme. In this scheme there is no need to include indices in
during periodic broadcasting. the broadcast messages because it uses one-way hash chain
Because PKC-based method efficiently deals with in forward direction in which each node starts with key index
verification delay problem, most researchers are trying to zero to the higher key indices.
speed up the operations of PKC. In PKC approach, the two

198
Key chain scheme comprises of three main phases similar to receiver receives the broadcast packet, it first verifies the pre-
pool scheme discussed in previous section [13]. authenticator. If this verification is successful than digital
a) Predeployment Phase: signature is done.
First of all, global key pool must be generated, Pre-authenticator is derived from the pseudorandom function
which consist of N starting keys of N and the sender node ID. Pseudorandom function say (f) is
only known to the base station (access point) and can verify
independent key chains. Each node selects k
the sender pre-authenticator when needed. Before sending or
starting keys from N keys at random. Like in broadcasting data packet the sender first distributes its pre-
key pool scheme, every node is configured authenticator to all receivers either by a secure broadcast key
with the public key of base station, network- or using pair wise shared key. Before the communication can
wide hash function and independent hash take place, each node saves the node ID and the most recent
functions for BFVs. pre-authenticator of all the neighbors. Finally, the broadcast
b) Signature generation Phase: message has the following format: [i/ Mi / DSi / Kiv] Where
Access points generate digital signature as
i is index,
DS=Eprivate (H\\tt). Mi is the message to be broad casted,
Every key chain starts from Kij to Ki(j+1) DSi is digital signature and
Where, i € [1, N], after that access point insert Kiv is the ith-pre-authenticator of node v.
all new keys into BFVs as in [13]. Final
message that is to be broadcasted is give below When this data packet is received at the receiver end, the pre-
where c shows the index of the current key in authentication is verified first to see that packet with i-th
index is not received before and the v is a valid neighbor.
chain.
Receiver perform the verification by checking
Broadcast Message = [M,tt,Ds, c,BFV] Kjv =fj-i(Kiv), because key Kiv can only be generated by node
v. If verification is successful then the receiver verifies the
c) Message verification and forwarding DS, otherwise packet is dropped if verification fails. At the
phase: end j is replaced by i and Kjv by Kiv at receiver end [15].
In this phase the tt is checked for freshness One of the advantages of one-way hash function is that as
long as the i-th packet is not broadcasted, the attacker cannot
and BFV for forgery. On BFV test passing, the
figure out Kiv since it is dependent upon i and hence the
sensor node advances the local key chain to attacker cannot fake the pre-authenticator. Consider the case
that index. If all the corresponding BFV bits when the attacker receives a broadcast packet from a node X
for new local keys are verified (the packet is and keeps the pre-authenticator and forges the message and
accepted otherwise dropped), the node then replays the forged message. If the receiver Y is in the
forwards the message. Before accepting the neighbor of X node, it must have received the unmodified
packet, the last step is to verify the DS. packet with similar pre-authenticator at the same time when
the adversary received the broadcast packet. So, node Y can
detect the forged message because he already has seen this
Advantages of Key Chain Scheme:
pre-authenticator. If node Y is not the neighbor of X, then X
• Indices are not required to be included in broadcast is also not in Y’s neighbors. So, Y will detect the modified
message. packet and will drop it.
• Multiple one-way hash chains are used which In many situations it is required to add new sensor nodes
eliminates communication overhead in WSNs. after initial deployment. The new node addition changes the
• If we have the starting key and network-wide hash neighborhood association of already existing nodes.
function, we can find key at any location. Therefore, there is need to handle the new node so that it can
Disadvantages of Key Chain Scheme: be detected as valid node and can broadcast the packets and
• This scheme is prone to single point of failure verify the packets which are received. For this purpose, first
which means once key is compromised; the whole of all for identity, ID certificate is calculated by signing node
key chain will be compromised. ID (each node has its unique ID) with the private key of the
base station. Then all the sensor nodes including both old and
new are pre-configured with base station’s public key. Now
4. Defending against DoS Attacks the new node can prove its validity with ID certificate.

For the new node to communicate with its neighboring


Since for low-computational power sensor nodes digital nodes, key distribution/management is essentially required
signature is expensive, an attacker can launch a DoS attack between the newly entered node and the neighboring nodes
by forging large broadcast messages with digital signature to create shared keys among them. It is assumed that the new
and then force the node to authenticate the signature, node meets the preliminary requirements and can perform
consuming the battery power of the node. One way to defend operations based on public key cryptographic (PKC);
against the DoS attack is to use sender specific one-way key therefore, for initial communication asymmetric key
chain [13]. In this approach each sender has a pre- algorithm can be utilized. Sensor node is loaded with
authenticator that is added to the broadcast packet. When the public/private key pair. A public certificate is generated
199
using private key of base station. Now the new node [2] Junqi Zhang; Varadharajan, V. “A New SecurityScheme for Wireless
Sensor Networks” Global Telecommunications Conference, Dec. 4
broadcasts its public key certificate for existing nodes. The 2014.
existing nodes can verify the certificate using base stations [3] Yilin Wang; Maosheng Qin “Security for wireless sensor
public key. Once the verification procedure is done, the new networksControl Automation and Systems (ICCAS”, 2010
node can use public/private key pair to communicate International Conference on 27-30 Oct. 2010.
securely and then can setup a pair wise shared key for [4] Haiguang Chen, XinHua Chen, Junyu Niu “Implicit Security
symmetric cryptographic operation. Authentication Scheme in Wireless Sensor Networks”, International
Conference on Multimedia Information Networking and
Security(MINES), 4- 6 Nov. 2012
5. Location Privacy in WSN: [5] Norziana Jamil, Sera Syarmila Sameon, Ramlan Mahmood “A User
Localization process is used to find the location of the WSN. Authentication Scheme based on Identity-bits Commitment for
Correct Location information of the WSN is very crucial Wireless Sensor Networks” 2010 Second International Conference on
Network Applications, Protocols and Services, 14 Nov 2010
because in many applications e.g. in military the location of
[6] D.Manivannan, B.Vijayalakshmi P.Neelamegam“An Efficient
an enemy, location of fire, the correct location information is Authentication Protocol Based on Congruence or Wireless Sensor
very important to take immediate actions [16] networks” IEEE International Conference on Recent Trends in
An attacker can disclose the location information. An Information Technology, ICRTIT 2011, Anna University, Chennai.
June 3-5, 2011
attacker can be insider or outsider. Inside attacker is more
[7] Hemanta Kumar Kalita and Avijit Kar “Wireless sensor network
dangerous because he claims to be an authentic user and he security analysis”International Journal of Next-Generation Networks
has access to the authentication credentials. Outsider attacker (IJNGN), VolNo.1, December 2009
can get into the network physically and can passively [8] Jiliang Zhou “Efficient and Secure Routing Protocol Based on
eavesdrop the communication. There are two basic Encryption and Authentication for Wireless Sensor Networks”
approaches to prevent location discloser of WSN. The first Artificial Intelligence and Education (ICAIE), 2010 International
Conference,18 Nov 2010
one is based on encrypting all messages but with a problem
[9] Iman Almomani Member Ieee, Emad Almashakbeh“A Power-
that attacker can find the physical characteristics of wireless Efficient Secure Routing Protocol For Wireless Sensor Networks”
network and repeated messages from the same sensor node Wseas Transactions On Computers , ISSN: 1109-2750 1042 ,Issue 9,
reveals more information till the node is localized, this Volume 9,September 2010
approach is not very effective. [10] G. M. Shafiullah, A. Gyasi-Agyei, P.J Wolfs “ A research Survey of
The second approach is to make the entire nodes go to sleep Energy-Efficient and QoS-Aware Routing Protocols for Wireless
Sensor Network” ,Springer Science, 2008.
so that no location privacy can be revealed but this is against
[11] Haowen Chan, Adrian Perrig, and Dawn Song “key distribution
the aim of using WSN. techniques for sensor networks" Available at:
Neelanjana Dutta, Abhinav Saxena and Sriram Chellappan http://www.cs.cmu.edu/~haowen/randomkey.pdf
proposed a new scheme in which only few nodes can [12] Noureddine Mehallegue, Ahmed Bouridane , EmiGarcia “Efficient
participate in any message exchange. This approach protects path key establishment for wireless sensor networks” EURASIP
Journal on Wireless Communications and Networking archive
the privacy of many nodes. The idea behind this approach is Volume 2008, January 2008 Hindawi Publishing Corp. New York,
that the sensor nodes will have to perform two conflicting NY, United States
requirements i.e. localizing the adversary and hiding its own [13] Panoat Chuchaisri and Richard Newman “Fast Response PKC-Based
existence from adversary [17]. The proposed protocol not Broadcast Authentication in Wireless Sensor Networks”,
only preserves the privacy of many nodes but also provides Collaborative Computing: Networking, Applications and
Worksharing(CollaborateCom), 2010 6th International Conference ,
adversary location. 12 May 2011
[14] F. Yeo H. Luo. S. Lu, and L. Zhang. "Statistical enroute filtering of
injected false data in sensor networks,"INFOCOM 2004. Twenty-
Conclusion third Annual Joint Conference of the IEEE Computer and
In this paper we have discussed WSN, its applications, Communications Societies pp. 2446-2457 vol.4, 22.
threats to WSNs in general and Security challenges [15] Xiaojiang Du, Mohsen Guizani, Yang Xiao, Hsiao Hwa Chen
“Defending DoS Attacks on Broadcast Authentication in Wireless
pertaining to WSNBA in particular. Broadcast authentication Sensor Networks “ IEEE Communications Society subject matter
is one of the crucial security services in wireless sensor experts for publication in the ICC 2008 proceedings.
networks but there are several security challenges which [16] Avinash Srinivasan and Jie Wu “A Survey on Secure Localization in
have been discussed in this paper. These challenges include Wireless Sensor Networks. Available at :
secure routing protocols, Key Establishment Issues, Fast http://ahvaz.ist.unomaha.edu/azad/temp/sal/07-srinivasan-
localization-sensor-wireless-security-coverage-beacon-network.pdf
Response WSNBA, and Defending DoS Attacks which has
[17] Neelanjana Dutta, Abhinav Saxena and Sriram Chellappan
been described in detail. For the all the issues related to “DefendingWireless Sensor Networks Against Aversarial
broadcast authentication there are some existing solutions Localization” Mobile Data Management (MDM), 2010 Eleventh
that we have summarized in this paper. All the solutions or International Con
proposed schemes have their own pros and cons which are [18] Aleksandar Milenkovic, Chris Otto, Emil Jovanov, Wireless sensor
discussed in detail. networks for personal health monitoring: Issues and an
implementation. Journal of Computer Communications archive
Volume 29 Issue 13-14, August, 2006 Pages 2521-2533
[19] Donggang Liu , Peng Ning , Sencun Zhu Sushil Jajodi Practical
Broadcast Authentication in Sensor Networks.Proceeding of
REFERENCES MOBIQUITOUS '05 Proceedings of the The Second Annual
International Conference on Mobile and Ubiquitous Systems:
Networking and Services Pages 118 - 132
[1] Tsern-Huei Lee “Simple Dynamic User-Authentication Protocols for
[20] Mohamed Hamdy Eldefrawy,1 Muhammad Khurram Khan,1,*
Wireless Sensor Networks” The Second International Conference on
Khaled Alghathbar,1,2 and Eun-Suk Cho3 “Broadcast Authentication
Sensor Technologies and Applications. Dec 2008
for Wireless Sensor Networks Using Nested Hashing and the Chinese

200
Remainder Theorem” . Sensors (Basel). 2010; 10(9): 8683–8695.
Published online 2010 Sep 17. doi: 10.3390/s100908683

[21] Rongxing L, Xiaodong L, Haojin Z, Xiaohui L, Xuemin S. BECAN:


A Bandwidth-Efficient Cooperative Authentication Scheme for
Filtering Injected False Data in Wireless Sensor Networks. Parallel
and Distributed Systems, IEEE Transactions on 2012; 23(1):32- 43,
DOI: 10.1109/tpds.2011.95
[22] M.Ramesh and Dr.C.Suresh “An Analysis of Broadcast
Authentication and Security Schemes in Wireless Sensor Networks “
, International Journal of Engineering and Technology , Oct,2013 .
[23] R. Bhatia M. P. S. Wireless sensor networks for monitoring the
environmental activitiesProceedings of the IEEE International
Conference on Computational Intelligence and Computing Research
(ICCIC '10)December 2010152-s2.0-
79951790941doi:10.1109/ICCIC.2010.5705791

201
Towards An Integration Concept of Smart Cities
Naoum Jamous Stefan Willi Hart
Magdeburg Research and Competence Cluster (MRCC) Digital Business Integration
Otto von Guericke University Magdeburg (OVGU) Accenture GmbH
Magdeburg, Germany Cologne, Germany
naoum.jamous@ovgu.de Stefan.willi.hart@accenture.com

Abstract— Urbanization is one of the greatest challenges of In a broader sense, an integrated smart city system can be
our time. Smart cities concept was introduced to support efficient compared to an IT system landscape of an enterprise
utilization of the limited resources in our world. However, describing its supply chain [10]. Thus, a similar concept of
implementing this concept is a challenging task where researchers supply chain management (SCM) system can be used to
and practitioners are dealing with. In this paper, different smart support a smart city project. SCM is a process-oriented
cities’ frameworks as well as integration approaches of enterprises management approach comprising all movements of raw
are discussed. In order to support the orchestration of Smart City materials, components, products, and information along the
services, an integration concept is presented. value creation and supply chain from raw material to the end
product [11]. In the context of this idea, integration
Keywords—Smart City, systems integration, Internet of things
(IOT), IT Operation Management, Supply Chain Management.
approaches for supply chain management could be
transferred to smart city projects as indicated in figure 1.
I. INTRODUCTION
Urbanization is one of the greatest challenges in our
world. For instance, in 2018 55 % of the population is living
in urban environments, and it is expected to reach 68% by
2050 [1]. Smart cities concept was introduced to support
efficient utilization of the ecosphere’s limited resources.
However, implementing this concept is a challenging task for
both researchers and practitioners. For example, the
European Union is supporting those researchers and
practitioners by funding Smart Cities’ and Internet of Things
(IOT) related research in the context of the Horizon2020
program [2].
By considering the schematic maturity level model of
Smart Cities, it can be stated that the ultimate goal is to
develop a self-learning and adaptive Smart City with
endogen and exogenous networks [3]. In order to achieve
this goal, Smart City initiatives in different areas, such as Fig. 1. Supply Chain of a Smart City System [10]
traffic management or waste management, have to be
implemented. However, to achieve the highest maturity In this work, different smart cities’ frameworks as well as
level, these initiatives should be integrated in a common integration approaches of enterprises will be discussed. In
communication framework [3], [4]. Since smart city project order to support the orchestration of smart city services, an
involves several partners, the implementation dilemma is one integration concept will be proposed and discussed. Then,
of the most crucial aspects to be considered. This dilemma the paper closes with a conclusion paragraph stating the main
describes the challenge of managing the stakeholders’ findings and obstacles.
interests differences while planning and developing a smart
city initiatives [3]. Moreover, failure and success
responsibilities are often not clarified. In addition to the II. SMART CITY FRAMEWORKS
management aspects, technological issues must be According to A.M. Townsend, Smart Cities can be
considered. High degrees of safety, security, as well as defined as "places where information technology is
system integration are necessary [5]. According to many combined with infrastructure, architecture, everyday objects,
researchers, incomplete information and communication and even our bodies to address social, economic, and
technologies, the variety of technological standards, the data environmental problems" [12]. Therefore, a smart city
privacy, and integration are some of the main transformation development is a challenging task. According to
barriers in smart city projects [3], [4], and [6]. C.Etezadzadeh, a Smart City consists of the following
enablers: natural basics, urban actors and their contributions,
An integration problem can be solved by several
integrated urban management and urban governance,
techniques. Thus, evaluating the integration solutions is a
objectives and versioning, infrastructures, layer of
multi-dimensional problem [7]. This paper focuses on the
information and communication technologies and resilience
integration of different smart city systems to achieve the
[4]. With the introduction of smart city solutions, new
highest maturity level possible. While the concept of web
business models and new value chains arising on basis of
services is recommended for single initiative in the scientific
inter-sectoral cooperation.
literature, achieving synergy effect among different
initiatives requires higher communication degree between As described in the previous section, there are many
the used systems [7], [8], and [9]. barriers to overcome. Researchers proposed development
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 202
frameworks to be used in smart cities projects. Hereafter, factors, the business process changes and human challenges
four different frameworks are presented. are identified. The second level is divided into
technologically related challenges, process related
A. Smart City Initiatives Design Framework (SCID) challenges, and human problems. At the third level, other
The SCID authors have developed a conceptual model to dimensions are derived from the process related challenges:
be used while designing a concrete smart city initiative. This inter-organizational, functional, and managerial challenges.
model was based on the analysis of ten different smart city
initiatives [13]. It describes the main features of the D. SMART model
initiative. Using the Leontief “Input-Output Model” [14], the This model is proposed by S. Ben Letaifa [17] and it uses
a top-down approach. As shown in figure 3, the model
consists of three paths (macro, mezzo, and micro) and five
consecutive main phases: Strategy, Multidisciplinary,
Appropriation, Roadmap, and Technology [17]. The top
level of the framework is the marco level. At this level, the
development of smart city strategy and the mobilization of
multidisciplinary resources occur. The mezzo level focuses
on the various actors' appropriation of the project and the
creation of a clear road map for the realization of the city. At
the micro level, possible technologies are identified to
support the strategy and hence the initiatives.
In the strategic phase, the local challenges and the
population are inputs requirements to realize the common
Fig. 2. The Smart City Initiatives Framework (SCID) [13] vision to be pursued. After, the objectives will be defined
and the strategy will be carried out. Then, the
authors aimed to create an explicit link between
environmental factors that affect the initiative directly, and
the achieved results. Thereby, a value-oriented perspective is
associated to the solution. The model consists of six main
elements as depicted in figure 2.
The element "Smart City Initiatives" describes how the
specific Smart City-related projects can be implemented.
These projects have an impact on the city policy domain of
the location in which the initiative is carried out. This in turn
produces some result on the city and on the various Fig. 3. The SMART model [17]
stakeholders. On the lower level of the model, the elements
of enablers, critical success factors, and challenges play a multidisciplinary resources should be mobilized. In the
role. The critical success factors extracted from two elements following phase, it is all about the iterative and agile
“enablers” and “challenges”. SCID describes two different improvement of definition and development of the project.
implementation approaches. The first is the top-down The various multidisciplinary actors should be integrated to
approach, stating that smart cities are initially planned, transform the actors to active members of the project. Then,
designed, and developed on the basis of drafts. The second the detailed planning is carried out including the
approach is a bottom-up one. It assumes that existing cities implementation of the services. In the last phase, the
are upgraded with smart features. technologies selection is detailed.

B. Modified Smart City Initiatives Design Framework III. ENTERPRISE INTEGRATION – GENERAL CONCEPTS
This model is an improvement of the SCID frameworks. The term system integration is defined by J. Myerson as:
Here, the SCID is converted using a decision support model “…Systems integration involves a complete system of
into a schematic transformation meta-model [3]. It firstly business processes, managerial practices, organizational
considers the maturity degree of the city development, and interactions and structural alignments… It is an all inclusive
secondly the iterative development of it. Another aspect process designed to create relatively seamless and highly
which is revisited in the model is the holistic planning agile processes and organizational structures that are aligned
approach. This planning should provides essential guidelines with the strategic and financial objectives of the enterprise…
for the redesign of the city. Systems integration represents a progressive and iterative
cycle of melding technologies, human performance,
C. Integration Model for Smart City Development knowledge and operational processes together.” [18]. Hence,
according to V. Javidroozi [15] it is extensive and complex process whereas various aspects
It is based on a literature review, questionnaires, and need to be considered. System integration has five essential
interviews [15]. The model is derived from the Business characteristics to be considered [18]:
Process Change (BPC) model of Kettinger and Grover [16]
• Functional and technical compatibility is provided.
which consists of four dimensions: information &
technology, people, management, and structure. V.Javidroozi • The technologies used to process applications and
merged the management, and the structure dimensions. The data are relatively transparent to users.
model has three levels. In the first one, the technological
203
• The issue is selecting the best technology with respect the connection of the application landscape to a central
to longevity, adaptability and scalability, and speed of communication component (Message Broker) can be found
solution delivery. in both. However, SOA requires that the connected
applications follow the service paradigm, whereas they can
• Application systems, data, access paths to data, and remain discrete in an EAI scenario [24], [25].
graphical user interfaces (GUIs) are harmonized and
standardized for the user. Another important difference is that EAI is driven by the
business processes, while SOA is driven by technology [26].
• All enterprise wide applications and computing Plus, SOA use a top-down approach while EAI follow a
environments are scalable and portable to a variety of bottom-up approach [20]. SOA defines standards for various
needs. integrations [27]. In contrast, the EAI intends to extend the
The integration objectives must be defined before setting changes from one system to a cluster of systems. Generally,
the integration approach to be followed. In the definition of SOA enables a wide range of enterprise applications to
integration objectives, two Integration items need be integrate the use of standardized services [25]. EAI
considered: the task level and the task manager level. Based integrated against it at the system level, so to speak, by the
on these two levels, four main aspects are determined [19]: integration system output [28].

• Redundancy: Describes how a particular function IV. ENTERPRISE INTEGRATION –IMPLEMENTATION


associated to system components, and how redundant
APPROACHES
copies can be removed without affecting the system
performance After presenting two main general concepts of enterprise
integration, hereafter three implementation approaches are
• Linkage: Describes the type and number of discussed.
communication channels among the components.
• Consistency: Describes the behavior of a formal A. Point-to-Point (P2P)
system and its relation to the components. As shown in figure 4, a point-to-point refers to a
connection between two equal IT systems [20]. Each system
• Goal orientation: Describes how the object network has a connection to every other system. The problem is that
and the application system to contribute to the the number of interfaces (S) grows with each new system
achievement of the overall task. (n): S = n*(n-1)/2 [20]. This leads to a very high degree of
This research focuses on the technological system complexity, which is difficult to manually management. For
integration due to the complexity of the subject. In general, example, a system landscape with 12 systems would already
two basic approaches are utilized: the Enterprise Application have 66 interfaces after the P2P approach. New systems can
Integration (EAI) and the Service-Oriented Architecture easily be added into the existing architecture. There is no
(SOA). reorganization of the system landscape. It is thus a fast
realization of unused interfaces or custom design possibility
A. Enterprise Application Integration (EAI) of individual interfaces. Therefore, it is a quick realization of
possible interfaces or individual design options of individual
EAI is a concept for enterprise-wide integration of interfaces [27].
business functions along the value chain which are
distributed across different applications on varying platforms
and can be connected in terms of data and business process
integration [20], [21], [22]. EAI includes the planning, the
methods, and the software to integrate a process-orientation
in heterogeneous, autonomous application systems. It is a
process-oriented integration of application systems in
heterogeneous IT application architectures.

B. Service-oriented Architecture (SOA)


According to the OASIS, Service-oriented Architecture is
“a paradigm for organizing and utilizing distributed
capabilities that may be under the control of different
ownership domains” ([23]. It is a method for designing a
system landscapes and the elements of a system
communicate via services. SOA is component orientation in Fig. 4. Point-to-Point approach
conjunction with loose coupling and outsourced sequence
control. This means that all components communicate via
defined interfaces. The Communication is primarily
asynchronous and the flow is controlled via its own B. Hub and Spoke
workflow component, which ensures that the systems know
A hub and spoke system consists of services with an
as little as possible from each other.
adapter (spoke) and a central broker (hub) [19]. As depicted
in figure 5, the adapter serves to realize the connectivity to
C. Comparing SOA and EAI set systems. The connectivity to the set systems is realized
EAI can be understood as a predecessor of Service via the adapter. Messages are sent to the central broker and
Oriented Architecture (SOA). The technological concepts of transformed there for the target system. Afterwards, the

204
The development of a suitable smart city system
landscape, and/or the integration of various smart city
solutions into a system landscape are not considered in any
model yet. The sustainability of the system landscape is an
essential aspect to be considered while developing a smart
city. Due to the rapid change in information technologies, it
is important to have flexible system environment enabling
the exchange of individual components/services. Therefore,
the real-time migration, the switch to new technologies, and
the integration of new services are at the forefront. The
maintenance of the services and the system environment
should also be considered. Another important aspect is the
potential synergy among the smart city solutions. For
example, in the field of smart logistics, all related systems
should use the same sensors, and/or they should all be able to
Fig. 5. Hub and Spoke approach communicate.
Table 1 presents an evaluation of the proposed
message is forwarded according to defined rules via the integration approaches based on [31]. It can be seen that all
adapter to the target system [19]. three approaches have middle or lower development
complexity. In particular, the ESB has a low development
C. Enterprise Service Bus (ESB) complexity since there are a lot of open standards, reducing
According to D.A. Chapell, an Enterprise Service Bus is: the development effort [30]. When the maintenance
“a standards-based integration platform that com-bines complexity is considered by the approaches, it can be seen
messaging, web services, data transformation, and intelligent that the P2P approach is the most complex. The reason is that
routing to reliably connect and coordinate the interaction of when a service is exchanged or replaced, all connections to
significant numbers of diverse applications across extended the other services need to be altered or updated. In the hub
enterprises with transactional integrity” [29]. As presented in and spoke approach, all hubs must be serviced, while for the
figure 6, the ESB is a messaging backbone. This messaging ESB, only the needed special services will be served.
system controls the flow among various services and Consequently, in many changes and maintenance cases,
application which are linked to the ESB. Subscribing these two approaches would be appropriate. When pairing,
applications will have adapters which would take message the situation is similar. The P2P is very tightly coupled,
from bus and transform the message into a format required unlike the other two approaches. Whereby, the ESB
for the application. approach is completely loosely coupled. In terms of
scalability, the P2P and the ESB approach have certain
advantages as they are highly scalable. The hub and spoke
approach is dependent on the hub structure. Therefore, there
may be limitations in the processing. For the extensibility of

TABLE I. COMPARISION OF THE DIFFERENT APPROACHES [31]


Parameter P2P Hub and Spoke ESB
Complexity of Medium Medium Low
Development
Complexity of High Medium Less
Maintenance
Loose
Need to know the
coupling via
Coupling Tightly protocol of the
use of
hub
adapters
Fig. 6. Enterprise Service Bus approach Restricted by hub
Scalability High High
infrastructure
Extensibility Low High High
V. DISCUSSION Security
Up to Hub can complete Built-in
individual the mechanism
The models presented in chapter 2 follow a strict process Mechanism service implementation with adapters
chain and therefore they are not flexible. Still, the modified Present, depends Depends on
SCID solved different problems, but iteratively. This means Data Lantency Real-time on frequency of messaging
that the city will always be analyzed and evaluated based on updates to hub backbone
its goals and needs. Then, the transformation processes - Depends on the Some dip;
High (no
toward a smart city- will start. The other frameworks process Performance overhead)
hub’s depends on
is similar to the waterfall model in project management. Yet, infrastructure adapters
all frameworks consider only the organizational aspect. They Intra-business Intra- and inter- Enterprise-
describe the challenges to be considered and their impact on Best Fits service business service wide
the development process. integration integration integration

205
the approaches, hub and spoke and ESB show advantages integration layer with using an ESB concept. At this level,
over the P2P approach. If a new node or service is added to the transformation, planning, and orchestration of the data of
the P2P approach, all other node must know the protocol of the services takes place. Moreover, security and policies are
the new service. This leads to an increase in the system’s managed. The presentation layer is the last level. Here, user
reorganization costs. can access and interact with the system through browser,
mobile device, or other means.
The latency is in all tests dependent on the central
orientation tool (stroke or messaging backbone). At this
point, the P2P approach has its advantages. The approach VII. CONCLUSION
can respond via its connections to each system very quickly In this research, different smart cities’ frameworks as
and in real time. In the performance, exactly the same well as integration approaches of enterprises are discussed.
consideration is identified. This results in optimal application In order to support the orchestration of smart city services, an
scenarios for the individual approaches. In the P2P approach, integration landscape model has been proposed. The
it is the intra-business service integration, because a fast proposed model serves as an example to prove that a
internal communication is possible and often the same combination of different approaches is the best solution to
standards are used in the company itself. The hub and spoke solve the integration problem of smart cities’ services.
approach is an intermediate solution; it can be used for intra- Evaluating the model is one of the main future steps.
and inter-business service integration, because the hub has a
central broker and external services can reach it via an
REFERENCES
adapter. The disadvantage of this approach is that each
service requires the protocol of the hub. The ESB approach is [1] The UN department of |Economic and Social Affairs. May, 2018.
URL:
especially suitable for enterprise integration, since each https://www.un.org/development/desa/en/news/population/2018-
service provides its own adapter. revision-of-world-urbanization-prospects.html
[2] Portmann, E., and Finger, M. 2015. “Smart Cities – Ein Überblick!,”
VI. PROPOSED FRAMEWORK HMD Praxis der Wirtschaftsinformatik (52:4), pp. 470–481.
[3] Jaekel, M. 2015. Smart City wird Realität, Wiesbaden: Springer
Based on the previous discussion, it can be concluded Fachmedien Wiesbaden.
that there is a lack of an uniform system landscape concept [4] Etezadzadeh, C. 2015. Smart City – Stadt der Zukunft?: Die Smart
integrating all smart city solutions. Furthermore, the City 2.0 als lebenswerte Stadt und Zukunftsmarkt, Wiesbaden:
evaluation of different approaches of the integration shows Springer Vierweg.
that no integration approach meets all necessary [5] Alawadhi, S., Aldama-Nalda, A., Chourabi, H., Gil-Garcia, J. R.,
requirements for an unified approach. Thus, there is no single Leung, S., Mellouli, S., Nam, T., Pardo, T. A., Scholl, H. J., and
approach that can be followed to achieve all the integration Walker, S. 2012. “Building Understanding of Smart City Initiatives,”
in Electronic Government, D. Hutchison, T. Kanade, J. Kittler, J. M.
objectives. A combination of different approaches is the most Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C.
suitable strategy to realize city-wide services integration. In Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M.
figure 7, a possible system landscape concept is proposed. Y. Vardi, G. Weikum, H. J. Scholl, M. Janssen, M. A. Wimmer, C. E.
The model recommends using Point 2 Point (P2P) approach Moe and L. S. Flak (eds.), Berlin, Heidelberg: Springer Berlin
Heidelberg, pp. 40–53.
within each smart city area (e.g. Smart Energy, Smart Living,
etc.). This ensures a high degree of crosslinking among the [6] Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J. M., Mattern, F.,
Mitchell, J. C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen,
services. Therefore, a better process control is possible. Here, B., Sudan, M., Terzopoulos, D., Tygar, D., Vardi, M. Y., Weikum,
the disadvantages of the P2P approach are negligible due to G., Scholl, H. J., Janssen, M., Wimmer, M. A., Moe, C. E., and Flak,
the relatively small number of the services related to one L. S. (eds.) 2012. Electronic Government, Berlin, Heidelberg:
Springer Berlin Heidelberg.
[7] Pero, M., Kühne, S., and Fähnrich, K.-P. 2014. “Integration – eine
Dienstleistung mit Zukunft,” in Enterprise -Integration, G. Schuh and
V. Stich (eds.), Berlin, Heidelberg: Springer Berlin Heidelberg, pp.
125–137.
[8] Ferreira, D. R. 2013. Enterprise Systems Integration, Berlin,
Heidelberg: Springer Berlin Heidelberg.
[9] Schuh, G., and Stich, V. (eds.) 2014. Enterprise -Integration, Berlin,
Heidelberg: Springer Berlin Heidelberg.
[10] Javidroozi, V., Shah, H., Cole, A., and Amini, A. (eds.) 2014. Smart
City as an Integrated Enterprise: A Business Process Centric
Framework Addressing Challenges in Systems Integration, Paris, July
20 - 24, 2014, IARIA
[11] Kummer, S., Grün, O., and Jammernegg, W. 2009. Value Pack
Grundzüge der Beschaffung, Produktion und Logistik + Übungsbuch:
Bundle Lehr- und Übungsbuch, München: Addison Wesley in
Pearson Education Deutschland.
[12] Townsend, A. M. 2014. Smart cities: Big data, civic hackers, and the
quest for a new utopia, New York, NY: Norton.
[13] Ojo, A., Curry, E., Janowski, T., and Dzhusupova, Z. 2015.
“Designing Next Generation Smart City Initiatives: The SCID
Fig. 7. A Smart City landscape Model Framework,” in Transforming City Governments for Successful Smart
Cities, M. P. Rodríguez-Bolívar (ed.), Cham: Springer International
smart city area. The data of each unit must flow into the hub. Publishing, pp. 43–67.
Hence, the different areas can be decoupled, and unnecessary [14] Leontief, W. 1971. Theoretical Assumptions and nonobserved Facts.
American Economic Review, Vol. 61, No. 1, pp. 1-7. March 1971.
dependencies can be avoided. This is followed by an

206
[15] Javidroozi, V., Shah, H., Cole, A., and Amini, A. (eds.) 2015. [24] Draheim, D. 2010. “Service-Oriented Architecture,” in Business
Towards a City’s Systems Integration Model for Smart City Process Technology, D. Draheim (ed.), Berlin, Heidelberg: Springer
Development: A Conceptualization, Las Vegas, 7-9 December. Berlin Heidelberg, pp. 221–241.
[16] Kettinger, W. J. and Grover, V. 1995. "Towards and Theory of [25] Fischer, S., and Werner, C. 2007. “Towards Service-Oriented
Business Process Change Management," Journal of Management Architectures,” in Semantic Web Services, R. Studer, S. Grimm and
Information Systems (12:1), pp. 9-30. A. Abecker (eds.), Berlin, Heidelberg: Springer Berlin Heidelberg,
[17] Ben Letaifa, S. 2015. “How to strategize smart cities: Revealing the pp. 15–24
SMART model,” Journal of business research : JBR (68:7), pp. [26] Papazoglou, M. P., and van den Heuvel, W.-J. 2007. “Service
1414–1419. oriented architectures: Approaches, technologies and research issues,”
[18] Myerson, J. M. 2002. The complete book of middleware, Boca Raton, The VLDB Journal (16:3), pp. 389–415.
Fla: Auerbach. [27] Masak, D. 2005. Moderne Enterprise Architekturen, Berlin,
[19] Ferstl, O. K., and Sinz, E. J. 2006. Grundlagen der Heidelberg: Springer-Verlag Berlin Heidelberg.
Wirtschaftsinformatik, München: Oldenbourg [28] Aier, S. (ed.) 2004. Enterprise application integration:
[20] Aier, S. 2007. Integrationstechnologien als Basis einer nachhaltigen Flexibilisierung komplexer Unternehmensarchitekturen, Berlin:
Unternehmensarchitektur: Abhängigkeiten zwischen Organisationen GITO-Verl.
und Informationstechnologie, Berlin: Gito-Verlag [29] Chappell, D. A. 2004. Enterprise Service Bus, Sebastopol: O'Reilly
[21] Ruf, W., Mucksch, H., and Biethahn, J. 2007. Ganzheitliches Media
Informationsmanagement: Band 2: Ganzheitliches [30] Bianco, P., Kotermanski, R., and Merson, P. 2007. “Evaluating a
Informationsmanagement: Band II: Entwicklungsmanagement, Service-Oriented Architecture,” Carnegie Mellon University.
München: De Gruyter Oldenbourg [31] Cognizant 20-20 Insights 2013. “Comparing and Contrasting SOA
[22] Ziemen, T. 2006. Standardisierte Integration und Datenmigration in Variants for Enterprise Application Integration,” .
heterogenen Systemlandschaften am Beispiel von Customer-
Relationship-Management.
[23] Organization for the Advancement of Structured Information
Standards 2006. Reference Model for Service Oriented Architecture
1.0: OASIS.

207
Compression Techniques Used in Iot: A
Comparitive Study
Salam Hamdan, Arafat Awajan, Sufyan Almajali
Department of Computer Science
Princess Sumaya University for Tecnology
Amman, Jordan
S.hamdan@psut.edu.jo, awajan@psut.edu.jo, s.almajali@psut.edu.jo

Abstract— Due to the improvement of technology, most of the cannot be retrieved from the compressed file, therefore, the file
devices used nowadays are connected to the internet, therefore a size is reduced permanently by eliminating the redundant data.
huge amount of data is generated, transmitted, and used by these On the other hand, in lossless compression, all original data is
devices. In general, these devices are limited in resources such as completely recovered after uncompressing the file [10]. In IoT
memory, processors, and battery lifetime. Reducing the data size restricted devices, lossy compression algorithms have better
reduces the energy required to process this data, minimizes the efficiency in compression rather than the lossless compression
storage of this, data and the energy required to transmit this algorithms, by taking the advantage the existence of the
data. The need for applying data compression techniques on these redundant data, because, there is no need to recover the
devices will come in handy. This paper provides a survey and a redundant data [11].
comparative study among most commonly used IoT compression This paper briefly discusses the most common compression
techniques. The study addresses the techniques in terms of techniques used in IoT and it will provide a comparative study
different attributes such as the compression type, lossless or lossy, between these techniques with respect to the compression type,
the limitations of the compression technique, the location of the amount of energy or space the technique will save, in what
where the compression is applied, and the implementation of the solutions these techniques are implemented, whether the
compression technique. compression happens in the node side or server side, the type of
IoT application. Also, the comparison covers whether the
experiments were simulated, emulated or using testbeds.
Keywords—internet of things, wireless sensor network, Data This paper is organized as follows, the second section
compression discusses the previous works on data compression in IoT
I. INTRODUCTION networks, the third section differentiates between these
compression techniques and finally, section four concludes the
Internet of things (IoT) is a network that connects various paper.
types of devices with each other[1] including Wireless sensor
network[2]; Sensors could be found almost everywhere from II. LITERATURE REVIEW
the implanted sensors in the human body to the deepest point in Several IoT applications employed various compression
the oceans. However, most of these devices have constrained methods. In [12], Pielli proposed an optimization MAC layer
resources. Their memory is always limited to a short RAM and protocol that combines the energy efficiency and data
flash memory[3]and is provided with a short battery compression for IoT devices. They consider a network from N
lifetime[4]. users aims to use the up-link channel which is a link from the
IoT devices are used in numerous types of applications, IoT device to a Base Station using the idea of Time Division
therefore, it enables human-to-device and device-to-device Multiple Access (TDMA) scheme which allows sharing the
connection in a trustworthy and reliable manner [5]. These same frame with the same frequency among several users by
applications include but are not limited to, healthcare dividing the frame into different time slots [13]. For each
applications [6], Mobile ad hoc Networks (MANET)[7], frame, the energy is consumed for the following reasons: 1)
transportation systems, heat and electricity management [8]. Data processing, 2) Data transmission and 3) Data sensing and
circuitry costs. In their MAC protocol, they aimed to extend the
The limitation of memory and battery lifetime in IoT network lifetime and to fulfill the Quality of service (QoS)
devices create the need to reduce the size of the data in order to requirements. The nodes generate data from the environment.
minimize the CPU cycles needed to process these data also to To compress the input signal, a number of CPU cycles per bit
reduce the memory space that is needed to save the data. In are needed, thus, the energy that is consumed by the node’s
addition, data size reduction reduces the bandwidth required to CPU depends on the node's processor. In their protocol, they
transmit the data. Thus, the implementation of data defined the optimal Energy-Allocation over time that balances
compression techniques is very important in IoT devices. The the lifetime of the network and the average of the maximum
data compression is essential for transmission, storage, and in- distortion and this problem is called the Energy allocation
network processing. Also reducing the network traffic is problem (EAP) which is a convex optimization problem that
essential in order to avoid saturation and to achieve many depends on observation by using an alternative optimization
devices to work cooperatively within the same hub [9]. procedure. EAP determines the amount and the optimal
There are two types of compression: lossy compression and allocation of the energy consumption for each frame. After
lossless compression. In lossy compression, the original data defining the optimal allocation energy, they determined for a

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 208


specific frame the powers and the duration of the transmission
by finding the consumed energy for the time slot. Frame-
Oriented Problem (FOP) handles a single frame which defines
the transmission powers and duration. They also came up with
an optimal policy to minimize the average of the data distortion
degree which is a function of compression ratio by adopting an
information-theoretic approach. Fig. 2. Traditional data compression

Deepu et al. [14] proposed a hybrid compression of lossy Dang et al. [16] have proposed a Robust Information-Driven
and lossless compression schemes that consist of lossy Architecture (RIDA) that aims to improve the compression by
compression, lossy decompression and entropy encoder. They determining the correlation of the data between a cluster of
applied their technique to the cardiovascular diseases IoT sensors. Their approach is only suitable for the fixed network
application, especially with the wearable electrocardiogram hence they can group the sensors into clusters. Also, they
(ECG) sensors. The data that is generated from the ECG
assumed that if any two nodes in the same cluster want to
sensors is compressed with a lossy compression with a high
communicate with each other, the communication takes only
one hop. Their architecture contains three main parts:
information-drive logical mapping, resiliency mechanism and,
and compression algorithm. In the first part, the nodes within
the same cluster exchange their readings among each other thus
each node will learn a pattern about the whole cluster, also,
they chose logical indices for each other based on the data
content. In the second part, the resiliency part, the faulty and
missing nodes will be detected, isolated and classified all along
Fig. 1: Block diagram for the hybrid scheme the compression and decompression process. Generally
speaking, the nodes first distribute their readings to the cluster,
compression ratio (CR). The output of lossy compression
which produces an initial estimation of QTS peak location, therefore, each node will have a glance at the data within each
heart rate variability (HRV), etc. Also, they consider when an area. The node coefficient contains the corresponding index as
overall analysis is required for a signal, therefore, in their the logical index. And, if the coefficient is not zero the node
hybrid scheme, the original ECG is reconstructed using the will send it back to the server. The data can be retrieved from
lossy decompression and then the difference between the the non-zero coefficient, the missing data will be classified and
reconstructed signal and the original signal is estimated and then retrieve the physical map by doing the remapping process.
they called it the residual error with a very low dynamic range. This approach reduced the energy consumption and the
Thereafter, the bit rate is minimized for the residual error by
entering it to an entropy encoder. The original signal can be bandwidth, therefore sending a few non-zero coefficients.
represented in lossless shape when using the lossy compressed Gandhi et al. [17] had proposed an algorithm called
signal with the encoded residual. This hybrid scheme has Grouping and Amplitude Scaling (GAMPS). In their algorithm,
several advantages. First, it enables a hybrid transmission they aimed to reduce the space that is needed for archiving the
mode, which minimizes the power consumption, therefore, data in the server side and also reducing the query time for the
only the compressed data is transmitted. Also, the transmission generated data from sensors, like RIDA, they take the
has power awareness. When most of the sensors are battery- advantage of the correlation between the data. First, they
based devices, and in case of battery low the transmission will formalized the compression problem of multi-sensors.
turn to lossy compression only to reduce the amount of power Thereafter they propose GAMPS as a compression method for
consumption. In addition, the local storage usage is optimized stream data that is generated from a large number of sensors. In
by storing the lossy data only in the memory. Another their algorithm, the sensor signals groups that could be
advantage is the error tolerance, it is increased, by removing maximally compressed together are discovered dynamically.
the redundancy between the data samples that are close to each Furthermore, each compressed data will have an index in order
other. The results show that the power was reduced 18% for the to make the data query more easily. They also enhance the
lossy compression and for lossless compression the power was signal compression ratio by using a suitable amplitude scaling.
reduced to 53%. This scheme is efficient for healthcare
application, therefore, some cases will need the original data. Ukil et al. [18] proposed a dynamic lossy compression
method called Senscompr which is influenced by the
In their approach [15], Ukil aimed to increase the information theoretic and statistical techniques. In their
information gain from the compressed data by analyzing the approach, they reconstruct a huge amount of varied sensor data
data and extracting the robust outlier that is generated from the set accurately by using the Chebyshev approximation which is
sensor and adjusting the parameters exhaustively. Due to its a nonlinear model. Also, it works on reducing the redundant
ability to achieve high data retrieval after the decompression, data as shown in figure 2 as the traditional lossy compression.
this approach is efficient for various sensors applications. In SensCompr will debrief the important information and then
order to extract the most important features, they used adjusts the parameters. This process happens in traditional
statistical and information theoretic techniques. They made a compression, however, in their method, they solve the fixed
hardware implementation to test the information gain after block size by introducing dynamic block size.
decompression.

209
Park in [19] proposed a machine learning based • Compression type. This attribute provides whether
compression algorithm that uses neural network regression to the compression technique is lossy, lossless or
vectorize the data. However, vectorizing the entire data is using both types.
inefficient using the neural network only, therefore, they divide
the entire data according to a specific range, after that they • The goal of the compression technique, the
vectorized the divided chunks and then they merge them. The limitation it aims to solve, such as energy
compression was done using the divide and conquer method consumption, data size, etc.
thus the neural network is not sufficient to compress the • What techniques and algorithms the authors used
generated data hourly. The generated data is divided into time in order to compress the data.
units, thereafter, in each unit they applied neural regression. In
the conquer process they applied different machine learning • The location of the compression, does it happened
techniques which are, coefficient averaging, Euclidean in the IoT device or on the sensor side.
distance, cosine similarity and re-learning in order to represent
• What is the application type that this compression
the data easily and choose the machine learning technique with
technique is efficient for?
the highest accuracy. And the results show that Euclidean
distance has the highest accuracy among the previous • Implementation that describes if the technique was
techniques. simulated, emulated or using hardware
In their work, [20] aim to adjust the storage and the implementation.
precision cost for sensors that generates a video stream data. In Comparative table 1 highlights a brief comparison between
order to make the size of the video less than the original video, the compression techniques according to the previous
the authors proposed to omit the redundant video frames attributes.
(frames that have slight differences among each other) and
store only differentiable frames, in order to find the differences The table shows that most of the compression techniques
among each other the authors adopted Structural Similarity were lossy compression thus retrieving all data is not
Index Measure (SSIM). By doing this the size of the video will necessary, in which removing the redundant data will not affect
be sized down to 60%. the application functionality. The table also shows that most of
the compression techniques focused on solving the energy issue
In [9], Stojkoksa proposed a lightweight delta compression required to transmit and process the data to enhance the
algorithm by developing a new coding scheme. This scheme network and device lifetime. Furthermore, most of the
could be used with temporally correlated data. In order to techniques concentration was on the network availability
compress data, they collected temperature raw data from parameters such as energy and bandwidth. On the other hand,
MICAz Crossbow nodes using MOTE_VIEW application [21]. less work was done on the information gain. Most of the
The change in temperature is slow and the correlation is techniques applied the statistical approach to compress the data,
temporal, thus the next temperature degree depends on the and some of them used machine learning techniques. Also,
previous temperature degree. Therefore, delta values are most of the compression techniques implanted using hardware
dependent which is the difference between the next temperature implementation.
value with the previous temperature value. Thence they apply a
statistical approach on delta values the result of applying the IV. CONCOLUSION
statistical approach provide a probability distribution which is Nowadays, most of devices are connected to the internet
Gaussian distribution for delta. Thereafter, they found the and which causes a huge amount of data generated from these
variance of this distribution, which leads to the most probable devices. Also, these devices are limited with resources.In order
values of the delta which are -1, 0 and 1. Taking advantage of to make the IoT network more efficient, there is a need to
this result they propose a statistical encoding on the possible compress the data to reduce the energy to process the data,
values of delta, thus the most probable values will get less reduce the demanded storage to store it and to minimize the
number of bits to be encoded, consequently, reducing the energy required to transmit the data.
number of bits required to encode delta.
In this paper, the authors summarized the common
In [22], [23], and [24], they allow the extraction of IoT compression data techniques in IoT and made a comparative
context data from IoT devices and store this data in a study among them in order of the compassion type, in what the
customized servers and providers. Providers provide the scheme will enhance the network, what techniques are used
collected IoT data, along with specialized services that are within the compression technique, the location of the
custom to the applications need. One of these services is to compression, node side or server side, the applications
apply compression at the cloud level for IoT data before it is deployed these techniques. As a result, most of compression
delivered to the IoT enabled applications. techniques were lossy techniques and aimed to reduce the
energy consumption.

III. COMPARISON STUDY REFERENCES


[1] Atzori, L., Iera, A., and Morabito, G.: ‘The internet of things: A survey’,
In this section, a comparison is done between the schemes Computer networks, 2010, 54, (15), pp. 2787-2805
that are mentioned in the literature based on the following
[2] Gubbi, J., Buyya, R., Marusic, S., and Palaniswami, M.: ‘Internet of
attributes: Things (IoT): A vision, architectural elements, and future directions’,
Future generation computer systems, 2013, 29, (7), pp. 1645-1660
[3] Hossain, M.M., Fotouhi, M., and Hasan, R.: ‘Towards an analysis of
security issues, challenges, and open problems in the internet of things’,

210
in Editor (Ed.)^(Eds.): ‘Book Towards an analysis of security issues, transactions on biomedical circuits and systems, 2017, 11, (2), pp. 245-
challenges, and open problems in the internet of things’ (IEEE, 2015, 254
edn.), pp. 21-28 [15] Ukil, A., Bandyopadhyay, S., Sinha, A., and Pal, A.: ‘Adaptive Sensor
[4] Tripathi, P.: ‘Vision, Opportunities and Challenges in Internet of Things Data Compression in IoT systems: Sensor data analytics based
(IoT)’, 2017 approach’, in Editor (Ed.)^(Eds.): ‘Book Adaptive Sensor Data
[5] Lee, I., and Lee, K.: ‘The Internet of Things (IoT): Applications, Compression in IoT systems: Sensor data analytics based approach’
investments, and challenges for enterprises’, Business Horizons, 2015, (IEEE, 2015, edn.), pp. 5515-5519
58, (4), pp. 431-440 [16] Dang, T., Bulusu, N., and Feng, W.-c.: ‘Rida: A robust information-
[6] Catarinucci, L., De Donno, D., Mainetti, L., Palano, L., Patrono, L., driven data compression architecture for irregular wireless sensor
Stefanizzi, M.L., and Tarricone, L.: ‘An IoT-aware architecture for smart networks’, in Editor (Ed.)^(Eds.): ‘Book Rida: A robust information-
healthcare systems’, IEEE Internet of Things Journal, 2015, 2, (6), pp. driven data compression architecture for irregular wireless sensor
515-526 networks’ (Springer, 2007, edn.), pp. 133-149
[7] Bellavista, P., Cardone, G., Corradi, A., and Foschini, L.: ‘Convergence [17] Gandhi, S., Nath, S., Suri, S., and Liu, J.: ‘Gamps: Compressing multi
of MANET and WSN in IoT urban scenarios’, IEEE Sensors Journal, sensor data by grouping and amplitude scaling’, in Editor (Ed.)^(Eds.):
2013, 13, (10), pp. 3558-3567 ‘Book Gamps: Compressing multi sensor data by grouping and
amplitude scaling’ (ACM, 2009, edn.), pp. 771-784
[8] Kyriazis, D., Varvarigou, T., White, D., Rossi, A., and Cooper, J.:
‘Sustainable smart city IoT applications: Heat and electricity [18] Ukil, A., Bandyopadhyay, S., and Pal, A.: ‘IoT data compression:
management & Eco-conscious cruise control for public transportation’, Sensor-agnostic approach’, in Editor (Ed.)^(Eds.): ‘Book IoT data
in Editor (Ed.)^(Eds.): ‘Book Sustainable smart city IoT applications: compression: Sensor-agnostic approach’ (IEEE, 2015, edn.), pp. 303-312
Heat and electricity management & Eco-conscious cruise control for [19] Park, J., Park, H., and Choi, Y.-J.: ‘Data compression and prediction
public transportation’ (IEEE, 2013, edn.), pp. 1-5 using machine learning for industrial IoT’, in Editor (Ed.)^(Eds.): ‘Book
[9] Stojkoska, B.R., and Nikolovski, Z.: ‘Data compression for energy Data compression and prediction using machine learning for industrial
efficient IoT solutions’, in Editor (Ed.)^(Eds.): ‘Book Data compression IoT’ (IEEE, 2018, edn.), pp. 818-820
for energy efficient IoT solutions’ (2017, edn.), pp. 1-4 [20] Hsu, C.-C., Fang, Y.-T., and Yu, F.: ‘Content-Sensitive Data
[10] Nelson, M., and Gailly, J.-L.: ‘The data compression book’ (M & t Compression for IoT Streaming Services’, in Editor (Ed.)^(Eds.): ‘Book
Books New York, 1996. 1996) Content-Sensitive Data Compression for IoT Streaming Services’ (IEEE,
2017, edn.), pp. 147-150
[11] Bose, T., Bandyopadhyay, S., Kumar, S., Bhattacharyya, A., and Pal, A.:
‘Signal Characteristics on Sensor Data Compression in IoT-An [21] Datasheet, M.: ‘Crossbow technology inc’, San Jose, California, 2006,
Investigation’, in Editor (Ed.)^(Eds.): ‘Book Signal Characteristics on 50
Sensor Data Compression in IoT-An Investigation’ (IEEE, 2016, edn.), [22] Almajali, S., Abou-Tair, D. 'Cloud based intelligent extensible shared
pp. 1-6 context services',in the proceeding of 2017 Second International
[12] Pielli, C., Biason, A., Zanella, A., and Zorzi, M.: ‘Joint optimization of Conference on Fog and Mobile Edge Computing (FMEC). pp:133-138
energy efficiency and data compression in TDMA-based medium access [23] Almajali, S ; Bany Salameh, H. ; Ayyash, M. Elgala, H. 'A framework
control for the IoT’, in Editor (Ed.)^(Eds.): ‘Book Joint optimization of for efficient and secured mobility of IoT devices in mobile edge
energy efficiency and data compression in TDMA-based medium access computing', in the Proceeding of 2018 Third International Conference on
control for the IoT’ (IEEE, 2016, edn.), pp. 1-6 Fog and Mobile Edge Computing (FMEC), pp: 58 - 62
[13] Jung, P.: ‘Time Division Multiple Access (TDMA)’, Wiley [24] Almajali, S., Abou-Tair, D., Bany Salameh, H. ; Ayyash, M. Elgala, H.'
Encyclopedia of Telecommunications, 2003 A distributed multi-layer MEC-cloud architecture for processing large
[14] Deepu, C.J., Heng, C.-H., and Lian, Y.: ‘A hybrid data compression scale IoT-based multimedia applications', Multimedia Tools and
scheme for power reduction in wireless sensors for IoT’, IEEE Applications. September 2019, Volume 78, Issue 17, pp 24617–24638.

TABLE I: COMPARISON TABLE


Compression Lossless Lossy Energy Bandwidth Information storage Time Technique Server side Application implementation
technique gain s used or node side
Pielli et. al ✔ ✔ ✔ TDMA- node side environment Numarical
[12] based evaluation
scheme
Deepu et. al ✔ ✔ ✔ ✔ Entropy Node side Healthcare Hardware
[14] coding implementation
Ukil et. al [15] ✔ ✔ statistical Node side Various Hardware
and applications implementation
information
theoretic
techniques
Dang et.al [16] ✔ ✔ ✔ Data Node side Temperature Hardware
correlation implementation
among
groups of
sensors,
clustering
Gandhi et. ✔ ✔ ✔ Data Server side Temperature Hardware
al[17] correlation and humidity implementation
among
groups of
sensors,
clustering
Ukil et. al[18] ✔ ✔ ✔ ✔ statistical Node side Healthcare Hardware
and implementation
information
theoretic
techniques
Park et. al[19] ✔ ✔ Machine Server side Industrial Simulation
learning, applications
regression

211
and divide
and
conquer,
coefficient
averaging,
Euclidean
distance,
cosine
similarity
and re-
learning
Chun-Chi et.al ✔ ✔ Structural Server side Viseo Hardware
[20] similarity streaming implementation
index streaming and
measure applications
Stojkoska et. ✔ ✔ Statistic Node side Temperature simulation
al[9] approach

212
Using Part of Speech Tagging for Improving
Word2vec Model
Dima Suleiman Arafat A. Awajan
Computer Science Department Computer Science Department
King Hussein Faculty of Computing Sciences King Hussein Faculty of Computing Sciences
Princess Sumaya University for Princess Sumaya University
Technology for Technology
Teacher at the University of Jordan Amman, Jordan
Amman, Jordan awajan@psut.edu.jo
d.suleiman@ psut.edu.jo

generates different vectors representations for the same words


Abstract— Word2vec is an efficient word embedding model that have the same shape or surface form but different
that convert words to vectors by considering syntax and semantic meaning. For example, in English the word “can” may have
relationship between words. In this paper, an extension of the two meaning (‫ )يستطيع‬and (‫)علبة‬. Also, in Arabic we have the
two approaches of word2vec model is proposed. The proposed word “‫”ذھب‬which means (gold) and the word “‫”ذھب‬which
model considers part of speech tagging of the words when
means (went). The two words have the same shape but
exploring the probability of prediction output word given the
input. Considering part of speech tagging provides deeper
different meaning. In the original word2vec model, both
semantic meaning for words when training the model. Thus, the words are considered the same which is not accurate since the
quality of the generated vectors becomes high and more two words have different meaning. In the proposed model, the
representative. In addition, the proposed model equips the user part of speech tagging is taken into account and becomes part
with the ability to query about the words and their part of speech of the words. In this case, the two words are considered
tagging. In this case, the words that have the same surface but different. The proposed model can be applied on all languages.
different meaning will have different vector representations. This However, the target language of this paper is the Arabic
model can be used in all languages including English and Arabic language. OSAC datasets are used in training the model.
languages. The focus of this paper is on Arabic language. The
Moreover, several preprocessing stages are carried out include
experiments are performed on OSAC datasets which consists of
22,429 documents. Several pre-processing stages include using using Farasa for segmentation and stemming [3]. Moreover,
Farasa stemmer take place. Moreover, part of speech tagging of Farasa is used to determine the part of speech tagging for all
the word is determined using Farasa toolkit. words [3].

Keywords—component; Word2vec; CBOW; Skip-Gram; Part-


This paper is organized as follows: section II discusses the
of-Speech tagging; Cosine Similarity; Arabic Natural Language related works and backgrounds. Arabic language features are
Processing; Semantic Similarity. explained in section III. The proposed model is covered in
section IV. Section V introduces experimental results and
finally the conclusion is presented in section VI.
I. INTRODUCTION

Using vectors to represent words is very crucial for several


natural language processing (NLP) applications [1]. Word2vec II. RELATED WORKS AND BACKGROUNDS
word embedding model is one of most recently used word 2.1 Word Embedding
embedding models [1], [2]. The learning process must be
conducted on large amount of unstructured data in order to Word embedding is the representation of the words into
improve the quality of generated vectors. Even though vectors [2], [4]. Dealing with vectors is more useful than
word2vec is neural network architecture, it does not require dealing with the word itself especially in NLP applications. To
complex operations, especially because that, the activation be useful, word embedding must consider both syntax and
semantic features when representing words [2]. In this case,
function is linear. Word2vec model consists of two approaches
the context similarity between words can be calculated using
called continuous Bag-of-Words approach (CBOW) and
Euclidean distance and cosine similarity.
continuous Skip-Gram approach (Skip-Gram).
In this paper, an extension of the original word2vec model One-hot representation is one of the word embedding
is proposed. The idea is to consider the part-of-speech tagging models [5]. In one-hot representation, the dimension of each
(POST) when calculating the probability of predicting a word vector is equal to the vocabulary size where all the entries
given context words, in case of CBOW. On the other hand, in values are “zeros” except one entry; its value is set to “one”.
case of Skip-Gram, the probability of predicting the context The index of the entry that has “one” value is equal the
words given the input word is maximized after considering the position of the word in the vocabulary, where the words in the
part of speech tagging of the words. The proposed model vocabulary are sorted based on the frequency of the words in

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 213


the corpus. Assume that, the vocabulary size is 10,000, then
the representation of the fourth word in the vocabulary | | w( ) , … , w( ) , w( ) ,
consists of “zeros” in all entries except in the fourth entry, its log p w( ……. (1)
| | ) w( ) , w( ) , … , w( )
value is set to “one”.
There are two problems associated with One-hot 2.2.2 Continuous Skip-Gram Approach (Skip-Gram)
representation model. The first problem is the curse of In Skip-Gram, the input is only one word while the output
dimensionality which is related to large dimension size of the is the context or surrounding words of the input. W(t) is used to
vectors. The second problem is the absence of syntax and denote the input word while the symbols { w(t-c) , … ,w(t-2) ,
semantic relationship between vectors since most of entries w(t-1) , w(t+1) , w(t+2) , … , w(t+c) }represent the surrounding
values are “zeros”. Both problems can be addressed using a words. The log of probability of predicting the context words
model proposed by Mikolov in 2013 [1], [2] which is called given the input words must be maximized in order to get high
word2vec model. quality vectors as shown in Eq.(2) [6].
There are two approaches of the word2vec model which
| |
are called CBOW and Skip-Gram models. More details about
∑ , log p w( ) w( ) ……. (2)
word2vec model and its two approaches are covered in the
following subsections:
2.2 Word2vec Model In both approaches, the input vectors are represented using
Word2vec is a word embedding model which composed of one-hot representation. The generated vectors size is equal to
neural network with one hidden layer. Hidden layer has no the dimension size which is the number of neurons in the
activation function. Furthermore, the number of neurons in the hidden layer. The generated vectors can easily be used to
hidden layer is equal to the dimension size of the word explore the relationship between words by subtracting and
generated vector. Also, word2vec consists of one input layer adding them. For example, if the vector of “‫ ”العراق‬vector
and one output layer and two weighting matrixes between the (Iraq) is added to the vector “‫ ”مصر‬vector (Egypt) and the
input  hidden layers and hidden  output layers. Moreover, result is subtracted from “‫ ”بغداد‬vector (Baghdad), then the
softmax function is used as objective function at the output result vector is very close to the vector “‫ ”القاھرة‬vector (Cairo)
layer. The quality of the generated vectors is highly affected [14].
by the quality and the size of the corpus that is used for
training the model. In order to get significant results, the
quality of corpus must be high and its size must be large. The
two approaches of word2vec consist of the same neural
network architecture and hyper parameters. Vocabulary size,
context window and the dimension size are examples of the
hyper parameters. Vocabulary size is the number of the
vocabulary to be represented which are the most frequent
words in the corpus. The context window is the window that
surrounds the input word in CBOW model and surrounds the
target word in the case of Skip-Gram model. Moreover, the
dimension size is the size of the new generated word vector.
Fig. 1. CBOW and Skip-Gram models architecture [2]
There are several NLP applications that can deal easily
with the vectors and provide significant results such as text
summarization and sentiment analysis [6]–[13]. Both CBOW 2.3 Word2vec Based Models
and Skip-Gram models are explained in the following
Skip-Gram based word embedding model was improved in
subsections.
2014 by Levy and Goldberg [15]. Levy et al., used
2.2.1 Continuous Bag-of-Words Approach (CBOW) dependency relation in order to determine the relationship
between words. Thus, the context window contains arbitrary
The architecture of CBOW and Skip-Gram models can be words instead of having adjacent words, the selection of
seen in Fig.1. In CBOW, the input consists of several adjacent arbitrary words is based on wither there exist a relation with
words called context words while the output or target word is the input word or not. Using the dependency relation
the middle word. The context window size is represented by c improved the quality of the generated vectors by considering
which is used to determine the number of context input words more semantic features [16].
that surround the output word. The log-linear probability of
finding the target word given the context words must be Words syntactic relationship was considered in both
maximized in order to improve the quality of the generated approaches of word2vec in 2015 by Ling et al., [17]. In their
vectors as shown in Eq.(1) [6]. W(t) is used to represent the research, the position of words around the input word is
target or output word while the symbols{ w(t-c) , … ,w(t-2) , w(t-1) considered by using several weighting matrixes between the
, w(t+1) , w(t+2) , … , w(t+c) } represent the context or input hidden and output layer instead of one shared matrix as in
words and |v| is used to denote the vocabulary size. Skip-Gram. The number of weighting matrixes is equal to the

214
number of context words. On the other hand, in case of IV. PROPOSED MODEL
CBOW, the embedding of the context input words are
The proposed method is an extension of both approaches
concatenated in the same order of occurrence. After that, the
of word2vec model that was proposed by Mikolov in 2013 [2].
result of concatenation is passed to the output predictor [17].
The main purpose of the proposed approach is to capture more
In 2016, in Skip-Gram, the distance between the precise syntax and semantic features in training process. In
context words and the input word is taken into account [18]. this paper, we proposed to use Part-of-Speech Tagging
Komninos and his colleagues extended the Skip-Gram model (POST) in order to learn high quality distributed
by using the dependency graph to determine the distance of representation of vectors. In both approaches of the word2vec,
the relation between words and the input word[18]. In addition the vocabulary size is denoted by V, and N is used to represent
to considering the dependency relation, the adjacent context the dimension size. The vocabulary size is the number of most
words are also taken into account. The syntax of word frequent words in the corpus. While the dimension size, is the
embedding is very crucial and took more concerns size of the vector that is used to represent the words, which is
recently[19]–[21]. equal to number of neurons in the hidden layer. Every input
word size is V and it is represented by one-hot representation.
Such that, all the units’ values are zeros except one entry is set
III. ARABIC LANGUAGE FEATURES to one. There are two weighting matrixes, one of them
between the input layer and the hidden layer which is called
Arabic language is considered as an official language in W. Another matrix is called W’ which is between the hidden
several regions in the world [22]. Even though, the number of layer and the output layer. The size of W and W’ is V X N and
research papers that are concerned with Arabic language is N X V respectively. The dimensions of each row in W and
limited due to the shortage of Arabic resources [23]. In order each column in W’ is N-dimensions. The transpose of each
to make improvements on the quality of the results of Arabic row i in W which is vTwi represents the vector of certain input
research, some Arabic features such as part-of-speech tagging word wi. For example, assume that the position of the input
(POST) and dependency parsing must be taken into account. (context) word in the vocabulary is k, then the one-hot
The morphological nature of Arabic language, makes the representation (x) of a word will have xk=1 and xk’=0 for all
process of dealing with Arabic language harder and needs k≠k’.
more effort [22]. As a result, other Arabic NLP tasks such as
normalization and segmentation become harder and must be In the original word2vec model there is a vocabulary list
considered. that contains the most frequent words in the corpus. However,
in the proposed model, each vocabulary entry in the
vocabulary list does not contain only the word as in original
Normalization
word2vec model, but also each entry contains the word and its
There are two types of vowels in Arabic language, short and part of speech tagging. Thus, the same word may have more
long. The diacritical marks are used to represent the short than one entry in the list for several part of speech tagging. For
vowels such as (‫ب‬,َ ‫ب‬ِ , ُ‫) ب‬. On the other hand, letters are used example, assume we have the word “‫”كتب‬, if it is noun, it
to represent long vowels. One of the challenges of Arabic means (books). However, if it is verb, it means (wrote). In this
language is related to having several marks such as hamza “‫”ء‬, case, in the proposed model, there are two entries for “‫”كتب‬
dot “.”, or madda “~”on the same letter. For example, “‫”ا‬ one with “Noun” word and one with “Verb” word, instead of
may be written as “‫”أ‬, “‫”آ‬and “‫”إ‬. In this case, normalization is one entry.
used to consider all the shapes to be the same. For example,
The proposed model includes extension of CBOW and
the words “‫ ”ايام‬and “‫ ”أيام‬which means (days) must be
Skip-Gram approaches of word2vec. More details are covered
considered the same, since both of them have the same
in the following subsections:
meaning.
Segmentation A) Continuous Bag-of-Word Model
Another important NLP task is the segmentation process.
Segmentation faces several problems such as keeping letters To simplify the explanations, we will simplify the CBOW
that must be removed and segmenting words that must not be to consist only one input (context) word and one predicted
segmented. For example, “‫( ”ال‬The) must be removed, such output word. In this case, the equation that is applied between
that the words “‫( ”بلد‬country) and “‫( ”البلد‬The country) must be the input and the hidden layer is as follows [24]:
considered the same. However, in some words “‫ ”ال‬is part of
the name and must not be removed such as “‫( ”ألغاز‬mysteries).
h = W x = ( ,.) = v ……. (3)
Therefore, Farasa is used since Farasa segmenter has high
quality of segmenting the words and removing the parts that
are not part of them, which can be determined based on the After that, score for each word uj at the output layer is
context [3]. computed using Eq. (4) [24].

u = v’ h ……. (4)

215
Where v’wj represents the vector of jth column in the
second weighting matrix W’. It can be clearly seen that, the u , = v’ , h ……. (10)
activation function is linear function. Moreover, the words
posterior distribution must be determined. This distribution of In this case, Eq.(5) is modified to consider the part of
the words is multinomial which can be obtained using a speech tagging when computing the output yj for certain
classification model that is log-linear such as softmax. In this unit j in the output layer as shown in Eq.(11).
case, the output yj in certain unit j in the output layer can be
computed as shown in Eq.(5) [24]. ,
y , =p w, w, = … (11)
,
y =p w w = ……. (5)

By substituting Eq. (9) and Eq. (10) in Eq. (11), we can get
Eq. (12).
By substituting Eq. (3) and Eq. (4) in Eq. (5), we can get Eq.
(6) [24].
y , =p w, w,

’ ’
y = p w w = ……. (6) =
, ,
……. (12)
’ ’
, ,

In this equation, both vw and vw’ represent the vector Finally, the log-linear probability in Eq.(1) must be modified
representation for the word w. vw represents the input vector to Eq.(13) to consider the part of speech tagging post of the
that is certain row in the weighting matrix between input and words w.
hidden layers W, while vw’ is the output vector that is certain
column of the weighting matrix between the hidden and output w( ) , post ( ) , … ,
layers W’. | |
w( ) , post ( ) ,
On the other hand, in the case that the input (context) 1 (w( ) , post ( ) ),
consists of more than one word, the output of the hidden layer log p (w( ) , post ( ) )
|V| w( ) , post ( ) ,
is equal to the average of the input vectors for the context
words vw multiplied by the input hidden layers weighting w( ) , post ( ) , … ,
matrix W which is computed using Eq.(7) [24]. w( ) , post ( )
……. (13)
h= ( 1 + 2 + ⋯+ ) ……. (7) B) Skip-Gram Model.
In case of Skip-Gram, as we explained previously, there is
Where x1, x2, …, xc is the vectors for the first, second and only one input word and several context or output words. In
c words in the context respectively and c is the number of this case, the equation of calculating the output of the hidden
context words. After substituting x1, x2,….xc with their input layer is shown in Eq.(14) [24].
vectors representations vw1, vw2, …., vwc in Eq.(7), we get Eq.(8)
[24].
h= W x = w( ,.) = v ……. (14)

h= ( + +⋯+ ) ……. (8) In Skip-Gram, instead of having one output score uj as in


CBOW we have c uj which is represented by uc,j , since instead
In the proposed model, the Eq.(9) is used to compute the of having one output, we have c-panel output. Where c is the
number of context words in the output layer. Therefore, there
output of the hidden layer, where , is the transpose of
are c multinomial distributions in the output instead of one. Yc,j
a vector that represents the word wi with certain part of speech
is the output of the jth unit in the c-panel which can be
tagging posti.
computed using the softmax as shown in Eq.(15) [24] for each
output in the context.
h = W x = W( ,.) = v , ……. (9)
,
y , =p w , =w , w = ……. (15)
After that, the score of the word and its part of speech
tagging u , is computed at the output layer as shown in
Eq. (10). Where v’ , represents the jth column in the
Where wc,j is the jth word in the c-panel and wo,c is the c
matrix W’ which represent the output vector of certain word context words in the output context words o. Moreover, wi is
and its part of speech tagging.

216
the input word. Note that, the same hidden output weighting Finally, the log-linear probability in Eq.(2) must be modified
matrix W’ is used for all the context output words, thus to Eq.(22) to consider the part of speech tagging post of the
words w.
, = = v’ h ……. (16)
| |
∑ , log p w( ) , ( ) w( ) , ( ) ……(22)
Where v’wi is the jth column in the hidden output layers
matrix W’. After substituting Eq.(14) and Eq.(16) in Eq.(15),
we get Eq.(17) V. EXPERIMENTAL RESULTS


1) Datasets and Pre-processing
y , =p w , =w , w = ……. (17) The experiments are conducted on OSAC datasets [25].

OSAC are benchmark datasets which include documents from
several domains such as, Sports, Health, Economic and others
The output of the hidden layer in the proposed model is with total number of documents equal to 22,429. The quality
computed using Eq.(18). Where vTwi,posti is the transpose of the of the generated vectors that are generated from the word
input vector of the word wi and its part of speech tagging embedding is highly affected by the quality of corpus. Thus,
posti. several pre-processing stages are made: the first stage incudes
removing non-Arabic words, diacritical and punctuation
h= W x = w( ,.) = v , ……. (18) marks. The second stage is replacing all numbers with NUM
keyword. After that, in the third stage, Farasa stemmer is used
for segmenting and retrieving the stem of the words [3]. This
On the other hand, the output score of the word and its part
stage is very crucial especially for Arabic language. For
of speech tagging at the output layer uc,postc,i,posti for jth word in
example, the sentence “‫ ”قام أحمد بتصحيح االمتحانات‬which means
c-panel is computed using Eq. (19). Where v’Twj,postj is the
(Ahmed corrected the exams) becomes “ ‫” قام أحمد تصحيح امتحان‬
transpose of the output vector for the word wi and its part of
after using Farasa. In this example, the stem of the word
speech tagging posti.
“‫( ”امتحانات‬exams) which is “‫( ”امتحان‬exam) is retrieved. In
word embedding using stem is more useful than using the
, , , = = v’ , h ……. (19) word itself [26]. For example, the words “‫ ”امتحان‬/emtehan/,
“‫ ”امتحانات‬/emtehanat/ , “‫ ”امتحاناتھم‬/emtehanatehem/ and
“‫ ”امتحانه‬/entehanoh/ which are translated to (exam), (exams),
In Skip-Gram as we mentioned before, there are c (their exams) and (his exam) respectively must be considered
multinomial distributions in the output instead of one. as one word “‫( ”امتحان‬exam). Finally, the last stage of pre-
Yc,postc,j,postj is the output of the jth unit in the c-panel for certain processing is the normalization.
word and its part of speech tagging which can be computed 2) Experimental settings
using the softmax as shown in Eq.(20) for each output in the
context. The proposed model was implemented using Python v3.5.3
and TensorFlow v1.12.0. The experiments are performed on
Y , , ,
standalone computer. The computer specifications include
3.4GHz Intel Core i7 quad processor and 24 GB RAM. The
= p w , , , =w , , , w, hyper parameters that are used in the experiments are 50000 is
the vocabulary size, 100 is dimension size, and 9 is context
,
= ……. (20) window. The experiments are conducted on the extension of
both approaches of word2vec model including CBOW and
Skip-Gram.
3) Results and Discussions
After substituting Eq.(18) and Eq.(19) in Eq.(20), we get
Eq.(21). After training the proposed model, we selected two words
“‫ ”ذھب‬and “‫”جمع‬. If the word “‫ ”ذھب‬is noun its meaning is
(gold) and if it is verb its meaning is (went). The proposed
Y , , ,
model enables the user to query and retrieve the vector
representation of the word for certain part of speech tagging.
= p w , , , =w , , , w, Cosine similarity is used to compute the similarity of vectors
as shown in Eq.(23). The vector representation of word “‫”ذھب‬
’ , ,
= ……. (21) with (Noun) part of speech tagging is completely different

, , than the vector representation of word “‫ ”ذھب‬with (Verb) part
of speech tagging. Thus, if we used cosine similarity to
retrieve the most similar words for the word “‫ ”ذھب‬with
different part of speech tagging, we can find that the similar
words are different. For example, the word “‫ ”ذھب‬with (Noun)

217
part of speech tagging means (Gold), thus we can notice that ‫ضم‬ Combined ‫مصف‬ Parking
the most similar words are “‫( ”نقد‬Money), “‫( ”فاتورة‬Bill), “‫”ملجم‬
(Mine) and others. On the other hand, the most similar words
for the word “‫ ”ذھب‬with (Verb) part of speech tagging which The word “‫ ’جمع‬if it is (Noun) it means (Group of People)
means (Went) are “‫( ”رجع‬Went Back), “‫( ”عاد‬Is Back), “‫”سافر‬ or (Addition Operation) while if it is (Verb) it means (Add).
(Travelled) and others. Table 1 and Table 2 show the most We can notice that, the most similar words in case of Verb
similar words for the words “‫ ”ذھب‬and “‫ ”جمع‬for Verb and part of speech tagging are different than the most similar
Noun part of speech tagging for CBOW and Skip-Gram words the case of Noun part of speech tagging.
models respectively. VI. CONCLUSION
word1. word2
Cosine Similarity(word1, word2) =
|word1| |word2| In this paper, extension of both approaches of word2vec
model including CBOW and Skip-Gram is proposed. The
……(23)
main idea of the proposed approach is to consider part of
speech tagging when training the word embedding model.
TABLE 1. THE MOST SIMILAR WORDS FOR THE WORDS “‫ ”ذھب‬AND “‫”جمع‬ Thus, the same word with different part of speech tagging
IN CBOW MODEL FOR VERB AND NOUN PART OF SPEECH TAGGING.
must be considered different. Therefore, if we have two words
Verb Noun
Word that have the same surface form but different part of speech
Arabic English Arabic English
tagging, the results are two different words with different
‫ذھب‬ ‫توجه‬ Go To ‫اشترى‬ Bought
meaning and different vector representation. The proposed
‫رجع‬ Went Back ‫ملجم‬ Mine
‫التفت‬ Turned ‫درھم‬ Dirham
model can be applied in several languages such as English and
‫ارجع‬ Come Back ‫شقة‬ Flat
Arabic. However, in Arabic language the process is harder
‫سافر‬ Traveled ‫حلق‬ Earing because of its morphological nature. In this paper, the
‫عمد‬ Went ‫فاتورة‬ Bill experiments are conducted on Arabic language using OSAC
‫صعد‬ Ascended ‫معدن‬ Metal datasets. Moreover, Farasa toolkit is used for segmentation,
‫اصطحب‬ Take ‫حلي‬ Jewels stemming and determining the part of speech tagging of the
‫نظر‬ Looked ‫بضاعة‬ Goods words.
‫انصرف‬ Run along ‫بائع‬ Seller
‫جمع‬ ‫ربط‬ Link ‫تفريق‬ Differentiation
‫قارن‬ Compared ‫ربط‬ Link REFERENCES
‫درس‬ Studied ‫فرق‬ Differentiate
‫فرق‬ Differentiate ‫خلط‬ Mix
[1] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
‫ضم‬ Sign ‫ضم‬ Join
“Distributed Representations of Words and Phrases and their
‫فصل‬ Separated ‫لؤلؤ‬ Pearl Compositionality,”. InAdvances in neural information
‫نظم‬ Organized ‫مئوي‬ Percentage processing systems,pages 3111–3119. 2013.
‫احصى‬ Counted ‫سائر‬ Other
‫مجموعة‬ Group ‫توزيع‬ Distribution [2] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
‫قطع‬ Cut ‫قوي‬ Strong Estimation of Word Representations in Vector Space,”
arXiv:1301.3781 [cs], Jan. 2013.
TABLE 2. THE MOST SIMILAR WORDS FOR THE WORDS “‫ ”ذھب‬AND “‫”جمع‬ [3] A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak,
IN SKIP-GRAM MODEL FOR VERB AND NOUN PART OF SPEECH TAGGING.
“Farasa: A Fast and Furious Segmenter for Arabic,” in
Verb Noun Proceedings of the 2016 Conference of the North American
Word
Arabic English Arabic English Chapter of the Association for Computational Linguistics:
‫ذھب‬ ‫رجع‬ Went Back ‫نقد‬ Money Demonstrations, San Diego, California, 2016, pp. 11–16.
‫عاد‬ Is Back ‫بيع‬ Sell
‫صعد‬ Ascended ‫ثمن‬ Price [4] J. Pennington, R. Socher, and C. Manning, “Glove: Global
‫انتقل‬ Moved ‫رھان‬ Bet Vectors for Word Representation,” in Proceedings of the 2014
Conference on Empirical Methods in Natural Language
‫خرج‬ Came Out ‫سبيكة‬ Alloy
Processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.
‫اسرع‬ Become Faster ‫معدن‬ Metal
‫توجه‬ Go To ‫دوالر‬ Dollars [5] R. Socher, “RECURSIVE DEEP LEARNING FOR NATURAL
‫وصل‬ Arrived ‫عملة‬ Currency LANGUAGE PROCESSING AND COMPUTER VISION,”
‫ادى‬ Led ‫بوليصة‬ Policy Ph.D. thesis, Stanford University,2014.
‫صار‬ Became ‫رھان‬ Bet
‫جمع‬ ‫جرى‬ Ran ‫فارق‬ Difference [6] A. Mahdaouy, E. Gaussier, and S. Ouatik El Alaoui, Arabic
Text Classification Based on Word and Document Embeddings,
‫نظم‬ Organized ‫اقام‬ Stayed
International Conference on Advanced Intelligent Systems and
‫فرق‬ Differentiate ‫الف‬ Make kind
Informatics, 2016.
‫فرق‬ Groups ‫فرد‬ Individual
‫مقارنة‬ Comparison ‫توزيع‬ Distribution [7] Y. Li, Q. Pan, T. Yang, S. Wang, J. Tang, and E. Cambria,
‫فرعي‬ Sub ‫فندق‬ Hotel “Learning Word Representations for Sentiment Analysis,”
‫بين‬ Between ‫ھدف‬ Target Cognitive Computation, vol. 9, no. 6, pp. 843–851, Dec. 2017.
‫شمل‬ Include ‫مزدلفة‬ Muzdalifah
‫حصل‬ Retrieved ‫منافسة‬ Completion [8] D. Suleiman, A. Awajan, and N. Al-Madi, “Deep Learning

218
Based Technique for Plagiarism Detection in Arabic Texts,” in Embeddings for Deep Compositional Models of Meaning,” in
2017 International Conference on New Trends in Computing Proceedings of the 2015 Conference on Empirical Methods in
Sciences (ICTCS), Amman, 2017, pp. 216–222. Natural Language Processing, Lisbon, Portugal, 2015, pp.
1531–1542.
[9] D. Suleiman and A. Awajan, “Comparative study of word
embeddings models and their usage in Arabic language [21] K. Hashimoto, P. Stenetorp, M. Miwa, and Y. Tsuruoka,
applications,” International Arab Conference on Information “Jointly Learning Word Representations and Composition
Technology (ACIT), Werdanye, Lebanon, pp. 1-7.2018. Functions Using Predicate-Argument Structures,” in
Proceedings of the 2014 Conference on Empirical Methods in
[10] A. El Mahdaouy, S. O. El Alaoui, and E. Gaussier, “Improving Natural Language Processing (EMNLP), Doha, Qatar, 2014,
Arabic information retrieval using word embedding pp. 1544–1555.
similarities,” International Journal of Speech Technology, vol.
21, no. 1, pp. 121–136, Mar. 2018. [22] A. Farghaly and K. Shaalan, “Arabic Natural Language
Processing: Challenges and Solutions,” ACM Transactions on
[11] P. Lauren, G. Qu, J. Yang, P. Watta, G.-B. Huang, and A. Asian Language Information Processing, vol. 8, no. 4, pp. 1–22,
Lendasse, “Generating Word Embeddings from an Extreme Dec. 2009.
Learning Machine for Sentiment Analysis and Sequence
Labeling Tasks,” Cognitive Computation, vol. 10, no. 4, pp. [23] A. Awajan, “Arabic Text Preprocessing for the Natural
625–638, Aug. 2018. Language Processing Applications,” Arab Gulf Journal of
Scientific Research vol. 25 no. 4 pp. 179-189 2007.
[12] D. Suleiman and A. Awajan, “Bag-of-concept based keyword
extraction from Arabic documents,” in 2017 8th International [24] X. Rong, “word2vec Parameter Learning Explained,”
Conference on Information Technology (ICIT), Amman, arXiv:1411.2738 [cs], Nov. 2014.
Jordan, 2017, pp. 863–869.
[25] M. K. Saad W. Ashour "Osac: Open source arabic corpora" 6th
[13] D. Suleiman, A. Awajan, and W. Al Etaiwi, “The Use of Hidden ArchEng Int. Symposiums EEECS vol. 10 2010.
Markov Model in Natural ARABIC Language Processing: a
survey,” Procedia Computer Science, vol. 113, pp. 240–247, [26] I. El Bazi and N. Laachfoubi, “Is Stemming Beneficial for
2017. Learning Better Arabic Word Representations?,” in Lecture
Notes in Real-Time Intelligent Systems, vol. 756, J. Mizera-
[14] T. Mikolov, W. Yih, and G. Zweig, “Linguistic Regularities in Pietraszko, P. Pichappan, and L. Mohamed, Eds. Cham:
Continuous Space Word Representations,”, 2013c. Linguistic Springer International Publishing, 2019, pp. 508–517.
regularities in continuous space word representations. In HLT-
NAACL

[15] O. Levy and Y. Goldberg, “Dependency-Based Word


Embeddings,” in Proceedings of the 52nd Annual Meeting of
the Association for Computational Linguistics (Volume 2:
Short Papers), Baltimore, Maryland, 2014, pp. 302–308.

[16] S. MacAvaney and A. Zeldes, “A Deeper Look into Dependency-


Based Word Embeddings,” in Proceedings of the 2018
Conference of the North American Chapter of the
Association for Computational Linguistics: Student Research
Workshop, New Orleans, Louisiana, USA, 2018, pp. 40–45.

[17] W. Ling, C. Dyer, A. W. Black, and I. Trancoso, “Two/Too


Simple Adaptations of Word2Vec for Syntax Problems,” in
Proceedings of the 2015 Conference of the North American
Chapter of the Association for Computational Linguistics:
Human Language Technologies, Denver, Colorado, 2015, pp.
1299–1304.

[18] A. Komninos and S. Manandhar, “Dependency Based


Embeddings for Sentence Classification Tasks,” in Proceedings
of the 2016 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language
Technologies, San Diego, California, 2016, pp. 1490–1500.

[19] N. T. Pham, G. Kruszewski, A. Lazaridou, and M. Baroni,


“Jointly optimizing word representations for lexical and
sentential tasks with the C-PHRASE model,” in Proceedings of
the 53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers),
Beijing, China, 2015, pp. 971–981.

[20] J. Cheng and D. Kartsaklis, “Syntax-Aware Multi-Sense Word

219
Applying Ontology in Computational Creativity
Approach for Generating a Story
Lana Issa Shaidah Jusoh
Department of Computer Science Department of Computer Graphics and Animation
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
lanaissa238@hotmail.com s.ibrahim@psut.edu.jo

Abstract. Computational creativity is a young ontology as a knowledge base to supply the story generator with the
multidisciplinary field which has a promising future for needed content to generate informative stories. The ontology stores
effectively giving options in tackling and handling many all the needed scientific concepts, relations, and attributes, to get
automated systems. It is very useful for example, in creating detailed knowledge about any concept that will be used in a
a narrative story for various purposes, which is actually a generated story. Then, a creative based algorithm will design a
human art. In this research, we are investigating methods suitable story plot according to the given information. Finally, a
that can be applied as into computational approach for language generating method will be used to generate full sentences
generating structured narratives automatically, to suit to form the final story in its final shape.
education purposes. In this paper, we present a literature In this work, we design an alternative approach for traditional e-
review of the work done in this field so far, and we propose a learning methods which deliver the educational content in its
framework that is designed to generate educational stories already prepared shape. Automating the process of generating
using computational creativity approach. The major the educational material in the shape of a story is entertaining
contribution of this paper is a proposed computational and helpful for students at a young age. This work could also be
creativity approach consisting of hybrid Artificial implemented to work as an extraordinary feature for intelligent
Intelligence methods to generate educational stories. systems that provide valuable content.
This paper is organized as follows: section II reviews the work
Keywords- Computational Creativity; Natural Language done in the field, section III describes the proposed method,
Generation; Ontology; Automatic Story Generation. section IV contains the conclusions and the future work of this
paper, and section V includes the references.
I. INTRODUCTION
Modern computer science techniques have provided multiple II. LITERATURE REVIEW
options for building solutions that makes life easier for humans. A.Computational Creativity
Various intelligent methods which have the ability to stimulate Humans are gifted with creativity, which gives them the ability
human practices had been developed to problem solving to come up with novel ideas, new solutions, or any type of novel
options. Computational Creativity is a new research area that creation that helps them achieve several goals. Several issues in
proposes several ideas related to simulating human's creative life do not need a formal method to find a solution, they need a
behavior in many areas. Narrating is one of the creative fields in different way of idea generation that includes unexpectedness,
which humans use their creative minds to produce creative and novelty. In this area, we find various creative methods
content that satisfies readers. With the computational creativity created by creative humans to suit such nature of problems.
approach, an automated narratives generation system may able In the world of computer science, many problems were solved
to create content that look like human-written content. by creating methods that simulate humans’ cognitive behavior,
Automatic story generation shows a very interesting example of this has supported humans to reach solutions faster and to deal
automating the process of generating narratives. with bigger volumes of data.
Integrating computational creativity approach which helps with Recent endeavors have taken a bigger step by investigating the
analyzing and understanding human writings, such as writing ability to simulate human creative behavior, to produce better
jokes, stories, and so on[21][23], into natural language solutions or to solve problems of un-formal nature. With the
processing (NLP) may produce an effective system for existing artificial intelligence (AI) techniques, it is possible to
generating a well written text automatically. hire them to build a method that produces novel yet familiar
Automatic story generation is considered as a young sub-field of ideas. Although it has been difficult to find a precise definition
Artificial Intelligence (AI) and an application of Computational of creativity in order to guide the construction of computational
Creativity. AI methods are extending to cover more and more creativity, a certain pattern could be analyzed and some rules
human intelligence applications, while Computational Creativity could be extracted to guide this process that aims at building
is about hiring Artificial Intelligence methods to simulate problem-solving methods that are inspired by humans creativity
human creativity in several life aspects. to achieve novelty and familiarity[2].
In this paper, we present a related work in this field, then, we Humans creativity has been behind implementing and inventing
propose a story generation approach which contain hybrid of AI many solutions that were found to be useful in many fields.
methods which aims at embedding educational material with the Here, we review some examples of using computational
entertaining nature of storytelling. creativity in various fields.
The proposed automatic story generating approach contains In the field of decision making, computational creativity
multiple steps: first of all, constructing stories creatively using an techniques were proved to be very effective. It utilizes available
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 220
information in a creative way that supports each decision to be The idea of automatic story generation is generating stories
made. automatically using intelligent computer programs to finally
The Deep Green concept [1] hired computational creativity as produce content that looks like human-produced stories. This
an innovative approach to deploy simulation to support military process requires building basic knowledge for the creative
operations while they are being conducted. The authors have program to learn how to combine words to form a story which
developed software agents that process information on the has all the expected story elements.
military operation to make military operations planning easier,
and with having a space graph of possible future states, along Stories are one form of entertainment that many people look for,
with the assessment of the possibility of approaching future generating creative content is the biggest challenge in
states, they have designed a dynamic approach that uses entertainment. Many efforts around the world are put towards
information acquired at the moment to make decisions. finding new ideas that participate in creating non-traditional
In [2], the authors review many decision-making problems that learning methods such as interesting e-learning systems, for
found computational creativity feasible to be used, because it helps example edraak website[58] created by Queen Rania Foundation (
with assessing situations, explore possible actions, and improve the QRF). Automatic story generation could add a new flavor for
planning process. The building of such creative solutions has educational based platforms or educational intelligent solutions, by
developed to reach a state where it affects humans in their decision- simulating the comprehensive, linguistic, and entertainment skills
making process, such as chess[15]. The program was enhanced that writers have in an automated method that is formed with
until it reached a stage where chess player where learning from the respect to human creativity. Several people have designed
program how to search and evaluate for each movement in the frameworks that generate short stories, MEXICA[5] for example, is
game. “humans play chess like machines, and machines play chess a computer model that produces short stories guided by content,
the way humans used to play”[15]. linguistic, and cultural constraints.
Since computational creativity supports planning and decision
making which are activities that are usually done by leaders who The production of stories at MEXICA is driven by the chosen
use their creativity to come up with the best suitable plan or actions. After learning from several existing stories stored in its
decision. One of the biggest examples that could be listed here information repository, it analyzes how the normal action flow
is making decisions in military training[2]. There have been should be designed. Each event has a set of pre and post-
many effective solutions developed for military training using conditions, whenever an event is added to the story, automatic
artificial intelligence, virtual reality, game trainers and many story compliance checking is done to check whether further
other new trends in technology. In a game trainer, for example, events needed to be added to satisfy the set of defined rules. The
computational creativity could be hired by building complex main idea at MEXICA is improvisation to produce creativity.
characters that might behave like real life soldiers called The system was created by creating two agents. The two agents
"intelligent agents" which are programmed with a human-like have partially different knowledge-base to collaborate in the
background such as emotions or education [3][4]. story generation process.
Another field that relies on coming up with creative strategies is
the field of marketing. It requires certain understanding for the Other than MEXICA, there are many famous story generation
advertised product and the target audience, to build innovative systems such as DEFACTO [34], Tale-Spin [35], OPIATE[36],
marketing strategies. Computational creativity has been applied KIIDS[32], Minstel[37], and MAKEBELIEVE[38].
in this field, where it was used to automate the creative work in In general, the approaches in generating stories could be divided
advertisement [6]. In their work, the creative system was into two different approaches: generating story structures and
programmed to produce a list of advertising messages that generating a full story. The first approach is about generating a
contain novel ideas yet familiar expressions. complex structure of elements depending on stored atomic
In the field of generating narratives, many studies have proposed elements and using production grammar, while the other
methods to generate written content such as stories, jokes, approach generates a full story from A to Z. And this is usually
metaphors, and so on. This part is discussed in the following done using planning or simulating approach to build a story.
section that discusses generating natural language in details. These two approaches to generating a full story are discussed in
the next section.
B. Automatic story generation
Writings are a form of creative art produced by humans, as well Before the year of 2000, DEFACTO[34] and Tale-Spin[35]
as cooking, music, and paintings. With the existence of were introduced. DEFACTO is a framework that uses logical
computational creativity, the probability of automating those formulas to generate s structure of a story. This framework
creative products of humans becomes higher. Computational implements a dynamic technique to produce a story with user
creativity techniques are very helpful in the world of generating engagement. On the other hand, Tale-Spin is another story
writings in a form of stories, by providing useful methods generating framework that generates stories but it generates a
connected to natural language processing methods so the results full story and not just story structures.
would be valuable and novel.
Producing a pleasing narrative requires a lot of creativity and Another example of previous work in the area of generating stories
intelligence. Therefore, an intelligent is required to is the work proposed by Peinado and Gervas [32]. In their paper,
automatically process, build, and produce a creative entertaining they presented a system ( KIIDS) that generates fabulas which is a
content. Stories, in particular, have multiple elements such as narratological term for the set of story events that form the story.
characters, setting, plot, and so on, that should be chosen Their system was built with respect to Vladimir Propp as a
carefully to build an attractive story. narratological background and their system learns from existing
stored fabulas using description logic. Their system

221
followed the same narrating structure but with changing the the quality and structure of the story plan on the off chance that
content. Their results were evaluated by comparison with one is found, however, which needs semantics. The ISR
randomly generated stories and existing stored stories. arranging algorithm expects innovative power over parts of the
MAKEBELIEVE is an interactive story generating agent that story world description.
generates short stories after the user inputs the first line of the
story, MAKEBELIEVE follows a hybrid approach generating Riedl et al [14] outline the stream of a story as a linear
both story structures and full stories. It is based on a representation of events with foreseen client actions and system
commonsense knowledge base which doesn’t only suit controlled operator collaborated together in a partially arranged
storytelling but many other goals also. plan. For each conceivable way the user violates the story plan,
Authors in [8], proposed a strategy to computational narrating. an alternative story plan is created.
Their methodology has three key highlights. Right off the bat, the The literature in an automatic story generation system can take one
story plot is made progressively by counseling a consequently made of two directions: planning-base, and simulation-based story
information base. Furthermore, the generator understands the generation [2]. In the planning-based generation of a story, the
different parts of the aging pipeline stochastically, without broad characters and events are statically defined, then a plot is created
manual coding. Thirdly, they create and store multiple versions of a based on the defined characters and events. Of course, the events
story in a tree structure. Story creation adds up to navigating the are not randomly placed but arranged according to pre and post
tree and choosing the nodes with the highest scores. Then, they conditions of each event. Whereas each event is suitable for the
created two scoring methods that rate stories as far as how coherent previous and posterior event. In the planning based approach of
and interesting they are. By and large, their outcomes demonstrate story generation, a fixed set of story events are set and then
that the over generation-and-positioning methodology pushed was characters and events are combined to form stories.
feasible in creating short stories that follow narrative structure. An example of this approach is MEXICA[5]. The framework
However, their approach stochastically combined sentences for proposed by the creators of MEXICA considered the pre and
stories and there is no guarantee that these stories will be interesting post conditions of events when generating the story.
or coherent.
An author in [9] the author presented a virtual storytelling Another author that followed this approach in story generation
system (AVEIRO). In their system, the characters are executed is Riedl [27], where he proposed a general planning model using
as intelligent, semi-autonomous operators. A virtual director (a AI by retrieving and reusing vignettes which are fragments of
specialist with general learning about plot structure), controls story that holds examples to narrating situations. This method
their activities and guarantees that an all around organized plot allowed him to create a space of creative solutions that help
develops. They don't make utilization of pre-characterized with creating stories with respect to a planning concept inspired
contents, which implies that the plot isn't endorsed yet made by by existing plans.
the characters. Their approach has been executed in a general The story-generating process isn't as static as forming a plan and
multi-operator structure, the Virtual Storyteller. The structure sticking to it without final touches or changes to follow some
for the Virtual Storyteller has been fully implemented, however, rules or constraints. Traditionally, sticking to a fixed plan will
the information bases were very constrained. give us one feasible solution but won't provide us with all
Authors of [10] have suggested an approach to automatically build possible solutions that might be generated. Adding
individual sentences with the help of an ontology that stores the unexpectedness or randomness to the planning-based approach
needed knowledge, their sentence generation model receives as will produce a non-deterministic output each time the system
input a specification of what it is supposed to deliver, and produces runs which will lead to having multiple output options.
as output a corresponding natural language expression. Language
grammar is a basis for the sentence generation. They have Regarding how efficient a planning-based method is, the final
considered sentence- structure planning according to grammatical goal of automatic story generation systems is to provide
rules, also a selection of syntax, plus the order and morphological something that satisfies the audience, which requires altering the
generation. Their system concentrates on the construction of story generation method to produce a happy ending for example
sentences to make some sort of a story. or to add comedy and many other examples. This is about hiring
The same authors of [10] have built on their basic idea to form an creativity in generating stories and not just following a fixed
Automatic story generation framework [11] that gives an method to generate expected repeated stories.
environment to the user to build or rewrite the story according to Unlike the planning-based approach of generating stories which
their selections, through client collaboration. The most alluring revolved around events, the simulation-based generating
component of their framework is that it enables the client to choose approach which revolves around individuals. In this approach a
the characters, items, and locations for the story in which they are scope of characters is created, each denoted with its properties
built. The utilized ontology gives the characteristics of the and possible actions. The rules in this approach hold the way
characters, items, and locations to the produced story. characters interact in the real world. Thus, no specific plan-
Charles et al [12] displayed results from the first form of a based constraints exist, the generated stories will comply with
completely implemented narrating prototype, which delineates the the rules existing by nature in the scope of the behavior of each
generation of variations of a conventional storyline. These character. Hence, the output isn't guaranteed to be interesting, it
variations result from the communication of autonomous characters will just follow a natural flow. But since this approach focuses
with each other, with condition assets or from user collaboration. highly on creating realistic characters, they might be equipped
Riedl et al [13] had given arranging algorithm to story generation. with elements that make them more realistic and close to the
The story organizers are restricted by the way that they can just nature of humans such as emotions.
work on the story world gave, which impacts the capacity of the Rank et al [4] reported on interactive storytelling approach[4] for
organizer to discover a solution story plan and example, the factor of emotion creatively affects the creation of

222
the agents(characters) or the story. This character configuration
process added a creative touch over the generated stories.
However, a hybrid approach might combine both methods to get
a hybrid advanced method that might produce more satisfying
results. But as mentioned earlier, each approach could be
tackled and updated in infinite ways that might suit various
domains to produce satisfying results.
In the planning-based approach, infinite creative planning
methods could be found to guide a story production. And in the Figure 1. Example of mapping concepts into the ontology
simulation-based approach, an infinite real world inspired
factors could drive the behavior of any character to produce The story plan is set by considering all of the story elements that
many versions of a requested story. are needed to be prepared. In this work, our method is designed
to generate short stories for children at young age (5-7 years
III. PROPOSED APPROACH old). Thus, we choose not to follow a very complicated method
Writing a story is about preparing the right content and putting in preparing story elements in order to keep the story simple and
it in a good structure. The story content and presentation matters suitable for kids. However, our method is scalable such that it
to the readers[39]. Well written stories usually have impressive can be modified by increasing the level of complexity in making
content that was prepared by a creative writer. In this work, we some decisions when preparing the story components. Figure 2
aim at designing a method that simulates humans' creativity in shows an overview of our generating story approach.
writing stories. The proposed approach contains two major
phases, the first phase is planning the story in terms of
components and structure, using computational creativity
approach. The second phase is the linguistics of the story, where
the story sentences are generated according to the pre-planned
content.
A. Planning the story
Planning a story requires preparing all the story elements. In the
area of children’s short stories, the writer Nancy Krulik [26]
who is an author of Katie Kazoo, Switcheroo book series
defined five essential elements of a story:

• Characters: Characters which interact and form the


story events, there are primary characters and
secondary characters. And there might be a star
character which the story focuses on. Figure 2. General components of the generating story approach
• Setting: Story setting is usually the environment details
in which the story will take place or time. A story usually contains a main character [26][28]. In our approach,
• Plot: The sequence of events that form the story and a user has to select the main character that he/she wants to learn
contain the main details the author wants to deliver. about in a shape of a story. The second element of stories is the
• Conflict: The main event that occurs in the story which story settings which is usually the time and place of the story.
usually contains a problem that is needed to be solved. These components are set stochastically according to the domain of
• Resolution: The solution of the story conflict and the the educational material in order to choose suitable story settings.
closing state of events sequence. for example; if the chosen domain is science, the settings are set
according to the places where the scientific objects exist. A
Applying computational creativity in designing stories requires decision is made according to the chosen main character that is
hiring computational creativity in planning the basic elements of a stored in the ontology along with its properties, actions, suitable
story. In order to prepare a story's content, some type of knowledge environments, and all the other relations with related concepts.
needs to be stored in a way that aids the process of preparing story Stories usually contain events that are described as a story conflict
elements. In this work, we designed a method based on an ontology that usually contains a problem or a challenge [26], and a story
[42] technique to represent the reference material that the story is resolution which is the event that presents a solution to the main
built from. The ontology is a knowledge based which will be used conflict of the story which is usually the exciting part that the
to build the concepts of educational contents that are to be delivered readers look forward to. The story events are actions performed by
in the produced stories. Figure 1 shows an example of an ontology story characters, and usually, the conflict and resolution events
which contains concepts from Cambridge primary science stage 4 contain actions performed by the main character. Thus, and in order
learner's book [30]. to maintain such information, we rely on an ontology that has all
the concepts stored and labeled, actions that are stored for
characters, could be labeled according to the possibility of acting as
a conflict action, and maybe a resolution action along with the
corresponding conflict action. In this way, when retrieving
information from the ontology, the characters properties, places,
and labeled actions are retrieved, and that forms a good base of
content when planning a story.

223
Planning a good plot is essential for making a good story. A plot
is a term that describes a set of events that make up a story,
which is the main part of the story. Creating a unique plot is
what makes a new story. Writers focus on creating an exciting
and interesting plot to catch the readers interest and to produce
an entertaining story. In order to make a good plot, a set of story
events are ordered in an exciting way such that it will capture an
interest from the reader. Usually, when a writer writes a story,
the writer sets a goal idea or a core event that is needed to be Figure 3. Example of markov chain model for a child's possible actions
delivered in the story, then, sequences of events are developed
to create a full plot. An interesting concept followed by writers Discrete Time Markov Chain[43] is a sequence of random
is called the causality[7] concept which is about considering the variables such that the probability of the next state n depends on
cause and effect to every chosen action in a story, to choose the the previous state n-1and not the overall sequence. This is
pre-actions of this action and the post-actions of this action expressed in the following probabilistic formula.
carefully. The main idea of causality is that every process is P( Xn+1 = x | X1 = x1, X2 = x2, …, Xn = xn) = P( Xn+1 = x | Xn = xn)
caused by many possible processes and a process could cause Probabilities could be set in multiple ways, a stochastic
several other processes, so, a plot could develop in many approach could be followed or a statistical based approach
possible ways it depends on writers' choice of each story event according to stored history of sequences that participate in
that makes sense according to the previously chosen events. forming the current probabilities.
This idea was followed by many authors who proposed methods The Markov Chain Model is able to form multiple sequences of
for generating stories such as the authors of [32]. character actions. Then, it will choose either the sequence with
the highest probabilities connecting the elements, or a sequence
In generating a story automatically, a system should be that sums up to be above a certain threshold set by domain
intelligent enough to create a story plot. The system has to be constraints. For now, we will consider taking the sequence with
able to predict the sequence of actions. Previous computational the highest summation of probabilities.
creativity has been developed using various methods. However, Applying the Markov Chain Model to find a suitable ordering
none of them has considered of predicting sequence of actions for story events assures planning the shape of the story by
in generating a story. In this work, we proposed to use artificial choosing a mathematically feasible sequence of actions which
intelligence method namely Markov Chain Model for the will make sure that the outcome is reasonable and valuable.
prediction purpose [43].
B. Building the story
In our proposed approach, using Markov Chain Model, we This stage contains natural language generation. The previous
follow a stochastic based approach which depends on stored stage sets the story outline creatively, what's remaining is
information about possible ordering of events to suit the story making all linguistic decisions to satisfy the goal of producing a
goal. This part is about simulating the creativity of humans story that is complete, correct, and coherent. Natural language
when forming story plots to design a suitable plot for our system generation contains 6 main tasks: content planning, text design,
to follow when generating the story. As previously mentioned, sentence planning, lexicalization, referring expression
we find our problem of story generation is a problem that hires generation, and linguistic realization[16, 17]. The first 3 tasks
the exploratory creativity in exploring a set of possible elements are application dependant and set according to the domain of the
to build the final solution. The exploration is controlled with language generating application. The last 3 tasks are about
certain constraints and requirements, also the options are making decisions about the linguistics of the story, a lot of
measured according to the domain measurement basis. techniques are used in the literature that could help with the
implementation process, but, here we design the general method
A plot is a sequence of events that form a story. Those events are of generating educational stories and such details related to the
actions performed by the characters involved in the story. So, implementation will be discussed in our future work.
designing a plot requires ordering characters' actions into a
reasonable order according to some constraints. In order to simulate C. Evaluating the story
this process we chose the Markov chain model[43]. Markov chain Evaluating any creative content is considered a bit of a
model is a model that hires statistics in determining a sequence of challenge[40]; because usually the evaluation in the artistic field
elements according to certain rules or history. This model contains is very subjective and humans taste is nondeterministic. Also,
building multiple possible sequences in shapes of directed graphs such productions cannot be automatically assessed. So, such
and then choosing a suitable sequence according to all the systems require human evaluation for the output. Many types of
probabilities between links that connect the sequence elements. evaluation methods could be used to get useful information
Figure 3 shows an example of a simple Markov Chain Model for a about the validity of such a proposed system, such as
possible actions in a child story. questionnaires, surveys and observations.

IV. CONCLUSIONS AND FUTURE WORK


We aimed at working on an educational story generator that
generates informative entertaining stories automatically. We
reviewed the existing work in the related fields which are:
computational creativity, NLG, and automatic story generation

224
which combines previous topics, and then, we studied all [22] Herv´as, R., Pereira, F., Gerv´as, P., & Cardoso, A. (2006). Cross-
domain analogy in automated text generation. In Proc. 3rd joint
possible paths that could be followed to build such a system. we workshop on Computational Creativity, pp. 43–48.
have designed a reasonable method for building such a system. [23] Veale, T., & Hao, Y. (2007). Comprehending and Generating Apt
In future work, we aim to implement this proposed method and Metaphors: A Web-driven, Case-based Approach to Figurative
test its ability to generate educational short stories for children. Language. In Proc. AAAI’07, pp. 1471–1476.
[24] Veale, T., & Hao, Y. (2008). Fluid knowledge representation for
After that, we also hope to continue investigating the ability to understanding and generating creative metaphors. In Proc.
hire computational creativity techniques into designing even the COLING'08, pp. 945–952.
smallest details in the stories to make them look like human [25] S. Han, H. Shim, B. Kim, S. Park, S. Ryu and G. G. Lee, "Keyword
produced stories. question answering system with report generation for linked data,"
2015 International Conference on Big Data and Smart Computing
(BIGCOMP), Jeju, 2015, pp. 23-26.doi:
V. REFERENCES 10.1109/35021BIGCOMP.2015.7072843.
[1] Surdu, J. R. & Kittka, K. (2008), The Deep Green concept, in [26] Nancy Krulik Katie Kazoo Classroom Crew.www.katiekazoo.com
'Proceedings of the 2008 Spring simulation multiconference', Society /nancy.html Accessed 18 Nov2018.
for Computer Simulation International, San Diego, CA, USA, pp. [27] Riedl, M. O. (2008), Vignette-Based Story Planning: Creativity
623--631. Through Exploration and Retrieval, in 'Proc. 5th International Joint
[2] Jändel, M. (2013a), 'Computational Creativity in Naturalistic Workshop on Computational Creativity'.
Decision-Making', Submitted to International conference on [28] List of narrative forms. (2019). Retrieved from
computational creativity, 2013. https://en.wikipedia.org/wiki/List_of_narrative_forms
[3] Swartjes, I. & Vromen, J. (2007), Narrative Inspiration: Using Case [29] R. Rosenfeld, "Two decades of statistical language modeling: where
Based Problem Solving to Support Emergent Story Generation, in do we go from here?," in Proceedings of the IEEE, vol. 88, no. 8, pp.
'4th International Joint Workshop on Computational Creativity'. 1270-1278, Aug. 2000.
[4] Rank, S.; Hoffmann, S.; Struck, H.-G.; Spierling, U. & Petta, [30] Fiona Baxter, Liz Dilley, Alan Cross, Jon Board. (June 2014).
P.(2012), Creativity in Configuring Affective Agents for Interactive Cambridge Primary Science Stage 4 Learner's Book. Cambridge:
Storytelling, in 'Proc. of the 3rd International Conference on University of Cambridge.
Computational Creativity'. [31] Musen, M.A. The Protégé project: A look back and a look forward. AI
[5] Perez y Perez, R.; Negrete, S.; Peñaloza, E.; Castellanos, V.; Ávila, R. Matters. Association of Computing Machinery Specific Interest Group
& Lemaitre, C. (2010), MEXICA-Impro: A Computational Model in Artificial Intelligence, 1(4), June 2015. DOI:
for Narrative Improvisation, in 'Proc. of the International Conference 10.1145/2557001.25757003.
on Computational Creativity'. [32] Federico Peinado, Pablo Gervas, "Evaluation of Automatic
[6] Strapparava, C.; Valitutti, A. & Stock, O. (2007), Automatizing Two Generation of Basic Stories "New Generation Computing,
Creative Functions for Advertising, in 'International Conference on Computational Paradigms, and Computational Intelligence. Special
Computational Creativity'. issue: Computational Creativity 24(3):289-302, 2006.
[7] Bunge, M.(2012) Causality and Modern Science: Third Revised [33] Jie Bao, Caragea.D, Honavar, V. “Towards Collaborative Environments
Edition. Massachusetts, USA: Courier Corporation. for Ontology Construction and Sharing.” Collaborative
[8] “Learning to Tell Tales: A Data-driven Approach to Story Technologies and Systems, CTS 2006.
Generation", Neil McIntyre and Mirella Lapata. Proceedings of the [34] Sgouros, N. M., ”Dynamic Generation, Management and Resolution
Joint Conference of the 47th Annual Meeting of the ACL and the 4th of Interactive Plots”, Artificial Intelligence 107, 1, pp. 29-62, 1999.
International Joint Conference on Natural Language Processing of [35] Meehan, James R., ”TALE-SPIN and Micro TALE-SPIN”, in Inside
the AFNLP. Suntec, Singapore — August 02 - 07, 2009. computer understanding (Schank, Roger C., and Riesbeck,
[9] “The Virtual Storyteller: Story Creation by Intelligent Agents”, Christopher K. ed.), Lawrence Erlbaum Associates, Hillsdale, NJ.
Mari¨et Theune, Sander Faas, Anton Nijholt, and Dirk Heylen. 1981.
University of Twente, PO Box 217, 7500 AE Enschede, The [36] Fairclough, C. and Cunningham, P., ”A Multiplayer Case Based
Netherlands. Story Engine”, in Proceedings of the 4th International Conference on
[10] “A novel approach for construction of sentences for automatic story Intelligent Games and Simulation, EUROSIS, pp. 41-46, 2003.
generation using ontology”, A. Jaya and G.V. Uma. Proceeding of [37] Turner, S. R, Minstrel: A Computer Model of Creativity and
International Conference on Computing, Communication, and Storytelling, Technical report UCLA-AI-92-04, Computer Science
Networking(2008). Department, University of California, USA, 1992.
[11] “An intelligent system for semi-automatic story generation for kids [38] Hugo Liu, Push Singh. (2002). MAKEBELIEVE: Using
using ontology”, A. Jaya and G.V. Uma.Proceedings of the Third Commonsense to Generate Stories. Proceedings of the Eighteenth
Annual ACM Bangalore Conference(2010). National Conference on Artificial Intelligence, AAAI 2002,
[12] Charles, F.; Mead, S.J.; Cavazza, M. “Character-driven story Edmonton, Alberta, Canada. AAAI Press, July 28 - August 1, 2002,
generation in interactive storytelling” Virtual Systems and pp. 957-958.
Multimedia. Proceedings. Seventh International Conference on [39] Soleimani, H., & Akbari, M.G. (2013). The Effect of Storytelling on
Virtual Systems and Multimedia. 25-27 pp no: 609 – 615, Oct. 2001. Children ' s Learning English Vocabulary : A Case in Iran.
[13] Riedl, M. and Young, RM, “Open-World Planning for Story [40] Boden, M. (2009). Computer Models of Creativity. AI Magazine,
Generation” Proceedings of the 19th International Joint Conference on (Vol 30 No 3: Fall 2009).
Artificial Intelligence.California USA 2004. [41] Gervás, P.; Pérez y Pérez, R.; Sosa, R. & Lemaitre, C. (2007), On the
[14] Riedl, M. and Young, RM, "From Linear Story Generation to Fly Collaborative Story-Telling: Revising Contributions to Match a
Branching Story Graphs" American Association for Artificial Shared Partial Story Line, in 'Proc. of International Joint Workshop
Intelligence(www.aaai.org) 2005. pg 23 – 29. on Computational Creativity'.
[15] Bushinsky, S. 2009. Deus ex machina a higher creative species in the [42] R. Studer, V. Benjamins, and D. Fensel, “Knowledge engineering:
game of chess. AI Magazine 30(3):63–69. Principles and methods,” Data &amp; Knowledge Engineering, vol.
[16] Reiter, E., & Dale, R. (1997). Building natural-language generation 25, no. 12, pp. 161 – 197, 1998. [Online]. Available:
systems. Natural Language Engineering, 3, 57–87. http://dx.doi.org/10.1016/S0169-023X(97)00056-6.
[17] Reiter, E., & Dale, R. (2000). Building Natural Language Generation [43] jaiswal, s. (2019). Python Markov Chains Beginner Tutorial. [online]
Systems. Cambridge University Press, Cambridge, UK. DataCamp Community. Available at:
[18] Binsted, K., & Ritchie, G. D. (1994). An implemented model of https://www.datacamp.com/community/tutorials/markov-chains-
punning riddles. In Proc. AAAI’94. python-tutorial [Accessed 14 Apr. 2019].
[19] Binsted, K., & Ritchie, G. D. (1997). Computational rules for
generating punning riddles. Humor: International Journal of Humor
Research, 10 (1), 25–76.
[20] Stock, O., & Strapparava, C. (2005). The act of creating humorous
acronyms. Applied Artificial Intelligence, 19 (2), 137–151.
[21] Petrovic, S., & Matthews, D. (2013). Unsupervised joke generation
from big data. In Proc. ACL’13, pp. 228–232.

225
Arabic Document Indexing for Improved Text
Retrieval
Yaser A. M. Al-Lahham
Computer Science Department
Zarqa University
Zarqa – Jordan
yasirlhm@zu.edu.jo

Abstract - Arabic document indexing is a challenging process Using Light stemmers for document indexing could be
due to the complex morphological nature of the Arabic language. improved by choosing a representative subset of terms instead
Methods of document indexing in the literature relied on applying of selecting all terms, since the results recorded by many
morphological schemes to extract terms. These morphological researchers showed that it had better retrieval, and it reduced
schemes mainly depend on root extraction and stemming. This the index size [18]. These results motivated the proposal of this
paper proposes a simple document indexing method based on paper. This paper proposes a different approach of indexing
selecting only definite words (that have the prefix AL, or it is documents, it selects index terms that most likely to have an
acceptable to have this prefix). The words that preceding (and/or) important role in Arabic sentences, such as the definite words
succeeding these definite words are also considered. The proposed
(that have the prefix "‫)"ال‬, and words after/before them.
method is tested using the TREC-2001/2002 Arabic test collection.
Definite words could gain more importance as key words in
The proposed method outperforms selecting all terms, either
without stemming, or stemmed by the Light10 stemmer, for
Arabic text as it is added to nominal words to upgrade
example, indexing documents by selecting definite words and importance, as previous knowledge indicator, and as definite
words that come after them enhances the Mean average Precision conjunctive article added to active and passive participles [20].
of the Light10 by 4.4%, and at the same time decreases the index Once a document is indexed according to a word that has the
size by 6.1%. prefix "‫"ال‬, all documents –later- are indexed according to this
word, regardless whether it has the prefix "‫ "ال‬or not.
Keywords - Arabic Information Retrieval; Arabic Document
All over the paper, the term "AL-Word" means a word
Indexing; Index Term Selection; Arabic Language Processing
begins by the article "‫ "ال‬or "AL", AL-Words and the words
I. INTRODUCTION before them are referred to as: “ALBEFORE”, AL-Words and
words after them as: “ALAFTER”, and AL-Words and words
Arabic has rich vocabulary since words can be devised by after and before them as: “ALBEFORE_AFTER”.
adding, stressing, or combining words, or just by changing a
diacritic of a letter in a word. The application of these rules on The rest of paper is organized as follows: section 2 includes
Arabic words making Arabic has a complicated morphological a survey of the related work. Section 3 presents the proposed
structure, which complicated the document-indexing process method of document indexing. Section 4 presents the
[10]. These complications could be noticed in many cases, evaluation procedure, results, and discussion. Finally, section
such as a word could have different forms for plural and pair 5 concludes the paper and presents the future work.
forms, definite articles, male or female, or any other usage, for
example applying these morphological rules on the word II. RELATED WORK
"‫( "كتاب‬a book), making the following forms: "‫ "كتابي‬my book, Arabic document indexing includes index terms’ selection,
"‫ "كتابھا‬her book, "‫ "كتابھم‬their book (male), "‫ "كتابھن‬their book which can be categorized into: statistical, linguistic, and
(female), "‫ "كتب‬books, "‫ "الكتاب‬the book, and "‫ "كتابھما‬their book combined linguistic and statistical techniques.
(for two). All of these forms of the word refer to the same
meaning. In Arabic Information Retrieval, this problem makes Statistical approaches used properties of index terms such
it difficult to match terms of a query to index terms of that a selection criterion is not oriented towards a specific
documents [18]. language or application, for example, Al-Kabi et al. in[2] used
term frequency-inverted term frequency (TF-ITF), and term
To solve these problems, some research efforts indexed co-occurrence, as statistical parameters to extract index terms.
documents using Arabic morphological analysis to extract Inverse document frequency (IDF) is also used as statistical
index terms, although many proposals based on root extraction parameter to select n-gram character stems, for example
effectively used in some areas, such as automatic diacritization Awajan in [7] used the IDF to distinguish a stem from terms
of Arabic sentences [9], it show less significance to be used in seem as that stem, since these terms are more frequent than
Arabic text retrieval [15]. Alternatively, stemming is mostly stems so they will have higher document frequency than a
used to index a document according to different word forms stem.
into a single term, or stem [8], which could solve the problem
of matching two words having the same meaning and different Other methods used machine learning techniques to apply
shapes [10]. Stemming makes a reduced index and enhances morphological rules of the Arabic language to extract words’
the retrieval (in terms of the recall), other researchers found roots and stems. Some researchers apply these methods to
that it has little effect on precision ratio, such as [16]. extract features in order to be used in a specific application, for
example Al-Thubaity et al. in [4] tested several methods to
Recently, light stemmers are used for document indexing extract features that suitable for Arabic text classification.
in order to reduce the complexity of morphological analyzers,
and heavy stemmers, an example of widely used light stemmer Some root extraction methods, for example Nehar et al. in
is the light10 [16], which applied few morphological rules to [19], used finite state transducers to determine index terms,
strip off predefined list of affixes. Light stemmers encounter and in case a word mapped to more than one root some
some problems, for example, using different stemming statistical methods are applied to resolve conflictions.
algorithm with the same affixes lists produces different results Morphological analysis suffers from ambiguity and has
[10], and for short queries light stemmers behave the same way limited number of words’ forms[10].
as no stemming [21].
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 226
Light stemmers were proposed to overcome the
complications of morphological analysers, where tables of
limited number of common affixes are determined, and the

AL-WORDS
INDEX

BEFORE
longest matched affix of a word is removed [7]. Light

AFTER
TERMS
stemmers were improved by updating previously used tables,
where a new affix is added to develop a new table at each
improvement.
PRE- LIGHT
Recently, enhancements on existing light stemmers are PROCESSING STEMMING
proposed, for example SAFAR [13]; and P-Stemmer [14] that
extends the prefix list of the Light10 stemmer and doesn’t strip Fig. 1 Index Term Selection Framework
the suffixes, P-Stemmer is used for text classification.
Mustafa, Mohammad et al., in [17], extended the Light10 The following subsections explain the rationale of using
stemmer by adding more prefixes and suffixes to be stripped- these aspects of the Arabic language to select index terms.
out, additionally, some conditions were imposed; such as one-
letter prefix is stripped-out if the remaining term length is
greater than three letters. They proposed another linguistic- III.1. Select definite Words (AL-Words)
based stemmer that uses some morphological aspects to
The article "‫ "ال‬in Arabic is used for different purposes: it
classify words into categories, such that a different stemming
could be used as a redundant article added to nominal words
method is applied on each category. Abdelali et al., proposed
to upgrade importance, as a previous knowledge indicator, and
FARASA stemmer [1] that uses the Support Vector Machine
as a definite conjunctive article added to active and passive
to rank multiple segments that are possible to be the stem of a
participles [20]. So this article is a good indicator to determine
word.
terms that are most likely to be important index terms, because
Alternatively, a dictionary or Lookup Table Approach can names and adjectives that this article define represent
be used such that for each stem, all of the words belong to it informative terms in the text as indicated by [5], and [12].
are stored, and replaced by that stem during the indexing
As "‫ "ال‬indicates previous knowledge, it focuses on a
procedure, as indicated in [3]. However, this method is static
concept that previously mentioned in the text, being the topic
and needs tables to be periodically updated.
of that text, so these words are expected to have significant
On the other hand, documents could be indexed by role in the text, and important enough to be selected as index
selecting index terms according to a combination of terms.
morphological analysis and other linguistic tools such as Part
Passive and active participles, which are determined by the
Of Speech (POS), for Example, Awajan in [6] proposed a
article "‫"ال‬, are frequently used for focusing on the event rather
method that automatically extracts keywords from Arabic
than the entity that actually did that event, which indicates that
documents using unsupervised learning and statistical aspects
these words are of sufficient importance to be selected as index
of words.
terms, for example "‫ "ھُ ِد َم الجدار‬or "the wall has been destroyed",
Previous research work extracted index terms according to the term “‫ ”الجدار‬is significant to be selected as it is the entity
the statistical and morphological aspects, and other tools such affected by the verb "‫ "ھدم‬. Moreover "‫ "ال‬is used to identify
as POS taggers. However, Arabic language has other aspects different concepts, as indicated in table-1.
that can be used for more simple and efficient index terms TABLE-1 EXAMPLES OF AL-WORDS’ USAGE
selection. This research is primarily based on using some of e begins by "‫"ال‬
these aspects for index term selection in order to index Region name ‫ ( الشرق األوسط‬the Middle east)
documents, as explained in section III. Enterprise name ‫( الشركة العربية‬the Arab Company)
focus ‫ ( قضية الالجئين‬the Refugee Cause)
III. DOCUMENTS’ INDEXING FRAMEWORK Relationship ‫( أھداف المؤتمر‬the Conference
The proposed document indexing selects a subset of words Objectives)
that are most likely to have importance in Arabic language Place name ‫( المسجد األقصى‬the Aqsa Mosque)
sentences and semi-sentences. The selected words include Family name ‫( الھاشمي‬Al-Hashmi)
definite words (AL-Words) that its prefix is the article "‫"ال‬, or
any of its forms (‫)وال فال بال كال لل‬, and terms preceding and/or Moreover, selecting AL-Words prevents the IR system
following them (ALAFTER/ALBEFORE). The words that are from ignoring some words that have the same shape as stop
acceptable to have this prefix are also considered as definite words, since –in the Arabic language- stop words couldn’t
words, even they haven’t the prefix AL. Some words, such as have “‫ ”ال‬as prefix, while these words accept this prefix. Some
verbs and most of persons’ names, couldn’t have the prefix examples are listed in table-2.
AL, so it could be considered in case it precede or follow a TABLE-2 EXAMPLES OF WORDS THAT SEEM AS STOP WORDS
word that can has this prefix. The overall index term-selection
Stop word Word of same shape
framework is presented in Fig-1. The framework begins by a ‫ايه‬ which ‫اآلية‬ ‫آية‬ Verses
pre-processing stage (stop-words and punctuation removal,
and normalization), the next step is to apply the proposed ‫فھم‬ they ‫الفھم‬ ‫فھم‬ understanding
selection of words, as follows: ‫وھم‬ And they ‫الوھم‬ ‫وھم‬ Mystery

• Select AL-Words only. Although it is beneficial to select AL-Words as index


• Select AL-Words and words following them. terms, as indicated above, some groups of words will not have
a chance to be selected, such as:
• AL-Words and words before them.
• Words that have the suffixes: "‫"ـه‬, "‫"ھا‬, "‫"ھن‬, or "‫ "ھم‬as
• Select AL-Words and words before them, and words in the Arabic language these words will never begin
following them. with "‫"ال‬.
The selected words are normalized by using some light • Most of persons’ names will not be included in the
stemmer, and finally the framework produces the index terms’ index, since most of Arabic names can't begin with
list. "‫"ال‬.
227
These problems could be solved by extending the selection AL-Words and words before and after them
of index terms to include the words come before/after AL- (ALBEFORE_AFTER), and (5) is an experiment that
Words. performed to index documents using all of the words of the
AL-Words words are determined and selected according to collection, in two forms: without stemming (referred to as
the following criteria: (1) a word is considered as a AL-Word ALL_TERMS), and stemmed by the Light10 stemmer
if it begins by the definite article, or (2) it was previously found (denoted as LIGHT10). The result of the fifth experiment is
–in any document- to begin with this article, for example, if used as a base-line to evaluate the results of the proposed
the indexer finds the word "‫ "الحاسوب‬or "computer", it will experiments.
consider also the word "‫"حاسوب‬, which don’t has the prefix Test Collection: all of the experiments where applied on
"‫"ال‬. the TREC-2001/2002 Arabic news wires corpus, which was
collected by the Associated France Press (AFP) news agency,
III.2. Select words before and after the AL-Words and published by the Linguistic Data Consortium (LDC). This
In Arabic language, a word that precedes an AL-word is a corpus has 383,872 documents, and 666,094 distinct words.
genitive of that word, which adds a ‘property’ of the AL-Word, Documents have been written using the UTF-8 encoding, and
so it is more convenient to select both of them as index terms. have SGML format, such that each document has: an
For example in the expression: "‫“ "مدير الشركة‬the company identifier, a title, a heading, a body text, and a trailer.
manager”, selecting both of the words: "‫ "الشركة‬or "the
The corpus includes 75 user needs or “Topics” (25 topics
company" (the AL-Word), and the word that precedes it ("‫"مدير‬
from TREC-2001 extended by 50 new topics in TREC-2002),
or "manager"), makes the index meaningful.
each topic has three parts: a title, a description, and a narrative.
On the other hand, selecting the word that succeeds the In this paper, a query is created by considering only the topic
AL-Word will enriches the index term list, since it is an title and its description.
adjective of the AL-Word. For example, in the expression:
Indexing: weights of terms in a document are calculated
"‫ "الكون متمدد‬or "the expanding universe", the AL-Word is:
using the Okapi BM25+. All experiments were performed
"‫ "الكون‬or "the universe", and the word after is: "‫ "ممتد‬or
using the default parameters of the BM25+, these default
"expanding", which adds a feature to the term: "‫"الكون‬.
values are: tuning parameter for document frequency scaling
The selection of the Al-Word and the words before/after (k1=1.2), tuning parameter for the term frequency in the query
them will make it possible to select the words that have (k3=8), and the tuning parameter of the document length
suffixes: "‫"ـه‬, "‫"ھا‬, "‫"ھن‬, or "‫"ھم‬, which couldn’t be AL-Words (b=0.75). The coefficient δ=1, since it is found that this value
as indicated in the previous sub-section. These terms could be makes BM25+ works better [22]. In all of the experiments
selected as they could succeed or precede AL-Words, for words are considered for document indexing if they satisfy the
example: "‫ "والؤھم العرقي‬or "their ethnic allegiance", the word: following two conditions:
"‫ "والؤھم‬will have a chance to be an index term since it comes
Having size more than or equals to three letters.
before the AL-Word: "‫"العرقي‬. This criteria also gives a chance
to words representing entities’ names to be included as index Having a document frequency greater than two.
terms, since they could precede or succeed AL-Words, for
example: "‫ "األستاذ علي‬, "‫ "الدكتور مصطفى‬, "‫"محمد األمين‬, and so The second condition is empirically concluded from the
on. results of the experiments. It is noted that the words that appear
only in one or two documents were mainly incorrectly spelled.
An explanatory study of manually selecting AL-Words,
and words that precede/succeed them, performed on a sample Documents are pre-processed by removing stop-words
of 30 documents (drawn at random from the TREC (and all of their forms), where the list of stop words is obtained
2001/2002 collection), yields the following results: from “Sourceforge1” open source site, and punctuation is also
removed. Words are normalized to a single form as given in
66% of persons’ names found in these test documents can table-3.
be indexed; other missed names (that aren't indexed) are
TABLE-3 LETTER NORMALIZATION CRITERIA
mostly mentioned in sport news documents, as these
documents include players' names as lists. Original form (word prefix) Normalized to
‫أ‬, ‫إ‬ ‫ا‬
An average of 6.97 words/document that represent ‫ـة‬ ‫ـه‬
enterprises and organizations, are selected to index ‫ فال‬, ‫ وال‬, ‫ بال‬, ‫كال‬, ‫لل‬ ‫ال‬
documents, such as "‫ "األمم المتخدة‬or United Nations. ‫اا‬ ‫ا‬
An average of 8.47/document words that represent ‫وو‬ ‫و‬
significant expressions are indexed, examples of these ‫فف‬ ‫ف‬
expressions are: "‫( "الحرب الدينية‬holy war), "‫"الحكم الذاتي‬ To ensure that all of the words that come before/after all
(autonomy), "‫( "السعفة الذھبية‬golden palm), etc. AL-Words in the collection are selected, the indexing
procedure is performed by selecting all of the definite words
Although this manual study was applied on a small number in the collection, as a first step, then a second pass over the
of documents, it can motivate moving forward to examine the collection is performed to select the words coming before/after
proposed framework in a larger information retrieval them. The selected words are stemmed by the Light10
environment. stemmer.
IV. EVALUATION Results presentation and discussion: Topics are indexed
the same way as documents, such that a query is formed by
To evaluate the proposed method of document indexing, combining the header of a topic and its description. Queries
five experiments, were performed to examine each of the are submitted to the system, a ranked list of the returned
following options of documents’ indexing: (1) index documents for each query is obtained, and the Mean Average
documents using AL-Words only (ALONLY), (2) index Precision (MAP) is calculated for each indexing method. A
documents using AL-Words and words before them comparison between the results of the experiments that apply
(ALBEFORE), (3) index documents using AL-Words and each of the proposed methods are presented in Fig-2, while
words after them (ALAFTER), (4) index documents using
2
http://sourceforge.net/projects/arabicstopwords/ 228
statistical data about the index size is listed in table-4. Table-5 method gives implementers more options to trade-off between
presents detailed comparison of the results of different index size and higher precision.
indexing methods. The presented results show that selecting
1
the definite words (AL-Words), and the words before/after
them, to index documents is more effective than selecting all
ALL Terms
terms, for example selecting AL-Words and words before 0.8
Light10
them (ALBefore) outperformed selecting all terms, either
ALAfter
stemmed by the Light10 stemmer (LIGHT10), or without
stemming (ALL_TERMS), as in Fig-2(a). It could be noted 0.6
that the results of indexing the documents using ALBEFORE,
ALAFTER, and ALBEFORE_AFTER are close, as presented 0.4
in Fig-2(b), and details in table-5. The reason for this closeness
in results can be justified since most of the words that succeed
the definite words are possible to follow them, as shown by 0.2
manually scanning a number of documents that were randomly
selected from the TREC-2001/2002 collection. The scanned 0
documents show that 73% of the same words that follow AL- 1 2 3 4 5 6 7 8 9 10 11
Words are also preceding them.
The proposed method is evaluated using other metrics; (a) Comparison between ALBefore, Light10, and ALL_Terms.
namely the R-Precision, and the precision at the 10th retrieved 1
document (10-Precision). R-Precision represents the precision
at the Rth retrieved document, where Ri is the number of ALOnly - Light10
0.8
documents known to be relevant (according to the judgment ALBefore
provided by the Linguistic Data Consortium LDC) to a user ALAfter
need: Qi. The results show that the proposed indexing method 0.6 ALBefore_After
gains higher R-Precision, and 10-Precision than using all
terms, either stemmed (LIGHT10) or not (ALL_TERMS), as 0.4
presented in table-5.
A comparison between dictionary sizes of the proposed 0.2
indexing techniques, as presented in table-4, shows that the
dictionary size of both ALBEFORE and ALAFTER are so
close, and the index size (number of postings) as well. 0
However, selecting AL-Words and both of the words before 1 2 3 4 5 6 7 8 9 10 11
and after them (ALBEFORE_AFTER) don’t enhance the (a) Comparison between different document-indexing
MAP value over that of ALBEFORE and ALAFTER, even it options that proposed in this paper.
has larger index size. This result could be justified as follows: Fig. 2 11-recall levels’ of the average interpolated precision
number of index terms (in case of ALBEFORE_AFTER) is
much higher than that of both ALBEFORE, and ALAFTER
that makes the number of index terms derived for each TABLE-5: COMPARISON OF PRECISION AT THREE LEVELS
document to increase, which means that the possibility of a
MEAN AVERAGE
document to randomly share insignificant terms with a query PRECISION (MAP)
–also- to increase. As a consequence more irrelevant
10-PRECISION

R-PRECISION

% TO ALL
METHOD

% TO LIGHT10
documents could be returned as a response to that query, TERMS
VALUE

making the number of returned false positive documents to


increase, which decreasing the precision value.

TABLE-4 INDEX TERMS’ STATISTICS FOR ALL INDEXING TECUNIQUES


IMPLEMENTED USING THE LIGHT10 STEMMER AL-WORDS 0.486 0.330 0.311 93.95 113.08
ALBEFORE 0.501 0.364 0.346 103.81 125.72
METHOD INDEX TERMS POSTINGS ALAFTER 0.505 0.368 0.347 104.23 126.23
ALBEFORE_ 0.499 0.370 0.348 104.48 126.53
ALL-TERMS

AFTER
POSTINGS /
% TO ALL-
NUMBER

NUMBER

LIGHT10
TERMS

LIGHT10 0.486 0.353 0.331 100.00 121.10


TERM
% TO

% TO

ALL_TERMS 0.456 0.315 0.275 82.57 100.00


10-PRECISION: PRECISION AT THE 10TH RETRIEVED DOCUMENT.
R-PRECISION: PRECISION AT R (NUMBER OF ACTUAL RELEVANT
AL-WORDS 3195 06.7 273746 74.4 856. 67.9 DOCUMENTS OF A TOPIC).
ALBEFORE 2196 46.4 347631 94.5 158. 86.2
There are many recent research conducted to improve Arabic
ALAFTER 2207 46.6 345367 93.9 156. 85.7
text retrieval; El-Mahdaouy et al. in [11], for example,
ALBEFORE_AF 2498 52.8 364403 99.1 145. 90.4 improved the retrieval by using a query expansion scheme
LIGHT10 2681 56.6 367695 100 137. 91.2 based on embedding the most similar terms to query terms
ALL_TERMS 4730 100 402842 109. 85.1 100. (either in the document or in the collection). They applied their
Table-5 shows that selecting definite words and the terms method using the Farasa Stemmer [1]. Their results show 6.5%
before/after them enhances the result of selecting all terms enhancement over the base line (the BM25), while in this
stemmed by the Light10 by 4.4% on average, while it reduces paper it is 4.4% enhancement over the baseline (the Light10
the index size (number of postings) by 6.1%, and the stemmer), this difference could be due to the fact that the
dictionary size (number of terms) by 17%. However, selecting authors of [11] used different stemmer, and their system is
only the definite words (ALOnly) got 77% of the MAP that trained on many collections, including TREC, which was used
was gained by selecting all words stemmed by the Light10, but to test their system. Moreover, this paper uses only about 82%
it decreases the index size by 25%. The results of the proposed of the terms indexed by the baseline, while they used 100% of
the terms indexed by their baseline.
229
Recently, Mustafa et al., in [17], applied their extended [7] Awajan, Arafat. 2015. “Semantic similarity based approach for
Light10 stemmer on the TREC 2001/2002 collection, and used reducing Arabic texts dimensionality”. International Journal of
Speech Technology, 19(2), 191–201. doi:10.1007/s10772-015-
the average precision as a performance criteria, the results 9284-6
show 5% enhancement over the basic Light10, while they [8] Brahmi, Abderrezak, Ahmed Ech-cherif, and Abdelkader
show 13% enhancement over the baseline using the linguistic- Benyettou. 2013. “An Arabic Lemma-Based Stemmer for Latent
based stemmer. This paper improves the Light10 retrieval by Topic Modeling.” The International Arab Journal of Information
the same percentage (using MAP measure), at the same time Technology 10 (2): 160–68.
reducing the postings of the Light10 by 6.1%. [9] Chennoufi, Amine, and Azzeddine Mazroui. 2017.
“Morphological, Syntactic and Diacritics Rules for Automatic
Concerning the reduction of the dimensionality of the Diacritization of Arabic Sentences.” Journal of King Saud
index space, Awajan in [7] reduced the dimension of the University - Computer and Information Sciences 29 (2): 156–63.
stemmed index space about 18% when applied on 100KB of https://doi.org/10.1016/j.jksuci.2016.06.004.
text, while in this paper the space reduction (number of [10] Darwish, Kareem. 2014. “Arabic Information Retrieval.”
Foundations and Trends® in Information Retrieval 7 (4): 239–
postings) is reduced by 32% when selecting only about 7% of 342. https://doi.org/10.1561/1500000031.
all terms (for example in case of ALOnly), and applied on 1GB [11] El-Mahdaouy, Abdelkader, Saïd Ouatik El Alaoui, and Eric
of text. Moreover selecting ALOnly terms got 93.4% of the Gaussier. 2018. "Improving Arabic information retrieval using
MAP that gained by the selecting all term (stemmed by the word embedding similarities", International Journal of Speech
Light10), and 121.1% of the MAP of all words without Technology, 21(1):121–136 https://doi.org/10.1007/s10772-
stemming. 018-9492-y
[12] El-Shishtawy, Tarek, and Fatma El-Ghannam. 2012. “An
V. CONCLUSION Accurate Arabic Root-Based Lemmatizer for Information
Retrieval Purposes.” IJCSI International Journal of Computer
This paper proposed a simple and effective indexing Science 9 (1).
method for the Arabic documents. This method adopts a [13] Jaafar, Younes, Driss Namly, Karim Bouzoubaa, and Abdellah
technique of documents’ indexing that rely on selecting the Yousfi. 2017. “Enhancing Arabic Stemming Process Using
Resources and Benchmarking Tools.” Journal of King Saud
definite words, and words surrounding them, namely the AL- University - Computer and Information Sciences 29 (2). King
Words, the AL-Words and words before and/or after them. Saud University: 164–70.
https://doi.org/10.1016/j.jksuci.2016.11.010.
The proposed indexing techniques are tested for retrieval
[14] Kanan, Tarek and Edward A. Fox. 2016. “Automated Arabic text
using a standard corpus of Arabic documents, the results give classification with P-Stemmer, machine learning, and a tailored
an argument to consider the effectiveness of definite words for news article taxonomy”. Journal of Association for Information
indexing documents. Likewise considering –also- the words Science Technology, 67(11): 2667-2683. DOI:
before/after the definite words show better results. A https://doi.org/10.1002/asi.23609
comparison of the proposed document indexing show better [15] Larkey, L., L. Ballesteros, and M. Connell. 2007. “Light
results than selecting all terms, whether it is stemmed or not, Stemming for Arabic Information Retrieval.” Arabic
Computational Morphology, 221–243.
at different retrieval levels: MAP, R-Precision, and at the 10th doi:10.1145/564376.564425.
retrieved document.
[16] Larkey, Leah S, Lisa Ballesteros, South Hadley, and Margaret E
However, the Arabic language has a wide range of Connell. 2002. “Improving Stemming for Arabic Information
Retrieval : Light Stemming and Co-Occurrence Analysis,” 275–
sentence forms and phrases, which can be used as heuristics to 82.
select indexing terms for more efficient retrieval. As a future [17] Mustafa, Mohammad, Afag Salah Aldeen, Mohammed E. Zidan,
work, more experiments will be conducted using other test Rihab E. Ahmed, Yasir Eltigani. 2019. "Developing Two
collections, and other stemmers, to give more evidence of the Different Novel Techniques for Arabic Text Stemming".
effectiveness of the proposed method, and to disclose more Intelligent Information Management, 11, 1-23
related heuristics that could be derived from the structure of [18] Mustafa, Mohammad, Afag Salah Eldeen, Sulieman Bani-ahmad,
the Arabic language. and Abdelrahman Osman Elfaki. 2017. “A Comparative Survey
on Arabic Stemming : Approaches and Challenges,” 39–67.
REFERENCES https://doi.org/10.4236/iim.2017.92003.
[19] Nehar, Attia, Djelloul Ziadi, and Hadda Cherroun. 2015.
[1] Abdelali A, Darwish K, Durrani N et al. Farasa: a fast and furious “Rational Kernels for Arabic Root Extraction and Text
segmenter for Arabic. In: Proceedings of the 2016conference of Classification Q.” Journal of King Saud University - Computer
the North American chapter of the association for computational and Information Sciences. King Saud University.
linguistics demonstrations session, 12–17 June 2016, pp. 11–16. https://doi.org/10.1016/j.jksuci.2015.11.004.
San Diego CA: Human Language Technologies,
http://www.aclweb.org/anthology/N16-3003 [20] Obadah, Muhammad I. 2011. Dictionary of Syntax, Morphology,
and Lexicon Grammar, Presentations and Rhyme. Edited by
[2] Al-Kabi, Mohammed N., Saif A. Kazakzeh, Belal M. Abu Ata, Cairo: literature Library. First. Cairo.
Saif A. Al-Rababah, and Izzat M. Alsmadi. 2015. “A Novel Root
Based Arabic Stemmer.” Journal of King Saud University - [21] Taghva, Kazem, Rania Elkhoury, and Jeffrey Coombs. 2005.
Computer and Information Sciences 27 (2). “Arabic Stemming Without a Root Dictionary.” In Information
https://doi.org/10.1016/j.jksuci.2014.04.001. Technology: Coding and Computing, International Conference on
(ITCC), 152–57. Las Vegas, Nevada.
[3] Al-sughaiyer, Imad A, and Ibrahim A Al-kharashi. 2004. “Arabic https://doi.org/10.1109/ITCC.2005.90.
Morphological Analysis Techniques : A Comprehensive Survey”,
Journal of the American Society for Information Science and [22] Yuanhua Lv, ChengXiang Zhai. 2011. “Lower-Bounding Term
Technology Banner 55 (3): 189–213. Frequency Normalization.” In Proceedings of the 20th ACM
https://doi.org/10.1002/asi.10368 International Conference on Information and Knowledge
Management, 7–16. Glasgow, Scotland, UK: ACM.
[4] Al-Thubaity A., Alhoshan M., Hazzaa I. 2015. “Using Word N- https://doi.org/10.1145/2063576.2063584.
Grams as Features in Arabic Text Classification.” In Studies in
Computational Intelligence. Springer, Cham.
https://doi.org/https://doi.org/10.1007/978-3-319-10389-1_3.
[5] Ali, Chedi Bechikh, and H Haddad. 2013. “A Quality Study of
Noun Phrases as Document Keywords for Information Retrieval.”
In International Conference on Control, Engineering &
Information Technology (CEIT13), Economics & Strategic
Management of Business Process. Sousse, Tunisie.
[6] Awajan, Arafat. 2015. “Keyword Extraction from Arabic
Documents Using Term Equivalence Classes.” ACM Trans.
Asian Low-Resour. Lang. Inf. Process. 14 (2). New York, NY,
USA: ACM: 7:1--7:18. https://doi.org/10.1145/2665077.

230
Evaluation of Question Classification
Mariam Biltawi, Arafat Awajan, Sara Tedmori
Computer Science Department
King Hussein School of Computing Sciences
Princess Sumaya University for Technology
Amman, Jordan
maryam@psut.edu.jo, awajan@psut.edu.jo, s.tedmori@psut.edu.jo

Abstract— The goal of this paper is to study question which is either to impose, to rebuke, to threat, to command, to
classification for the Arabic language using machine learning pray, or to hope. Arabic questions can be asked either by using
approaches. Different experiments were conducted using two Arabic IWs or without using them. There are two types of
types of weighting schemes and three classifiers; Multinomial Arabic IWs; (1) question particles used for yes/no questions,
Naïve Bayes, Decision Trees, and Support Vector Machine. The and they are Hamza (‫ )أ‬and Hal (‫)ھل‬, and (2) other question
dataset used in the experiments is an updated version of CLEF. words such as; who - ‫من‬, what – ‫ ماذا‬،‫ما‬, where - ‫اين‬, when –
The best results were obtained when the dataset was ‫ ايان‬، ‫متى‬, how – ‫ انى‬، ‫كيف‬, how much/many - ‫كم‬, which – ‫[ اي‬2].
preprocessed by removing punctuation, diacritics, stop-words, Some Arabic questions do not use IW. Such questions can be
and performing normalization and stemming and then using the
list questions that might start with the words (‫ عدد‬، ‫ )اذكر‬and
TF-IDF weighting scheme; with SVM being the best classifier
which mean list or explanation questions that start with the
among the three with an F1-score of 81%.
words (‫ فسر‬، ‫ )اشرح‬and which mean explain.
Keywords—question classification, Arabic question Question are classified in order to determine either the
classification, multinomial Naïve Bayes, Decision Trees, Support question type or the answer type. Questions can be factoid or
Vector Machine non-factoid. Factoid questions can have different answers, for
I. INTRODUCTION example a person name, organization, locations, etc. While
non-factoid questions can be definition, casual, etc. The goal
Question Answering (QA) is an application of computer of this paper is to experiment question classification utilizing
science that spans several core areas including information the updated version of the translated CLEF dataset. The
retrieval and natural language processing. QA is concerned contents of original version of the CLEF dataset can be dated
with building systems that can automatically answer questions back to 1994. Hence, some of the questions are outdated and
provided by humans. QA can be seen as an extension to search can’t be utilized for purposes of constructing QA systems that
engines but rather than providing a group of documents as a use the web as the answers’ source. In addition, some of the
result, QA systems provide concise and correct answers, questions in the original version of CLEF are syntactically
saving navigation time for users. Generally, QA systems can incorrect. This paper is organized as follows: section 2
be classified based on the source of answers, which can either presents the related work, section 3 provides information
be structured or unstructured. Unstructured can be documents relating to the used dataset, section 4 presents the classifiers
from the web, while structured QA systems may use used in the experiment, section 5 presents weighting schemes
knowledge bases [1]. used, section 6 presents the evaluation measures, section 7
QA systems rely on different fields and technologies, presents the experiment and the results, section 8 discusses the
including; NLP, information retrieval, semantic web results and section 9 presents the conclusion.
technologies, database technologies, and human computer II. RELATED WORK
interaction. QA systems can be implemented to construct
either structured or unstructured answers to deal with different Generally text classification can be categorized into; (1)
types of questions including: How, Why, Fact, List, definition, rule-based techniques [3], (2) Machine Learning (ML)
Cross-lingual, semantically constrained, and hypothetical techniques [4], and (3) hybrid-based techniques [5]. Rule-
questions. These questions can be either domain-specific, or based techniques are usually unsupervised since no model
open-domain that deal with nearly anything [1] . needs to be trained, and they can be either lexicon-based,
pattern-based, or both. ML techniques are usually named
A question is a natural language sentence, phrase, or even corpus-based techniques and are categorized into supervised
a word, used to request information or test someone’s learning and unsupervised learning. However, hybrid-based
knowledge. A question usually starts with an interrogation techniques are usually either a mix of supervised and
word (IW). Questions posed in the beginning of a research in unsupervised learning techniques, or a mix of rule-based and
order to identify the main objectives of the study or to ML techniques.
determine the type of the problem the writer is trying to solve
are referred to as research questions. Rhetorical questions, on Some researchers have experimented with question
the other hand refer to the questions that are used to begin a classification techniques in the Arabic language. Al-Chalabi
discussion or to emphasize a point rather than requesting a et al. [6] presented a rule-based Arabic question classifying
direct answer. Rhetorical questions can either have obvious technique. Their technique relies on the Arabic (IW) where
answers or can be used as metaphors (example: Can birds fly?) each IW within a question represents a class, while questions
or can have no answers and hence used for negative assertion that do not use IW were neglected. The authors considered the
or sarcasm (example: Who cares?). IWs; (‫ كم‬- how much, how many, how far, and how long), (‫من‬
- who), (‫ ما‬- what), (‫ اين‬- where), (‫ متى‬- when), (‫ اي‬- which),
In Arabic, there are two types of questions: (1) Real (‫ كيف‬- how). They have proposed patterns for each class as
questions: that are used to request a direct answer from the illustrated in table 1. Most of the patterns start with IW and
respondent and (2) Metaphorical questions: that do not seek a maybe followed by a noun phrase (NP) or a verb phrase (VP),
specific answer. Metaphorical questions are identical to and any word format (WF) that will not affect the
rhetorical questions discussed earlier in terms of their purpose classification process. The only IW that may start with a
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 231
proposition (PP) is (‫ اي‬- which). The IW (‫ ما‬- what) is either [12] dataset, and results showed an accuracy of 92.83% in
followed by (HOA - ‫)ھو‬, (HEA - ‫)ھي‬, or a NP. The experiment classifying the course-grain classes and an accuracy of
was conducted on 200 questions, applied on context free- 89.32% in classifying fine-grain classes.
grammar and regular expressions written using NooJ tool.
Results showed a recall and precision of 93% and 100% III. DATASET USED
respectively. The experimented dataset is an updated version of the
original translated CLEF dataset and which consists of 800
TABLE I. QUESTION PATTERNS PROPOSED IN [6] question-answer pairs. For purposes of this research, a native
IW ANSWER TYPE CLASS PATTERN Arabic expert was assigned the mission to review the dataset
‫( كم‬HOW MUCH, Number IW NP VP WF for correctness. Knowing that the contents of the dataset can
HOW MANY, IW VP WF be tracked back to 90s. Therefore, some of the questions were
HOW FAR, AND either updated or deleted. For example the question “ ‫كيف يمكن‬
HOW LONG)
‫( من‬WHO) Person/ Organization IW NP WF
‫ ”للتصوير بالرنين المغناطيسي العمل؟‬is syntactically incorrect, and
‫( ما‬WHAT) Device/ Geographical IW HOA NP WF there is no answer attached to it; thus, it was deleted. Other
location/ Sports/ IW HEA NP WF questions were syntactically corrected such as “ ‫كيف عدد سكان‬
Organization/ Art/ Person IW NP WF ‫ ”فيتنام ؟‬and mean “how is the population of Vietnam?”, the IW
‫( اين‬WHERE) Geographical location IW VP WF “‫ ”كيف‬means “how” is replaced with “‫ ”كم‬which means “how
‫( متى‬WHEN) Date IW VP WF much”.
‫( اي‬WHICH) Number/ Geographical PP IW NP WF
location/ History/ Sports IW NP WF Replicated questions were also deleted, keeping just one
‫( كيف‬HOW) Science IW VP WF form of them, noting that some of the replicas were
syntactically incorrect. For example the question “‫”ما ھو أداه؟‬
Al-Shawakfa [7] proposed a rule-based technique to which means “what is the tool” is replicated twice with
classify questions according to IWs. The examined IWs; (‫ من‬- different answer each time, the question is ambiguous and
who, whose), (‫ متى‬- when), (‫ – اين‬where), (‫ماذا‬, ‫ ما‬- what, does not ask about a specific thing, nor does it mimic human
which), (‫مما‬, ‫ – ما‬what), (‫ماھي‬, ‫ ما ھو‬- what is), (‫ كم‬- how much, behavior; thus, both question were deleted. In addition,
how many), (‫ – لماذا‬why), (‫ – اي‬which), (‫ – كيف‬how). The questions that have English letters were excluded as well. For
classes assigned are; person, organization, temporal example, questions like “‫ ما ھي‬UEFA ‫ ”؟‬and means “what is
expressions, location, product, event, object, device, sports, UEFA?” and its answer is “‫”االتحاد االوروبي لكرة القدم‬, were also
art, thing, numeric expressions, reason, history, and science. omitted because they are not pure Arabic questions.
The question is tokenized, then classified according to a set of Furthermore, questions like “‫ ”من ھو كريستو؟‬which means
patterns defined by the authors. “who is Cristo?” and its answer was “‫ ”فنان من أصل ھنغاري‬which
Lahbari et al. [8] proposed a rule-based method to classify means “an artist with Hungarian origin”, and which have no
Arabic questions, they also compared between two types of records when a google search is performed, have also been
question taxonomies; the first taxonomy is Arabic taxonomy excluded. Some other questions were updated, for example the
proposed by the authors, while the second one is proposed by question “‫ ”من ھو رئيس وكالة الطاقة الذرية ؟‬and means “Who is the
Li and Roth in [9]. Their rule-based method first starts by head of the IAEA?” and at that time the head of IAEA was
normalizing the questions through removing diacritics and “Hans Blix”, was updated into “ ‫من كان رئيس الوكالة الدولية للطاقة‬
punctuation, then tokenizing. Next is the pattern matching ‫ ؟‬1997-1981 ‫ ”الذرية في الفترة‬which means “who was the head of
step, where each IW corresponds to one class, except what ( ،‫ما‬ IAEA in period of 1981-1997?”.
‫ )اي‬which can have multiple classes according to the noun The total number of updated questions is 189 question-
present in the question; therefore, it is further processed by answer pairs, while the excluded questions were 200 question-
removing stop-words and identifying the nouns. Experiments answer pairs. The resulting size of the dataset was 600
were conducted using CLEF and TREC translated datasets. question-answer pairs. Question were given labels manually,
The Arabic taxonomy classes used to label the question were; and the total number of classes were eight; casual, definition,
time, description, location, human, number. The question description, entity, human, list, location, and numeric.
classes used for experimentation from Li and Roth taxonomy Questions assigned the classes; entity, human, location, and
were; abbreviation, definition, description, location, person, numeric, are of type factoid questions, where the answers are
time, number, entity, and other. Experimental results showed short and represent facts. For example, entity can be an
an accuracy of 78% and an error rate of 3.39%. The same organization, metric, currency, etc. While human can be
authors conducted three experiments using the same question human name or occupation. Location may represent city,
taxonomies to compare between three classifiers (SVM, NB, country, river, or mountain. And numeric can be date, time,
and DT) in [10]. Results showed that SVM outperformed both population, etc.
NB and DT when Arabic taxonomy used with a recall,
precision, and f-measure of 89%, 93%, and 90% respectively. The remaining classes (casual, definition, description, and
list) are considered non-factoid and their answer may exceed
Aouichat et al. [11] presented an approach to classify one sentence, and can have multiple answers according to the
Arabic questions using Li and Roth taxonomy. Their approach writing style, and each class differs from the answer it asks
starts by preprocessing the questions through applying for, for example casual questions usually ask for a reason or
tokenization, removing diacritics, normalization, and date and purpose. Definition questions ask for definition for a term or
time labeling using regular expressions. Next, the an entity, description questions ask for methods and
preprocessed questions are fed into the SVM classifier to explanations, and list questions ask for steps. Table 2 shows
assign them a course-grain class, and then they are fed into the the number of questions under each class.
Convolutional Neural Network (CNN) to assign them a fine-
grain class. Experiments were conducted on TALAA-AFAQ TABLE II. NUMBER OF QUESTIONS UNDER EACH CLASS

232
CLASS NUMBER OF QUESTIONS numeric 23
CASUAL 5 ‫كم‬ How 58
DEFINITION 60 (much/many)
DESCRIPTION 16 ‫متى‬ When numeric 54
ENTITY 107 ‫ماذا‬ What casual 1
HUMAN 110 definition 1
LIST 21 entity 6
LOCATION 128 human 4
NUMERIC 153 list 2
TOTAL 600 ‫الى‬ To casual 1
Table 3, shows the first tokens used in the dataset along entity 1
with their count, noting that these tokens are normalized and numeric 1
the affixes were not removed. As illustrated in the table there ‫اين‬ Where location 72
‫بمن‬ Whom, who human 1
are seven tokens that are not IWs and came at the beginning ‫اي‬ Which entity 1
of the questions, these tokens are ( ، ‫ اعطي‬، ‫ فوق‬، ‫ عدد‬، ‫ الى‬، ‫في‬ ‫عدد‬ List, enumerate list 6
‫ منذ‬، ‫)على‬. Note that IWs in the Arabic language can come ‫بماذا‬ What casual 1
either; (1) at the beginning of the question, (2) as a second entity 2
token in the question, and (3) at the end of the question. In the location 1
updated CLEF, the third case is not included. ‫فوق‬ On, above location 1
‫كيف‬ How description 13
TABLE III. FIRST TOKENS FOR THE QUESTION IN THE DATASET ‫اعطي‬ Give entity 7
human 2
FIRST MEANING IW FREQUENCY list 3
TOKEN location 4
‫ما‬ What Yes 179 ‫فيما‬ What casual 1
‫من‬ Who Yes 132 ‫على‬ On, onto numeric 1
‫في‬ In No 40 ‫لماذا‬ Why casual 1
‫كم‬ How Yes 58 ‫الي‬ For what entity 1
(much/many) ‫منذ‬ Since numeric 1
‫متى‬ When Yes 54 ‫باي‬ Which entity 1
‫ماذا‬ What Yes 14 TOTAL 600
‫الى‬ To No 3 Table 5 illustrates the second token for the IW “‫ ”ما‬and the
‫اين‬ Where Yes 72 non-IWs “‫ منذ‬، ‫ على‬، ‫ فوق‬، ‫ الى‬، ‫”في‬. It also shows the number of
‫بمن‬ Whom, who Yes 1 times that both the first and second tokens occur together.
‫اي‬ Which Yes 1
Obviously, the IW (‫ )ما‬is frequently used with the pronouns
‫عدد‬ List, enumerate No 6
‫بماذا‬ What Yes 4 (‫ ھي‬، ‫)ھو‬. The question classes of questions starting with the
‫فوق‬ On, above No 1 IW “‫ ”ما‬can differ according to the second or third tokens. For
‫كيف‬ How Yes 13 example, the question “‫”ما ھي المجرة التي ينتمي اليھا كوكب األرض ؟‬
‫اعطي‬ Give No 16 which means “What is the galaxy to which the Earth
‫فيما‬ What Yes 1 belongs?” is given the class “entity”, and the question “ ‫ما ھي‬
‫على‬ On, onto No 1 ‫ ”غولدمان ساكس ؟‬which means “What is Goldman Sachs?” is
‫لماذا‬ Why Yes 1 given the class “definition”, both questions start with the IW
‫الي‬ For what Yes 1
“‫ ”ما‬followed by the pronoun “‫”ھي‬.
‫منذ‬ Since No 1
‫باي‬ Which Yes 1 Questions that start with non-IWs, such as “‫ ”في‬is followed
TOTAL 600 with (‫ ايه‬، ‫)اي‬, to indicate either time, location or an entity. The
Table 4 shows the number of classes under each IW. For questions that start with (‫ )الى‬are followed with one of the
example question starting with (‫ )ما‬can be assigned one of the tokens (‫ ماذا‬، ‫ كم‬، ‫)اي‬. Questions having the second token (‫)اي‬
seven classes; definition, description, entity, human, list, indicate the answer to be an organization, and thus are given
location, and numeric. Therefore it will be hard to specify the class entity. Questions having the second token (‫)كم‬
rules or patterns for each case. The questions with the IW “‫”من‬ indicate that the answer is a number, thus are given the class
can be given one of six classes. numeric. Finally, questions having the second token (‫)ماذا‬
indicate that the answer should be a reason or purpose,
TABLE IV. NUMBER OF CLASSES ACCORDING TO IWS
therefore are given the class casual.
FIRST MEANING CLASS FREQUENCY
TOKEN TABLE V. THE OCCURRENCE OF TWO TOKEN TOGETHER
‫ما‬ What definition 22
description 2 FIRST SECOND MEANING SECOND
entity 77 TOKEN TOKEN FREQ
human 24 ‫ما‬ ‫ھو‬ He 75
list 9 ‫ھي‬ She 83
location 30 ‫الذي‬ That, whose, 7
numeric 15 whome
‫من‬ Who definition 37 ‫اسم‬ Name 9
description 1 ‫االسباب‬ Reasons 1
entity 10 ‫الشركتين‬ The two 1
human 79 companies
list 1 ‫جنسيه‬ Nationality 1
location 4 ‫االفرقه‬ Teams 1
‫في‬ In entity 1 ‫اصل‬ origin 1
location 16 ‫في‬ ‫ايه‬ Which 19

233
‫اي‬ Which 21 ‫الشركتين‬ The two 1 No ‫باع‬ Sold 1 No
‫الى‬ ‫اي‬ Which 1 companies
‫استغرق‬ Took 1 No ‫نجح‬ Succeesed 1 No
‫كم‬ How (much/ 1
‫من‬ Of 4 No ‫عاش‬ Lived 1 No
many) ‫اخترع‬ Invented 1 No ‫اطيح‬ Dropped 1 No
‫ماذا‬ What 1 ‫اقيمت‬ Establish 2 No ‫جرت‬ Took place 1 No
‫فوق‬ ‫ايه‬ Which 1 ‫اقيم‬ Establish 1 No
‫على‬ ‫اي‬ Which 1
‫منذ‬ ‫متى‬ When 1 IV. EXPERIMENTED CLASSIFIERS
In this paper, the task of question classification was
Table 6 illustrates all the second tokens that occurred in performed using three classifiers; Multinomial Naïve Bayes
the dataset. Note that the second token maybe an IW attached (MNB), Decision Trees (DT), and Support Vector Machine
to another word. For example the question “ ‫إلى كم يصل عدد‬ (SVM).
‫”السكان في الواليات المتحدة األمريكية ؟‬, here we can see that “‫”إلى‬ 1. Multinomial Naïve Bayes (MNB):
which means “to” is not an IW but it is attached to an IW. The
IW is “‫ ”كم‬and means how much, the two words together mean Naïve Bayes classifier was selected because it can work
(to how much) and the full question means “How many people well on small datasets and can be considered computationally
are in the United States?” As a conclusion, using rules or fast. There are several extensions of NB classifiers, and
identifying patterns for question is time consuming, because Multinomial NB (MNB) is one that works with discrete
the IWs as stated earlier may come at the beginning of the features. MNB is a widely used text classifier and is useful
question, as the second token for the question, or even at the classifier when term frequency matters. Generally NB
end of the question. Therefore, the purpose of this paper is to classifiers refers to the independent assumption between the
examine machine learning classifiers to test their capability of features given the class in the model, equation 1 represents the
classifying the classes given for the updated version of CLEF Bayes theorem.
dataset. ( )
( | )= …………………………… (1)
( )
TABLE VI. SECOND TOKENS IN THE DATASET
SECOND MEANING FREQ IW SECOND MEANING FREQ IW Where ( | ) is the posterior probability, i.e. the
TOKEN
‫ھو‬ He 140 No
TOKEN
‫تنحى‬ Step aside 1 No
probability of the class given the question, ( | ) is the
‫ايه‬ Which 20 Yes ‫تولى‬ Took over 1 No likelihood, i.e. the probability of the question given the class,
‫ھي‬ She 91 No ‫اندلعت‬ Broke out 2 No ( ) is the prior, i.e. probability of the class, and ( ) is the
‫قدر‬ Estimate 1 No ‫قام‬ Started 3 No
‫عدد‬ Number 36 No ‫دخلت‬ Entered 1 No
normalization constant, i.e. the probability of the question.
‫حاز‬ Possess 1 No ‫انتقل‬ Moved 1 No Generally, the normalization constant is neglected.
‫يبلغ‬ Reaches 2 No ‫الفاٮز‬ Winner 1 No
‫اي‬ Which 25 Yes ‫جنسيه‬ Nationality 1 No 2. Decision Trees (DT):
‫يسمى‬ Called 2 No ‫ولد‬ Born 5 No
‫يوجد‬ Exist 19 No ‫تاسست‬ Established 3 No A Decision Tree (DT) is a non-parametric supervised tree-
‫كان‬ Was 21 No ‫وصل‬ Reached 1 No
‫الذي‬ That, 42 No ‫حل‬ Settle 1 No
based machine learning algorithm, that can be used for
whose, classification and regression problems. DT can perform multi-
whome class classification, which is the case in this paper. DT can
‫تزوج‬ married 3 No ‫تعني‬ Mean 1 No
‫نوع‬ Type 1 No ‫فريق‬ Team 1 No
handle both categorical and numerical data. It can map non-
‫العناصر‬ Elements 1 No ‫يقوم‬ Do/ work 1 No linear relationships among features. DT predict classes
‫ادخلت‬ Entered 1 No ‫ولدت‬ Born 2 No through learning simple decision rules from the training
‫تغطى‬ Covers 1 No ‫تجري‬ Take place 1 No
‫كم‬ How 1 Yes ‫اصبحت‬ Became 2 No examples.
(much/
many) 3. Support Vector Machine (SVM):
‫تم‬ Done 6 No ‫دفعت‬ Paid 1 No
‫اتھم‬ Accuse 2 No ‫بلغ‬ Reached 1 No Support Vector Machine (SVM) is a supervised machine
‫ھم‬ They 3 No ‫توجد‬ Located 6 No learning algorithm that works through fitting a boundary to a
‫الفٮات‬ Categories 1 No ‫تاسس‬ Founded 1 No
‫اجزاء‬ Parts 1 No ‫توغلت‬ Penetrated 1 No
region of training examples that are alike. SVM is known to
‫ظھر‬ Appear 1 No ‫يقع‬ Located 4 No perform well with small datasets. It also works well in high
‫ينتقل‬ Transfer 1 No ‫تنتج‬ Produce 3 No dimensional space, where natural text can be high
‫اسم‬ Name 23 No ‫حاله‬ Case 1 No
‫تحدث‬ Happen 1 No ‫تبيع‬ Sell 1 No dimensional. This paper experiments the linear SVM
‫استقال‬ Quit 1 No ‫افتتح‬ Opened 1 No classifier.
‫حصلت‬ Obtain 1 No ‫توفي‬ Died 3 No
‫يتم‬ Complete 4 No ‫الذين‬ Those 1 No V. WEIGHTING SCHEMES
‫بلغت‬ Reach 2 No ‫انعقد‬ Was held 1 No
‫حدث‬ Happen 3 No ‫وقع‬ Happened 2 No Before feeding the questions to the classifier, they are
‫قتل‬ Kill 2 No ‫تقام‬ Held 1 No
‫تفعل‬ Do 1 No ‫بدات‬ Started 1 No represented using two different schemes, Term-Frequency
‫مات‬ Die 5 No ‫عقد‬ Was held 2 No (TF), and TF-Inverse Document Frequency (TF-IDF). Where
‫اين‬ Where 2 Yes ‫عقدت‬ Was held 1 No TF is the number of times each term occurred in a document,
‫يمكن‬ Can 2 No ‫حطت‬ Landed 1 No
‫تحول‬ Convert 1 No ‫االفرقه‬ Teams 1 No and IDF represents the number of documents that a term
‫كانت‬ Was 3 No ‫تمت‬ Done 2 No appears in, it increases the weights of non-frequents terms,
‫بني‬ Build 2 No ‫متى‬ When 1 Yes while decreasing weights of frequent terms. Both TF and IDF
‫ماذا‬ What 5 Yes ‫اصل‬ Origin 1 No
‫االسباب‬ Reasons 1 No ‫يصب‬ Pour 1 No can be combined by multiplying their values, to adjust the
‫اطلقت‬ Launched 2 No ‫مره‬ Number of 2 No frequency of a term for how rarely it is used. Thus, TF is used
times
‫تقع‬ Located 19 No ‫تبلغ‬ Reaches 1 No
to measure frequency while TF-IDF is used to measure
‫اسماء‬ Names 5 No ‫صدر‬ Released 1 No relevancy, this applies on saying; terms that are frequent may
‫طرق‬ Methods 1 No ‫اصبح‬ Become 1 No
‫يعمل‬ Works 2 No ‫اصطدمت‬ Bumped 1 No

234
not be relevant such as the stop-words. Equation 2 is the Normalization was conducted on all the three forms for the
formula of IDF for the term . dataset. The goal of normalization is to standardize letters, for
example; Alif (‫ )ا‬which has multiple forms (‫ آ‬، ‫ إ‬، ‫ )أ‬and
= log …………………………… (2) transformed into the bare Alif (‫)ا‬. Another letter is TA (‫)ة‬,
which is sometimes written as HA (‫)ه‬, therefore TA is
Where is the total number of documents/ training transformed to HA, there is also A (‫ )ء‬that may come on Alif
examples, and is the number of documents the term Maqsoora (‫ )ى‬such as (‫ )ئ‬or on Waw (‫ )و‬such as (‫)ؤ‬, the (‫ )ء‬is
appeared in. Equation 3 is the formula for TF-IDF is: removed in both cases. Punctuations and diacritics were both
− = ……………… (3) removed for the three forms of the dataset as well.
, ,
The difference between the first form of the preprocessed
Where TF is the frequency of the term in the document
dataset and the second form is either to keep the stop words or
and IDF of the term .
remove them. the removal of the stop-words is performed by
VI. EVALUATION MEASURES checking a lexicon that consists of Arabic normalized stop-
words, therefore, the first two tokens were not checked for
The measurements used to compute the experimental stop-word removal, because (‫ )من‬which is an IW if it comes as
results were; precision, recall, and F1-score. Where precision a first token, can be considered a stop-word when it come in
(equation 4) is the ratio of the correctly predicted examples the middle of the question.
among the retrieved examples, recall (equation 5) is the ratio
of correctly predicted examples among the total amount of The difference between the second and third form of the
relevant examples, and F1-score (equation6) models accuracy preprocessed datasets lays in adding stemming to the third
combining both precision and recall. Three types of averages form. While it differs from the first dataset in two things; it
are taken for each measurement, macro-average, micro- has no stop-words while the first form has, and the token are
average, and weighted-average. stemmed while the tokens in the first form are not stemming.
Stemming is done using the shallow stemmer ISRI.
= …………………………… (4)
There is no significant change in the number of unique
= …………………………… (5) token between the first form of the processed dataset and the
second. Noting that before removing stop-words the number
1= …………………………… (6) was 1732, while after removing the stop-words it become
1719. Only 13 unique tokens were removed when the stop-
Where P is the precision, R is the recall, TP is the number words were removed. While the number of unique token
of true positive examples, FP is the number of false positive become 1243 after stemming. Table 7 shows these numbers.
examples, FN is the number of false negative examples, and
F1 is the F1-score. TABLE VII. THE NUMBER OF UNIQUE TOKENS IN EACH FORM OF
PREPROCESSED DATASET
Micro-average computes the average for all classes by PREPROCESSED PREPROCESSING NUMBER OF
summing all their contributions globally, macro-average is DATASET PERFORMED UNIQUE TOKENS
implemented through computing each measure independently FORM1 Normalization 1732
for each class and then taking the average, and weighted Punctuation Removal
average is implemented by first computing each measure Diacritics Removal
FORM2 Normalization 1719
independently and then taking the average for each metric by Punctuation Removal
multiplying them with the number of instances in the class. Diacritics Removal
The two measures that can be important in the multi-class Stop-words Removal
problem is the micro- and weighted- average, because macro- FORM3 Normalization 1243
average treats all classes equally while the other two does not. Punctuation Removal
Diacritics Removal
The reason behind using these measurements is because the
Stop-words Removal
dataset is unbalanced, and the number of each class in the Stemming
dataset differs from one another, as demonstrated in table 8.
VII. EXPERIMENT AND EXPERIMENTAL RESULTS To experiment on the three forms of the preprocessed
Twelve experiments were conducted on the updated datasets, three classifiers were used; MNB, DT, and SVM.
version of the CLEF dataset, where three forms of the dataset Each time using one different weighting scheme, which are;
were prepared: TF and TF-IDF. And the dataset is divided into training and
testing set, where 400 question used for training and 200
1. The questions were preprocessed by first normalizing question used for testing, table 8 shows the number of
them, and then removing punctuation marks and questions under each class for both the training and testing
diacritics, keeping the stop-words. sets. And tables from 9 to 14 illustrate the results for
2. The questions were preprocessed by first normalizing experimenting the three classifiers on the three form of the
them, and then removing punctuation marks, diacritics, preprocessed datasets.
and also stop-words.
3. The questions were preprocessed by first normalizing TABLE VIII. NUMBER OF QUESTIONS UNDER EACH CLASS IN THE
TRAINING AND TESTING SETS
them, and then removing punctuation marks, diacritics,
and stop-words. Adding stemming to this step. CLASS TRAINING SET TESTING SET
CASUAL 4 1
DEFINITION 50 10
DESCRIPTION 13 3

235
ENTITY 74 33 micro avg 0.72 0.72 0.79 0.72 0.72 0.79 0.73 0.73 0.79
macro avg 0.55 0.51 0.58 0.59 0.58 0.62 0.55 0.50 0.58
HUMAN 69 41
weighted 0.78 0.79 0.81 0.72 0.72 0.79 0.74 0.74 0.79
LIST 18 3 avg
LOCATION 78 50
NUMERIC 94 59 TABLE XIV. RESULTS OF THE THREE CLASSIFIERS USING TF-IDF
TOTAL 400 200 WEIGHTING ON THE DATASET FORM3

precision recall f1-score


TABLE IX. RESULTS OF THE THREE CLASSIFIERS USING TF WEIGHTING MNB DT SVM MNB DT SVM MNB DT SVM
ON THE DATASET OF FORM1 casual 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
definition 0.43 0.33 0.36 0.90 1.00 0.90 0.58 0.50 0.51
precision Recall f1-score
description 1.00 0.60 1.00 1.00 1.00 1.00 1.00 0.75 1.00
MNB DT SVM MNB DT SVM MNB DT SVM
entity 0.57 0.42 0.66 0.61 0.33 0.58 0.59 0.37 0.61
casual 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
human 0.68 0.76 0.80 0.61 0.63 0.68 0.64 0.69 0.74
definition 0.33 0.29 0.31 0.90 1.00 1.00 0.49 0.44 0.48
list 0.00 1.00 0.00 0.00 0.33 0.00 0.00 0.50 0.00
description 1.00 0.60 1.00 1.00 1.00 1.00 1.00 0.75 1.00
Location 0.98 0.93 0.96 0.94 0.86 0.94 0.96 0.90 0.95
entity 0.42 0.41 0.46 0.55 0.45 0.36 0.47 0.43 0.41
numeric 0.93 0.93 0.93 0.88 0.88 0.93 0.90 0.90 0.93
human 0.62 0.83 0.71 0.59 0.59 0.66 0.60 0.69 0.68
micro avg 0.78 0.73 0.81 0.78 0.73 0.81 0.78 0.73 0.81
list 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
macro avg 0.57 0.62 0.59 0.62 0.63 0.63 0.58 0.58 0.59
Location 1.00 0.98 1.00 0.66 0.80 0.82 0.80 0.88 0.90
weighted 0.79 0.77 0.82 0.78 0.73 0.81 0.78 0.74 0.81
numeric 0.93 0.98 0.95 0.86 0.83 0.92 0.89 0.90 0.93
avg
micro avg 0.69 0.70 0.73 0.69 0.70 0.73 0.69 0.70 0.73
macro avg 0.54 0.51 0.55 0.57 0.58 0.59 0.53 0.51 0.55
weighted 0.75 0.79 0.78 0.69 0.70 0.73 0.70 0.73 0.75 VIII. DISCUSSION
avg
The results showed 0% for precision, recall, and F1-score
TABLE X. RESULTS OF THE THREE CLASSIFIERS USING TF-IDF for all the three classifiers in classifying the class casual, this
WEIGHTING ON THE DATASET OF FORM1 is because the number of questions under this class is not
precision Recall f1-score enough for the classifiers to learn. The same situation is
MNB DT SVM MNB DT SVM MNB DT SVM demonstrated for the list class, where the number of questions
casual 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
definition 0.39 0.31 0.31 0.90 1.00 1.00 0.55 0.48 0.48 in the training set under this class is 18, the only classifier that
description 1.00 0.60 1.00 1.00 1.00 1.00 1.00 0.75 1.00 handled this class in three situation is DT, the first two
entity 0.51 0.38 0.50 0.58 0.36 0.39 0.54 0.37 0.44
human 0.63 0.79 0.70 0.63 0.63 0.68 0.63 0.70 0.69 situations are as illustrated in tables 11 and 12 for the dataset
list 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 of form2, where stop-word removal is applied, and for both
Location 1.00 0.95 1.00 0.80 0.80 0.82 0.89 0.87 0.90
numeric 0.93 0.96 0.98 0.88 0.85 0.92 0.90 0.90 0.95 weighting schemes TF and TF-IDF, precision, recall and F1-
micro avg 0.74 0.71 0.74 0.74 0.70 0.74 0.74 0.70 0.74 score reached 50%, 33%, and 40%, respectively for both
macro avg 0.56 0.50 0.56 0.60 0.58 0.60 0.56 0.51 0.56
weighted 0.77 0.77 0.80 0.74 0.71 0.74 0.75 0.72 0.76 cases. However, in the third situation for the same class, when
avg DT is applied on form3 of the preprocessed dataset where
stemming and stop-words removal is performed, F1-score
TABLE XI. RESULTS OF THE THREE CLASSIFIERS USING TF WEIGHTING reached 50% having 100% precision and 33% recall, this
ON THE DATASET OF FORM2
explains DT ability to perform well when the number of
precision
MNB DT SVM
Recall
MNB DT SVM
f1-score
MNB DT SVM
dimension decreases. Although as an overall performance, DT
casual 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 was the worst when applied on form3 using both TF and TF-
definition 0.27 0.23 0.41 0.90 0.90 0.90 0.42 0.36 0.56 IDF weighting schemes. Another case that DT was listed as
description 1.00 0.60 1.00 1.00 1.00 1.00 1.00 0.75 1.00
entity 0.44 0.49 0.56 0.58 0.52 0.76 0.50 0.55 0.64 the worst is when applied on form1 using TF-IDF weighting.
human 0.67 0.88 0.80 0.54 0.56 0.59 0.59 0.69 0.68 On the other hand, the best result for DT is given when
list 0.00 0.50 0.00 0.00 0.33 0.00 0.00 0.40 0.00
Location 1.00 0.97 1.00 0.68 0.76 0.82 0.81 0.85 0.90 classifying the class numeric as its F1-score reached 90% or
numeric 0.96 1.00 0.95 0.88 0.86 0.92 0.92 0.93 0.93 above in all cases.
micro avg 0.69 0.71 0.78 0.69 0.71 0.78 0.69 0.71 0.78
macro avg
weighted
0.54
0.77
0.58
0.83
0.59
0.82
0.57
0.69
0.62
0.71
0.62
0.78
0.53
0.71
0.56
0.75
0.59
0.79
The expectation of MNB is to perform well on the current
avg dataset which is relatively small, but MNB in its overall
performance was not as expected, especially that it was the
TABLE XII. RESULTS OF THE THREE CLASSIFIERS USING TF-IDF worst classifier in three cases, case 1; when applied on form1
WEIGHTING ON THE DATASET OF FORM2
using TF weighting scheme. Case 2: when applied on form2
precision recall f1-score using TF weighting scheme, and case 3: when applied on
MNB DT SVM MNB DT SVM MNB DT SVM
casual 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 form2 using TF-IDF weighting. However, MNB perfectly
definition 0.29 0.27 0.38 0.90 0.90 0.90 0.44 0.42 0.53 classified the class description in all experiments, having the
description 1.00 0.60 1.00 1.00 1.00 1.00 1.00 0.75 1.00
entity 0.43 0.51 0.52 0.55 0.58 0.67 0.48 0.54 0.59 precision and recall equal to 100% in most of the cases.
human 0.65 0.85 0.75 0.54 0.54 0.59 0.59 0.66 0.66
list 0.00 0.50 0.00 0.00 0.33 0.00 0.00 0.40 0.00 SVM was the best classifier used for question
Location
numeric
1.00
0.93
0.98
0.94
1.00
0.96
0.68
0.88
0.80
0.86
0.84
0.92
0.81
0.90
0.88
0.90
0.91
0.94
classification in all cases, having both micro-average and
micro avg 0.69 0.72 0.77 0.69 0.71 0.77 0.69 0.73 0.77 weighted-average F1-scores reached 81% as their maximum,
macro avg
weighted
0.54
0.76
0.58
0.81
0.58
0.81
0.57
0.69
0.63
0.72
0.61
0.77
0.53
0.71
0.57
0.75
0.58
0.78
while as their minimum, micro-average and weighted average
avg F1-score reached 73% and 75% respectively. For individual
classes, the best results reached 100% for precision, recall and
TABLE XIII. RESULTS OF THE THREE CLASSIFIERS USING TF WEIGHTING F1-score when classifying the class description in all cases.
ON THE DATASET OF FORM3
The best results obtained by the three classifiers was in the last
precision recall f1-score experiment, when using all preprocessing steps and the TF-
MNB DT SVM MNB DT SVM MNB DT SVM
casual 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 IDF weighting scheme.
definition 0.33 0.24 0.33 0.90 1.00 0.90 0.49 0.39 0.49
description 1.00 0.60 1.00 1.00 1.00 1.00 1.00 0.75 1.00 Table 15 shows the classes used to experiment question
entity
human
0.46
0.68
0.50
0.81
0.65
0.77
0.58
0.61
0.27
0.61
0.52
0.66
0.51
0.64
0.35
0.69
0.58
0.71
classification in [10], were the authors conducted a
list 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 comparison between two taxonomies, an Arabic taxonomy
Location
numeric
1.00
0.95
0.98
0.93
0.96
0.93
0.74
0.88
0.86
0.93
0.94
0.93
0.85
0.91
0.91
0.93
0.95
0.93
proposed by the authors and Li and Roth taxonomy, after

236
applying a set of preprocessing steps on the dataset, which the classifiers MNB, DT, and SVM, respectively. The future
consisted of removing punctuations, diacritics, and stop- work will focus on conducting similar experiments on a larger
words, then performing tokenization. Their dataset consisted dataset.
of translated questions from both CLEF and TREC, having
800 and 1500 question-answer pairs respectively. Note that X. REFERENCES
Arabic and Li and Roth taxonomies consisted of two numeric
classes (time and number), while in our proposed taxonomy,
time and number is under one class named numeric. Arabic [1] R. K. Santosh and K. Shaalan, "A review and future
taxonomy do not have the classes; definition and entity, while perspectives of arabic question answering systems,"
Li and Roth and our taxonomy does. The three taxonomies IEEE Transactions on Knowledge and Data
share the classes; human or person, location, and description. Engineering, pp. 3169--3190, 2016.
However, Li and Roth have the class abbreviation which is not [2] K. C. Ryding, A reference grammar of modern standard
used in the in the Arabic taxonomy nor in our taxonomy, while Arabic, Cambridge university press, 2005.
we have proposed two other classes; casual and list.
[3] C. C. Aggarwal and C. Zhai, "A survey of text
TABLE XV. EXPERIMENTED CLASSES IN [10] AND PROPOSED CLASSES classification algorithms," in Mining text data,
Springer, 2012, pp. 163--222.
EXPERIMENTED CLASSES
ARABIC human, description, location, time, and number. [4] B. Agarwal and N. Mittal, "Text Classification Using
TAXONOMY Machine Learning Methods-A Survey," in Proceedings
LI AND ROTH abbreviation, definition, description, location of the Second International Conference on Soft
(city, country, and other location), person, time,
number, entity, other. Computing for Problem Solving (SocProS 2012),
Springer, 2014.
OUR casual, definition, description, entity, human, list,
EXPERIMENT location, numeric [5] C. P. Rose, A. Roque, D. Bhembe and K. Vanlehn, "A
Hybrid Text Classification Approach for Analysis of
Student Essays," in Proceedings of the HLT-NAACL 03
The results show that the three classifiers with the Arabic workshop on Building educational applications using
taxonomy outperformed the same classifiers in our natural language processing-Volume 2, 2003.
experiment. Note that the difference is not that significant [6] H. M. Al Chalabi, S. K. Ray and K. Shaalan, "Question
between the NB classifiers having F1-score 1% lower than classification for Arabic question answering systems,"
that of the Arabic taxonomy, while compatible with Li and in 2015 International Conference on Information and
Roth taxonomy. However, DT in our experiment
Communication Technology Research (ICTRC). IEEE,
outperformed that in Li and Roth taxonomy with F1-score
2015.
reaches 74% while in Li and Roth reaches 66%. On the other
hand, SVM classifier reached 90% using the Arabic [7] E. Al-Shawakfa, "A Rule-based Approach to
taxonomy, the difference is 9% between them and our Understand Questions in Arabic Question Answering,"
experiments, and only 2% between Li and Roth and our Jordanian Journal of Computers and Information
experiment. Note that the results obtained in our experiment Technology, vol. 2, pp. 210--231, 2016.
is promising especially that the dataset is only 600 records [8] I. Lahbari, S. E. A. Ouatik and K. A. Zidani, "A rule-
compared to the two other experiments in table 15. Therefore, based method for Arabic question classification," in
our intension is to increase the size of the data as a future work, 2017 International Conference on Wireless Networks
through manually updating and labeling the translated TREC
and Mobile Communications (WINCOM). IEEE, 2017.
dataset as done in this paper.
[9] X. Li and D. Roth, "Learning question classifiers," in
TABLE XVI. EXPERIMENTAL RESULTS FOR [10] AND EXPERIMENT OF Proceedings of the 19th international conference on
THE CURRENT PAPER Computational linguistics-Volume 1, 2002.
NB DT SVM TOTAL NUMBER [10] I. Lahbari, S. O. El Alaoui and K. A. Zidani, "Toward
OF QUESTIONS
79% 81% 90% 2300
a new arabic question answering system," International
ARABIC
TAXONOMY Arab Journal of Information Technology (IAJIT), vol.
LI AND ROTH 78% 66% 83% 2300 15, pp. 610--619, 2018.
OUR 78% 74% 81% 600
EXPERIMENT
[11] A. Aouichat, M. S. H. Ameur and A. Geussoum,
"Arabic Question Classification Using Support Vector
IX. CONCLUSION AND FUTURE WORK Machines and Convolutional Neural Networks," in
This paper presents a comparative Arabic question International Conference on Applications of Natural
classification experiments on an updated version of translated Language to Information Systems. Springer, 2018.
CLEF dataset, which was labeled manually using eight [12] A. Aouichat and A. Guessoum, "Building TALAA-
classes; casual, definition, description, entity, human, list, AFAQ, a corpus of Arabic FActoid question-answers
location, and numeric. The experiments were conducted using for a question answering system," in International
three classifiers, MNB, DT, and SVM, after applying a Conference on Applications of Natural Language to
number of preprocessing steps on the dataset and creating Information Systems. Springer, 2017.
three versions of the dataset differing in the applied
preprocessing steps. The best results were given after
performing all the preprocessing steps and using the TF-IDF
weighting scheme with F1-score of 78%, 74% and 81% for

237
Arabic Text Classification of News Articles Using
Classical Supervised Classifiers
Leen Al Qadi, Hozayfa El Rifai, Safa Obaid, and Ashraf Elnagar
Dept. of Computer Science
University of Sharjah
Sharjah, UAE
ashraf@sharjah.ac.ae

Abstract—Automatic document categorization gains more im- Nowadays, manual classification that is done by experts is
portance in view of the plethora of textual documents added not so fruitful due to the large number of text documents.
constantly on the web. Text categorization or classification is As a result, automated classifiers were proven to be more
the process of automatically tagging a textual document with
most relevant label. Text categorization for Arabic language is effective and a great alternative utilizing machine learning
interesting in the absence of large and free datasets. Our objective algorithms. Many applications and examples of text catego-
is to automatically identify the category of a document based on rization have been explored such as sentiment analysis [1]–[5],
its linguistic features. To achieve this goal, we constructed a new spam filtering [6] and [7] language identification [8], dialect
dataset which contains almost 90k Arabic news articles with their identification [9] and many more.
tags from Arabic news portals. The dataset shall be made freely
available to the research community on Arabic computational lin- Using machine learning for structuring data is especially
guistics. The dataset has four main categories: Business, Sports, helpful in the field of business. It enhances decision- making
Technology and Middle East. Each collected article was cleaned and automates processes, getting faster results. For instance,
from Latin characters, numbers, punctuation and stop words. marketers can research, collect and analyze keywords used by
To investigate the effectiveness of the dataset, we used an array competitors.
of classical supervised machine learning classifiers. Namely, the
following 10 popular classifiers were used: Logistic Regression, The Arabic language is the mother tongue of more than
Nearest Centroid, Decision Tree (DT), Support Vector Machines 300 million people and it is one of the languages that present
(SVM), K-nearest neighbors (KNN), XGBoost Classifier, Random significant challenges to many NLP applications. It is a highly
Forest Classifier, Multinomial Classifier, Ada-Boost Classifier, and inflected and derived language. The scale of Arabic compu-
Multi-Layer Perceptron (MLP). In pursuit of high accuracy, tational linguistic research work is now orders of magnitude
we implemented an ensemble model to combine best classifiers
together in a majority-voting classifier. Our experimental results beyond what was available a decade ago, but still it has so
showed solid performance with a minimum F1-score of 87.7%, much room to grow.
achieved by Ada-Boost and top performance of 97.9% achieved The statistics reported by the Internet World Stats show that
by SVM. The experimental results are presented in terms of the Arabic language is the fourth popular language online by
confusion matrices, F1-scores, and accuracy. share of Internet users with an estimate of 226,595,470 Arabic
Index Terms—Arabic Text Classification; Single-Label Classi-
fication, Arabic Dataset, Shallow Learning Classifiers. Internet users by language, which represent 5.2% of all the
World’s Internet users as of April, 2019. Moreover, that out
of 444,016,517 Arabic speaking people (as estimated in 2019),
I. I NTRODUCTION
51.0% of them use the Internet. The highest growth rate among
Due to the heavy usage of the Internet and Web 2.0, all languages in the last nineteen years for the number of online
enormous amounts of repositories had arisen. The increasing users was for the Arabic language, achieving 8,917.3%. In our
number of these repositories of online documents resulted in work context, we constructed a dataset of Arabic news articles
a growing demand for automatic categorization algorithms. scraped from multiple websites for the purpose of our research.
Majority of the data, which is generated, is in textual form, 10 classifiers were implemented to predict the most probable
which is highly unstructured in nature, yet extremely rich in class an article should belong to. In addition, we implemented
information. Extracting insights from such data can be hard a voting classifier, which takes into account the classifiers that
and time-consuming, so machine-learning algorithms are used gave the best accuracy scores while predicting the label.
to organize massive chunks of the data and perform a number An automatic Arabic news article labeling system extracts
of automated tasks. Text classification is a fundamental task in features from the articles using the TF-IDF technique. After
NLP (Natural Language Processing) that is used for assigning turning each article into a feature vector, it will identify which
tags to text and classifying it under categories based on features are most common under which class (in the training
its content. Classifying huge textual data standardizes the phase). This will help the classifier when encountering a new
platform, and makes searching for information much easier article, to predict which class it falls under after turning it into
and more feasible, and improves and simplifies the overall a feature vector.
experience of automated navigation. We propose a single-class text classifier and the objective

978-1-7281-2882-5/19/$31.00 ©2019 IEEE


238
is to assign an Arabic news article to a specific class out TABLE I
of 4 classes. We adopt a supervised approach to classify the ARTICLES COUNT FOR EACH SCRAPED NEWS PORTAL.
articles. We experimented with using a different vectorizer for Websites Classes Articles Count
the articles to see the possibility of it affecting the accuracy. Sky News Arabia Sports 7923
We also tested the effect of using a custom-made stop-words Sports 3800
Tech 1680
list instead of the built-in list in the NLTK library. CNN Arabia
Middle East 21516
The remaining of the paper is organized as follows: liter- Business 3908
ature review is presented in Section II. Section II-A demon- Bein Sports Sports 6603
Tech-wd Tech 23682
strates the dataset. Section III describes the proposed classifi- Arabic RT Business 896
cation systems. Section IV presents the experimental results. Youm7 Business 14478
Finally, we conclude the work in Section V. CNBC Arabia Business 4653

dedicated for classification because either there are no defined


classes such as 1.5 billion words Arabic Corpus [19], or the
existing classes are not well defined. Therefore, the authors
propose a new preprocessed and filtered corpus “NADA”,
composed from two existing corpora OSAC and DAA. The
authors used the DDC hierarchical number system, that allows
for each main category to be divided into ten sub-categories
and so on. “NADA” has 10 categories in total, with 13,066
documents. We believe the size is small with respect to the
proposed number of categories.
In addition, [20] investigates text classification using the
SVM classifier on two datasets that differ in languages (En-
Fig. 1. Dataset distribution percentages. glish and Portuguese). It was found that the Portuguese dataset
needs more powerful document representations, such as the
use of word order and syntactical and/or semantic information.
II. P REVIOUS W ORK
Shedding the light on the research papers that focused
Several papers review the various English text classification on using the classical supervised machine learning classifiers
approaches and existing literature in addition to the many such as Decision Tree [21], NB [22]–[24], SVM [24] and
surveys covering the subject such as [10]. Some surveys that [25], KNN [23]. While other authors preferred to work on
cover Arabic text categorization are also available, [11] and classification using deep learning and neural networks [26]
[12]. and [27] where they witnessed an overall better performance.
Recently, more research works are focusing on Arabic text
Lastly, it is clear that the performance of classification
classification, and on enriching the Arabic corpus. In [13],
algorithms in Arabic text classification is greatly influenced
authors have compared the result of using six main classi-
by the quality of data source, feature representation techniques
fiers, using the same data sets and under the same environ-
as the irrelevant and redundant features of data degrade the
mental settings. The data sets were mainly collected from
accuracy and performance of the classifier.
(www.aljazeera.net), and they have found that Naive Bayes
gave the best results, with or without using feature selection
methods. A. dataset
Some of the papers focus on the feature selection method
like in [14]. Implementing the KNN classifier, they study the We collected the proposed dataset using web scraping
effect of using unigrams and bigrams as representation of the (Python Scrapy), from seven popular news websites (bein-
documents, instead of the traditional single term indexing (bag sports.com, tech-wd.com, skynewsarabic.com, Arabic.rt.com,
of words) method. Moreover, on feature selection, in [15], the cnbcarabia.com, arabic.cnn.com and youm7.com). The dataset
authors investigated the performance of four classifiers using 2 consists of 89,189 articles with more than 32.5M words. The
different feature selection methods which are Information Gain dataset has the following four categories [Business, Sports,
(IG), and the(X2) statistics (CHI squared) on a BBC Arabic Technology and Middle East]. The articles are written in
dataset. [The use of SVM classifier (with Chi squared feature Modern Standard Arabic (MSA), and so there are no dialects
selection) for Arabic text classification, in [16], gives the best involved. The collected articles were grouped in one corpus.
results. In [17], a new feature selection method is presented We aimed at producing a nearly balanced dataset to avoid bias.
where it outperformed five other approaches using the SVM The average of the scraped articles for each category is about
classifier. Regarding the availability of Arabic datasets online, 22k articles. Figure 1 and Table I show the distribution of the
[18] suggest that some of the existing Arabic corpora are not four categories of this dataset.

239
TABLE II
COMPARISON BETWEEN TF-IDFVECTORIZER AND
COUNTVECTORIZER.

Algorithms tf-idfVectorizer (%) countVectorizer (%)


Logistic 96.4 97.3
SVC 97.5 97.0
DT Classifier 92.4 91.7
Multinomial NB 91.1 96.8
XGB Classifier 91.2 91.2
KNN Classifier 95.0 69.9
RF Classifier 95.1 94.5

III. PROPOSED CLASSIFICATION SYSTEMS


A. Text Features
In text processing, words of the articles represent categorical Fig. 2. Summary of the work-flow of the classifiers.
features. However, most machine algorithms cannot under-
stand text. To solve such problem, we create numerical vectors
to represent the text. Each sentence is being represented 2) Multinomial Naı̈ve Bayes: Using Bayes Theorem, this
by one vector. Turning text into vectors is a process called classifier calculates the probability of each label for
vectorization. CountVectorizer and Tf-idfVectorizer are the a given data, then outputs the label with the highest
most common techniques used for this task. A comparison probability. The classifier assumes that the attributes
has been made to choose which vectorizer should be selected. are independent of each other. In other words, the
The dataset we used for this comparison had almost 40k presence of one feature does not affect the presence of
articles distributed over 3 categories. As it is shown in Table another, therefore all the attributes contribute equally in
2, higher accuracies were provided by classifiers that use the producing the output.
TF-IDFVectorizer, which we adopt in our work. The term 3) Decision Tree: This classifier resembles a tree, with
frequency-inverse document frequency (TF-IDF) scales down each node representing a feature/attribute, and each
the impact of tokens that occur very frequently. The TF-IDF corresponding leaf representing a result. Each branch
Vectorizer is composed by two terms: represents a condition and whenever a condition is
• Term Frequency (TF): measures how frequently a word answered, a new condition will be distributed recursively
occurs in an article. Since every article is different in until a conclusion is reached. Recursion is used to
length, it is possible that a term would appears much partition the tree into a number of conditions with their
more times in long articles that shorter ones. outcomes.
• Inverse Document Frequency (IDF): measures how im- 4) Support Vector Machines (SVM): This is a supervised
portant a word is by weighing down the frequent terms non-probabilistic binary linear classifier that is extremely
and scale up the rare ones. popular and robust. It constructs a model and outputs a
In addition to that, we tested the classifiers once using the line, known as the hyperplane, between classes. This
built-in stop words list and once using a custom-made list. hyperplane separates the data into classes. Both linear
We achieved higher accuracy scores after implementing our and nonlinear classification can be performed by the
list, and we used it moving forward with the experiments. SVM classifier. The hyperplane can be written as the
Figure 2 demonstrates the overall work-flow of our system. vector of input articles x satisfying w · x − b = 0, where
w is the normal vector to the hyperplane and b is the
B. Selected Classifiers
bias.
There are different types of supervised classifiers that can 5) Random Forest: This is a supervised ensemble learning-
be used to for our text categorization task. The essential based classifier. It uses an array of decision trees. The
role of these classifiers is to simply map input data to a outcome class is determined as an aggregate of such
predicted category. We studied the performance of 10 different trees. Technically, given a set of articles x1 , x2 , · · · , xn
classifiers, and a majority voting one. The classifiers are: and their corresponding classes y1 , y2 , · · · , yn . Each
1) Logistic Regression: Logistic Regression is the appropri- classification tree is trained using a random sample
ate regression analysis to conduct when the dependent (xi , yi ), where i ranges from 1 to the total number
variable is dichotomous (binary). Like all regression of trees. The predicted class shall be produced using
analyses, the logistic regression is a predictive analysis. a majority vote of all used trees.
It is used to describe data and to explain the relationship 6) XGBoost Classifiers: This is a supervised classifier,
between one dependent binary variable and one or more which has gained popularity because of winning a good
nominal, ordinal, interval or ratio-level independent vari- number of Kaggle challenges. Like Random Forest, it is
ables. an ensemble technique of decision trees and a variant of

240
gradient boosting algorithm.
7) Multi-layer Perceptron (MLP): This is a supervised
classifier. It consists of three (or more) layers of neuron
nodes (an input and an output layer with one or more
hidden layers). Each node of one layer is connected
to the nodes of the next layer, and uses a non-linear
activation function to produce output.
8) KNeighbors Classifier: This is a supervised classifier.
In order to classify a given data point, we take into
consideration the number of nearest neighbors of this
point. Each neighbor votes for a class and the class
with the highest votes is taken as the prediction. In
other words, the major vote of the point’s neighbors will
determine the class of this point.
9) Nearest Centroid Classifier: This is a supervised classi-
fier. It’s a no parameter algorithm where each class is
represented by the centroid of its members. It assigns to Fig. 3. Confusion Matrix for the worst classifier.
tested articles the label of the class of training samples
whose mean (centroid) is closest to the article.
10) AdaBoost Classifier: This is a supervised classifier. It
is a meta-estimator that begins by fitting a classifier on
the original dataset and then fits additional copies of
the classifier on the same dataset but where the weights
of incorrectly classified instances are adjusted such that
subsequent classifiers focus more on difficult cases.
11) Voting Classifier: A very interesting ensemble solution.
It is not an actual classifier but a wrapper for a set of
different classifiers. The final decision on a prediction is
taken by majority vote.
IV. EXPERIMENTAL RESULTS AND DISCUSSION
A. Setup and Pre-processing
Our objective is to explore the success of using 11 different
classifiers to classify Arabic news categories. Our experiments
involve single-label classification on the collected dataset
and comparing the results of using the same classifiers with Fig. 4. Confusion Matrix for the best classifier.
another recently reported dataset ‘Akhbarona’ [28], which has
7 different categories. We split our constructed dataset into
80% for training and 20% for testing. All classifiers were samples for each Arabic character. In fact, normalization can
trained on the training set which consist of 71,707 labeled even affect the meaning of some Arabic words.
articles, then tested on the testing set which consists of 17,432
B. Text Classification
articles.
To evaluate the performance of our classifiers, we report We implemented all the classifiers using Scikit-learn. With
the accuracy score, which is simply expressed as the ratio just using the default hyper-parameters as a black-box on our
of the number of correctly classified articles. The number of testing set and L1 penalty for some of the classifiers. We tested
extracted features from our training set is more than 344k the proposed classifiers on the testing set. The accuracy scores
features. are high and clearly show the strength of the system as well
Furthermore, text pre-processing is used to clean the dataset as the hyper parameters used with each classifier.
by removing all the non-Arabic content. This approach is Table III shows the precision, recall, and F1-score measures
highly recommended when dealing with text collected from for each of the tested classifiers on our dataset. Accuracy
the web. The next step is to clean all the scraped articles scores are almost same as F1-scores. The average of the accu-
by removing elongation, punctuation, Arabic digits, isolated racy scores is 94.8%. The SVM classifier produced the best re-
chars, qur’anic symbols, Latin letters, and other marks. sult of 97.9%. However, the Ada-Boost classifier produced the
Although most of the research works on Arabic computa- worst result of 87.7%. Furthermore, four classifiers produced
tional linguistics apply normalization on the collected text, we close results between 97.5% and 97.9%. For the rest of the
believe this step is not necessary. The dataset provides enough classifiers, two classifiers (MultinomialNB and KNeighbors)

241
TABLE III
ACCURACY METRICS FOR CLASSIFIERS TESTING ON OUR
DATASET.

Algorithms Precision Recall F1-score


Logistic Regression 0.98 0.98 0.98
SVC 0.98 0.98 0.98
DT Classifier 0.91 0.91 0.91
Multinomial NB 0.96 0.96 0.96
XGB Classifier 0.94 0.93 0.94
KNN Classifier 0.95 0.95 0.95
RF Classifier 0.95 0.94 0.94
Nearest Centroid 0.95 0.94 0.94
Ada-Boost Classifier 0.89 0.88 0.88
MLP Classifier 0.98 0.95 0.95
Voting Classifier 0.98 0.98 0.98

Fig. 6. An example of an incorrectly classified ’Technology’ article as


’Business’ article.

category as predicted by the SVM classifier! This shows how


precise the classifiers are.
For the second part of the testing, we experimented with
a recently reported dataset, which is Akhbarona. It is an
unbalanced dataset that consists of seven categories holding
46,900 articles. We split the dataset into 80% training and 20%
testing. It is anticipated that the accuracy scores will be lower
Fig. 5. An example of a correctly classified ’Business’ article.
for two reasons; having an unbalanced dataset may lead to
biased classifiers towards a certain category, and increasing the
number of classes will raise the possibility of misclassifying
performed above the average with accuracy scores of 96.3% an article. Table IV shows that precision, recall, and F1-
and 95.4%, respectively, while the other classifiers performed score metrics. The accuracy results are matching the F1-
below the average with accuracy scores range from 87.7% to scores on Akhaborna dataset. The average of the accuracies
94.4%. Figures 3 and 4 show the confusion matrix of the worst is 90.0%. The SVM classifier produced the best result of
(ADB) and best classifiers (SMV), respectively. 94.4%. However, the Ada-Boost classifier produced the worst
result of 87.7%. Furthermore, four classifiers out of eleven
C. Testing produced close results between 93.9% and 94.4%. For the rest
This phase is divided into 2 parts. The first part is to test seven classifiers, one classifier only (KNeighbors) performed
the best classifier (SVM) by having it to predict the class of above the average with an accuracy of 90.8%. The other six
articles a testing set from our collected dataset. classifiers performed below the average with accuracy scores
Figure 5 shows an example of an article from the testing range from 77.9% and 88.4%.
set, taken from the ’Business’ category. Our model was 95.7%
V. CONCLUSIONS
sure the article should be classified under the same category.
The SVM classifier was able to record 99.6% accuracy on In this work, we have developed a single-class text classifier
other news articles. These results show the robustness of the system for Arabic news articles. We present a benchmark
classifier. Additionally, we checked some of the predictions in dataset which contains almost 90k Arabic news articles with
the list of misclassified articles in the testing section of the their tags scraped from seven different websites. We described
dataset, to try to make sense of the misclassification. We have the collection, cleaning and construction steps of the dataset.
found that some articles do indeed belong to a different class, We examined our dataset by implementing different eleven
that is not the class they were originally classified under by the classifiers.
news website. In Figure 6, we show an article that is originally The classifiers were trained and tested on the proposed
tagged as “Technology”. However, after reading the article, dataset. SVM model produced the best results among all the
we believe that it should be classified under the “Business” other classifiers. The final accuracy scores range from 87% to

242
TABLE IV [12] A. al Sbou, “A survey of arabic text classification models,” International
ACCURACY METRICS FOR CLASSIFIERS TESTING ON Journal of Electrical and Computer Engineering, vol. 8, pp. 4352–4355,
AKHBARONA DATASET. 12 2018.
[13] A. El-Halees, “A comparative study on arabic text classification.”
Algorithm Precision Recall F1-score Egyptian Computer Science Journal, vol. 30, 01 2008.
Logistic Regression 0.94 0.94 0.94 [14] R. Al-shalabi and R. Obeidat, “Improving knn arabic text classification
SVC 0.94 0.94 0.94 with n-grams based document indexing,” in in Proceedings of the 6 th
DT Classifier 0.83 0.83 0.83 International Conference on Informatics and Systems INFOS2008, 2008,
Multinomial NB 0.91 0.88 0.88 pp. 108–112.
XGB Classifier 0.89 0.88 0.88 [15] G. Raho, R. Al-Shalabi, G. Kanaan, and A. Nassar, “Different
KNN Classifier 0.91 0.91 0.91 classification algorithms based on arabic text classification: Feature
RF Classifier 0.88 0.88 0.88 selection comparative study,” International Journal of Advanced
Nearest Centroid 0.89 0.86 0.87 Computer Science and Applications, vol. 6, no. 2, 2015. [Online].
Ada-Boost Classifier 0.80 0.78 0.78 Available: http://dx.doi.org/10.14569/IJACSA.2015.060228
MLP Classifier 0.94 0.94 0.94 [16] A. M. A. Mesleh, “Chi square feature extraction based svms arabic
Voting Classifier 0.94 0.94 0.94 language text categorization system,” Journal of Computer Science,
vol. 3, no. 6, pp. 430–435, 2007, exported from https://app.dimensions.ai
on 2019/02/03.
[17] B. Hawashin, A. Mansour, and S. Aljawarneh, “An efficient feature
97%. We also used the voting classifier, hoping for improving selection method for arabic text classification,” International Journal
of Computer Applications, vol. 83, pp. 1–6, 12 2013.
the accuracy, using a majority vote of ten classifiers. However, [18] N. Alalyani and S. L. Marie-Sainte, “Nada: New arabic dataset
the result is comparable to the SVM classifier. for text classification,” International Journal of Advanced Computer
A further investigation has taken place to check the ro- Science and Applications, vol. 9, no. 9, 2018. [Online]. Available:
http://dx.doi.org/10.14569/IJACSA.2018.090928
bustness of our proposed system. We trained and tested the [19] I. Abu El-Khair, “1.5 billion words arabic corpus,” 11 2016.
classifiers on the recently reported “Akhabrona” dataset. The [20] T. Gonçalves and P. Quaresma, “The impact of nlp techniques in
number of classes is increased to have seven classes. The the multilabel text classification problem,” in Intelligent Information
Processing and Web Mining, M. A. Kłopotek, S. T. Wierzchoń, and
results were as good as on our dataset. The SVM classifier K. Trojanowski, Eds., 2004, pp. 424–428.
scored the highest. In future, we intend to increase the number [21] F. Harrag, E. El-Qawasmeh, and P. Pichappan, “Improving arabic
of classes in our dataset. We have also shown the need of text categorization using decision trees,” in 2009 First International
Conference on Networked Digital Technologies, July 2009, pp. 110–115.
multi-label text classification which we start on soon. [22] M. EL KOURDI, A. BENSAID, and T.-e. Rachidi, “Automatic arabic
document categorization based on the naı̈ve bayes algorithm,” 08 2004.
R EFERENCES [23] M. Bawaneh, M. Alkoffash, and A. Alrabea, “Arabic text classification
[1] A. Elnagar and O. Einea, “Brad 1.0: Book reviews in arabic dataset,” using k-nn and naive bayes,” Journal of Computer Science, vol. 4, 07
in 2016 IEEE/ACS 13th International Conference of Computer Systems 2008.
and Applications (AICCSA), Nov 2016, pp. 1–8. [24] S. Alsaleem, “Automated arabic text categorization using svm and nb,”
[2] A. Dahou, S. Xiong, J. Zhou, M. H. Haddoud, and P. Duan, “Word International Arab Journal of eTechnology, vol. 2, no. 2, pp. 124–128,
embeddings and convolutional neural network for Arabic sentiment June 2011.
classification,” in Proceedings of COLING 2016, the 26th International [25] A. Mohammad, T. Alwadan, and O. Almomani, “Arabic text catego-
Conference on Computational Linguistics: Technical Papers, Dec. 2016, rization using support vector machine, naı̈ve bayes and neural network,”
pp. 2418–2427. GSTF Journal on Computing (JoC), vol. 5, 09 2016.
[3] A. A. Altowayan and A. Elnagar, “Improving arabic sentiment anal- [26] M. Biniz, S. Boukil, F. El Adnani, L. Cherrat, and A. Elmajid
ysis with sentiment-specific embeddings,” in 2017 IEEE International El Moutaouakkil, “Arabic text classification using deep learning tech-
Conference on Big Data (Big Data). IEEE, 2017, pp. 4314–4320. nics,” International Journal of Grid and Distributed Computing, vol. 11,
[4] A. Elnagar, Y. S. Khalifa, and A. Einea, Hotel Arabic-Reviews Dataset pp. 103–114, 09 2018.
Construction for Sentiment Analysis Applications. Springer Interna- [27] A. Elnagar, O. Einea, and R. A. Debsi, “Automatic text tagging of arabic
tional Publishing, 2018, pp. 35–52. news articles using ensemble deep learning models,” in Proceedings
[5] A. Elnagar, L. Lulu, and O. Einea, “An annotated huge dataset for of the 3rd International Conference on Natural Language and Speech
standard and colloquial arabic reviews for subjective sentiment analysis,” Processing, Sep. 2019.
Procedia Computer Science, vol. 142, pp. 182 – 189, 2018, arabic [28] O. Einea, A. Elnagar, and R. A. Debsi, “Sanad: Single-label
Computational Linguistics. arabic news articles dataset for automatic text categorization,”
[6] A. Al-alwani and M. Beseiso, “Article: Arabic spam filtering us- Data in Brief, p. 104076, 2019. [Online]. Available:
ing bayesian model,” International Journal of Computer Applications, http://www.sciencedirect.com/science/article/pii/S2352340919304305
vol. 79, no. 7, pp. 11–14, October 2013.
[7] Y. Li, X. Nie, and R. Huang, “Web spam classification method based
on deep belief networks,” Expert Systems with Applications, vol. 96, pp.
261 – 270, 2018.
[8] S. Malmasi and M. Dras, “Language identification using classifier
ensembles,” in Proceedings of the Joint Workshop on Language Tech-
nology for Closely Related Languages, Varieties and Dialects. Hissar,
Bulgaria: Association for Computational Linguistics, Sep. 2015, pp. 35–
43. [Online]. Available: https://www.aclweb.org/anthology/W15-5407
[9] L. Lulu and A. Elnagar, “Automatic arabic dialect classification using
deep learning models,” Procedia Computer Science, vol. 142, pp. 262 –
269, 2018, arabic Computational Linguistics.
[10] C. C. Aggarwal and C. Zhai, A Survey of Text Classification Algorithms,
2012, pp. 163–222.
[11] I. Hmeidi, M. Al-Ayyoub, N. A. Mahyoub, and M. A. Shehab, “A
lexicon based approach for classifying arabic multi-labeled text,” In-
ternational Journal of Web Information Systems, vol. 12, no. 4, pp.
504–532, 2016.

243
Graph-Based Arabic Key-phrases Extraction
Dana Halabi Arafat Awajan
Department of Computer Science Department of Computer Science
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
d3hhalabi@yahoo.com awajan@psut.edu.jo

Abstract— This paper proposes Arabic key-phrases unsupervised methods). Some approaches like Al-Kabi et al.
extraction using graph representation. The proposed approach [1] was based on building a co-occurrence matrix for the most
based on representing the text of an individual document as a frequent terms and used the knowledge of the χ2 and TF-ITF
graph, where the nodes within the graph hold the words’ stem measures. The terms with high χ2 were considered to be
and the edges represent the co-occurrence relation between stems keywords. Awajan [2] proposed unsupervised two-phase
in specific window size. After building the graph, graph-based approach. In phase one, the author detected all N-gram for the
centrality measures were used in ranking the nodes according to possible candidate keywords, then in phase two, he used a
their importance. Then the ranking results are sorted decently to morphological analyzer to calculate the frequency of N-gram
determine the top n nodes. The stems that are represented by the
term based on the root and stem of terms. Awajan [3] proposed
top n nodes will be considered as the key-stems of the individual
new technology based on a vector space model to compute the
document. The performance of our work is measured using the
three accuracy measures: Precision, Recall, and F-Measure. The
most frequency N-gram in the text. In addition to count
obtained result reached 54%, 82% and 64% for Precision, frequency of terms within the doc, the final frequency of N-
Recall, and F-measure respectively. grams within a document depends on their weight and degree.
El-Shishtawy et al. [4] represented a supervised learning
Keywords— Natural language processing, Arabic, key-phrases method for extracting key-phrases from a document based on
extraction, graph, ranking, centrality measures linguistic knowledge and annotated Arabic corpus, they used
syntactic rules based on Part Of Speech (POS) to extract the
I. INTRODUCTION key-phrases. Suleiman et al. [19] proposed Arabic keywords
extraction based on bag-of-concept to extract keywords from
Recently, 4.7% of internet users are Arabic speakers, which the text and used a semantic vector space model to group
will impact an increasing amount of Arabic contents in the web synonym words into classes. Although the tested dataset had
world [6]. This yields the need to have efficient ways to extract only three documents, the proposed method showed significant
information and knowledge from the available amount of data. results.
Key-phrases extraction is a useful Natural Language
Processing (NLP) task that can be used in NLP related tasks In this work, we propose an Arabic keywords extraction
such as automatic document(s) summarization, information approach based on a simple weighted graph. The main idea is
retrieval, search engine... etc. to convert the sequence of words in the sentences of the
document into a simple graph that its nodes represent the terms
Currently, there are limited researches related to keywords and its edges represent frequency co-occurrence relation.
extraction from Arabic contents. In this work, we propose a
new approach, based on a weighted graph. The main idea of Representing the document as a graph was introduced by
proposed work is to represent the document as a graph, in Mihalcea and Tarau (2004) [8] for English content. In their
which its nodes will hold the candidate stems, and the weighted work, they introduced a ranking algorithm called Text-Rank
edges represent the frequency co-occurrence relation between based on PageRank ranking algorithm proposed by Brin et al.
the connected nodes within predetermined window size. [13]. The Text-Rank model considered the words as lexical
Centrality measures will be used to analyze the network and units represented as nodes in an undirected weighted graph.
rank the N-top most important candidate nodes (stems) in the The edges in the graph represented the co-occurrence relation
graph that will be considered as key-stems. These key-stems between words. The best-achieved result for Mihalcea and
will be used to extract the keywords and key-phrases. Tarau [8] was 31% for precision. Litvak et al. and Last (2008)
[15], also represented the text as a graph using Hyperlink-
In this paper, section two represents the related work. Induced Topic Search (HITS) for ranking the nodes. Boudin
Section three explains a basic theoretical background of graph represented the text as a graph and proposed a comparison of
theory. Next, section four illustrates the proposed application. different centrality measures for Key-phrase extraction. In his
Section five represents the experiments and evaluation. Finally, work, he recommended the use of degree centrality measure to
conclusion and future work will be held in section six. ranking the nodes [16].
Kim et al. [7] represented the Korean content as a graph
II. RELATED WORKS and apply the original PageRank algorithm for ranking the
Most of the work for Arabic key-phrases extraction nodes within a graph. In their work, they had achieved more
depended on the use of statistical methods (supervised and than 71% for precision. For Arabic content, Al-Taani et al. [14]

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 244


used the graph and PageRank to summarize documents. In their Where p1, p2, ..., pn are the pages under
work, they segmented the text into sentences and gave a unique consideration, M(pi) is the set of pages that link to pi.
id to each sentence. The sentences’ ids were then represented L(pj) is the number of outbound links on page pj,
by the graph’s nodes, while the edge between every two and N is the total number of pages [13].
sentences represented the score of cosine similarity measure
between them. PageRank measure was then used to rank the IV. THE PROPOSED SYSTEM
nodes. Daoud et al. [18] utilized Social Networks to build a In this work, the document is represented as a graph. Each
corpus from Arabic tweets as a source of information to sentence in the document will be tokenized and represented by
statistically extract groups of used key-phrase over time. After the stems of its words. The set of stems will be stored as nodes,
that, relating key-phrases based on the search results were and the weighted edges represent the frequency co-occurrence
returned for a target topic and ranked. relation between stems within a predefined window. A window
of size n represents a sequence of n words in the sentence. A
III. GRAPH-BASED CENTRALITY MEASURES centrality (ranking) measure will be used to analyze the graph
In Graph theory, centrality measures [12,13] are used to and rank the top-n most important candidate stems.
analyze the network in order to determine the most "important" Four centrality (ranking) measures are used to illustrate the
or "prominent" nodes based on node location. The measures results of the proposed system: PageRank, Betweenness
that are usually used in network analysis are summarized Centrality, Closeness Centrality, and Degree Centrality.
below:
The proposed system consists of three phases. Pre-
1. Degree centrality: is the number of adjacent processing phase, graph building and ranking phase, and post-
neighbors to node i in an unweighted graph processing phase. The general architecture of the proposed
system is illustrated in Figure 1.
(1)
The first phase is for preparing and cleaning the raw data of
the entire text (document). The second phase is the core of key-
The node(s) that has the highest number of direct phrases extraction system where the main operations are held.
contact with many other nodes is said to be the node It will produce the N-top key-stems (nodes). In other words,
with high degree centrality [12]. this phase extracts the most N-top important stems within the
2. Closeness centrality: it describes how close a node i document. Finally, the last phase works on the N-top key-stems
is to all the other nodes in the network and produce the final format M-top key-phrases.
(2)

Where d(i, j) is the number of edges between node i


and node j for the path (i, j) in an unweighted graph.
The node(s) that has the shortest communication path
to others, or in other words, the node(s) that have the
minimal number of steps to reach other nodes will
have the high closeness centrality [12].
3. Betweenness centrality: is the number of shortest
paths going through the node σst (i)

(3)

where s is the start node in the unweighted graph, and


t is the end node.
The node(s) that lies on many shortest paths from s to
t is the node(s) that has the high betweenness
centrality [12].
4. PageRank: Is an algorithm used by the Google Fig. 1. The System General Architecture
Search engine to rank the results. PageRank is a
variant of Eigenvector centrality, in which the
importance of a node depends on the importance of
a. Phase one: Pre-processing Phase
its neighbors.
This phase aims to clean data by removing non-needed
(4) words such as non-Arabic characters or phrases i.e. http://, stop
words and functional words. This phase consists of sequence
steps. It is very important and should be accurate as its result

245
will be the input for the graph in the next phase. In other words, The first time any two stems appear in the same window the
it has a big effect on how to represent the entire Arabic weight of E between these two stems will initialize to one, then
document as a graph. Algorithm 1 represents the pseudo-code each time these two stems appear again in the same window
of “Pre-Processing phase”. the edge’s weight will be incremented by one.
Algorithm 1. Pseudo-code of “Pre-processing phase”. For research purposes, the proposed system is testing with
window size ranged from 2 to 10, in addition to the window
Inputs with a size equal to an exact number of stems in a sentence.

• Arabic document, StopWords list, Punctuation In the second step, the centrality (ranking) measure is
list, Special Characters list applied to produce the N-top key-stems list. The number of
key-stems should be determined before step 2. Its default value
Outputs
is set to 10. The centrality (ranking) measure should be
determined whether it will be PageRank, Betweenness
• sentence_no_stopwords: tokens for each sentence Centrality, Closeness Centrality or Degree Centrality.
(a two-dimensions list) Algorithm 2 represents the pseudo-code of step one “Build
• sentences_stems: stems for each token in each graph G”, and algorithm 2 represents the pseudo-code of step
sentence (a two-dimensions list) two “Ranking graph G”.
• sentences: the set of sentences before remove
Algorithm 2. Pseudo-code of “Build graph G”.
StopWords
Steps Inputs:

1. Use punctuation* to split the text into a list of • sentence_no_stopwords: tokens for each sentence (a
sentences two-dimensions list)
2. For each sentence in sentences Do • sentences_stems: stems for each token in each
2.1. Remove Punctuations sentence (a two-dimensions list)
2.2. Remove Special Characters • win_size (Window Size): {2,3,.., 10} U {A:
2.3. Make a copy from the original sentence Sentence}*
2.4. Remove StopWords from the new copy
2.5. Add the new sentence that is without the stop Outputs:
words to a new list sentence_no_stopwords
2.6. Split the sentence_no_stopwords to tokens • G = {V, E}
2.7. POS Tagging each token in the tokens
• V = stems
2.8. Find tokens that have POS tag value ϵ
• E = Undirected co-occurrence relation between two
{DTNN, NN, DTNNP, NNP} and save
stems in the same window.
these tokens in new list noun_tokens**
3. For each token in noun_tokens Do • pair_set: set contains the pair of stem and the token
3.1. Compute stem(token) related to it from sentence_no_stopwords
3.2. Add the stem to noun_stems list Steps
4. For each sentence in sentence_no_stopwords Do
4.1. For each token in the tokens of sentence Do 1. For each sentence in sentences_stems Do
4.1.1. If stem(token) not in noun_stems 1.1. For each stem in sentences_stems - 1
Remove the token from the sentence's Do
tokens 1.1.1. If stemi not in G, Then
Add stemi to G
4.1.2. Otherwise
Add stem(token) to sentences_stems 1.1.2. If stemi+1 not in G, Then
Add stemi+1 to G
* In step 1, the punctuations used to split the text into sentences
are Arabic comma (،), dot (.) and (‫)؟‬. 1.1.3. If edge (stemi, stemi+1) not in G
** In step 2.8, the graph will build depending on the stems for Then
only the nouns that come in the Arabic document. Add edge (stemi, stemi+1, weight = 1) to G

1.1.4. Otherwise
b. Phase two: Graph building and Ranking Phase
Update edge (stemi, stemi+1) [weight] += 1
This phase is the core of the system. It has two main steps.
In the first step, the undirected weighted graph G = (V, E) is 2. Add pair(stemi, tokeni) to pair_set
created, where V holds the stems and E holds the edges * Window size: could be from 2 to 10, or it could be the whole
between two stems that represent the co-occurrence relation sentence.
between them if these two stems appear in the same window.

246
Algorithm 3. Pseudo-code of “Ranking graph G”. keywords (one word) and list of key-
phrases (two, three or maximum four
Inputs:
words)
• G = (V, E) V. EXPERIMENT AND EVALUATION
• Ranking_algorithm: PageRank (PR), Betweenness
Centrality (BC), Closeness Centrality (CC) or Degree A. Environment and Configuration Settings for Evaluation
Centrality (DC). The new system was tested against a dataset of 60
• N: number of the top stems to be selected from documents. Mostly, these documents were collected from
candidate key-stems Aljazeera.net site. They were manually annotated for their key-
Outputs: phrases. Information about the dataset was summarized in
Table 1.
• prime list of N-top key-stems
Steps TABLE I. DATASET INFORMATION

Dataset of 60 documents
1. Ranking nodes in G according to centrality
Total number of tokens The average number of
(ranking) measure before removing stop 36317 tokens before removing 605
2. Sorted_list = Sort the nodes in descending words stop words
order according to their ranking value Total number of tokens The average number of
3. Extract from the Sorted_list the N_top key- after removing stop 25648 tokens after removing 427
stems words stop words
Total number of unique The average number of
9898 165
stems unique stems
c. Phase three: Post-Processing Phase Total number of The average number of
343 6
manual annotated manual annotated
The last phase receives the N-top key-stems from phase Total number of The average number of
two and transforms, to produce the most appropriate M-top 526 9
automatic key stems automatic key stems
key-phrases by replacing the stems by their surface form listing
in the original document. Algorithm 4 represents the pseudo-
code of “Post-Processing phase”. For each document, the manually annotated key-phrases
were converted to a set of keywords, then using the same
Algorithm 4. Pseudo-code of “Post-Processing Phase”. stemmer that used within the system to extract the stems of
Inputs: keywords. These stems represented the annotated-stems.
In order to evaluate the performance of the system, a
• N-top key-stems list comparison between annotated-stems and N-Top Key-stems
• pair_set: set contains the pair of the stem and the token (the output from step 2 of phase two Ranking Graph) was
related to it from sentence_no_stopwords conducted to compute the main measures of accuracy:
• sentences: original sentences list (output from phase 1) Precision (P), Recall (R) and F-measure (F).
Outputs: The testing results were generated for four (centrality)
ranking measures, PageRank, Betweenness Centrality,
• The final list of keywords and key-phrases Closeness Centrality, and Degree Centrality. The centrality
Steps measures were computed using the NetworkX package [17].
The effect of window size on the accuracy of outputs was also
1. Extract from pair_set the tokens that their tested.
stem's in the N-top key-stems, and
determine these tokens as keywords B. The Document Test
2. Examine all keywords against the original In order to illustrate the detailed steps of the proposed
sentences units, to decide which tokens system, a document test was selected, which is the Arabic
appear adjacent to each other. version of the human rights declaration available at
3. If two tokens adjacent to each other http://www.un.org/ar/documents/udhr/. The original document
combine them to produce the key-phrases. contains 1411 tokens (including stop-words and punctuations).
The key phrases can be two keywords or The number of manual annotated stems equally to 12. The
more. default value of N-Top Key-stems was updated from 10 to 15.
4. If the tokens are adjacent to each other, use After converting the text of the document to the graph we had
the sentences list to ensure if there is need found 328 unique stems representing the nodes in the graph G.
to add any deleted stop-words to give a The graph G had different edge numbers according to window
significant meaning of key-phrases. size. Table 2 summarizes the number of edges for each window
size.
5. At the end, this step will produce a list of

247
TABLE II. NUMBER OF EDGES FOR G WITH DIFFERENT WINDOW SIZE
PageRank Betweenness
Document Title: ‫اإلعالن العالمي لحقوق اإلنسان‬ Window Centrality Centrality
(Universal Declaration of Human Rights) Size
Tokens = 1411, Nodes = 328, Key-stems = 15 P R F P R F
Number of manual annotated stems = 12 2 60 75 67 53 67 59
Window Size Number of Edges
2 590 3 60 75 67 60 75 67
3 1023 4 47 58 52 47 58 52
4 1316
5 53 67 59 40 50 44
5 1470
6 1504 6 60 75 67 47 58 52
7 1466 7 53 67 59 40 50 44
8 1379
9 1298 8 47 58 52 47 58 52
10 1216 9 47 58 52 47 58 52
Sentence’s length 550
10 40 50 44 40 50 44
Sentence’s
Figure 2 illustrates a visual representation of graph G for 40 50 44 33 42 37
length
window size equal to 5 to a subgraph for the following part
from the document.
TABLE IV. P, R AND F VALUES FOR KEY-STEMS = 15 USING THE CC AND
‫ل ّما كان االعتراف بالكرامة المتأصلة في جميع أعضاء األسرة البشرية‬ DC FOR THE SELECTED TEST DOCUMENT.
(Since recognition of the inherent dignity of all members of
the human family) Closeness Degree
Window Centrality Centrality
The number of nodes (stems) = 6 that are {(word, stem): Size
P R F P R F
(‫االعتراف‬, ‫)عرف‬, (‫بالكرامة‬, ‫)كرم‬, (‫المتأصلة‬, ‫)ءصل‬, (‫أعضاء‬, ‫)عضو‬,
(‫األسرة‬, ‫)ءسر‬, (‫البشرية‬, ‫ })بشر‬and number of edges = 5. 2 60 75 67 60 75 67
3 60 75 67 60 75 67
4 67 83 74 47 58 52
5 60 75 67 53 67 59
6 67 83 74 60 75 67
7 47 58 52 53 67 59
8 53 67 59 47 58 52
9 40 50 44 47 58 52
10 40 50 44 40 50 44
Sentence’s length 47 58 52 40 50 44

C. Evaluation
The results for system evaluation are summarized in Table
5 and 6. They hold the average values for P, R, and F among
all documents in the dataset for all window size values.

TABLE V. P, R AND F VALUES FOR KEY-STEMS = 15 USING THE PR AND


BC FOR THE WHOLE DATASET.
Fig. 2. Visual representation Window Size = 5
PageRank Betweenness
Table 3 and 4 displays the P, R and F values for key-stems Window Centrality Centrality
= 15 using the PageRank (PR), Betweenness Centrality (BC), Size
Closeness Centrality (CC) and Degree Centrality (DC) P R F P R F
measures for the selected test document. 2 52 79 62 48 73 57
3 52 79 62 48 73 57
TABLE III. P, R AND F VALUES FOR KEY-STEMS = 15 USING THE PR AND
BC FOR THE SELECTED TEST DOCUMENT. 4 51 77 61 48 72 57

248
5 50 76 60 46 70 55 Khartoum, although “Sudan” appears only one time in the
document. Another article talking about Al-Aqsa Mosque in
6 49 75 59 46 70 55
the whole document they use the term “Alhrm Alqdsy Alshryf
7 48 72 57 43 65 51 ‫”الحرم القدسي الشريف‬, but in the keywords, they use “Almsjd
45 68 53 42 63 50 Alaqsa ‫”المسجد األقصى‬.
8
9 44 67 53 41 63 50 We can conclude that keywords can hold synonyms terms
for candidate terms extracted from the document using
10 41 62 49 38 58 45 centrality (ranking) measures. As the work approach that does
Sentence’s length 26 39 31 31 46 37 not take into account the synonyms, this yields to have a
limitation for this approach.
TABLE VI. P, R AND F VALUES FOR KEY-STEMS = 15 USING THE CC AND
DC FOR THE WHOLE DATASET. VI. CONCLUSION AND FUTURE WORK
Closeness Degree The new system based on representing the words of the
Window Centrality Centrality document as a graph and used graph-based centrality measures
Size in ranking the words. It had a very good performance
P R F P R F according to the three accuracy measures: Precision, Recall,
2 46 71 55 54 82 64 and F-Measure. The accuracy measures reach 54, 82 and 64 for
the three measures respectively. Although there is no
3 48 73 58 53 81 63 significant difference between the four centrality measures,
4 48 72 58 52 79 62 Degree Centrality shows better performance than others. There
are still other ranking methods that could be tested also, such as
5 48 73 58 50 76 60 TextRank and HITS.
6 48 72 57 48 74 58
In general, based on graph centrality measures for
7 47 71 56 48 73 57 keywords extraction process give better performance than
8 46 69 54 46 70 55 statistical approaches. One of the limitations that any term to be
one of the key-phrase candidates must appear at least once in
9 45 68 54 45 69 54 the documents and in order to increase its chance, it should
10 45 68 54 43 65 52 appear more than once. One possible solution is to take the
word’s synonyms into accounts.
Sentence’s length 41 61 48 32 48 39
REFERENCES
In term of precision values, the Degree Centrality gives the
best results with window size = 2. While the PageRank gives [1] Al-Kabi M., Al-Belaili H., Abul-Huda B., and Wahbeh A., “Keyword
very close results to Degree Centrality. Followed by Closeness Extraction Based On Word Co-Occurrence Statistical Information for
Centrality and Betweenness Centrality. In general, when the Arabic Text”, in Abhath Al-Yarmouk: "Basic Sci. & Eng.", Vol. 22, No.
window size varies from 2 to 6, it has obtained the best 1, pp. 75- 9, 2013.
performance for all centrality (ranking) measures. Furthermore, [2] Awajan A., “Unsupervised Approach for Automatic Keyword Extraction
from Arabic Documents”, in The 2014 Conference on Computational
the Degree Centrality, which refers to the number of ties a node Linguistics and Speech Processing, pp. 175-184, The Association for
has to other nodes, has the best time cost compared to the other Computational Linguistics and Chinese Language Processing, 2014
centrality (ranking) measures [3] Awajan, A., “Keyword extraction from Arabic documents using term
equivalence classes”, in ACM Trans. Asian Low-Resour. Lang. 2015
D. Discussion [4] El-shishtawy T.A. & Al-sammak A.K., “Arabic Keyphrase Extraction
using Linguistic knowledge and Machine Learning Techniques”, in
The keywords-extraction system presented in this work is a Proceedings of the Second International Conference on Arabic
pure unsupervised method. It mainly depends on the content of Language Resources and Tools, The MEDAR Consortium, Cairo,
the entire document. In other words, for any word that does not Egypt. 2009.
appear in the document for at least one time, there is no chance [5] Sahmoudi I., Froud H. and LACHKAR A., "A new keyphrases
for this word to be one of the candidates for keywords. extraction method based on suffix tree data structure for Arabic
According to the way of building the graph (an acyclic documents clustering", in Int. J. Database Manag. Syst., vol. 5, no. 6, pp.
17-33, 2013
weighted graph), as much as the term repeated in the
document, it will have a better chance to be one of the [6] http://www.internetworldstats.com/stats7.htm
candidates of keywords. But in most cases, the keywords of a [7] Kim Y., Kim M., Park S., Cattle A., Shin H. and Otmakhova J.,
“Applying Graph-based Keyword Extraction to Document Retrieval”, in
document usually hold some terms or phrases that are not International Joint Conference on Natural Language Processing, pp 864–
included in the treated document. For example, in our dataset, 868, Nagoya, Japan, 14-18 October 2013
there is an article about Kurdistan and Turkey, when looking [8] Mihalcea R., Tarau P., “TextRank: Bringing Order into Texts” in
for its keywords we see that it has the term “Abdullah Öcalan Proceedings of EMNLP 2004, pp 404–411, Barcelona, Spain.
‫ ”عبدﷲ أوجالن‬although it does not a term in the article. Another Association for Computational Linguistics, 2004
example, an article about some events happened in Khartoum [9] Kolaczyk E. and Csardi G., “Statistical Analysis of Network Data with
city (the capital of Sudan), its keywords have Sudan beside to R”. ISBN-13: 978-1493909827, 2014

249
[10] Wasserman S. and Faust K., “Social Network Analysis: Methods and workshop on Multi-source Multilingual Information Extraction and
Applications”, Cambridge University Press, ISBN: 9780521387071, Summarization, pp 17–24 Manchester, August 2008
1994 [16] Boudin F., “A comparison of centrality measures for Graph-Based
[11] Newman M., “Networks: An Introduction” Oxford University Press, Keyphrase extraction”, in International Joint Conference on Natural
ISBN: 97801992066502010, 2010. Language Processing, pages 834–838, Nagoya, Japan, 14-18, October
[12] Freeman L., “Centrality in Social Networks. Conceptual Clarification”, 2013.
in Journal: Social Networks - SOC NETWORKS, vol. 1, no. 3, pp. 215- [17] https://networkx.github.io/
239, 1979 [18] Daoud D., Al-Kouz A. and Daoud M., “Time-sensitive Arabic
[13] Brin S. and Page L., “The PageRank Citation Ranking: Bringing Order multiword expressions extraction from social networks”, in International
to the Web”, in Stanford Digital Library Technologies Project, 1998 Journal of Speech Technology 2016, volume 19, pages 249–258, 2016.
[14] Al-Taani A and Al-Omour M., “An Extractive Graph-based Arabic [19] Suleiman, D. and Awajan, A., 2017, May. Bag-of-concept based
Text Summarization Approach”, in The International Arab Conference keyword extraction from Arabic documents. In 2017 8th International
on Information Technology, 2014 Conference on Information Technology (ICIT)(pp. 863-869). IEEE.
[15] Litvak M. and Last M., “Graph-Based Keyword Extraction for Single-
Document Summarization”, in Coling 2008: Proceedings of the

250
Arabic Text Keywords Extraction using
Word2vec
Dima Suleiman Arafat A. Awajan Wael Al Etaiwi
Computer Science DepartmentKing Computer Science Department Computer Science Department
Hussein Faculty of Computing Sciences King Hussein Faculty of Computing King Hussein Faculty of Computing
Princess Sumaya University for Sciences Sciences
Technology Princess Sumaya University for Princess Sumaya University for
Teacher at the University of Jordan Technology Technology
Amman, Jordan Amman, Jordan Amman, Jordan
d.suleiman@ psut.edu.jo awajan@psut.edu.jo w.etaiwi@psut.edu.jo

In general, the keywords list of each document consists of


Abstract— Automatic keywords extraction is very useful for four to ten keywords, and each keyword consists of one or
text summarization, information retrieval and other natural more words. These keywords can be used to determine the
language applications. This paper proposes a keyword content of the document; also the keywords list can help in
extraction method that extracts keywords from Arabic classifying and summarizing the text, in addition to
documents based on the semantic similarity of words. The
facilitating the information retrieval tasks [2].
proposed method groups words into classes and put the words
that are similar in the same class. Word2vec word embedding In this research, we proposed a new keyword extraction
model is used to represent words using vectors. Accordingly, the method to extract keywords from Arabic documents. The
semantic similarity between words can easily be computed using proposed method is built based on grouping words into
cosine similarity. Therefore, the words that have high context
classes according to their semantic similarities. The semantic
semantic similarity will be grouped in the same class. This
method can also generate abstractive keywords that do not exist similarity between words can be computed using cosine
in original text using Word2vec. The experiments are conducted similarity. Thus, in order to compute the semantic similarity,
using three documents. Results showed that, the proposed model the words are converted into vectors using word embedding
enhances the performance of keyword extraction with model. In this research, word2vec word embedding model
comparison to the previous models in terms of precision, recall will be used to convert words into vectors [3]. The proposed
and F-measure. approach consists of five phases: the first phase is the
preprocessing phase. In this phase, Farasa text processing
Keywords—component; Word2vec; Keyword Extraction; toolkit will be used for segmenting and stemming the text [4].
Cosine Similarity; Arabic Natural Language Processing; Semantic
Similarity.
The second phase consists of creating the matrix of the word-
context using semantic vector space model, context based
semantic similarity and synonyms [5]. Words that have the
I. INTRODUCTION same stem and have high similarities are grouped into the
same cluster in the third phase. In the fourth stage, the
weights of N words are computed by using N-gram. Finally,
There are huge amount of online documents that cannot the keywords are extracted.
be accessed and retrieved in a searching process because of
the lack of keywords. Keyword extraction process is very The rest of this paper is organized as follows: section II
important since the number of documents that are available covers the related work of keyword extraction methods. The
online is huge. Keywords can be used in several Natural proposed approach is discussed in section III and its
Language Processing (NLP) applications such as information evaluation is covered in section IV. Finally, section V
retrieval, text summarization and classification. presents the conclusion.
Automatic keywords extraction must be used instead of II. RELATED WORK
manual extraction since the manual extraction is a time The new approaches of keywords extraction depend
consuming process. There are many keywords extraction mainly on NLP techniques in addition to statistical analysis
techniques for English documents [1]. Extraction of keywords [1]. However, the old methods used the frequency of the
from Arabic documents is a new topic and needs more words in the corpus.
attention. Arabic language is very important, its importance
related to number of speakers and geopolitical aspects. In Many applications used bag-of-words (BoW) vectors to
addition, no one can decline that Arabic is one of the top represent the text [6]. BoW vector consists of words and their
languages in Internet. Moreover, many important documents weights based on their frequency. Different statistical
that are available online were written in Arabic. measures are used to extract keywords from their

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 251


representation such as information gain, words frequency and represented using semantic model-based. Sedki et al. [13]
mutual information. proposed a system that used the distributed similarity measure
to find the similarity between words in order to classify them
Decomposition of singular value is an example of using
into equivalence classes. Each class has an identifier and
factor analytical methods [6]. Feature extraction is used to
represents a group. The words in the same group are
handle the problems of synonym words; this can be done by
considered similar. Another way to generate similar words is
grouping the words with similar meaning together. It is
by using the morphology such that, the words that have the
clearly noted that grouping words together facilitates
same stem root-pattern are assigned the same class identifier.
representing the text as a collection of concepts so that this
Moreover, the size of name entity is reduced by grouping
representation is called bag-of-concepts (BoC).
similar entities into the same class. Finally, the similar words
Representing documents as (BoW) faces several are identified by applying the distributional similarity on the
problems. The first problem is that, the BoW does not take word-context matrix.
the meaning of words into consideration. On the other hand,
Combing the linguistics with statistics was used to extract
the huge dimensionality is the second problem. Both
keywords from Arabic documents [14]. Extracting
problems can be addressed by using bag-of-concepts (BoC)
morphological pattern and the root of derivative words were
[7].
used as preprocessing phase. In cleaning phase, the
Although BoC was used to overcome many problems of meaningless words were removed. Finally, the most frequent
BoW, it requires external resources such as gazetteers, name words were grouped in classes of clustered. Experimental
entities, dictionaries and synonyms list. In addition, BoC results showed that the average precision was 31% and the
requires expensive computation. Thus, another approach is average recall was 53%.
used as an alternative to BoC which is based on what is called
Extraction of keywords from the documents requires
vector space [6]. Vector space depends on random indexing,
knowledge of the document domain. This can be achieved
which is used to generate context vector in order to represent
either by using: the linguistic rules related to the domain,
the documents as BoC. Using BoC to represent the document
supervised machine learning, or the combination of two ways.
can help in many NLP applications [7] and can use word
On the other hand, the human effort that is needed for domain
vectors to create clusters [8], [9].
knowledge keywords extraction is high, which can be
In [10] the authors implemented a new method for addressed by using word embedding. Qiu et al. [15]
extracting keywords based on Document Frequency threshold recommended to use word embedding in keyword extraction
(DF) method. After extracting the keywords, vector space since it does not require domain knowledge.
model was used to represent the documents, then the
Support Vector Machine along the side with statistical
documents were normalized using TF x IDF weighting
features was used for extracting Arabic keywords in [16]. The
scheme.
experimental results showed that the precision and recall were
Keyword extraction from Arabic document has few 0.77 and 0.58 respectively.
researches. The authors of [11] presented an approach called
KP-Miner for extracting keywords from both Arabic and In this paper, the semantic vector space model along the
English documents. KP-Miner applied two conditions with side with the context semantic similarity will be used to
TF x IDF weights in order to extract keywords: The first extract keywords from Arabic documents. The new method is
condition is that, the phrase from which the keyword will be an unsupervised one, so that there is no need to create or to
extracted must occur in the text at least n times. The second use – an annotated corpus.
condition related to the position of the keyword in the
document. In order to satisfy the two conditions, two III. PROPOSED SYSTEM
threshold values must be defined. To adapt Arabic The proposed keywords extraction approach is considered
documents, El-Beltagy et al. applied some preprocessing as an enhancement on Arabic documents keyword extraction
tasks such as removing stop words and stemming. For method that was proposed in [17]. In both methods the bag-
stemming, they used custom-built stemmer that removes of-concepts (BoC) is used. The main enhancement is on
affixes. using semantic context similarity when creating the Term
Equivalence Classes that was introduced in [14], [17] along
Some researchers used machine learning and linguistic
the side with synonyms. Another enhancement of the
methods to extract keywords such as El-Shishtawy et al. [12].
proposed method is to generate abstractive keywords that do
Linguistic information was used in three phases: tokenization,
not exist in the original text. There are many concerns of
word abstract forming and Part of Speech Tagging (PoST).
extracting keywords from English documents. However, there
The proposed system used annotated Arabic corpus in order
are few research papers that investigate Arabic documents
to increase the extraction speed. False results may occur since
keyword extraction. The manual keywords extraction is time
the proposed system deals with all types of the words such as
consuming, thus automatic keyword extraction is a good
verb, names, functions and adjectives equally. There are a
solution in order to retrieve keywords in reasonable time. In
limited number of Arabic documents with keywords, and the
order to increase the accuracy, a combination of linguistics
number of resources for building a learning model is limited.
and statistics are used. In order to extract keywords from
In order to decrease the dimensionality, text can be Arabic document there is no need for domain pre knowledge.

252
B. Term weight
The terms weights are computed using the score or the
frequency of their stem. Furthermore, the weights are affected
by the position of the words in the document. Such that, the
words that occur at the abstract, introduction and conclusion
will have higher probability of being keywords. Therefore
such words will have higher weights than other words.
Accordingly, they will have higher opportunities to be
keywords.
In order to calculate the weights based on the position of
the word in the document, the document will be divided into
Fig. 1: Phases of Proposed Method N sections where each section will be assigned a weight as in
[5]. The same probabilities that are used in research [14] will
be used here. Li is used to represent the probability of the
The proposed approach consists of five phases as shown
words in certain section i. The values of Li will be 0.24, 0.51
in Fig. 1. The first phase is the preprocessing phase, which and 0.11 for the words that occur in first or last paragraph,
consists of several stages: the first stage is cleaning the text. title and internal paragraph respectively. Freq(w,i) is used to
Cleaning the text includes removing punctuation marks, non- represent the frequency of word w in section i as in [14].
Arabic languages words, numerals, and diacritical. The Therefore the weight of certain word w is computed using Eq.
second stage is the tokenization, which splits the text into (1) [14].
words. In this research, Farasa segmenter is used for
tokenization [4]. The last stage is the stemming where Farasa
stemmer is used to get the stem of the words.
weight(w) = ∑ (LiFreq (w, i)) (1)
In the second phase, the semantic vector space model is
used to represent words. Each row in semantic vector space
represents word’s stem and word’s weight, where the weight All the paragraphs of the Universal Declaration of Human
is the frequency of words that have the same stem. In Rights document, that we will use as an example in this
clustering phase, the words that have the same stem will be paper, will have the same probability since it does not contain
grouped together in the same class. Moreover, as in [17], also abstract, introduction and conclusion.
the synonyms words are assigned to the same class. In
addition, we group words that have high context sematic C. Building bag-of-concept
similarity together in the same class. The fourth phase is the
statistical phase, where N-gram will be used. In this phase, N- Building BoW consists of three stages: in the first stage,
grams include unigram, bigram and trigram. For each N- words that have the same stem will be grouped together in the
gram, there exists one entry in the semantic vector space, and same class in a process called Term Normalization. Each
the frequency of N-gram will be used to compute the weights class will be given a weight, such that for class C the weight
of N-gram. Finally, keywords extraction is the last phase. will be computed using Eq. (2) [14].
The Universal Declaration of Human Rights document is
used to apply the five phases of the proposed keyword ClassWeight(C) = ∑ Weight(Wi) , Wi ϵ C (2)
extraction method.
The second stage includes grouping the words that are
DOCUMENT PROCESSING AND CLEANING synonyms in the same class. Wordnet is used to find the
synonyms words [18]. The last stage consists of grouping the
A. Document Preprocessing and cleaning words that have high context similarity in the same class. In
this stage the pre-trained word2vec model is used to convert
Document preprocessing and cleaning is crucial step in the words into vectors. In addition, the cosine similarity is
many NLP applications, thus instead of dealing with the used to find the similarity values between words.
words itself, it is better to deal with the word’s stem. The
preprocessing phase includes removing stop words, in Table I shows the results of term normalization for the
addition to extracting the word’s stem. In tokenization classes with highest count or weight.
process the text will be splitted into words.
Table II displays the classes after considering the
Keyword extraction mainly based on frequency of words. synonyms. After using WordNet, the synonym words are
The words that occur frequently are more nominated to be combined into the same class. For example, classes C, D and
keywords. However, even that stop words occur frequently E which represent words “‫”فرد‬, “‫ ”شخص‬and “‫”انسان‬
they cannot be considered as keywords, thus stop words are respectively will be combined into one class. Since the count
removed in preprocessing phase. of word “‫ ”فرد‬is the highest, thus the name of the combined
class will be “‫ ”فرد‬and it is represented by C.

253
TABLE I TABLE III
CLASSES AND THEIR COUNTS CLASSES OF MOST SIMILAR WORDS AND THEIR COUNTS
Class Label Class Label Class Label Class Label Count
Count Count Count Root/Stem
Symbol Root/Stem Symbol Symbol Symbol Root/Stem Symbol Symbol
A ‫حق‬ 61 N ‫بلد‬ 7 A ‫حق‬ 92 K ‫جميع‬ 6
B ‫حرية‬ 22 O ‫كرامة‬ 6 B ‫فرد‬ 51 L ‫اسرة‬ 6
C ‫فرد‬ 19 P ‫جميع‬ 6 C ‫دولة‬ 14 M ‫متساوي‬ 6
D ‫شخص‬ 18 Q ‫اسرة‬ 6 D ‫عمل‬ 10 N ‫تمتع‬ 6
E ‫انسان‬ 14 R ‫متساوي‬ 6 E ‫حماية‬ 10 O ‫أخر‬ 6
F ‫عمل‬ 10 S ‫تمتع‬ 6 F ‫لكل‬ 10 P ‫امة‬ 6
G ‫حماية‬ 10 T ‫أخر‬ 6 G ‫اجتماعي‬ 8 Q ‫اساسي‬ 6
H ‫لكل‬ 10 U ‫امة‬ 6 H ‫عام‬ 7 R ‫اعالن‬ 6
I ‫اجتماعي‬ 8 V ‫اساسي‬ 6 I ‫قانون‬ 7 S ‫عادل‬ 4
J ‫عام‬ 7 W ‫اعالن‬ 6 J ‫مجتمع‬ 7 T
K ‫قانون‬ 7 X ‫احترام‬ 4
L ‫دولة‬ 7 Y ‫عادل‬ 4
M ‫مجتمع‬ 7 Z ‫مساواة‬ 3
D. N-Gram Detection and Scoring
N-gram must be taken into account in statistical analysis,
since the keywords composed of one or more words in most
The counts of words in classes C, D and E are19, 18 and 14 of the documents. The score of unigram is the same as the
respectively. Therefore, the count of the new class will be 51, count or the weight of the word, since unigram consists of
which is the summation of the counts of the three classes C, one word. On the other hand the score of bigram and trigram
D and E. Also, by using synonyms, the two classes N and L do not equal to their weights since bigram consists of two
will be combined with total count equal to 14 and the class is classes and trigram consists of three classes.
labeled by “‫”دولة‬.
N-gram (NG) weight is equal to the summation of the
Moreover, the context semantic similarity will be weight of the classes that compose it as in Eq. (3) [14]. The
considered when combining classes. The words that occur number of the sectors in the documents is represented by M.
within the same semantic will be considered as similar. The
pre-trained word2vec model proposed by AraVec [19] is used NgramWeight(NG) = ∑ (Li Freq(NG, i)) (3)
in the proposed model. The cosine similarity is used to
compute the similarity between the words in the document.
On the other hand, the total score of the N-gram is
The results show that, the classes “‫”حق‬, “‫”حرية‬, “‫”كرامة‬, and
calculated by adding the weights of all the classes that
“‫ ”مساواة‬which are represented by symbols A, C, L, and V
composing it in addition to the weight of the N-gram itself.
respectively in Table II, have high similarity values and will
For example, the score of the bigram “‫ ”حق فرد‬is equal to the
be combine into one class. As we mentioned previously, the
summation of the weights of the classes “‫ ”حق‬and “‫”فرد‬,
name of the class will be “‫ ”حق‬since the count of the class
which is equal to 80 and the weight of the bigram “‫”حق فرد‬
“‫ ”حق‬is the highest. Also, the count of the new class will be
which is equal to 16. Thus, the total score is 96. On the other
equal to the summation of the count of the classes A, C, L
hand, the score of the bigram “‫ ”حق شخص‬is equal to the
and V. Thus, the count of the combined class is equal to 61+
summation of the weights of the classes “‫ ”حق‬and “‫”شخص‬
22 + 6 +3 = 92. Table III displays the classes after
which is equal to 79 and the weight of the bigram “‫”حق شخص‬
considering the context semantic similarity.
which is equal to 15. Thus, the total score is 94. In the case of
The proposed approach differs from the approach that was bigram “‫”حق انسان‬, the total score is 85 while the bigram
proposed in [17] in considering the semantic similarity of the weight is 10.
words that occur within the same context when combining
N-gram score is calculated by Eq. (4) [17]. Where N is the
classes. In research [17], only words that have the same stem
number of the classes that composes the N-gram.
and synonyms will be grouped together.

Score(NG) = NgramWeight(NG) +
TABLE II ∑ (ClassWeight(Cj)) (4)
CLASSES OF SYNONYM WORDS AND THEIR COUNTS
Class Label Class Label Count
Symbol Root/Stem
Count
Symbol Symbol Root/Stem The synonyms and the semantic context similarity of
A ‫حق‬ 61 L ‫كرامة‬ 6 words must be considered not only for unigram but also for
B ‫فرد‬ 51 M ‫جميع‬ 6 N-gram. For example, the bigrams “‫”حق فرد‬, “‫ ”حق شخص‬and
C ‫حرية‬ 22 N ‫اسرة‬ 6 “‫ ”حق انسان‬are considered as one bigram which is “‫”حق فرد‬.
D ‫دولة‬ 14 O ‫متساوي‬ 6 Since “‫”فرد‬, “‫ ”شخص‬and “‫ ”انسان‬are synonyms. The reason for
E ‫عمل‬ 10 P ‫تمتع‬ 6
F ‫حماية‬ 10 Q ‫أخر‬ 6 choosing “‫ ”حق فرد‬instead of the two others is that the count
G ‫لكل‬ 10 R ‫امة‬ 6 of “‫ ”فرد‬is higher than “‫ ”شخص‬and “‫”انسان‬. Furthermore, the
H ‫اجتماعي‬ 8 S ‫اساسي‬ 6 bigrams “‫”حق فرد‬, “‫ ”حرية فرد‬, “‫ ”كرامة فرد‬and “‫ ”مساواة فرد‬are
I ‫عام‬ 7 T ‫اعالن‬ 6 considered as one bigram which is “‫”حق فرد‬. The reason for
J ‫قانون‬ 7 U ‫عادل‬ 4
K ‫مجتمع‬ 7 V ‫مساواة‬ 3 choosing “‫ ”حق فرد‬is that “‫”حق‬, “‫”حرية‬, “‫ ”كرامة‬and “‫”مساواة‬
have high similar context based on word2vec model and they
will be grouped in the one class. Moreover, the reason for

254
choosing “‫ ”حق فرد‬instead of the other bigrams is that the TABLE V
TRIGRAMS AND THEIR SCORES
count of “‫ ”حق‬is higher than“‫”حرية‬, “‫ ”كرامة‬and “‫”مساواة‬.
The score of the N-gram in the case of synonyms and the Trigram Weight of
semantic context similarity of words can be calculated as First Second Third First Second Third Trigram
Score
follows: in the previous example, in the case of synonyms, Term Term Term Term Term Term / count
the new class “‫ ”حق فرد‬will be the combination of all the ‫دولة‬ ‫فرد‬ ‫حق‬ 14 51 92 9 166
bigrams “‫”حق فرد‬, “‫ ”حق شخص‬and “‫”حق انسان‬. The score of ‫عمل‬ ‫فرد‬ ‫حق‬ 10 51 92 7 160
NgramWeight (NG) of “‫ ”حق فرد‬will be the sum of the ‫حماية‬ ‫فرد‬ ‫حق‬ 10 51 92 6 159
‫اجتماعي‬ ‫فرد‬ ‫حق‬ 8 51 92 6 157
weights of the three bigrams “‫”حق فرد‬, “‫ ”حق شخص‬and “ ‫حق‬ ‫عام‬ ‫فرد‬ ‫حق‬ 7 51 92 5 155
‫ ”انسان‬which are equal to 16, 15 and10 respectively. In this ‫قانون‬ ‫فرد‬ ‫حق‬ 7 51 92 4 154
example, the total will be 41. After that, 41 will be added to ‫آخر‬ ‫فرد‬ ‫حق‬ 6 51 92 5 154
the weights of the classes “‫”فرد‬, “‫”حق‬, “‫ ”شخص‬and “‫”انسان‬. ‫اساسي‬ ‫فرد‬ ‫حق‬ 6 51 92 5 154
‫مجتمع‬ ‫فرد‬ ‫حق‬ 7 51 92 4 154
Thus, the total score of the bigram “‫ ”حق فرد‬is 153. In case of ‫تمتع‬ ‫فرد‬ ‫حق‬ 6 51 92 4 153
combing classes that have high context semantic similarity, ‫امة‬ ‫فرد‬ ‫حق‬ 6 51 92 4 153
the same process is made as follows: NgramWeight (NG) of
“‫ ”حق فرد‬will be the sum of the weights of the four bigrams E. Selection of Keywords
“‫”حق فرد‬, “‫”حرية فرد‬, “‫ ”كرامة فرد‬and “‫ ”مساواة فرد‬which is The keywords that have highest score will have high
equal to 41, 15, 4 and 2 respectively. In this example, the probability of being the predicted keyword. The application
total will be 62. After that 62 will be added to the weights of requirements, size of the document and the user needs
the classes “‫”فرد‬, “‫”حق‬, “‫”حرية‬, “‫ ”كرامة‬and “‫”مساواة‬. Thus, the determine the number of selected keywords. Moreover, the
total score of the bigram “‫ ”حق فرد‬is 205. number of words in the keywords will affect the selection of
Eq. (4) is modified to consider the synonyms and the keywords, such that if we have two keywords with the same
words that have context semantic similarity as in Eq. (5). In score, then the keyword with largest number of words will be
Eq. (5) B represents the number of N-grams that have selected. The list of select keywords can be seen in Table VI.
synonym words such as “‫”حق فرد‬, “‫ ”حق شخص‬and “‫”حق انسان‬
in this case the value of B is 3. And C represents the number TABLE VI
CANDIDATE KEYWORDS AND THEIR SCORE
of N-grams that have high context similarity such as words
Keyword Score
such as “‫”حق فرد‬, “‫”حرية فرد‬, “‫ ”كرامة فرد‬and “‫ ””مساواة فرد‬in ‫حق فرد‬ 205
this case the value of C is 4. ‫حق فرد دولة‬ 166
‫حق فرد عمل‬ 160
‫حق فرد حماية‬ 159
‫حق فرد اجماعي‬ 157
Score(NG) = ∑ NgramWeight NG(b) + ‫حق فرد عام‬ 155
∑ NgramWeight NG(b) + ∑ ClassWeight(Cj) +
ClassWeightofSynonym + ClassWeightofSimilar +
IV. EXPERIMENTS AND EVALUATION
(5)
A. Description of Datasets
TABLE IV
BIGRAMS AND THEIR SCORES
The performance of the proposed approach is evaluated
Bigram Weight of by comparing the generated keywords with the manually
Score
First Second First Second Bigram extracted keywords. The experiments are conducted using
Term Term Term Term / count
three documents. The titles of the three documents are as
‫فرد‬ ‫حق‬ 51 92 62 205
‫دولة‬ ‫حق‬ 14 92 10 116 follows: “‫”اإلعالن العالمي لحقوق اإلنسان‬, “‫”المؤتمر الوطني األردني‬
‫حماية‬ ‫حق‬ 10 92 9 111 and “‫”العنف لدى طالب جامعة ال البيت‬. The keywords of the
‫عمل‬ ‫حق‬ 10 92 8 110 documents will be displayed below; the keyword of “ ‫المؤتمر‬
‫اجتماعي‬ ‫حق‬ 8 92 7 107 ‫ ”الوطني األردني‬is displayed in Table VII, Table VIII lists the
‫عام‬ ‫حق‬ 7 92 6 105
‫جميع‬ ‫حق‬ 6 92 6 104 keywords for document “‫”العنف لدى طالب جامعة ال البيت‬.
‫اسرة‬ ‫حق‬ 6 92 6 1f[04
‫متساوي‬ ‫حق‬ 6 92 6 104 Table VIII presents the third document keywords
‫تمتع‬ ‫حق‬ 6 92 6 104 without considering semantic. Two synonym keywords can
‫قانون‬ ‫حق‬ 7 92 5 104 be seen “‫”العنف جامعة طالبة‬,”ً ‫” العنف جامعة طالبا‬. The two
‫امة‬ ‫حق‬ 6 92 6 104 keywords will be grouped into one class when using
‫اساسي‬ ‫حق‬ 6 92 6 104
semantic. Accordingly, the score of the generated class will
be higher. Thus, this class will have higher probability of
The score of the most frequent bigram and trigram is being selected as candidate keyword.
shown in tables IV and V respectively.

255
TABLE VII document, the words “‫”حق‬, “‫”حرية‬, “‫”كرامة‬, and “‫ ”مساواة‬are
KEYWORDS OF “‫ ”المؤتمر الوطني األردني‬DOCUMENT
highly similar. Thus, these words are grouped into one class.
Keyword Score Also, we can notice that, the word “‫ ”عدالة‬has also high
‫المؤتمر الوطني األردني‬ 32 similarity with the word “‫”حرية‬. However, the word “‫”عدالة‬
‫األردني الوطني‬ 26
‫المؤتمر الوطني‬ 24 does not exist in the document, but in the proposed model, the
‫المؤتمر األردني‬ 20 word "‫ "عدالة‬appears in the generated keywords, this type of
the generated keywords are called abstractive keywords.
TABLE VIII Abstractive keywords are the keywords that are generated by
KEYWORDS OF “‫ ”العنف لدى طالب جامعة ال البيت‬DOCUMENT the keyword extraction models but do not occur in the
original document.
Keyword
ً ‫العنف جامعة طالبا‬ C. Performance Evaluation
‫العنف جامعة طالبة‬
‫طلبة الدراسة جامعة‬ The evaluation measures that we use in this research is
‫طلبة جامعة العنف‬ precision P, recall R and F-measure. True positive, false
positive, true negative and false negative values for each
In order to evaluate the accuracy of the proposed model, keyword w are defined. Keyword w that is selected by the
the results are compared with the results in [14] and [17]. proposed algorithm and also selected by manual extraction is
Since the three models evaluate the first documents and [17] considered as true positive. There are three groups of
evaluates all the documents. In all documents, the keywords experiments that are determined according to number of
are extracted manually in order to make comparisons with the keywords that are extracted, the first group contains 5
generated keywords. keywords and the second and third groups contain 10, and 15
keywords respectively. Table IX displays the results.
B. Word2vec Model
Word2vec is a word embedding model that was proposed The number of selected keywords determines the precision
by Mikolov in 2013 [3][20]. Word embedding is used to as shown in Table IX. The second document achieved the
convert words into low dimensional vectors such that the best precision since number of keyword is low. Thus, the
similar words can be explored in term of syntax and semantic accuracy is high. On the other hand, the precision of the large
similarity. Using Word2vec facilitates finding the most documents like the third one is high and the recall is
similar words that occurs within the same context. moderate. Furthermore, a comparison between the results of
Furthermore, word2vec model has two approaches including the proposed approach and the results in [14], [17] is made.
Continuous Bag of Words (CBOW) model and skip-gram Accordingly, the results show that the proposed model
model. Both approaches are neural networks that consist of overcomes the previous models in terms of precision, recall
one input, one hidden and one output layers. In order to and F-measure.
retrieve significant results, both approaches must be trained
using large corpus [21],[22]. Also, they have the same hyper
V. CONCLUSION
parameters such as the vocabulary size, the context window
and the dimension size. The vocabulary size is the number of
most frequent words in the vocabulary. On the hand, the According to the increase in number of available online
context window is a window that surrounds the input word in Arabic documents, the need for keyword extraction methods
case of skip-gram and the output or target word in case of increase. The importance of keyword refers to its uses in
CBOW. The dimension size is the size of the dimension of several NLP applications such as text summarization and
each vector that is used to represent the words. The dimension information retrieval. Automated keyword extraction is used
of each vector in the input and output layers is equal to the instead of manual extraction since manual extraction is time
vocabulary size, while the number of neurons in hidden layer consuming. A new Arabic keyword extraction approach was
is equal to the dimension size. The similarity between words proposed in this research. The proposed method considered
can be measured using cosine similarity, Euclidean distance the context semantic similarity. Furthermore, it combined
and other measures. Arabic language linguistics properties with some statistical
analysis in order to extract keywords from Arabic documents.
In this paper, we used one of the six pre-trained word Words that have the same stem are grouped together in the
embedding models prepared by a Arabic open source project same class. Also the words that have synonyms stem were
which is called AraVec[19]. AraVec models are trained using grouped together, in addition to grouping words that have
three different resources including Wikipedia Arabic articles, high context semantic similarity together. Word2vec model
Tweets and World Wide Web pages, where the number of was used in order to convert the words into vectors. This
tokens are more than 3,300,000,000. The proposed keyword facilitated computing the semantic similarity between words
approach used the model that was trained using Wikipedia that occurs within the same context. Moreover, abstractive
with the dimension size is equal to 100. The cosine similarity keywords were generated using word2vec model. The
is used to find the words in the documents that are highly experimental results showed that the proposed model
similar. For example, in the “‫”اإلعالن العالمي لحقوق اإلنسان‬ improved the results of extracting keywords from Arabic

256
documents. The experiments were conducted using three
documents and the results were reasonable.

TABLE IX
RESULT AFTER APPLYING ALGORITHM

Document Number of keywords per Number of extracted keywords


number document 5 10 15
P R F P R F P R F
1 15 0.80 0.27 0.40 0.70 0.44 0.54 0.60 0.56 0.58
2 9 1 0.5 0.67 1 0.91 0.90 - - -
3 15 1 0.31 0.48 0.80 0.47 0.60 0.67 0.60 0.63

REFERENCES [11] S. R. El-Beltagy and A. Rafea, “KP-Miner: A keyphrase


extraction system for English and Arabic documents,”
[1] J. D. Cohen, “Highlights: Language- and domain-independent Information Systems, vol. 34, no. 1, pp. 132–144, Mar. 2009.
automatic indexing terms for abstracting,” Journal of the
American Society for Information Science, 46, 3, 162–174, [12] T. El-Shishtawy and A. Al-sammak, “Arabic Keyphrase
1995. Extraction using Linguistic knowledge and Machine Learning
Techniques,” CoRR, vol. abs/1203.4605, 2012.
[2] S. Rose, D. Engel, N. Cramer, and W. Cowley, “Automatic
Keyword Extraction from Individual Documents,” in Text [13] E. Sedki, A. Alzaqah, and A. Awajan, “Arabic Text
Mining, M. W. Berry and J. Kogan, Eds. Chichester, UK: John Dimensionality Reduction Using Semantic Analysis,”, WSEAS
Wiley & Sons, Ltd, pp. 1–20, 2010. TRANSACTIONS on INFORMATION SCIENCE and
APPLICATIONS vol. 12, p. 10, 2015.
[3] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
Estimation of Word Representations in Vector Space,” [14] A. Awajan, “Keyword Extraction from Arabic Documents using
arXiv:1301.3781 [cs], Jan. 2013. Term Equivalence Classes,” ACM Transactions on Asian and
Low-Resource Language Information Processing, vol. 14, no. 2,
[4] A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, pp. 1–18, Apr. 2015.
“Farasa: A Fast and Furious Segmenter for Arabic,” in
Proceedings of the 2016 Conference of the North American [15] Y. Qiu, H. Li, S. Li, Y. Jiang, R. Hu, and L. Yang, “Revisiting
Chapter of the Association for Computational Linguistics: Correlations between Intrinsic and Extrinsic Evaluations of
Demonstrations, San Diego, California, 2016, pp. 11–16. Word Embeddings,” in Chinese Computational Linguistics and
Natural Language Processing Based on Naturally Annotated
[5] A. Awajan, “Semantic similarity based approach for reducing Big Data, vol. 11221, M. Sun, T. Liu, X. Wang, Z. Liu, and Y.
Arabic texts dimensionality,” International Journal of Speech Liu, Eds. Cham: Springer International Publishing, 2018, pp.
Technology, vol. 19, no. 2, pp. 191–201, Jun. 2016. 209–221.

[6] M. Sahlgren and R. Cöster, “Using bag-of-concepts to improve [16] Alarmouty, Batool & Tedmori, Sara. Automated Keyword
the performance of support vector machines in text Extraction using Support Vector Machine from Arabic News
categorization,” in Proceedings of the 20th international Documents. 342-346. 10.1109/JEEIT.2019.8717420. (2019).
conference on Computational Linguistics - COLING ’04,
Geneva, Switzerland, 2004, pp. 487-es. [17] D. Suleiman and A. Awajan, “Bag-of-concept based keyword
extraction from Arabic documents,” in 2017 8th International
[7] F. Wang, Z. Wang, Z. Li, and J.-R. Wen, “Concept-based Short Conference on Information Technology (ICIT), Amman,
Text Classification and Ranking,” in Proceedings of the 23rd Jordan, 2017, pp. 863–869.
ACM International Conference on Conference on Information
and Knowledge Management - CIKM ’14, Shanghai, China, [18] W. Black et al., “Introducing the Arabic WordNet Project,”,
2014, pp. 1069–1078. Proceedings of the Third International WordNet Conference,
Sojka, Choi, Fellbaum and Vossen eds, 2006.
[8] H. K. Kim, H. Kim, and S. Cho, “Bag-of-concepts:
Comprehending document representation through clustering [19] A. B. Soliman, K. Eissa, and S. R. El-Beltagy, “AraVec: A set of
words in distributed representation,” Neurocomputing, vol. Arabic Word Embedding Models for use in Arabic NLP,”
266, pp. 336–352, Nov. 2017. Procedia Computer Science, vol. 117, pp. 256–265, 2017.

[9] S. Albitar, B. Espinasse, and S. Fournier, “Semantic [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
Enrichments in Text Supervised Classification: Application to “Distributed Representations of Words and Phrases and their
Medical Domain,” The Twenty-Seventh International Flairs Compositionality,”. InAdvances in neural information
Conference, March, 2014. processing systems,pages 3111–3119. 2013..

[10] D. R. Al-Shalabi, “Arabic Text Categorization Using kNN [21] D. Suleiman and A. Awajan, “Comparative study of word
Algorithm,”, Proceedings of the 4th International Multi- embeddings models and their usage in Arabic language
conference on Computer Science and Information Technology, applications,” International Arab Conference on Information
Vol. 4, Amman, Jordan, April 5-7. 2006. Technology (ACIT), Werdanye, Lebanon, pp. 1-7.2018.

[22] D. Suleiman, A. Awajan, and N. Al-Madi, “Deep Learning


Based Technique for Plagiarism Detection in Arabic Texts,” in
2017 International Conference on New Trends in Computing
Sciences (ICTCS), Amman, 2017, pp. 216–222.

257
A Deep Learning Approach for Arabic Text
Classification
Katrina Sundus Fatima Al-Haj Bassam Hammo
Computer Science Department Computer Science Department Computer Information Systems
The University of Jordan The University of Jordan The University of Jordan
Amman, Jordan Amman, Jordan Amman, Jordan
sun.katrina@yahoo.com Alhaj5661@gmail.com b.hammo@ju.edu.jo

Abstract—Advancement in information technology year, and author name. In this paper, we only consider the
produced massive textual material that is available online. subject classification property [1].
Text classification algorithms are at the core of many natural
language processing (NLP) applications. There are several Automatic text classification may fall under one of
algorithms which have been implemented to tackle the three categories: supervised, unsupervised and semi-
classification problem for English and other European supervised. In the supervised text classification, human
languages. Few attempts have been carried out to solve the interaction evolved to provide some text classification
problem of Arabic text classification. In this paper, we information. Whereas, in the unsupervised text
demonstrate a feed-forward deep learning (DL) neural classification, also known as text clustering, classification
network for the Arabic text classification problem. The first
layer uses term frequency-inverse document frequency (TF- is completed without any external information. In the
IDF) vectors constructed from the most frequent words of semi-supervised text classification, the categorization is
the document collection. The output of the first layer is used completed using some external mechanism [2].
as an input to the second layer. To reduce the classification
error rate, we used Adam’s optimizer. We conducted a set of
Arabic language is one of the top ten languages used on
experiments on two multi-classes Arabic datasets to evaluate the web [3]. Although Arabic language is growing rapidly
our approach based on standard measures such as precision, on the internet, its content is still as low as 3%. The current
recall, F-measure, support, accuracy and time to build the rapid growth is a compelling motivation for researchers
model. We compared our approach with the logistic and developers to develop effective systems and tools to
regression (LR) algorithm. Experiments entailed that the advance research in Arabic NLP.
deep learning approach outperformed the logistic regression
algorithm for Arabic text classification. Deep learning (DL) is considered as a part of neural
networks. It is the fastest growing field in machine
Keywords—Arabic Text Classification, Machine Learning, learning methods. It could be supervised, semi-supervised
Logistic Regression, Neural Networks, Deep Learning. or unsupervised [4]. DL allows various computational
models to be composed into multiple processing layers to
I. INTRODUCTION participate in the learning representations of the data with
different abstraction levels.
Finding relevant information about a specific topic in a
massive amount of exponentially growing online textual In this work, we demonstrate a feed-forward supervised
data is a challenging problem. Organizing data in DL model for Arabic text classification. The first layer
predetermined categories may help to solve this dilemma. uses term frequency-inverse document frequency (TF-IDF)
Hence, the need for efficient and effective automatic vectors constructed from the most frequent words of the
classification algorithms is always in demand. Text document collection. The output of the first layer is used as
classification algorithms are at the core of many NLP an input to the second layer. Optimization methods are
applications such as: text summarization, question used to reduce the error rate between the computed and the
answering, sentiment analysis, spam detection and text target output. The error rate is usually measured by a loss
visualization. function. In this paper, we used a common optimizer called
Adam.
The main task of text classification can be summarized
as follows: given a document D, find a zero or multiple To test the model, we carry out a set of experiments on
categories to which D belongs. A binary classification two multiclass, single-label Arabic datasets based on
process involves a collection made of two classes, while a standard measures such as precision, recall, F-measure,
multi-classification process acts on a data collection of support, accuracy and the time to build the model. Then we
more than two classes to be assigned to an unseen compared our proposed model with the logistic regression
document. (LR) algorithm.
Text classification can be either manual or automatic.
Manual text classification was the core task of classifying The rest of the paper is organized as follows. The
library context since the early days. On the other hand, second section presents the related work. The third section
automatic text classification is mainly done by computer presents the research background. In section four, a
machines using classification techniques. description of the test datasets is presented. In section five,
we discuss the research methodology. Experiments and
Textual material can be classified in many ways based their results are presented in section six followed by the
on some metadata such as: text subject, type, publication conclusion in section seven.

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 258


II. LITERATURE REVIEW A. Deep Learning
In [5], the authors used the Self Organizing Maps DL is a branch of machine learning (ML) area. DL is
(SOM) for English and Spanish languages for the text based on artificial neural networks (ANN) [4]. DL has
classification task using three different datasets. They been introduced to ML by Rina Dechter in (1986) and to
separated the classes according to their topics. Using SOM ANN by Igor Aizenberg et al. in (2000) [16].
as an unsupervised learning algorithm for text Advancement in DL research produced many learning
classification, the authors achieved the highest accuracy architectures such as: deep neural networks, deep belief
results when compared with other studied unsupervised networks, recurrent and convolutional neural networks.
learning algorithms. DL has been applied to different research areas. Few to
mention include: computer vision, natural language
In [6], the authors proposed a hybrid Associative processing, speech recognition, social media, and
Classification (AC) with the Naïve Bayes (NB). The AC bioinformatics. DL was successful in producing excellent
model suffers from the large number of classification rules results compared with many traditional ML algorithms
and using various pruning methods which remove some and human experts [18].
important information, and hence, affecting the right
decision. According to the authors, these drawbacks were A deep neural network (DNN) is an ANN with
handled by using NB. The proposed mechanism increased multiple layers of algorithms to process data [17].
the efficiency of Arabic text classification by integrating Information is passed through connected layer. The output
the mining association rules with the classification task. of a previous layer provides an input to the next layer. The
first layer is referred to as the input layer, while the last is
In [1], the authors applied DL to multiclass, single- referred to the output layer. The rest of the layers are
label Arabic sentences classification. They used the called the hidden layers. Each layer is typically a simple,
convolutional neural networks (CNN) along with the word uniform algorithm containing one kind of activation
embedding layer. The word embedding layer used the function.
AraVec Arabic word2vec model which is used to capture
the word float point vector similarity. CNN used hyper An important aspect of DL is feature selection (FS).
parameters such as: the layer count, filter count, filter size FS uses algorithms to automatically extract meaningful
stride and padding. They applied tokenization, word features from data for training, learning, and
indices, and pad sequences as a preprocessing phase. The understanding. Now a day, the processing of big data and
model consists of sequential stack embedded layer and the evolution of AI technology are both dependent on an
three convolutional layers. The dataset has 1500 of the evolving DL technology.
most common words. They claimed that their approach
B. Logistic regression
achieved high accurate results in the NLP tasks.
The history of LR may be traced back to the 19th
The author of [7] investigated the impact of the
preprocessing phase and its relation to the accuracy and century when it was first used to describe the growth rate
performance improvement by using the machine learning of populations [19]. As of today, LR is a classification
algorithms. algorithm frequently used in the fields of medicine,
biology and social sciences where logistic regression is
In [8], the authors applied very deep convolutional used as an important statistical tool.
networks for text classification using English language
corpora. They demonstrated that the performance LR uses a statistical method and a logistic function for
increased by using many layers each with different analyzing a dataset by measuring a relationship between a
features. They used 29 convolutional layers and max- categorical dependent variable and one or more
pooling type with distinct public text classification tasks. independent variables. A categorical variable is a variable
The authors claimed that their deep architecture functioned that takes values falling in limited categories instead of
well on assorted corpora sizes, even on the big data sets. being continuous. Categorical data may be of various
types including binary data (Pass/Fail or Yes/No),
In [9], the authors applied DL to Arabic text
classification. They used stemming to extract, select and unordered nominal data (Cat, Dog, Sheep) or ordinal data
reduce the collected features. TF-IDF scheme was applied (Low, Medium, High).
to the documents as a feature weighting technique. Finally, IV. THE PROPOSED APPROACH
CNN classification was applied on multiple benchmarks
and achieved good results. In this paper, we demonstrate a feed-forward DL neural
network model for a multiclass, single-label Arabic text
The author of [10] applied the logistic regression to classification. Each input node will be multiplied by a
Arabic text classification. They conducted their weight and a bias value will be added to it. The weights are
experiments on Aljazeera Arabic News (Alj-News) calculated using a specific algorithm. The algorithm starts
dataset. Experiments showed that the logistic regression by initializing the weights with random values. Then the
algorithm is a competitive approach for the Arabic text nodes are trained with back-propagation.
classification.
The input to the first dense layer is the TF-IDF vectors
III. BACKGROUND of the most frequent terms. The output of this layer is the
input to the next layer. The nodes of the output layer equal
In this section, we give a brief introduction to deep the number of classes in the classification process (four
learning and logistic regression. classes for the first dataset and nine for the second dataset).

259
Training the model is an iterative process, and hence, delimiters, including white spaces, tabs and
determining the number of iterations for the model must be punctuation marks. The output of the tokenization
specified. The iterations are called epochs. In our proposed process is of two types: tokens that correspond to
model, we used ten epochs. In addition, a batch size units whose characters are recognizable such as
parameter is used to determine how many samples to punctuation marks, numeric data, dates, etc., and
manage in the forward/backward pass for each epoch. The tokens that need further morphological analysis.
reason for applying this parameter is to increase the Tokens of one or two-character length, non-Arabic
computational speed and to minimize the number of characters, or numerical values are ignored and
epochs required for running. excluded from the dataset as they affect the
performance of the classifier [9]. Regular
To reduce the error between the computed and the expressions could be a helpful tool to do
target output, we used a common optimizer called Adam. tokenization [1].
The error is measured by a loss function.
2) Stop-words removal. Stop-words are usually
A. The datasets functional words. They include conjunctions,
In this work, we used two test datasets to conduct the prepositions, etc. They occur frequently in a text
experiments and to validate the efficiency of the proposed and they have low impact on the classification
approach to solve the problem of Arabic text classification. process [9]. A compiled list of Arabic stop-words is
The following is a description of the two datasets. usually used to eliminate them from a text [14].
Dataset-I. The first dataset, Khaleej-2004, was borrowed Developers of NLP applications usually remove
from [20]. It contains 5690 Arabic documents organized stop-words from search engine indices as this will
into four categories: Economy, International News, Local reduce the size of indices dramatically and, hence,
News and Sport. Table 1 shows the characteristics of the improving recall and precision [15].
first dataset. 3) Word stemming. Stemming is the process of
TABLE 1. CHARACTERISTICS OF DATASET-I mapping derivative words onto the base form, the
stem, which they share. Stemming uses
Category Num. of Num. of Words Num. of Words
Documents Before Processing After Processing
morphological heuristics to remove affixes from
Economy 909 418978 8946 words before indexing them. As an example, the
Int. News 935 534532 10382 Arabic words "‫"كتاب‬, "‫ "كاتب‬and "‫ "مكتبة‬share the
Local News 2398 967525 12519 same root "‫"كتب‬. In this work, we utilized a stemmer
Sport 1430 551728 9978 described in [13].
Total 5690 2472763 41825
After the preprocessing phase, the dataset is
Dataset-II. The second dataset was borrowed from [12]. It represented in a form suitable for the ML phase.
contains a set of 1445 Arabic documents organized into Consequently, the most relevant stems of words must be
nine categories. They include: Computer, Economics, extracted and converted into vectors. The vector space is
Education, Engineering, Law, Medicine, Politics, represented as a two dimensional matrix where the
Religion, and Sport. Table 2 shows the characteristics of columns denote the stems and the rows denote the
the second dataset. documents. The entries in the matrix are the weights of the
stems in their corresponding documents. The TF-IDF
TABLE 2. CHARACTERISTICS OF DATASET-II
scheme is used to assign weights to stems. Equations (1)-
Category
Num. of Num. of Words Num. of Words (3) are used to calculate the weights of the terms.
Documents Before Processing After Processing
Computer 70 13959 2890

Economics 220 99853 8505 (, )= (1)
Education 68 51749 6803
Engineering 115 143240 9280 ℎ
Law 97 199156 11346 ( , )= (2)
ℎ ℎ
Medicine 232 97230 9097
Politics 184 85015 9009 = ( , )× (, ) (3)
Religion 227 140307 9769
Sport 232 73604 4156
where TF(i,j), is the frequency of term i in document j,
Total 1445 904113 70855
IDF(i,j), is the frequency of word i with respect to all
B. The preprocessing phase documents (i.e. dataset). Finally, the weight of word i in
Data preprocessing is extremely important for many document j, wij, is calculated by (3).
research areas such as NLP, DM, and ML. It allows C. Model evaluation metrics
improving the quality of the raw experimental data. Data
Text classification evaluation is performed using four
preprocessing has a significant impact on the performance
standard measures: classification accuracy, precision,
of supervised learning models [11]. The primary aim of
recall, and F1-score. Accuracy is simply a ratio of correctly
preprocessing is to reduce the test space and to minimize
predicted observation to the total observations. It is
the error rate. The appropriate data preprocessing and data
calculated using (4).
analysis is the next step in text classification. Data
preprocessing includes the following stages: +
= = (4)
+ + +
1) Text tokenization. The tokenization process takes
the dataset and splits it into separate words Precision is the ratio of correctly predicted positive
(tokens). The words are separated at multiple observations to the total predicted positive observations.
260
Precision is calculated using (5). Recall is the ratio of matrix shows that class 0, the “Economy” class, has
correctly predicted positive observations to all (151/178) of its documents correctly classified. This is
observations in actual class and is captured by (6). The F1- equivalent to (84.8%) of the documents. If we keep
score is a function used when a balance between precision looking across the same row of the “Economy” class, we
and recall is needed and is calculated by (7). find that (27/178) documents, which forms 15.2% of its
documents, were misclassified and predicted as class 2,
“Local news”.
= = (5)
+
TABEL 3. EVALUATION RESULTS OF THE LR MODEL OF DATASET-I
= = (6)
+ Class# Name Precision Recall F1-score Support
2×( × ) 0 Economy 0.88 0.85 0.87 178
1_ = (7)
+
1 Int. news 0.98 0.91 0.94 192
D. Split the datasets into testing and training portions 2 Local news 0.91 0.94 0.93 489

After the datasets of the Arabic documents have been 3 Sport 0.97 0.98 0.97 279
preprocessed, they were split into a training set (80%) and
a test set (20%). This group division is completed by using TABLE 4. CONFUSION MATRIX OF LOGISTIC REGRESSION
the split train test library in python. CLASSIFICATION MODEL FOR DATASET-I

Predicted
V. EXPERIMENTAL RESULTS Class 0 1 2 3
In this section, we compare our work with the LR 0 84.8% 15.2%
supervised classification model. We evaluate the LR model (151/178) 27

Actual
based on standard measures, namely, precision, recall and 1 1.0% 91.1% 7.3% 0.5%
F1-score. The DL model is evaluated based on accuracy 2 (175/192) 14 1
and the percentage of loss in training and validation 2 3.5% 0.8% 94.3% 1.4%
17 4 (461/489) 7
datasets. A Confusion Matrix, which is a visualized
3 0.4% 1.8% 97.8%
summary of the classification prediction results, is 1 5 (273/279)
produced to evaluate the accuracy of the classification of
both models. The number of correct and incorrect For the DL classification mode, Table 5 shows the
predictions is summarized with counted values for each confusion matrix of dataset-I. Taking a closer look at the
class. The confusion matrix provides an insight overview confusion matrix, it is obvious that the deep learning
of the errors being made by the classifier and the model outperformed the logistic regression in two classes,
misclassified instances. namely, class 0 (“Economics”) (+3.4) and class
1(“International News”) (+4.7). For class 2 (“Local
In order to implement the two models, we used Python News”) and class 3 (“Sport”), the logistic regression
3.7.0 with the aid of JetBrains PyCharm Community, model was better than the deep learning model.
edition 2017.2.4, Python IDE. The DL model used Keras,
which is a DL and neural network API running on top of TABLE 5. CONFUSION MATRIX OF DEEP LEARNING CLASSIFICATION
the Tensorflow library. Keras API supports two main types MODEL OF DATASET-I
of models; the sequential model API, which we
Predicted
administered in this work, and the functional API, which
Class 0 1 2 3
may be implemented for advanced models with complex
0 88.2% 11.8%
NN architectures. The applied sequential model is a stack (157/178) 21
of layers using a dense layer. The experiments were carried
Actual

1 0.5% 95.8% 3.6%


on a laptop machine with the following specifications: 1 (184/192) 7
Intel (R) core (TM) i7-4510U CPU with 2.40 GHz, 8 GB 2 4.7% 0.8% 92.8% 1.6%
RAM and Windows 10, 64-bit operating system. 23 4 (454/489) 8
3 0.4% 2.3% 97.5%
A. Experimenting with dataset-I 1 6 (272/279)
The two supervised classification models, namely, the
LR and the DL, were both tested on dataset-I and dataset- To visualize the accuracy of the training and testing
II. Table 3 shows the evaluation results of the LR model data for the DL model, each class of the dataset was split
applied on dataset-I. The best performance of the model into two groups; a training group and a validating group.
was on the “Sport” class with precision value of (0.97), Table 6 shows the training and validation accuracy values
recall value of (0.98), F1-score value of (0.97). The of dataset-I at 10 epochs. As shown in Fig.1, the training
reported support value for each class denotes the number accuracy was approximately reached at 97.5%, while the
of samples of the true responses in that class. The next best data validation accuracy was approximately reached at
performance of the LR classification model was on the 93.76% when running the 10th epoch.
“International news” class and the least performance was Table 7 depicts the loss value in the training and
on the “Economy” class. validation dataset at 10 epochs, while Fig. 2 shows that the
Table 4, shows the confusion matrix of the LR model training loss decreased at every epoch until it was close to
of dataset-I. The actual documents are the correctly 0.012. Meanwhile, the validation loss was fluctuating till
classified documents, whereas the predicted documents are the 7th epoch to reach 0.099. However, it started increasing
the misclassified ones. As an example, the confusion from the 4th epoch to reach 0.113 at the 10th epoch.

261
TABLE 6. TRAINING AND VALIDATION ACCURACY OF DATASET-I The next best performance of the logistic regression model
Epoch Validation Accuracy Training Accuracy was on the “Computer” class and least performance was on
1 0.729 0.567 the “Economics” class.
2 0.894 0.843
3 0.928 0.925 TABLE 8. LOGISTIC REGRESSION RESULTS OF DATASET-II
4 0.929 0.947
5 0.929 0.953 Class # Name Precision Recall F1-score Support
6 0.934 0.958 0 Computer 1 0.79 0.88 19
7 0.933 0.963 1 Economics 0.73 0.93 0.81 40
8 0.939 0.967 2 Education 1 0.75 0.86 16
9 0.937 0.972 3 Engineering 0.95 1 0.98 21
10 0.938 0.976 4 Law 1 0.52 0.69 21
5 Medicine 0.92 1 0.96 34
6 Politics 0.92 1 0.96 44
7 Religion 0.96 0.95 0.95 55
8 Sports 1 1 1 39

Table 9 shows the confusion matrix of the logistic


regression model for dataset-II. Again, the actual
documents are the correctly classified documents, whereas
the predicted documents are the misclassified ones. As for
example, the confusion matrix shows that class 0, the
“Computer” class, has (15/19) of its documents (78.9%)
correctly classified, while two documents (10.5%) were
misclassified and predicted as class 1, “Economics”, one
document (5.3%) was predicted as class 3, “Engineering”,
and one document (5.3%) was misclassified and predicted
as class 5, “Medicine”.
TABLE 9. CONFUSION MATRIX OF LOGISTIC REGRESSION
Fig. 1. Training & validation accuracy of DL classification of dataset-I CLASSIFICATION MODEL OF DATASET-II

TABLE 7. TRAINING AND VALIDATION LOSS OF DATASET-I


Epoch Validation Loss Training Loss
1 0.1149 0.2687
2 0.0914 0.0797
3 0.0917 0.0564
4 0.0905 0.0444
5 0.0929 0.0355
6 0.0964 0.0292
7 0.0987 0.0238
8 0.1006 0.019
9 0.1097 0.0154
10 0.1129 0.0124

Table 10 shows the confusion matrix of the DL model


of dataset-II. The confusion matrix shows that the DL
model outperformed the logistic regression model in four
classes, namely, “Computer” (+10.6%), “Education”
(+18.8%), “Law” (+9.5%), and “Religion” (+3.7).

TABLE 10. CONFUSION MATRIX OF DL CLASSIFICATION MODEL OF


DATASET-II

Fig. 2. Training & validation loss of DL classification model of dataset-I

B. Experimenting with dataset-II


Dataset-II has 1445 Arabic documents and nine
different classes as described in Table 2. Table 8 shows the
evaluation results of the logistic regression model applied
on dataset-II. The best performance of the logistic To visualize the accuracy of the training and testing
regression model was on the “Sport” class with a precision data for the DL model, we split each class of the dataset
value of (1.0), recall value of (1.0), F1-score value of (1.0). into two groups; a training group and a validating group.
262
Table 11 shows the training and validation accuracy
values of dataset-II at 10 epochs. Fig. 3 shows that the
training accuracy was approximately reached at 99.4%,
while the data validation accuracy was approximately
reached at 94.1% when running the 10th epoch.

TABLE 11. TRAINING AND VALIDATION ACCURACY VALUES OF


DATASET-II
Epoch Validation Accuracy Training Accuracy
1 0.7128 0.5718
2 0.7993 0.8028
3 0.8581 0.8763
4 0.9135 0.9256
5 0.9308 0.9602
6 0.9343 0.9749
7 0.9412 0.9775 Fig. 4. Training & validation loss of DL classification of dataset-II
8 0.9343 0.9870
9 0.9377 0.9896
10 0.9412 0.9939

Fig. 5. Accuracy score

Fig. 6 shows the model building time for the two tested
classification models. The evaluation results showed that
the time consumed in deep learning to build the model was
less than the time consumed in the logistic regression
model for both dataset.
Fig. 3. Training & validation accuracy of DL classification of dataset-II

Table 12 shows the loss value in the training and


validation dataset at 10 epochs. Fig. 4 shows that the
training loss decreases at every epoch until it was close to
0.013. Meanwhile, the validation loss started decreasing at
every epoch to reach 0.034 at the 10th epoch.

TABLE 12. TRAINING AND VALIDATION LOSS OF DATASET-II


Epochs Validation Loss Training Loss
1 0.2826 0.3224
2 0.1752 0.2215
3 0.1134 0.1263
4 0.0833 0.0787 Fig. 6. Model building time
5 0.0654 0.0538
6 0.0551 0.0383 VI. CONCLUSION
7 0.0482 0.0287
In this paper we demonstrated how the supervised deep
8 0.0438 0.0217
learning neural net classification model can be applied to
9 0.0404 0.0168
10 0.0385 0.0131
solve the problem of multiclass, single-label Arabic text
classification. We implemented a feed-forward DL neural
network. The input to the first layer was the TF-IDF
C. Accuracy and time evaluationof both datasets vectors of the most frequent terms in the dataset. The
Fig. 5 shows the accuracy score evaluation of dataset-I output of the first layer was used as the input to the next
and dataset-II for the two tested classification models. The layer. In addition, we used the Adam optimizer to reduce
evaluation results showed that the deep learning the error rate. We have conducted a set of experiment to
classification model has better accuracy results than the validate our approach using two datasets of Arabic
logistic regression model for both datasets. We also documents. We used the supervised logistic regression
concluded that the accuracy increases when the dataset classification model as a base classifier. Experimental
has more class variation as in dataset-II. results showed a significant improvement in classification

263
accuracy and time building model in favor of the deep
learning model compared with the logistic regression
model. The results pertain that deep learning classification
models are very promising to the Arabic text classification
problem.
REFERENCES
[1] D. Sagheer, and F. Sukkar, "Arabic Sentences Classification via [11] S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, "Data
Deep Learning," International Journal of Computer Applications, preprocessing for supervised leaning," International Journal of
182(5), pp. 40-46, 2018. Computer Science, 1(2), pp 111-117, 2006.
[2] R. G. Rossi, A. Lopes, and S. O. Rezende, "Optimization and label [12] A. M. A. Mesleh, "Chi Square Feature Extraction Based Svms
propagation in bipartite heterogeneous networks to improve Arabic Language Text Categorization System," Journal of
transductive classification of texts," Information Processing & Computer Science, 3(4), pp. 430-435, 2007.
Management, 52(2), pp. 217–257, 2016. [13] B. Hammo, S. Yagi, O. Ismail and M. AbuShariah, "Exploring and
[3] https://speakt.com/top-10-languages-used-internet/, visited May 20 exploiting a historical corpus for Arabic," Language Resources and
2019. Evaluation, 50(4), p. 839–861, 2016.
[4] J. Schmidhuber, "Deep learning in neural networks: An overview," [14] M. T. Alrefaie, Arabic stop-words list, available online:
Neural Networks, vol. 61, pp. 85-117, 2015. https://github.com/mohataher/arabic-stop-words. Visited 20 May
[5] J. Saarikoski, J. Laurikkala, K. Järvelin, M. Juhola, "Self- 2019.
Organising Maps in Document Classification: A Comparison with [15] G. Salton, and C. Buckley, “Term-weighting approaches in
Six Machine Learning Methods," in International Conference on automatic text retrieval,” Information Processing and Management,
Adaptive and Natural Computing Algorithms. ICANNGA 2011, 24, pp. 513–523, 1988.
Verlag Berlin Heidelberg, 2011.
[6] W. Hadi, Q. A. Al-Radaideh, S. Alhawari, "Integrating [16] J. Schmidhuber, "Deep learning," Encyclopedia of Machine
Associative Rule-based Classification with Naïve Bayes for Text Learning and Data Mining, pp. 1-11, 2016.
Classification," Applied Soft Computing, 69, pp. 344-356, 2018. [17] Y. Bengio, "Learning deep architectures for AI," Foundations and
[7] R. Alshammari, "arabic Text Categorization using Machine trends in Machine Learning, 2(1), pp. 1-127, 2009.
Learning Approches," International Journal of Advanced Computer [18] L. Deng, and Y. Dong, "Deep learning: methods and applications,"
Science and Applications (IJACSA), 9(3), pp. 226-230, 2018. Foundations and Trends in Signal Processing, 7(3-4), pp. 197-387,
[8] A. Conneau, H. Schwenk, Y. L. Cun, and L. Barrault, "Very Deep 2014.
Convolutional Networks for Text Classification," Proceedings of [19] N. Bacaër, "Verhulst and the logistic equation (1838)," In A Short
the 15th Conference of the European Chapter of the Association for History of Mathematical Population Dynamics, pp. 35-39.
Computational Linguistics, vol. 1, p. 1107–1116, 2017. Springer, London, 2011.
[9] S. Boukil, M. Biniz, F. El Adnani, L. Cherrat, and A. E. El [20] Khaleej-2004 Arabic corpus compiled by Dr. Mourad Abbas.
Moutaouakkil, "Arabic Text Classification Using Deep Learning Available online:
Technics," International Journal of Grid and Distributed https://sourceforge.net/projects/arabiccorpus/files/arabiccorpus%20%
Computing,11(9), pp. 103-114, 2018. 28utf-8%29/. Visited 20 May 2019.
[10] M. M. Al-Tahrawi, "Arabic Text Categorization Using Logistic
Regression," I.J. Intelligent Systems and Applications, vol. 6, pp.
71-78, 2015.

264
Arabic Text Semantic Graph Representation
Wael Mahmoud Al Etaiwi Arafat Awajan
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman. Jordan Amman. Jordan
w.etaiwi@psut.edu.jo awajan@psut.edu.jo

Abstract— Semantic representing of Arabic text can facilitate weighted graph called semantic graph, in which weights
several language processing applications. It reflects the meaning represents the semantic relations between vertices (words). For
of the text as it is understood by humans. Semantic graphs can be paragraphs or documents, the semantic graph becomes more
used to enhance the performance of several natural language complex and difficult to manipulate. On the other hand, network
processing applications such as question answering and textual representation is flexible and accumulative, thus, it is suitable
entailment. This paper proposed a graph-based Arabic text for real-time and online applications. Finally, a set of predefined
semantic representation model. The proposed model aims to rules are used in the rule-based representation to represent the
represent the meaning of Arabic sentences as a rooted acyclic semantic relation between words. The combination of applied
graph. Most of the works on semantic representation have focused
rules may differ among different implementations. The order of
on the English language. Furthermore, not much work considered
checked rules and the priority of applied rules may produce
and focused on the Arabic language. In this paper, the proposed
model dedicated to the Arabic language and considers its features different representations. This affects the process of retrieving
and challenges. the original text from its rule-based representation negatively[6].
Semantic parsing refers to the process of mapping text into
Keywords—semantic graph; knowledge representation; its semantic representation [7]. Many different methods and
semantic; semantic representation. techniques are used in the semantic parsers such as machine
learning and linguistics-based methods [8]. Semantic parsers are
I. INTRODUCTION classified into two main types: Deep semantic parsers and
The knowledge representation using a predefined set of shallow semantic parsers. Deep semantic parsers are used to
notations that can be used by a computer program in a systematic represent text components such as multiword expressions [9].
way is called semantic [1]. The semantic relations between While in shallow semantic parsers, each word in the text is
words and text components play a key role in several represented according to its meaning and its semantic
applications, especially for text analysis and mining relationship with other words [10].
applications. For the Arabic language, the semantic parsers are limited and
Semantic representation is to reflect what human understand have less attention with comparison to other languages. This is
about the meaning of given text semantically. It is used in due to the lack of high-quality resources and tools that could be
several Natural Language Processing (NLP) applications such used to Arabic NLP models. Furthermore, the Arabic language
as Question Answering (QA), Textual Entailment (TE) and text has a sophisticated syntax and morphology structure. Thus, most
summarization. of the proposed Arabic parsers focus on the syntax and
morphology of Arabic text rather that it’s semantic
Semantic representation models can be classified into four representation [11].
main categories: Predicate logic representation, Frame
representation, Network representation, and Rule-based In this paper, we propose an Arabic text semantic network
representation. In the predicate logic representation, the representation model. The semantic graph will be used to
language is used as a notation set in order to represent the represent the meaning of Arabic sentences. The proposed
semantic relation between text components [2], [3]. A set of representation model is used to represent different sentences into
logic notations is used to express the meaning of words in the the same semantic graph when they share the same meaning.
sentence. For example, the sentence “The weather is beautiful” The proposed model designed for the Arabic language and it has
is represented as: beautiful(weather). The complexity of the ability to represent and retrieve Arabic sentences easily.
representing complex sentences is the main drawback of this The remaining of this paper is structured as follows: section
representation model. Furthermore, ignoring helping verbs and II presents the related work. The proposed model is presented in
supporting words reduces the retrieval process quality [2]. In section III. Some examples are illustrated in section IV. Section
frame representation, the original text is represented as slots of V discussed the main challenges of the proposed model. Finally,
components and parts. Each part carries a specific type of the conclusion and future works are presented in section VI.
information [4]. The key step in this model is to split the original
text into its appropriate components and part, which is a time-
consuming process. Furthermore, retrieving the original text II. RELATED WORK
from its frame representation is a very difficult task. In network Most of the proposed researches on semantic representation
representation, also call graph representation, the semantic and parsers are small and domain-oriented [12]. Furthermore,
relations are represented as a set of vertices and edges [5]. The they are oriented for the English language. Several graph-based

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 265


semantic representation models proposed and investigated such FrameNet in a single unified knowledge resource in order to
as Abstract Meaning Representation (AMR) [12], Groningen improve disambiguation accuracy in the machine translation
Meaning Bank (GMB) [13], and Universal Networking task.
Language (UNL) [14]. The proposed models are different in
Most of the proposed Arabic semantic representation models
terms of structure, granularity, and automaticity. Semantic
use graph knowledge representation rather than other knowledge
features and their representation are used to enhance several text
representation techniques. Most of the proposed models do not
processing applications such as Paraphrase identification [15]
consider the syntactical and morphological features of Arabic
and documents classification [16]. For the Arabic language, few
text, as well as semantic features, in the parsing process.
models were proposed to represent Arabic sentences
Moreover, translating other languages resources (such as
semantically.
English) into Arabic missed the unique features of the Arabic
AMR is proposed by Banarescu et al. [12] in order to language, which may affect the semantic representation of
represent sentences as semantic rooted directed graphs. The Arabic text.
proposed model is dedicated to the English language. PropBank
frames [17] were utilized to represent words and the semantic III. PROPOSED MODEL
relations between them. AMR is a manual semantic
representation model. The proposed model uses rooted Directed Acyclic Graphs
(DAGs) to represent sentences as a semantic graphs. The
A language-independent semantic representation model is vertices of the graph represent the words and the main concepts
proposed by Uchida et al. [18]. The proposed model, called (Person, Location, conjunctions, and Date\Time) in the
Universal Networking Language (UNL), facilitates the sentence. While the edges that connect vertices together
translation task of sentences in any language into other represent the semantic relation between words.
languages. UNL mainly used in machine translation methods
such as English-Arabic text translation [19]. Each word in the original sentence has been represented as a
vertex in the semantic graph, and each word has several
For the Arabic language, several semantic representation attributes or related words, such as root, synonym, and type
models are proposed to represent Arabic text using many (noun, verb or article). These related words have been
different representation structures. Graph-based semantic represented in the semantic graph as vertices and linked to the
representation model is proposed by Ismail et al [20]. Graph’s original word via labeled edges according to the word’s attribute
vertices represent text words and concepts. While the graph’s type. The proposed model includes four main groups of
edges represent the semantic relation between words. The relations:
proposed model consists of five main steps: Preprocessing, word
sense instantiation, concept validation, sentence ranking, and A. Verb Relations.
semantic graph generation.
In the Arabic language, each verb has several attributes:
Predicate logic is used to represent the Arabic text subject, one or more objects, tense, and occurrence frequency.
semantically. Haddad [11] proposed a logic-based These attributes have been represented in the semantic graph in
compositional model for semantic analysis of Arabic sentences. two different ways: 1) the attributes that connect two words will
Arabic syntactical constituents are the main components in the be represented by adding a new edge that connects their two
proposed model. The proposed model extends the concept of the vertices. For instance: subject or object attributes. 2) The
generalized language quantification to Generalized Arabic attributes that are related to the verb itself, and not connected to
Quantifiers (GAQ) that utilize lambda-calculus and type other words, such as tense relation. These attributes will be
theoretical analysis of Arabic structure. represented by adding a new vertex and connect it to the verb
vertex. For example, the sentence “‫( ”لعب الولد‬the boy played)
Lhioui et al [21] proposed a rule-based semantic frame
consists of two main vertices ( "‫“لعب‬and "‫( )"الولد‬Figure 1), the
representation for Arabic speech. The proposed model is used to
subject relation is represented by adding new edges between the
enhance human-machine Spoken Dialogue System (SDS). In
existing vertices. While the tense relation is represented by
which, the fairly constrained semantic space is limited. SDS
adding a new vertex and new edge to the graph.
requires online representation of speech during the dialogue that
makes the representation process more difficult. TIHR_ARC
corpus was used in the experiments in order to evaluate the
proposed model.
A FrameNet [22] frames for Quranic verbs was proposed by
Sharaf et al. [23]. The authors compared the semantic frames of
verbs in the Quran with verbs in English FrameNet. The frames
were used to build a lexical dataset of verb valences in the
Quran. The authors compared the semantic frames of verbs in
the Quran with verbs in English FrameNet.
Frame semantic representation was used for interlingual
machine translation applications. Lakhfif et al. [25] used frame Figure 1: Adding Verb Relation.
semantic representation for Arabic language machine
translation. The proposed model integrates WordNet [24] and

266
B. Noun Relations person name. Figure 4 illustrates an example of
Mainly, noun relations classified into two main groups: the representing the sentence “‫”سجل محمد صالح الھدف‬
relation between two nouns, and relation based on noun type. (Mohamad Salah scored the goal).
The first group is represented by adding a new edge to connect
the two nouns together, such as adjective (Adj), modifier (Mod.)
and identifier (Idf.). For example; the sentence “‫”الشمس مشرقة‬
(The sun is shining) consists of two nouns; “‫( ”الشمس‬The sun)
and its adjective noun “‫( ”مشرقة‬shining). Thus, a direct adjective
edge will be added to connect the two vertices. Figure 2
illustrates an example.

Figure 4: Adding Person Relation.


Figure 2: Adding Noun Relation.
3) Date\Time. Date\Time has four main attributes: start,
finish, duration and date which has one or more sub-
The second group of noun relations depends on the noun attributes (day, month, year … etc.). As well as location
type. Nouns can be categorized into three main types: Location, and person, date\time has been represented by creating a
Person and Date\Time. Each type has its own attributes and new concept vertex called (Time) and adding it to the
properties. These categories are represented as follow: semantic graph. After that, the date\time word in the
1) Location. Location noun has five main attributes: place, sentence has been connected to this concept vertex with
path, direction, source, and destination. In order to a new edge labeled with relation type. Figure 5 illustrates
represent location nouns, a new concept vertex called an example of representing the sentence “ ‫وقع الحادث صباح‬
(Location) has been created and added to the semantic ‫( ”الخميس‬The accident occurred Thursday morning).
graph. Then, the location noun has been connected to this
concept vertex with a new edge labeled with relation
type. For example: as illustrated in Figure 3, in order to
represent the sentence “‫( ”ذھب الرجل الى المنزل‬the man went
to the home), the verb “‫( ”ذھب‬went) is connected to
concept vertex (Location) via a new edge. Another new
edge connects the concept vertex with the original
location vertex with relation type (Destination).

Figure 5: Adding Date\Time Relation.

C. Conjunctions Relation.
Additional concept vertex has been used to represent
conjunctions relation such as “‫( ”و‬and) and “‫( ”أو‬or).
Figure 3: Adding Location Relation. Furthermore, the conjunction options have been represented as
relation edges that connect the concept conjunction vertex with
the original word vertices. For instance, the representation of the
sentence “‫( ”اشترى الطالب قلم ودفتر‬The student bought a pen and a
2) Person. In the Arabic language, person names can be
notebook) is illustrated in Figure 6.
mentioned in the original text in many different forms.
They may consist of one or more phrases (e.g. "‫"طه حسين‬
or as a noun phrase (e.g. “‫)”مخترع الذرة‬. Thus, a new
concept vertex called (Person) has been created and
added to the semantic graph in order to represent the

267
vertex. Additional concept vertex is added to represent the
location.

Figure 6: Adding Conjunction Relations.

Figure 8: semantic graph to represent three different


D. Questions (Interrogatives) Relation. sentences.
Another concept vertex has been added in order to represent
the unknown object. For example: in order to represent the
question “‫( ”من كتب المقال؟‬who wrote the article), a new vertex
called “unknown” is created and attached to the verb vertex (the
subject that we ask about), as illustrated in Figure 7.

Figure 7: Adding Question Relation.

In the proposed model, different sentences are represented


by the same semantic graph if they share the same semantic
meaning. This is due to the fact that the order of the words in the
sentence has no impact on its semantic representation. For
example; the semantic illustrated in Figure 8 represents the
following sentences: Figure 9: semantic representation of the sentence: “ ‫اغتيال‬
‫( ” ”مدير ھيئة العدالة والمسائلة في بغداد‬The assassination of the
• “‫( ”ذھب الولد الى المدرسة في الصباح‬The boy went to school director of Justice and Accountability Commission in
in the morning). Baghdad).
• "‫ ذھب الولد الى المدرسة‬، ‫( "في الصباح‬In the morning, the boy
went to school). The second example, illustrated in Figure 10, represents the
• "ً ‫( "الى المدرسة ذھب الولد صباحا‬To school, the boy went in sentence “ ‫قرر الجيش األمريكي خفض عدد قواته في الباكستان خالل العام‬
the morning). ‫( ”المقبل‬The US military has decided to reduce the number of its
troops in Pakistan during next year). In this example, location
and date/time concept vertices are added to the semantic graph.
IV. EXAMPLES
The sentence contains a verb and its attributes (subject and
In this section, different examples and test cases illustrated object).
and discussed. The first example is representing the sentence:
“‫( ”اغتيال مدير ھيئة العدالة والمسائلة في بغداد‬The assassination of the V. CHALLENGES
director of Justice and Accountability Commission in Baghdad).
This statement has no verbs. It contains conjunction word “‫”و‬ Although the proposed model could be used to represent
(and) and location noun “‫( ”بغداد‬Baghdad). As illustrated in most of the Arabic language statements, it has some limitations
Figure 9, each word in the sentence is represented as a separated and challenges that may affect the outputted semantic graph. The
Arabic language is a sophisticated language in terms of

268
structure, and it has many challenging features. In the Arabic VI. CONCLUSION AND FUTURE WORK
language, many different types of ambiguities affects the This paper proposed a model for Arabic text semantic
understanding of sentences meaning, for instance, the same representation. The proposed model represents text components
word could be used for either location or time, such as the word (words) and the semantic relation between them as a rooted
“‫ ”مشرق‬that could be used as location noun (e.g. “ ‫سافرت باتجاه‬ acyclic graph. The proposed model is dedicated to the Arabic
‫( ”المغرب‬I traveled towards the west)), or it could be used as language. It considers Arabic language features and challenges.
daytime noun (e.g. "‫( " عدت الى المنزل بعد مغرب الشمس‬I went home The vertices in the proposed semantic graph consist of original
after sunset). words in addition to the main concepts. Main concepts include
location, person and date time. The proposed model could be
used to represent different types of Arabic sentences including
questions and conjunctions.
In our ongoing research, we are going to utilize the proposed
model to enhance different Arabic NLP application such as
textual entailment and question answering. Furthermore, a new
dataset that contains a collection of pre-generated graphs could
be established and produced.

REFERENCES
[1] P. J. Hayes, “Some Problems and Non-problems in
Representation Theory,” in Proceedings of the 1st
Summer Conference on Artificial Intelligence and
Simulation of Behaviour, Amsterdam, The Netherlands,
The Netherlands, 1974, pp. 63–79.
[2] A. Ali and M. A. Khan, “Selecting predicate logic for
knowledge representation by comparative study of
knowledge representation schemes,” in 2009
Figure 10: semantic representation of the sentence: “ ‫قرر‬ International Conference on Emerging Technologies,
‫( ”الجيش األمريكي خفض عدد قواته في الباكستان خالل العام المقبل‬The US 2009, pp. 23–28.
military has decided to reduce the number of its troops in [3] A. Ali and M. A. Khan, “Knowledge representation of
Pakistan during next year).
Urdu text using predicate logic,” in 2010 6th
International Conference on Emerging Technologies
(ICET), 2010, pp. 293–298.
Another challenge that may affect the quality of the semantic
graph is Name Entity Recognition (NER). The lack of capital [4] M. Minsky, “A Framework for Representing
letters in the Arabic language makes the task of NER a Knowledge,” Massachusetts Institute of Technology,
challenging task. Furthermore, names in Arabic are derived from Cambridge, MA, USA, 1974.
adjectives. For example, the word “‫ ”كريم‬can be used as a named [5] M. R. Quillian, “Semantic Networks,” in Semantic
entity (person name) or an adjective which means (generous). Information Processing, M. L. Minsky, Ed. MIT Press,
1968.
Several Arabic text processing toolkits were proposed for the [6] M. A. Tayal, M. M. Raghuwanshi, and L. G. Malik,
Arabic language in order to perform specific text processing “Semantic Representation for Natural Languages,” Int.
tasks, such as POS tagging, segmentation, dependency parsing, Refereed J. Eng. Sci. IRJES, vol. 4, no. 10, pp. 01–07,
and others. The quality of the used toolkit affects the semantic
Oct. 2015.
representation of the Arabic text.
[7] Y. Wilks and D. Fass, “The preference semantics
In order to overcome the Arabic text semantic representation family,” Comput. Math. Appl., vol. 23, no. 2, pp. 205–
challenges, further preprocessing tasks should be conducted 221, 1992.
with further analysis of the Arabic text. More understanding of [8] P. Liang, “Learning Executable Semantic Parsers for
the morphological and syntactical features of the Arabic Natural Language Understanding,” Commun ACM, vol.
language yield to better semantic representation. On the other 59, no. 9, pp. 68–76, Aug. 2016.
hand, using high quality resources that are dedicated for Arabic [9] P. Liang and C. Potts, “Bringing Machine Learning and
language is more useful than using translated resources from Compositional Semantics Together,” Annu. Rev.
other languages, since Arabic resources considers Arabic Linguist., vol. 1, no. 1, pp. 355–376, 2015.
features and challenges during the processing.
[10] D. Jurafsky and J. H. Martin, Speech and language
processing, vol. 3. Pearson London, 2014.
[11] B. Haddad, “Semantic Representation of Arabic: a
Logical Approach towards Compositionality and

269
Generalized Arabic Quantifiers,” Int J Comput Proc communication, understanding, and collaboration,”
Orient. Lang, vol. 20, pp. 37–52, 2007. Tokyo UNUIASUNL Cent., 1996.
[12] L. Banarescu et al., “Abstract Meaning Representation [19] S. Alansary, M. Nagi, and N. Adly, “The universal
for Sembanking,” in Proceedings of the 7th Linguistic networking language in action in English-Arabic
Annotation Workshop and Interoperability with machine translation,” in Proceedings of 9th Egyptian
Discourse, Sofia, Bulgaria, 2013, pp. 178–186. Society of Language Engineering Conference on
[13] J. Bos, V. Basile, K. Evang, N. J. Venhuizen, and J. Language Engineering,(ESOLEC 2009), 2009, pp. 23–
Bjerva, “The groningen meaning bank,” in Handbook of 24.
linguistic annotation, Springer, 2017, pp. 463–496. [20] S. S. Ismail, M. Aref, and I. F. Moawad, “Rich semantic
[14] O. Abend and A. Rappoport, “Universal Conceptual graph: A new semantic text representation approach for
Cognitive Annotation (UCCA),” in Proceedings of the arabic language,” in 7th WSEAS European Computing
51st Annual Meeting of the Association for Conference (ECC ‘13), 2013.
Computational Linguistics (Volume 1: Long Papers), [21] C. Lhioui, A. Zouaghi, and M. Zrigui, “A Rule-based
Sofia, Bulgaria, 2013, pp. 228–238. Semantic Frame Annotation of Arabic Speech Turns for
[15] M. AL-Smadi, Z. Jaradat, M. AL-Ayyoub, and Y. Automatic Dialogue Analysis,” Procedia Comput. Sci.,
Jararweh, “Paraphrase identification and semantic text vol. 117, pp. 46–54, 2017.
similarity analysis in Arabic news tweets using lexical, [22] C. F. Baker, C. J. Fillmore, and J. B. Lowe, “The
syntactic, and semantic features,” Inf. Process. Manag., Berkeley FrameNet Project,” in Proceedings of the 36th
vol. 53, no. 3, pp. 640–652, May 2017. annual meeting on Association for Computational
[16] Z. Kastrati, A. S. Imran, and S. Y. Yayilgan, “The impact Linguistics -, 1998.
of deep learning on document classification using [23] A. Sharaf and E. Atwell, “Knowledge representation of
semantically rich representations,” Inf. Process. Manag., the Quran through frame semantics: A corpus-based
vol. 56, no. 5, pp. 1618–1632, Sep. 2019. approach,” Corpus Linguist.-2009, p. 12, 2009.
[17] M. Palmer, D. Gildea, and P. Kingsbury, “The [24] G. A. Miller, “WordNet: A Lexical Database for
Proposition Bank: An Annotated Corpus of Semantic English,” Commun ACM, vol. 38, no. 11, pp. 39–41, Nov.
Roles,” Comput. Linguist., vol. 31, no. 1, pp. 71–106, 1995.
2005.
[18] H. Uchida, M. Zhu, and T. Della Senta, “Unl: Universal
networking language–an electronic language for

270
Sentiment Analysis for Arabic Language using
Attention-Based Simple Recurrent Unit
Saja Al-Dabet Sara Tedmori
Department of Computer Science Department of Computer Science
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
saja.aldabet@yahoo.com s.tedmori@psut.edu.jo

Abstract— With the growing number of people who express used in Quran, Modern Standard Arabic (MSA) which is
their opinions on the web, Sentiment Analysis have become an derived from the classical form and used for the formal
active research field that aims to analyze and classify the writing and speaking, and colloquial Arabic which is a
sentiment polarity of opinionated reviews. Recently, Deep regional delicate that used for the informal speaking and
Learning models have been extensively used for many Natural
varies by the region [2].
Language Processing tasks including Sentiment Analysis. In this
paper, the authors propose a Deep Learning model for Arabic
language sentence-level Sentiment Analysis. The proposed There are three main approaches to SA: (1) lexical based
model represents an integration between an emerged variant of approaches, (2) Machine Learning (ML) based approaches,
Recurrent Neural Networks known as Simple Recurrent Unit and (3) hybrid approaches. Lexical based approaches are
which is characterized by its light recurrent computations, and dependent on external lexicons that are used to uncover the
an attention mechanism that concentrates more on the sentiment polarity. In ML based approaches, supervised
important parts of an input text. The Simple Recurrent Unit learning techniques are applied. Lastly, hybrid approaches
model allows parallel recurrent calculations that lead to
integrate both lexical and ML based approaches [3]. Deep
enhance the training process in terms of time and accuracy.
Experiments were performed to evaluate the performance of the
Learning (DL), a subfield of ML, has demonstrated its power
proposed model using the Large Scale Arabic Book Reviews and success in a variety of fields including Natural Language
(LABR) dataset. The proposed model obtained state of the art Processing (NLP). Recurrent Neural Networks (RNN)
results compared to other Deep Learning models where it including its variants, such as Gated Recurrent Unit (GRU)
achieved 94.53% in terms of accuracy measure with faster [4] and Long-Short Term Memory (LSTM) [5], are capable
execution time. of dealing with a large number of sequence modeling tasks
such as language understanding [6], [7], opinion mining [8],
Keywords—Sentiment Analysis, Deep Learning, Natural and Question Answering (Q&A) [9]. However, the RNN
Language Processing, Recurrent Neural Networks, Simple
models are limited by the timestep dependency as the
Recurrent Unit, Attention Mechanism.
calculation of each timestep is dependent on the completion
I. INTRODUCTION of the previous one which restricts the processing of long
sequences especially in deep models. This dependency
With the rise of web 2.0 services, people around the globe
makes the operations slower and less scalable than other DL
have become more willing to express their opinions and share
models like Convolutional Neural Networks (CNN) which
them with others using different platforms such as e-
allows parallel computations [10], [11]. Simple Recurrent
commerce websites, blogs, social media websites, and many
Unit (SRU) model was proposed as a light recurrent model
others. Such opinions can be exploited by various
designed to have a parallelism feature with careful parameters
applications like sales prediction, reputation evaluation, and
initialization property. Moreover, SRU model utilizes
intention analysis. In recent years, and as published opinions
highway connections which improve the training process
continue to play a vital role in customers’ purchase decisions,
even within a model of multiple layers. SRU model have been
there has been a steady increase in interest in the field of
applied in Q&A, machine translation and different text
Sentiment Analysis (SA) and its applications. SA is a field of
classification tasks with comparable results and speed [11].
study that aims to analyze and classify people emotions,
evaluations, or opinions as positive, negative, or neutral. SA
is divided into three levels; aspect-level, sentence-level, and The advances of DL have reshaped the NLP research. DL
document-level. Aspect-level aims to classify the sentiment models have been integrated with attention mechanisms for
polarity for different aspects by considering the discussed different tasks which help the model to automatically
entities. The two latter levels, however are more general concentrate more on the important words in a sentence. In
levels that treat the sentence or the document as it expresses SA, words in a sentence do not contribute equally in
a sentiment about specific entity without considering the classifying the sentiment polarity. Consider the following
discussed aspects of each entity[1]. sentence, “The story is full of suspense, worth to read it”.
Only the words “suspense” and “worth” play an important
role in determining the sentiment polarity of this sentence
Although the majority of SA research efforts target
which is positive sentiment. The advantage of using attention
languages such as English, SA for the Arabic language has
has been widely conducted for aspect-level SA [12]–[14].
gained special attention in the last few years. The Arabic
language is spoken by millions of people in the Arab world.
Arabic comes in three main forms; classical Arabic which is In this paper, the authors aim to investigate the use of an
SRU model with the attention mechanism (Att-SRU model)

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 271


in SA for the Arabic language. The targeted level of SA is the a structure-parser was added to create the syntactic parse tree
sentence-level. The rest of paper is organized as follows: structure which is used to specify the model recursion order.
section II presents the related work using a variety of DL It was found that the proposed recursive model outclassed the
models. Section III describes the methodology. Section IV RAE model and other Machine Learning (ML) models
presents a discussion for the obtained results by the proposed including Naïve Bayes (NB) and Support Vector Machine
model. Finally, section V draws the conclusion. (SVM) using the same evaluation datasets.
II. RELATED WORK The authors of [24] applied Recursive Neural Tensor
As the focus of this research is Arabic SA using DL Networks (RNTN) using different morphological abstraction
techniques, this section gives a review of the existing works levels, starting from the word level to the root level. To use
in this field. The authors of [15] focused on building Arabic that model, there was a need for a sentiment treebank which
word embeddings using different architectures based on a includes annotated syntactic parse trees for sentiments at
web-crawled corpus. The generated embeddings were used as multiple consistency levels. Therefore, an Arabic Sentiment
an input for a Convolutional Neural Networks (CNN) model Treebank (ArSent TB) was created. The experimental results
that classifies the sentiment polarities. The proposed showed that RNTN model achieved superior results and
architecture was evaluated using different Arabic datasets outclassed many classifiers such as RAE, LSTM, and SVM.
including LABR dataset [16] which is used in this research. The performance analysis revealed that the model’s
The experimental results showed an enhanced performance performance improved while using the morphological
of the CNN model while using pre-trained word embeddings. abstraction especially at stem-level.

The authors of [17] examined several DL architectures In [25], several experiments were conducted using
based on LSTM and CNN models for Arabic SA such as different ML models such as SVM, NB, Logistic Regression
simple LSTM, simple CNN, a combination of LSTM and (LR) which were trained using Term Frequency Inverse
CNN, stacked LSTM, and a combination of two LSTM Document Frequency (TF-IDF), Unigrams, and Bigrams
models with different dropout probabilities and combination [26]. Moreover, the DL models were also examined including
methods. The evaluation of these models was based on two DNN and CNN. Those models were trained using words’
publicly available Twitter datasets; Arabic Sentiment Twitter frequency and word2vec [27] respectively. The authors
Dataset (ASTD) [18] and Arabic Twitter (ArTwitter) dataset introduced a health dataset collected from Twitter data. The
[19]. The experiments showed promising results of the best results were obtained when using the SVM model.
combined LSTM model which outperformed all the tested
models. In [28], the authors released a pre-trained word
embeddings based on a large Arabic corpus. The generated
In another study [20], the authors exploited the advantage embeddings were built using the architectures of word2vec
of combining CNN and LSTM models. To consider the model; Continues Bag-of-Words (CBOW) and skip-gram. A
morphological diversity of the Arabic language, different set of classifiers were used in the experiments: Linear SVM,
levels of SA were explored; character-level, word-level, and Random Forest, and Logistic Regression. The classifiers
character n-grams level. Both word-level and character n- were trained using the generated embeddings. The
grams level shown better results than the word level where experiments showed that utilizing word embeddings can
the used dataset was based on Twitter data and the character- slightly enhance performance. The Logistic Regression and
level increased the number of extracted features without any the Linear SVM classifiers outperformed the other classifiers.
beneficial effect.
In [29], the authors applied a lexicon-based fuzzy
The work presented in [21] explored the effect of utilizing approach. Their model is composed of two stages: in the first
various DL models for Arabic SA. The authors investigated stage weights were assigned to the entered text, and the
four models; Deep Neural Networks (DNN), Deep Belief second one was to apply fuzzy logic operations to classify the
Networks (DBN), Recursive Auto Encoder (RAE), and a sentiment polarity. The lexicon-based fuzzy approach was
combination of DBN and Deep Auto-Encoder (DAE) compared with a lexicon-based approach and achieved better
models. To train the first three models, lexicon-based features results.
were used based on ArSent lexicon [22]. However, the last
model was trained using the indices of raw words. According III. METHODOLOGY
to the reported results, the RAE model achieved the best The architecture of the proposed model is shown in figure
performance over the investigated models where the parsing 1, which consists of four main modules: the input module,
order and context-semantic were considered. Although the SRU module, attention module, and output module. Given a
RAE model obtained the best results, it suffers from the set of sentences which represent book reviews from the
limited capability of generalizing semantics and modeling LABR dataset, the proposed model aims to classify the
morphological interaction between morphemes. sentiment polarities of those sentences as positive or negative
sentiments.
The same group of authors extended the work in [21] and
developed a Recursive DL Model for Opinion Mining in
Arabic (AROMA) [23]. This model handled the limitations
of RAE by adding a morphological tokenization procedure
followed by sentiment and semantic embeddings. Moreover,
272
multiplication. This means that the current state does not have
to wait until the full completion of the previous state . In
this way, the state vectors dimensions become independent of
each other and that facilitates the parallelization process.
In order to make the gradient-based training easier, a
highway component [33] is used in SRU. The reset gate is
utilized which integrates the produced state from light
recurrent with the current input . In case of stacking
multiple layers of this model, the 1 ⊙ in (4) will
ignore the connections that permit the gradient flow through
layers. The following equations illustrate the process:

= + ⊙ + (3)

Fig. 1. The architecture of the proposed Att-SRU model


= ⊙ + 1 ⊙ (4)
A. Input Module The description of the used parameters is as mentioned
before [11]. The extracted sequential features of this module
The input sentences are treated as a sequence of words
are passed to the next one.
,…, , these words are represented in the distributional
space as vectors ∈ ℝ and stored in a lookup embeddings C. Attention Module
table ∈ ℝ | | where refers to the vectors’ dimensions The purpose of this module is to automatically emphasize
and | | refers to the vocabulary size. Words in the lookup the important words which could have an impact on the
table are transformed into indices, and each index is sentiment classification decision. Therefore, this module
associated with a vector representation. The main idea behind receives the generated outputs of the former module and tries
word embeddings is that similar words tend to occur in to provide additional information about the input in order to
similar contexts and hence can be used to infer the empower the model. Let H= , , … , be a set of SRU
relationships between words [30], [31]. The lookup table is output vectors per each word, attention mechanism is based
constructed using a set of unsupervised pre-trained on extracting a sentence representation for the input as
embeddings which are trained based on massive amount of follows:
data. The utilized vectors are built using CBOW model which
is trained to use the surrounding context words to predict the = tanh + (5)
center words in order to build the embeddings. The utilized
vectors are built on word-level representation and created exp
= (6)
based on Wikipedia dataset [32]. The output of this module ∑ exp
is the retrieved vector representation per each word
,…, . = (7)

B. Simple Recurrent Unit (SRU) Module where the hidden vectors H are fed into a single-layer multi-
After receiving the input’s vectors, this module aims to layer perceptron network to calculate the hidden
extract the input’s sequential features. The proposed model representation for each hidden vector . Afterward, the
utilizes a gated network similar to LSTM and GRU but with significance of each word is calculated as a similarity
a parallelism scheme. The architecture of SRU includes two between the generated vector and a trainable vector ,
main components: light recurrence operations and highway the output is normalized using a softmax function to form the
operations. The light recurrence operations involve reading attention weight . Finally, the sentence representation is
input vectors and extracting the sequential features by produced as a weighted sum of the attention weights and the
computing the cell state . This could be achieved using the words’ hidden vectors [34], [35]. This representation is
following equations: passed to the output module to classify the sentiment of the
given sentence.
= + ⊙ + (1)
D. Output Module
After receiving the final representation from the attention
= ⊙ + 1 ⊙ (2)
module, a sigmoid layer is utilized in order to classify the
where represents a forget gate that controls the flow of sentiment of each sentence. The model is trained in a
information and represents the cell state which is supervised manner by minimizing cross entropy error of the
calculated based on the adaptive average of the previous cell classified sentiment polarities and the actual sentiment
state and the current input ( ) with respect to . polarities. In addition, regularization technique is used to
, refer to parameters’ matrices and , refer to alleviate the model overfitting [36]. The loss function is
defined as follows:
learning vectors during the training process. The way of using
the previous cell state makes a substantial difference
= log ; + || || (8)
between the SRU and other recurrent models, where a point-
wise multiplication operation ⊙ is used rather than matrix , ∈ ∈

273
where refers to the training dataset, C refers to the - CNN: a CNN model was trained based on a
sentiment polarity classes, ∈ ℝ| | refers to the sentiment generated word embedding using word2vec model.
class which is represented as a one-hot vector where 1 is the Convolutional filters with different sizes were used
true class and 0 is the false class, ; refers to the for the convolutional operations. To down-sample
estimated sentiment distribution, and refers to the the extracted features, max-over-time pooling
regularization weights. operation was used. Finally, a sigmoid function was
used for the classification purpose [15].
IV. EXPERIMENTS AND RESULTS - Baseline: a linear SVM classifier was trained using
To evaluate the performance of the proposed model, the N-grams and TF-IDF features [16].
following experiments, detailed in this section, where - Random Forest, Linear SVM, Logistic Regression:
conducted. these classifiers are trained based on a generated
word embedding from a large Arabic corpus. The
A. Data
generated word embeddings are trained using
The LABR dataset [16] is a book reviews dataset word2vec model. The classifiers have been used
composed of 63000 reviews collected from Goodread books with the default parameters configurations [28].
website. The reviews are annotated with a rating from 1-5 - Fuzzy Logic: The learning process was divided into
stars. The utilized version of the dataset is a binary version two phases: data pre-processing and features
with two classes; a positive class for 1-2 rating, a negative extraction phase, and fuzzy control system phase.
class for 4-5 rating. The dataset consists of 42832 positive The model was trained based on a lexicon-based
reviews and 8224 negative reviews. features.
TABLE 1.Experimental Results
In order to prepare the dataset, each sentence in the
dataset was tokenized into a sequence of words. Thereafter, a Model Accuracy Time/Min
pre-trained vector for each word was retrieved. The adopted -
CNN [15] 89.6%
pre-trained vectors were built on the word-level
-
representation trained on a Wikipedia dataset with 300 Baseline [16] 75.1%
dimensions. Random Forest [28] 80.05% -
B. Experimental Setup LinearSVM [28] 81.27% -
The experiments were conducted on a windows 10 machine Logistic Regression [28] 81.88% -
with 64-bit Operating system, 16 GB RAM, and Intel (R)
Fuzzy Logic [29] 80.59% -
Core (TM) i7 CPU. The development environment was
python 3.6 where the implementation was using Tensorflow GRU 90.62% 114
open-source machine learning library [37]. For the SRU
module, the number of hidden cells was 100 cells. The model SRU 92.96% 34
was trained with 15 epochs, learning rate of 0.001, L2 of 0.001 GRU + Attention 93.75% 129
for weights regularization, dropout of 0.7 probability, batch
SRU + Attention 94.53% 40
size of 128, Adaptive Moment Estimation (Adam) optimizer
[38] for weights stochastic optimization, and sigmoid cross
entropy as a loss function. The proposed model outperformed the baseline and all the
previous models and achieved the best results. The baseline
C. Evaluation Measure model [16] obtained the worst result since SVM requires
In order to evaluate the proposed model, the accuracy more comprehensive features engineering to achieve better
measure was used. The accuracy measure is defined as the performance. The utilization of word embeddings in [28] has
number of correctly classified sentiment-polarities divided by slightly improved the performance of SVM, Random Forest,
the total number of sentiment-polarities. This measure is and Logistic Regression classifiers as word embeddings
calculated as follows: extract text semantics features. These features can be very
helpful for the handled task. Fuzzy Logic [29] model
+ (8) achieved a comparable result to other models as it was based
=
+ + + on lexicon information. CNN model obtained the best result
compared to the mentioned models from the literature. The
where TP is the number of relevant sentiment-polarities that
effectiveness of using CNN model for text classification tasks
are correctly classified, TN is the number of irrelevant
has been studied by many researchers where CNN is
sentiment-polarities that are correctly classified, FP is the
characterized by its hierarchical architecture that is capable
number of irrelevant sentiment-polarities that are incorrectly
of extracting local-invariant features which help in text
classified, and FN is the number of relevant sentiment-
modeling tasks. However, despite the CNN capability of
polarities that are incorrectly classified.
extracting local features, the recurrent models which are
D. Results and Discussion characterized by its sequential architecture still achieve better
Table 1 presents the experimental results of the proposed results in text classification tasks. This can explain the
model in comparison to other Arabic models from the noteworthy results obtained by the proposed model against
literature. The Arabic models could be summarized as the previous models.
follows:

274
Our experiments aimed to examine the effect of: (1) using the sentence rather than give the same level of attention for
SRU model for SA task and (2) integrating attention the whole words. Figure 4 shows a visualization example of
mechanism with SRU model for SA task. To evaluate the attention weights. It could be noticed that the positive words
impact of using the SRU model against other recurrent “‫ ”اكثر‬and “‫ ”رائعة‬have more attention weights than the rest of
models, a Gated Recurrent Unit (GRU) was also applied words in the sentence. This helps the model to focus more on
using the same hyperparameters configurations. The reported these words and take the final classification decision which is
results in table 1 show that using SRU cells for such task lead positive in this example.
to better performance in terms of accuracy and time. The
simplified design of SRU cells helps improve the training
process where the used element-wise multiplication makes
the training easier and candidate to obtain better results.
Figures 2 and 3 show the training accuracy of the SRU model
and the GRU model respectively. It can be noticed that the
SRU cells model training is more robust compared with the
GRU model, where the SRU does not suffer while training
and the accuracy seamlessly decreased over steps. This
demonstrates that the SRU cells are practically simpler to Fig. 4. A visualization example of attention weights. The dark red refers to
train than the GRU cells. Moreover, the probability of having high attention weights and the light one refers to lower weights.
the overfitting with the SRU is lower than that of a GRU
model. The reason behind that is the imposed constraints on V. CONCLUSION AND FUTURE WORK
the recurrent weights which prevent them from the extensive In this paper, the authors have proposed a DL model to
correlations over the same layer. Furthermore, the tackle sentence-level SA for the Arabic language. The
parallelization scheme allowed the SRU model to be much proposed model investigated the utilization of a variant of the
faster than GRU model because of the light recurrence as recurrent model called Simple Recurrent Model (SRU) which
each state does not have to wait for the previous one to finish. permits the parallel recurrent computations model, and the
integration of such model with the attention mechanism. The
proposed model outperformed the Gated Recurrent Unit
(GRU) and obtained competitive results to other DL models.
The obtained results were better in terms of time and accuracy
measure.

In future experiments, the authors plan to compare the


SRU model with other DL models using different datasets.
Moreover, different types of word embeddings could be
implemented for SA task. The authors also plan to investigate
other attention mechanisms for Arabic SA such as self-
attention.
Fig. 2. The accuracy of SRU model over training steps where (x-axis) is the
number of steps and (y-axis) is the accuracy percentage. REFERENCES
[1] P. D. Turney, “Thumbs up or thumbs down?: semantic orientation
applied to unsupervised classification of reviews,” in Proceedings
of the 40th annual meeting on association for computational
linguistics, 2002, pp. 417–424.
[2] A. Farghaly and K. Shaalan, “Arabic natural language processing:
Challenges and solutions,” ACM Trans. Asian Lang. Inf. Process.,
vol. 8, no. 4, p. 14, 2009.
[3] M. Biltawi, W. Etaiwi, S. Tedmori, A. Hudaib, and A. Awajan,
“Sentiment classification techniques for Arabic language: A
survey,” in 2016 7th International Conference on Information and
Communication Systems (ICICS), 2016, pp. 339–346.
[4] K. Cho et al., “Learning phrase representations using RNN
encoder-decoder for statistical machine translation,” arXiv Prepr.
Fig. 3. The accuracy of GRU model over training steps where (x-axis) is arXiv1406.1078, 2014.
the number of steps and (y-axis) is the accuracy percentage.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
Integrating the attention mechanism has a beneficial
[6] A. Suhr, S. Iyer, and Y. Artzi, “Learning to map context-dependent
impact on both SRU and GRU models as it improved the sentences to executable formal queries,” arXiv Prepr.
achieved results. The best accuracy result was achieved using arXiv1804.06868, 2018.
SRU model with attention mechanism. Although the [7] A. Suhr and Y. Artzi, “Situated mapping of sequential instructions
attention computations have slowed down the SRU training, to actions with single-step reward observation,” arXiv Prepr.
arXiv1805.10209, 2018.
but the model still much faster than the GRU. The obtained
[8] O. Irsoy and C. Cardie, “Opinion mining with deep recurrent
results verified the advantage of using such mechanism for neural networks,” in Proceedings of the 2014 conference on
SA task. This could be explained due to the attention empirical methods in natural language processing (EMNLP),
mechanism’s capability to focus on the important words in 2014, pp. 720–728.

275
[9] A. Kumar et al., “Ask me anything: Dynamic memory networks mining in arabic as a low resource language,” ACM Trans. Asian
for natural language processing,” in International Conference on Low-Resource Lang. Inf. Process., vol. 16, no. 4, p. 25, 2017.
Machine Learning, 2016, pp. 1378–1387. [24] R. Baly, H. Hajj, N. Habash, K. B. Shaban, and W. El-Hajj, “A
[10] J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-recurrent sentiment treebank and morphologically enriched recursive deep
neural networks,” arXiv Prepr. arXiv1611.01576, 2016. models for effective sentiment analysis in arabic,” ACM Trans.
[11] T. Lei, Y. Zhang, S. I. Wang, H. Dai, and Y. Artzi, “Simple Asian Low-Resource Lang. Inf. Process., vol. 16, no. 4, p. 23,
recurrent units for highly parallelizable recurrence,” in 2017.
Proceedings of the 2018 Conference on Empirical Methods in [25] A. M. Alayba, V. Palade, M. England, and R. Iqbal, “Arabic
Natural Language Processing, 2018, pp. 4470–4481. language sentiment analysis on health services,” in 2017 1st
[12] J. Liu and Y. Zhang, “Attention modeling for targeted sentiment,” International Workshop on Arabic Script Analysis and
in Proceedings of the 15th Conference of the European Chapter of Recognition (ASAR), 2017, pp. 114–118.
the Association for Computational Linguistics: Volume 2, Short [26] C. Manning, P. Raghavan, and H. Schütze, “Introduction to
Papers, 2017, vol. 2, pp. 572–577. information retrieval,” Nat. Lang. Eng., vol. 16, no. 1, pp. 100–
[13] M. Yang, W. Tu, J. Wang, F. Xu, and X. Chen, “Attention Based 103, 2010.
LSTM for Target Dependent Sentiment Classification.,” in AAAI, [27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
2017, pp. 5013–5014. “Distributed representations of words and phrases and their
[14] D. Ma, S. Li, X. Zhang, and H. Wang, “Interactive attention compositionality,” in Advances in neural information processing
networks for aspect-level sentiment classification,” arXiv Prepr. systems, 2013, pp. 3111–3119.
arXiv1709.00893, 2017. [28] A. A. Altowayan and L. Tao, “Word embeddings for Arabic
[15] A. Dahou, S. Xiong, J. Zhou, M. H. Haddoud, and P. Duan, “Word sentiment analysis,” in 2016 IEEE International Conference on
embeddings and convolutional neural network for arabic sentiment Big Data (Big Data), 2016, pp. 3820–3825.
classification,” in Proceedings of coling 2016, the 26th [29] M. Biltawi, W. Etaiwi, S. Tedmori, and A. Shaout, “Fuzzy based
international conference on computational linguistics: Technical Sentiment Classification in the Arabic Language,” in Proceedings
papers, 2016, pp. 2418–2427. of SAI Intelligent Systems Conference, 2018, pp. 579–591.
[16] M. Aly and A. Atiya, “Labr: A large scale arabic book reviews [30] J. R. Firth, “A synopsis of linguistic theory, 1930-1955,” Stud.
dataset,” in Proceedings of the 51st Annual Meeting of the Linguist. Anal., 1957.
Association for Computational Linguistics (Volume 2: Short [31] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2–3, pp.
Papers), 2013, vol. 2, pp. 494–498. 146–162, 1954.
[17] S. Al-Azani and E.-S. M. El-Alfy, “Hybrid deep learning for [32] A. B. Soliman, K. Eissa, and S. R. El-Beltagy, “Aravec: A set of
sentiment polarity determination of arabic microblogs,” in arabic word embedding models for use in arabic nlp,” Procedia
International Conference on Neural Information Processing, Comput. Sci., vol. 117, pp. 256–265, 2017.
2017, pp. 491–500. [33] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very
[18] M. Nabil, M. Aly, and A. Atiya, “Astd: Arabic sentiment tweets deep networks,” in Advances in neural information processing
dataset,” in Proceedings of the 2015 Conference on Empirical systems, 2015, pp. 2377–2385.
Methods in Natural Language Processing, 2015, pp. 2515–2519. [34] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
[19] N. A. Abdulla, N. A. Ahmed, M. A. Shehab, and M. Al-Ayyoub, by jointly learning to align and translate,” arXiv Prepr.
“Arabic sentiment analysis: Lexicon-based and corpus-based,” in arXiv1409.0473, 2014.
2013 IEEE Jordan conference on applied electrical engineering [35] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy,
and computing technologies (AEECT), 2013, pp. 1–6. “Hierarchical attention networks for document classification,” in
[20] A. M. Alayba, V. Palade, M. England, and R. Iqbal, “A combined Proceedings of the 2016 conference of the North American chapter
cnn and lstm model for arabic sentiment analysis,” in International of the association for computational linguistics: human language
Cross-Domain Conference for Machine Learning and Knowledge technologies, 2016, pp. 1480–1489.
Extraction, 2018, pp. 179–191. [36] C. Cortes, M. Mohri, and A. Rostamizadeh, “L 2 regularization for
[21] A. Al Sallab, H. Hajj, G. Badaro, R. Baly, W. El Hajj, and K. B. learning kernels,” in Proceedings of the Twenty-Fifth Conference
Shaban, “Deep learning models for sentiment analysis in Arabic,” on Uncertainty in Artificial Intelligence, 2009, pp. 109–116.
in Proceedings of the second workshop on Arabic natural [37] M. Abadi et al., “Tensorflow: a system for large-scale machine
language processing, 2015, pp. 9–17. learning.,” in OSDI, 2016, vol. 16, pp. 265–283.
[22] G. Badaro, R. Baly, H. Hajj, N. Habash, and W. El-Hajj, “A large [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic
scale Arabic sentiment lexicon for Arabic opinion mining,” in optimization,” arXiv Prepr. arXiv1412.6980, 2014.
Proceedings of the EMNLP 2014 workshop on arabic natural
language processing (ANLP), 2014, pp. 165–173.
[23] A. Al-Sallab, R. Baly, H. Hajj, K. B. Shaban, W. El-Hajj, and G.
Badaro, “Aroma: A recursive deep learning model for opinion

276
A novel medical image fusion algorithm
for detail-preserving edge and feature extraction
Fayadh Alenezi
Department of Electrical Engineering, Faculty of Enginerring, Jouf University, Sakaka 72388, Saudi Arabia
Fshenezi@Ju.edu.sa

Abstract—By combining two or more medical images into method improve edges and textual information of the fused
one, image fusion has become an important tool for clinical image by combining Gabor filtering with maximum
diagnosis. However, existing fusion methods have also shown selection and fuzzy fusion, resulting in an image with low
significant limitations, such as the loss of information content, information content [1]. Another recent method uses a
weak contrast, noise and lengthy computation times. This Pulse-Coupled Neural Network (PCNN) together with
paper presents a novel technique for medical image fusion that Gabor filtering, in order to produce fused images with high
seeks to preserve and boost detailed information of the source information content [7], although the estimation of the
images, while promoting its edges and textual features and PCNN parameters requires a significant computation time.
suppressing noise. The method is based on a feature-linking,
In this paper, an improvement of the information content of
pulse-coupled neural network, followed by a modified Haar
the fused image is sought by preserving and promoting the
wavelet transform that leads to maximum-selection fusion in
the transformed domain and high-scale Wiener filtering of the textual features of the source images. For this purpose, a
resulting image. The new algorithm is presented, described and Feature-Linking model (FLM) [8] and a Modified Haar
evaluated on two sets of images, and its results are compared to Wavelet Transform (MHWT) [9] are combined. FLM is
those obtained from existing fusion methods. The performance aimed at enhancing the contrast of the fused image by
of the newly developed algorithm is shown to be superior over preserving and boosting detailed information of the source
the reference fusion methods in terms of a set of quality images. MHWT also strengthens the contrast of the FLM
metrics based on subjective visual perception criteria, thus image output before fusion, which is accomplished by
confirming its potential benefits to medical diagnosis. maximum selection rule. The fused image is filtered using a
high-scale Wiener filtering [8] in order to smooth out the
Keywords—Edge extraction, Information Content, HAAR noise in the final image.
transform, FLM PCNN, Wiener filter, Medical Image Fusion.
The rest of the paper is organized as follows: the
proposed algorithm is presented in section II, and section III
I. INTRODUCTION presents simulation results and their discussion. Section IV
Image fusion aims at combining complementary and provides analysis and conclusions.
redundant information from two or more images. The fused
(composite) image has superior qualities than any individual II. PROPOSED METHOD
input image [1]. Image fusion improves quality of decision-
making and therefore has found applications in medical A. Overview and Background
imaging, military science, biometrics and machine vision
[2]. The block diagram in Fig. 1 represents the algorithmic
steps proposed to achieve the desired fused image. The input
Image fusion methods are divided into spatial and images are fed into a lateral-inhibited and excited feature-
transform domain fusion methods. Spatial domain fusion linking pulse-coupled neural network (FLM-PCNN), in
methods directly handle the pixels of input images [3]. On order to boost and preserve key features. The output from
the other hand, transform-domain methods operate in an FLM-PCNN is decomposed in the spatial resolution domain
alternative domain in which images are represented via using a Discrete Wavelet Transform (DWT). The
some suitable transformation [3]. Image fusion can also be transformed image is then modified by means of the Haar
categorized based on fusion stages, such as pixel-level, wavelet transform to increase image contrast and extract
feature-level and decision-level fusion [4]. Pixel-level edge details and features. All of the wavelet coefficients are
fusion involves generating a composite image based on then combined using maximum fusion rule in order to
predetermination of pixel intensities of source images [4]. preserve salient features of the images. The fused image is
Feature-level fusion is based on extracting salient features then filtered using a high-scale Wiener filter to reduce
from source images, such as edges or texture [4]. Decision- image noise [8] and to optimize the complementary effects
level fusion entails the pre-extraction of information from of the inverse transformation from the earlier procedure.
source images, followed by the application of a set of
decision rules in order to have a common interpretation [4].
Although generally helpful, existing image fusion
techniques have also had significant drawbacks. For
instance, weighted average fusion methods have often
produced outputs with reduced contrast [5]; image fusion
using controllable cameras depends on camera motion and
Fig. 1: Schematic representation showing the proposed algorithm.
does not work with still images [6]; image fusion based on
probabilistic techniques entails huge and lengthy
computation efforts. A recently developed image fusion

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 277


B. Proposed FLM which illustrates how synapses are connected to outputs of
The proposed Feature-Linking Model (FLM), which is neighboring neurons within a predetermined radius σ [10].
similar to the traditional Pulsed Coupled Neural Network In this paper, σ is large enough to allow effective filtering of
(PCNN), has two inputs: feeding inputs and linking inputs the neurons. Locally excitatory linking inputs have a
[10]. FLM has two leaky integrators, which represent the negative, globally inhibitory term that supports
membrane potential and threshold of the neurons, one for desynchronization [17]. Therefore, the dendritic signals to
each pixel, in the network [11]. Unlike PCNN, FLM's leaky the neuron are the feeding inputs and the linking inputs,
integrators enable the synchronization and respectively, as represented by
desynchronization across different regions in a medical ( )=∑ ( − 1) + (4)
image, thus replicating the function of the human eye [11].
Image contrast is enhanced by FLM through simultaneous ( )=∑ ( − 1) − + − (5)
timing of the first-generated action potential, and by keeping
a time matrix record of action potential timing across the where indices (i,j) denote each reference neuron, indices (k,
entire network [12]. The spiking (action potential) of the l) and (p, q) denote neighboring neurons; ( ) is the
neuron is differential, which helps extract image feeding input; ( − 1) denotes the postsynaptic action
information. potential; is the stimulus for the neuron; the term
is a synaptic weight applied to feeding inputs ; ( )
The time matrix in FLM imitates the greyscale intensity
of an image [12]: it maintains a logarithmic relationship to denotes a linking input, and is a synaptic weight
the stimuli matrix and is therefore implemented as a single- applied to linking inputs. The positive constant applies to
pass record, consistent with the Weber-Fechner law. In the global inhibition, is a negative constant for lateral
order to perfectly imitate the visual perception of the human inhibition, and is positive constant for lateral excitation.
eye, the parameters of the FLM must be carefully chosen in
a manner similar to the Mach band effect. The Mach band
effect is an optical illusion where the contrast between edges
of slightly different shades of gray are exaggerated as soon
as they make contact [13]; this visual effect is important in
triggering edge-detection in the human visual system [14].
The Mach band effect is emphasized by introducing two
positive constants in the linking inputs in FLM network.
These constants, and , are related to lateral inhibition
[15] and lateral excitation [16], respectively. The lateral Fig. 2. Schematic of proposed feature linking PCNN model with feeding
excitation ensures that only mutually exciting neurons input, linking input, leaky integrator and spike generator.
relevant to stimuli are selected [15]. The lateral inhibition
ensures that neurons that are irrelevant are suppressed [16].
a) Leaky integrator
Leaky integrators are the most important component of
the FLM and they describe the dynamic potential ( ) of a
neural oscillator,
( )
=− ( )+ (1)

where represents time, (the input stimulus) is the pixel


value of the preprocessed image, and is the leak rate (0 < Fig. 3. Schematic of relationship between linking inputs with excitatory ( )
< 1). Eq. (1) can be discretized as and inhibitory ( ) neurons.
( ) ( )
= ( − 1) + (2) In order to enable synchronization, stimulus-driven
( )
feedforward streams are combined with stimulus-induced
where ( ) is the discretized potential and is the discrete feedback streams [18]. The leaky integrator driven by the
time index. Eq. (2) can be rewritten as membrane potential is described by
( )= ( − 1) + (3) ( )= ( − 1) + ( ) 1+ ( ) (6)
where = 1 − is the attenuation time constant of the
leaky integrator. Eq. (3) represents the generic form of a where is the attenuation time constant of the membrane
leaky integrator. potential and is the linking strength. After algebraic
substitution of (4) and (5) into (6), the neural membrane
b) Membrane potential potential can be finally expressed as
A cortical neuron is mostly bi-directionally connected;
( )= ( − 1) + ∑ ( − 1) + 1+
feeding synapses are of the feedforward type while linking
synapses are of the feedback type [10]. Fig. 2 shows a ∑ ( − 1) − + − (7)
feature-linking model for the proposed method. Fig. 3 also
shows a neuron that has feeding synapses and lateral linking c) Threshold
synapses. Feeding synapses are connected to a spatially
corresponding stimulus. Lateral linking is shown in Fig. 3,

278
The threshold Θ ( ) of the neuron can be represented LL sub-bands, where all approximations take place, consists
by a leaky integrator. The threshold Θ ( ) is given by (8) of low-frequency components of the image, and they are
split at high level of decomposition. HL, HH and LH sub-
Θ ( ) = Θ ( − 1) + ℎ ( − 1), (8) bands are the detail components [9]. The HL sub-bands
result from high pass filtering on the row direction and low
where ( − 1) is the postsynaptic action potential, is
pass filtering on the columns. The HH sub-bands are high
the attenuation time constant and ℎ is a magnitude pass filtering on all directions while the LH sub-bands result
adjustment. from low pass filtering on the row direction and high pass
d) Summary of FLM action on input image filtering on the columns [9].
Each pixel of the input images corresponds to one All the visible details like edges and lines of the FLM
neuron of the network; therefore, a two-dimensional image enhanced image are assumed perpendicular to the
matrix is represented as × neurons (r being the number orientation of the high pass filtering. The proposed MHWT,
of image rows and c the number of columns). The input as shown in Fig. 4, consists of four nodes as opposed to two
image intensity is normalized according to nodes in the DWT.

=
( )
+ , (9) The first average sub-signal in the proposed MHWT is
( ) ( ) estimated by
where S represents the output of FLM stage, min( ) returns = , ,…, , (10)
the minimum value of , max( ) returns the maximum value
of , and is a small positive constant which ensures where is the signal length. The frequency of the signal is
nonzero pixel values, which has been set to the smallest denoted by , where = ( , , , … , ). For instance, the
gray scale value of the matrix, = ( ). mean of the first sub-signal of length can be
approximated as
The first multiplying term in (9) normalizes the pixel
value across its local neighborhood: is the peak-to-mean = , (11)
amplitude of the neurons’ filter response to the edge, is and the corresponding detail sub-signal at the same level is
the mean amplitude, used to achieve contrast invariance approximated as
during normalization, and is a normalization constant,
which is set to 0.5. The normalization matrix increases the = , (12)
lateral inhibition, thereby sharpening the visual and feature
properties of the images [14]. These sharp-masked images where = 1,2,3 … , .
are subsequently transformed using MHWT.
The maximum coefficients resulting from the MHWT
C. Modified Haar Wavelet Transform processes corresponding to all source images are selected,
and the inverse transform of this selection is subsequently
The output of the FLM becomes the input in this stage. obtained. The resulting image from the inverse transform is
The image is read as a matrix by application of MHWT the fused image, which is then fed to the next stage.
along rows and columns on the entire image matrix. The
process largely relies on Haar filters in order to help extract D. Space-variant Wiener filter
image features [19]. When MHWT is applied along the rows The fused image from the MHWT stage is filtered
and columns of the FLM-enhanced image matrix, a using a high-scale wiener filter to optimize the trade-off
transformed image matrix is obtained, featuring one level of between noise power and signal power [20]. High-scale
input image divided into four corners, namely: upper left Wiener filtering is used as opposed to general Wiener filter
corner of the FLM-enhanced image; lower left corner of the in order to solve the problem of invariance across different
vertical details; upper right corner of the horizontal details; image regions. High-scale Wiener filtering is achieved by
lower right corner of the component of the FLM enhanced amplifying the magnitude of fused image pixels so that their
image detail (high frequency), as presented in Fig. 4. energies dominate over that of the noise [8]. The energy of
the spectral components of the fused image pixels that is
smaller than the noise energy is set to zero, leading to noise-
free image. The proposed Wiener filter output ( ) of the
input image ℊ( ) is described as follows

( )= ( )+ ℊ( ) − ( ) , (13)

where ℊ represents the pixel intensities, is the local


mean of the fused image pixel intensities, and
|ℬ | ( )
= max |ℬ |
;0 , (14)
Fig. 4. MHWT decomposition of FLM enhanced image to matrix form
where ℬ is the variance of the local fused image intensities,
The low-low (LL) sub-band is in the upper left corner is the noise standard deviation, and is a Lagrange
and originates from low pass filtering in both directions [9].

279
constant and it ensures that the filter has low frequency
response at high frequencies.
The first term inside the max [•] operator in (14) ensures
the filter is dynamic, making it spatially variant. On the
|ℬ | ( )
other hand, the weight coefficients |ℬ |
in the
max [•] operator depend on the spectrum of the fused image,
and have values ranging from 0 to 1 depending on the
magnitude of the noise variance ,
1, |ℬ | ≥
= (15).
0, ℎ
Fig. 5. Inputs and fused image for algorithm test using example 1.
The space-variant Wiener filtered image has more
features preserved, which is crucial in medical imaging,
where images are typically characterized by poor contrast TABLE II. PERFORMANCE OF PROPOSED ALGORITHM COMPARED TO
[22, 23]. Thus, by letting the filter vary from one region to EXISTING ALGORITHMS FOR EXAMPLE 1
another, there is enough flexibility to expose the appropriate
details of the fused image for further operations. also Algorithm Entropy OCE AVG
helps ensure that all of the power spectrum in either un- Proposed 7.401 0.5806 0.0817
degraded or noisy images that are hard to estimate are also
SHFV 7.1961 0.6214 0.0795
filtered.
CT 6.6424 0.9041 0.0765
III. SIMULATION RESULTS DWT 6.5142 0.7274 0.0662
The proposed algorithm has been implemented using
MATLAB R2018b, and then applied to two different sets of
images (input images and resulting images corresponding to
Examples 1 and 2 are shown in Fig. 5 and Fig. 5,
respectively) The FLM parameters used in these simulations
are listed in Table I. The results have been evaluated using
subjective visual perception criteria based on a set of
performance quality metrics, namely: entropy, which
measures information content in the image [8]; overall cross
entropy (OCE), which measures the difference between the
input images and the fused image [23]; and average gradient
(Avg), which measures clarity of the fused image [24]. The
results are compared with existing medical image fusion
methods such as Shearlets and Human Feature Visibility
(SHFV), Contourlet Transform (CT) and Discrete Wavelet
Transform (DWT); such comparison is presented in Fig. 7 Fig. 6. Inputs and fused image for algorithm test using example 2
and Tables II and III. Finally, the graphical representation of
the selected performance metrics for the proposed method TABLE III. PERFORMANCE OF PROPOSED ALGORITHM COMPARED TO
and the reference algorithms is displayed in Fig. 8, 9, 10, 11, EXISTING ALGORITHMS FOR EXAMPLE 2
12 and 13.
Algorithm Entropy OCE AVG
Proposed 7.084 0.7348 0.0701
SHFV 6.9467 0.7654 0.0521
CT 6.8824 0.8843 0.0433
DWT 6.5198 1.1076 0.0419
TABLE I. LIST OF THE PROPOSED FLM PARAMETERS VALUES.

Parameter Value
f 0.015
g 0.975
h 1.95 × 10
d 2.05
ϵ −0.2
φ 1.05
β 0.0295
α 0.015

280
Fig. 10. AVG of proposed algorithm compared to existing algorithms for
Example 1.

Fig. 11. AVG of proposed algorithm compared to existing algorithms for


Example 2-

Finally, the depiction of OCE values in Examples 1 and


2, presented in Fig. 12 and Fig. 13 respectively, indicate that
Fig. 7. Comparison of proposed algorithm results with result from existing
the proposed method yields the lowest OCE values among
algorithms all methods, which is also a positive aspect in terms of
performance. Lower OCE values means that the fused
The graphical representation of entropy values obtained results obtained performed better when compared to exiting
for examples 1 and 2, shown in Fig. 8 and Fig. 9 methods.
respectively, reveal that the proposed method yields better
entropy values than all of the reference algorithms, a desired
feature in image fusion methods. Higher entropy values
implies that there is more information content in the fused
images.

Fig. 12. OCE of proposed algorithm compared to existing algorithms for


Example 1.

Fig. 8. Entropy of proposed algorithm compared to existing algorithms for


Example 1.

Fig. 13. OCE of proposed algorithm compared to existing algorithms for


Example 2.

IV. CONCLUSION
Medical image fusion is critical in medicine in order to
Fig. 9. Entropy of proposed algorithm compared to existing algorithms for
Example 2. enable correct and accurate clinical diagnosis. Image
features such as textures and edges are important in accurate
Furthermore, the AVG values obtained for Examples 1 non-invasive treatments. This paper proposes a medical
and 2 and graphically displayed in Fig. 10 and Fig. 11 image fusion method based on combination of FLM,
respectively, expose the superiority of the proposed method MHWT and space-variant Wiener filter. The algorithm,
with respect to the selected reference algorithms, given the which is precisely aimed at improving those critical image
fact that higher AVG values are preferred, since they reflect features, exhibits a remarkable improvement when
an increased clarity of the fused image. compared to existing fusion methods. The evaluation has
been based on a set of performance metrics, showing that
the proposed algorithm outperforms the existing ones
despite having low computational complexity. The proposed

281
method yields images with better edges, information content for image enhancement," Neural computation, vol. 28, no. 6, pp.
and contrast. This performance can be attributed to better 1072-1100, 2016.
edge detection and extraction due to the MHWT, increased [13] A. Tsofe, H. Spitzer and S. Einav, "Does the Chromatic Mach bands
richness in information content thanks to the FLM and effect exist?," Journal of vision, vol. 9, no. 6, pp. 20-20, 2009.
superior contrast enhancement and smoothing of noise by
[14] F. A. A. Kingdom, "Mach bands explained by response
the space-variant Wiener filter. normalization," Frontiers in human neuroscience, vol. 8, p. 843,
2014.
Based on this preliminary evaluation, it is possible to
conclude that the proposed algorithm can potentially bring [15] F. G. J. Montolio, W. Meems, M. S. A. Janssens, L. Stam and N. M.
Jansonius, "Lateral inhibition in the human visual system in patients
significant benefits to the field of medical diagnosis. with glaucoma and healthy subjects: a case-control study," PloS one ,
Nevertheless, a more thorough evaluation considering an vol. 11, no. 3, p. e0151006, 2016.
increased number of examples and a more extensive set of
performance indicators is deemed necessary in order to fully [16] J. H. Byrne, Introduction to neurons and neuronal networks, 2013.
assess the performance of this novel method. [17] R. D. Stewart, I. Fermin and M. Opper, "Region growing with pulse-
coupled neural networks: an alternative to seeded region growing,"
IEEE Transactions on Neural Networks, vol. 13, no. 6, pp. 1557-
REFERENCES 1562, 2002.

[1] F. Alenezi and E. Salari, "Medical Image Fusion (MIF) Exploring [18] T. Brosch and H. Neumann, "Interaction of feedforward and
Textural Information," in 2018 IEEE International Conference on feedback streams in visual cortex in a firing-rate model of columnar
Electro/Information Technology (EIT), Rochester, MI, USA , 2018. computations," Neural Networks, vol. 54, pp. 11-16, 2014.

[2] F. Alenezi and E. Salari, "Perceptual Local Contrast Enhancement [19] S. Audithan and R. M. Chandrasekaran, "Document text extraction
and Global Variance Minimization of Medical Images for Improved from document images using haar discrete wavelet transform,"
Fusion," International Journal of Imaging Science and Engineering European journal of scientific research, vol. 36, no. 4, pp. 502-512,
(IJISE), vol. 10, no. 3, pp. 1-10, 2018. 2009.

[3] D. K. Sahu and M. P. Parsai, "Different image fusion techniques-a [20] G. Cristobal, P. Schelkens and H. Thienpont, Optical and digital
critical review," International Journal of Modern Engineering image processing: fundamentals and applications, John Wiley &
Research (IJMER), vol. 2, no. 5, pp. 4298-4301, 2012. Sons, 2013.

[4] S. K. Shah and D. U. Shah, "Comparative study of image fusion [21] A. Umarani, "Enhancement of coronary artery using image fusion
techniques based on spatial and transform domain," International based on discrete wavelet transform," Biomedical Research, vol. 27,
Journal of Innovative Research in Science, Engineering and no. 4, pp. 1118-1122, 2016.
Technology (IJIRSET), vol. 3, no. 6, pp. 10168-10175, 2014.
[22] R. Singh and A. Khare, "Multiscale medical image fusion in wavelet
[5] J. Kong, K. Zheng, J. Zhang and X. Feng, "Multi-focus image fusion domain," The Scientific World Journal, vol. 2013, pp. 1-11, 2013.
using spatial frequency and genetic algorithm," International Journal
of Computer Science and Network Security, vol. 8, no. 2, pp. 220- [23] L. Yang, B. L. Guo and W. Ni, "Multimodality medical image fusion
based on multiscale geometric analysis of contourlet transform,"
224, 2008.
Neurocomputing, vol. 72, no. 1-3, pp. 203-211, 2008.
[6] W. B. Seales and S. Dutta, "Everywhere-in-focus image fusion using [24] Z. Li, Z. Jing, X. Yang and S. Sun, "Color transfer based remote
controlablle cameras," in International Society for Optics and
sensing image fusion using non-separable wavelet frame transform,"
Photonics, 1996.
Pattern Recognition Letters, vol. 26, no. 13, pp. 2006-2014, 2005.
[7] F. Alenezi and E. Salari, "Novel Technique for Improved Texture [25] T. Schoenauer, S. Atasoy, N. Mehrtash and H. Klar, "NeuroPipe-
and Information Content of Fused Medical Images," in 2018 IEEE
Chip: A digital neuro-processor for spiking neural networks," IEEE
International Symposium on Signal Processing and Information
Transactions on Neural Networks, vol. 13, no. 1, pp. 205-213, 2002.
Technology (ISSPIT), 2018.

[8] F. Alenezi and E. Salari, "A Novel Image Fusion Method Which [26] M. Deshmukh and U. Bhosale, "Image fusion and image quality
assessment of fused images," International Journal of Image
Combines Wiener Filtering, Pulsed Chain Neural Networks and
Processing (IJIP), vol. 4, no. 5, p. 484, 2010.
Discrete Wavelet Transforms for Medical Imaging Applications,"
International Journal of Computer Sci ence And Technology, vol. 9, [27] L. Yaroslavsky, Yaroslavsky, L. (2013). Digital holography and
no. 4, pp. 9-15, 2018. digital image processing: principles, methods, algorithms. Springer
Science & Business Media, New York: Springer Science+Business
9] G. Singh, G. Singh and G. S. Aujla, "MHWT-A Modified Haar Media, 2004, p. 323.
Wavelet Transformation for Image Fusion," International Journal of
Computer Applications, vol. 79, no. 1, pp. 26-31, 2013. [28] N. A. Al-Azzawi, "Medical Image Fusion based on Shearlets and
Human Feature Visibility," International Journal of Computer
[10] R. Eckhorn, H. J. Reitboeck, M. T. Arndt and P. Dicke, "Feature Applications, vol. 125, no. 12, pp. 1-12, 2015.
linking via synchronization among distributed assemblies:
Simulations of results from cat visual cortex," Neural computation,
vol. 2, no. 3, pp. 293-307, 1990.

[11] J. L. a. P. M. L. Johnson, "PCNN models and applications," IEEE


transactions on neural networks, vol. 10, no. 3, pp. 480-498, 1999.

[12] K. Zhan, J. Teng, J. Shi, Q. Li and M. Wang, "Feature-linking model

282
Classification of Short-time Single-lead ECG
Recordings Using Deep Residual CNN
Areej Kharshid Ridha Ouni
Department of Computer Engineering Haikel S. Alhichri Department of Computer Engineering
King Saud University Advanced lab for Intelligent Systems King Saud University
Riyadh, Saudi Arabia 11543 Research (ALISR) Riyadh, Saudi Arabia 11543
Areej.kharshid@gmail.com Department of Computer Engineering rouni@ksu.edu.sa
King Saud University
Riyadh, Saudi Arabia 11543
hhichri@ksu.edu.sa

Yakoub bazi
Advanced lab for Intelligent Systems
Research (ALISR)
Department of Computer Engineering
King Saud University
Riyadh, Saudi Arabia 11543
ybazi@ksu.edu.sa

Abstract— This paper presents a method for the ECG devices cannot replace the larger more expensive
classification of short-time single-lead ECG recordings of devices used in hospitals but they can have a major role in
variable size. These recordings are published as part of a early detection of AF through long term daily monitoring [3].
challenge in 2017 by PhysioNet. The goal of the challenge is to The dataset in the competition is challenging because the class
classify the ECG recordings into four classes (Normal, atrial sizes are unbalanced, which is problematic for many
fibrillation, other abnormalities, and too noisy). The dataset is classification algorithms. Another difficulty in this dataset is
challenging because the high inter-class variability and because that each ECG recording has one label yet they have variable
class sizes are unbalanced. The proposed method starts by sizes (from 9 seconds to 60 seconds in length), which is again
denoising the ECG recordings using bandpass filtering, then
makes it difficult to use directly in raw format as input to many
detecting and correcting inverted signals using our own
deep classification algorithms.
proposed algorithm. Since the recording have variable size, our
proposed solution extracts a large set of features (188) that the More than 70 groups participated in the 2017 ECG
literature has shown to be effective in characterizing ECG challenge. For example, the method of Teijeiro et al. [4]
signals and detecting abnormalities. Then we present our own extracts morphological and rhythm-related features using an
carefully designed residual convolutional neural network (CNN) abductive framework for time series interpretation [5]. Then,
with 5 hidden layers and use advanced and efficient training the authors feed these features into two classifiers, one that
techniques to build a deep learning classifier for the solution. evaluates the record globally, using aggregated values for each
Finally the paper presents preliminary results of testing the
feature; and another one that evaluates the record as a
proposed solution on the challenge dataset and shows its
classification capabilities.
sequence, using a Recurrent Neural Network fed with the
individual features for each detected heartbeat. Kropf et al. [6]
Keywords—short-time single-lead ECG recordings, Atrial proposed a method which starts by extracting a total of 380
Fibrillation detection, Deep Residual Convolutional Neural features from both time and frequency domain. They used
Networks (CNN). these features to train a Random Forest–based classifier
(bagged decision trees). Billeci et al. [7] proposed an approach
I. INTRODUCTION that starts by extracting fifty different features which can be;
Early diagnosis of irregular heart rhythm known as 1) computed on the ECG signal, 2) derived from the RR series,
arrhythmia, helps reduce the risk of severe complications, and 3) obtained by merging QRS morphology and rhythm.
such as stroke or heart failure. Atrial fibrillation (AF) is one Then, they select a subset of thirty discriminating features
of the most common heart arrhythmias today, affecting an using the stepwise linear discriminant analysis (LDA)
estimated 1% of the population [1]. It is the leading cause of algorithm. After that, the least squares support vector machine
stroke, so detecting it is important. An Electrocardiogram (SVM) classifier performs the classification step.
(ECG) is the most important method for AF detection. ECG Another top-performing method, proposed by Datta et al.
records the electrical activity of the heart at rest and provides [8], used a two layer binary cascaded approach where the first
information about heart rate and rhythm. It can show if there binary classifier separates the unlabelled recordings into two
is enlargement of the heart due to high blood pressure intermediate classes (’normal+others’ and ’AF+noisy’). Then,
(hypertension) or evidence of a previous heart attack each intermediate class is separated into two using a second
(myocardial infarction). binary classifier in a second layer. This method also relies on
In 2017, the PhysioNet/Computing in Cardiology a feature extraction step before classification. It extracts more
presented a challenge that asked researchers and practitioners than 150 features including morphological, Prior art AF
to provide a reliable solution for the screening of AF from features, frequency features, statistical features, and others.
short-time single-lead ECG signals acquired with a Zabihi et al. [9] propose a hybrid classification approach
commercial low-cost hand-held device [2]. These hand-held for ECGs recorded by the AliveCor hand-held devices [9]. It

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 283


combines features from multi-domains including time, algorithm. Residual networks deal with this problem by using
frequency, time-frequency, phase space, and meta-level. It the residual blocks as shown in Fig. 1.
utilizes a feature selection approach based on a random forest
classifier. Finally, the selected features are classified by
another random forest classifier.
Bin et al. [10] developed an approach to determine AF
based on decision tree ensemble. The proposed approach first
utilized an improved Hamilton and Tompkins algorithm to
detect the QRS complex. Second, they derived thirty features
from each ECG record, all these features can be categorized
into four main groups: AF features, Morphology features, RR
interval features, and Similarity indexed between beats. Last,
a binary decision tree based classifier is used for classification.
Behar et al. [11] used feature-based machine learning Fig.1: Illustration of the residual network concept
approach. As preprocessing step to determine and select the The basic idea is to make a shortcut connection from the
highest quality continuous sub-segment signal, R-peak input layer to some future hidden layer, where the input x is
detectors are used to evaluate the RR interval, and a second- added with the output of that hidden. In other words the hidden
by-second basis is used to estimate the signal quality. Based mapping H(x) is now F(x)+x. Thus the neural network now
on the results a set of features were extracted: heart rate maps x to the residual H(x)-x. The idea is that making the
variability, ECG morphology and the presence of ectopic network learn how to map this residual to 0 is better and more
beats. These features were used to train SVM classifiers in a efficient than learning the mapping between x and F(x).
one-vs-rest approach.
In this work, we present a solution based on a residual
Bonizzi et al. [12] presented a two-stage ensemble convolutional neural network (CNN). In contrast to many
learning method to detect the AF from other rhythms approaches which rely on the pretrained ResNet model, we
depending on several sets of known AF features. In stage one, build our own 5-layer residual CNN in this work. This is a
the ECG record is classified into noisy versus non-noisy preferred approach due to the small size of the dataset. The
record based on extracting a set of features that don’t exist in method starts by a preprocessing step, which performs signal
ECG signal. In stage two, the records are classified into AF denoising and also detection of (vertically) flipped signals.
versus non-AF classes. For the two stages, an ensemble model The second step involves extracting a set of important features
with decision tree was utilized and adapted with RUSBoost to from the ECG records. Next, the proposed method trains a
classify the records. residual CNN with the extracted features in order to classify
When it comes to convolutional neural networks (CNN), the short-time ECG records. Thus the main contributions in
the literature contains three proposed solutions that use the this paper are two fold, 1) a new algorithm for the detection of
short-time ECG records dataset. Xiong et al. [13] proposed a inverted ECG signals, 2) using a residual CNN designed and
16 layer CNN that accepts a 5 seconds of the raw ECG signal trained from scratch for the classification of short-time single-
at a time. The final class probability of the full ECG record is lead ECG recording of variable size.
the average of the individual 5-second clip probabilities. The remainder of the paper, presents a description of the
Another work, presented by Warrick and Homsi [14], uses a methodology in section 2, experimental results in Section 3,
one-layer CNN as an Autoencoders to learn a local and and finally concluding remarks in Section 4.
discriminative features on the raw input sequence. Then three
layers of LSTMs are put up on top of the previous CNN to II. PROPOED METHOD
encode the sequential patterns. Finally, the proposed deep
This section describes the methodology in detail. It contains
network adds one fully connected hidden layer, and a softmax
layer for classification. Andreotti et al. [15] presented a sections to describe the preprocessing step, the feature
comparison between a feature-based classifier and a deep extraction step, and the classification step using deep residual
learning approach (Convolutional Neural Network) to detect CNN.
rhythms from short ECG segments which are then classified A. ECG record preprocessing
to four classes. Both methods used different tool to be
developed. The feature-based approach was implemented in 1) Denoising:
MATLAB with WFDB toolbox, while, on the other hand, the Important cardiac information, are typically stored within
deep learning approach is developed using tensorflow in 20 Hz in ECG signal. Thus, any ECG segment having
Python 3.5. frequency power above 50 Hz can be safely assumed to be
noise. Patient movement, bad electrodes, and improper
As can be seen, many methods have been proposed in the electrode site preparation etc. are the main causes of signal
literature, however, none of them have used residual networks corruption. Another form of noise is baseline wander, which
to solve the problem of classifying short-time single-lead ECG
is mainly caused by the patient breathing and voltage
records. Residual networks are a type of deep neural networks
fluctuation [18]. The denoising step involves preprocessing
which have been proposed recently for image classification
tasks [16]. It is famously known under the name of ResNet each ECG signal with a 10th order bandpass Butterworth
and it is commonly used as a pre-trained model in many works filter with cut-off frequencies of 0.5Hz and 50Hz.
in the literature. The accuracy of deep neural networks get 2) Detection of inverted signals:
saturated due the diminishing gradient phenomenon, which Upon inspection of ECG recordings in the provided
means the gradient goes to zero during the backpropagation challenge dataset, many are found to be inverted [5] [6]. If a

284
normal ECG signal is inverted then a classification algorithm
may classify it as abnormal, because of the difficulty in
detecting P waves. Thus detecting and correcting inverted
signals is an important step to improve classification
accuracy. Our algorithm for inverted signal detection is
illustrated in Fig. 2. In the figure we see two example of
normal signals, one is not flipped (Fig. 2a) while the other is
flipped (Fig. 2b). Our algorithm uses a sliding window with
a size equal to 600 samples or 2 seconds (since the sampling
rate of 300 Hz). This window size also guarantees that the
sliding window covers at least two heart beats, since one heart
beat takes on average 300 samples. Inside this window we
compute the maximum and minimum values. Then we
compute the midpoint between the maximum and minimum
values.
Fig.3: Wave form of a typical ECG heartbeat showing the R peak, QRS
complex, and other characteristics.

Heart Rate Variability (HRV) features: These features are


used in many ECG algorithm and they include SDSD, pNNx,
SDNN, and normalized RMSSD. The details of these HRV
features are described in [22].
(a) (b) Morphological features: These are among the earliest basic
Fig.2: Illustration of inverted ECG records. Yellow line shows the short-
features used by doctors to diagnose ECG abnormalities.
time signal mean. Sliding window is shown in red color (a) non-inverted They are based on the PQRST points detected earlier. They
normal record, (b) inverted normal record. include; 1) the widths of QR and QRS intervals, 2) the width,
median, and variance of the QT interval, 2) slopes of QR, RS
One can easily observe that the middle point should be
and ST curves, 3) difference in amplitude of the TR interval,
below the signal mean (yellow line in the figure) in case of
4) difference in amplitude between the R peak and the Q and
the inverted record. In the case of a non- inverted record, this
S points, 5) distance of the ST segment, and others [23].
middle value will be above the signal mean. Therefore, in
Special AF features: many authors in the literature have
order to detect inverted signals our algorithm simply counts
proposed special features for AF detection. For example, we
the number of windows producing points above the signal
include the variance of RR intervals which is correlated with
mean versus the number producing points below the signal
AF presence. In addition, a set of features proposed by Sarkar
mean. If the number of windows producing middle points
et al. [24] is extracted and included in our sent of features.
below the signal mean is larger, then the ECG signal is
Other features that have been reported in the literature to help
inverted.
in AF detection are the entropy based features [25] [26] and
B. Feature extraction: inter beat intervals derived from Poincare plots [27].
ECG classification is a challenging problem due to high Statistical features: Statistical features include the mean,
inter and intra class variability within the data. Thus feature median, variance, skewness, kurtosis, and range. Also, three
extraction based on expert knowledge about ECG signals types of entropy are used including Tsallis, Shannon, and
main characteristics is an important step in an ECG Renyi entropy. They also include the number of RR intervals
classification algorithm. One of the main characteristics of and the probability density estimate of the RR intervals and
ECG signals is the QRS complex which is illustrated in Fig.3. the delta RR intervals. the number of peaks on the probability
Pan-Tompkins is a famous algorithm for the detection of the density estimate of the RR and delta RR intervals.
QRS complex and the R peaks in ECG signals. The challenge Other features: The list also includes features that have been
organizers have provided an implementation in MATLAB for shown in the literature to detect noisy ECG signals. These
Pan-Tompkins algorithm which we have used in our work. include some statistical features about the morphology of the
Next the algorithm uses the R peaks detected to locate other ECG signal [28] and special features extracted from the time
main points of the ECG waveform. and frequency domains presented in [29].
After detection of the P, Q, R, S, T points, a set of 188 C. Classification using residual CNN
features are extracted from each ECG recording [4]. These
The classification step uses a deep residual CNN as
include:
shown in Fig. 4b. However, a 5-layer CNN without residual
Frequency features: These features are extracted using the
connections, shown in Fig. 4a, is also used and compared
Short Time Fourier Transform (STFT). The algorithm
with the residual CNN to illustrate the effectiveness of the
employs a Hamming window of size 2 seconds (600 samples)
residual connections. The convolutional layers have a size of
that is sliding over the ECG recording every 300 samples
256 using ReLU as activation functions with an alpha value
(giving an overlap between windows of 50%). From each
equal to 0.2. Each convolutional layer is followed by a Batch
sliding window we extract the following set of features mean
normalization layer and a dropout layer with a fraction equal
spectral centroid [19], spectral flux [20], and spectral roll-off
to 0.2.
[21].

285
B. Experimental setup
We implement the proposed deep residual CNN in the
Keras environment, which is a high level neural network
application programming interface written in python. We set
the number of epochs to 100 and fix the batch size to 100
samples. Additionally, we set the learning rate of the Adam
optimization method to 0.0001. Regarding the exponential
decay rates for the moment estimates and epsilon, we use the
following default values 0.9, 0.999, and 1e-8 respectively. We
note here that all experiments are conducted on HP-station
with an Intel Xeon processor 2.40GHz, 24.00 GB of RAM,
and the GPU GEForce GTX1090 with 11GB of memory.
For performance evaluation, we present the results using
the F1-score. Given a confusion matrix as shown in Fig. 6,
theF1 scores are computed as presented in (1), (2), (3), and
(4).
FN =
(1)
FA = (2)
FO = (3)
F~ = (4)
(a) (b) Finally, following the guidelines of the
PhysioNet/Computing in Cardiology challenge [2] we
Fig. 4: Proposed CNN architecture for classification of short-time single-
lead ECG records. (a) 5-layer CNN without residual connections, (b) same compute an overall F1 score using the F1 scores for the N, A,
CNN with residual connections. and O classes as follows: F1 = (FN + FA + FO)/3.

III. EXPERIMENTAL RESULTS


This section presents the preliminary results obtained
using our proposed method on the short-time single-lead
ECG records dataset.
A. Dataset description
The short-time ECG recordings dataset [2] has a total of
12186 records. AliveCor, a company that makes hand-held Fig. 6: Definition of parameters used in score formulas in
ECG devices, has donated this dataset for the challenge. Fig. equations (1), (2), (3), and (4).
5 shows the composition of the dataset. As can be seen, 8528
records are selected for training and 300 records for testing. C. Preliminary results
The training set is divided into Normal (N), AF (A), Other
(O), and Noisy (~) classes with sizes 5154, 771, 2557, and 46 Table I show the effect of the detection of inverted signals
respectively. In addition, the testing set is divided as follows: on the classification accuracy using both types of the CNN.
150 for the N class, 50 for the A class, 70 for the O class, and As the table shows, detecting and correcting the inverted
30 for the ~ class. One clear observation is the large class signals provides a significant improvement on the
imbalance of the dataset. classification accuracy. Namely, it increases from 87.58% to
91.76% for the 5-layer CNN, and from 89.09% to 95.09% for
the 5-layer CNN with residual connections. These results
indicate that correcting the inverted ECG signals is important
in order to have accurate classification.
Table II gives comparison of our method with top-
(a) performing deep learning methods based on the F1 scores.
We have selected for comparison only the 5 topmost
solutions in the PhysioNet/Computing in Cardiology
challenge 2017. These include the works of Teijeiro et al. [5],
Kropf et al. [6], Billeci et al. [7], Datta et al. [8], and Plesinger
et al.[30].
(b)
Fig. 5: Short-time single-lead ECG dataset composition. (a) Training set
with 8528 samples, (b) Testing set has 300 samples.

286
TABLE I. EFFECT OF INVERSION DETECTION ON IV. CONCLUSION
CLASSIFICATION ACCURACY
This paper presented a feature-based deep learning
F1 score per class F1 score
Method N A O ~ Overall
approach to classify rhythms from short-time single-lead
5 layer CNN ECG recordings of variable size. A general performance
without inversion
95.42 93.75 84.21 - 87.58
evaluation has been performed based on the PhysioNet
correction challenging dataset and compared to the most recent works
with inversion published in this field. Our results show that residual CNN
96.71 90.52 86.13 - 91.76
correction
5 layer CNN with residual connections. are more capable of classifying short-time single-lead ECG
without inversion recordings. Moreover, the proposed method, based on our
93.89 92.63 83.72 - 89.09
correction own algorithm for inverted signal detection and a 5-layer
with inversion CNN trained from scratch, has shown great classification
correction 97.10 94.85 93.33 85.58 95.09
capabilities which reaches 91.76% and 95.09% when using
5-layer CNN and 5-layer Residual CNN approaches
We have also selected three more papers from the
respectively.
challenge because they use deep neural networks in their
Finally, it is challenging to reliably detect AF from a
solution. In particular, we have selected the works of Xiong
short-time single-lead of ECG, and the broad taxonomy of
et al. [13], Warrick et al. [14], and Andreotti et al. [15]. Thus
rhythms makes this particularly difficult. However, two
in total we have selected 8 methods from the challenge for
alternatives can be followed in future works in order to
comparison.
reliably improve results: increase the number of extracted
TABLE II. COMPARISON OF CLASSIFICATION ACCURACY
features and the use a cascaded approach for ECG
WITH STATE-OF-THE-ART classification.
F1 Score per class F1 score ACKNOWLEDGMENT
Method N A O ~ Overall This work was supported by the Deanship of Scientific
Top 5 methods Research at King Saud University through the Local Research
Teijeiro et al [5] 93.29 95.74 84.62 91.22 Group Program under Project RG-1435-055.
Kropf et al [6] 95.50 98.95 92.42 95.62
REFERENCES
Billeci et al. [7] 92.72 94.62 83.20 90.18
[1] “What is Atrial Fibrillation (AFib or AF)?,” www.heart.org.
Datta et al [8] 99.66 98.95 98.46 99.02
[Online]. Available: https://www.heart.org/en/health-topics/atrial-
Plesinger [30] 95.30 95.83 85.94 92.36 fibrillation/what-is-atrial-fibrillation-afib-or-af. [Accessed: 19-Apr-
Other deep NN 2019].
Xiong et al. [13] [2] G. Clifford et al., “AF Classification from a Short Single Lead
92.31 96.91 82.17 90.46
ECG Recording: the Physionet Computing in Cardiology Challenge
Warrick et al. [14] 89.93 89.36 70.07 83.12 2017,” presented at the 2017 Computing in Cardiology Conference,
Andreotti et al. [15] 96.35 84.71 89.05 90.03 2017.
[3] K. M. Griffiths, E. N. Clark, B. Devine, and P. W. Macfarlane,
5 layer CNN [ours] 96.71 90.52 86.13 83.71 91.76 “Assessing the accuracy of limited lead recordings for the detection
Residual CNN [ours] 97.10 94.85 93.33 85.58 95.09 of Atrial Fibrillation,” in Computing in Cardiology 2014, 2014, pp.
405–408.
[4] T. Teijeiro, C. A. García, D. Castro, and P. Félix, “Arrhythmia
The results of our method is better than 6 methods out of classification from the abductive interpretation of short single-lead
8, which is good as a preliminary result. The two methods ECG records,” in 2017 Computing in Cardiology (CinC), 2017, pp.
which beat our method are the ones by Datta et al. [8] which 1–4.
achieved 99.02 score and Kropf et al. [6] which achieved a [5] T. Teijeiro, P. Félix, J. Presedo, and D. Castro, “Heartbeat
Classification Using Abstract Features From the Abductive
score of 95.62. Interpretation of the ECG,” IEEE Journal of Biomedical and
Similar to this work, the method by Datta et al. [8], relies Health Informatics, vol. 22, no. 2, pp. 409–420, Mar. 2018.
on a feature extraction step where more than 150 features are [6] M. Kropf, D. Hayn, and G. Schreier, “ECG classification based on
extracted. However, it uses a two layer binary cascaded time and frequency domain features using random forests,” 2017
Computing in Cardiology (CinC), pp. 1–4, 2017.
approach where the first binary classifier separates the [7] L. Billeci, F. Chiarugi, M. Costi, D. Lombardi, and M. Varanini,
recordings into two intermediate classes (’normal+others’ “Detection of AF and other rhythms using RR variability and ECG
and ’AF+noisy’). Then, each intermediate class is separated spectral measures,” in 2017 Computing in Cardiology (CinC),
into two using a second binary classifier in a second layer. 2017, pp. 1–4.
[8] S. Datta et al., “Identifying normal, AF and other abnormal ECG
Clearly, this is the reason for the good performance of their rhythms using a cascaded binary classifier,” in 2017 Computing in
method. Thus, we definitely should investigate this cascaded Cardiology (CinC), 2017, pp. 1–4.
approach in our future work. [9] M. Zabihi, A. B. Rad, A. K. Katsaggelos, S. Kiranyaz, S.
As for the second work by Kropf et al. [6], it again starts Narkilahti, and M. Gabbouj, “Detection of atrial fibrillation in ECG
hand-held devices using a random forest classifier,” in 2017
by extracting a set of features from each ECG recording. Computing in Cardiology (CinC), 2017, pp. 1–4.
However, they extract a total of 380 features from both time [10] G. Bin, M. Shao, G. Bin, J. Huang, D. Zheng, and S. Wu,
and frequency domains. This a larger set of features than what “Detection of atrial fibrillation using decision tree ensemble,” in
we are using in our method (188 features only). Then for 2017 Computing in Cardiology (CinC), 2017, pp. 1–4.
[11] J. A. Behar, A. A. Rosenberg, Y. Yaniv, and J. Oster, “Rhythm and
classification they use a random forest–based classifier quality classification from short ECGs recorded using a mobile
(bagged decision trees). However, we believe the larger device,” in 2017 Computing in Cardiology (CinC), 2017, pp. 1–4.
number of extracted features explains the slightly better [12] P. Bonizzi, K. Driessens, and J. Karel, “Detection of atrial
fibrillation episodes from short single lead recordings by means of
results achieved.
287
ensemble learning,” in 2017 Computing in Cardiology (CinC), International Joint Conference on Pervasive and Ubiquitous
2017, pp. 1–4. Computing: Adjunct, New York, NY, USA, 2016, pp. 1084–1088.
[13] Z. Xiong, M. K. Stiles, and J. Zhao, “Robust ECG signal [23] L. Maršánová et al., “ECG features and methods for automatic
classification for detection of atrial fibrillation using a novel neural classification of ventricular premature and ischemic heartbeats: A
network,” in 2017 Computing in Cardiology (CinC), 2017, pp. 1–4. comprehensive experimental study,” Scientific Reports, vol. 7, no.
[14] P. A. Warrick and M. N. Homsi, “Cardiac arrhythmia detection 1, p. 11239, Sep. 2017.
from ECG combining convolutional and long short-term memory [24] S. Sarkar, D. Ritscher, and R. Mehra, “A detector for a chronic
networks,” 2017 Computing in Cardiology (CinC), pp. 1–4, 2017. implantable atrial tachyarrhythmia monitor,” IEEE Trans Biomed
[15] F. Andreotti, O. Carr, M. A. F. Pimentel, A. Mahdi, and M. D. Vos, Eng, vol. 55, no. 3, pp. 1219–1224, Mar. 2008.
“Comparing feature-based classifiers and convolutional neural [25] R. Alcaraz, D. Abásolo, R. Hornero, and J. J. Rieta, “Optimal
networks to detect arrhythmia from short segments of ECG,” in parameters study for sample entropy-based atrial fibrillation
2017 Computing in Cardiology (CinC), 2017, pp. 1–4. organization analysis,” Comput Methods Programs Biomed, vol.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for 99, no. 1, pp. 124–132, Jul. 2010.
Image Recognition,” in 2016 IEEE Conference on Computer Vision [26] D. E. Lake and J. R. Moorman, “Accurate estimation of entropy in
and Pattern Recognition (CVPR), 2016, pp. 770–778. very short physiological time series: the problem of atrial
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for fibrillation detection in implanted ventricular devices,” Am. J.
Image Recognition,” in 2016 IEEE Conference on Computer Vision Physiol. Heart Circ. Physiol., vol. 300, no. 1, pp. H319-325, Jan.
and Pattern Recognition (CVPR), 2016, pp. 770–778. 2011.
[18] G. M. Friesen, T. C. Jannett, M. A. Jadallah, S. L. Yates, S. R. [27] J. Park, S. Lee, and M. Jeon, “Atrial fibrillation detection by heart
Quint, and H. T. Nagle, “A comparison of the noise sensitivity of rate variability in Poincare plot,” Biomed Eng Online, vol. 8, p. 38,
nine QRS detection algorithms,” IEEE Trans Biomed Eng, vol. 37, Dec. 2009.
no. 1, pp. 85–98, Jan. 1990. [28] S. Bandyopadhyay et al., “An unsupervised learning for robust
[19] K. K. Paliwal, “Spectral subband centroid features for speech cardiac feature derivation from PPG signals,” in 2016 38th Annual
recognition,” in Proceedings of the 1998 IEEE International International Conference of the IEEE Engineering in Medicine and
Conference on Acoustics, Speech and Signal Processing, ICASSP Biology Society (EMBC), 2016, pp. 740–743.
’98 (Cat. No.98CH36181), 1998, vol. 2, pp. 617–620 vol.2. [29] C. Puri et al., “Classification of normal and abnormal heart sound
[20] D. Giannoulis and J. D. Reiss, “Parameter Automation in a recordings through robust feature selection,” in 2016 Computing in
Dynamic Range Compressor,” 2013. Cardiology Conference (CinC), 2016, pp. 1125–1128.
[21] G. Peeters, B. L. Giordano, P. Susini, N. Misdariis, and S. [30] F. Plesinger, P. Nejedly, I. Viscor, J. Halamek, and P. Jurak,
McAdams, “The Timbre Toolbox: extracting audio descriptors “Automatic detection of atrial fibrillation and other arrhythmias in
from musical signals,” J. Acoust. Soc. Am., vol. 130, no. 5, pp. holter ECG recordings using rhythm features and neural networks,”
2902–2916, Nov. 2011. in 2017 Computing in Cardiology (CinC), 2017, pp. 1–4.
[22] R. Banerjee, R. Vempada, K. M. Mandana, A. D. Choudhury, and
A. Pal, “Identifying Coronary Artery Disease from
Photoplethysmogram,” in Proceedings of the 2016 ACM

288
Identification and Tagging of Malicious Vehicles through License
Plate Recognition
Ahmad Mostafa#, Walid Hussein*, Samir El-Seoud+
# Computer Networks Department, The British University in Egypt, El-Sherouk, Egypt
E-mail: ahmad.mostafa@bue.edu.eg

* Computer Science Department, The British University in Egypt, El-Sherouk, Egypt


E-mail: walid.hussein@bue.edu.eg

+ Software Engineering Department, The British University in Egypt, El-Sherouk, Egypt


E-mail: samir.elseoud@bue.edu.eg

Abstract— Vehicular Ad-hoc NETworks (VANET) are becoming a reality in today’s world. These networks are composed of highly
dynamic and capable vehicles and they rely on information that originates and is exchanged between each other. One of the main
success factors of this communication is the validity of the data communicated. Hence, malicious vehicles pose a serious threat to
VANETs. Once a vehicle is identified as malicious, the main challenge is to keep a centralized ledger of the malicious vehicles within
the network. In this paper, an innovative distributed framework is proposed for the identification and the tagging of malicious vehicles.
This framework is based on Arabic license plate recognition using different image recognition algorithms and the identification of the
vehicle as malicious or non-malicious propagate through the network, with higher accuracy in comparison to the other common plate
recognition approaches. The details of both the vehicle communication framework and the image processing process are presented and
the framework is validated through different implementations and discussion.

Keywords—VANET, Image Processing, Number Plate Recognition, Feature Extraction, Malicious Nodes.

between the vehicles to only remain for a few seconds de-


I. INTRODUCTION pending on the speed and the direction of the vehicles [5]. In
Vehicular Ad-Hoc Networks (VANET) have been the fo- other words, this means that two vehicles will only get in con-
cus of research in the past few years, and many advancements tact with each other for a short period of time due to the fast
have been made in the routing protocols as well as tackling entrance and exit of the vehicles from the network [6].
other practical issues. VANET is based on radio communica- Because of this rapid change, one of the main challenges
tion between vehicles, which has been introduced as early as facing the VANET technology is the vehicle malicious activ-
1925 [1]. VANET relies on two modes of communication: ity. This types of malicious activity affect many types of net-
vehicle-to-vehicle (V2V) communication and vehicle-to-in- works, however, the severity in VANET is evident. The com-
frastructure (V2I) communication [2]. In V2V, vehicles com- munication that takes place in VANET is the basis for routing
municate with each other in an ad-hoc fashion, while in V2I, the vehicles, safety stops, controlling the speed, and much
vehicles communicate with infrastructure units on the road- more. If this information is malicious, the damage and the
side, often named roadside units (RSUs). Between both tech- harm done can cause a loss in life and money. Hence, the ex-
nologies, V2V is becoming the de facto technology in istence and the undetectability of malicious vehicles in
VANET, and is proposed to be mandatory for each new vehi- VANETs can have dire consequences. Many solutions have
cle to have V2V communication capabilities in the USA by proposed in the literature in order to deal with malicious nodes
2023 [3]. However, V2V remains to have many challenges, in different types of networks. In wireless sensor networks
mainly due to the peer-to-peer distributed nature of commu- (WSN), some algorithms have been proposed in order to de-
nication. Unlike common peer-to-peer networks, VANETs tect malicious nodes. For example, the one proposed solution
witness a rapid change in the network topology due to the is to detect the malicious vehicle using the signal strength of
highly dynamic nature of vehicles [4]. This causes the links

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 289


the messages received from the nodes [7]. Although such so- Malicious nodes in networks are able to manipulate either
lutions have been proposed for different networks, they can- the data or the information conveyed from one node to an-
not be easily adopted by vehicular networks. other. Hence, it is important for networks to be able to identify
Once vehicles are identified as malicious, another chal- malicious vehicles and keep that information within the net-
lenge presents itself. The challenge now becomes how to work. There have been many proposals in literature on how to
identify the vehicles in a distributed ad-hoc network. The identify and isolate malicious vehicles. In [12], the authors
problem can be illustrated as shown in figure 1. In this figure, proposed the modification of the AODV [13] routing protocol
we have three vehicles traveling across the road. Vehicle B is to deal with malicious vehicles. The modification introduced
malicious and vehicle A is able to detect that it is a malicious was to allow the RSU to identify a vehicle as malicious if the
vehicle. However, vehicle C is not aware of this, and in order vehicle speed changes abruptly, or if the vehicle does not reg-
for vehicle A to inform vehicle C, then the message has to be ister with the RSU. At that point, the vehicle is isolated from
sent through vehicle B. Moreover, once vehicle B recognizes the network. One limitation is that no framework was intro-
that it has been identified as malicious, then it can spoof or duced to preserve the isolation or the identification of a vehi-
forge a new identity in order to be able to communicate with cle as malicious, and once a vehicle moves on to another RSU
other vehicles. In order to overcome this issue, we propose a or cluster or if it spoofs its identity, then the process will start
protocol that allows the neighboring vehicles to identify the all over.
malicious vehicle through the license plate recognition, and The authors in [14] proposed the utilization of sensor data
identifying the license plate as malicious. The advantage of from honest nodes in order to identify the malicious activity
this system is that the identity cannot be forged since it is ex- and hence be able to recognize the malicious vehicles. Once
tracted by the surrounding vehicles rather than sent by the ma- the vehicle has been recognized as malicious, the identifica-
licious vehicle. One other advantage is that it is not feasible tion of the vehicle is a challenge. The reason for this is that if
to change the license plate easily during the vehicle travel. the dependence on the identification of the vehicle is based on
a specific signature, the vehicle can exit the network and
reenter it with a new signature. This issue has been proposed
to be solved through key management and certification
schemes [15, 16]. One main challenge for this approach is that
it requires the existence of some form or centralized authority
which is non-existent in VANETs. In this paper, we propose
identifying the vehicle based on the license plate number. This
B approach is based on real-time image recognition.
Other approaches have also been introduced to detect ma-
A licious vehicles through the surrounding and environmental
information. In [17], the authors introduced TrustLevel which
allows the surrounding vehicles to create a perception map
C and an expectation of what should be received from other ve-
hicles. If the packet received is deemed malicious, then the
Fig. 1 Vehicle Identification in VANET source vehicle is isolated and no communication is allowed
with it. One major limitation for this approach is that if the
In this paper, we present the details of this proposed frame- vehicle exits the network or moves on to a new cluster of ve-
work. We present a method to detect the plate numbers hicles, the whole process will have to be repeated again.
through image recognition and following that, we amend this In the framework that we introduced in this paper, the main
identification to the packet received from the vehicle. advantage is that it allows for the overall network to tag a ve-
The paper is organized as follows: section 2 discusses the hicle as malicious and that this identification is preserved.
related work. Section 3 discusses the general architecture in- This is achieved whether the vehicle decides to exit and renter
cluding the image recognition and the communication proto- the network or if it decides to change the neighboring vehicles
col. Section 4 discusses license plate recognition, while sec- in the cluster. This is possible through the utilization of li-
tion 5 focuses on the details of the framework. Lastly, we pre- cense plate number detection using image recognition.
sent the future work and conclusion. Number plate recognition is not a new field and significant
work has been done in this area. In [18], the authors developed
II. RELATED WORK a number plate detection system which receives the images
Security of most networks depends on a centralized sys- with high resolution from digital camera and different back-
tem, in which the authentication of the nodes and the outcast ground then resizing the image with 1024 * 768 size to be
of malicious nodes happens at a central location [8]. This more applicable. The next step was to enhance the input im-
problem is more challenging in VANET due to the high speed age with image processing filters and techniques to be more
of vehicles that causes frequent network disconnections applicable and to reduce the noise in the image and to improve
which leads to a more decentralized system. This has opened the contrast. In [19], the process is divided into three stages.
the VANET to a slew of security challenges [9]. Some of The first stage is plate detection, the second stage is character
these challenges are the data consistency liability, low toler- segmentation, and the third stage is optical character recogni-
ance for error, key distribution [10], and malicious vehicles tion. Firstly, the input image is converted to gray scale, and
[11]. the edge detection is applied to isolate the plate and the me-

290
dian filter to reduce the noise. Afterwards, histogram equali- In order to achieve the steps mentioned in the flowchart
zation is implemented to remap the pixels of the image and above, there have to be some requirements. These require-
improve the quality. The most ideal number plate area is dis- ments are:
covered by looking at width by height variable of genuine In-
1.The license plate number has to be recognized by other
dian number plates to a similar component of plate like re-
neighboring vehicles instead of being sent by the source ve-
gions found by this strategy. Secondly, the characters of the
hicle. This is important in order to avoid allowing the mali-
distinguished number plate district are divided utilizing Re-
cious vehicle to spoof its own identity in a reputation based
gion props capacity of MATLAB to get jumping boxes for
system.
each of the characters. Region props restores the littlest
bouncing box that contains a character. The third step is ap- 2.The license plate number will have to be amended to the
plying Optical Character Recognition (OCR) using template received packet. Hence, the neighboring vehicle will have to
matching or supervised learning approach. It works by pixel- have the ability to identify and distinguishing which one of
by-pixel correlation of the picture and the layout for every the received packets will be associated with which license
conceivable removal of the format. For each character and plate.
number there is template created in the database for each one
from 0 to 9 and from A to Z. However, the published accuracy 3.There has to be a ledger for the identified malicious ve-
in relation to the computational time is not suitable for the hicles that will propagate through the network in order to
real-time plate recognition application introduced in this pa- minimize the effect of these vehicles in the network.
per. 4.There has to be a mechanism for vehicles to recover from
being tagged as malicious. Moreover, the reputation of
III. GENERAL ARCHITECTURE whether a vehicle is malicious or not should be a consensus
The general architecture can be displayed through the in order to avoid a benign vehicle being tagged as malicious
flowchart shown in figure 2. The protocol steps can be ex- by an actual malicious vehicle.
plained as follows: It is important to note that the vehicle decides the location
of the vehicle using two main approaches:
1.Each vehicle takes photos of the surrounding vehicles li-
1. Through image recognition: In which we estimate the
cense plates, and start recognizing the plate numbers and
distance of the vehicle based on the image analysis.
how far away the vehicle is. The method of achieving that
2. Using RSSI readings from the received packet.
has been explained in the previous section.
These two values are compared to each other, and since the
2.Once a vehicle receives a packet, it needs to identify vehicle continuously move, it is important that the vehicle
which packet belongs to which license plate. This is chal- captures images and analyzes them continuously in order to
lenging in case the license plate is on the further side from be ready to amend the identity of the vehicle to the packet
the camera, and hence, the picture is not obtainable. Another once it is analyzed.
case that might be challenging is if there are two vehicles In the following section, we will discuss the details of the
that are very close to each other, and hence, it will be diffi- protocol. We will start by discussing the license plate recog-
cult to distinguish which one is the source of the packet. In nition, following that, we will discuss the actual protocol and
order to decide which vehicle is the source of the packet, the how the messages are tagged. Also, we will be discussing the
RSSI measurements are used in order to estimate the dis- different cases.
tance of the packet source (the vehicle) from the destination.
This distance is also compared based on the image recogni-
tion and deciding the distance of the vehicle from the cam-
era, and based on both values of distance and RSSI, the ve-
hicle source of the packet is identified.
3.The packet is analyzed in order to determine whether it
is malicious or not. This can be achieved based on the con-
tent of the packet and data being sent. For example, if the
packet is supposed to be including traffic data, then this data
can be verified using data from other vehicles. However, this
framework introduced in this paper can function with any
malicious data analysis algorithm.
4.Once the packet is considered either malicious or benign,
the license plate is given a score.
5.The score along with the license plate is saved in a ledger
and this ledger is distributed throughout the network.
6.This plate along with the score is broadcasted with a
timestamp of when the original packet was received.

291
Fig. 3 Vehicle Schematic diagram for the enhancement process of the cap-
tured plate image.

The important aspect of converting the image to grayscale


is to preserve the contrast, sharpness, shadow and the image
structure. Successful conversion of the image to grayscale
helps reduce the cost and simplifies the techniques required
for the plate recognition, as demonstrated in figure 4. Follow-
ing this, a Gaussian filter is applied in the frequency domain,
which is a low pass filter which passes the low frequency sig-
nals and cut-off high frequencies. The main use of this filter
is to reduce or remove the noise and blurring of the image and
to offer edge positioning uprooting.

Fig. 4 Vehicle Converting the vehicle image to Grayscale.

After that, Sobel edge detection is applied to the picture,


which has the responsibility of identifying the edges of the
image. This step is crucial to the success of the segmentation
and the optical character recognition processes. The Sobel op-
erator used is a discrete differential operator, and it uses two
different 3*3 kernels, one for the x-direction and the other one
for the y-direction. We apply the two kernels on the image to
approximate the derivatives in horizontal and vertical
Fig. 2 Flowchart detailing the protocol. changes, and at each point, we can calculate magnitude with
G = |Gx| + |Gy|. Figure 5(a) shows the license plate before
IV. LICENSE NUMBER RECOGNITION applying the Sobel filter, while figure 5(b) shows the plate af-
The automated plate recognition is based on the main sev- ter applying it.
eral steps: the first is the enhancement of image that used to
detect plate. Secondly, Detection of the area of the plate in the
image. Thirdly, is the segmentation of every character in the
plate. Finally, optical character recognition.
The objective of the enhancement technique to improve the
original image to be more suitable for the application. There
are several methods to enhance the image based on the idea
of every application. In this system, we first convert the image (a) (b)
to grayscale. Secondly, apply a Gaussian filter. Thirdly, apply
Sobel filter, followed by the application of the threshold
method. Finally, apply morphological operation and canny
detection. The different steps are showing in figure 3. Fig. 5 The plate image, (a) before applying the Sobel filter, and (b) after ap-
plying the Sobel filter.

A thresholding technique has to be applied as shown in fig-


ure 6, where all image values higher than 70% of the maxi-
mum value are settled to “1”, while other values are settled to

292
“0”. The importance of the thresholding process is that it con- Accuracy 79.84% 80.8% 88%
verts the image to a bi-level picture by using an ideal edge as
described in figure 7. It is worth mentioning that the first technique is applying non-
linear support vector machines through a radial basis function.
The second technique is directly applying a template match-
ing on the captured number plate image to reduce the compu-
Fig. 6 Thresholding process. tational time. While the proposed method is applying a tem-
plate matching on a filtered (by Guassian filter) and smoothed
(by thresholding and Canny edge detection process)version of
the number plate image. The accuracy represents how the
technique provided correct number plate recognition out of all
the test images.

Fig. 7 Output of the thresholding process. V. PROTOCOL


Canny edge detection is employed for detecting the edges In this section, we will elaborate on each step and how it is
in the image as shown in figure 8. However, before detecting achieved. The first step has already been explained in the pre-
the letters from the plate, the plate itself needs to be detected. vious section.
This is achieved by finding the contours of the image and A. Packet Source Identification
resizing it. The resizing that we use in this method is 179*91
pixel. The final step in our method is the plate recognition In order to be able to identify the packet source, the packet
which is composed of the OCR segmentation of the detected has to be matched with one of the vehicles that we have ob-
characters and apply template matching on them to find the tained a picture of. This picture might include a license plate
associated character(s) in the database. The database consists number, or it might not depending on whether the license
of all possible Arabic letters and digits, as displayed in figure plate is located on the front end or the rear end of the vehicle.
9. If the license plate is located in the image, then we will esti-
mate the distance from the vehicle in the image using image
processing techniques as explained in the previous section.
From the packet received, we can detect the RSSI signal, and
based on that, estimate the distance from the source. This dis-
tance estimate is compared with the license plate distance and
the vehicle source of the packet is identified.
Fig. 8 Output of canny edge detection. In the other scenario, if the license plate is not visible, then
we have one of two scenarios. Either there is another third
neighboring vehicle that can detect the license plate (case A),
or there is none (case B).
• Case A: This case can be illustrated using figure 9. In
this case, receiving vehicle (vehicle C) can clearly detect
the license plate of the transmitting vehicle (vehicle B).
Fig. 9 The OCR segmentation and digits/letters extraction of the number In this scenario, there is no problem and no need for the
plate. presence of other vehicles in the vicinity.
The accuracy of the proposed plate recognition technique • Case B: This case is more problematic since the receiv-
by OCR segmentation found to be higher than other ap- ing vehicle (vehicle C) cannot detect the transmitting ve-
proaches which implement either support vector machine or hicles’ (vehicle B) license plate. In this scenario, vehicle
template matching algorithm, as summarized in Table 1. C will have to use the help of a third vehicle within the
TABLE I vicinity (vehicle A). Vehicle A will follow the first step
COMAPRISON OF THE RESULTS OBTAINED BY THE PROPOSED TECHNIQUE in the protocol which is capturing license plates of the
WITH OTHER TECHNIQUES WHICH APPLY SUPPORT VECTOR MACHINE AND surrounding vehicles and recognizes the plate numbers,
TEMPLATE MATCHING IN THE RECOGNITION STEP.
and the location of the vehicle. Since vehicle A already
has this information in its local database, vehicle C will
First Tech- Second
The pro- send a request to vehicle A to ask for the plate number
Techniques posed of the vehicle that is present at an estimated location at
nique Technique
technique a specific timestamp. Vehicle A will search in its local
database to find the matching vehicle, and will send this
Detection Method Vertical Comparing Find con- information back to vehicle C.
Edge Detec- width by tours
tion width factor

Recognition Support Template OCR


Method Vector Ma- Matching
chine

293
from the transmitting vehicle. RSS is the strength of the elec-
tromagnetic signal that attenuates with distance. The further
the signal travels, the more the signal attenuates. This can be
demonstrated using experimental results as shown in figure
10. This experiment was done using Tmote Sky sensors in an
indoor environment. The experiment was repeated 100 times
in different indoor environments with different locations from
walls and reflective surfaces.
In this experimental results, we utilized to wireless sensor
devices to transmit wireless signals in an indoor environment
to simulate the reflection of wireless signals which is present
due to the reflection/diffraction of the wireless signals off of
Case A Case B
vehicle metal body.
• As shown in the figure, the strength of the electromagnetic
• signal attenuates with the distance. Although the attenuation
Figure 9. Different possible scenarios of license plate locations on transmit- is not uniform, however, the attenuation of the signal is clear
ting vehicle. and can be used to estimate the location of the source of the
vehicle.
It is important to note that the vast majority of vehicles are
B. Plate number detection and vehicle location estimation
equipped with global positioning systems (GPS), which pro-
The plate number detection happens according to the pre- vide an accurate location of the vehicle. However, the depend-
vious section. However, after detecting the vehicle plate num- ence on the RSSI for estimating the location of the vehicle is
ber, it is imperative that the vehicle that captures the image is based on the assumption that malicious vehicle can modify
able to estimate the location of the vehicle in the picture. This the GPS location amended with the packet. On the other hand,
includes two elements: in the case of RSSI localization, the transmitting vehicle is not
• The orientation of the camera on the vehicle involved in its localization and it is fully dependent on the
• The resolution of the camera, through which we can es- receiving vehicle.
timate the distance of the vehicle.
Based on these two elements, it becomes feasible to detect
both the distance of the vehicle and the angle.

C. Local Database
When a vehicle detects a license plate, it saves the location
of the vehicle with the timestamp the location was detected,
and the plate number in a local database. There are periodic
operations that take place in the database in order to keep it Fig. 10 Received Signal Strength attenuation with distance
up to date. These operations are:
• The timestamp is checked regularly. If the timestamp is VI. FUTURE WORK
before a certain time threshold, this entry in the table is
purged. This threshold depends on the density and the The protocol presented in this paper needs further verifica-
vehicles and velocity of the vehicles in the vicinity. If tion through being implemented in a test-bed in order to test
the velocity is low and the density of the vehicles is high, for different factors such as the speed of plate recognition as
then the location of the vehicle will be difficult to ex- compared to the speed of vehicle movement. It is important in
trapolate. However, if the velocity is high and the den- this protocol that the vehicle plate recognition is done and
sity is low, then it becomes feasible to predict the loca- amended to the packet being transmitted before the distance
tion of the vehicle. The density of the network can be separating the two vehicles becomes larger than the commu-
inferred from the number of packets received at the nication range between them. This can be tested by installing
source directly, or are being forwarded. If the number of a camera on the vehicles and testing the accuracy of the recog-
unique packets is high, then the density is high and vice- nition during movement, and speed.
versa.
• The new entries in the table are communicated to the VII. CONCLUSIONS
surrounding vehicles in order to copy the entries. VANETs are becoming a reality in the technology world,
These two operations ensure that the vehicles collaborate and they have a large impact on both the technology and hu-
with each other in order to detect the plate numbers. man life. Malicious vehicles pose a serious threat to the secu-
rity and safety of both the vehicles and the personnel using it,
and the consequences of this threat are dire. In order to over-
D. RSSI Measurement come and deal with malicious vehicles, we propose a novel
The relative location of the vehicle is estimated from the framework that includes both a communication protocol as
packet received using the received signal strength (RSSI) well as image processing algorithms in which the vehicle is
tagged based on its license plate. Once the license plate is

294
tagged as malicious or non-malicious, this information is 20. A. Thomas, “Reducing air pollution in cairo: Raise user costs and in-
vest in public transit.” Available: https://erf.org.eg/publications/reduc-
propagated through the network. We discussed the different
ing-air-pollution-in-cairo-raiseuser-costs-and-invest-in-public-transit,
components of our proposed system, and we presented the 2018
feasibility of this framework through actual experiments. Fu- 21. Cho, Woong, Sang In Kim, Hyun kyun Choi, Hyun Seo Oh, and Dong
ture work will be required to rigorously test this system in ac- Yong Kwak. "Performance evaluation of V2V/V2I communications:
The effect of midamble insertion." In 2009 1st International Confer-
tual vehicle networks and with random malicious nodes.
ence on Wireless Communication, Vehicular Technology, Information
However, the framework is a step towards a reliable distrib- Theory and Aerospace & Electronic Systems Technology, pp. 793-797.
uted framework to handle malicious vehicles. IEEE, 2009.

REFERENCES
1. Flurscheim H , “Patent No. (US 1612427 A) 28522/23”. UK, 1925
2. NHTSA. “Vehicle-to-vehicle communication.” Available:
https://www.nhtsa.gov/technology-innovation/vehicle-vehiclecom-
munication, 2015
3. ERPINNEWS, “Fog computing vs edge computing”, Available:
https://erpinnews.com/fog-computing-vs-edge-computing, 2018
4. Li, F., & Wang, Y. “Routing in vehicular ad hoc networks: A sur-
vey”. IEEE Vehicular technology magazine, 2007.
5. E. Schoch, F. Kargl, and M. Weber, “Communication patterns in
vanets,” IEEE Communications Magazine, vol. 46, no. 11, 2008.
6. Naumov, V., & Gross, T. R. “Connectivity-aware routing (CAR) in
vehicular ad-hoc networks". In INFOCOM 2007. 26th IEEE Interna-
tional Conference on Computer Communications. IEEE , p: 1919-
1927
7. W. Pires, T. H. de Paula Figueiredo, H. C. Wong, and A. A. F.
Loureiro, “Malicious node detection in wireless sensor networks,” in
Parallel and distributed processing symposium, 2004. Proceedings.
18th international. IEEE, 2004, p. 24.
8. Carman, D. W., Kruus, P. S., & Matt, B. J. “Constraints and ap-
proaches for distributed sensor network security." DARPA Project re-
port,(Cryptographic Technologies Group, Trusted Information Sys-
tem, NAI Labs), 2000.
9. Raw, Ram Shringar, Manish Kumar, and Nanhay Singh. "Security
challenges, issues and their solutions for VANET." International jour-
nal of network security & its applications, 2013.
10. Hao, Yong, Yu Cheng, and Kui Ren. "Distributed key management
with protection against RSU compromise in group signature based
VANETs." In IEEE GLOBECOM 2008-2008 IEEE Global Telecom-
munications Conference, pp. 1-5. IEEE, 2008.
11. Golle, Philippe, Dan Greene, and Jessica Staddon. "Detecting and cor-
recting malicious data in VANETs." In Proceedings of the 1st ACM
international workshop on Vehicular ad hoc networks, pp. 29-37.
ACM, 2004.
12. Praba, V. Lakshmi, and A. Ranichitra. "Isolating malicious vehicles
and avoiding collision between vehicles in VANET." In 2013 Interna-
tional Conference on Communication and Signal Processing, pp. 811-
815. IEEE, 2013.
13. Perkins, Charles, Elizabeth Belding-Royer, and Samir Das. Ad hoc on-
demand distance vector (AODV) routing. No. RFC 3561. 2003.
14. Marti, S., T. Giuli, K. Lai, and M. Baker. "Mitigating routing misbe-
havior in ad hoc networks." In Proceedings of MOBICOM 2000.
15. Studer, Ahren, Elaine Shi, Fan Bai, and Adrian Perrig. "TACKing to-
gether efficient authentication, revocation, and privacy in VANETs."
In 2009 6th Annual IEEE Communications Society Conference on
Sensor, Mesh and Ad Hoc Communications and Networks, pp. 1-9.
IEEE, 2009.
16. Haas, Jason J., Yih-Chun Hu, and Kenneth P. Laberteaux. "Design and
analysis of a lightweight certificate revocation mechanism for
VANET." In Proceedings of the sixth ACM international workshop on
VehiculAr InterNETworking, pp. 89-98. ACM, 2009.
17. Rezgui, Jihene, and Cédryk Doucet. "Detection of malicious vehicles
with demerit and reward level system." In 2017 International Sympo-
sium on Networks, Computers and Communications (ISNCC), pp. 1-
6. IEEE, 2017.
18. Shidore, M. M., and S. P. Narote. "Number plate recognition for indian
vehicles." IJCSNS International Journal of Computer Science and
Network Security , 2011, p: 143-146.
19. A. Puranic, D. K. T., and U. V., “Article: Vehicle number plate recog-
nition system: A literature review and implementation using template
matching,” International Journal of Computer Applications, vol. 134,
no. 1, pp. 12– 16, January 2016, published by Foundation of Computer
Science (FCS), NY, USA.

295
Cascaded Layered Recurrent Neural Network for
Indoor Localization in Wireless Sensor Networks
1st Hamza Turabieh 2nd Alaa Sheta
Information Technology Department Computer Science Department
CIT collage, Taif University Southern Connecticut State University
Taif, KSA New Haven, CT 06515, USA
h.turabieh@tu.edu.sa shetaa1@southernct.edu

Abstract—The growth in using various smart wireless devices methods necessitate expensive site surveys to gather fingerprint
in the last few decades has given rise to indoor localization service data for localizing mobile device. The dynamic nature of
(ILS). Indoor localization is defined as the process of locating a fingerprint information in indoor wireless environments makes
user location in an indoor environment. Indoor device localization
has been widely studied due to its popular applications in public the problem even complicated and computationally expensive.
settlement planning, health care zones, disaster management, the In [14], the author provided a comparison between several
implementation of location-based services (LBS) and the Internet deterministic localization methods. They include Non-Linear
of Things (IoT). The ILS problem can be formulated as a learning Regression (NLR), Iterative Non-Linear Regression (INLR),
problem utilizing Wi-Fi technology. The measured Wi-Fi signal Least Squares (LS), Random Sample Consensus (RANSAC)
strength can be used as an indication of the distribution of
users in a various indoor location. Developing a classification and Trilaterate on Minima (ToM). A data set was collected
model with high accuracy can be achieved using a machine from real environments over a space of size 550 m2 . The
learning approach. Artificial Neural Network is one of the most finding proves that NLR is the best approach. The full
successful trends in machine learning. In this article, we provide availability and accessibility of smart phones and wearable
our initial idea of using Cascaded Layered Recurrent Neural devices that adopt wireless communication feature have made
Network (L-RNN) for the classification of user localization in
an indoor environment. Several neural network models were the localization and pursuing such devices much more acces-
trained, with the best performance attainment is reported. The sible. Dissimilar to most outdoor GPS navigation systems.
experimental results marked that the presented L-RNN model is In many cases, the fingerprints are repeated owing to the
highly accurate for indoor localization and can be utilized for available Access Points (APs) and interference, which make a
many applications. duplication of the matched patterns and the user’s fingerprint.
Index Terms—Layered Recurrent Neural Network, User Lo-
calization, Indoor Environment, Prediction. Thus, improving the classification performance and reducing
the computation cost of the WiFi indoor localization systems
is urgently needed.
I. I NTRODUCTION
Traditional fingerprinting consists of two steps (i) Offline
Sensor nodes localization is an essential task for numerous step, where the fingerprint database is created at an early
emerging applications of wireless sensor networks (WSNs) stage, and (ii) Online step, where user position is determined
such as precision agriculture, forest monitoring, home security, based on Received Signal Strength (RSS). Comparing the
smart buildings, health monitoring and many others [1], [2]. current RSS with stored RSS signal to determine user location
Precise estimation of the sensor node location is vital for is a time-consuming approach and not works well in-case
the effectiveness of location-aware services. In the past few of changing building infrastructure [15]. As a result, finding
decades, the Indoor localization Service became one of the fingerprint algorithms is needed to reduce the computational
hot research topic [3]–[5]. User and device localization found time based on machine learning methods by analyzing the
many applications in areas such as health sector, disaster RSS database and not influence by changing the building
management [6], [7], Internet of Things (IoT) [8], [9], smart infrastructure.
cities [10], [11], and smart buildings [12], [13]. Currently, we The WiFi indoor localization systems based on machine
still do not have a reliable and accurate indoor localization learning methods are broadly used in the literature. Several
system that can provide an exact location for a person. We machine learning techniques for indoor localization were pro-
cannot, for example, navigate persons at home or offices using posed such as Nearest Neighbor (NN [16], K-Nearest Neigh-
Google Maps. Recently, the proliferation of smart phones and bor (KNN) [17], [18], Artificial Neural Networks (ANNs)
other mobile devices made indoor localization more possible [19], Support Vector Regression (SVR) [20] and Deep Neural
for enabling location-based services. Networks [21]. In [22] authors developed a detailed study in a
Today, emerging indoor positioning systems is significantly real environment by exploring a number of ANN-based meth-
studied due to the increasing demands on universal posi- ods such as Radial Basis Function (RBF), Multi-Layer Per-
tioning. Most indoor wireless sensor network localization ceptron (MLP), Recurrent Neural Networks (RNN), Position-

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 296


to learn from input data even if the data distribution is not
accurate or incomplete one. After training the ANN, it can
be used directly for the prediction or classification purposes
[34]. ANN has been used successfully in different areas such
as optimization problems, image processing, robotics, and
forecasting [35]. In this work, we adopted the Layered-RNN
structure for indoor user localization.
L-RNN can handle complex problems if it is well trained.
The main advantage of L-RNN that it has a dynamic memory,
where information can temporally memorize in the L-RNN
model. The learning process inside L-RNN is a time-varying
pattern by applying either feedback or feedforward connec-
tions. L-RNN output depends on the calculations from the
Fig. 1. Wi-Fi indoor localization system.
input data at the current step and the output from previous
steps. The final output of L-RNN depends on previous and
Velocity (PV), Position Velocity Acceleration (PVA) and Re- current input data [34].
duced Radial Basis Function (RRBF). They show that RBF In Figure 2, we show a simple example of L-RNN model.
can outperform other ANN types. Meanwhile, RBF requires We demonstrate the output of L-RNN at time t. By given an
more computational and memory resource than other methods. input sequence L = (L1 , . . . , Lt ), the L-RNN evaluates the the
Khatab et al. [23] applied auto-encoder based deep extreme hidden sequence P = (P1 , . . . , Pt ) and output vector sequence
learning machine to estimate RSS values. The obtained results y = (y1 , . . . , yt ) as shown in Equations (1) and (2).
show a notable improvement in indoor users localization by
applying high level extracted features and increasing the size Pt = f (WhL Lt + WP P Pt−1 + bP ) (1)
of training samples. Wu et al. [24] applied channel state yt = f (Wyh Pt + by ) (2)
information (CSI) and Naive Bayes (NB) classifier for passive
indoor localization. Haider et al. [25] proposed pre- and post- where the activation function (sigmoid function) is pre-
data processing techniques with deep learning classifiers to sented as f (). L-RNN has three weight matrices
estimate the indoor user localization. The obtained results 1) WhL is a matrix has the conventional weights between
show an excellent performance of the proposed approach even input and hidden layers,
with missing data for RSS values. Sun et al. [26] proposed 2) WP P is a matrix has the conventional weights between
Gaussian process regression models to predict the spatial the hidden layer and itself, and
distribution of RSS for indoor localization systems. 3) Wyh is a matrix has the conventional weights between
The main contribution of this paper is to provide a new hidden and output layers. The bias parameters bP and
cascaded architecture based neural network method that accu- by presented as vectors to simplify the learning process
rately predicts indoor user location in minimum computational for each recurrent neuron.
time. A cascaded layered recurrent neural network is proposed
to handle the localization problem. The proposed approach will III. E XPERIMENTAL DATA
be tested using two data sets provided by the International To evaluate our proposed approach, two different public data
Conference on Indoor Positioning and Indoor Navigation sets employed in this paper. The first data set estimates user
(IPIN) [27]. location based on building and floor, while the second data
The rest of this paper is organized as follows: Section II set estimates the user location inside a floor. The following
presents the layered recurrent neural network approach. In subsections demonstrate each data set briefly. For indoor users
Section III we present the experimental data sets used in this localization problem, The input data is a matrix of RSS values,
paper. Section IV shows the experimental results and analysis while the output is the estimated location. We proposed a
of the proposed approach. Section V draws the conclusion and cascaded L-RNN model to estimate both the building number
future works. and the floor number using the RSS values. Figure 3 presents
a pictorial diagram of our proposed model.
II. L AYERED R ECURRENT N EURAL N ETWORK
One of the most successful ML algorithms is the Artificial A. UJIIndoorLoc data set
Neural Network (ANN) that simulates the behavior of the The UJIIndoorLoc data set was introduced in 2014 at the
human brain. There are six general types of ANN used International Conference on Indoor Positioning and Indoor
by researchers: Feedforward Neural Network (FFNN) [28], Navigation (IPIN) [27]. The data set constructed from the
Radial basis function Neural Network (RPFNN) [29], Kohonen Received Signal Strength Indicator (RSSI) values from 520
Self Organizing Neural Network (KSONN) [30], Recurrent access points. The data set determines different buildings and
Neural Network(RNN) [31], Convolutional Neural Network floors inside a campus. The campus consists of three different
(CNN) [32], and Modular Neural Network [33]. ANN used buildings, and each building has a different number of floors

297
Fig. 2. The L-RNN model.

in the range of [0,4], the lack of measurements denoted by TABLE II


+100 dBm. The training data set consists of 19937 samples S AMPLE DATA FOR INDOOR LOCALIZATION BASED W I -F I SIGNAL
STRENGTH .
while testing data set consists of 1111 samples. Table I shows
samples of UJIIndoorLoc data set. The data set is available on W S1 W S2 W S3 W S4 W S5 W S6 W S7 Room
the UC Irvine Machine Learning Repository website [36]. -64 -56 -61 -66 -71 -82 -81 1
-68 -57 -61 -65 -71 -85 -85 1
-63 -60 -60 -67 -76 -85 -84 1
TABLE I -42 -53 -62 -38 -66 -65 -69 2
S AMPLE DATA FOR UJII NDOOR L OC DATA SET. -44 -55 -61 -41 -66 -72 -68 2
-41 -58 -56 -40 -73 -69 -73 2
W AP1 W AP2 W AP3 ... W AP520 Floor Building -54 -53 -54 -50 -63 -79 -77 3
-97 -75 +100 ... +100 1 0 -50 -56 -54 -50 -71 -79 -77 3
+100 -85 -70 ... -20 0 1 -49 -57 -52 -51 -60 -89 -83 3
+100 -95 +100 ... +100 4 2 -58 -56 -47 -62 -36 -85 -84 4
-61 -52 -49 -56 -46 -84 -83 4
-55 -50 -51 -61 -48 -82 -79 4
B. Wireless Indoor Localization data set
The data set is collected from a single office in Pittsburgh
RNN to classify a user indoor localization . The proposed
in western Pennsylvania. The office has seven Wi-Fi routers,
tuning parameters of the L-RNN is provided in Table III. The
and four rooms. The RSS is collected to determine the user
parameters used were chosen to tune the network based on
location inside the office. In some locations the RSS values
number of experiments.
were collected every one second. The data set consists of 2000
samples. We divided the data set into two data sets: training
TABLE III
data set (70%) and testing data set (30%). This data set only T HE PARAMETERS SETTING FOR L-RNN.
predicts room inside the office. Table II shows samples of
Wireless Indoor Localization data set. The data set is available Parameters Values
on the UC Irvine Machine Learning Repository website [36]. Number of iterations 1000
Number of neurons in Input layer number of RSS
Number of neurons in Hidden layer number of RSS/2
Number of neurons in Output layer 1
IV. E XPERIMENTAL RESULTS AND ANALYSIS Threshold of the transfer output 0.5
Several experiments were performed using MATLAB-
R2014a to examine the performance of the proposed L- Table IV summarized the accuracy results for each data set.

298
Fig. 3. Proposed Cascaded L-RNN model for indoor localization.

TABLE IV Best Validation Performance is 0.037294 at epoch 43


S TATISTICAL RESULTS FOR 11 RUNS BASED ON ACCURACY VALUE (%). 10 1
Train
UJIIndoorLoc Wireless Indoor Localization Validation
Test
Building Floor Room
Best
Mean Squared Error (mse)

Average 96.45 87.41 95.80


Best 97.80 89.30 96.3
Worst 94.63 84.80 96.00 10 0
Std. 0.85 0.83 0.60

10 -1
The performance of L-RNN over UJIIndoorLoc data set for es-
timating building is more accurate that floor. The main reasons
that L-RNN has 89.30% for UJIIndoorLoc testing data set that
estimating building number is used as an input for the L-RNN
10 -2
model that estimates floor number. It is evident that building 0 20 40 60 80 100 120 140
estimation is 97.8% and this will reduce the accuracy for the 143 Epochs
floor estimation process. However, the average estimation for
Building and Floor (B&F) is 91.8%. The performance of L- Fig. 4. L-RNN convergence process
RNN over Wireless Indoor Localization data set is outstanding
for two reasons: (i) The size of the data set is only 2000
samples which are 93.55% smaller than UJIIndoorLoc data A. Comparison
set, and (ii) the data set do not have missing values. Table IV In this section, we provide a comparison between our
also shows the statistical results of our proposed approach for proposed L-RNN and many methods reported in the literature.
11 runs. It is clear that the performance of L-RNN approach • Table V shows comparison results between our proposed
is stable based on standard deviation value. approach and the state-of-the-art methods based on the
Figure 4 shows the performance of L-RNN model in the average accuracy value. It is clear that our proposed
training process. It is clear that L-RNN can converge within method gains the second rank for UJIIndoorLoc data sets.
143 iterations. This fast convergence is due to the ability of L- • Table VI shows a comparison results for Wireless Indoor
RNN to learn by generating various abstract representation of Localization data set and other methods in the literature.
the data. One more advantage is that the network structure It is clear that our proposed method outperforms all
can expand deeper to co-op with the modeling problem reported results and gain rank number one Big campus
requirements. with a large number of building and multi-floors will

299
increase the complexity of an indoor user localization [5] Y. Gu, A. Lo, and I. Niemegeers, “A survey of indoor positioning
problem. As a result, deep learning algorithms such systems for wireless personal networks,” Commun. Surveys Tuts.,
vol. 11, no. 1, pp. 13–32, Jan. 2009. [Online]. Available: http:
as L-RNN will be more applicable for such problems //dx.doi.org/10.1109/SURV.2009.090103
compared to the traditional machine learning algorithms. [6] S. Doeweling, T. Tahiri, P. Sowinski, B. Schmidt, and M. Khalilbeigi,
“Support for collaborative situation analysis and planning in crisis
management teams using interactive tabletops,” in Proceedings of
TABLE V the 2013 ACM International Conference on Interactive Tabletops and
C OMPARISON WITH THE SATE - OF - THE ART METHODS BASED ON THE Surfaces, ser. ITS ’13. New York, NY, USA: ACM, 2013, pp. 273–282.
AVERAGE ACCURACY VALUES FOR UJII NDOOR L OC DATA SET. [Online]. Available: http://doi.acm.org/10.1145/2512349.2512823
[7] K. Tran, D. Phung, B. Adams, and S. Venkatesh, “Indoor location
Rank Approach Average accuracy (%) prediction using multiple wireless received signal strengths,” in
1 CNN [37] 95.41 Proceedings of the 7th Australasian Data Mining Conference - Volume
2 Cascaded L-RNN 93.55 87, ser. AusDM ’08. Darlinghurst, Australia, Australia: Australian
Computer Society, Inc., 2008, pp. 187–192. [Online]. Available:
3 Scalable DNN [38] 92.89 http://dl.acm.org/citation.cfm?id=2449288.2449317
4 SAE+ classifier [39] 91.10 [8] S. K. Pandey and M. A. Zaveri, “Localization for collaborative
processing in the internet of things framework,” in Proceedings of the
Second International Conference on IoT in Urban Space, ser. Urb-IoT
’16. New York, NY, USA: ACM, 2016, pp. 108–110. [Online].
TABLE VI Available: http://doi.acm.org/10.1145/2962735.2962752
C OMPARISONWITH THE SATE - OF - THE ART METHODS BASED ON THE
[9] T. Kramp, R. van Kranenburg, and S. Lange, Introduction to the Internet
AVERAGE ACCURACY VALUES FOR W IRELESS I NDOOR L OCALIZATION
of Things. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp.
DATA SET.
1–10. [Online]. Available: https://doi.org/10.1007/978-3-642-40403-0 1
[10] E. Curry, S. Dustdar, Q. Z. Sheng, and A. Sheth, “Smart cities
Rank Approach Average Accuracy (%)
– enabling services and applications,” Journal of Internet Services
1 Cascaded L-RNN 96.30 and Applications, vol. 7, no. 1, p. 6, Jun 2016. [Online]. Available:
2 FPSPGSA-NN [40] 95.16 https://doi.org/10.1186/s13174-016-0048-6
3 SVM [40] 92.68 [11] A. Ojo, Z. Dzhusupova, and E. Curry, “Exploring the Nature of the
4 Naı̈ve Bayse [40] 90.47 Smart Cities Research Landscape,” in Smarter as the New Urban
5 PSOGSA-NN [40] 83.28 Agenda: A Comprehensive View of the 21st Century City, R. Gil-Garcia,
T. A. Pardo, and T. Nam, Eds. Springer, 2015. [Online]. Available:
6 GSA-NN [40] 77.53 http://www.edwardcurry.org/publications/Landscape Preprint.pdf
7 PSO-NN [40] 64.66 [12] A. Filippoupolitis and E. Gelenbe, “An emergency response system
for intelligent buildings,” in Sustainability in Energy and Buildings,
N. M’Sirdi, A. Namaane, R. J. Howlett, and L. C. Jain, Eds. Berlin,
V. C ONCLUSION AND F UTURE W ORKS Heidelberg: Springer Berlin Heidelberg, 2012, pp. 265–274.
[13] W. Zeiler, R. van Houten, and G. Boxem, “Smart buildings: Intelligent
In this paper, we proposed a cascaded layered recurrent software agents,” in Sustainability in Energy and Buildings, R. J.
neural network to predict indoor user localization using Wi-Fi Howlett, L. C. Jain, and S. H. Lee, Eds. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2009, pp. 9–17.
fingerprinting. L-RNN have been examined use two different [14] A. Rice and R. Harle, “Evaluating lateration-based positioning
public data sets. A set of experiments were performed, and algorithms for fine-grained tracking,” in Proceedings of the 2005 Joint
the obtained results show that L-RNN can works in a proper Workshop on Foundations of Mobile Computing, ser. DIALM-POMC
’05. New York, NY, USA: ACM, 2005, pp. 54–61. [Online]. Available:
manner either with a small or massive number of samples. The http://doi.acm.org/10.1145/1080810.1080820
performance of L-RNN shows high accuracy for indoor local- [15] P. Jiang, Y. Zhang, W. Fu, H. Liu, and X. Su, “Indoor mobile
ization problem. The future work will investigate the exact localization based on wi-fi fingerprint’s important access point,”
position of indoor users based on real-location inside floor or International Journal of Distributed Sensor Networks, vol. 11, no. 4, p.
429104, 2015. [Online]. Available: https://doi.org/10.1155/2015/429104
room and simulated different machine learning methods such [16] C. Li, Z. Qiu, and C. Liu, “An improved weighted k-nearest
as Convolutional Neural Network (CNN) and Modular Neural neighbor algorithm for indoor positioning,” Wirel. Pers. Commun.,
Network (MNN). vol. 96, no. 2, pp. 2239–2251, Sep. 2017. [Online]. Available:
https://doi.org/10.1007/s11277-017-4295-z
[17] A. Belay Adege, Y. Yayeh, G. Berie, H. Lin, L. Yen, and Y. R.
R EFERENCES Li, “Indoor localization using k-nearest neighbor and artificial neural
[1] B. Rashid and M. H. Rehmani, “Applications of wireless sensor networks network back propagation algorithms,” in 2018 27th Wireless and
for urban areas,” J. Netw. Comput. Appl., vol. 60, no. C, pp. 192–219, Optical Communication Conference (WOCC), April 2018, pp. 1–2.
Jan. 2016. [18] M. Y. Umair and K. V. R. and, “An enhanced k-nearest neighbor
[2] S. R. J. Ramson and D. J. Moni, “Applications of wireless sensor algorithm for indoor positioning systems in a wlan,” in 2014 IEEE
networks — a survey,” in 2017 International Conference on Innova- Computers, Communications and IT Applications Conference, Oct 2014,
tions in Electrical, Electronics, Instrumentation and Media Technology pp. 19–23.
(ICEEIMT), Feb 2017, pp. 325–329. [19] M. V. Moreno-Cano, M. A. Zamora-Izquierdo, J. Santa, and
[3] M. Kwak, Y. Park, J. Kim, J. Han, and T. Kwon, “An energy-efficient A. F. Skarmeta, “An indoor localization system based on artificial
and lightweight indoor localization system for internet-of-things neural networks and particle filters applied to intelligent buildings,”
(iot) environments,” Proc. ACM Interact. Mob. Wearable Ubiquitous Neurocomput., vol. 122, pp. 116–125, Dec. 2013. [Online]. Available:
Technol., vol. 2, no. 1, pp. 17:1–17:28, Mar. 2018. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2013.01.045
http://doi.acm.org/10.1145/3191749 [20] A. Chriki, H. Touati, and H. Snoussi, “Svm-based indoor localization
[4] E. Martin, O. Vinyals, G. Friedland, and R. Bajcsy, “Precise in wireless sensor networks,” in 2017 13th International Wireless Com-
indoor localization using smart phones,” in Proceedings of the 18th munications and Mobile Computing Conference (IWCMC), June 2017,
ACM International Conference on Multimedia, ser. MM ’10. New pp. 1144–1149.
York, NY, USA: ACM, 2010, pp. 787–790. [Online]. Available: [21] W. Zhang, K. Liu, W. Zhang, Y. Zhang, and J. Gu, “Deep neural
http://doi.acm.org/10.1145/1873951.1874078 networks for wireless localization in indoor and outdoor environments,”

300
Neurocomput., vol. 194, no. C, pp. 279–287, Jun. 2016. [Online]. networks,” in Proceedings of Sixth International Conference on Soft
Available: https://doi.org/10.1016/j.neucom.2016.02.055 Computing for Problem Solving, K. Deep, J. C. Bansal, K. N. Das,
[22] M. Altini, D. Brunelli, E. Farella, and L. Benini, “Bluetooth indoor A. K. Lal, H. Garg, A. K. Nagar, and M. Pant, Eds. Singapore: Springer
localization with multiple neural networks,” in IEEE 5th International Singapore, 2017, pp. 286–295.
Symposium on Wireless Pervasive Computing 2010, May 2010, pp. 295–
300.
[23] Z. E. Khatab, A. Hajihoseini, and S. A. Ghorashi, “A fingerprint method
for indoor localization using autoencoder based deep extreme learning
machine,” IEEE Sensors Letters, vol. 2, no. 1, pp. 1–4, March 2018.
[24] Z. Wu, Q. Xu, J. Li, C. Fu, Q. Xuan, and Y. Xiang, “Passive indoor
localization based on csi and naive bayes classification,” IEEE Trans-
actions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 9, pp.
1566–1577, Sep. 2018.
[25] A. Haider, Y. Wei, S. Liu, and S.-H. Hwang, “Pre- and post-processing
algorithms with deep learning classifier for wi-fi fingerprint-based
indoor positioning,” Electronics, vol. 8, no. 2, 2019. [Online]. Available:
http://www.mdpi.com/2079-9292/8/2/195
[26] W. Sun, M. Xue, H. Yu, H. Tang, and A. Lin, “Augmentation of
fingerprints for indoor wifi localization based on gaussian process
regression,” IEEE Transactions on Vehicular Technology, vol. 67, no. 11,
pp. 10 896–10 905, Nov 2018.
[27] J. Torres-Sospedra, R. Montoliu, A. Martı́nez-Usó, J. P. Avariento, T. J.
Arnau, M. Benedito-Bordonau, and J. Huerta, “Ujiindoorloc: A new
multi-building and multi-floor database for wlan fingerprint-based indoor
localization problems,” in 2014 International Conference on Indoor
Positioning and Indoor Navigation (IPIN), Oct 2014, pp. 261–270.
[28] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear
feedforward neural network,” Neural Networks, vol. 2, no. 6, pp. 459 –
473, 1989. [Online]. Available: http://www.sciencedirect.com/science/
article/pii/0893608089900440
[29] S. Elanayar V.T. and Y. C. Shin, “Radial basis function neural net-
work for approximation and estimation of nonlinear stochastic dynamic
systems,” IEEE Transactions on Neural Networks, vol. 5, no. 4, pp.
594–603, July 1994.
[30] N. R. Pal, J. C. Bezdek, and E. C. . Tsao, “Generalized clustering
networks and kohonen’s self-organizing scheme,” IEEE Transactions on
Neural Networks, vol. 4, no. 4, pp. 549–557, July 1993.
[31] H. HaddadPajouh, A. Dehghantanha, R. Khayami, and K.-K. R.
Choo, “A deep recurrent neural network based approach for
internet of things malware threat hunting,” Future Generation
Computer Systems, vol. 85, pp. 88 – 96, 2018. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167739X1732486X
[32] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A
survey of deep neural network architectures and their applications,”
Neurocomputing, vol. 234, pp. 11 – 26, 2017. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0925231216315533
[33] B. L. Happel and J. M. Murre, “Design and evolution of modular neural
network architectures,” Neural Networks, vol. 7, no. 6, pp. 985 – 1004,
1994, models of Neurodynamics and Behavior. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0893608005801558
[34] H. Turabieh, M. Mafarja, and X. Li, “Iterated feature selection
algorithms with layered recurrent neural network for software fault
prediction,” Expert Systems with Applications, vol. 122, pp. 27 – 42,
2019. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0957417418308030
[35] A. J. Maren, C. T. Harston, and R. M. Pap, Handbook of Neural
Computing Applications. Orlando, FL, USA: Academic Press, Inc.,
1990.
[36] D. Dua and C. Graff, “UCI machine learning repository,” 2019.
[Online]. Available: http://archive.ics.uci.edu/ml
[37] J. Jang and S. Hong, “Indoor localization with wifi fingerprinting using
convolutional neural network,” in 2018 Tenth International Conference
on Ubiquitous and Future Networks (ICUFN), July 2018, pp. 753–758.
[38] K. S. Kim, S. Lee, and K. Huang, “A scalable deep neural network
architecture for multi-building and multi-floor indoor localization based
on wi-fi fingerprinting,” Big Data Analytics, vol. 3, no. 1, p. 4, Apr
2018. [Online]. Available: https://doi.org/10.1186/s41044-018-0031-2
[39] M. Nowicki and J. Wietrzykowski, “Low-effort place recognition with
wifi fingerprints using deep learning,” in Automation 2017, R. Szewczyk,
C. Zieliński, and M. Kaliczyńska, Eds. Cham: Springer International
Publishing, 2017, pp. 575–584.
[40] J. G. Rohra, B. Perumal, S. J. Narayanan, P. Thakur, and R. B. Bhatt,
“User localization in an indoor environment using fuzzy hybrid of
particle swarm optimization & gravitational search algorithm with neural

301
Learning with Dynamic Architectures for Artificial
Neural Networks - Adaptive Batch Size Approach

Reham Saeed* Rawan Ghnemat* Ghassen Benbrahim§ Ammar Elhassan*


riham_mefleh@hotmail.com r.ghnemat@psut.edu.jo gbrahim@pmu.edu.sa a.elhassan@psut.edu.jo

*
King Hussein School of Computer Science
Princess Sumaya University for Technology
Jordan

§College of Computer Engineering & Science


Prince Mohammad bin Fahd University
KSA

Abstract— In this research we explore the performance of including for images, speech recognition, expert systems,
ADANET framework by using custom search space for an fuzzy logic and control to name but a few [15], [16].
image-classification dataset using tensorflow libraries in
combination with adaptive batch sizes for learning. In one ADANET [1], [2], which is a fast, flexible and easy to
experiment we classified fashion MNISET data and MNIST
use AutoML framework newly introduced by Google, is an
data of handwritten digits and obtained favorable results in
terms of training time as well as accuracy by alternating adaptive structural learning platform for Artificial Neural
learning batch sizes dynamically. Our testing was applied Networks (ANN) developed to handle both structure and
using simple deep neural network (DNN) and also with weights of the ANN. It is a lightweight Tensorflow [3]
convolutional neural network (CNN). based platform for high quality ensemble learning that does
not depend on domain expertise. The code, which is based
on AdaNet algorithm [2], is open-source and:
Keywords—Artificial Neural Networks, Convolutional
Neural Networks, Batch Size, Convergence, Accuracy, Training, i. Supports Learning of ANN structure as an
Feed-forward, Two-Layer Feed-Forward Net, Sampling,
Tensorflow, ADANet, Stochasticity ensemble of subnetworks
ii. Integrates with the existing TensorFlow design and
ecosystem
I. INTRODUCTION iii. Performs well on novel datasets by offering
sensible default search spaces
iv. With the availability of a flexible API, can utilize
Artificial Neural Networks are machine learning models
inspired by the human brain [12], [13]. They are considered expert information when available,
as the most powerful structure capable of producing highly v. Utilizes distributed CPU, GPU, and TPU hardware
accurate learning rates. However, these structures did not to efficiently accelerate training.
gain very high popularity due to their complex designs, long
training times and the machine learning model candidate
selection requiring its own domain expertise. But as II. PROBLEM STATEMENT
computational power and specialized deep learning
hardware such as TPUs become more readily available,
machine learning models will grow larger and ensembles Design and training of Artificial Neural Networks takes a
will become more prominent. Neural Networks have been long time to converge and achieve acceptable accuracy.
applied in different domains such as classification problems Some of the drawbacks associated with their design include,

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 302


training difficulty due to optimization issues as well as the benefits that can be gained through the combination of
time requirements of the training process itself [11]. adaptive batch sizes when with existing large batch size
CNN training work as in [4,5]
It has been shown that large, fixed training batch sizes
fail to find good minimizers for non-convex problems like IV. METHODOLOGY
deep neural networks [14]. One way to overcome this In order to find an optimum balance between ANN
limitation is to dynamically increase the batch size over time accuracy and learning rate, it is possible to alternate between
wherein the initial iterations with smaller batches yield two modes: fast-forward and normal. Fast-forward iterates
increased stochasticity which can help converge the training faster than normal mode by using smaller number of
cycles to an optimum minimizer. Increasing the batch sizes samples in each mini-batch. Switching between these two
at later training cycles can increase the speed of modes in a dynamic way is triggered by accuracy
convergence towards this minimizer fluctuations; opting for the faster mode so as long as it
yields acceptable results and reverting to normal mode
III. BACKGROUNG AND RELATED WORK otherwise. With this approach, it is possible to perform
training exercises feasibly even on commodity CPUs. We
The work in this article which explores adaptive batch applied ADANET with fast (larger batch size) mode as long
sizes for various neural network architectures, is inspired by as accuracy is increasing, and switch to slower, normal
[8] as well as [19] whose approaches consider second-order, (smaller batch size) mode otherwise. We can multiply the
Newton-type methods. Authors in [8] explored the effect of batch size by factor n without changing the learning rate.
varying batch sizes on learning rate, particularly, for The basic crux of the idea is described in the pseudo
strongly-convex, unconstrained optimization problems and algorithm below:
experimented with batch size increases as an alternative to
learning rate decreases. [19] applied estimate gradient Algorithm: Adaptive batch size ADANET
variance as a basis for batch size selection in their learning //use ADANET to choose the structure of a
models. //neural network
//Configure ADANET COLAB settings
//Set initial batch size
The idea of incremental learning rate variance and / use the same learning rate
scaling was introduced by [4] for ImageNet CNN training in
a large, distributed GPU cluster environment targeting 8192 BSet = {10, 50, 100, 200}; setPointer = 0;
batches, although that work uses fixed batch sizes. The BSize = BSet[setPointer]; iLR = 0.01;
scaling effect of multi-layer learning rates scaled by norms iAcc = 0; //initial accuracy
of gradients was also used successfully to achieve batch iMode = “Normal”
sizes of 32k+ by [5], this is also based on fixed batch size do
{
training models. iteration++;
TrainBatch(BSize); //run training batch
Adapting batch size selection against learning rates If (iAcc > newAcc)
during the training process in ANN was applied recently by {
[6] who demonstrated that small batch sizes yield efficient setPointer++;
learning while large batch sizes are more efficient in terms BSize = BSet[setPointer];//Next Batch Size
of computational complexity. The crux of their approach is }
Else
to dynamically adapt the batch size during the training
{
process rather than working with a static batch size across iAcc = newAcc;
all training cycles. Their work struck a favorable }
combination of high convergence rates offered by small } loop until Exit Criteria
batch sizes and high performance of large batch sizes, thus
improving overall performance by factors of 6+ while
compromising accuracy by less than 1%. A. Test Data Description

Motivated by the popularity of variance-reduced methods The Fashion MNISET dataset consists of a training set of
that achieve linear convergence rates with small sample 60,000 examples and a test set of 10,000 examples. Each
sizes, [7] increased sample sizes dynamically in stochastic example is a 28x28 grayscale image, associated with a label
gradient descent iteration and developed theoretical and from 10 classes [17]. The second set of tests was conducted
empirical methods to counter the prohibitive issues inherent against the MNIST Database of Handwritten Digital Images
in multiple training passes. They obtained positive which offers a collection of handwritten digital images to be
performance increments within accuracy thresholds “on an used in optical character recognition (OCR) and research in
n-sample in 2n, instead of n log n steps”. Similar work that data science and machine learning [18].
demonstrates the effectiveness of combining learning rates
with dynamic batch sizes with was performed by [8] and V. EXPERIMENTAL RESULTS
[9], also [10] applied novel, adaptive approaches that control
the increases in batch sizes and application to convex The Fashion modified National Institute of Standards and
problems and convolutional neural networks (CNN). These Technology (MNIST) dataset was fed into Tensorflow using
approaches however, did not explore the performance
303
Estimator convention, then neural network was built using Figures 3 and 4 show the results of the ADANET tests on
different classifiers; the first model was built using simple this dataset with batch sizes of 1000 to 3000.
deep neural network (DNN) classifier and the second
model was built using convolutional neural network
(CNN). The accuracy was evaluated for both classifiers.
Figure 5.1a shows accuracy over iterations with batch sizes
10 to 200 for the Fashion MINST Dataset with DNN
Classifier for Learning Rate of 0.01.

Iteration Color Code


1000

1500

2000

3000

. Fig. 3. MNIST dataset accuracy using Simple DNN Classifier

Fig. 4. Digits MNIST dataset accuracy using CNN Classifier


Fig. 1. Fashion MNIST dataset accuracy using Simple DNN Classifier

Figure 5.1b shows accuracy over iterations with batch sizes A. Test Colab Specification
10 to 200 for the Fashion MINST Dataset with CNN The tests above were conducted with the colab spec as
ADANET Model for Learning Rate of 0.01. follows:
CPU: 1xsingle core hyper threaded i.e(1 core, 2
threads) Xeon
Processors @2.3Ghz (No Turbo Boost), 45MB Cache
RAM: ~12.6 GB Available
Disk: ~320 GB Available
B. Analysis
It is evident from both sets of results that ADANET
accuracy rates are around 4% to 10% higher with CNN
models than with simple DNN models for similar batch
sizes and iterations. We can see significant improvement
when using large batch size (200) over small batch size (10).
Furthermore, changing the number of iterations seemingly
has little effect on the obtained accuracy rates which directly
affects the time needed in training big datasets.

Fig. 2. Accuracy Tests - CNN ADANET Model

304
VI. LIMITATIONS AND FUTURE WORK [8] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark
Schmidt, Jakub Koneˇcn´y, and Scott Sallinen. Stop wasting my
gradients: Practical SVRG. In Advances in Neural Information
The experimental testing conducted in this research, Processing Systems, pp. 2251–2259. 2015.
which adaptively varies the batch sizes used in training
artificial neural networks indicates that training DNN and [9] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Big
CNN models in this way has a linear effect on speed with Batch SGD: Automated Inference using Adaptive Batch Sizes. arXiv
miniscule accuracy degradation overhead. We have preprint arXiv:1610.05792, October 2016.
alternated between large batch size and small batch size as
needed without compromising the speed by using fewer [10] Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive
iterations. Our experiments show the same improvement batch sizes with learning rates. In Proceedings of the Conference on
Uncertainty in Artificial Intelligence, pp. 410–419, 2017.
under different classifier-dataset combinations. The
proposed procedure could be applied equally well in large
[11] P. G. Maghami and D. W. Sparks, "Design of neural networks for fast
and small datasets with different classifier with different convergence and accuracy: dynamics and control," in IEEE
model architectures Transactions on Neural Networks, vol. 11, no. 1, pp. 113-123, Jan.
2000. doi: 10.1109/72.822515,
We presented experimental results demonstrating that our http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=822515&is
number=17821, accessed May 2019
procedure was successful in training the network and
perform better than those using static small batch size and
adaptive learning rates, further experiments with other batch [12] S. Haykin, Neural Networks: A Comprehensive Foundation. New
York: Macmillan, 1994.
sizes and larger datasets will be conducted in the future
[13] M. I. Elmasry, Ed., VLSI Artificial Neural Networks Engineering.
Norwell, MA: Kluwer, 1994.
REFERENCES
[14] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail
Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for
[1] Charles Weill, Introducing AdaNet: Fast and Flexible AutoML with deep learning: Generalization gap and sharp minima. arXiv preprint
Learning Guarantees, Cornell University 2018, arXiv:1609.04836, 2016.
https://arxiv.org/abs/1905.00080 , accessed March 2019

[15] GW Irwin, K Warwick, KJ Hunt. Neural Network Applications in


[2] Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri Control, 1995. books.google.com Accessed May 2019
and Scott YangAdaNet: adaptive structural learning of artificial
neural networks. Proceeding ICML'17 Proceedings of the 34th
International Conference on Machine Learning - Volume 70, Pages [16] I.A Basheer, M Hajmeer, Artificial neural networks: fundamentals,
874-883, Sydney, NSW, Australia — August 06 - 11, 2017 computing, design, and application, Journal of Microbiological
Methods, Volume 43, Issue 1, 2000, Pages 3-31, ISSN 0167-7012,
https://doi.org/10.1016/S0167-7012(00)00201-3.
[3] M. Abadi et al. TensorFlow: A System for Large-Scale Machine
Learning. 12th USENIX Symposium on Operating Systems Design
and Implementation 2016, pp265-283, Savannah, GA. [17] Han Xiao, Kashif Rasul, Roland Vollgraf. Fashion-MNIST: A Novel
https://www.usenix.org/conference/osdi16/technical- Image Dataset for Benchmarking Machine Learning Algorithms,
sessions/presentation/abadi accessed May 2019 https:// arXiv:1708.07747v2, accessed May 2019

[4] Priya Goyal, Piotr Doll´ar, Ross B. Girshick, Pieter Noordhuis, [18] L. Deng, "The MNIST Database of Handwritten Digit Images for
Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Machine Learning Research [Best of the Web]," in IEEE Signal
and Kaiming He. Accurate, large minibatch SGD: training imagenet Processing Magazine, vol. 29, no. 6, pp. 141-142, Nov. 2012. doi:
in 1 hour. arXiv preprint arXiv:1706.02677, 2017. URL 10.1109/MSP.2012.2211477. URL:
http://arxiv.org/abs/1706.02677, accessed May 2019 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6296535&
isnumber=6296521, accessed May 2019

[5] Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size
to 32K for ImageNet training. arXiv preprint arXiv:1708.03888, [19] Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu.
2017. http://arxiv.org/abs/1708.03888, accessed May 2019 Sample size selection in optimization methods for machine learning.
Mathematical programming, 134(1):127–155, 2012.

[6] Aditya Devarakonda, Maxim Naumov, Michael Garland. AdaBatch:


Adaptive Batch Sizes for Training Deep Neural Networks. Cornell
University, Feb 2018. https://arxiv.org/abs/1712.02029, accessed
May 2019

[7] Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Starting


Small - Learning with Adaptive Sample Sizes. In Proceedings of the
International Conference on Machine Learning, pp. 1463–1471, 2016.

305
Hybrid Machine Learning Classifiers to Predict
Student Performance
Hamza Turabieh
Information Technology Department
CIT collage, Taif University
Taif, KSA
h.turabieh@tu.edu.sa

Abstract—Recently, machine learning technology has been help educators and decision-makers to improve the educational
involved successfully in our life in an extreme manner in various systems.
domains. In this paper, we investigate the machine learning Educational domains have a set of challenging problems for
concept for educational data mining systems, that focus on devel-
oping new approaches to discover meaningful knowledge from machine learning researchers since educations systems offer
stored data. Educational data come from different resources such complex information such as students information, class and
as academic data from students, virtual courses, e-learning log schedule information, admission and registration, and alumni
files, and so on. Predicting student marks is a challenging problem information. The motivation of this paper is to predict students
in the educational sector. We applied a hybrid feature selection performance-based on historical data using a hybrid machine
algorithm with different machine learning classifiers (i.e. nearest
neighbors (kNN), Convolutional Neural Network (CNN), Naı̈ve learning approach.
Bayes (NB) and decision trees (C4.5)) to predict the student’s This paper aims to investigate the performance of different
performance. A feature selection algorithm is used to select the machine learning classifiers with and without feature selection
most valuable features. In this paper, we applied a binary genetic algorithm to predict student’s performance. Binary genetic
algorithm as a wrapper feature selection. A benchmark dataset is algorithm is employed as a feature selection to reduce the
used from UCI Machine Learning Repository, and the obtained
results show excellent performance. dimensionality of the search space, that improves the overall
Index Terms—Machine learning, Student performance, Fea- performance of the classifiers and reduces the computational
ture selection. time.
The rest of this paper has been organized as follows:
I. I NTRODUCTION Section II explores the related works about machine learning in
educational systems. Section III presents the proposed hybrid
Educational systems have complex data that can be used approach. Section IV presents the experimental datasets used
to discover hidden knowledge, that can improve the over- in this paper. Section V shows the experimental results and
all educational systems [1]. Educational data (i.e. e-learning analysis of the proposed approach. Section VI draws the
log file, student marks, admissions/registration data, virtual conclusion and future works.
courses, and so on) can manipulate using machine learning ap-
proaches to find meaningful models. Several researchers adopt II. R ELATED W ORKS
different methods (i.e. classification, clustering, statistic and Machine learning for educational systems has been investi-
so on) to mining educational data [2], [3]. Predicting students gated deeply by [1], where they defined five different fields:
performance is a challenging problem that faces educational prediction, discovery within models, extraction of data for
institutions such as universities, schools, and training centers human judgment, clustering, and relationship mining. Most
every year. As a result, predicting student’s performance of previous works for education systems are related to univer-
at earlier stage will encourage the educational institutes to sities or virtual learning [10]. In all previous works, the data
find solutions to prevent negative performance of students collected either from surveys or from e-learning systems.
[4]. Lectures can expect the performance of their students Kapur et al. [11] applied two different machine learning
and find appropriate learning strategies to improve students methods (i.e. J48 Decision Tree, and Random Forest) to
performance. Moreover, it can enhance institution enrolment predict students marks in the education field. The collected
policies and help students to overcome their grades. data consists of 480 entries that related to the student’s enroll-
Machine Learning (ML) methods have been successfully ment. Veracano et al. [12] applied different machine learning
used in several domains, as healthcare [5], environmental methods to estimates students drop out for unbalanced dataset.
studies [6], industrial [7] and educational systems [3]. Up The authors collect 419 samples from one Mexican High
to date, machine learning concept in the educational sector School. Saif et. al [13] investigates several numbers of courses
still attracting researcher [8], [9]. Moreover, the concept of e- and explores how can predict good or poor achievement.
learning and big data in education provide researchers with Saarela et al. [14] proposed a system to predict the difficulty
extremely large data that should be examined correctly to level of different math questions and predict if the students can

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 306


solve these questions or not. Xu et al. [15] proposed a novel lection. GA has been applied successfully in various domains
approach based on ensemble learning techniques to predict [20]–[22]. The process of GA is an iterative one. GA starts
student performance. Asif et al. [16] use a machine learning by generating a pool of solutions (population), then three
approach to predict the student’s performance for the final genetic operators are performed (i.e., selection, crossover, and
year based on historical students data. Prasada Rao et al. [17] mutation) on selected solutions. This process is repeated until
applied three machine learning algorithms (i.e. J48, Naı̈ve- reach the maximum number of iterations or achieves the
Bayes and Random forest algorithms) to predict the student’s optimal fitness value [19]. Figure 2 shows the pseudo-code
performance for 200 students from computer science and engi- for GA.
neering departments. Anuradha and Velmurugan [18] applied GA operators are the main core operations inside GA.
several machine learning to predict students performances and These operations determine the overall performance of GA.
proposed a model of student performance predictors. Selecting two solutions using a determined selection approach
III. P ROPOSED H YBRID A PPROACH (i.e. random, roulette wheel, or tournament) from a pool
of solutions, then crossover operation (i.e. single, double,
In this paper, we proposed a hybrid approach between
uniform), followed by mutation process and updating the
a wrapper feature selection and a several machine learning
solutions based on an elitism replacement strategy.
classifiers (i.e. nearest neighbors (kNN), Convolutional Neural
In this paper, we applied a Binary Genetic Algorithm
Network (CNN), Naı̈ve Bayes (NB) and decision trees (C4.5)).
(BGA), where each solution is presented as a vector of a binary
The Binary Genetic Algorithm (BGA) is used as a feature
string. Figure 3 simulates the BGA for a single step. Artificial
selection algorithm. Figure 1 shows a pictorial diagram of the
Neural network (ANN) is used as an internal classifier, where
proposed approach. The following subsections demonstrate the
the fitness function presented in Equation (1). Where E is the
main components of our proposed approach.
overall error rate, β is a threshold value (β = 5), |R| represents
the number of the selected features, |N | is the total number of
features. Table I shows the parameters setting used for internal
classifier.

|R|
F itness = E ∗ (1 + β ∗ ) (1)
|N |

TABLE I
PARAMETERS SETTING FOR ANN INTERNAL CLASSIFIER .

Parameters Values
Number of neurons input layer Number of selected features
Number of neurons hidden layer 10
Number of neurons output layer 1
Training sample 70% of the data
Testing sample 15% of the data
Validation sample 15% of the data
Fitness function Mean square error

B. Machine learning classifiers


Machine learning has many classification algorithms. So, we
limit our research to use only four different classifiers: near-
est neighbors (kNN), Convolutional Neural Network (CNN),
Naı̈ve Bayes (NB) and decision trees (C4.5). The selected clas-
sifiers are successfully applied in different domains. In general,
all classifiers are applied successfully in several domains. CNN
is one of the successful machine learning that perform a
deep learning process to predict complex data [23]. The kNN
algorithm used the similarity threshold value to classify the
dominate label to the nearest group [24]. In this paper, we
select k = 5. NB classifier is a simple approach that can be
Fig. 1. A pictorial diagram of the proposed hybrid methodology. simulated as a Bayesian network based on two assumptions:
the features are independent and no hidden features that affect
A. Binary genetic algorithm the final prediction [25]. C4.5 is a decision tree classifier that
One of the most successful population-based algorithms is employs information-based criteria to set up decision trees
Genetic algorithm (GA) [19], that simulates the natural se- [26]. The tree is extended based on the valuable information

307
Given:
-nP: base population size.
-nI: number of iterations.
-rC: rate of crossover.
-rM: rate of mutation.
Generate initial population of size nP.
Evaluate initial population according to the fitness function.
While (current iteration ≤ nI)
//Breed rC × nP new solutions.
Select two parent solutions from current population.
Form offspring’s solutions via crossover.
IF(rand(0.0, 1.0) < rM)
Mutate the offspring’s solutions.
end IF
Evaluate each child solution according to the fitness function.
Add offspring’s to population.
//population size is now MaxPop=nP× (1+rC).
Remove the rC× nP least-fit solutions from population.
end While
Output the global best solution

Fig. 2. The pseudo-code for Genetic Algorithm.

received. All classifiers are trained and tested based on five measurements criteria are used to evaluate the obtained results:
fold cross-validation. Interested readers about classification accuracy, precision, recall, and F-measure. Eqs. (2), (3), (4)
algorithms and its applications can read [23], [27]–[29]. and (5), shows how we evaluate each criteria, respectively.
All previous equations are calculated based on a confusion
IV. E XPERIMENTAL DATA
matrix as shown in Table III. Where:
In this paper, we used a public dataset proposed by Cortez
1) TP: presents the correct predicted of positive values
and Silva [30], [31] in 2008. The dataset presents a secondary
when actual and estimated values are both correct.
education in Portugal, where the secondary educational sys-
2) TN: presents the the correct predicted negative values
tems consist of 3 years. Two types of secondary schools:
when actual and estimated values are both negative.
private and public. The grading system range between 0
3) FN: actual class is correct while estimated value in
(lowest) and 20 (highest) grade. Each student evaluated 3
negative.
times during a year. The data set is collected during 2005-
4) FP: actual value is negative and estimated value is
2006 academic year from two public schools.
correct.
The final grade shows student performance. The dataset
consists of student marks, demographics, school information, TP + TN
etc. The dataset has 649 samples and each sample has 33 Accuracy = (2)
TP + FP + FN + TN
attributes. The data set has two distinct student performance
cases: (i) Mathematics (mat), and (ii) Portuguese language TP
P recision = (3)
(por). Table II shows a description of the dataset. The final TP + FP
target is G3 (final year grade), that has a solid correlation
with attributes G2 and G1. For more details about dataset can TP
Recall = (4)
be found in [30], and the dataset link is https://archive.ics.uci. TP + FN
edu/ml/datasets/student+performance. 2 × (Recall × P recision)
F − M easure = (5)
V. E XPERIMENTAL RESULTS AND ANALYSIS Recall + P recision
In this research, we evaluate the performance of a binary Table V shows the obtained results of all classifiers without
genetic algorithm with different machine learning classifiers feature selection algorithm. It is clear that CNN approach
to enhance the prediction process for student performance for outperforms other classifiers based on accuracy value. While
Mathematics (mat) dataset. All experiments were evaluated C4.5 is the worst one. The CNN shows a great performance
using MATLAB-R2014a. Two types of experiments were per- compared to other approaches due to its structure, where
formed: Without feature selection and with feature selection. CNN has many different filters/kernals that can convolve on
Table IV shows the parameters setting for binary genetic a given input volume. CNN learn by creating a more abstract
algorithm. All settings are carefully selected after preliminary representation of data as the network structure expand deeper.
experiments. Each classifier has been executed 11 times. Four So, the CNN structure extracts features which yield to higher

308
Fig. 3. A demonstration of Binary Genetic Algorithm for a single iteration [20].

accuracy results. Figure 4 shows the boxplot diagrams for all diagrams for all classifiers with feature selection algorithm.
four classifiers based on (Best, Worst, Average, and Median). All methods have a stable performance after reducing the size
it is clear that the performance of CNN outperforms other of dataset.
approaches.

0.94

0.92
0.9
0.9

0.88
Accuracy

0.85
0.86
Accuracy

0.84

0.8 0.82

0.8

0.75 0.78

kNN NB CNN C4.5

kNN NB CNN C4.5

Fig. 5. Boxplots for all classifiers with feature selection.


Fig. 4. Boxplots for all classifiers without feature selection.

Table VI shows the obtained results after employing BGA VI. C ONCLUSION AND FUTURE WORKS
feature selection algorithm. it is obvious that all results are In this paper, we proposed a hybrid features selection
improved except NB method compared to the results reported algorithm with a set of machine learning algorithms to predict
in Table V. kNN is improved 2%, CNN is improved 2%, and student performance. Four different machine learning algo-
C4.5 is improved 3%. It is clear the feature selection algo- rithms are examined: nearest neighbors (kNN), Convolutional
rithm can reduce the complexity of dataset and enhance the Neural Network (CNN), Naı̈ve Bayes (NB) and decision trees
overall prediction performance. Figure 5 presents the boxplot (C4.5). Binary Genetic Algorithm (BGA) is used as a wrapper

309
TABLE II
DATASET DISTRIBUTION .

Attribute Description (Domain)


Sex student’s sex (binary: female or male)
Age student’s age (numeric: from 15 to 22)
school student’s school (binary: Gabriel Pereira or Mousinho da Silveira)
Address student’s home address type (binary: urban or rural)
Pstatus parent’s cohabitation status (binary: living together or apart)
Medu mother’s education (numeric: from 0 to 4a )
Mjob mother’s job (nominalb )
Fedu father’s education (numeric: from 0 to 4a )
Fjob father’s job (nominalb )
guardian student’s guardian (nominal: mother, father or other)
famsize family size (binary: ≤ 3 or > 3)
famrel quality of family relationships (numeric: from 1 – very bad to 5 – excellent)
reason reason to choose this school (nominal: close to home, school reputation, course preference or other)
traveltime home to school travel time (numeric: 1 – < 15 min., 2 – 15 to 30 min., 3 – 30 min. to 1 hour or 4 – > 1 hour).
studytime weekly study time (numeric: 1-< 2 hours, 2-2 to 5 hours, 3-5 to 10 hours or 4-> 10 hours)
failures number of past class failures (numeric: n if 1 ≤ n < 3, else 4)
schoolsup extra educational school support (binary: yes or no)
famsup family educational support (binary: yes or no)
activities extra-curricular activities (binary: yes or no)
paidclass extra paid classes (binary: yes or no)
internet Internet access at home (binary: yes or no)
nursery attended nursery school (binary: yes or no)
higher wants to take higher education (binary: yes or no)
romantic with a romantic relationship (binary: yes or no)
freetime free time after school (numeric: from 1 – very low to 5 – very high)
goout going out with friends (numeric: from 1 – very low to 5 – very high)
Walc weekend alcohol consumption (numeric: from 1 – very low to 5 – very high)
Dalc workday alcohol consumption (numeric: from 1 – very low to 5 – very high)
health current health status (numeric: from 1 – very bad to 5 – very good)
absences number of school absences (numeric: from 0 to 93)
G1 first period grade (numeric: from 0 to 20)
G2 second period grade (numeric: from 0 to 20)
G3 final grade (numeric: from 0 to 20)
a: 0 – none, 1 – primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education
b:teacher, health care related, civil services (e.g. administrative or police), at home or other.

TABLE III
T HE CONFUSION MATRIX .

Predicted Class
Class = Yes Class = No
Class = Yes True Positive (TP) False Negative (FN)
Actual Class
Class = No False Positive (FP) True Negative (TN)

TABLE IV TABLE VI
T HE PARAMETERS SETTING FOR BGA. O BTAINED RESULTS WITH FEATURE SELECTION .

Parameters Values Accuracy Precision Recall F1-measure


Number of iterations 1000 kNN 0.88 0.84 0.84 0.86
Population size 100 NB 0.82 0.8 0.78 0.73
Crossover rate 0.75 CNN 0.95 0.92 0.91 0.91
Mutation rate 0.01 C4.5 0.82 0.76 0.80 0.77
Selection type Roulette Wheel Selection (RWS)
Crossover type Single, Double or Uniform

feature selection approach. The obtained results show that


TABLE V BGA can enhance the performance of all classifiers. The
O BTAINED RESULTS WITHOUT FEATURE SELECTION . performance of all classifiers is improved 2−3% after applying
BGA algorithm except NB classifier. The performance of CNN
Accuracy Precision Recall F1-measure
kNN 0.86 0.81 0.83 0.85 outperforms all other methods.
NB 0.82 0.78 0.75 0.70 In our future work, we will examine different feature
CNN 0.93 0.88 0.89 0.90
selection algorithms and perform a deep analysis to build a
C4.5 0.79 0.73 0.72 0.66
hyper-heuristic model to select the best classifier.

310
R EFERENCES [17] K. P. Rao, M. C. S. Rao, and B. Ramesh, “Article: Predicting learning
behavior of students using classification techniques,” International Jour-
[1] R. Baker and K. Yacef, “The state of educational data mining in 2009: nal of Computer Applications, vol. 139, no. 7, pp. 15–19, April 2016,
A review and future visions,” JEDM, vol. 1, no. 1, pp. 3–17, Jun. 2009. published by Foundation of Computer Science (FCS), NY, USA.
[2] C. Romero, S. Ventura, and E. Garcı́a, “Data mining in course [18] C. Anuradha and T. Velmurugan, “A comparative analysis on the
management systems: Moodle case study and tutorial,” Computers & evaluation of classification algorithms in the prediction of students
Education, vol. 51, no. 1, pp. 368 – 384, 2008. [Online]. Available: performance,” Indian Journal of Science and Technology, vol. 8, no. 15,
http://www.sciencedirect.com/science/article/pii/S0360131507000590 2015. [Online]. Available: http://www.indjst.org/index.php/indjst/article/
view/74555
[3] E. Fernandes, M. Holanda, M. Victorino, V. Borges, R. Carvalho, and
[19] J. H. Holland, Adaptation in Natural and Artificial Systems: An Intro-
G. V. Erven, “Educational data mining: Predictive analysis of academic
ductory Analysis with Applications to Biology, Control and Artificial
performance of public school students in the capital of brazil,” Journal
Intelligence. Cambridge, MA, USA: MIT Press, 1992.
of Business Research, vol. 94, pp. 335 – 343, 2019. [Online]. Available:
[20] H. Turabieh, M. Mafarja, and X. Li, “Iterated feature selection algo-
http://www.sciencedirect.com/science/article/pii/S0148296318300870
rithms with layered recurrent neural network for software fault predic-
[4] L. H. Son and H. Fujita, “Neural-fuzzy with representative sets for
tion,” Expert Systems with Applications, vol. 122, pp. 27 – 42, 2019.
prediction of student performance,” Applied Intelligence, vol. 49, no. 1,
[21] B. Qu, Y. Zhu, Y. Jiao, M. Wu, P. Suganthan, and J. Liang, “A survey
pp. 172–187, Jan 2019. [Online]. Available: https://doi.org/10.1007/
on multi-objective evolutionary algorithms for the solution of the
s10489-018-1262-7
environmental/economic dispatch problems,” Swarm and Evolutionary
[5] C. M. Hatton, L. W. Paton, D. McMillan, J. Cussens, S. Gilbody, Computation, vol. 38, pp. 1 – 11, 2018. [Online]. Available:
and P. A. Tiffin, “Predicting persistent depressive symptoms in http://www.sciencedirect.com/science/article/pii/S2210650216301493
older adults: A machine learning approach to personalised mental [22] S. Mirjalili, Genetic Algorithm. Cham: Springer International Publish-
healthcare,” Journal of Affective Disorders, vol. 246, pp. 857 – 860, ing, 2019, pp. 43–55.
2019. [Online]. Available: http://www.sciencedirect.com/science/article/ [23] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A
pii/S0165032718319931 survey of deep neural network architectures and their applications,”
[6] E. Fijani, R. Barzegar, R. Deo, E. Tziritis, and K. Skordas, “Design and Neurocomputing, vol. 234, pp. 11 – 26, 2017. [Online]. Available:
implementation of a hybrid model based on two-layer decomposition http://www.sciencedirect.com/science/article/pii/S0925231216315533
method coupled with extreme learning machines to support real-time [24] S. Zhang, X. Li, M. Zong, X. Zhu, and R. Wang, “Efficient knn classifi-
environmental monitoring of water quality parameters,” Science of The cation with different numbers of nearest neighbors,” IEEE Transactions
Total Environment, vol. 648, pp. 839 – 853, 2019. [Online]. Available: on Neural Networks and Learning Systems, vol. 29, no. 5, pp. 1774–
http://www.sciencedirect.com/science/article/pii/S0048969718331851 1785, May 2018.
[7] D. D. Clercq, D. Jalota, R. Shang, K. Ni, Z. Zhang, A. Khan, [25] G. Feng, J. Guo, B.-Y. Jing, and T. Sun, “Feature subset selection
Z. Wen, L. Caicedo, and K. Yuan, “Machine learning powered using naive bayes for text classification,” Pattern Recognition
software for accurate prediction of biogas production: A case study Letters, vol. 65, pp. 109 – 115, 2015. [Online]. Available:
on industrial-scale chinese production data,” Journal of Cleaner http://www.sciencedirect.com/science/article/pii/S0167865515002378
Production, vol. 218, pp. 390 – 399, 2019. [Online]. Available: [26] L. A. Breslow and D. W. Aha, “Simplifying decision trees: A survey,”
http://www.sciencedirect.com/science/article/pii/S095965261930037X Knowl. Eng. Rev., vol. 12, no. 1, pp. 1–40, Jan. 1997. [Online].
[8] K. S. Rawat and I. V. Malhan, “A hybrid classification method based Available: http://dx.doi.org/10.1017/S0269888997000015
on machine learning classifiers to predict performance in educational [27] E. Bauer and R. Kohavi, “An empirical comparison of voting
data mining,” in Proceedings of 2nd International Conference on Com- classification algorithms: Bagging, boosting, and variants,” Machine
munication, Computing and Networking, C. R. Krishna, M. Dutta, and Learning, vol. 36, no. 1, pp. 105–139, Jul 1999. [Online]. Available:
R. Kumar, Eds. Singapore: Springer Singapore, 2019, pp. 677–684. https://doi.org/10.1023/A:1007515423169
[9] S. Carnell, B. Lok, M. T. James, and J. K. Su, “Predicting student success [28] C. C. Aggarwal and C. Zhai, A Survey of Text Classification Algorithms.
in communication skills learning scenarios with virtual humans,” in Boston, MA: Springer US, 2012, pp. 163–222.
Proceedings of the 9th International Conference on Learning Analytics [29] S. Boucheron, O. Bousquet, and G. Lugosi, “Theory of classification:
& Knowledge, ser. LAK19. New York, NY, USA: ACM, 2019, pp. 436– a survey of some recent advances,” ESAIM: Probability and Statistics,
440. [Online]. Available: http://doi.acm.org/10.1145/3303772.3303828 vol. 9, pp. 323–375, 2005.
[10] P. Ducange, R. Pecori, L. Sarti, and M. Vecchio, “Educational big data [30] P. Cortez and A. Silva, “Using data mining to predict secondary school
mining: How to enhance virtual learning environments,” in International student performance,” in In A. Brito and J. Teixeira Eds., Proceedings of
Joint Conference SOCO’16-CISIS’16-ICEUTE’16, M. Graña, J. M. 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008), Porto,
López-Guede, O. Etxaniz, Á. Herrero, H. Quintián, and E. Corchado, Portugal, 2008, pp. 5–12.
Eds. Cham: Springer International Publishing, 2017, pp. 681–690. [31] D. Dua and C. Graff, “UCI machine learning repository,” 2019.
[11] B. Kapur, N. Ahluwalia, and R. Sathyaraj, “Comparative study on [Online]. Available: http://archive.ics.uci.edu/ml
marks prediction using data mining and classification algorithms,” Int. J.
Adv. Res. Comput. Sci., vol. 8, pp. 632 – 636, 2017. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S095965261930037X
[12] C. Márquez-Vera, A. Cano, C. Romero, A. Y. M. Noaman,
H. Mousa Fardoun, and S. Ventura, “Early dropout prediction using
data mining: a case study with high school students,” Expert
Systems, vol. 33, no. 1, pp. 107–124, 2016. [Online]. Available:
https://onlinelibrary.wiley.com/doi/abs/10.1111/exsy.12135
[13] R. Asif, A. Merceron, S. A. Ali, and N. G. Haider, “Analyzing
undergraduate students’ performance using educational data mining,”
Computers & Education, vol. 113, pp. 177 – 194, 2017.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0360131517301124
[14] M. Saarela and B. Yener, “Predicting math performance from raw large-
scale educational assessments data : A machine learning approach,”
2016.
[15] J. Xu, Y. Han, D. Marcu, and M. van der Schaar, “Progressive prediction
of student performance in college programs,” 2017. [Online]. Available:
https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14234
[16] R. Asif, S. Hina, and S. Haque, “Predicting student academic perfor-
mance using data mining methods,” Int. J. Comput. Sci. Netw. Secur,
vol. 17, no. 5, pp. 187–191, 2017.

311
Automated Grading for Handwritten Answer Sheets
using Convolutional Neural Networks
Eman Shaikh, Iman Mohiuddin, Ayisha Manzoor Ghazanfar Latif *, Nazeeruddin Mohammad
Department of Computer Engineering, Department of Computer Science,
Prince Mohammad bin Fahd University, Prince Mohammad bin Fahd University,
Al Khobar, Saudi Arabia. Al Khobar, Saudi Arabia.
*Email: glatif@pmu.edu.sa

Abstract—Optical Character Recognition (OCR) is an handwritten text is obtainable as an image. Whereas, in the
extensive research field in image processing and pattern online recognition systems, the characters are typed from
recognition. Traditional character recognition methods some input devices. The offline character recognition is
cannot distinguish a character or a word from a scanned more complicated than the online character recognition
image. This paper proposes a system, which is to develop systems as writing styles may differ from one user to
a method that uses a personal computer, a portable another and an enormous noise occurs in the offline
scanner and an application program that would characters during the writing of the text and scanning of the
automatically correct the handwritten answer sheets. document [4, 16]. Hence, the offline handwritten
For handwritten character recognition, the scanned recognition mechanism extends to be an effective field for
images are fed through a machine learning classifier research towards exploring the innovative procedures that
known as the Convolutional Neural Network (CNN). would enhance the accuracy of handwritten recognition
Two CNN models were proposed and trained on 250 systems.
images that were collected from students at Prince This paper proposes an automated system for grading
Mohammad Bin Fahd University. The proposed system handwritten answer sheets with the help of Convolutional
will finally output the final score of the student by Neural Networks (CNN). All the answer sheets were
comparing each classified answer with the correct scanned separately through a portable scanner, and the
answer. The experimental results exhibited that the scanned images were stored as black and white images.
proposed system performed a high testing accuracy of After scanning each answer sheet, the scanned images were
92.86%. The system can be used by the instructors in given as an input to the segmentation algorithm that
several educational institutions to automatically grade performed segmentation. This is done to separate the
the handwritten answer sheets of students effectively. questions from the answers written in each box. The
segmentation procedure divided the images into more
Keywords—Handwritten Numerals Recognition,
comprehensive divisions and procured more relevant data.
Convolutional Neural Network, Handwritten Character
Each segmented character and digit answers were extracted
Recognition, Scanned document Segmentation
to generate parameters for testing and training. The data
obtained was a handwritten data set consisting of few
I. INTRODUCTION English alphabets and numerals. The dataset was then used
In recent years, handwritten recognition is considered to to score the student's answer sheet. The recognition of the
be the uttermost engrossing and demanding analysis range student's answer was done using two CNN proposed
in the sphere of image processing and pattern recognition. architecture.
Handwritten recognition systems remarkably administer to
the development of automated procedures and enhance the The remaining paper is organized as follows: Section II
alliance among human and computerized systems in several introduces the literature review, Section III describes the
operations. Nowadays, there are various technological proposed framework, Section IV demonstrates the
approaches in organizations and institutions that help to experimental results, and Section V discusses the
reduce the time consumed for grading answer sheets conclusion.
manually. This is achieved by raising the accuracy and
avoiding the inaccuracies caused by humans. Hence, the II. LITERATURE REVIEW
comparison of answer sheets with their answer keys and The concept of handwritten recognition is a confined
grading the student answers monotonously is a trivial and sphere of research in the discipline of pattern recognition
arduous task that must be automated. and image processing over the past years and indeed there is
For this purpose, Optical Character Recognition (OCR) a broad insistence for optical character recognition on
is implemented to transform handwritten or typed text handwritten scripts. In this section, an extensive analysis of
images that are captured with the help of a scanner into an extant works in handwritten recognition systems that depend
electronic or machine-based text image. Predominantly, on various machine learning techniques are proposed.
handwritten recognition systems are characterized into two Although the printed text recognition is considered as a
categories namely, offline and online recognition. In the clarified issue these days, handwritten text recognition
offline handwritten recognition systems, the handwriting remains as a demanding task, mainly due to the huge
written on the paper is normally apprehended by a scanner variation in handwriting among certain people including the
which recognizes the characters and then the completed size, orientation, thickness, format, and dimension of each

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 312


written letter or digit. Various machine learning methods total 1000 answer sheets and detected 100% accuracy.
have been suggested for handwritten text recognition. This Muangprathub et al. [11] presented a method to
section describes different handwritten recognition automatically grade scanned multiple-choice answer sheets
approaches using machine learning classifiers such as the using k-nearest neighbors. 560 answer sheets were evaluated
automated grading of handwritten answers and the in total. The proposed system operated almost three times
recognition of handwritten alphabets and digits in various quicker than the manual approach and the result was an
languages. average of 100% accuracy in case of nearly complete
Brown [1] proposed an automated system using MNIST markings whereas the accuracy for the cases of incomplete
handwritten digit dataset to grade handwritten numerical markings, such as small markings, overflow, and deleted or
answers of scanned student answer sheets using CNN. CNN unclean markings was obtained to be 62.42%, 93.16%, and
was used to estimate the student answers and produced an 99.57% respectively. Patole et al. In [12] proposed an
accuracy of 95.6%. In [2], the authors implemented linear innovative idea for grading multiple-choice tests using a
regression classifier to grade SAT standardized test essays scanner that could grade a multiple-choice exam. This
instinctively by merging the character and word length project used C# language to combine the computing power
features of essays. Kaggle dataset was used to automate the of C++ to evaluate each student’s academic performance as
grading and resulted in an accuracy of 87.65%. In [3], an well for student’s to provide feedbacks on staff members.
automated scoring system for multiple choice answers was Lastly, the system was able to provide benefits which are
implemented that allowed the users to print and scan all the better scalability and suitability to asynchronous mode of
answer sheets. The training time taken for each answer sheet evaluation as compared to traditional evaluation systems.
was 35 seconds or 0.4 seconds. Feedback propagation neural Tayana et al. [13] suggested a method for correcting
network classifier was used for the implementation of the multiple choice-based answer sheets with the help of
system and obtained an accuracy of 90%. Supic et al. [4] mathematical format and k-nearest neighbors (k-NN). The
proposed an automated system to recognize handwritten database used for the manipulation of the images contained
alphabetical answers from an answer sheet containing 680 certified answer sheets and 10 basic image folders
multiple choice-based questions. Random forest classifier containing 26 questions with four choices for each question.
was implemented to automate the reading and was tested on An overall accuracy rate of 99.85% was obtained. In [14],
a dataset sample that contained 3960 scanned answer sheets the author proposed an approach to grade a specifically
with an accuracy of 89.88%. Srihari et al. [5] described designed multiple choice question paper with ten questions
computational scoring methods for handwritten essays in and five choices. The system could obtain 82.44%
reading comprehension tests. The handwritten response accuracy.
dataset consisted of 300 essays out of which 150 essays Ciresan et al. [15] proposed a handwritten character
were used as training sets and 150 essays as testing sets. classification using CNN. Along with the MNIST dataset, a
ANN classifier was used for the scoring methods and special and more challenging dataset called as the NIST SD
obtained an accuracy of 87.63%. Mahana et al. [6] designed 19 dataset was used for this purpose. CNN’s were trained
an automated system for essay grading using Kaggle dataset for around 900 epochs. The total training time consumed
that consisted of 13000 essays. Various essay features were was twelve hours and the accuracy resulted to be 75.66%.
extracted from the training set with the help of a linear Latif et al. [16] proposed a deep learning architecture by
regression model and obtained an accuracy of 91.85%. using Deep Convolutional Neural Networks for the
Saengtongsrikamon et al. [7] developed an Optical Mark recognition of Multilanguage handwritten numerals. The
Recognition (OMR) software that was used as an OMR databases used to test the accuracy of the proposed method
machine with neural networks and was then implemented in were MADBase (Arabic), MNIST (English), HODA
a scanner. This software captured and scored the answers of (Persian), PMUdb (Urdu, a database was created as there
multiple choices questions with an accuracy of 95.24%. The was no pre-existing database available for this language),
OMR machine scanned 1,000 answer sheers using multiple and DHCD (Devanagri). And the resultant overall accuracy
scanners with varying resolutions. In [8], the authors obtained for each language was 99.322%. Singh. N. [17]
proposed a novel local feature extraction method that was suggested an active method for handwritten Devanagari
used to design a multi-language handwritten numeral characters based on ANN classifier and resulted in an
recognition system. The databases used were MADBase accuracy of 98.65%. The training time taken for the
(Arabic), MNIST (English), HODA (Persian), PMU-UD handwritten Devanagari character recognition of 400
(Urdu, a database was created for this language), ICDAR samples was 2.33 seconds. Kumar et al. [18] suggested
(Bengali) and DHCD (Devanagri). Moreover, the same English handwritten character recognition with the help of
authors in [9] enhanced the feature set which was tested Kernel-based SVM and MLP Neural Network
with the help of different classifier methods and it was Classifiers. An isolated handwritten character dataset
found that the Random Forrest classifier achieved the best written by different people were considered to be the
results with an average recognition rate of 96.73%. dataset. 27 features were obtained from each character
In [10], an image processing method was proposed to during training, and these features were consumed for
automatically grade answer sheets that contained multiple training the SVM with 80.96% accuracy. Rao et al. [19]
choice questions. This approach enabled every user to print proposed an analysis of English handwritten character
whatever answer sheet they wished to print and after the recognition algorithm based on CNN. MNIST and SVHN
printing process, they utilized a normal scanner and datasets were used for this purpose. Accuracies produced by
computer to assess the answer sheets. The proposed method MNIST and SVHN Dataset on handwritten character
consumed a training time of 1.4 seconds per sheet out of the recognition were 94.65% and 95.1% respectively. Jong et al.
313
[20] proposed a research on handwritten English alphabet the python scoring code which loads the trained CNN model
recognition systems based on extreme learning machine. and outputs the score for each student.
Extreme Learning Machine (ELM) classifier was used for A. Handwritten Experimental Data
this purpose and OCR datasets were used to train and test Table I depicts a comprehensive description of the
data. The total training time taken for each English alphabet segmented dataset used in the project. The dataset was
was 0.398 seconds and the average accuracy produced was collected by disturbing the template to 250 students in
95.513%. Prince Mohammad Bin Fahd University. As illustrated in
In [21] an algorithm based on local adaptive thresholding Fig. 2, each answer sheet consisted of 20 questions,
and geometric features was proposed to segment different therefore a total of 5080 segmented images was obtained for
regions from scanned Arabic documents based on the 250 answer sheets. Since some of the images were not
Physical Layout Analysis (PLA). This method was applied segmented properly. Only 4871 segmented images were
to a random dataset if images from various publishers selected (shown in Table II). Therefore, a total of 209
containing the text zone, image zone, and graphic zone. This images was discarded as some of the scanned images
algorithm achieved an average recognition of 86.71% for captured were slightly tilted in orientation. This posed a
Text and Image block regions. problem because the segmentation algorithm considered
III. PROPOSED SYSTEM each pixel value of the template. A slight difference in the
orientation of the template causes different pixel values
which would further yield undesirable results. Moreover, the
data split chosen used the 80/20 rule, in which 80% of the
dataset, i.e., 3507 images were used for training and 20% of
the segmented images, i.e., 973 images were used for testing
purposes.

TABLE I. COLLECTED DATASET

Answers Classes Dataset Size


A 461
B 403
C 330
D 226
F 835
T 1090
1 271
2 251
3 251
4 249
5 252
6 252
Total 4871

TABLE II. INCORRECTLY SEGMENTED IMAGES

Fig. 1. Proposed System


B. Segmentation
A portable system was proposed as is illustrated in Fig. 1. Algorithm for Segmentation of Isolated Answers from
It was designed to automatically recognize and grade the Template is as follows:
handwritten answer sheets. For which a portable scanner
‘Fujitsu Scansnap ix100’ is used to scan and store the Input: Input scanned Answer Sheet
student’s handwritten onto the Raspberry Pi where the Output: Isolated Answers segments with labels
scanned images are converted from pixmap to jpg format. Step 1: Start
The jpg formatted image is sent to a laptop/PC from
Step 2: Convert input scanned document to black and
the Raspberry Pi wirelessly which is done by using
the SSH server. The laptop on which the scanned answer white.
sheets are uploaded contains the CNN models as well as Step 3: Map the scanned document to the pre-defined
the MATLAB Segmentation code. First, the scanned images template of answer sheet.
are segmented so that only the handwritten alphabets and Step 4: Extract the questions from the input by cropping
numbers are fed to the machine learning algorithm. the image segments based on the template.
Thus, after segmentation, the training data is fed to the CNN Step 5: Isolation of the twenty answers from the template is
models so that they are trained well to recognize the obtained by discarding the outliers.
handwritten answers whereas the testing data is provided to Step 6: Label all the images segments based on the prior
scanned template.

314
Step 7: Save each segment answer in local drive. size. Higher batch size significantly degrades the quality of
Step 8: End the model, as measured by its ability to generalize.
The output size of an image that is produced from the
Table III depicts images of some properly segmented images hidden layer is fed into a logistic function like softmax.
from a student’s answer sheet (refer Fig. 2 for the sample ReLU layer or the activation function performs an element-
template). wise activation function max (0, x) that changes the negative
values to zeros. The layer does not alter the size of the
TABLE III. SAMPLES OF SEGMENTED ANSWERS volume since there are no hyperparameters present. The
softmax layer outputs a probability distribution, that is, the
values of the output sum equal to 1. In addition, the softmax
layer is a soft version of the max-output layer and hence it is
differentiable and also resilient to outliers. Max pooling is
the most used type of pooling which only takes the most
important part of the input volume and the largest element
from the rectified feature map. Dropout is a layer whose
function is to drop out a random set of activations in the
layer by setting them to zero. Moreover, it forces the network
to be redundant by providing the network with the right
classification or output for a specific example even if few of
the activations are dropped out. It also assures that the
network is not getting overfitted to the training data.
Dense is  a non-linear activation function that first
performs classification to the features that are extracted by
the convolutional layers, then it downsamples the pooling
layers. Each node present in this layer is connected to every
node in the preceding layer. Adam is an optimization
algorithm which is used in replacement of the classical
stochastic gradient descent procedure to update the network
weights iterative based on training data. It usually is a
combination of RMSprop and Stochastic Gradient Descent
with a momentum that uses the squared gradients to scale the
learning rate like RMSprop. The cross-entropy loss function
calculates the error rate between the expected value and the
original value. Minimizing cross-entropy loss function
approximately will help gain better performance. In the
hidden/convolutional layer, all the artificial neurons are
attached to the neurons of the preceding layers, in order to
give out an output by picking up a set of weighted inputs. It
is necessary to minimize the number of hidden layers, due to
the fact that a large number of hidden layers would result in
an overfit and enlarged computation.
D. Proposed CNN Models
Fig. 2. Sample Answer Sheet Fig. 3 illustrates the proposed CNN architecture of
Model 1. The input image for the CNN model used is of size
C. Convolutional Neural Network (CNN) 64x64 and then it is passed through a convolution layer of
Parameters are an essential part of Convolutional Neural 64 filters with a kernel size of 5x5 and ReLU activation
Network that helps  to optimize the quality of the neural function. It is then followed by the 2 x 2 MAX Pooling layer
network. Their role is to avoid the overfitting and that downsamples the image and aids in identifying the most
underfitting of the model for a given dataset. Changes in the important features. This leads to a decrease in the size of the
parameters helps to get the desired results for a specific
image. Then it passed through another convolution layer of
problem. This section talks about the different CNN
48 filters of kernel size 3 x 3 and ReLU activation function.
parameters which were implemented to design the CNN
models. In order to avoid overfitting, the images then go through
20% regularization in the first dropout layer. The image
Firstly, a kernel size of more than 5 x 5 is not used since further goes to more convolution, max
large kernel size results in a slower training time. Secondly, pooling, ReLU function, and dropout layers until the sample
to minimize the error on the training data, the number of data is ultimately converted into one-dimensional vector
rounds of optimization that were implemented during which happens due to the flatten layer. The final
training is increased. However, this can lead to an overfitting layers comprise of three dense layers that consist of 512,
in the neural network which will thereby result in 256 and 12 features. The first two dense layer
performance degradation during the testing phase. In order to
uses ReLU and the third dense layer
analyze this, monitoring of error performance is done
uses softmax activation function which helps to convert the
separately on the testing data as the number of
epochs increases. Larger batch size requires larger memory

315
output into a probability distribution. The image is then Step 7: Display the score
recognized based on its probability distribution value. Step 8: End.
Fig. 2 illustrates a student’s answer sheet which provides
the correct answers to all the questions. After segmentation
the answers to each question will be used as refernce to score
the students answersheets.

IV. EXPERIMENTAL RESULTS


Table IV illustrates the experimental results
accomplished by the two proposed CNN models. Epoch size
of 10, 25, 50 and 100 was employed in each experiment.
Furthermore, for each epoch, batch sizes of 50,100 and 250
were implemented. Several parameters were tuned and each
of their influence on the level of accuracy was evaluated. A
total of 12 experiments were carried out separately for
model 1 and model 2. Both models were executed with the
Fig. 3. Proposed CNN architecture of Model – 1
common goal of finding an optimized architecture. The
experimental results demonstrate that the test accuracies
Fig. 4 illustrates the proposed Model 2 with slight generated for both the models were heavily dependent upon
changes. The first convolution layer consists of 32 filters the number of epochs and batch sizes. An increase in the
with a kernel size of 5x5 and ReLU activation function. It is
epoch size led to an increase in the test accuracy as well.
then followed by another convolutional layer of 64 filters Conversely, an increase in the batch size led to a decrease in
with the same kernel size. Then the images go through 10% the test accuracy. Optimal test accuracy was achieved when
regularization in the first dropout layer. A 2 x 2 MAX the epoch size is equal to the batch size for both model 1
Pooling layer is then applied to the images. Later it is passed (92.866 %) and model 2 (92.3274 %). Overall, the test
through another convolution layer of 32 filters of kernel size accuracy level achieved for model 1 is better in terms of
3 x 3 and ReLU activation function. In order to avoid lower computation time and a minor increase in test
overfitting, the images then go through 20% regularization accuracy. The result generated provided an enhanced basis
in the first dropout layer. Further the, images undergo more for utilizing CNN architecture in the use of
convolution, max pooling, ReLU function, and dropout handwritten character recognition as a resolution to the
layers. The final layers comprise of four dense layers that challenges that are caused by traditional methods.
consist of 512, 256, 64 and 12 features. The first three dense
layer uses ReLU and the fourth dense layer TABLE IV. EXPERIMENTAL RESULTS OF PROPOSED CNN
uses softmax activation function. MODEL

Computation Time
Model # Batch Size Epochs Testing Accuracy
(Seconds)
1 50 10 90.120 % 1772.201
1 100 10 86.451 % 1409.258
1 200 10 80.164 % 954.4838
1 50 25 91.916 % 2284.184
1 100 25 92.430 % 3919.011
1 200 25 92.353 % 2258.307
1 50 50 92.866 % 4750.766
1 100 50 92.763 % 4483.056
1 200 50 92.738 % 5266.739
1 50 100 92.840 % 10790.350
1 100 100 92.763 % 10123.380
Fig. 4. Proposed CNN architecture of Model - 2 1 200 100 92.840 % 9349.122
2 50 10 90.4799 % 1803.066
2 100 10 90.4799 % 1363.192
E. Scoring 2 200 10 81.011 % 1885.508
Algorithm for scoring each Student’s Answer Sheet is 2 50 25 92.2504 % 3579.950
as follows: 2 100 25 92.1991 % 3295.304
Input: Input Template Filled in A4 sheet 2 200 25 90.8648 % 4753.194
Output: Score of the Input Template 2 50 50 92.3274 % 6125.266
Step 1: Start 2 100 50 92.3531 % 6142.417
Step 2: Load the CNN Model 2 200 50 92.3531 % 5819.936
Step 3: Obtain segmented student’s answer sheets 2 50 100 92.4301 % 11866.290
Step 4: Read each segmented student’s answer files 2 100 100 92.4044 % 12197.170
Step 5: Compare student’s answers with True answers 2 200 100 92.1735 % 11079.200
Step 6: Score the answer sheets

316
Using Improved Structural Features–A Unified Method for
Handwritten Arabic and Persian Numerals. Journal of
V. CONCLUSION Telecommunication, Electronic and Computer Engineering
(JTEC), 9(2-10), 33-40.
Offline handwritten recognition systems based on [10] Chai, D. (2016, December). Automated marking of printed multiple-
machine learning algorithm has significant importance in the choice answer sheets. In 2016 IEEE International Conference on
research field. However, it is a difficult recognition due to Teaching, Assessment, and Learning for Engineering (TALE) (pp.
145-149). IEEE.
the presence of odd characters or similarity in shapes for [11] Muangprathub, J., Shichim, O., Jaroensuk, Y., & Kajornkasirat, S.
multiple characters. The paper proposed a system that was (2018). Automatic Grading of Scanned Multiple-Choice Answer
implemented to recognize the handwritten characters and Sheets.
then display the final score of the student. The system was [12] Patole, S., Pawar, A., Patel, A., Panchal, A., & Joshi, R. (2016,
evaluated from a dataset that consisted of 250 answer sheets March). Automatic system for grading multiple choice questions and
feedback analysis. IEEE International Journal of Technical Research
and this data was tested by using two deep convolutional and Applications, 12(39), 16-19. IEEE.
neural network models. The results attained a high accuracy [13] Tavana, A. M., Abbasi, M., & Yousefi, A. (2016, September).
with 92.86% testing accuracy. The accuracy of the system Optimizing the correction of MCQ test answer sheets using digital
was less as compared to the ones mentioned in section II as image processing. In 2016 Eighth International Conference on
the system used its own handwritten data set. In future work, Information and Knowledge Technology (IKT)(pp. 139-143). IEEE.
the segmentation algorithm can be improved to attain a [14] Abbas, A. A. (2009). An automatic system to grade multiple choice
questions paper-based exams. Journal of university of Anbar for Pure
higher percentage of accuracy for segmentation of the science, 3(1), 174-181.
images. Moreover, the proposed CNN architecture can also [15] Ciresan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J.
be enhanced to achieve much higher performance and (2011, September). Convolutional neural network committees for
accuracy in displaying the score of the student. handwritten character classification. In Document Analysis and
Recognition (ICDAR), 2011 International Conference on (pp. 1135-
1139). IEEE.
REFERENCES [16] Latif, G., Alghazo, J., Alzubaidi, L., Naseer, M. M., & Alghazo, Y.
[1] Brown, M. T. (2017). Automated Grading of Handwritten Numerical (2018, March). Deep Convolutional Neural Network for Recognition
Answers. In 2018 16th International Conference on Frontiers in of Unified Multi-Language Handwritten Numerals. In 2018 IEEE 2nd
Handwriting Recognition (ICFHR) (pp. 279-284). IEEE. International Workshop on Arabic and Derived Script Analysis and
[2] Murray, K. W., & Orii, N. (2012). Automatic essay scoring. IEICE Recognition (ASAR) (pp. 90-95). IEEE.
Transactions on Information and Systems, 102(1), 147-155. [17] Singh, N. (2018, February). An Efficient Approach for Handwritten
[3] Alomran, M., & Chia, D. (2018). Automated Scoring System for Devanagari Character Recognition based on Artificial Neural
Multiple Choice Test with Quick Feedback. International Journal of Network. In 2018 5th International Conference on Signal Processing
Information and Education Technology, 8(8). and Integrated Networks (SPIN) (pp. 894-897). IEEE.
[4] Cupic, M., Brkic, K., Hrkac, T., Mihajlovic, Z., & Kalafatic, Z. (2014, [18] Kumar, P., Sharma, N., & Rana, A. (2012). Handwritten Character
May). Automatic recognition of handwritten corrections for multiple- Recognition using Different Kernel based SVM Classifier and MLP
choice exam answer sheets. In Information and Communication Neural Network (A COMPARISON). International Journal of
Technology, Electronics and Microelectronics (MIPRO), 2014 37th Computer Applications, 53(11), 413-435.
International Convention on (pp. 1136-1141). IEEE. [19] Rao, Z., Zeng, C., Wu, M., Wang, Z., Zhao, N., Liu, M., & Wan, X.
[5] Srihari, S., Collins, J., Srihari, R., Srinivasan, H., Shetty, S., & Brutt- (2018). Research on a handwritten character recognition algorithm
Griffler, J. (2008). Automatic scoring of short handwritten essays in based on an extended nonlinear kernel residual network. KSII
reading comprehension tests. Artificial Intelligence, 172(2-3), 300- Transactions on Internet & Information Systems, 12(1), 25-31.
324. [20] Jeong, S. H., Nam, Y. S., & Kim, H. K. (2003, August). Non-similar
[6] Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading candidate removal method for off-line handwritten Korean character
using machine learning. In Document Analysis and Recognition, recognition. In Document Analysis and Recognition, 2003.
ICDAR. 10th International Conference (pp. 1206-1210). IEEE. Proceedings. Seventh International Conference on (pp. 323-328).
[7] Saengtongsrikamon, C., Meesad, P., & Sodsee, S. (2009). Scanner- IEEE.
based optical mark recognition. Information Technology [21] Al-Dobais, M. A., Alrasheed, F. A. G., Latif, G., & Alzubaidi, L.
Journal, 5(1), 69-73. (2018, March). Adoptive Thresholding and Geometric Features based
[8] Alghazo, J. M., Latif, G., Alzubaidi, L., & Elhassan, A. (2019). Multi- Physical Layout Analysis of Scanned Arabic Books. In 2018 IEEE
Language Handwritten Digits Recognition based on Novel Structural 2nd International Workshop on Arabic and Derived Script Analysis
Features. Journal of Imaging Science and Technology, 63(2), 20502- and Recognition (ASAR) (pp. 171-176). IEEE.
1.
[9] Alghazo, J. M., Latif, G., Elhassan, A., Alzubaidi, L., Al-Hmouz, A.,
& Al-Hmouz, R. (2017). An Online Numeral Recognition System

317
Wrapper-based Feature Selection for Imbalanced
Data using Binary Queuing Search Algorithm
Thaer Thaher Majdi Mafarja Baker Abdalhaq Hamouda Chantar
IT Dept. CS Dept. ICS Dept. CS Dept.
At-Tadamun Society Birzeit University An-Najah National University Sebha University
Nablus, Palestine Birzeit, Palestine Nablus, Palestine Sebha, Libya
thaer.thaher@gmail.com mmafarja@birzeit.edu baker@najah.edu hamoudak77@gmail.com

Abstract—The non-uniform distribution of classes (imbalanced classification model. If the learning algorithm is involved in the
data) and the presence of irrelevant and/or redundant infor- selection process, then the method is said to follow the wrapper
mation are considered as challenging aspects encountered in approach. Otherwise, the filter approach is being followed. The
most real-world domains. In this paper, we propose an efficient
software fault prediction (SFP) model based on a wrapper main difference between filters and wrappers is that the filter
feature selection method combined with Synthetic Minority approach is computationally more efficient than the wrapper
Oversampling Technique (SMOTE) with the aim of maximizing approach, thus the selected features may not be appropriate for
the prediction accuracy of the learning model. A binary variant of some learning algorithms. However, in the wrapper approach,
recent optimization algorithm; Queuing Search Algorithm (QSA), the selection of the features is decided based on the classi-
is introduced as a search strategy in wrapper FS method. The
performance of the proposed model is assessed on 14 real-world fication accuracy of machine learning algorithm. This may
benchmarks from the PROMISE repository in terms of three lead to high computational time, but at the same time, higher
evaluation measures; sensitivity, specificity, and area under the performance is guaranteed [1].
curve (AUC). Experimental results reveal a positive impact of FS problem can be defined as the task of finding the subset
the SMOTE technique in improving the prediction performing of features that leads to the best performance of the data
in a highly imbalanced data. Moreover, the binary QSA (BQSA)
show a superior efficacy on 64.28% of datasets compared with mining tasks. Searching for the optimal subset of features
other state-of-the-art algorithms in handling the problem of FS. is considered as a hard optimization problem. First, FS is
The combination of BQSA and SMOTE achieved an acceptable formulated as a multi-objective problem, in which the lowest
AUC results (66.47-87.12%). number of features that satisfies highest prediction quality is
Index Terms—Queuing Search Algorithm, Feature Selection, required [1]. Second, datasets with a large number of features
SMOTE, Transfer Function, Software Fault Prediction
(high dimensional) increase the complexity of this problem.
As a general speaking, there exists 2N possible subsets when
I. I NTRODUCTION
dealing with a dataset with N number of features. The
In data mining, classification techniques are used to cate- search space is exponentially increased, and thus complete
gorize data points into predefined labels based on the avail- and random search methods are impractical to handle such
able features in the dataset. Therefore, the used features for problem [2]. On the other hand, heuristic search approaches
building the classification models have high influence on the have shown superior performance in tackling various complex
performance of the constructed models. That’s to say, if some problems. These techniques guide the search process towards
irrelevant or redundant features are available in the dataset, a high quality solution within a reasonable time [3].
they will mislead the classification model, and consequently Human-based algorithms are a class of metaheuristics which
degrade its performance. Thus, selecting the most informative are inspired by some human activities. Examples of this cate-
features becomes a crucial process in order to get a high- gory include Teaching Learning Based Optimization (TLBO),
performance classification model with less computational time. Harmony Search (HS), and Passing Vehicle Search (PVS)[4].
FS is a preprocessing step that aims to build a robust classifi- Recently, various algorithms have been well exploited with
cation model by involving the most informative features, and wrapper feature approaches in order to handle the FS problem.
discard noisy and irrelevant ones. So, FS is considered as an In [5], a HS based wrapper selection approach was proposed to
optimization problem that can be handled based on two main improve the handwritten word recognition. Allam and Nadim
stages, namely feature subset generation, and feature subset [6] introduced a binary variant of TLBO to be used as a search
evaluation. A search mechanism is employed to generate strategy to handle FS. Other variants of metaheuristic algo-
feasible subsets of features, while an evaluation technique rithms have been extensively utilized for FS problem. Some
(learning algorithm) is used to assess the generated subsets examples of these algorithms include: Whale Optimization
and thus guiding the search process towards to optimal solution Algorithm (WOA)[7], Gravitational Search Algorithm (GSA)
[1]. [8], and Particle Swarm Optimization [9]. However, QSA has
FS methods are mainly distinguished based on their de- not been employed in the area of FS yet.
pendency on the learning algorithm that is used to build the Another interesting challenge that may degrade the perfor-

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 318


mance of classification techniques is the imbalanced dataset B. SMOTE oversampling technique
[10]. It is a very common scenario in real-world problems in The SMOTE is recognized as a powerful oversampling
which the class of interest (the minority class) is much rarer method designed by Chawla et. al. to provide a balanced
than other classes (the majority class). The unequally repre- data [14]. the main advantages of SMOTE compared to other
sentation of target class decreases the prediction performance sampling techniques are: 1) it preserves the original data
of the classification model. For this reason, various approaches without information loss, and 2) it increases the representation
have been proposed to handle the problem of imbalanced of minority class samples without duplication through a non-
dataset such as kernal-Based methods, Active Learning meth- random mechanism [10]. The SMOTE algorithm generates
ods, Cost-Sensitive methods, and sampling methods [10]. new synthetic observations based on the original minority
In this paper, an efficient software fault prediction (SFP) instances [14]. For each minority sample (xi ), the SMOTE
model is proposed based on a wrapper feature selection procedure identifies the k-nearest minority neighbours (x̂ij )
combined with Synthetic Minority Oversampling Technique using the Euclidean distance, where j = 1, 2, ..., k. Then new
(SMOTE) for the sack of increasing the accuracy. Our contri- synthetic samples are generated along the lines joining the
butions are summarized as follows: minority sample and its j selected neighbours as in Eq. (1)
• The SMOTE method is investigated using various over-
sampling ratios with the aim of handling the imbalanced xnew = xi + (x̂i − xi ) ∗ r (1)
class distribution in the datasets. where r is a random vector within [0,1], x̂i denotes one of the
• A wrapper-based FS method using a binary variant of the
k neighbours. The value of k depends on the desired amount
recently proposed QSA algorithm is designed to select the of oversampling.
most informative features.
• In comparison to the state-of-the-art algorithms, the pro- III. Q UEUING S EARCH A LGORITHM
posed binary QSA showed a clear superiority. QSA is a novel human-based meta-heuristic optimizer that
The rest of this paper is set out as follows: Section II was recently developed by Zhang et al. [4] in 2018. QSA
presents basics and theoretical background related to the mimics the behaviors and interactions between humans during
problem of imbalanced data. The QSA algorithm is introduced the queuing process. In our daily activities such as shopping
briefly in Section III. Section IV describes the proposed feature and waiting for services, customers usually follow the staff
selection approach. Section V reports the experimental results member with high ability to provide services in a short
as well as the analysis and discussion. Lastly, the conclusion time. Inspired by this principle, the authors have developed
and future directions are presented in Section VI. the population-based QSA algorithm, in which the group
II. BASICS AND BACKGROUND of customers represents the search agents, while the staff
members lead these search agents. In QSA, the trad off
A. Imbalanced Datasets
between exploration and exploitation are implemented during
Data imbalance problem is considered as one of the com- three main phases (called business1, business2, and business3
mon problems in data mining and ML fields [10]. This respectively). The following subsections briefly explain these
problem arises when the target classes of the instances within phases.
a dataset are unequally distributed such that the minority
classes have significantly lower representation than the ma- A. Business 1
jority classes [11]. In such a case, the classification models In this phase, all customers (search agents) are distributed
become more biased towards the dominant class and thus into three queues named queue1n (n=1,2,3). The fittest three
leads to incorrectly prediction of the minority class (i.e. most agents denoted by A1n are selected as staffs (the leaders). The
instances will be predicted as majority class) [12]. It has been number of customers (q1n ) in each corresponding queue are
observed that in various real-world domains such as disease calculated as in Eq. (2) based on the principle that the better
diagnosis, fraud detection, and software defect prediction, staffs can server more customers within the same amount of
the minor class is usually more important than the others. time.
Therefore, Handling the problem of imbalanced data is posing 1/T1n
q1n = N × (2)
a major challenge [10] 1/T11 + 1/T12 + 1/T13
Various techniques have been introduced including sampling where N is the number of search agents, and T1n represents
methods, Cost-Sensitive methods, Kernel-Based methods, and the fitness value (service time) for the selected staffs. Then
so on [10]. In sampling mechanism, the balanced representa- the customers in each queue are updated using one of the two
tion of classes can be achieved by either reducing the number rules as defined in Eqs. (3) and (4):
of majority class samples (under sampling) or through in-
creasing the number of minority class samples (oversampling) Xinew = A + F11 (3)
[13]. However, the shortcoming of under-sampling approach
Xinew = Xi + F12 (4)
is the information loss which may cause poor classification
performance. To overcome this problem, we utilized the over- where i is the current iteration, Xinew is the updated state of
sampling technique using SMOTE method in this work. Xinew , A is the staff for customeri , F 11 and F 12 represents

319
the fluctuation of service process which are represented in Eq. IV. T HE PROPOSED A PPROACH
(5) and (6) respectively. A. Data pre-processing
F11 = β × α × (E. |A − Xi |) + (e × A − e × Xi ) (5) In this work, the experiments are applied over 14 real
F12 = β × α × (E. |A − Xi |) (6) datasets available in PROMISE software engineering reposi-
tory. These data are considered to be free of noise and missing
where α is a random number in the interval [-1, 1], E is values while having imbalanced samples [15]. Therefore, we
an Erlang distributed random vector by size 1× D, e is an applied SMOTE technique with different oversampling ratios
Erlang distributed random number, | | represents the absolute to get more balanced datasets.
value, (.) denotes the element by element multiplication, and
β is an adaptive control parameter used to adjust the range of B. Feature Selection using Binary Queuing Search Algorithm
fluctuation which is computed as in Eq. (7) Metaheuristics are problem-independent, so they can be
0.5 adapted to handle problems related to various domains [3].
β = e(ln(1/t)×(t/T ) )
(7)
However, two essential design aspects should be considered:
where t is the current iteration, and T represents the maximum the solution representation for the handled problem and the
allowable iterations. evaluation (fitness) function.
B. Business 2 1) Solution representation: FS is recognized as a binary
optimization problem. That is, the set of features are encoded
In the second phase, A portion of customers are selected
as a vector of zeros and ones. In which, a specific feature is
to utilize the update strategies of this phase. as in business
selected if the value of its corresponding element is set to 1;
1, there are three queues where the number of customers for
otherwise it is ignored because the value of its corresponding
each queue is computed by Eq. (2). Initially, the customers are
element is set to 0. However, QSA was originally designed to
sorted in descending order based on their fitness value (fi ),
solve problems with a continuous search domain. Therefore,
then each costumer given a probability to be handled as in Eq.
we should employ an efficient binarization method that allows
pri = rank(fi )/N (8) adapting QSA to solve the binary FS problem.
Transfer function (TF) is recognized as a simple, cheap,
Hence, the worst agents have a higher chance of being handled
and efficient operator that has been widely used to map the
than the fittest ones. For each agent, a random number (r)
continuous search space to a binary one [16]. In the this
within [0,1] is generated, if the random number is less than
strategy, the optimizer works without adjustments, then the
pri , then this agent will be updated. In business2, the selected
obtained solutions are converted into binary by including two
agents are updated based on two patterns as defined in Eq.
{ steps. 1) The TF is employed to map the real values in Rn
new Xi + e × (Xr1 − Xr2 ) r < cv into values in range [0,1] such that each value represents the
Xi = (9)
Xi + e × (A − Xr1 ) r ≥ cv probability of transforming the corresponding real value into
binary. 2) A binarization rule that is used to convert the output
where e is an Erlang distributed random number, Xr1 and Xr2
of TF into 1 or 0 [17].
are two randomly selected customers, A is the leader(staff) for 1

each considered queue. cv is a confusion degree which is used 0.9

0.8

to control the selection between the two updating patterns, 0.7

0.6

which is computed using Eq.


T(X)

0.5

0.4

0.3

Cv = T21 /(T22 + T23 ) (10) 0.2

0.1

where T2n n=(1,2,3) represents the fitness values of the best 0


-10 -5 0
X
5 10

three customers obtained so far in business2. It is noticeable


that the first update pattern in Eq. (9) has a higher chance to be Fig. 1: S-shaped TF
selected as the increase of Cv . So, the state of the customers
In this paper, a binary variant of QSA (BQSA) is proposed
is updated based on other customers.
based on a sigmoidal TF (or S-shaped) as shown in Fig. 1 .
C. Business 3 This function as given in Eq. (12) was originally introduced
As in business2, a part of agents are allowed to handle this by Kennedy and Eberhart [18]. For the next step, we applied
phase. However, the update process is done at the level of the standard rule as defined in Eq. (13)
dimensions. This means that the j th dimension of each search 1
agent is mutated according to mutation probability assigned T (xji (t)) = j (12)
1 + e−xi (t)
for this agent by Eq. (8). The dimensions are updated as the
rule in Eq. (11) where xji represents the j th dimension of the ith solution at
iteration t, and T (xji (t)) is the probability value obtained by
Xinew = Xr1,j + (e × Xr2,j − e × Xi,j ) (11) TF. {
It can be seen that other customers have a greater influence j 0 If r < T (xji (t + 1))
xi (t + 1) = (13)
than staffs in last phase of QSA. 1 otherwise

320
where r is a random number restricted in range [0,1], and and the remaining part was used for testing purposes. This
Xij (t + 1) is the new binary output. procedure is repeated k times, thus each instance of the dataset
2) Fitness function: An efficient fitness function is required is given the opportunity to be employed k − 1 times to train
to guide the search process, thus the generated subset is given the model and one time to validate it. Due to the stochastic
a score that describes its quality. The desired objective of FS behavior of the utilized optimizers, each conducted experi-
is to minimize the number of selected features and maximize ment was repeated 10 times. Hence, an individual algorithm
the classification performance. These two contradictory criteria was evaluated 10 ∗ k times for each dataset. By using this
were formulated using Eq. (14) mechanism, we can be more confident with the results of the
|R| proposed model.
↓ F itness = α × E + β × (14) The implementation of the proposed approach was done
|N |
using MATLAB-R2017a, and the wrapper FS model with
where E denotes the classification error rate, |R| indicates the KNN classifier (with k = 5 [2]) as an evaluation method was
number of selected features, |N | is the number of original adopted to generate the best feature subset. We used KNN for
features. α and β (the complement of α) are two controlling its simplicity and low computational time compared to other
parameters ∈ [0, 1] which are employed to balance between classifiers. It is also a non-parametric learning algorithm that
the importance of both criteria. has shown superior results in several previous FS experiments
V. E XPERIMENTAL R ESULTS [19], [7]. All experiments were tested on an Intel machine with
Core i5 2.2GHz processor and 4 GB RAM. To be consistent
In this paper, to test the performance of the proposed and fair, all optimizers in this work were experimented using
approaches, a set of well know benchmark software fault the same common parameters (100 iterations and 30 search
prediction datasets from PROMISE repository (see Table I) agents), these values were selected after conducting extensive
were used. Observing the Table I, it can be easily seen that all experiments. The other specific parameters were selected
datasets are imbalanced, where the occurrences of the positive based on recommended settings in the original papers and
cases are very low when compared to the negative cases. related works on FS. The list of parameter values are presented
TABLE I: Description of software fault prediction data sets in Table (II). Please note that the best obtained results in the
reported tables were highlighted using a boldface format.
Dataset version #instances #defective instances %defective instances
ant 1.7 745 166 0.223 TABLE II: The used parameters settings
camel 1.0 339 13 0.038
camel 1.2 608 216 0.355 Fitness function α = 0.99 , β = 0.01
camel 1.4 872 145 0.166 No. iteration=100, population size=30
camel 1.6 965 188 0.195 Common parameters
dimension=#features, No. runs= 10
jedit 3.2 272 90 0.331 classification KNN classifier (K=5), 10-fold cross validation
jedit 4.0 306 75 0.245 GSA G0 =100, α=20
jedit 4.1 312 79 0.253 BBA Qmin =0, Qmax =2, A loudness=0.5, r pulse rate=0.5
jedit 4.2 367 48 0.131 GWO a from 2 to 0
jedit 4.3 492 11 0.022
log4j 1.0 135 34 0.252
log4j 1.1 109 37 0.339 A. Evaluation Measurements
log4j 1.2 205 189 0.922
xalan 2.4 723 110 0.152 The performance of classifiers on a set of test data can be
The experiments in this work were conducted a set of described using a specific table called confusion matrix (or
phases, in the first phase, we tried to get the best oversampling error matrix). Table III demonstrates a confusion matrix for
percentage (see Table IV), then we compared the performance a binary classifier, in which the instances of a given test data
of the proposed BQSA on the dataset without applying any are classified as either positive or negative. Various basic mea-
oversampling technique, and after using SMOTE as an over- sures, such as accuracy, error rate, sensitivity, and specificity
sampling technique, where three measurements (i.e., sensitiv- are calculated based on the four outcomes (TP,TN,FP, and FN)
ity, specificity, and AUC) were used to assess the BQSA. Then, of the confusion matrix. Other evaluation measures such as
in the last phase, two types of experiments were conducted to Area Under the ROC curve (AUC) can be derived from the
examine the effectiveness of the proposed method: In the first basic measures.
experiment, the classification outcomes of KNN classifier was TABLE III: Confusion matrix for binary classification.
compared with those after applying BQSA for FS while in the
Predicted positive Predicted negative
second experiment, BQSA was compared with other wrapper
Actual positive True Positive (TP) False Negative (FN)
FS methods by implementing four SI algorithms called Binary Actual negative False Positive (FP) True Negative (TN)
Whale Optimization Algorithm (BWOA), Binary Gravitational
• Sensitivity (true positive rate): The percentage of the
Search Algorithm (BGSA), Binary Bat Algorithm (BBA), and
positive cases that were predicted as positive.
Binary Ant Lion Optimizer (BALO).
In all experiments, the classification algorithm (KNN) was Sensitivity = T P /(T P + F N ) (15)
trained and tested using the n-fold cross-validation method
(where n=10). In this procedure, each dataset has been split • Specificity (true negative rate): The percentage of the
into 10 parts, such that nine parts were used for training, negative cases that were predicted as negative.

321
each measurement was obtained based on the original datasets
and the modified ones. Regarding the sensitivity values (which
Specif icity = T N /(T N + F P ) (16)
represents the performance measurement to detect the class
• AUC: An efficient evaluation measure that based on the of interest (i.e. defective cases)), we can see that using the
trad-off between Sensitivity and specificity. It is calcu- SMOTE technique significantly improves the performance
lated as follows: of BQSA for all datasets. On the other hand, there is a
remarkable degrade in the specificity values (which represents
AU C = (Sensitivity + Specif icity)/2 (17) the performance measurement to detect the normal cases). To
For imbalanced data, we are interested in having a high balance between these two contradictory behaviors, we based
sensitivity and specificity on the minority and majority classes on AUC values. Inspecting these values, we can notice the
respectively, so AUC is an appropriate measure for the predic- superior performance of BQSA when using SMOTE for all
tion quality over imbalanced data rather than accuracy metric datasets. This result is expected since the datasets are highly
[12]. imbalanced, and the results of the original datasets suffer from
the bias towards the majority class.
B. The impact of oversampling ration
In order to get the best percentage of the oversampling TABLE V: The performance of BQSA before and after uti-
ration, an extensive experiment was conducted, and different lizing SMOTE in terms of sensitivity, specificity, and AUC
values for the oversampling percentage (100%, 200%, 300%, measures
and 400%) were used. The value that obtained the best results Sensitivity Specificity AUC
Dataset
was used in all consequent experiments. The average of AUC original SMOTE original SMOTE original SMOTE
results of this experiment are reported in Table IV. Observing ant-1.7 0.5018 0.9218 0.8965 0.7476 0.6991 0.8347
camel-1.0 0.0000 0.5769 0.9960 0.9482 0.4980 0.7625
the results, we can notice a significant improvement in the camel-1.2 0.3745 0.8877 0.8061 0.4416 0.5903 0.6647
camel-1.4 0.2324 0.8548 0.9512 0.7180 0.5918 0.7864
AUC values for all re-sampled datasets compared to the camel-1.6 0.2043 0.8563 0.9332 0.6842 0.5687 0.7702
original ones. However, one can clearly see that the impact jedit-3.2 0.6533 0.9228 0.8264 0.6962 0.7399 0.8095
jedit-4.0 0.5107 0.8793 0.8961 0.7100 0.7034 0.7946
of using a low percentage (i.e., 100% and 200%) is less than jedit-4.1 0.5494 0.9016 0.8957 0.7549 0.7225 0.8283
jedit-4.2 0.3896 0.8786 0.9436 0.8232 0.6666 0.8509
using a high percentage (i.e., 300% and 400%). This is due jedit-4.3 0.0182 0.6205 0.9977 0.9530 0.5079 0.7867
to the highly imbalanced data in which a sufficient number of log4j-1.0 0.6265 0.9125 0.9495 0.8059 0.7880 0.8592
log4j-1.1 0.6865 0.9257 0.9222 0.7722 0.8044 0.8489
minor instanced need to be oversampled. Moreover, it seems log4j-1.2 0.9852 0.8862 0.3813 0.8563 0.6832 0.8712
xalan-2.4 0.2718 0.8675 0.9553 0.7879 0.6136 0.8277
that the further increase in the oversampling ratio (i.e., grater
than 300% ) has a negative impact on the prediction quality.
D. Comparison of BQSA versus other optimizers
As per F-test, it is revealed that increasing the percentage of
the minority class by 300% obtains the best rank, followed In this section, we present a comparison between BQSA
by 400%, 200%, 100%, and 0% (the original). Thus, in all approach and other similar approaches in the literature. To
experiments the 300% oversampling percentage was adopted. make a fair comparison, all approaches were implemented,
and all runs were conducted in the same environment with
the same parameter settings. Table VI shows the average AUC
TABLE IV: Average AUC values of BQSA with SMOTE results for all FS approaches (i.e., BQSA, WOA, BGSA, BBA,
using different oversampling ratios and BALO), in addition to the results of the KNN with no FS
SMOTE oversampling ratio on the full over-sampled datasets. It can be seen that BQSA
Dataset original 100% 200% 300% 400%
achieved the best results among all approaches in 65% of the
ant-1.7 0.699 0.804 0.815 0.835 0.819
camel-1.0 0.498 0.666 0.727 0.763 0.814 datasets, and comes in the first place according to the F-test,
camel-1.2 0.590 0.670 0.687 0.665 0.663 while WOA comes in the second place by achieving the best
camel-1.4 0.592 0.719 0.777 0.786 0.798
camel-1.6 0.569 0.702 0.757 0.770 0.764 results on 28% of the datasets. Comparing the results of the
jedit-3.2 0.740 0.795 0.801 0.809 0.799
jedit-4.0 0.703 0.765 0.775 0.795 0.785 FS approaches with those of the KNN without FS, it can
jedit-4.1 0.723 0.791 0.822 0.828 0.818 be concluded that FS as a preprocessing step has an impact
jedit-4.2 0.667 0.813 0.826 0.851 0.857
jedit-4.3 0.508 0.563 0.583 0.787 0.788 on the performance of the learning algorithm by selecting
log4j-1.0 0.788 0.827 0.873 0.859 0.862
log4j-1.1 0.804 0.819 0.845 0.849 0.835 the most informative features. According to the F-test, BBA
log4j-1.2 0.683 0.761 0.867 0.871 0.880 algorithm and KNN* without FS come in the last place among
xalan-2.4 0.614 0.751 0.799 0.828 0.822
Rank (F Test) 5.00 3.86 2.50 1.64 2.00 all approaches.
Figure 2 shows the convergence curves of the FS approaches
C. Evaluation results of BQSA with and without SMOTE on some datasets. It is clear that the BQSA recorded the best
In this section, a deep comparison between the results based performance among other approaches, where it has the fastest
on the original datasets (without oversampling), and with those convergence rate on the presented datasets, as well as the
of using the oversampled datasets with 300% percentage is lowest fitness values. Only WOA among presented approaches
conducted. Table V shows the results of three measurements competes with BQSA in xalan-2.4 dataset. However, BBA and
(i.e., Sensitivity, Specificity, and AUC) for the BQSA, where BGSA suffer from the problem of premature convergence in
all cases.

322
TABLE VI: AUC results of BQSA versus other approaches
with SMOTE [KNN*: the classification model without FS] conducted on multi-levels, where the impact of resampling
the datasets has been studied in the first stage, based on some
Dataset KNN* BQSA BWOA BGSA BBA BALO
ant-1.7 0.835 0.835 0.836 0.826 0.8182 0.8340
preliminary results to select the best resampling ration. Then,
camel-1.0 0.684 0.763 0.761 0.740 0.6977 0.7589 a comparison between the BQSA and other similar approaches
camel-1.2 0.658 0.665 0.688 0.660 0.6364 0.6669
camel-1.4 0.789 0.786 0.786 0.780 0.7720 0.7837 has been conducted. The results have confirmed that BQSA-
camel-1.6
jedit-3.2
0.739
0.751
0.770
0.809
0.765
0.798
0.747
0.794
0.7355
0.7673
0.7642
0.8027
based wrapper FS technique combined with SMOTE can be
jedit-4.0 0.760 0.795 0.793 0.781 0.7696 0.7884 utilized as a promising approach in predicting faults of real-
jedit-4.1 0.793 0.828 0.826 0.806 0.7682 0.7988
jedit-4.2 0.786 0.851 0.852 0.835 0.8088 0.8433 world software projects. Our future directions will be related
jedit-4.3
log4j-1.0
0.673
0.752
0.787
0.859
0.782
0.854
0.748
0.823
0.6903
0.8100
0.7806
0.8492
to investigating new variants of BQSA by utilizing different
log4j-1.1 0.727 0.849 0.856 0.825 0.8050 0.8524 S-shaped and V-shaped transfer functions.
log4j-1.2 0.681 0.871 0.867 0.825 0.7681 0.8404
xalan-2.4 0.821 0.828 0.824 0.824 0.8090 0.8186 R EFERENCES
Rank (F-test) 5.04 1.57 1.86 3.96 5.43 3.14
[1] V. Kumar and S. Minz, “Feature selection: A literature review,” Smart
0.26 0.21 Computing Review, vol. 4, pp. 211–229, 01 2014.
0.25
BQSA
BWOA
BQSA
BWOA [2] M. Mafarja and S. Mirjalili, “Whale optimization approaches for wrap-
BGSA
0.2 BGSA
BBA
BALO
BBA
BALO
per feature selection,” Applied Soft Computing, vol. 62, p. 441453, 2018.
Fitness Value

Fitness Value

0.24
0.19
[3] E.-G. Talbi, Metaheuristics: from design to implementation. John Wiley
0.23 & Sons, 2009, vol. 74.
0.22
0.18 [4] J. Zhang, M. Xiao, L. Gao, and Q.-K. Pan, “Queuing search algorithm:
A novel metaheuristic algorithm for solving engineering optimization
0.17
0.21 problems,” Applied Mathematical Modelling, vol. 63, 07 2018.
0.2 0.16 [5] S. Das, P. Singh, S. Bhowmik, R. Sarkar, and M. Nasipuri, “A harmony
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Iteration Number Iteration Number
search based wrapper feature selection method for holistic bangla word
recognition,” Procedia Computer Science, vol. 89, pp. 395–403, 12 2016.
(a) camel-1.0 (b) jedit-4.1 [6] M. Allam and M. Nandhini, “Optimal feature selection using binary
0.17 0.185
BQSA
teaching learning based optimization algorithm,” Journal of King Saud
BQSA
BWOA
BGSA
BWOA
BGSA
University - Computer and Information Sciences, 2018.
BBA
0.16 BALO 0.18 BBA
BALO [7] M. Mafarja, I. Jaber, S. Ahmed, and T. Thaher, “Whale optimisation
Fitness Value

Fitness Value

algorithm for high-dimensional small-instance feature selection,” Inter-


0.175 national Journal of Parallel, Emergent and Distributed Systems, vol. 0,
0.15
no. 0, pp. 1–17, 2019.
0.17 [8] M. Taradeh, M. Mafarja, A. A. Heidari, H. Faris, I. Aljarah, S. Mir-
0.14
jalili, and H. Fujita, “An evolutionary gravitational search-based feature
0 10 20 30 40 50 60 70 80 90 100
0.165
0 10 20 30 40 50 60 70 80 90 100
selection,” Information Sciences, 05 2019.
Iteration Number Iteration Number [9] M. Mafarja, R. Jarrar, S. Ahmed, and A. Abusnaina, “Feature selection
using binary particle swarm optimization with time varying inertia
(c) jedit-4.2 (d) xalan-2.4 weight strategies,” 05 2018.
[10] H. He and E. Garcia, “Learning from imbalanced data,” Knowledge and
Fig. 2: Convergence curves for meta-heuristic approaches on Data Engineering, IEEE Transactions on, vol. 21, pp. 1263 – 1284, 10
selected datasets (with SMOTE) 2009.
[11] R. Longadge and S. Dongre, “Class imbalance problem in data mining
review,” Int. J. Comput. Sci. Netw., vol. 2, 05 2013.
[12] G. Ismail Sayed, A. Tharwat, and A. E. Hassanien, “Chaotic dragonfly
There are several justifications for the excellent performance algorithm: an improved metaheuristic algorithm for feature selection,”
of BQSA compared to other algorithms. It is a multileader Applied Intelligence, 08 2018.
optimizer in which the whole population is divided into three [13] M. Rahman and D. N. Davis, “Addressing the class imbalance problem
in medical datasets,” International Journal of Machine Learning and
groups so that each group follows one of the best three leaders, Computing, vol. 3, p. 224, 04 2013.
this allows the search agents to explore several regions at [14] N. Chawla, K. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
once and thus avoid trapping into local optima. In addition, it Synthetic minority over-sampling technique,” J. Artif. Intell. Res. (JAIR),
vol. 16, pp. 321–357, 01 2002.
uses several ways to update solutions witch gives the approach [15] J. Sayyad Shirabad and T. Menzies, “The PROMISE Repository of
robustness and ability to perform well in several types of ob- Software Engineering Databases.” School of Information Technology
jective functions. Furthermore, the utilized adaptive mutation and Engineering, University of Ottawa, Canada, 2005. [Online].
Available: http://promise.site.uottawa.ca/SERepository
mechanism increases the amount of exploitation towards the [16] S. Mirjalili and A. Lewis, “S-shaped versus v-shaped transfer functions
end of the iterations. for binary particle swarm optimization,” Swarm and Evolutionary Com-
VI. C ONCLUSION putation, vol. 9, pp. 1–14, 2013.
[17] B. Crawford, R. Soto, G. Astorga, J. Garcia Conejeros, C. Castro, and
In this paper, a binary version of the QSA algorithm (called F. Paredes, “Putting continuous metaheuristics to work in binary search
BQSA) was introduced to serve as a search strategy in a wrap- spaces,” Complexity, vol. 2017, pp. 1–19, 01 2017.
[18] J. Kennedy and R. C. Eberhart, “A discrete binary version of the
per FS method. In addition to the FS, to improve the quality particle swarm algorithm,” in Systems, Man, and Cybernetics, 1997.
of the learning model, a resampling technique (i.e., SMOTE) Computational Cybernetics and Simulation., 1997 IEEE International
was used to rebalance the datasets as they are imbalanced. Conference on, vol. 5. IEEE, 1997, pp. 4104–4108.
[19] T. Liao and R. Kuo, “Five discrete symbiotic organisms search algo-
Having a FS method besides a resampling technique aims to rithms for simultaneous optimization of feature subset and neighborhood
improve the quality of the datasets that would be used to train size of knn classification models,” Applied Soft Computing, vol. 64, pp.
the learning model. The experiments in this paper have been 581 – 595, 2018.

323
Self-Organizing Maps for Agile Requirements
Prioritization
Amjad Hudaib Fatima Alhaj
King Abdullah II School for Information Technology King Abdullah II School for Information Technology
The University of Jordan The University of Jordan
Amman, Jordan Amman, Jordan
ahudaib@ju.edu.jo fat9170261@fgs.ju.edu.jo

Abstract—In building software systems, decisions at the spec- Properly prioritized requirements enables planing the
ification phase will extremely affect the rest of the system life development tasks reasonably to meet the stockholders
cycle. Well-defined requirements at this phase will increase the expectations. Even in commercial off-the-shelf (COTS)
chance of achieving the ultimate goal of delivering a software that
meets stakeholders needs. Given a limited sources of time and case, planning a software release is a vital factor for the
predefined budget, not all the requirements should be fulfilled product success, this planning process should be influenced
with the same priority. Here comes the need for requirement by prioritizing requirements [7]. RP transfer the project into
prioritization RP techniques. This paper presents a new approach a sequential execution order or releases [8]. Where software
to deal with the dynamic nature of requirements prioritization quality depends on this defined order [9]. This order also
process in agile development. Training a self-organizing map
according to requirement’s predefined features is the main helps to avoid conflicting requirements [10] and remove
process in the proposed approach. The trained map can produce the unnecessary requirements [11]. However, implementing
a set of clusters. A farther rank is given to each requirement requirements following this order will lead to an incremental,
according to map resulting weights. The proposed approach was cumulative and systematic delivery to the client. Of course
implemented using different variables related to requirements this can help to modify the project schedule and discover any
themselves and related to the self- organizing map to show its
ability to prioritize requirements in agile development model. hidden misunderstandings between the organization and the
stockholders before moving forward in upcoming sages of
Index Terms—Requirement Prioritization, Agile, ASD, SOM, the software product SDLC [12].
Self-Organizing Map, Clustering.
Prioritization of requirements is considered the most
I. INTRODUCTION challenging for RE teams [13].Prioritizing requirements with
a predefined resources is a complex process, and as the
Dealing with a large number of software requirements along number of stakeholder increases,it even becomes a more
with having limited resources(e.g. tight deadlines, budget) complicated task ; each stakeholder has a different opinion
can be very confusing for software project managers. Since it about requirements, a single requirement can be considered a
is impractical to implement all the requirements at the same major priority to one stakeholder, and minor priority to other
time [1], RP helps determining which requirements should be stakeholder [14]. A good RP technique will keep track of all
implemented in the early release, and which of them should the requirements weights that stockholders assigned to [15].
be set aside for later implementation. Requirements ordering Stakeholders involvement should be defined and justified in
in the project time-line should be decided precisely, delaying every RP technique.
essential requirements may affect the overall success of a
software product [2]. Agile development can deal with changing requirements
[16]. Sprints, which are the incremental releases in agile
Since RP is one of the vital process of requirement system, should be featured according to the changing
engineering RE, which plays a vital role in the success prioritized requirements [17]. RP in agile is different than
or failure of a software product [3], [4], a serious need RP in traditional RE (waterfall development/ non-agile) with
for efficient RP technique is raised. As [5] mentioned, RE two main points [18]:first, prioritizing and re-prioritizing
empirical studies over years ranked as the highest after is happening through different agile iterations. The RP
requirements negotiation in the RE sub-areas. A considerable process applied before each iteration, which as a result gives
amount of research work has been done in providing different enough time before making a decision, this allow a new
techniques for RP, and still there is a need to fully analyze information and results to be taken in consideration. Second:
these techniques to enhance the maturity of this research the prioritization process mainly considering the business
exploration [6]. value/relative benefit [19], early implementation for highest
priority to obtain the highest business value and the lowest

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 324


risk as possible. The MoSCoW Method is about ordering requirements
by priority. Where most vital requirements need to be
The main objective of this paper is to provide an approach implemented first.According to this Method, requirements are
for RP within the agile model. Specifically, provide a process categorized into four groups:Must have, Should have, Could
that can cluster and rank a dynamic list of requirements have and Wont have [1].
defined in ASD, according to their related features.
A common method in RP world is Analytic Hierarchy
This paper is organized as follows: section 2 briefly Process (AHP), where the technique is based on pair-
summarizes the related work of the recent RP approaches and wise comparisons of requirements in a matrix. Scale
techniques. While section 3 provide the proposed approach will be assigned to each of them by users preferences.
specification and related it to a brief background. Section Interactive prioritization using genetic algorithm [25] is
4 describe the experimental results of implementing the another algorithm for RP.Includes incremental knowledge
proposed approach. Finally, section 5 draws the conclusion acquisition and make use of the defined constraints ( e.g.
of this study. dependencies, priorities). The constrains can be included with
the requirements or acquired from the client iteratively during
RP process. Disagreement is the main notation this algorithm
II. RELATED WORK relies on. Disagreement between two different orders can
be defined as the set of pairs in the transitive closure of
As mentioned earlier, there are various requirement the first order that appear reversed in the second order closure.
prioritization techniques. Some of them are common RP
techniques, such as: MoSCoW method, validate learning and Azzolini and Passoni [26] proposed a methodology to
walking skeleton [20], [21] and some of them are dedicated cluster stakeholders according to their Users Requirements
to deal with agile software development (ASD). preferences, extracted from the AHP method. The clustering
process used the SOM networks. Their two-level clustering
Anand and Dinakaran [21] technique aims to overcome process helped reducing the computational cost and the noise,
the stockholders conflict through performing two continuous also it showed high relation with the stakeholders knowledge
functions: join and prune. the algorithm focus on identifying about the domain. SOM clustering can be used in a number
the frequently asked requirements by stakeholders. The of contexts, it is used on other discipline, such as addressing
algorithm Apriori is commonly used in databases to define the problem of pest risk assessment [27]. SOM have been
the frequent item set. used to filter the large amounts of information about the
distribution of pests and plant pathogens and use that in risk
Another RP framework is PBURC (A patterns- assessment in order to help prioritize policy and resources.In
based unsupervised requirements clustering), which uses the project management field, Parvizian et. al [28] used SOM
unsupervised machine learning for requirements clustering to reduce the high dimensional information related to project
in distributed agile development. Case Based Ranking management. This helps identifying potential risks.
(CB Rank) [22] also used machine learning techniques to
guide stockholders prioritizing the project requirements. this III. M ETHODOLOGY
technique relies on three main procedures :pair sampling,
A. Requirements prioritization model
preference acquisition and rank learning.
Considering the literature’s work about requirements related
RIZE technique proposed to support ASD [23]. Mainly, factors that effects its importance and prioritizing, a set of
it was created to overcome ’the issue starvation situation’, features were defined and used to estimate each requirement
which arise when a feature or a bug is not treated or priority. Given that ri ∈ R, where R = {r1 , r2 , ..., rn } is
accomplished. This situation happen in agile environment the set of software requirement. ERP (ri ) is the estimated
due to the short time of iteration, that force developer to requirement priority calculated for requirement(Ri ). ERP (ri )
work on higher requirement priority leaving other priority is calculated according to the following equations:
starving. Another technique that serves ASD too is called
”Walking skeleton” technique was first referred by Vic
Basili [24]. Usually, ASD start with Sprint zero, but walking RP (ri )(ω1 IM (ri ) + ω2 BP (ri ))
ERP (ri ) = (1)
skeleton approach starts with an implementation of a small ω3 R(ri ) + ω4 T (ri )
end-to-end function.It acts as ”initial guess” in the process
1 + DR
of developing a final implementation that meets the complete IM (ri ) = (2)
set of requirements.Developer teams in agile project choose n
some project features, which specifies some requirement(s) Notations of the equations are descried in table I. All
to be implemented within a short time. mentioned features can be assigned a specific numerical value
for each requirement, except the weights related to each feature

325
TABLE I Training a SOM using the whole set of data vectors will
R ELATED FEATURE DESCRIPTION FOR EQUATION 1. position each data vector onto the map. So, input data are
Notation Description mapped to the most similar node on the SOM. All features
ERP (ri ) The estimated requirement priority related to input data are used to determine similarity. In
calculated for requirement(Ri ) addition, each node has a weight vector of same size as the
RP (ri ) The requirement priority given in
requirement specification phase.
input data features.
IM (ri ) potential impact
BP (ri ) Business profit Depending on the map size, each map node is linked to
R(ri ) risk some point in the input data; a node is associated to weight
T (ri ) Time to accomplish
DR number of dependent requirement’s vector. Each input data can be linked to the map by first
n Number of requirements finding the vector that is closest to the data vector, and then
ω1 , ω 2 , ω 3 , ω 4 Weights related to each variables. mapping the data vector to the corresponding map node BMU.

This approach is proposed to be adopted within the agile


( ω1 , ω2 , ω3 , ω4 ). Weights can be discovered by a trained SOM model. the clustering process using SOM have the iterative
as it will be explained in the upcoming subsection. nature that correspond to the agile iterations. The clusters
defined by SOM are ready to be designed and developed.
Besides, any change and improvement can be considered in
B. SOM model the next iteration which will train the SOM map using new
SOM is one of the unsupervised learning algorithms that input data and their related features. A Cluster validation
uses neural network. It can detect patterns in complex multi- process is done after obtaining weights of each feature and
dimensional data. SOM is applied in this context to support applying equating 1 to calculate estimated priority and rank
clustering, which represent partitioning a data set into a set of the requirements within one cluster.
clusters.
Given that each neuron i in the map has a d-dimensional IV. E XPERIMENTAL RESULTS
weight vector w = (wi1 , wi2 , ........, wid ), where i = 1, 2..m, Five data sets where created, each have different number of
which has the same dimensionality of the input. The SOM requirements: 100,250, 500, 750 and 1000. Each requirements
algorithm steps are summarized as follows [29]: are related to five features ( RP (ri ), BP (ri ),IM (ri ), R(ri )
1) Initializing map parameters (weight vectors wi for each and T (ri )) that was generated randomly.
neutron).
2) Select an input vector randomly x(t). To build SOM, library SimSOm [30] was used, which is a
3) Defining best matching units (BMU) for each input lightweight python library for Kohonen Self-Organising Maps
vector. it is usually referred as ”the winner neuron”, and (SOM). It was used with Python 3.7.0 code and JetBrains
it is defined using the the Euclidean distance measure PyCharm community edition 2017 as an API. According to
[28]. Winner neuron c is defined as [29]: SOM model, after training the map, it returns a list of BMU’s
and positions for each input (requirement). The resulting
c = arg(min1<i<mn (||wi t − x(t)||) (3) clusters may vary according to many factors: iteration(epochs)
Where x(t) is the input of neuron i at iteration t and number, map size and the input data (requirement). Each map
wi (t) is the weight vector. node (nuotrun) is related to one or more input data.
4) Weight vectors of neurons is continuously updated ac- Table II shows the resulting clustering using different data
cording to many factors include: neighborhood func- sets, with a fixed map size 20 20, and a learning rate of 0.1.
tion,width of neighborhood radius, coordinate position It is mentioned in the tool documentation that the training is
of the neuron on the map. The radius is decrease while done using a ”bootstrap-like” method, where instead of using
training according to the learning rate and the training all the input points, a random input data is selected for each
length. iteration and used to update the weights of all the nodes in
the map. The number of epochs are assigned to be ten times
C. SOM requirement prioritization the input size.
According to the RP model proposed, the input data
will contain the full set of requirements (R) and each ri Table III provides the acquired clusters for a given require-
related features( RP (ri ), BP (ri ),IM (ri ), R(ri ) and T (ri )). ment set of 300 requirement, different map with different sizes
Values of the estimated requirement priority ERP (ri ) can be where used to provide an insight about the increasing number
calculated ri after finding an approximation for the weights of learned clusters according to increasing the SOM size. The
related to each features. number of epochs was fixed at 3000, learning rate of 0.1.

Defining a related SOM for this work such as each node


has a position, weight vector and associated input data.

326
TABLE II
R ESULTING CLUSTERS OF TRAINING SOM ON DIFFERENT INPUT DATA .

Req. count Epochs Training Time(Sec) Clusters


100 1000 63.7021 5
250 2500 104.9862 8
500 5000 212.1284 9
750 7500 332.1857 10
1000 10000 455.9605 12

TABLE III
SOM CLUSTERS REGISTERED FOR A DATA SET OF 300 REQUIREMENTS .

Map size Training Time(sec) Clusters


1010 38.0235 2
2020 127.2452 7
3030 318.2253 16
4040 673.9764 28
5050 1316.8793 35

Clusters are given an associated rank according to the Fig. 2. A heat-map that shows SOM nodes and weights difference among
neighbouring nodes.
SOM trained weights. Fig. 1 shows a ranked 14 clusters of
requirements set of size 500 using SOM RP model.
V. CONCLUSIONS
This paper presents a new approach to deal with the
dynamic nature of requirements prioritization process in ag-
ile development. The method is basically depends on self-
organizing map that defines a set of ranked clusters that can
be related to successive sprints. Using the proposed work,
project managers can have a good insight about the entire
project development plan, and can adapt the development
process to any new added requirements. Using the trained
SOM can support making decisions in terms of requirement
prioritization. Each resulting cluster can be internally ranked
using the trained SOM weights.

R EFERENCES
Fig. 1. Example of a figure caption.
[1] A. Hudaib, R. Masadeh, M. H. Qasem, and A. Alzaqebah, “Require-
Having a map in which input data with similar conditions ments prioritization techniques comparison,” Modern Applied Science,
vol. 12, no. 2, p. 62, 2018.
are close to each other. Such a map can be exploit to illustrate [2] L. Alawneh, “Requirements prioritization using hierarchical dependen-
the similarity of nodes (neurons), so area where corresponding cies,” in Information Technology-New Generations. Springer, 2018, pp.
color is in minimum rate of the heat-map color coding, have 459–464.
[3] A. R. Asghar, A. Tabassum, S. N. Bhatti, and S. A. A. Shah, “The
a low distance between each other and inform one cluster. impact of analytical assessment of requirements prioritization models:
Also, the clusters are separated from each other by boundaries an empirical study,” 2017.
of nodes with high distances between them, as shown in [4] Y. V. Singh, B. Kumar, S. Chand, and D. Sharma, “A hybrid approach for
requirements prioritization using logarithmic fuzzy trapezoidal approach
Fig. 2. Resulting clusters can be developed sequentially, one (lfta) and artificial neural network (ann),” in International Conference
requirements cluster each sprint.Also, for the data points on Futuristic Trends in Network and Communication Technologies.
that belongs to a single cluster a farther sequential order is Springer, 2018, pp. 350–364.
obtained by applying equation 1 mentioned in the previous [5] T. Ambreen, N. Ikram, M. Usman, and M. Niazi, “Empirical research
in requirements engineering: trends and opportunities,” Requirements
section.Using equation 1 each requirement will assigned an Engineering, vol. 23, no. 1, pp. 63–95, 2018.
estimated priority.Next, requirement estimated priorities are [6] M. Dabbagh, S. P. Lee, and R. M. Parizi, “Functional and non-functional
ranked, resulting a an ordered list of requirements according requirements prioritization: empirical evaluation of ipa, ahp-based, and
ham-based approaches,” Soft Computing, vol. 20, no. 11, pp. 4497–4520,
to their priorities. For example, considering that a cluster with 2016.
four requirements of a data set (n = 50), features values as [7] J. R. F. Dos Santos, A. B. Albuquerque, and P. R. Pinheiro, “Require-
shown in table IV, weights are given by the trained SOM. ments prioritization in market-driven software: A survey based on large
numbers of stakeholders and requirements,” in Quality of Information
Estimated RP for each of them is calculated and a rank is and Communications Technology (QUATIC), 2016 10th International
associated to it as the last column suggest. Conference on the. IEEE, 2016, pp. 67–72.

327
TABLE IV
E XAMPLE ON RANKING REQUIREMENTS WITHIN A CLUSTER USING SOM WEIGHTS .

Index ω1 ω2 ω3 ω4 RP IM BP R T DR ERP Associated rank


12 2.32567555 1.1015222 0.61854229 1.32536492 5 0.1 5 5 2 5 5.037652318 1
4 1.914848 0.13488556 0.43310942 0.83541 3 0.18 1 1 3 8 0.489455076 3
3 1.72306123 0.21889004 0.6731828 0.66829695 1 0.1 1 2 3 4 0.116731193 4
27 2.05979796 0.56444777 0.65129787 0.98508874 5 0.02 3 5 1 0 2.044686238 2

[8] A. Alzaqebah, R. Masadeh, and A. Hudaib, “Whale optimization algo- Discovery, Knowledge Management and Decision Support. Atlantis
rithm for requirements prioritization,” in Information and Communica- Press, 2013.
tion Systems (ICICS), 2018 9th International Conference on. IEEE, [27] S. Worner, M. Gevrey, R. Eschen, M. Kenis, D. Paini, S. Singh,
2018, pp. 84–89. M. Watts, and K. Suiter, “Prioritizing the risk of plant pests by clustering
[9] M. Yousuf, M. U. Bokhari, and M. Zeyauddin, “An analysis of software methods; self-organising maps, k-means and hierarchical clustering,”
requirements prioritization techniques: A detailed survey,” in Computing NeoBiota, vol. 18, p. 83, 2013.
for Sustainable Global Development (INDIACom), 2016 3rd Interna- [28] J. Parvizian, H. Tarkesh, S. Farid, and A. Atighehchian, “Project
tional Conference on. IEEE, 2016, pp. 3966–3970. management using self-organizing maps,” Industrial Engineering and
[10] R. V. Anand and M. Dinakaran, “Whalerank: an optimisation based Management Systems, the official journal of APIEMS, vol. 5, no. 1,
ranking approach for software requirements prioritisation,” International 2006.
Journal of Environment and Waste Management, vol. 21, no. 1, pp. 1–21, [29] V. Chaudhary, R. Bhatia, and A. K. Ahlawat, “A novel self-organizing
2018. map (som) learning algorithm with nearest and farthest neurons,”
[11] H. Ahuja, G. Purohit et al., “Understanding requirement prioritization Alexandria Engineering Journal, vol. 53, no. 4, pp. 827–831, 2014.
techniques,” in Computing, Communication and Automation (ICCCA), [30] F. Comitani, “Simpsom (simple self-organizing maps),” 2019. [Online].
2016 International Conference on. IEEE, 2016, pp. 257–262. Available: https://pypi.org/project/SimpSOM/
[12] R. Qaddoura, A. Abu-Srhan, M. H. Qasem, and A. Hudaib, “Require-
ments prioritization techniques review and analysis,” in 2017 Inter-
national Conference on New Trends in Computing Sciences (ICTCS).
IEEE, 2017, pp. 258–263.
[13] H. F. Hofmann and F. Lehner, “Requirements engineering as a success
factor in software projects,” IEEE software, no. 4, pp. 58–66, 2001.
[14] J. A. Khan, I. U. Rehman, Y. H. Khan, I. J. Khan, and S. Rashid, “Com-
parison of requirement prioritization techniques to find best prioritization
technique,” International Journal of Modern Education and Computer
Science, vol. 7, no. 11, p. 53, 2015.
[15] M. A. Awais, “Requirements prioritization: challenges and techniques
for quality software development,” Advances in Computer Science: an
International Journal, vol. 5, no. 2, pp. 14–21, 2016.
[16] R. V. Anand and M. Dinakaran, “Popular agile methods in software
development: Review and analysis,” International Journal of Applied
Engineering Research, vol. 11, no. 5, pp. 3433–3437, 2016.
[17] M. Brhel, H. Meth, A. Maedche, and K. Werder, “Exploring princi-
ples of user-centered agile software development: A literature review,”
Information and Software Technology, vol. 61, pp. 163–181, 2015.
[18] Z. Racheva, M. Daneva, K. Sikkel, A. Herrmann, and R. Wieringa, “Do
we know enough about requirements prioritization in agile projects: In-
sights from a case study,” in 2010 18th IEEE International Requirements
Engineering Conference. IEEE, pp. 147–156.
[19] K. Wiegers, “First things first: prioritizing requirements,” Software
Development, vol. 7, no. 9, pp. 48–53, 1999.
[20] R. Popli, N. Chauhan, and H. Sharma, “Prioritising user stories in
agile environment,” in 2014 International Conference on Issues and
Challenges in Intelligent Computing Techniques (ICICT). IEEE, 2014,
pp. 515–519.
[21] R. V. Anand and M. Dinakaran, “Handling stakeholder conflict by
agile requirement prioritization using apriori technique,” Computers &
Electrical Engineering, vol. 61, pp. 126–136, 2017.
[22] P. Avesani, S. Ferrari, and A. Susi, “Case-based ranking for decision
support systems,” in International Conference on Case-Based Reason-
ing. Springer, 2003, pp. 35–49.
[23] M. S. Rahim, A. Z. M. E. Chowdhury, and S. Das, “Rize: A proposed
requirements prioritization technique for agile development,” in 2017
IEEE Region 10 Humanitarian Technology Conference (R10-HTC), Dec
2017, pp. 634–637.
[24] V. R. Basil and A. J. Turner, “Iterative enhancement: A practical
technique for software development,” IEEE Transactions on Software
Engineering, no. 4, pp. 390–396, 1975.
[25] P. Tonella, A. Susi, and F. Palma, “Interactive requirements prioritization
using a genetic algorithm,” Information and software technology, vol. 55,
no. 1, pp. 173–187, 2013.
[26] M. Azzolini and L. I. Passoni, “Prioritization of software requirements:
a cognitive approach,” in Fourth International Workshop on Knowledge

328
A Parallel Face Detection Method using Genetic &
CRO Algorithms on Multi-core Platform
Mohammad Khanafsa Ola Surakhi Sami Sarhan
Computer Science Department Computer Science Department Computer Science Department
University of Jordan University of Jordan University of Jordan
Amman, Jordan Amman, Jordan Amman, Jordan
mkhanafsa@gmail.com ola.surakhi@gmail.com samiserh@ju.edu.jo

Abstract— Face recognition is a well-known biometric method image by finding distance between pivot and each point in the
that used in many applications for authentication and identification. image, the extracted features are compared with the one
The original face recognition scheme takes face image, extract its stored in the database. If the number of matched features is
features and store it as a vector in the database. The saved vector is greater than a threshold value, then the user is identified.
then compared with the input image by comparing features to
recognize it. Many methods had been proposed before to achieve
Otherwise the user is not matched and not recognized.
that and to increase the level of identification accuracy. This paper Two factors play an important role in the matching phase, the
proposes a new method by using a meta-heuristic algorithm Genetic pivot point and the weight for each area in the face image.
and Chemical Reaction Optimization algorithms, both are Selecting pivot point correctly implies to extract a set of
implemented in parallel using multicore platform. The aim is to features that can increase accuracy of matching.
increase accuracy of image matching with less error rate and Dividing face image into a set of areas and assign a weight
increase performance of the system in terms of speedup. for each area such that summation of the weight values is one
Keywords— Chemical Reaction Optimization algorithm; Face will enhance accuracy. Some areas in the face image are clear
Recognition; Genetic Algorithm; Multi-threaded and can be assigned a high weight, other areas may have
Introduction
special objects that affect on the accuracy and thus should be
assigned a low weight.
Face recognition is a biometric technique which used in In [15], two meta-heuristic algorithms are used to achieve
many applications and systems. Because of its importance, that, Genetic algorithm, GA and Chemical Reaction
many fields pay a great attention in it like security, image Optimization algorithm CRO.
processing and psychology [1-7]. The face recognition Meta-heuristic algorithm can be used to expand search in the
process consists of three main steps: face detection, features search space of problem in order to generate better solution.
extraction and face recognition as shown in Figure 1 [8]. As the algorithm consists of a number of iterations, better and
better solutions can be generated after each iteration till
reaching best solution.
GA and CRO are used in this paper to search for the best point
in the image to be selected as pivot point, to assign a weight
value for each area in the image, and to generate a set of
Figure 1: Face recognition process steps features that are not important and may reduce matching rate.
The excluded features are stored in a vector. The selection
Face detection detects the face in an image by determining its process will be repeated after each iteration to get better
position. The features then extracted from the face and saved results that increase matching rate and enhance accuracy of
in a vector which are then used as a signature for the image recognition system. The algorithms are implemented in
which discriminate individual from other. Last, face parallel using multicore platform to speedup training of the
recognition is done by comparing extracted features of the algorithms and increase efficiency.
input image with the ones stored in the database and based on The rest of the paper is organized as follow: Section 2 gives
the matching rate the recognition is accepted. an overview about GA and CRO. Section 3 introduces the
Each face image consists of more than 80-point features, one proposed work. Section 4 shows experimental results. A
of them is selected as a pivot. The features are extracted by detailed description about experimental results is discussed in
evaluating distance between a pivot and each of the 80 points section 5, and section 6 gives conclusion.
in the image and save it to a vector in the database which are
then used for comparison in order to recognize individual I. BACKGROUND
later. The enrollment phase in the face recognition system
consists of processing image, extracting features and saving A. Genetic Algorithm
them in the database. Genetic algorithm [9,10,11] is a meta-heuristic algorithm that
After enrollment phase, all user’s images are saved in the is used to solve large search problem. GA depends on the
database to be used for identification. When a user is to be initial population that consists of a set of individuals. The
identified, the entered image is compared with the one stored chromosome can be represented as a vector where each entry
in the database which is the matching phase in the face is a gene. Chromosome represents the set of variables that
recognition system. The features are extracted from input need to be updated during algorithm running to reach best

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 329


solution. The set of variables is defined according to the ω → ω1- + ω2
problem. c. Inter-molecular ineffective collision: happens when
The value of each variable can be updated based on a fitness multi molecule collide with each other and then
function which is used to help in generating better solution.
bounces away.
The main steps for GA can be summarized as follow:
1. Initialization: generate a random population that d. Synthesis: It is the opposite to the decomposition
consists of a set of individuals. and happens when two or more molecules hit and
2. Evaluate fitness value for each individual. combine together; ω1 + ω2 → ω-.
3. Selection: select individuals from population as a One of the reactions takes place in each iteration to build
parent. solution and redistribute energy between molecules. After a
4. Crossover: exchange information to generate new number of iterations, the best solution is generated.
offspring for the next generation, the offspring is not Applying CRO algorithm in the proposed work will help in
generating best point to be selected as a pivot, updating
identical to the parent but have their traits.
weights value for the area set sin the image and selecting
Crossover could be single point, two points or excluded features. The fitness function here represents
uniform crossover. matching rate, which should by high to increase accuracy of
5. Mutation: change some bits value in the the system.
chromosome to get better solution.
II. THE PROPOSED WORK
The steps from 2 to 5 are repeated to a predefined number of
The main aim of the new method is to use a meta-heuristic
iterations which is enough to generate best solution.
algorithm in parallel over face recognition scheme to enhance
The most important factors that affect on the performance of
accuracy and reduce error matching rate. The original face
heuristic algorithm are how to represent individual and fitness
recognition method consists mainly of two steps: enrollment
function. and matching. The enrollment phase depends on choosing a
In the proposed method, the individual consists of the set of point in the face image as a pivot point, then calculates the
parameters that affect on the matching accuracy, which are distance between pivot point and all other face features
pivot point, weight for each face area, set of excluded feature points. The output of this phase is a set of vectors that contain
points and distances between selected pivot and all feature such distance. The second phase is the matching which gets
points. While fitness function is the matching value between the user image as an input, extracts its features, calculates
the original image features which are saved on the database distances then compares them with the original stored in the
and the input image features. database for matching and recognizing it.
The best solution will have the highest fitness value that
The face image is divided into a set of areas, each area
represents the highest matching rate.
consists of a number of features that affect on increasing or
decreasing matching accuracy. Some areas may have a set of
B. Chemical Reaction Optimization Algorithm new features that are not stored in the original database image
Chemical Reaction Optimization algorithm (CRO) is a meta- and will cause on reducing accuracy for matching. For
heuristic algorithm that inspired from the nature of chemical example, the image for a face with glasses has some new
reaction [12,13]. The population in CRO algorithm consists features that are different from the one without glasses; when
of a set of molecules, each one has a number of parameters. comparing both images for the same individual, the system
The important parameters in the molecules are (a) the may not recognize it. Many objects may be used for the image
molecular structure (ω); (b) the potential energy (PE); and (c) that effect on the accuracy of recognition, such that mustache,
the kinetic energy (KE). The molecular structure (ω) captures glasses, brightness, etc.
solution of the problem which can be represented as a vector. In this paper, the proposed method will use a meta-heuristic
Potential energy is the fitness function of the algorithm. algorithm to be applied on the enrollment phase to choose
kinetic energy helps on taking decision whether generated best point in the image to be selected as a pivot and assign a
solution is better than previous or not. weight for the image areas. The implementation of this phase
For a number of molecules to be set in a container, a collision is done in parallel using multithreaded platform. Two
will start to happen between them. The CRO algorithm starts algorithms are used to achieve that, Genetic and Chemical
with a set of molecules which are change after each iteration Reaction Optimization. The goal is to increase accuracy and
of the algorithm by collision. These collisions are chemical getting better results with better matching rate and use a
reaction that move molecule from one solution to better as parallel implementation to reduce running time needed and
possible [13]. There are four types of reactions: increase performance of the system.
a. On-wall ineffective collision: This occurs when the The using of meta-heuristic algorithm aims to achieve three
molecule collides with the wall of container and then different points as follow:
1. Choosing the best point in the image to be used as a
bounces, causes an ineffective collision. The
pivot.
transformation of the molecule structure can be
2. Dividing image to a set of areas and assigning a
represented as ω → ω-
weight for each one.
b. Decomposition: This happens when the molecule
3. Exclude a set of unnecessary features that may
hits in the wall and then decomposes into small reduce accuracy and put them in an array of
parts. excluded features

330
The meta-heuristic algorithm always searches to find the best sequentially, and thus it is divided into a set of steps that run
solution by repeating searching to a number of predefined on the multithread.
iterations. Better and better solution can be generated by each Each area from image runs on a single core which extract
iteration. In the proposed work, the algorithm will generate a features from it to be compared with the original extracted
new different pivot point, new areas weight and new excluded features and to generate a set of excluded features that reduce
features array. The new solutions will be compared with the accuracy of matching. The communication between different
old one, and the better between them will be saved for the cores is done to exchange information, the time needed to run
next iteration. algorithm sequentially is more than time needed to run it in
The proposed work in this paper uses Genetic and CRO parallel, thus the overall communication overhead between
algorithms that are implemented in parallel using multi-core cores can be ignored in order to achieve such performance in
platform. The output for each algorithm is generated and speedup and accuracy.
compared. The steps for each algorithm in detail are shown
C. Parallel-CRO Face Recognition
in the next section.
Chemical Reaction Optimization algorithm is a meta-
A. Data set heuristic algorithm that searches for a best solution in search
The data set used in this paper is taken from XM2VTSDB space. As mentioned before, it consists of a set of steps. As
multi-modal face database project [14] which consists of 371 any other meta-heuristic algorithm, the important step is
images. Each image is for one individual where each determining fitness function according to the problem. The
individual has more than one session. The overall number of fitness function in the proposed algorithm evaluates matching
the images is 2360 images with 67 features for each one. value for the variables pivot, areas weight and excludes
The features of the collected images are different as they features which influence on the matching results.
contain images for female, male with different ages and The mapped between CRO steps and the proposed method is
colors. These images are taken over a period of four months. shown in Table II.
B. Parallel-Genetic Face Recognition TABLE II. MAPPING BETWEEN CRO STEPS AND PROPOSER WORK

The GA consists of different phases, the most important is Chemical Its meaning on the proposed idea
determining fitness function which plays a role of enhancing Meaning
value of the desired parameter to get better solution. In the Molecular Set of Solutions which found based on original
Structure solution
proposed algorithm, fitness function generates the value of
Potential Energy Important variables value as Pivot value, Exclude
matching rate of the image. Based on it the new solution will array values, Weights value for different face areas
have a new pivot point that used to exclude features that have Kinetic Energy Measure of tolerance to have a worse Solution
less effect on the accuracy and new weights for the face areas Number of Hits Total Number of iteration used for specific
such that total values for them is 100%. experiment.
Minimum Current Optimal value for Matching based on
The mapping between GA phases and the proposed method Structure Different variables values
is shown in Table I. Synthesis Two solutions with two Potential Energy combined
Interaction, with each other to select single solution with
TABLE I. MAPPING BETWEEN GA STEPS AND PROPOSED WORK ω1 + ω2 → ω' highest Potential energy which refers to highest
match percent for all faces.
Genetic Mapped to the proposed idea Inter-molecule Two solutions with two Potential energy will
Phase infective produce a single solution with highest Potential
Individual Pivot point, weight for each face area, set of excluded collision energy value. By combining different steps of both
feature points and distances between selected pivot and ω1 + ω2 → ω1' + solutions, like selecting best excluded array values
all feature points ω2'. from one solution with face area weight from
Population Set of individuals that contains initial pivot point, initial another solution to have a solution with highest
weight value for each face area for first round, initial set matching percentage for all faces.
of excluded features from first random round. Decomposition Single solution with specific potential energy will
Search Different solutions founded through different iterations. ω → ω1 + ω2 produce two new spate solutions with different
Space potential energy for each.
On wall effective Single solution will be combined with other
Fitness Match values for testing data sets based on training data collision random solution where each solution has its own
Function for all faces, best solution will have highest fitness value ω → ω' potential energy to produce a new with different
which mean highest match rate for different images which potential energy from original solution.
compared with all feature information saved on database.
Crossover Generate different values for pivot, face area weight,
excluded array based on best solution with other solutions
The parallel implementation for the algorithm is similar to
what had been applied for Genetic. After evaluation running
time for CRO face recognition steps it was found that
Mutation Random difference for generated solution based on matching step needs longest time. This step is implemented in
specific value parallel by distributing the jobs between multicores in order to
reduce running time and enhance speedup. The results of such
implementation showed a great enhancement in the
Applying algorithm in parallel is done by evaluating time
performance of the system at all in terms of speedup and
needed to run each step from the above-mentioned table, the
accuracy.
step which take longest time is run on the multithread. After
evaluating time for each step, it was found that matching step
took the longest time when running the proposed algorithm

331
III. EXPERIMENTAL RESULTS TABLE V. SEQUENTIAL IMPLEMENTATION TIME FOR CRO FACE
RECOGNITION
A. Experimantal Results Using GA Total
The GA is implemented in parallel using Java programming Input size Execution Time need for Matching
language, on Intel core I7-3632QM CPU2.20GHz, 8GB of time in M/S matching step accuracy
RAM and windows 7 64 bits. As mentioned before, the 50 Persons * 10 6000 5350 99%
longest step that took most running time of the method is 100 Persons * 10 7200 6170 97%
matching step. In order to show enhancing on running time 150 Persons * 10 42300 40700 95%
for matching step, the algorithm first executes sequentially 200 Persons * 10 19500 17350 94%
for different data sets starting from 50 x 10 images which 250 Persons *10 78250 74300 92%
include images for 50 persons where each one has 10 300 Persons * 10 121000 115000 92%
different samples, to 371 x 10 images. The times needed to 335 Persons *10 177000 162000 91%
run matching step for sequential implementation by using GA 371 Persons * 10 250000 197000 89%
is shown in Table III.

TABLE III. SEQUENTIAL IMPLEMENTATION TIME FOR GA FACE


The algorithm runs in parallel using 2, 4, 6, and 8 cores. Each
RECOGNITION time matching time step is estimated, and accuracy is
Execution Total
conducted. The results of each implementation are shown in
Input size time in Time need for Matching Table VI.
M/Second matching step accuracy
50 Persons * 10 3030 2434 98% TABLE VI. PARALLEL IMPLEMENTATION TIME FOR CRO FACE
100 Persons * 10 8750 8143 96% RECOGNITION USING 2-CORES
150 Persons* 10 3000 2168 99%
Time Time Time
200 Persons* 10 34000 33405 97% Time
needed needed needed
250 Persons *10 61000 58735 93% Input size needed using
using 2- using 4- using 8-
300 Persons * 10 150000 74809 91% 6-cores
cores cores cores
335 Persons *10 100500 99675 91% 50 Persons
371 Persons * 10 117000 110051 93% * 10 760 712 1760 470
The algorithm runs then in parallel using 2, 4, 6, and 8 cores. 100
Each time the matching time step and accuracy were Persons *
estimated to compare results. The results of each 10 3530 1640 1720 690
implementation are shown in Table IV. 150
Persons *
10 15300 9730 6310 720
TABLE IV. PARALLEL IMPLEMENTATION TIME FOR GA FACE
200
RECOGNITION USING 2, 4, 6 AND 8-CORES
Persons *
Time Time Time Time 10 9560 11700 8100 4300
needed needed needed needed 250
using 2- using 4- using 6- using 8- Persons
Input size cores cores cores cores *10 9730 21700 17600 8300
50 Persons * 300
10 1529 1000 946 1148 Persons *
100 Persons 10 19500 12500 30700 17500
* 10 4620 3713 2824 3348 335
150 Persons Persons
* 10 9800 8427 5729 6814 *10 93500 31700 35200 57300
200 Persons 371
* 10 16000 11887 9583 11820 Persons *
250 Persons 10 123000 37500 88300 59300
*10 30982 21135 15231 14881
300 Persons IV. DISCUSSION AND ANALYSIS OF THE RESULTS
* 10 40825 24370 19757 21724
335 Persons A. Parallel GA Face Recognition
*10 49923 32566 25907 25453
371 Persons
Parallel GA face recognition shows an enhancement in the
* 10 66697 34246 29381 33750 execution time with maintaining on the same level of
accuracy. The above results show that for maximum input
size, the longest step in the running algorithm which is
B. Experimantal Results Using CRO matching step needs 110051 m/s which is reduced by half to
The CRO algorithm had been implemented in parallel using be 49923 m/s when it runs in parallel using two cores. Having
Java programming language, on Intel core I7-3632QM more cores run in parallel enhanced execution time of the
CPU2.20GHz, 8GB of RAM and windows 7 64 bits. The algorithm till 6 cores, after that the communication between
main aim of the parallel implementation is to reduce running cores increased and dependency between them increased
time and enhance performance of the recognition algorithm. which decreased performance of algorithm in terms of
the same data set with the same number of images were used. speedup. The accuracy of the results did not change when
The result of running algorithm sequentially is shown in the increasing degree of parallelism. Using GA as a heuristic
next Table. method on face recognition technique to choose pivot point,
assigning weight for each area in the face image and choosing
a set of excluded features enhanced the performance of the
332
algorithm by reducing running time while keeping on the C. Comparison between GA and CRO Performance and
same level of accuracy. Accuracy in Face Recognition Scheme
A comparison between time needed to run the proposed In order to compare between parallel performance of GA and
method sequentially and in parallel by using 2, 4, 6 and 8 CRO in the proposed face recognition method, a comparison
cores is shown in Figure 2. between results of each algorithm for different number of
cores is represented in Figure 4.

Figure 4 shows the results for parallel GA and CRO in


proposed face recognition using 2, 4, 6 and 8 cores. For 2-
cores case, it is clear that GA performed better in terms of
speedup. For the largest input size 371 x10 images it took
123000 m/s for CRO algorithm to perform matching step in
the proposed method. But in GA the time is reduced by
almost half for the same step with the same input size.

Figure 2: comparison between time needed to run proposed algorithm using


GA sequentially and in parallel

B. Parallel CRO Face Recognition


The results from parallel CRO face recognition show a great
enhancement in terms of execution time. The sequential time
needed to run longest step in the algorithm, matching step, is
197000 for the largest input size. When running algorithm in Figure 4: comparison between parallel performance of applying GA and
parallel using 2 cores, the time reduced to 123000, achieving CRO in the proposed face recognition method
about 35% enhancing in the speedup. When increasing
number of parallel cores to be 4, the speedup of the algorithm When increasing number of cores to be 4, the performance of
reduced 60%. GA is very close to the performance of CRO algorithm with
The optimal number of cores needed to run algorithm is 4, approximately same time needed for matching step with
after that speedup decreased due to communications largest data.
overhead between cores as data input increased. Table VI
shows that running time is 37500 for the largest input size The performance of parallel GA increased when using 6
371 x 10 images running on 4 cores, which is near to the result cores. It took the algorithm 29381 m/s to train the matching
when run algorithm using 6 cores but with input size equal to step in the proposed face recognition, which is quarter of the
300 x 10 images, after that the speedup increased as input size time CRO needed to train matching step with the same data
increased. size and same number of cores. GA has better performance in
A comparison between time needed to run the proposed terms of speed up.
method using CRO algorithm sequentially and in parallel by
using 2, 4, 6 and 8 cores are shown in Figure 3. The same results can be concluded when using 8 cores. Using
GA in parallel over different number of cores 2, 4, 6 and 8
gives better results with less execution time comparing it with
performance of parallel CRO algorithm.
.
A comparison between speedup and efficiency that are
achieved from the parallel implementation of Genetic face
recognition and CRO face recognition is shown in Table VII.
The speedup formula is given in formula 1, and efficiency
formula is given in formula 2.
Speedup = time Sequential/time parallel (1)
Efficiency = speedup/no. of cores (2)

TABLE VII. SPEEDUP AND EFFECIENCY FOR GA FACE RECOGNITION


AND CRO FACE RECOGNITION

speedup efficiency speedup efficiency


GA face GA face CRO face CRO face
Recognition recognition Recognition recognition
Figure 3: comparison between time needed to run proposed algorithm using 2-
CRO sequentially and in parallel cores 1.650 82.501 1.602 80.081

333
4- REFERENCES
cores 3.214 80.339 5.253 131.333
6-
cores 3.746 62.428 2.231 37.184 [1] Aoun, N.B.; Mejdoub, M.; Amar, C.B. Graph-based approach for
8- human action recognition using spatio-temporal features. J. Vis.
cores 3.261 40.760 3.322 41.526 Commun. Image Represent. 2014, 25, 329–338.
[2] El’Arbi, M.; Amar, C.B.; Nicolas, H. Video watermarking based on
As we mentioned before, the execution time for CRO face neural networks. In Proceedings of the 2006 IEEE International
Conference on Multimedia and Expo, Toronto, ON, Canada, 9–12 July
recognition using 2-cores is almost double execution time of 2006; pp. 1577–1580.
running GA face recognition. But the speedup achieved by [3] El’Arbi, M.; Koubaa, M.; Charfeddine, M.; Amar, C.B. A dynamic
running GA face recognition in parallel using 2-cores is very video watermarking algorithm in fast motion areas in the wavelet
close to that achieved by running CRO face recognition, domain. Multimed. Tools Appl. 2011, 55, 579–600.
while execution time is different. That’s because sequential [4] Wali, A.; Aoun, N.B.; Karray, H.; Amar, C.B.; Alimi, A.M. A new
time for GA face recognition is almost half the time of CRO system for event detection from video surveillance sequences. In
Advanced Concepts for Intelligent Vision Systems, Proceedings of the
face recognition. GA is faster than CRO algorithm in 12th International Conference, ACIVS 2010, Sydney, Australia, 13–16
recognition problems. December 2010; Blanc-Talon, J., Bone, D., Philips, W., Popescu.
The best performance for CRO face recognition had been [5] D., Scheunders, P., Eds.; Lecture Notes in Computer Science; Springer:
achieved when the algorithm run using 4-cores, the speedup Berlin/Heidelberg, Germany, 2010; Volume 6475, pp. 110–120.
and efficiency are the best. While in GA face recognition, the [6] Koubaa,M.;Elarbi,M.;Amar,C.B.;Nicolas,H.Collusion,MPEG4compr
essionandframedroppingresistant video watermarking. Multimed.
speedup is increased when algorithm runs on 6-cores. After Tools Appl. 2012, 56, 281–301.
that, the performance decreased as mentioned before. [7] M. S. Obaidat and N. Boudriga,” Security of e-Systems and Computer
Networks, Cambridge University Press, 2007.
V. CONCLUSIONS
[8] Mejda Chihaoui *, Akram Elkefi, Wajdi Bellil and Chokri Ben Amar,
This paper proposed a parallel implementation for face “A Survey of 2D Face Recognition Techniques”, REGIM: Research
recognition by using a meta heuristic algorithm. The meta Groups on Intelligent Machines, University of Sfax, National School
of Engineers (ENIS), Sfax 3038, Tunisia; Elkfi@gmail.com (A.E.);
heuristic algorithm is used to choose best point in the image wajdi.bellil@ieee.org (W.B.); chokri.benamar@ieee.org (C.B.A.) *
to be selected as a pivot point, evaluate weight area for each Correspondence: mejda.chihaoui@ieee.org; Tel.: +216-5460-1073
area in the face image and exclude a number of unneeded [9] D.E. Goldberg, Genetic Algorithms in Search, Optimization &
features. Two algorithms were used, GA and CRO. Both Machine Learning, Addison-Wesley, Reading, MA, 1989.
algorithms were implemented in parallel using Java [10] John H. Holland ‘Genetic Algorithms’, Scientific American Journal,
July 1992.
programming language, on Intel core I7-3632QM
[11] KalyanmoyDeb, ‘An Introduction To Genetic Algorithms’, Sadhana,
CPU2.20GHz, 8GB of RAM and windows 7 64 bits. Vol. 24 Parts 4 And 5.
Different number of cores were used, and different data sets [12] Lam AYS, Li VOK. Chemical-reaction-inspired meta-heuristic for
were used for testing results. The results of the optimization. IEEE Trans Evol Comput. 2010;14(3):381–99.
implementation show that proposed method can give better [13] A.Y.S. Lam, V.O.K. Li, "Chemical reaction optimization: a tutorial",
results with more accuracy and less error rate comparing it Memetic Computing 4, 2012, pp. 3–17.
with original face recognition. The parallel implementation [14]https://personalpages.manchester.ac.uk/staff/timothy.f.cootes/data/xm2
enhanced performance by decreasing running time. GA vts/xm2vts_markup.html.
shows better performance over CRO in terms of speedup and [15] Ola Surakhi, Mohammad Khanafseh,Yasser Jaffal, “An enhanced
Biometric-based Face Recognition System using Genetic and CRO
efficiency. Algorithms”, submitted

334
Heart Disease Detection Using Machine Learning
Majority Voting Ensemble Method
Rahma Atallah Amjed Al-Mousa
Communications Engineering Department Computer Engineering Department
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
r_rahma@hotmail.com a.almousa@psut.edu.jo

Abstract—This paper presents a majority voting ensemble results since the combination of models produces a powerful
method that is able to predict the possible presence of heart collaborative overall model.
disease in humans. The prediction is based on simple affordable
medical tests conducted in any local clinic. Moreover, the aim of Section II of this paper presents a review of related work, then
this project is to provide more confidence and accuracy to the Section III introduces the intricate details of the dataset, data
Doctor’s diagnosis since the model is trained using real-life data preprocessing and the machine learning techniques used.
of healthy and ill patients. The model classifies the patient based Moreover, the results of each model along with the overall
on the majority vote of several machine learning models in order accuracy of the hard voting model are presented in Section IV.
to provide more accurate solutions than having only one model. Finally, a conclusion is outlined in section V.
Finally, this approach produced an accuracy of 90% based on
the hard voting ensemble model. II. RELATED WORK
In the field of heart disease detection, a variety of techniques
Keywords—Machine learning; Majority Voting ensemble regarding data preprocessing and model variation has been
method; heart disease; UCI dataset; classification.
used. The work presented in [5] used the same dataset as this
I. INTRODUCTION paper but different machine learning models were
implemented. Three discrete classifier models were built
In the present era, heart disease rates have dramatically which included Support Vector Machine (SVM) classifier,
increased to become the leading cause of death in the United naïve Bayes algorithm, and C4.5. The prediction of the heart
States upon adults due to the widespread of unhealthy habits disease was conducted based on each of these models
[1]. These include a declination in physical activity since the discretely and produced a maximum accuracy of 84.12% in
technology trend is moving towards replacing human physical the SVM machine learning model.
activity and unhealthy eating habits which are directly linked
to increasing the risk of having heart diseases. The work in [6] also used the Cleveland heart disease dataset
but the classification models that were implemented involved
Starting off with the definition of a Heart Disease, according only Tree algorithms. Those included J48, Logistic Model,
to [2] the National Heart, Lung, and Blood Institute states that and Random Forest algorithm. A comparison of the three
heart disease is a disruption to the heart’s normal electrical methodologies was conducted and the highest accuracy
system and pumping functions. Where the disease makes it achieved was 84% using the J48 algorithm.
harder for the heart muscle to pump blood efficiently.
Furthermore, the work in [7] presents a prediction system of
Furthermore, according to the World Health Organization coronary artery heart diseases using four different datasets
(WHO), 17.9 million people die each year from including the Cleveland dataset. The algorithm used for
cardiovascular diseases which correspond to 31% of all deaths prediction involved only decision tree techniques that
around the world [3]. This incurs the need of having an included C4.5 and Fast Decision Tree. At first, the model is
affordable system that is able to give a preliminary assessment trained based on each dataset using all features. Then the best
of a patient based on relatively simple medical tests that are features from each dataset are selected and used for training
affordable to everyone. the model. This technique improved the accuracy of
To conduct the training and testing of the machine learning prediction of the model from 76.3% to 77.5% using C4.5 (this
model, the Cleveland dataset from the well-known UCI accuracy represents the average accuracy from all datasets)
repository was used since it is an authenticated dataset that is and for the Fast Decision Tree, the average accuracy improved
widely used for training and testing in machine learning from 75.48% to 78.06%.
models [4]. The dataset contains 303 instances and 14 The work in [8] uses data mining techniques where the large
attributes that are based on well-known factors that are Cleveland dataset with all 76 attributes is investigated in order
thought to correlate with risks of heart diseases. to extract hidden and previously unknown patterns. This
The approach presented in this paper uses the hard voting allows the prediction to utilize the most dominant and
ensemble method which is a technique where multiple effective attributes provided in the dataset. The machine
machine learning models are combined and the prediction learning algorithm consists of different Decision Tree
result is based on the majority vote from all models. This methods (J48, Logistic Model Tree Algorithm, Random
technique is used in order to improve the overall prediction Forest Algorithm.) The highest accuracy is obtained from the

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 335


J48 model that is 56.76% with a total build model time of 0.04 A. Dataset Attribute Information
seconds. The UCI repository was used to retrieve the heart disease
Finally, the work in [9] deploys various machine learning database. The original database contains 76 attributes, but
models in order to investigate the highest performance metric based on extensive experiments it was found that the most
(Accuracy, Sensitivity, Specificity, and Kappa). The machine effective attributes were 14. The Cleveland database contains
learning algorithms involve Random Forest, Logistic the most dominant 14 attributes, which why they were chosen
Regression and Artificial Neural Network. Cross-Industry for training the model. Table I presents the attribute’s name,
Standard Process for Data mining technique (CRISP-DM) is type, and description.
used to find insights and meaningful information from the In order to analyze the data, a correlation value was calculated
data. The CRISP-DM involves six stages that were followed between each of the values and the Target diagnosis. It can be
in this research. Moreover, the accuracies obtained from the noted that the highest correlated features with the target
models used were as follows; 80.9 % for the Random forest, attribute were Cp, Thalach, Exang, and Oldpeak. This helps in
79.78% for the Artificial Neural Network, and 85.39% for the forming an overview of the data that is being dealt with.
Logistic Regression.
III. EXPERIMENTAL SETUP
TABLE II. CORRELATION WITH TARGET DIAGNOSIS
The objective of this paper is to produce a heart disease
prediction system using the aforementioned dataset. This Attribute name Correlation value
dataset represents real-life data which serves the purpose of Cp 0.433798
this paper and allows the prediction system to generalize to
any new data. Thalach 0.421741
Slope 0.345877
TABLE I. ATTRIBUTE INFORMATION
Restecg 0.137230
Attributes Type Description FBS -0.028046
Age Continuous Age in years Chol -0.085239
0=female
Sex Discrete Trestbps -0.144931
1=male
Chest pain type: Age -0.225439
1=typical angina Sex -0.280937
Cp Discrete
2=atypical angina
3=non-anginal pain Thal -0.344029
Trestbps Continuous Resting blood pressure (mm/Hg) Ca -0.391724

Chol Continuous Cholesterol (mg/dl) Oldpeak -0.430696


Exang -0.436757
Fasting blood sugar > 120 (mg/dl):
Fbs Discrete 1=true
0=false
Resting electrocardiographic result:
0=normal Moreover, to further form a clear overview of the feature
Restecg Discrete 1=having ST-T abnormality correlation between each of the attributes, a heat map showing
2= probable left ventricular the correlations between all features is shown in Figure 1.
hypertrophy
Thalach Continuous Maximum heart rate achieved

Exercise-induced angina:
Exang Discrete 1=yes
0=no

ST depression induced by exercise


Old peak ST Continuous
relative to rest

Peak exercise slope segment:


1=up sloping
Slope Discrete
2=flat
3=down sloping

Number of major vessels colored by


Ca Discrete
fluoroscopy that ranges from 0-3

Heart rate:
3=normal
Thal Discrete
6=fixed defect
7=reversible defect Figure 1: Heat map of cross-correlation values

Diagnosis classes:
Target Discrete 0=healthy
1=possible heart disease

336
Also, a pie chart as shown in Figure 2 displays the gender
distribution of the instances in the Cleaveland data set. It is
clear that the dataset had more males 68% than females
32%.

Figure 5: Cholesterol distribution

Figure 2: Gender distribution within the dataset

Furthermore, for continuous attributes data visualization


histograms are plotted to preview the data distribution as
shown in Figures 3-6. It can be noted that all of the
continuous attributes have a normal distribution.

Figure 6: Maximum heart rate achieved

For the age attribute in Figure 3, it can be seen that most


observations lie between 47-61 years old. To further
investigate if age has a relation to having heart diseases,
Figures 7 & 8 show the age distribution for people with no
heart disease and people with heart disease respectively. It is
observed that people with heart diseases had a major
concentration in the age range from 51-53 and 41.
Figure 3: Age distribution

Figure 7: Age Distribution for people with No heart disease

Figure 4: Resting blood pressure distribution

337
A. Stochastic Gradient Descent (SGD) Classifier
Starting off with the first model, a binary classifier that uses
the SGD approach was built. The SGD approach picks
random instances in the training set and computes the
gradient-based on that single instance in order to reach the
minimum value of the cost function. Then based on the
parameters chosen to minimize the cost function,
classification occurs based on the simple binary classifier
built that is able to identify whether heart disease is present
or not.
B. K-Nearest Neighbor Classifier
The second model that was built is the K-Nearest Neighbor
classifier. The algorithm in this classifier involves finding the
Figure 8: Age Distribution for people with heart disease
distances between the new instance and all of the training
In addition, the highest correlated continuous attribute instances, then from a predefined K number it selects the
(thalach) is plotted against age, as shown in Figure 9 to nearest K data points to the new instance. Finally,
examine if there is any relation. It is noticed that for people classification occurs based on the majority class of the K data
with heart disease at all age ranges, the heart rate was points selected. The K number in this project was chosen to
generally higher than that for people with no heart disease. In be 7 since it produced the best results based on the
addition, in both groups as age increased the maximum heart GridsearchCV.
rate decreased leading to a negative correlation of -0.4 with C. Random Forest Classifier
age shown earlier in Figure 1.
The third model that was built is the Random Forest
Classifier. This model involves building multiple decision
trees and combines them together in order to obtain a more
accurate and stable prediction. In this project, a number of
1000 trees worked best according to the GridsearchCV.
D. Logistic Regression Classifier
The fourth model built was the Logistic Regression
Classifier. According to [10] the Logistic Regression
Classifier computes a weighted sum of the input features and
outputs the logistic of this result. The logistic is a sigmoid
function that outputs a number between 0 and 1. Then based
on the estimated probability, the classification occurs.

Figure 9: Max heart rate distribution Vs Age E. Ensemble Classifier


Finally, the four models mentioned in this section are
B. Data preprocessing combined in an ensemble method where the classification is
The data in the Cleveland dataset had different scales which done based on the majority vote of the models (hard voting.)
led to the need to scale the large continuous data using the The voting occurs when each model makes a prediction for
Min-Max normalization strategy. This strategy involves each instance and the output prediction is the one that
linearly transforming the data by subtracting the minimum receives more than half of the votes.
and dividing over the data range as shown in equation (1).
Thus, the data is mapped to a range between 0 and 1 which
helps the machine learning model to form a clearer trend V. RESULTS AND ANALYSIS
between data and normalize the impact of different
Starting off with the SGD classifier, the prediction was run
parameters.
on the test set which is considered unseen data that the model
has never prevailed. The first test was run on the default
parameters of the classifier and produced an accuracy of
(1) 80%. Then after running a GridsearchCV, the optimized

parameters based on cross-validation were found and the
accuracy increased to 88%. Figure 10 shows the confusion
IV. MACHINE LEARNING ALGORITHM matrix obtained from this model.
After analyzing the data, the data was split into training and
testing sets into a ratio of 80% training data and 20% testing
data. This is needed to validate the ability of the model to
generalize to new data. Several Classifier models have been
tested as follows:

338
Furthermore, the last model that was built was the Logistic
Regression classifier. The model was built using the default
parameters and the classification occurred based on the
unseen test set. The accuracy came out to be 87% and after
conducting GridsearchCV the accuracy remained the same
since the default parameters came out to be the same as the
optimized parameters. Figure 13 shows the confusion matrix
of this model.

Figure 10: SGD classifier confusion matrix

Moving on to the second model, that is the K-Nearest


Neighbor classifier. The model was built with the default
parameters and was run on the unseen test set. The accuracy
came out to be 82% and after running GridsearchCV to find
the optimized parameters the accuracy went to 87%. Figure
11 shows the confusion matrix obtained from the results.
Figure 13:Logistic Regression classifier confusion matrix

Table III shows the overall final accuracies of the four


models.

TABLE III. ACCURACY OF THE MODELS

Model Name Accuracy


SGD Classifier 88%

KNN Classifier 87%

Random Forest Classifier 87%


Figure 11: KNN classifier confusion matrix Logistic Regression Classifier 87%

Hard Voting Ensemble Method 90%


Moreover, the third model that was built was the Random
Forest Classifier. The model was built using the default Also, Figure 14 shows how running the GridsearchCV which
parameters and conducted predictions on the unseen test set. is based on the cross-validation technique improves the
The accuracy came out to be 85%, then a GridsearchCV was accuracy of every model. This shows the need to fine-tune
deployed and built the model using the optimized parameters the parameters of any machine learning algorithm.
to produce an accuracy of 87%. Also, feature importance was
computed in this classifier and the top three features were
(oldpeak, ca, thalach). Figure 12 shows the confusion matrix
obtained from this model.

Figure 14: GridsearchCV accuracy improvement

Figure 12: Random Forest classifier confusion matrix To further investigate the models built, a receiver operating
characteristic curve (ROC) was plotted as shown in Figure 16
for all of the models involved in this project. The ROC
339
represents the diagnostic ability of the classifier and the area used to assist doctors in analyzing patient cases in order to
under each curve is calculated and displayed in Figure 15. validate their diagnosis and help decrease human error.
The closer the area value of the ROC curve to one, the better
the diagnostic ability of the model. REFERENCES

[1] “Heart Disease Facts & Statistics,” Centers for Disease Control and
Prevention.[Online].Available:https://www.cdc.gov/heartdisease/facts
.htm. [Accessed: 27-Apr-2019].
[2] Nhlbi, Nih. Anatomy of the Heart. 2011 [updated 2011 November 17;
cited 2015 January 10]. Available from:
http://www.nhlbi.nih.gov/health/health-topics/topics/hhw/anatomy
[3] “Cardiovascular diseases (CVDs),” World Health Organization, 26-
Sep2018.[Online].Available:https://www.who.int/cardiovascular_dise
ases/en/. [Accessed: 27-Apr-2019].
[4] Lichman, M. (2013). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
[5] D. Chaki, A. Das, and M. Zaber, “A comparison of three discrete
methods for classification of heart disease data,” Bangladesh Journal
of Scientific and Industrial Research, vol. 50, no. 4, pp. 293–296, 2015.
[6] R. G. Saboji, “A scalable solution for heart disease prediction using
classification mining technique,” 2017 International Conference on
Figure 15: ROC curve for all models Energy, Communication, Data Analytics and Soft Computing
(ICECDS), 2017.
[7] El-Bialy, R., Salamay, M., Karam, O. and Khalifa, M. (2015). Feature
Analysis of Coronary Artery Heart Disease Data Sets. Procedia
Finally, the overall accuracy of this project after conducting Computer Science, 65, pp.459-468.
the hard voting ensemble method came out to be 90% which [8] Patel, J., TejalUpadhyay, D. and Patel, S., 2015. Heart disease
is considered a fairly adequate accuracy that can be further prediction using machine learning and data mining technique.
built upon in the future. International Journal of Computer Science & Communication, 7(1),
pp.129-137. DOI: 10.090592/IJCSC.2016.018.
[9] Ghosh, S. (2017). Application Of Various Data Mining Techniques To
VI. CONCLUSION Classify Heart Diseases. [online] Pdfs.semanticscholar.org. Available
at:
In conclusion, this paper presented a machine learning https://pdfs.semanticscholar.org/dbe6/7e47cb35edc283cebd5cf06dd6
ensemble technique that combined multiple machine learning 7faf1ad100.pdf [Accessed 13 Jul. 2019].
techniques in order to provide a more accurate and robust [10] A. Géron, Hands-on machine learning with Scikit-Learn and
TensorFlow: concepts, tools, and techniques to build intelligent
model for predicting the possibility of having a heart disease. systems. Beijing: OReilly, 2018.
The Ensemble model achieved 90% accuracy, which exceeds
the accuracy of each individual classifier. The model can be

340
Resolving Conflict of Interests in Recommending
Reviewers for Academic Publications Using Link
Prediction Techniques
Sa’ad A. Al-Zboon Saja Khaled Tawalbeh Heba Al-Jarrah
Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
Jordan University of Science and Jordan University of Science and Jordan University of Science and
Technology Technology Technology
Irbid, Jordan Irbid, Jordan Irbid, Jordan
saalzboon16@cit.just.edu.jo sajatawalbeh91@gmail.com hebaatta96@gmail.com

Muntaha Al-asa’d Mahmoud Hammad Mohammad AL-Smadi


Dept. of Computer Science Dept. of Software Engineering Dept. of Computer Science
Jordan University of Science and Jordan University of Science and Jordan University of Science and
Technology Technology Technology
Irbid, Jordan Irbid, Jordan Irbid, Jordan
mabalasad@gmail.com m-hammad@just.edu.jo masmadi@just.edu.jo

Abstract—An honest peer-review process is a key for producing Conflict of interest, simply, occurs when a reviewer’s judg-
high quality scientific research. However, this process depends on ment might be compromised by an existing relationship to
two main factors: (1) the expertise of reviewers in the topic of an author of a submitted paper. There are many forms of
a submitted paper and (2) the relationships between reviewers
and authors. To satisfy the first factor, editors and conferences relationships that can lead to a CoI such as student-supervisor,
chairs manually select reviewers. Whereas to prevent any conflict working on the same affiliation, co-authorship, family relation-
of interest (CoI) between reviewers and authors to satisfy the ships, etc. On the other hand, a recommended reviewer of a
second factor, reviewers and authors are asked to declare any paper, arguably, should be an active researcher who has some
CoI manually. Such a solution is tedious to all actors and error- publications on the topic of that paper.
prone. To solve this problem and satisfy those two factors, we
have developed a novel framework that (1) recommend expert To solve the aforementioned problem and achieve the two
reviewers and (2) resolve the CoI problem. To develop our frame- main factors easily and efficiently, we have developed a
work, we have represented the DBLP citation network dataset as framework that recommends expert reviewers in the topic of
a graph database using Neo4J. A Cypher queries used to select a given paper while resolving the CoI problem. To develop
expert reviewers. Various link prediction algorithms, especially our framework, we utilized graph mining techniques [1] to
the Adamic Adar and the Common Neighbors algorithms, have
been utilized to resolve any potential conflict of interest. recommend expert reviewers and detect any potential CoI
Index Terms—Conflict of Interests (CoIs), DBLP, Link Predic- between reviewers and authors.
tion, Adamic Adar, Common Neighbors. The graph mining techniques have been used in several
domains such as computer networks [2] [3], social networks
I. I NTRODUCTION [4]–[6], co-authorship networks [7] [8], and other fields.
The peer-review process for evaluating scientific research is These techniques depend on data extraction techniques such
a crucial process for producing high quality research papers as classification and clustering. Relations between people,
and successfully running academic events. An honest peer- whether business relationships, friendship or otherwise, are
review process relies on two main factors: (1) the expertise represented as graphs. A graph represented as G(V, E), where
of reviewers in the topic of the submitted paper and (2) the V is a set of vertices (nodes), and E is a set of edges. In
relationships between reviewers and authors. Unfortunately, social networks, nodes represent people and edges represent
meeting these two factors is not an easy task. the relations between them. The relationship could be direct,
Currently, achieving the first factor depends, mainly, on meaning that there is a direct edge between two nodes, or
the editors and conferences chairs to decide who can review indirect relationship (implicit relationship), meaning that there
what. Regarding the second factor, reviewers and authors need is a path between two nodes but not a direct edge.
to declare any conflict of interest (CoI) manually. Although To detect implicit relationships, various link prediction
achieving those two factors are very important, unfortunately, algorithms have been developed. Link prediction algorithms
the current solution is tedious and error prone. calculate the possibility of two nodes to have a direct edge

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 341


between them in the future. In our approach, to compute the In addition, Cho and Yu [21] presented a new link prediction
CoI between a camdidate reviewer and an author, we utilized algorithm to analyze and explore the collaboration network
two widely used link prediction algorithms: the Adamic Adar on a data collected at the University of Bristol by finding
(AA) [9] and the Common Neighbors (CN) [10] algorithms. the similarity scores between interdisciplinary research either
In this research, we used the Digital Bibliography and by taking the intersection of co-author network. Moreover,
Library Project (DBLP) dataset [11] which contains informa- Xiao et al. [22] proposed a two-phase link prediction selection
tion about more than 4 million publications in the computer approach named TPLP. TPLP predicts top-k links in a graph
science domain. The dataset expressed as a graph database stream that are most likely to connect with each other.
using Neo4J graph database [12]. Authors represented as nodes Ahmed and Khan [23] presented a study to analyze and
and the relationships between them represented as edges. We understand social network to explore patterns of a relation-
also used the Latent Dirichlet Allocation [13] topic modeling ship in the graph. They used centrality measures to analyze
technique to extract the topic of the publications. Finally, we complex structure of co-authorship network. They clustered
utilized the Cypher query language [14] of the Neo4j graph authors that work on each field and then they ranked them
database to query the graph. according to their publications on each field. Then they used
To the best of our knowledge, this is the first research to adjacency matrix to represent the graph and to find the top
(1) apply the LDA model to extract topics of publications group of authors who work repeatedly in the same domain.
hosted on a Neo4j graph database, (2) use Cypher queries, to Finally, Amjad et al. [24] surveyed several methods for ranking
recommend candidate reviewers who are expert in the topic authors and discussed the advantage and disadvantage of each
of a given paper, and (3) use various link prediction algorithm one of them.
to resolve the CoI problem. All of the aforementioned papers related to our work since
The remainder of this paper is structured as follows: Section they all use graph analysis techniques and link prediction to
II discuss the related work. Section III describes the utilized achieve their goals. However, none of them solve the prob-
link prediction algorithms in our approach. Section IV de- lem we are solving, that is, recommending expert reviewers
scribes our methodology to process the dataset and hosting it without CoI for a given paper using social network analysis
in a Neo4j graph database as well as describing our proposed techniques and link prediction algorithms.
framework. Section V presents an experimental evaluation of
III. L INK P REDICTION A LGORITHMS
our approach. Finally, Section VI concludes the paper with
avenues of future work. Link prediction algorithms are techniques to predict weather
two nodes will have a relationship (a link) between them in
II. R ELATED W ORK the future or not. Link prediction techniques have been used
Chuan et al. [15] proposed a new metric, called LDAcosin, extensively in social networks analysis (SNA) [25]. Various
for recommending authors to work together based on the algorithms have been developed to calculate the closeness
similarity of the content of their publications. Dai et al. [16] (similarity) between two nodes in a graph including Adamic
proposed an approach called TMALCCite to automatically Adar (AA) [9], Common Neighbors (CN) [10], Preferential
recommend citations for researchers. TMALCCite uses text Attachment (PA) [26] and its enhanced version [27], Resource
similarity between two papers as well as leveraging social Allocation (RA) [28], etc. Each algorithm calculates the simi-
network analysis technique to increase the effectiveness of the larity between two nodes differently.
proposed approach. In this approach, the LDA, and the matrix In this research, we applied the Adamic Adar and the
factorization to trace author communities were aggregated Common Neighbors techniques. This section describes each
together. algorithm in more details.
Zhou et al. [17] investigated the problem of attacking the A. Adamic Adar Algorithm
similarity-based link prediction algorithms through deleting
The Adamic Adar (AA) proposed by Lada A. Adamic and
some links either by using local knowledge of the target link or
Eytan Adar in 2003 predicts a link between two nodes in a
by using global knowledge of the network. They showed that
graph based on the shared neighbors between them [9].
solving such a problem is NP-Hard. Ahuja et al. [18] proposed
AA defined as the sum of the inverse logarithmic of the
a new efficient algorithm that calculates a Directed Acyclic
shared neighbors between two nodes. The AA link (probability
Graph (DAG) from a directed graph of a social network. The
of connection) between two nodes is computed using the
proposed algorithm was two times faster than conventional
following equation:
algorithms.
Bütün and Kaya [19] proposed a pattern-based link predic- ∑ 1
tion technique to increase the performance of link prediction AA(A, B) =
accuracy for complex network. Fard et. al [20] proposed a log |N(n)|
n∈N(A)∩N(B)
technique to predict the existence of a target relationship be-
tween two nodes in a heterogeneous information network. The Where, A and B represent two nodes in a graph, n
proposed technique supports a set of hidden and topological represents the shared neighbors between nodes A and B,
features. N (A) represents the number of neighbors for nodes A, N (B)

342
represents the number of neighbors for node B, and N (n) 2) Data Pre-processing: To increase the performance of
represents the frequency of shared neighbors between node A our approach and prepare the data for the hosting process,
and node B. we have implemented a python script to clean the data from
A value of 0 indicates that two nodes are not close, whereas any unnecessary data such as removing special characters,
a higher value indicates that the two nodes are closer. removing unneeded features from the dataset, and removing
old publications (more than 10 years old).
B. Common Neighbors Algorithm The dataset contains 4,107,340 publications. However, after
The Common Neighbors (CN) algorithm measures the link the pre-processing step, we end up with 2,219,099 publications
prediction between two nodes based on their shared neighbors with a data size of 1.21 GB. For each publication, the dataset
[10]. It relies on the fact that, if two strangers have one contains publication id, title, authors name, venue raw, year,
common friend, those two strangers are more likely to meet field of study (fos), and references data schema.
in the future than two strangers without a common friend. Finally, we used the Latent Dirichlet Allocation (LDA) [13]
The CN value is computed using the following equation: topic modeling technique to find the topics of all publications
in the dataset from their titles and stored them to be used later
CN(A, B) = |N(A) ∩ N(B)| by our framework (recall Section IV-D).
B. Hosting the Dataset as a Graph Database
Where, N (A) and N (B) are the neighbors for both node
A and node B, respectively. This equation calculates the After we obtained and preprocessed the dataset (recall
convergence between two nodes. The CN value of 0 indicates Section IV-A), we stored the dataset as a graph database
that the two nodes are not nearby, whereas higher value means on the Neo4j [12]. Neo4j is an open source graph database
the two nodes are closer. implemented using Java and Scala. The Neo4j graph database
can be managed using the Cypher query language or Bolt
IV. M ETHODOLOGY protocol. Cypher is the Neo4j query language that allows
In this research, we define a reviewer as an active researcher developers to store and retrieve data from a Neo4j graph
who has some publications on a given topic during the last 5 datbase. Bolt protocol is an efficient client/server protocol for
years and he has no CoI with any author of a given paper. This database applications.
section describes our approach in more details. Section IV-A Unlike relational databases where data stored in tables,
describes the dataset we used in our research. Section IV-B using Neo4j, the data represented as a graph of nodes and the
describes our hosting mechanisms of the processed dataset. relationships (links) between them. We utilized Neo4j graph
Finally, Section IV-D describes our framework for finding the database for hosting our DBLP dataset. This step needs to
candidate reviewers for a given paper. be performed offline one time. Table IV-B shows information
about the hosted dataset on the Neo4j graph database. The
A. Dataset Preparation total graph size is 14.07 GB.
This section describes the dataset we used in our research
Table I
as well as the pre-processing techniques we applied to clean N EO 4 J G RAPH DATABASE INFORMATION
up the data.
1) Dataset: The DBLP Citation Network dataset [11] has Authors # 1,596,642
Articles # 2,219,099
been used in this research. The DBLP is a computer science Venues # 22837
bibliography that provides bibliographic information on major Nodes # 4,992,939
computer science journals and conferences. There are several Relationships # 36,516,189
versions of the DBLP dataset including DBLP citation network
v1, DBLP citation network v4, ACM citation v9, and DBLP Figure 1 depicts the structure of the stored graph in the
citation network v11. In this research, we have used DBLP Neo4j database. The figure shows that there are four labels
version 11. (Article, Author, Topic, and Venue). An Article
The dataset contains 4,107,340 publications. For each publi- has Topic, authored by one or more Author, cited by
cation, the dataset contains the publication id, authors {name, different Article, and presented in a Venue.
id, and organization as org}, title, venue {raw, id}, year,
number of citations (n_citation), references, publisher, etc. C. Co-Author Graph Building
Moreover, the dataset contains 36,624,464 citation relation- Once the graph database has been hosted on the Neo4j,
ships. The DBLP dataset is available in different formats such we create a co-author relationships model that describes the
as XML, RDF, and JSON files. In this study, we used the collaborations between different authors. To build such a
JSON file in which each line in the dataset file represents a graphical model, we relied on the “Article authored by or
paper. or more Author" relationships in the stored graph database
This dataset has been used in the research for many purposes (see Figure 1). This step also needs to be done one time (offline
such as data clustering [29], topic modeling analysis [30], process). Each co-author relationship indicates that there is, at
conflict of interest [31], and expert finding [32]. least, one research collaboration between two authors.

343
them descending based on their number of publications on that
topic. The REVIEWER presents the name of a candidate re-
viewer. Number_Of_Publication_In_Topic presents
the total number of publications for the reviewer in that topic.
Topic_A and Topic_B are examples of the topics of the
given paper.
1 MATCH ( r e v i e w e r ) < −[:AUTHOR] −( p a p e r )
2 MATCH ( p a p e r : TOPIC
3 { HAS_TOPIC : ’ Topic_A , Topic_B ’ } )
4
5 RETURN r e v i e w e r AS REVIEWER ,
6 COUNT ( p a p e r ) AS N u m b e r _ O f _ P u b l i c a t i o n _ I n _ T o p i c
7
8 ORDER BY N u m b e r _ O f _ P u b l i c a t i o n _ I n _ T o p i c DESC
9
10 LIMIT 10
Query 2. Get Top Active Authors In a Topic

After retrieving the top 10 authors (using Query 2), our


approach utilizes the link predictions algorithms, mainly AA
Figure 1. Graph Model For The Dataset and CN algorithms (recall Section III), to find the percentage
of any implicit co-author relationships between each candidate
reviewer and all authors of the given paper.
Query 1: This Cypher query builds the co-author relation-
To calculate the AA and the CN values, our approach run
ships after we hosted the dataset on the Neo4j and created
the Query 3 and Query 4, respectively.
the required labels, i.e., Article, Author, Topic, and
Query 3: this query runs the Adamic Adar (AA) link
Venue.
prediction algorithm.
1 MATCH ( a1 ) < −[:AUTHOR] −( p a p e r ) − [ :AUTHOR] − >( a2 : A u t h o r )
2 WITH a1 , a2 , p a p e r 1 MATCH ( a1 : A u t h o r { name : ’ a u t h o r _ o n e ’ } )
3 ORDER BY a1 , p a p e r . y e a r 2 MATCH ( a2 : A u t h o r { name : ’ a u t h o r _ t w o ’ } )
3
4
5 WITH a1 , a2 , c o l l e c t ( p a p e r ) [ 0 ] . y e a r AS y e a r , 4 RETURN r e v i e w e r AS REVIEWER ,
6 c o u n t ( * ) AS c o l l a b o r a t i o n s 5 a l g o . l i n k p r e d i c t i o n . adamicAdar
7 MERGE 6 ( a1 , a2 , { r e l a t i o n s h i p Q u e r y : "CO_AUTHOR" } )
8 ( a1 ) −[ c o a u t h o r : CO_AUTHOR { y e a r : y e a r } ] − ( a2 ) AS "AA S c o r e "
9 SET Query 3. Adamic Adar Query
10 coauthor . collaborations = collaborations ;
Query 1. Building a Co-Author graph Query 4: shows the Common Neighbors Query as link
prediction algorithms.
D. CoI Framework 1 MATCH ( a1 : A u t h o r { name : ’ a u t h o r _ o n e ’ } )
2 MATCH ( a2 : A u t h o r { name : ’ a u t h o r _ t w o ’ } )
Figure 2 depicts our approach for determining the reviewers 3
of a given paper. The framework consists of four steps and a 4 RETURN r e v i e w e r AS REVIEWER ,
decision for each candidate reviewer to decide if he has a CoI 5 a l g o . l i n k p r e d i c t i o n . commonNeighbors
6 ( a1 , a2 , { r e l a t i o n s h i p Q u e r y : "CO_AUTHOR" } )
with any author of the given paper. In our approach, hosting AS "CN S c o r e "
the DBLP on Neo4J (recall Sections IV-A and IV-B) and
Query 4. Adamic Adar Query
extracting the LDA topic model (using Query 1) are considered
as configuration steps that need to be done one time. After calculating the AA prediction value (Query 3) and the
Once a new paper is submitted to our approach (see Figure CN prediction value (Query 4) for each candidate reviewer,
2), our approach extracts the list of authors from the header of our approach decides if a given candidate reviewer will be
the paper. Then, it extracts the topic of the given paper form considered as a recommended reviewer (the prediction values
its title using the LDA model. of AA and CN are equal zero) or not (if the prediction value
Using the extracted topic of the given paper, our approach of AA or CN is greater than zero).
queries the graph database hosted on Neo4j to get the top 10 It is worth mentioning that our approach also, as a final step
active authors on that topic. Those top 10 authors are consid- that is not shown in Figure 2, calculates the AA prediction
ered candidate reviewers but not recommended reviewers until value of the top 10 candidate reviewers to determine if there
we check if there is any CoI between them and any author of is any CoI among the reviewers themselves and highlights
the given paper. To get the top 10 authors of a given topic, the results. Knowing that there is a CoI among two reviewers
we use the below Cypher query, Query 2. would help the editor not to assign the given paper to those
Query 2: This Cypher query retrieves the top 10 active two reviewers since the decision of one reviewer might affect
authors in a specific topic (candidate reviewers) and order the decision of the other reviewer, relying on the fact that CoI

344
Figure 2. Overview of our CoI Framework

indicates that two reviewers know and communicate with each Table III
other. A DAMIC A DAR P REDICTION VALUES OF THE T OP 10 C ANDIDATE
R EVIEWERS
V. E XPERIMENTAL E VALUATION
Reviewer Name AA Score
To evaluate our approach, we applied our CoI approach on Giuseppe Riva 0.0
a publication to get the recommended reviewers and exclude Andrea Gaggioli 0.0
Mark Billinghurst 0.0
the candidate reviewers with CoI. Kimon P. Valavanis 0.0
For this experiment we chose this publication “Decoupling Karl Rihaczek 0.0
assessment and serious games to support guided exploratory Dieter Schmalstieg 0.0
learning in smart education." [33]. The authors of this paper Vinton G. Cerf 0.0
Peter Pagel 0.0
are Mohammad Al-Smadi, Nicola Capuano, and Christian Gary McGraw 0.0
Guetl. The topic that the Extract Topic step (see Figure Stefania Serafin 0.0
2) obtained from the LDA model according to the title of
this paper is "Virtual, Local, Reality, Health, Error, Phase,
Table IV
Education, Dynamics, Medical, Continuous". C OMMON N EIGHBORS P REDICTION VALUES OF THE T OP 10 C ANDIDATE
R EVIEWERS
Table II
T OP 10 C ANDIDATE R EVIEWERS OF [33] Reviewer Name CN Score
Giuseppe Riva 0.0
Reviewer Name Number Of Publications In Topic Andrea Gaggioli 0.0
Giuseppe Riva 77 Mark Billinghurst 0.0
Andrea Gaggioli 59 Kimon P. Valavanis 0.0
Mark Billinghurst 59 Karl Rihaczek 0.0
Kimon P. Valavanis 57 Dieter Schmalstieg 0.0
Karl Rihaczek 40 Vinton G. Cerf 0.0
Dieter Schmalstieg 35 Peter Pagel 0.0
Vinton G. Cerf 35 Gary McGraw 0.0
Peter Pagel 34 Stefania Serafin 0.0
Gary McGraw 31
Stefania Serafin 31

Neighbors prediction value (using Query 4) of each candidate


Running Query 2 by The next step of our approach, reviewer.
Retrieve top 10 authors, retrieved the top 10 can- As discussed in Section IV-D, as a last step, our approach
didate reviewers in a descending order of their publications as highlights the recommended reviewers who has CoI between
shown in Table II. This table shows the name of each candidate them, based on their AA prediction values, and displays that
reviewer as well as the number of his publications in the last to the editor. The final result of the recommended reviewers
5 years in the topic of the selected paper [33]. It is worth of our selected paper shown in Table V.
mentioning that, our database contains 42,092 publications in
the topic of the selected paper in the last 5 years authored by VI. C ONCLUSION AND F UTURE W ORK
86,938 authors. This paper presented a solution to solve the conflict of
The Link Prediction Algorithm step of our ap- interests (CoIs) problem to recommend reviewers for a given
proach calculates the AA prediction value (using Query 3) paper. Our approach used the DBLP network citation dataset,
and the CN prediction value (using Query 4) of each candidate hosted as a graph database on Neo4J, and graph mining
reviewer. Table III shows the Adamic Adar prediction value techniques to suggest recommended reviewers without CoI.
of each candidate reviewer. Table IV shows the Common LDA modeling used to extract the topic of each publication

345
Table V [13] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
R ECOMMENDED R EVIEWERS FOR [33] WITHOUT C O I. Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022,
2003.
Reviewer Name [14] Neo4j Inc., “Cypher query language,” Accessed May, 2019,
Giuseppe Riva https://neo4j.com/developer/cypher/.
Andrea Gaggioli [15] P. M. Chuan, M. Ali, T. D. Khang, N. Dey et al., “Link prediction in co-
Mark Billinghurst authorship networks based on hybrid content similarity metric,” Applied
Kimon P. Valavanis Intelligence, vol. 48, no. 8, pp. 2470–2486, 2018.
Karl Rihaczek [16] T. Dai, L. Zhu, X. Cai, S. Pan, and S. Yuan, “Explore semantic
Dieter Schmalstieg topics and author communities for citation recommendation in bipartite
Vinton G. Cerf bibliographic network,” Journal of Ambient Intelligence and Humanized
Peter Pagel Computing, vol. 9, no. 4, pp. 957–975, 2018.
Gary McGraw [17] K. Zhou, T. P. Michalak, M. Waniek, T. Rahwan, and Y. Vorobeychik,
“Attacking similarity-based link prediction in social networks,” in Pro-
Stefania Serafin
ceedings of the 18th International Conference on Autonomous Agents
and MultiAgent Systems. International Foundation for Autonomous
Agents and Multiagent Systems, 2019, pp. 305–313.
and Cypher query language used to retrieve the candidate [18] R. Ahuja, V. Singhal, and A. Banga, “Using hierarchies in online social
networks to determine link prediction,” in Soft Computing and Signal
reviewers. Finally, several link prediction algorithms have Processing. Springer, 2019, pp. 67–76.
been utilized to calculate the CoI prediction value of each [19] E. Bütün and M. Kaya, “A pattern based supervised link prediction in
candidate reviewer. The final list of the reviewers presented directed complex networks,” Physica A: Statistical Mechanics and its
Applications, vol. 525, pp. 1136–1145, 2019.
with highlights for the reviewer with CoI. [20] A. M. Fard, E. Bagheri, and K. Wang, “Relationship prediction in
In the future, we are planning to use machine learning dynamic heterogeneous information networks,” in European Conference
techniques together with link prediction algorithms to solve on Information Retrieval. Springer, 2019, pp. 19–34.
[21] H. Cho and Y. Yu, “Link prediction for interdisciplinary collaboration
the CoIs problem. Moreover, we plan to use different graph via co-authorship network,” Social Network Analysis and Mining, vol. 8,
mining techniques to achieve better results. no. 1, p. 25, 2018.
[22] Y. Xiao, H. Huang, F. Zhao, and H. Jin, “Tplp: Two-phase selection link
ACKNOWLEDGEMENTS prediction for vertex in graph streams,” in Pacific-Asia Conference on
Knowledge Discovery and Data Mining. Springer, 2019, pp. 514–525.
This research is partially funded by Jordan University of [23] A. Ahmed, M. F. Khan, M. Usman, and K. Saleem, “Analysis of
Science and Technology, Research Grant Number: 20170107. coauthorship network in political science using centrality measures,”
arXiv preprint arXiv:1902.06692, 2019.
[24] T. Amjad, A. Daud, and N. R. Aljohani, “Ranking authors in academic
R EFERENCES social networks: a survey,” Library Hi Tech, vol. 36, no. 1, pp. 97–128,
[1] L. Tang and H. Liu, “Graph mining applications to social network 2018.
analysis,” in Managing and Mining Graph Data. Springer, 2010, pp. [25] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for
487–513. social networks,” Journal of the American society for information
[2] Baoxing Chen, Wenjun Xiao, and B. Parhami, “Internode distance science and technology, vol. 58, no. 7, pp. 1019–1031, 2007.
and optimal routing in a class of alternating group networks,” IEEE [26] A.-L. Barabási and R. Albert, “Emergence of scaling in random net-
Transactions on Computers, vol. 55, no. 12, pp. 1645–1648, Dec 2006. works,” science, vol. 286, no. 5439, pp. 509–512, 1999.
[3] M. Ljubojević, A. Bajić, and D. Mijić, “Centralized monitoring of [27] K. Hu, J. Xiang, W. Yang, X. Xu, and Y. Tang, “Link prediction
computer networks using zenoss open source platform,” in 2018 17th in complex networks by multi degree preferential-attachment indices,”
International Symposium INFOTEH-JAHORINA (INFOTEH), March arXiv preprint arXiv:1211.1790, 2012.
2018, pp. 1–5. [28] T. Zhou, L. Lü, and Y.-C. Zhang, “Predicting missing links via local
[4] Z. Lu, Y. E. Sagduyu, and Y. Shi, “Integrating social links into information,” The European Physical Journal B, vol. 71, no. 4, pp. 623–
wireless networks: Modeling, routing, analysis, and evaluation,” IEEE 630, 2009.
Transactions on Mobile Computing, vol. 18, no. 1, pp. 111–124, Jan [29] H. Yin, A. R. Benson, and J. Leskovec, “The local closure coefficient:
2019. A new perspective on network clustering,” in Proceedings of the Twelfth
[5] L. Zhang, H. Li, C. Zhao, and X. Lei, “Social network information prop- ACM International Conference on Web Search and Data Mining. ACM,
agation model based on individual behavior,” China Communications, 2019, pp. 303–311.
vol. 14, no. 7, pp. 1–15, July 2017. [30] X. Kong, Y. Shi, S. Yu, J. Liu, and F. Xia, “Academic social networks:
[6] S. H. Sajadi, M. Fazli, and J. Habibi, “The affective evolution of social Modeling, analysis, mining and applications,” Journal of Network and
norms in social networks,” IEEE Transactions on Computational Social Computer Applications, 2019.
Systems, vol. 5, no. 3, pp. 727–735, Sep. 2018. [31] S. Wu, U. L. Hou, S. S. Bhowmick, and W. Gatterbauer, “Pistis: A
[7] L. Guo, X. Cai, F. Hao, D. Mu, C. Fang, and L. Yang, “Exploiting fine- conflict of interest declaration and detection system for peer review
grained co-authorship for personalized citation recommendation,” IEEE management,” in Proceedings of the 2018 International Conference on
Access, vol. 5, pp. 12 714–12 725, 2017. Management of Data. ACM, 2018, pp. 1713–1716.
[8] M. Kudělka, Z. Horák, V. Snášel, P. Krömer, J. Platoš, and A. Abraham, [32] C. Shi, Z. Zhang, P. Luo, P. S. Yu, Y. Yue, and B. Wu, “Semantic
“Social and swarm aspects of co-authorship network,” Logic Journal of path based personalized recommendation on weighted heterogeneous
the IGPL, vol. 20, no. 3, pp. 634–643, June 2012. information networks,” in Proceedings of the 24th ACM International
[9] L. A. Adamic and E. Adar, “Friends and neighbors on the web,” Social on Conference on Information and Knowledge Management. ACM,
networks, vol. 25, no. 3, pp. 211–230, 2003. 2015, pp. 453–462.
[10] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for [33] A.-S. Mohammad, N. Capuano, and C. Guetl, “Decoupling assessment
social networks,” Journal of the American society for information and serious games to support guided exploratory learning in smart
science and technology, vol. 58, no. 7, pp. 1019–1031, 2007. education,” Journal of Ambient Intelligence and Humanized Computing,
[11] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: vol. 9, no. 3, pp. 497–511, 2018.
extraction and mining of academic social networks,” in Proceedings
of the 14th ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2008, pp. 990–998.
[12] Neo4j Inc., “Neo4j: Graph database,” Accessed May, 2019,
https://neo4j.com/.

346
Reconstructing Colored Strip-Shredded Documents
based on the Hungarians Algorithm

Fatima Alhaj Ahmad Sharieh Azzam Sleit


King Abdullah II School of IT King Abdullah II School of IT King Abdullah II School of IT
University of Jordan University of Jordan Amman, Jordan
Amman, Jordan Amman, Jordan azzam.sleit@ju.edu.jo
fat9170261@fgs.ju.edu.jo sharieh@ju.edu.jo

Abstract—One of the common problem in forensic science and In general, shredding machines produce three categories of
investigation science is reconstructing destroyed documents that shreds: rectangular strips (spaghetti), cross-cut and circular
have been strip-shredded. This work intend to design a strip [3]. This work aims to design, implement, and test an
matching algorithm to resemble edges of the strips, in order
to reconstruct the original document. The proposed algorithm algorithm to solve the problem of reconstructing shredded
is divided into three phases. First is image- based similarity document. This implies finding the correct positioning of a
evaluation to produce a score function (which includes building given n shreds, in order to form the original document. Each
” distance matrix”). The second phase is assignment phase that shred can be presented as a binary bitmap, and it is assume
matches a shred border pixels of the right side to a left side of that the shreds are placed in the correct orientation. Few
another shred (using Hungarians algorithm). The third phase is
defining the sequence according to the matched strips in order to researches have worked on reconstructing strip-cut documents
merge the shreds and reconstruct the document. The proposed problem [4]. We intend to improve the strip-cut matching
work is compared with a nearest neighbor search algorithm algorithm specifically, and outperform the sequential “best
in term of accuracy and speed. The Hungarians reassembling match” and “minimum distance” search for each shred.
algorithm scores better accuracy and run time than nearest Searching for the nearest neighbor matching for each strip
neighbor reassembling. The proposed approach scored (96.2
percent) as an average accuracy for reassembling an available from both sides holds the drawback of being time consuming.
online benchmark. Motivated by this drawback, a new approach is proposed
to reconstruct strip shredded text documents by firstly
Index Terms—Document Reconstruction, Feature matching, specifying the problem as an optimization problem, secondly
Hungarians Algorithm, Nearest neighbor search, Strip-shredded reformulating the problem as a maximum bipartite matching
documents.
problem. The Hungarians algorithm was deployed to find the
best match with a reduced complexity.
I. I NTRODUCTION
Automatic shreds reconstruction involves finding a correct The paper is organized as follows. In Section 2, a brief
spatial arrangement of given shreds in order to reassemble overview of related work is given. In Section 3, the method-
a complete document. This problem is usually handled by ology and how obtaining our shredded document is explored.
historians and forensic investigators [6]. It is used in many The naive algorithm used to solve this problem is described,
domains such as: health informatics, insurance claim analysis along with its complexity analysis, and the optimized algo-
[1], and military sector [2]. Also, it can be used in recovery of rithm and how to proceed in its different phases, which put
documents accidentally lost [1]. Manually reconstruction can both algorithms in a comparative frame in terms of time
be used, in which parts are arranged and analyzed as if it is a complexity. In Section 4, the experiment results thorough
puzzle [3]. The huge number of possible shreds permutations quantitative evaluation of the proposed approach are presented.
of arrangements makes manual solution an inefficient, Finally, Section 5 concludes the paper.
exhausting and time consuming. Many methodologies are
followed to provide automated and semi-automated document II. RELATED WORK
reconstruction. Whether manual reconstruction or automated
is used, the greatest challenge is the shreds identification and The problem of reconstructing shredded documents is
matching. closely related to the problem of automatically solving jigsaw
puzzles. Schauer et al. [5] considered the shredded document
as a form of jigsaw puzzles. They specified three types of
the fragments: the manually torn documents, the cross-cut

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 347


shredded documents, and the strip shredded documents. phase is defining the sequence according to the matched strips
in order to merge the shreds and reconstructs the document.
Some work used image-based similarity evaluation, such as
Lin and Fan-Chiang [3], where they proposed a reconstruction A. Distance matrix
algorithm for strip shreded document. Their algorithm is The objective of this stage is to define the differences
based on image feature matching and sorting graph-based between boarders of shred. Consider this step as a
representation of the shreds. In [3], they used color-matching- preprocessing step that includes applying a similarity
based method to produce an impressive accuracy. score function. The score function calculates a score for
each pair of shreds to measure the similarity (dissimilarity)
On the other hand, another group of researchers concentrate between them. Specifically, it measures pixel difference.
on the text based document and exploits the character features This step will result in distanceL =(shx , shy ) or distanceR =
to match shreds boundaries. Perl et al. [8] proposed an (shx , shy ) , where distanceR (shx , shy ) is the similarity score
optical character recognition algorithm to match two shreds of placing shx to the right of shy and distanceL (shx , shy )
boundaries using characters histograms. When the lines of is the score of placing it to the left side. For n shreds,
the text decrease, the precision of the paper reconstruction calculation of order of n2 scores will be done.
changes in non- increasing order.

Sleit et al. [4] proposed a solution for the reconstruction of Algorithm 1: Hungarians reassembling algorithm
crosscut shredded text documents (RCCSTD) problem based input : sh[] ← shreds
on iterative building of clusters related to shreds. Biesinger output: Ordered sequence of shreds
et al. [7] investigated the same problem with an improved 1 distance[] ← ∞ , counter1 ← 0 , counter2 ← 0
genetic algorithm. 2 while counter1 ≤ numberof strips(sh) do
3 while counter2 ≤ numberof strips(sh) do
Butler and Chakraborty [3] proposed “Deshredder” 4 distance[counter1][counter2]=
approach, which provides a visual analysis and make use of strip-Distance(counter1,counter2) ( Total
user involvement to direct the reconstruction process. The distances between band edges)
approach represents shredded pieces as time series and uses 5 end
nearest neighbor matching techniques, which enable matching 6 end
not just the contours of shredded pieces, but also the content 7 Hungarians-Reassembling(distance[],sh[]) returns
of shreds. Some literature deals with reconstructing strip indexes that assigning the best right match for left
shredded documents. These implies extracting information side related to every shred.
from the boundaries of the shredded documents strips [2], 8 sequence[] ← 0
[3], [12], but their work dose not concentrate on the function 9 indexes = Hungarians-Reassembling(distance[])
order of the algorithm (time complexity). They focus on 10 while i,v in indexes do
finding a solution other than finding a better run time solution. 11 val =distance[i][v]
12 sequence.append((v,val))
Justino et al. worked in reconstructing hand shredded docu- 13 end
ments [10]. The proposed methodology include pre-processed 14 return sequence[]
each shred based on polygonal approximation in order to
reduce complexity of the boundaries. The next stage is features
extraction followed with matching stage. Shreds in hand- The preprocessing steps are defined in Algorithm 1 (line
shredded documents usually yields to irregular boundaries, 1-6). The innermost loop calculates the distance between two
which need an extra processing before matching. They applied pixel values as a measure of how similar they are. This process
polyline simplification by using Douglas–Peucker (DP) algo- deals with the pixel vector, whatever the color model used.
rithm. Justino et al. methodology’s performance degrades as Other loops ensure that comparing each shred with all other
the number of shreds gets bigger since it affects the polygonal shreds. It returns the sum of distances between the rightmost
approximation. column of pixels in a shred and leftmost column of pixels in
another shred. The result matrix ”distance” will have sums of
III. P ROPOSED M ETHODOLOGY distances between edges of each two shreds. The complexity
of preprocessing in Algorithm 1 (line 1-6) is n2 × h, where
The result of the shredding process is a set of n shreds Sh h is the height of the shreds. Assuming that the height of a
= sh0 ,...,shn , which also represents the input to the algorithm. document is the number of shreds multiplied by some constant,
This work is divided into three subsequent phases. First phase then h = C × n. The run time complexity is approximately
is applying similarity score function. This phase will result in O (n3 ). The input used in this step is a shredded document
n × n distance matrix. Second phase uses Hungarians method that was shredded using a shredding function. This shredding
for bipartite matching to find the borders matches. The Third function generates any number of shreds out of a document,

348
Fig. 3. A part of Algorithm 1 distance matrix for Fig.1 document.

Fig.2 shows an illustration for Algorithm 1 preprocessing.


For each point (x, y) in strip 1 right side and strip 28
left side, a long all strip heights, the algorithm finds the
difference between their pixels vectors ( the absolute value).
In distance matrix, distance [1][28] equals the summation of
pixels distances for every point(x,y) along the strip height.
Similar summation is done for all the points to fill the distance
matrix. By doing that, the algorithm compares all shreds lefts
sides with all shreds right sides.
e
Fig. 3 shows a part of the 40 × 40 distance matrix
constructing by implementing Algorithm 1 (preprocessing
Fig. 1. A strip-shredded document into 40 strips (n = 40). part) on the document shown in Fig 1.

B. Hungarians Method for Bipartite Matching


A strip shredded document reconstructing problem can be
defined as an assignment problem (matching) for a bipartite
graph, where it includes two disjoint sets. The first step
is matching the set of all shreds left-side boarder columns
and the second one is the right side boarder columns. The
right of a specific shred to be assigned to another shred
left side. Thus, Hungarians algorithm suits perfectly for
solving this kind of matching according to distances between
shreds information, which was collected earlier in Algorithm1.

Hungarians algorithm was chosen because it can achieve


a low order polynomial run time, has worst-case run time
complexity of O(n3 ). Be side, optimality is guaranteed in
Hungarians assignment algorithm [9]. The basic rule of the
Fig. 2. Algorithm 1 distance matrix process for strip 1 and 28. Hungarians is that the number of rows and columns should
be equal, which is true about this two-dimensional matrix
(distance) introduced in Algorithm 1. This similarity matrix
by splitting it into strips and shuffle the strips into random is algorithm equivalent to the cost matrix in the Hungarians
positions.Figure 1 shows an example for a shredded document method.
into 40 strips.
The objective of this stage is to define the differences In Algorithm 1 , line 7 calls ”Hungarians-Reassembling”
between boarders of shred. Consider this step as a as a function. ”Hungarians-Reassembling” as a part of the
preprocessing step that includes applying a similarity proposed methodology, was built according to Pilgrim work
score function. The score function calculates a score for each [11] and his definition to the Munkres’ Assignment algorithm,
pair of shreds to measure the similarity (dissimilarity) between and was used to achieve the reassembling purpose.
them. Specifically, it measures pixel difference. This step will
result in distanceL =(shx , shy ) or distanceR = (shx , shy ) , Fig. 4 shows a part of the output of implementing
where distanceR (shx , shy ) is the similarity score of placing ”Hungarians-Reassembling” using document of Fig. 1. Each
shx to the right of shy and distanceL (shx , shy ) is the score pair of shreds are the best assignment found by implementing
of placing it to the left side. For n shreds, calculation of order the Hungarians algorithm. The number after the arrow
of n2 scores will be done. represent the sum of distances between the two strips. For
example, the distance between the right border of strip 4 and

349
TABLE I
E XPERIMENTAL RUN TIME OF THE H UNGARIANS REASSEMBLING
ALGORITHM MEASURED IN SECONDS FOR EACH N SHREDS

Number of shreds(n) Run time(second)


25 0.856
50 3.219
70 6.221
100 12.9035
120 18.6241
160 33.1682
Fig. 4. A part of ”Hungarians-Reassembling” output indexes. 200 51.8686
300 120.3616

left border of strip 19 is the minimum distance between all


other strips distances, which equal (6958). Fig. 6 shows the run time obtained by the implementation
constructed for Hungarians reassembling. Fig. 6 also shows
the theoretical time complexity of the proposed algorithm.
C. Defining Shreds Sequence The results show similar behavior of the run time complexity
The last remaining process employs the Hungarians of the theoretical and experimental analysis.
reassembling algorithm matches or the assigned pair of
shreds to define a sequence of shreds. This sequence will be Table 1 shows a sample result of the running time of
used to attach shreds and reconstruct the original document. Hungarians reassembling algorithm, where time increases as
Algorithm 1 (line 8-14) shows how to achieve this. This step the number of shreds increases.
run time complexity is approximately O(n3 ).
A benchmark called “Caltech and Pasadena Entrances”, pro-
Fig.5 shows the steps of reconstruction a strip shredded vided by Caltech University (http://www.vision.caltech.edu”),
document using Hungarians algorithm. This fig. summarizes containing 86 different images was shredded and reassembled
the process proposed in this research along with the resulting using Hungarians reassembling algorithm. The average accu-
reconstructed document, which was obtained by applying racy obtained was 0.962. Fig. 7 shows the different accuracy
Algorithm 1 (lines 8-14) using the indexes in Fig. 4. This score using the mentioned benchmark.
part of the algorithm appends the matched strips in order The Nearest Neighbor reassembling algorithm was
into a sequence list, then attaches document strips in the formulated in Algorithm 2, although it is not provided
space required for the document image. Next section will explicitly, but it is proposed by [1], [12]. Their algorithm
investigates the performance of these processes. depends on searching each row of the matrix sequentially
to find the minimum distance. Then, the resulting minimum
values are used to join the shreds.
IV. E XPERIMENTS AND R ESULTS
The analytical run time complexity of the Hungarians Algorithm 2: Nearest Neighbor reassembling
algorithm reassembling calculated in section 3 was used with
input : distance[] calculated in Algorithm 1,
different number of shreds. Values obtained from assuming
document shreds (sh[ ])
different value of n, which is the number of shreds, and using
output: define pairs of matched shreds depending on
these values in the run time complexity function resulting
minimum distance
from algorithm analysis.
1 distance[] calculated in Algorithm 1
The customized constants (C)= 4.17E-10, defined by the
2 sequence[] ← 0
frequency of the processor in local machine, was used in the
3 sh[] ← shreds
experiments.
4 min[] ← 0
5 while counter1 ≤ length(sh) do
The proposed method was implemented using Python
6 sequence[ ].append (sh[counter1])
v3.6.3, and PyCharm (JetBrains PyCharm - Community
7 while counter2 ≤ length(sh) do
Edition 2017.2.4 x64). Shreds were obtained using shredding
8 if distance[counter1][counter2] <
function that shreds any document image to an assigned
min[counter1] then min[counter1] =
number of shreds. This algorithm takes n (number of shreds)
distance[counter1][counter2];
as input, along with a document image and splits the image
9 end
to n shreds. This is followed by a shuffle process to put each
10 end
shred in a different random position other than its original
position.
Algorithm 2 finds the minimum distance in each row of

350
Fig. 5. The reassembled document using Hungarians reassembling algorithm.

Fig. 6. The analytical run time complexity against the implementation run Fig. 7. Nearest Neighbor reassembling (NNR) and Hungarians reassembling
time of Hungarians reassembling algorithm. (HR) run time.

n × n matrix that produced by Algorithm1. In every search ratio of the corrected positioned shreds to the number of all
(each row), the search space decreased by one. When a shreds. The accuracy depends on the document image and the
minimum is assigned as a match, discard the entire column color distribution of it. Hungarians reassembling shows a more
from subsequent minimum searches. stable performance than the nearest neighbor reassembling.
Fig. 9 shows the compared accuracy results after applying both
Analyzing the complexity of Algorithm4, the number of NNR and HR algorithms using different number of shreds (n).
searches
∑n for minimum equals to n+(n−1)+(n−2)+.....+1 =
2 Table II shows both the experimental run time complexity
i=1 i = (n(n + 1))/2 = O(n ). Therefore, adding the first
phase of building distance matrix which costs O (n3 ), into and the accuracy of implementation of both methods: HR and
this phase will results O(n3 ). Fig. 8 shows a comparison NNR. In general, it is clear that the HR performs better than
between the two algorithms: Nearest Neighbor reassembling NNR in both the run time and the accuracy.
(NNR) and Hungarians reassembling (HR) in terms of their
run time. Since both of them have similar theoretical run time
complexity of O(n3 ), the shape of cubic function is obvious V. C ONCLUSION
for both of them. The difference in the constants that was This paper investigates the power of Hungarians method
dropped in the HR makes it faster than the NNR. and its ability to find the best match to provide an algorithmic
solution for reassembling colored shredded documents. The
Accuracy was defined for both algorithms, HR and NNR, by algorithm has three phases. The first phase is finding image-
comparing the original document image with the reassembled based similarity and produces a distance matrix. The matrix
document results, strip by strip. Accuracy function finds the defines the distances between the left sides and the right

351
TABLE II
E XPERIMENTAL RUN TIME OF THE H UNGARIANS REASSEMBLING ALGORITHM MEASURED IN SECONDS FOR EACH N SHREDS

Number of shreds(n) HR Run time (sec.) NNR Run time (sec.) HR Accuracy NNR Accuracy
50 3.219 9.0668 0.96 0.96
100 15.5388 40.7209 0.95 0.87
150 28.9084 79.8505 0.92 0.84
200 47.3328 127.8999 0.90 0.84
250 65.5816 173.9221 0.88 0.82
300 120.3616 335.283 0.84 0.76

[2] A. S. Atallah, E. Emary and M. S. El-Mahallawy, ”A Step toward


Speeding Up Cross-Cut Shredded Document Reconstruction,” 2015
Fifth International Conference on Communication Systems and Network
Technologies, Gwalior, 2015, pp. 345-349. doi: 10.1109/CSNT.2015.69
[3] Huei-Yung Lin, Wen-Cheng Fan-Chiang, Reconstruction of shredded
document based on image feature matching, Expert Systems with
Applications, Volume 39, Issue 3, 2012, Pages 3324-3332, ISSN 0957-
4174, https://doi.org/10.1016/j.eswa.2011.09.019.
[4] Sleit, A., Massad, Y. Musaddaq,An alternative clustering approach for
reconstructing cross cut shredded text documents, M. Telecommun Syst
(2013) 52: 1491. https://doi.org/10.1007/s11235-011-9626-x
[5] Schauer C., Prandtstetter M., Raidl G.R. (2010) A Memetic Algorithm
for Reconstructing Cross-Cut Shredded Text Documents. In: Blesa M.J.,
Blum C., Raidl G., Roli A., Sampels M. (eds) Hybrid Metaheuristics.
HM 2010. Lecture Notes in Computer Science, vol 6373. Springer,
Berlin, Heidelberg.
[6] F. Richter, C. X. Ries, N. Cebron and R. Lienhart, ”Learning to Reassem-
ble Shredded Documents,” in IEEE Transactions on Multimedia, vol. 15,
no. 3, pp. 582-593, April 2013. doi: 10.1109/TMM.2012.2235415
Fig. 8. Accuracy comparison chart between NNR and HR. [7] Biesinger B., Schauer C., Hu B., Raidl G.R. (2013) Enhancing a
Genetic Algorithm with a Solution Archive to Reconstruct Cross Cut
Shredded Text Documents. In: Moreno-Dı́az R., Pichler F., Quesada-
Arencibia A. (eds) Computer Aided Systems Theory - EUROCAST
side of each strip (shred). In second phase, the Hungarians 2013. EUROCAST 2013. Lecture Notes in Computer Science, vol 8111.
algorithm was used to match pairs of shreds. Third phase Springer, Berlin, Heidelberg
[8] J. Perl, M. Diem, F. Kleber and R. Sablatnig, ”Strip shredded document
defines the sequence according to the matched strips in order reconstruction using optical character recognition,” 4th International
to merge the shreds and reconstructs the document. Conference on Imaging for Crime Detection and Prevention 2011 (ICDP
2011), London, 2011, pp. 1-6. doi: 10.1049/ic.2011.0132
[9] Wong, J.K.: A new implementation of an algorithm for
The proposed work was compared with the nearest the optimal assignment problem: An improved version of
neighbor search algorithm in term of accuracy and run munkres’ algorithm. BIT Numerical Mathematics 19(3),
time. The accuracy of the output results by implementing 418424 (Sep 1979). https://doi.org/10.1007/BF01930994,
https://doi.org/10.1007/BF01930994
Hungarians reassembling algorithm depends greatly in the [10] Justino, E., Oliveira, L.S., Freitas, C.: Reconstructing shredded docu-
image color distribution along with the number of shreds. The ments through feature matching. Forensic Science International 160(2),
proposed algorithm accuracy and run time were evaluated 140 147 (2006).
[11] Pilgrim, R.: Tutorial on implementation of munkres’ assignment algo-
and compared with the Nearest Neighbor reassembling. The rithm (1995)
proposed algorithm shows better results in terms of run time, [12] Marlos A. O. Marques and Cinthia O. A. Freitas. 2009. Recon-
and also a more stable accuracy as the number of shreds structing strip-shredded documents using color as feature matching.
In Proceedings of the 2009 ACM symposium on Applied Com-
increases. puting (SAC ’09). ACM, New York, NY, USA, 893-894. DOI:
https://doi.org/10.1145/1529282.1529475
Since the matching between each two shreds is an inde-
pendent process, parallel processing can be applied to the
proposed algorithm to get a better speed up. This can be in-
vestigated in future work for parallel Hungarians reassembling
algorithm.

R EFERENCES

[1] Butler, Patrick , Chakraborty, Prithwish ,Ramakrishan, Naren. (2012).


The Deshredder: A visual analytic approach to reconstructing
shredded documents. IEEE Conference on Visual Analytics Sci-
ence and Technology 2012, VAST 2012 - Proceedings. 113-122.
10.1109/VAST.2012.6400560.

352
Implementation and Comparative Analysis of
Semi-automated surveillance algorithms in real
time using Fast-NCC
Omer Khan, Nayab Saeed, Raheel Muzzammel, Umair Tahir and Omar Azeem
Electrical Engineering Department, University of Lahore, Lahore Pakistan
omerkhan128@gmail.com

Abstract - Chaotic environment, irregular motion of automated systems without the interference of human
objects creates challenging environment in the field of being [6].
computer vision. Advance target tracking techniques are Background subtraction with alpha is used for tracking
used to overcome these problems, but few parameters objects by calculating deviations from the background
are considered ideal in those scenarios, or those
model [7]. In this technique background is initialized
parameters are ignored. In this research, cross
correlation technique is applied for target tracking with first few frames. Where adoptive coefficient
which is famous for feature extraction in image and large value of alpha leave tail mark of moving
processing. Further, normalized cross correlation and object. Statistical method provides difference of whole
fast-normalized cross correlation are implemented and frame with reference frame and then resultant frame is
results are compared. As enormous computation are grouped to create object, which requires expensive
required in these techniques, real time target tracking is computing [8]. It is not commonly used for real time
factual challenge faced by this technique. High target tracking. Temporal differencing method uses
performance embedded hardware is required to few consecutive frames to extract the moving object
implement these techniques. In this research,
“TMS320DM642 evaluation module with TVP video
but this technique is not very good [9]. When object
decoders” digital signal processor embedded board is stop moving, this technique fails to detect the object.
carefully chosen for this purpose. These techniques are Eigen background subtraction provides motion
implemented on TMS320DM642 evaluation module and detection using Eigen space model [10]. In this
their results are carefully analyzed in this research. method, dimensionality of the space constructed from
sample images is reduced by the help of Principal
Keywords — Digital Signal Processor (DSP); Evaluation Component Analysis (PCA). In this technique
Module (EVM); External Memory Interface (EMIF); additional overhead is added to calculate principal
Synchronous Dynamic Random-Access Memory component. Correspondence based matching
(SDRAM); Field-Programmable Gate Array (FPGA);
Universal Asynchronous Receiver-Transmitter (UART);
algorithm, takes object of current frame and previous
Normalized Cross-Correlation (NCC); Real time Tracking frame then Euclidian Distance is calculated [11]. On
(RTT); Region of Interest (ROI); Phase Alternation Line the basis of Euclidian distance, next location of object
(PAL); Computational Time (CT); Frame Per Second is predicted which increases chance of target miss.
(FPS); In this research cross-correlation, normalized NCC
and fast-NCC is selected for comparative analysis of
I. INTRODUCTION target tracking algorithm. These algorithm select
The phenomenon of analyzing video sequences is optimized target vector and search area vector, which
known as video surveillance. Video surveillance is a provide less computing and fast processing rate. In
demanding and vigorous region in the field of most of the systems, hardware optimization is required
computer vision and has been proved vital in data for real time tracking for automated systems.
storing and displaying [1]. Video surveillance Dedicated hardware can be designed to perform
activities can be categorized in three types: manual application specific tasks so it will be much expensive
video surveillance, semi-autonomous video to use redundant hardware [12].
surveillance and fully-autonomous system [2]. TMS320DM642 evaluation module with TVP video
Automated surveillance systems are required to decoders” digital signal processor embedded board is
provide target tracking and feature extraction [3]. selected for this research to improve target tracking
Video surveillance for a long time by a human algorithm in real time [13]. The DSP on the DM642
operator is not possible. Solution for this problem real EVM interfaces to on-board peripherals through the
time target tracking algorithms are widely used in 64-bit wide EMIF of the three 8/16 bit wide video
surveillance [4] [5]. Image processing and computer ports [14]. The SDRAM, Flash, FPGA, and UART
vision are areas of recent research to provide [15] are each connected to one bus. The EMIF bus is

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 353


also connected to the daughter card expansion Video frame are acquired in TMS320DM642
connectors which are used for add-in boards. evaluation module with TVP video decoders through
On board video encoders and decoders interface the a PAL standard camera on video port 0. Target object
video ports and expansion connectors [16]. Two is provided through UART port by its location, target
decoders and one encoder are standard on the EVM. vector is created to search in a specific search area and
On screen, display functions are implemented in an some preprocessing is also done to reduce the noise.
external FPGA which resides between the output When targeted object is found and matched, a box is
video port and the video decoder. created around the object.

II. PROPOSED METHODOLOGY III. IMPLEMENTATION


This system is proposed to test the behavior of cross Correlation is a measure of the degree to which two
correlation, normalized cross correlation and fast variables agree, not necessarily in actual value but in
normalized cross correlation for real time target general behavior. The two variables are the
tracking. To implement these algorithm and record corresponding pixel values in two images, template
their results, “TMS320DM642 evaluation module and source.
with TVP video decoders” digital signal processor In this research, video frames are provided by a PAL
embedded board is used. standard camera with frame size of 720x480. Then,
this algorithm waits for the location of target area
through UART port. When location is provided it
creates a target vector of (64x64). After the target is
specified, it locks the UART port until the target is
discarded by user. The frame next to the target area is
specified. Search area is identified that is of resolution
(128x128).

Figure 2: Resolution of Vectors

Cross correlation provide better and efficient results in


transform domain, but for feature matching,
normalized cross correlation is preferred and it does
not have simple expression in frequency domain.
Spatial domain is most commonly used to compute
normalized cross correlation. In this research, an
algorithm is implemented of cross correlation with
some pre-processing which provides enhanced results.
This pre-processing is applied on search window and
target vector to reduce the number of computations. It
enhances the processing power as well as frame rate.

A. Pre Processing
In this research, system is not designed for a specific
targets so, pre-processing is applied to reduce noise. If
the image sequence has noise, noise should be
removed. Common types of noises found in image
Figure 1: Proposed Flow Chart sequences includes salt and pepper noise, the pixels

354
effected by salt and pepper noise have different colors Search area is extracted area from input video which
or intensities from that of its surrounding pixels, which is represented by in the above. Expansion of
are removed by applying median filter. Another type equation 3 will provide us with
of noise found in image sequence is Gaussian noise, , ( , ) =∑ , ( , )−2 ( , ) ( −
where every pixel value in the image is changed by a (4)
, − )+ ( − , − )
small value. The noise removal methods for Gaussian In above equation 4 the term ∑ ( − , − ) is a
noise include Gaussian smoothing. constant value as it represents the square of
displacement of target vector over the search area. As
i. Gaussian Filtering target vector is constant, it will remain same for every
In this algorithm, Gaussian smoothing filtering is frame. If the term ∑ ( , ) is approximately
applied on depth video only along spatial dimension. constant considering that the search area remains the
Each frame in depth video is convolved with Gaussian same for most of the time. By considering above two
smoothing filter independently. This is preprocessing terms constant, it is said that the remaining cross-
step to remove sharp changes in video. Gaussian filter correlation term is simplified to;
of 4x4 is applied in this research. ( , )= ( , ) ( − , − )
, (5)
,
ii. Contrast Adjustment This equation measures the similarity between the
In research, histogram equalization is applied only on search area and the target vector. If energy of the
search area and target area. It works quite effective in image ∑ ( , ) differs with the change in its
local texture enhancement. To perform normalization position, matching using above equation 5 can fail. For
on target vector and search area vector it requires example, cross-correlation among the target vector
maximum and minimum value of both vectors. and an exactly matching region in the search area may
be less correlated between the target vector and the
IV. MATHEMATICAL FORMULATION OF search area because of changes of light conditions
across the image sequence
CROSS CORRELATION
Template matching is the most important in this i. Correlation Coefficient
research to provide efficient and affective target
Environmental changes causes amplitude changes in
tracking algorithm.
video sequence, these variation generates challenging
situation for target tracking. Normalization of search
A. Template Matching By Cross-Correlation area and target vector are suited for this problem,
Among many techniques, cross-correlation is one of yielding a cosine-like correlation coefficient.
the commonly used techniques for template matching.
Cross-correlation for template matching is motivated B. NORMALIZED CROSS-CORRELATION
by the Euclidean distance. The Euclidean distance
Normalized cross-correlation faces many challenges
between points is the length of the line which are discussed in previous sections. To find the
segment connecting them, in Cartesian coordinates, if
target vector in a search area of a two dimensional
= ( 1, 2, … , ) and = ( 1, 2, … , ) are two
frame, it is required to calculate for normalized
points in Euclidean n-space, then the distance ( ) cross-correlation. Normalized cross-correlation value
from , or from is given by the for every point of ( , ) for and , which has been
Pythagorean formula:
displaced by from . The equation
( , )= ( , )= ∑ ( − ) (1) below represents the basic equation of normalized
Where determines the dimensions of Cartesian cross-correlation coefficient
coordinates in this case, it only have two coordinate of ( , )
a frame so equation reduced to; ∑ , ( , ) − ̅ , [ ( − , − ) − ̅]
= (6)
( , )= ( , )= ( − ) +( − ) (2) ∑ , [ ( , ) − ̅ , ] ∑ , [ ( − , − ) − ̅]
It seems intuitively likely that the convolution output
will be highest at places where the image structure Here ̅ ̅ , represent mean values of target vector
matches the mask structure, where large image values and search area respectively. Mean value of target
get multiplied by large mask values. This idea can be vector and search area are represented by the
tried by picking out part of our image to use as a mask. following equations.
Where square of Euclidian distance for search area and 1
target area can be represented by equation 3 ̅, = ( , ) (7)
, ( , ) = ∑ , [ ( , ) − ( − , − )] (3)

355
investigate the efficiency of this implemented
1 algorithm are; average time to process a frame, frame
̅ , = ( − , − ) (8)
rate, accuracy by percentage of throughput and error
Where search area dimensions are and power consumption.
determined by , limits for target vectors are
defined by . For a search window of size D. Average time to process a frame
and a feature of size requires approximately To calculate average time to process a frame, a
( − − 1) additions and ( − + 1) hardware pin is set. When the process is completed,
multiplications. the pin is reset. This is monitored by a digital login
analyzer and time duration is recorded.
C. FAST NORMALIZED CROSS-CORRELATION
E. Frame rate
Fast calculation of the normalized cross-correlation is
provided by using two sum tables over the image In a video sequence of one minute, it have 60 second,
function and search area energy . Sum tables it is difficult to provide frame rate for the video
over the search area are pre-computed sequence of one minute so we take average of frames
integrals. After calculating sum tables, the arithmetic for five second.
operations are efficiently reduced to only three
addition/subtraction operations from ∗ F. Accuracy
computations. Various approaches can be used to Accuracy of algorithm can be determined by
efficiently calculate the denominator (image calculating number of frames in which target object is
variances) of (6), however, it cannot be directly found or lost. Number of frames in which target is lost
applied to compute the cross correlation between can further be categorized in two, target lost and false
search area and target vector, as the one shown in the target located. In this research we considered target
numerator of (6). found and lost only. To calculate both following
formula are used.

i. Calculation of Numerator and Denominator ℎ ℎ % = ∗ 100

Numerator of (6) can be expressed as
( , )= ( , ) ( − , − ) −
(9) % = ∗ 100

Where, (9) provides the simplified term for nominator
of normalized cross-correlation coefficient. G. Power consumption
Using these sum tables mean for search area from (7) Power consumption is a major constrained in
can be very efficiently calculated independent of the embedded hardware. It is not possible to calculate
size of target vector. Now (9) can be represented as; power consumed by individual units so in this research
power of whole system is calculated.
( , )= + − 1, + −1
(10) VI. RESULTS
− − 1, + −1
− ( + − 1, − 1) In order to provide results for comparative analysis of
+ ( − 1, − 1) cross correlation techniques for real time target
It is clear from (10), that only three tracking, experiments are made to obtain results. To
addition/subtraction are required to calculate the test all these, implemented algorithm is tested in
double sum over ( , ) by evaluation of sum different scenarios. Four set of video sequences of
table ( , ). Sum-tables are calculated using the time interval two minutes; normal video sequence,
recursive equations for the target vector. The basic chaotic video sequence, low contrast video sequence
functions in each overlapping template sub-image are and dark video sequence.
then calculated by threshold of the image and labeling
and identifying the boundaries and centers of A. Target tracking using Cross-correlation results
landmark points or natural speckle patterns on the
Cross-correlation is the basic technique that is
skin.
implemented on hardware to observe the results as
described in above subsections. It is not possible to
V. EXPERIMENTAL PARAMETERS compare these results with some previous research,
To establish understanding and gather authentic because TMS320DM642 evaluation module with
research results these video sequence are repeatedly TVP video decoders is general purpose DSP hardware;
tested through algorithm. Parameters selected to on this hardware no such tracking algorithm is

356
implemented. Later in this research, reason of systems. In case, time period is even more important
improvements and decline in values are discussed. to provide real time target tracking.
Table 1: Target tracking using Cross-correlation results Table 3: Target tracking using Fast-NCC results
Video Low Video Low
Normal Chaotic Dark Normal Chaotic Dark
Sequence Contrast Sequence Contrast
Computational Computational
50.78 54.01 55.21 54.38 48.71 50.12 50.26 49.54
time (ms) time (ms)
Frame Rate Frame Rate
18.95 16.40 17.87 15.32 19.79 17.83 18.66 18.66
(FPS) (FPS)
Target Located 13 11 10 8 Target Located 16 12 14 13

Throughput % 72.22 68.75 62.50 53.33 Throughput % 84.21 70.59 77.78 76.47

Error % 27.78 31.25 37.50 46.67 Error % 15.59 29.41 22.22 23.53
Power Power
20.01 22.62 21.35 22.24 17.11 18.74 16.81 17.60
Consumption Consumption
It is visible from result, table of cross-correlation,
B. Target tracking using NCC results normalized cross-correlation and fast-NCC; results of
Normalized cross-correlation is implemented to deal fast-NCC are way better then both previous techniques
with the low contrast images, it provide significant concerning tracking time, which rapidly changes
improvement in low contrast and dark video sequence. frame rate as well. Over all, accuracy of the system is
By comparing the two tables. It is clearly visible that increased but for the video sequence with chaotic
all the parameters are improved significantly. environment, throughput falls because in fast
Table 2: Target tracking using NCC results computation it detects false object as tacking object
which also considered as target lost. Power
Video Low
Normal Chaotic Dark consumption is also reduced, but for chaotic
Sequence Contrast
Computational environment, power utilization is increased. Results
50.13 53.61 54.11 53.83 discussed in the above section are graphically shown
time (ms)
Frame Rate in form of charts below for comparison.
19.21 16.53 17.24 15.86
(FPS)
Target Located 15 12 12 10

Throughput % 78.95 75.00 70.59 66.66

Error % 21.05 25.00 29.41 33.33


Power
18.76 19.14 19.05 19.21
Consumption
Time period is improved because values are
normalized which reduces the load from ALU,
normalizing makes multiplication and addition
process fast. As process time on one frame decreases Figure 3: Comparative Analysis of Computational Time
it allows our algorithm to process new frames so due (ms)
to decrease in time period frame rate increases.
Normalization spreads pixel spectra in a specified
range. It also increases the percentage throughput of
algorithm. Percentage error is inversely proportional
to percentage throughput so it decreases with same
ratio. Power consumption is also reduced because
utilization of resources is reduced.

C. Target tracking using Fast-NCC results


Fast normalized cross-correlation is implemented to
improve the time period of the algorithm. Time is the
most important parameter for real time embedded

357
efficient utilization of existing hardware resources.
The applied algorithm in this research does not deal
with circumstances like; target object is scaled, tilted,
and object is rotated from its original location.
Multipoint tracking can also provide variety of
applications in the field of computer vision.

REFERENCE
[ 1 ] Ding Zhonglin, and LiLi. "Research on a hibrid moving object
detection algorithm in video surveillance system",
Proceedings of 2011 International Conference on Computer
Science and Network Technology, 2011.
[ 2 ] Venkatesan, R., and A. Balaji Ganesh. "Supervised and
Figure 4: Comparative Analysis of Frame Rate (FPS) Unsupervised Learning Approaches for Tracking Moving
Vehicles", Proceedings of the 2014 International Conference
on Interdisciplinary Advances in Applied Computing -
ICONIAAC 14, 2014.
[ 3 ] Purshottam J. Assudani. "Dot pattern feature extraction,
selection and matching using LBP, Genetic Algorithm and
Euclidean distance", 2012 International Conference on
Computing Communication and Applications, 02/2012
[ 4 ] Oh, Seung-Taek, Nak-Hyun Chun, Seung-Young Yoo, Ho-
Yeop Lee, and Hak-Eun Lee. "A Study on the Target 2D
Tracking Analysis Using Digital Image Correlation at Bridge
Deck Wind Tunnel Test", IABSE Congress Report, 2012.
[ 5 ] C. Wang, X. Chang, Y. Zhang, L. Zhang and X. Chen, "Failure
Analysis of Composite Structures Based on Digital Image
Correlation Method," 2017 International Conference on
Sensing, Diagnostics, Prognostics, and Control (SDPC),
Shanghai, 2017, pp. 473-476.
Figure 5: Comparative Analysis of Throughput (%) [ 6 ] Hii, A.J.H.. "Fast normalized cross correlation for motion
tracking using basis functions", Computer Methods and
Programs in Biomedicine, 200605
[ 7 ] G. Adhikari, S. K. Sahani, M. S. Chauhan and B. K. Das, "Fast
real time object tracking based on normalized cross correlation
and importance of thresholding segmentation," 2016
International Conference on Recent Trends in Information
Technology (ICRTIT), Chennai, 2016, pp. 1-5.
[ 8 ] P. Hwang, K. Eom, J. Jung and M. Kim, "A Statistical
Approach to Robust Background Subtraction for Urban
Traffic Video," 2009 International Workshop on Computer
Science and Engineering, Qingdao, 2009, pp. 177-181.
[ 9 ] T. Shibahara, T. Aoki, H. Nakajima and K. Kobayashi, "A
Sub-Pixel Stereo Correspondence Technique Based on 1D
Phase-only Correlation," 2007 IEEE International Conference
on Image Processing, San Antonio, TX, 2007, pp. V - 221-V -
224.
[ 10 ] Zheng Yi and Fan Liangzhong, "Moving object detection
Figure 6: Comparative Analysis of Power Consumption based on running average background and temporal
(Watt) difference," 2010 IEEE International Conference on
Intelligent Systems and Knowledge Engineering, Hangzhou,
2010, pp. 270-272.
VII. CONCLUSION [ 11 ] T. Cooke, "Eigen-Patch Based Background Subtraction," 2011
This paper presents comparative analysis of cross International Conference on Digital Image Computing:
correlation, NCC and Fast-NCC for the application of Techniques and Applications, Noosa, QLD, 2011, pp. 462-
467.
real time target tracking. Computational time, frame [ 12 ] N. Amrouche, A. Khenchaf and D. Berkani, "Multiple target
rate, throughput and power consumption results tracking using track before detect algorithm," 2017
provide promising improvements for NCC and fast- International Conference on Electromagnetics in Advanced
NCC as compared to cross correlation. In this Applications (ICEAA), Verona, 2017, pp. 692-695.
[ 13 ] F. E. T. Munsayac, L. M. B. Alonzo, D. E. G. Lindo, R. G.
research, results show that these algorithms work for
Baldovino and N. T. Bugtai, "Implementation of a normalized
all environmental situation; chaotic, low contrast, and coefficient-based template matching algorithm in number
dark scenes. system conversion," 2017 IEEE 9th International Conference
In future, noise filters can be added to improve the on Humanoid, Nanotechnology, Information Technology,
Communication and Control, Environment and Management
efficiency of this algorithm. There are hardware
(HNICEM), Manila, 2017, pp. 1-4.
optimization algorithms that can be applied to achieve

358
[ 14 ] M. V. G. Rao, P. R. Kumar and A. M. Prasad, "Implementation
of real time image processing system with FPGA and DSP,"
2016 International Conference on Microelectronics,
Computing and Communications (MicroCom), Durgapur,
2016, pp. 1-4. doi: 10.1109/MicroCom.2016.7522496
[ 15 ] X. Nguyen, L. Nguyen, T. Bui and H. Huynh, "A real-time
DSP-based hand gesture recognition system," 2012 IEEE
International Symposium on Signal Processing and
Information Technology (ISSPIT), Ho Chi Minh City, 2012,
pp. 000286-000291.
[ 16 ] Y. XiaoPing and L. Jieyun, "Hardware Design of Video
Stabilization System Based on TMS320DM642," 2010 Fourth
International Conference on Genetic and Evolutionary
Computing, Shenzhen, 2010, pp. 86-89.

359
Adaptive Control of Nonaffine Nonlinear Systems by
Neural state Feedback

M. Bahita 1st Department of Chemical Engineering, K. Belarbi 2nd Ecole nationale polytechnique de
Faculty of Process Engineering, Constantine 3 Constantine, University of Constantine 3, Constantine
University, Constantine 25000, Algeria 25000, Algeria
mbahita@yahoo.fr kbelarbi@yahoo.com

Abstract— In this paper, a new control method for a class of involve certain types of function approximators in their
single input single output nonaffine nonlinear systems is considered learning mechanism.
using radial basis function (RBF) neural networks (NNs). Firstly, the
existence of an ideal implicit feedback linearization control is Fuzzy logic systems and artificial neural networks [10-11]
established based on implicit function theory. An online RBF system
is introduced to approximate this ideal implicit feedback linearization
have been widely used as adjustable components in adaptive
law. The proposed neural fuzzy adaptive controller ensures that the control. In particular, these systems are introduced to
system output tracks a given bounded reference signal, while the approximate unknown nonlinear functions in nonlinear
closed loop stability results are provided and guaranteed using systems in the form of linear regression with respect to
Lyapunov theory. The effectiveness of the proposed controller is unknown parameters and then to apply the well developed
illustrated through a simulation to a nonaffine nonlinear system. adaptive control techniques.
Keywords—Adaptive control; Nonaffine nonlinear systems; Adaptive NNs design methods have been proposed to
Neural networks; Implicit function theorem.
control affine nonlinear system. In practice, many physical or
nonaffine systems are inherently nonlinear, whose input
I. INTRODUCTION variables may enter in the systems nonlinearly. To solve the
Artificial Neural Networks have gone through a rapid control problem for nonaffine nonlinear systems, several
development and grown past the experimental stage to become works have been proposed [7], [12-13], for a comprehensive
implemented in a wide range of engineering applications, such survey, see [14].
as for example state estimation, pattern recognition, signal
processing, process modeling, process quality control and data Fuzzy logic belongs [10] to a class of knowledge based
reconciliation [1-6]. systems. The main advantage of fuzzy logic is the possibility
of implementing human expert knowledge in the form of
The neural network (NNs) is capable of modeling non- linguistic if-then rules and provides a mathematical formalism
linear systems [7-8]. On the basis of supplied training data the for implementing these rules in the form of a computer
neural network learns (trains) the relationship between the program. Fuzzy logic is a rigorous mathematical field offering
process input and output. The training sets consist of one or very interesting solutions for control. Moreover, it offers
more input data and one or more output data. After the methods to control non-linear plants known to be difficult to
training of the network, a test-set of data should be used to model, and also can be used as an estimation technique or
verify whether the desired relationship was learned. In approximator in adaptive control, where the parameters are
practical applications a neural network can be used when the updated during plant operation.
exact model is not known. It is a good example of a ‘black-
box’ technique. With the combination of neural network and In this work and based on our previous related works
adaptive systems [4-5], the control techniques of most [15-16], we will use the same fuzzy logic system of Mamdany
complex systems have been improved. Adaptive control has type to approximate a non-linear term appeared in the
found extensive applications for plants that are complex and adaptation law of the RBF controller parameters. The radial
ill-defined [9]. Mathematical models might not be available basis function (RBF) controller is used in a direct neural fuzzy
for many complex systems in practice, and the adaptive adaptive control structure for a class of single input single
control problem of these systems is far from being output (SISO) unknown and nonaffine nonlinear systems.
satisfactorily resolved. Most of the adaptive controllers More specifically, the RBF controller is used online to
approximate the unknown implicit feedback linearization

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 360


control law based on the implicit function theory [17]. The x ( n ) = f ( x) + g ( x).u , where the control input u enters linearly
centers of the RBF network controller are updated using the k- in the model.
means algorithm [18-20]. The weight adaptation is derived
from the minimization of the control error instead of the We note here that we will make the following assumption
tracking error and is based on a gradient descent algorithm. As concerning the system (1) and reference signal y m (t ) .
mentioned above, the unknown control error appearing in the
adaptation laws is approximated based on the same online Assumption 1. The function f u ( x, u ) = ∂f ( x, u ) / ∂u is nonzero
estimate provided by Mamdani fuzzy inference system used in and bounded for all ( x, u ) ∈ Ω x xR . This implies that
[15,16]. The contribution of this work is that this new
f u ( x, u ) is strictly either positive or negative for
structure of control concerns the nonaffine on the control input
systems (where the control input variable enters in the systems all ( x, u ) ∈ Ω x xR . Without loss of generality, it is assumed
nonlinearly) and based on the implicit function theorem [17], that it exists a constant c such that
f u ( x, u ) ≥ c > 0 for all
whereas in our previous works [15-16], the control structure
( x, u ) ∈ Ω x xR . Define the tracking error vector as
concerns the affine on the control input (where the control
input variable enters in the systems linearly) systems without
use of the implicit function theorem. e = (e, e,..., e ( n −1) )T ∈ R n (2)

The paper is organized as follows. Section II describes the Where e = ym − y (3)


problem formulation containing the class of SISO nonaffine
nonlinear systems under study. In Section III, structure and
Then, from (1), we get
properties of the RBF system are presented with the adaptive
law. In section IV, the proposed method is used to control a ( n)
nonaffine (in which the control input appears nonlinearly) e (n) = ym − f ( x, u ) (4)
nonlinear system ensuring the convergence of the tracking
error to the neighborhood of the origin. Finally, Section V Which can be written in matrix form as
contains the conclusion.
(n) T
e = Ac .e + B.[ y m + k .e − f ( x, u )] (5)

II. PROBLEM FORMULATION Where


Consider the nonaffine single input single output nonlinear  0 1 0 ... 0 0   0
system represented in the following form  0 0  ,  0
 0 1 ... 0
 
(6)
Ac =   B = . 
x1 = x2    
 0 0 0 ... 0 1   0
x 2 = x3 − k 0 − k1 − k2 ... − k n −2 − k n −1  1 
x n = f ( x, u ) (1)
y = x1 Let k = (k 0 , k1 ,..., k n −1 ) T ∈ R n be a positive constant vector
selected such that the matrix Ac is stable. Thus, for any given
Where x = [ x1 , x 2 ,...x n ]T ∈ R n is the state vector of the
positive definite symmetric matrix Q , there exists a unique
system in the normal form which is assumed available for
measurement, u ∈ R and y ∈ R are the control input and positive definite symmetric solution P to the following
Lyapunov algebraic equation:
output of the system respectively, f ( x, u ) is an unknown
nonlinear function. The control objective is to design an (7)
AcT .P + PAc = −Q
adaptive neural network controller for system (1) such that the
error e(t ) = ym (t ) − y (t ) tends to zero, where ym (t ) is the
Let a signal v defined as
reference signal. T
(8)
v = y m( n ) + k .e
Remark1. Our contribution in this work is that the form is a
nonaffine in the control input, i.e., x n = f ( x, u ) , where the Substituting (7) in (5), we obtain
control input u enters nonlinearly in the model (see please the
e = Ac .e − B.[ f ( x, u ) − v] (9)
model equations in the simulation part IV). Besides, the work
is based on the implicit function theorem [17]. Whereas, in our
previous work [15, 16] and any other not cited work, the From assumption 1 and the fact that the signal v , defined in
model studied is in an affine form in the control input, i.e., (8), does not explicitly depend upon the control input u , i.e.,

361
∂v / ∂u = 0 the partial derivative of f ( x, u ) − v with respect to functions. The most used basis function is the Gaussian
the input u satisfies function. θ T = [θ1T θ 2T ...θ nT ] contains all adjustable parameters
and ξ (x ) is a vector of radial basis functions. It has been
∂ ( f ( x, u ) − v) ∂f ( x, u ) (10)
= >0 proven that (15) can approximate over a compact set Ω Z , any
∂u ∂u
smooth function up to a given degree of accuracy [21].
let u • be the ideal implicit unknown controller that makes
Thus, based on the implicit function theorem [17], we know
that the nonlinear algebraic equation f ( x, u ) − v = 0 is locally tracking error e = y m − y as small as possible.
solvable for the input u for each ( x, v ) . Thus, there exists The parameters update will be designed so as to minimize the
error eu between u • and the output
some ideal controller u • ( x, v) satisfying the following equality
u c ( x,θ ) = θ .ξ ( x) of the of the actual RBF neural controller
T
for all ( x, v) ∈ Ω x xR :
with

(11) eu = u • − u c ( x , θ ) (16)
f ( x, u ( x, v)) − v = 0
This leads to the cost function:
Therefore, if the control input u is chosen as the ideal control
law, i.e., u = u • , the closed-loop error dynamic (9) is reduced (u • − u c ( x, θ )) 2 (17)
J = min
to 2
e = Ac .e (12) Based on the gradient descent law, the connections weights of
the RBF network controller are adjusted under the following
Define the positive Lyapunov following function: law:
1 T (13)
V= e .P.e
2 ∂J (18)
θ = −γ
∂θ
Differentiate V with respect to time, and using (12) and (7), we
obtain: with γ > 0 the learning rate and:
1 T ∂J ∂u (19)
V = − e .Q.e (14)
∂θ
= − eu c
∂θ
2
Using (15), (19), (18) can be written as:
We conclude that V is a negative semi-definite function and ∂u
θ = γeu c = γ euξ (z ) (20)
that the tracking error e(t ) and its derivatives e ( i ) (t ) , ∂θ
i = 1,..., n − 1 , go to zero as t goes to ∞ .
As eu is unknown then, we estimate it by a fuzzy system of
However, the implicit function theory only guarantees the
Mamdani type with output êu based on the work done in [22],
existence of the ideal controller u • ( x, v) for system (1), and
does not prescribe a technique for constructing it even if the we then obtain the new law :
dynamics of the system are well known. In the following, a
neural network of RBF type system will be used to construct θ = γ eˆuξ ( z ) (21)
this unknown ideal implicit controller.
We first note that the update law (21) does not guarantee the
boundedness of the weights. In order to ensures boundedness
III. THE NEURAL NETWORK ADAPTIVE CONTROLLER of the weights, we use the so-called e-modification [23]:
The RBF network (as described in [15, 16]) can be considered
as a two-layer network with only one hidden layer. The output θ = γ ′eˆu .ξ ( z ) − γ ′ eˆu .v 0 θ (22)
depends linearly on the weights. More explicitly, the output of
v 0 > 0 is a design constant.
an RBF neural network system can be put in the following
form
Remark 2. As a resume, our adaptive controller shown in Fig. 3
nr
u c ( x,θ ) = θ .ξ ( x) =  ξ iθ i , with ξ i = ψ ( x − ci 2 )
T (15) consists of three blocs: an RBF controller, a Mamdani fuzzy
i =1 estimator of the control error and the adaptation mechanism.

x is the input vector, ψ is a nonlinear function called radial


basis function, θ are connections weights to be adapted IV. SIMULATION RESULTS
(parameters) between the hidden layer and the output layer, ci To test the effectiveness of the proposed RBF fuzzy
are centres of basis functions and nr is the number of basis adaptive controller, we propose to simulate a SISO nonaffine

362
nonlinear system [12], [14] which is described by the
following differential equations: With r = x − c i and a width σ = 1.8 . The following initial
2

conditions ( x1 (0), x 2 (0)) T = ( 0.6 , 0.5 ) are used in the


x1 = x2
simulation. The simulation results for the first state variable
2 2
x 2 = x1 + 0.15.u 3 + 0.1.(1 + x 2 ).u + sin(0.1.u ) + d (t ) (23)
y = x1 is shown in Fig. 2, the second state variable y = x 2 is
y = x1
shown in Fig. 3, and the control input signal is shown in Fig.
4. The tracking error signal is shown in Fig. 5. From these
Where d (t ) = 0.5.sin(10.t ) is an external disturbance included figures, we can conclude that the control objective is reached,
in order to test the robustness of the RBF fuzzy adaptive i.e., the system output y (t ) = x1 (t ) tracks very well the desired
controller against external disturbances. The control objective trajectory eliminating the added perturbation d (t ) in the
is to force the system output y (t ) = x1 (t ) to track a desired
system equation (23). In other words, we can see that the
trajectory y m (t ) = sin(t ) + cos(0.5.t ) . The unknown ideal implicit tracking error is bounded and converges rapidly to a value
feedback linearization controller is approximated by an RBF close to zero (in spite of the perturbation d (t ) ), confirming the
system in the form of (15). Clearly, the derivatives of the smoothing property of the overall RBF fuzzy logic systems.
reference y m exist and are bounded. The parameters are
chosen as γ = 1.2 , v0 = 0.005 , step size dt = 0.01 , and
k = [k 0 k1 ]T = [5 5]T in order to have all roots of
2
s + k1 .s + k 0 = 0 in the open left-half plane.
Q = diag (1.1,1.1) > 0 . Then :

1.10 0.10  (24)


P=
 0.10 0.120 

The RBF controller has five radial basis functions. The


parameters θ are initialised to 0. The centres of the basis
functions in the RBF network are uniformly distributed in the
interval [-2 2] and are adjusted using the k-means algorithm
[18-20]. The RBF network has two inputs
T (2)
x = [ z1 z 2 ] = [[ x1 ( K e + y m )] with e = [ y m − y y m − y ] .

Fig. 2. Output y = x 1 (. . . . .) and the desired reference trajectory y m (__)

FUZZY
UPDATING
LAW

Reference RBF NETWORK PLANT


CONTROLLER
u c ( x, θ ) Output

System states

Fig. 1. Structure of the RBF adaptive fuzzy controller

The used basis functions are Gaussian functions under the


following form :

− r2 (25) Fig. 3. Velocity y = x 2 (. . .) and the desired reference trajectory y m (__)


ψ (r ) = exp( )
2.σ 2

363
V. CONCLUSION

In this work, a direct neural adaptive control for a class of


SISO unknown and nonaffine nonlinear systems has been
introduced. A radial basis function (RBF) system is used
online to approximate the unknown implicit feedback
linearization control law based on the implicit function theory.
The weights adaptation of the RBF network is derived from
the minimization of the control error instead of the tracking
error based on the gradient descent algorithm. The algorithm
is applied in simulation to control a nonaffine nonlinear
system. The proposed adaptive RBF controller ensures the
convergence of the tracking error to the neighborhood of the
origin based on the Lyapunov theory and guarantees that all
signals are bounded.
Fig. 4. The control signal u

REFERENCES

[1] Xiongbo Wan; Zidong Wang; Min Wu; Xiaohui Liu, “H-infinity State
Estimation for Discrete-Time Nonlinear Singularly Perturbed Complex
Networks Under the Round-Robin Protocol,” vol. 30, no. 2, pp. 415 –
426, 2019.
[2] Bingrong Xu; Qingshan Liu; Tingwen Huang, “ A Discrete-Time
Projection Neural Network for Sparse Signal Reconstruction With
Application to Face Recognition,” vol. 30, no. 1, pp. 151 – 162, 2019.
[3] Massimiliano Luzi; Maurizio Paschero; Antonello Rizzi; Enrico
Maiorino; Fabio Massimo Frattale Mascioli , “A Novel Neural Networks
Ensemble Approach for Modeling Electrochemical Cells,” vol. 30, no. 2,
pp. 343 – 354, 2019.
[4] Xiucai Huang; Yongduan Song; Junfeng Lai, “ Neuro-Adaptive Control
With Given Performance Specifications for Strict Feedback Systems
Under Full-State Constraints,” vol. 30, no. 1, pp. 25 – 34, 2019.
[5] Yan-Jun Liu; Shu Li; Shaocheng Tong; C. L. Philip Chen , “Adaptive
Reinforcement Learning Control Based on Neural Approximation for
Nonlinear Discrete-Time Systems With Unknown Nonaffine Dead-Zone
Input,” vol. 30, no. 1, pp. 295 – 305, 2019.
[6] Brian Roffel and Ben Betlem, “Process Dynamics and Control:
Fig. 5. tracking error e Modeling for Control and Prediction.”. John Wiley and Sons, 2006.
[7] S. L. Dai, C. Wang, M. Wang, “Dynamic Learning From Adaptive
Neural Network Control of a Class of Nonaffine Nonlinear Systems,”
IEEE Transactions on Neural Networks and Learning Systems., vol.
25, no. 1, pp. 111-123, 2014.
[8] J. L. Tao, Y. Yang, D. H.Wang, C. Guo, “A robust adaptive neural
networks controller for maritime dynamic positioning system,
Neurocomputing.,” vol. 110, no.1, pp. 128–136, 2013.
When comparing briefly our results to the works done in [12]
[9] K.J. Astrorn and B. Wittenmark, “Adaptive control,” Addison-wesley,
and [14], referring to these works, we can observe that the 2nd ed, 1995.
evolution of states x1 and x2 in our work is the same as in [12] [10] G. Feng, “A Survey on Analysis and Design of Model-Based Fuzzy
and are both better than the results obtained in [14]. The Control Systems,” Rev. IEEE Trans. Fuzzy Syst., vol. 14, no. 5, pp.
evolution of the tracking error around zero as shown in Fig. 5 676–697, 2006.
confirms the obtained results. [11] R. E. Precup, H. Hellendoorn, “A survey on industrial applications of
fuzzy control, Computers in Industry,“ vol. 62, pp. 213–226, 2011.
[12] S. Labiod and T.M. Guerra, “Adaptive fuzzy control of a class of SISO
nonaffine nonlinear systems,” Fuzzy Sets and Systems, vol. 158, no. 10,
pp. 1126–1137, 2007.

364
[13] M. Chen and S. S. Ge, “Direct adaptive neural control for a class of
uncertain nonaffine nonlinear systems based on disturbance observer,”
IEEE Transactions on Cybernetics, vol. 43, no. 4, pp. 1213–1225, 2013.
[14] Chaojiao Sun, Bo Jing, and Zongcheng Liu, “Adaptive Neural Control
of Nonaffine Nonlinear Systems without Differential Condition for
Nonaffine Function,” Hindawi Publishing Corporation, Mathematical
Problems in Engineering., vol. 16, 1, pp. 1-11, 2016.
[15] M. Bahita and K. Belarbi, “On-line Neural Network, Adaptive Control
of a Class of Nonlinear Systems Using Fuzzy Inference Reasoning,”
Rev. Roum. Sci. Techn.- Electrotech. Et Energ., vol. 54, no. I, pp. 401-
410, Buccarest, 2015.
[16] M. Bahita and K. Belarbi, “Radial Basis Function Controller of a Class
of Nonlinear Systems Using Mamdani Type as a Fuzzy Estimator,”
Procedia Engineering, vol. 41, pp. 501 – 509, 2012.
[17] KC Border, “Notes on the Implicit Function Theorem, “ Caltech:
Division of the Humanities and Social Sciences, pp. 1 - 21, 2018.
[18] C . Darken.and J . Moody, “ Fast adaptive k-means clustering: Some
empirical results,” International Joint conference on Neural Networks,
2, pp. 233-238, 1990.
[19] M. Bahita and K. Belarbi, “Neural Stable Adaptive Control for a Class
of Nonlinear System Without Use of a Supervisory Term in The
Control Law,” Journal of Engineering Science and Technology, Vol. 7,
No. 1, pp. 97 – 118, February 2012.
[20] M. Bahita and K. Belarbi, Fuzzy and Neural Adaptive Control of a Class
of Nonlinear Systems. ISBN: 978-3-8484-8920-6, LAP LAMBERT
Academic Publishing GmbH & Co. KG Heinrich-Böcking-Str. 6-8,
66121, Saarbrücken, Germany, 2012.
[21] T. P Chen and H. Chen, “Approximation capability to functions of
several variables, nonlinear functionals, and operators by radial basis
function neural networks,” IEEE. Trans. Neural Networks, vol. 6, no. 4,
pp. 904-910, 1995.
[22] M. Bahita and K. Belarbi, “ Real-time application of a fuzzy adaptive
control to one level in a three tank system,” Journal of systems and
control engineering, vol. 232 no. 7, pp. 845-856, 2018.
[23] P. A. Ioannou, J. Sun, “ Robust Adaptive Control,” Prentice-Hall, 1996.

365
Would it be Profitable Enough to Re-adapt
Algorithmic Thinking for Parallelism Paradigm
1st Aimad Eddine Debbi 2nd Abdelhak Farhat Hamida 3rd Haddi Bakhti
dept. of informatics. dept. of electronics dept. electronics
Mohamed Boudhiaf university Ferhat Abbas university Mohamed Boudhiaf university
M’sila, Algeria Setif, Algeria M’sila, Algeria
aimad-eddine.debbi@univ-msila.dz a ferhat h@yahoo.fr ahmed3791@gmail.com

Abstract—A lot of progress in computing systems components expected to have some characteristics that make some ones
are devoted today to grant more support for parallelism. This, more appropriate for parallelization than other ones. The
is likely affording much opportunities for High Performance intrinsic parallel potential of algorithm and its ability to be
Computing (HPC) applications developers who become now
able to accelerate run-times progressively. Adapting algorithmic effortlessly parallelized are the most two important properties
writings for parallelism paradigm likely leads to additional that favorite its adoption for parallelization. This paper deals
improvement in run-times. This paper deals with this matter. with the impact of those two features in parallelization issues.
We carry an empirical measures to assess how interesting is The present investigation projects to brought some clarifica-
to re-adapt algorithmic thinking for parallel processing context. tions about wither would be more advantageous? parallelizing
We provide thorough comparisons of achievable accelerations
among a number of different sorting algorithm kinds. We use a serial programs or seeking for producing parallel ones being
proprietary framework meant previously to serve as a front-end not difficult to implement in parallel systems. Such investi-
kernel in an automatic parallelization compiler and we populated gation may be extended to predict compliances between the
it with interpolation to make performance predictions in large- triplet (algorithms - target architectures parallel programming
scale parallelization. Sequential, semi parallel and parallel algo- paradigms).
rithms for sorting problem are all involved in the empiric tests
considering different distributions for randomized input records. In the same vein, we suggest a new metric to allow more
The results let to estimate how much the innovation of specific fair comparisons among parallelization issues offers. Instead
parallel algorithms could be more profitable than parallelization of using the absolute speedup for comparisons, we propose to
of serial programs. consider a relative speedup given as a ratio of speedup respect
Index Terms—parallelism paradigm; workload characteriza- to max achievable speedup. This maximum achievable speedup
tion; profiling; inherent parallelism assessment; static analysis
is determinable by an earlier profiler we have previously
proposed in [4].
I. I NTRODUCTION We provide in this paper a large set of empirical tests to
Almost all todays computing platforms are parallel systems. assess achievable speedups in a number of sorting algorithms.
They may embed at once several many/multi core CPUs, a We have chosen to consider three classes of algorithms. We
number of GPUs clusters and even many FPGA ships. This is selected sequential algorithms, parallel sorting algorithms and
affording good opportunities for High Performance Computing semi-parallel algorithms for sorting problem. Tests are ex-
HPC applications developers, but exploiting those innovative tended using regression to get results for large scales problem.
architectures and accelerating run-times by parallelization is That set of instrumentations allows us to appreciate how much
still a challenging task. Automatic parallelization frameworks parallel algorithms are profitable and how amount of effort is
[1]–[3] fail sometimes to carry a total parallelization of required for their implementation.
sequential programs. They are usually not able to process II. R ELATED W ORKS
appropriately some critical code parts containing complicated
interdependencies, which makes execution crashes. The present work bears similarities with many researches
Semi-automatic parallelization tools like CUDA, OpenMP and [5]–[9] suggested for the workload characterization. It is even
OpenACC need large involvements of end-developers. End closer to ones dealing with the parallelization concern problem
users using semi-automatic parallelization tools have to specify [8], [10]–[13]. We share with them a common objective in the
explicitly the parallel parts in programs. Their mission is meaning that we plan to make attenuation of the difficulties
still hard when they deal with sequential programs, except stemming in parallelization issues. Speedups estimation is
if algorithms to parallelize contain explicit parallel parts. a centric challenge addressed in almost all of those works.
In many cases, several algorithms may come to be a valid Particularly some of works are stemming from efforts done
solution for a single problem. However, they are generally for automatic parallelization.
Peruse [6] is an LLVM based profiling tool designed to
characterize loops features and help developers recognize the

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 366


amenability of loops for acceleration. Authors of Peruse stated tion compiler. No identifier has been assigned to it in our
that their tools builds upon and complements a long line of previous paper [4], while here we privileged to designate it as
researches related to workload characterization and paralleliza- Ex DeProK, which stands for Explicit Dependence Profiling
tion. They indicated as well that their tool serves as loops filter Kernel. Mainly, its strength is that it features the ability to
in some scenarios, and specially, when development involve discriminate explicitly all dependences in limited size kernels
the use of the Aladdin tool [14] for ASICs. Authors of Peruse (programs) on building a particular set of Data Flow Graph
suggest as well the use a machine-learning model to predict DFGs we nominated The Map. It allows evaluating with
the potential speedup of loops when off-loaded to a fixed exactitude the total inherent parallelism in regions code even
function hardware accelerator. The loops features considered they encompass complex dependencies schemes. The intrinsic
in characterization include among others: annotated parallel, sequential fraction is given simply as a ratio of the depth of
atomics, big operations, vectorizability and idempotence. Be- the longest path in the Map to the total workload [4].
sides, authors provide options to make features filtering using
like-SQL queries. IV. E MPIRICAL T ESTS
Pintool [15] is another tool suggested to help evaluate The results stemming from the upcoming tests are ex-
potential benefit of accelerating particular regions. It does pected to bring additional guidance to assist selection of the
not focus only on loops regions, but it concentrates on the most profitable implementation among potentially available
communication aspects to characterize bottlenecks and un- parallelization scenarios. We deal in this sequence of instru-
derstand the dynamic memory access behavior. Authors seek mentation with two variants of algorithms for the sorting
to determine whether an accelerator is likely computation or problem. In the first set are gathered some simple and classical
communication-bound. algorithms considered sequential. In the second variant we
Authors of [5] and [11] provided the Kismet profiler. They considered the quick sort and the merging sort qualified as
looked to determine how much benefit might come from parallel algorithms. Sorting problem is treated here only as an
refactoring serial programs for parallel execution using a arbitrary example of processing. Besides sorting is abundantly
hierarchical critical path analysis (HCPA) technique. HCPA involved in a lot of applications and subjected to thorough
is an extension of critical path analysis CPA proposed early analysis in research [17]–[20], there is some properties that
in [8] for Comet tool. Kismet adopt models usage following have favored its selection in the present investigation. Several
gprof [12] style. It performs dynamic analyses of loops and algorithms are now available for the sorting problem, among
functions to determine the amount of available parallelism. which some are parallel and others are sequential. The nested
It incorporates system constraints to assess upper bounds of loops involved in sorting algorithms are usually complicated
speedups. The constraints include the number of cores, cache paths that carry complex dependency schemes. The inherent
effects and synchronization overheads. parallel potential may be qualified as slight in small scale, but
Authors of [16] dealt with the performance prediction and it is unknown in large scales. This inherent parallel potential
characterization on implying the concepts of resemblance and is input dependent. Our evaluations of parallel potential are
similarity. In their proposed approach, authors measure a set performed on considering the two variants of algorithms and
of microarchitecture-independent characteristics of the target are presented subsequently in the following subsections A and
application and then relate them to the same characteristics of B.
the standardized benchmark suite already known by a sequence
of performance scores on the systems of interest. Authors 120

considered six categories of microarchitecture-independent 100


characteristic to form a collection of 47 characteristic. The
80
six categories relate to Instruction mix, ILP, register traffic,
Speedup

working set size, data stream strides and branch predictability. 60

In other words, authors have performed an indirect character-


40
ization via standardized benchmarks acting as proxies.
In the present contribution, we plan to understand if the 20

consideration of new algorithms formulations (e.g innovation


0
of specific parallel algorithms) could lead to gain interesting 0 50 100 150
Input size
200 250 300

proportions of performance and if this approach will be more


effortless and attractive than considering parallelization of Fig. 1. Maximum achievable speedup in the bubble sort for several small
sizes of records that follow poisson distribution.
serial programs. So we do not provide further descriptions
of the profiler [4] used for the present analysis. The additional
details concerning that tool are provided in [4].
A. Inherent Parallelism Assessment in Serial Algorithms
III. A B RIEF S UMMARY ABOUT THE P ROFILER In this first set we dealt with two simple classical algo-
The tool used for the upcoming empirical tests stems from rithms. Bubble and insertion sorting algorithm are considered
an earlier effort carried to produce an automatic paralleliza- sequential algorithms.

367
4.5 k 120
4k 100
X: 4321
Y: 3425
3.5 k
80

Speedup
3k
60
Speedup

2.5 k
X: 2769 40
2k Y: 1766

20
1.5 k

1k X: 1078
0
Y: 507.5 0 50 100 150 200 250 300
500 Input size

0
0 500 1k 1.5 k 2k 2.5 k 3k 3.5 k 4k 4.5 k 5k Fig. 5. Maximum achievable speedup in the bubble sort algorithm for several
Input size
small sizes of records that follow uniform distribution.
Fig. 2. Maximum achievable speedup in the bubble sort algorithm for large 5000
sizes of records that follow poisson distribution.
4000
X: 4097
120 Y: 3221

100 3000

Speedup
80 X: 2707
Speedup

2000 Y: 1714
60
X: 1299
40 1000 Y: 621.7

20
0
0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 50 100 150 200 250 300 Size
Input size

Fig. 6. Maximum achievable speedup in the bubble sort algorithm for large
Fig. 3. Maximum achievable speedup in the bubble sort for several small sizes of records that follow uniform distribution.
sizes of records that follow geometric distribution.

6k Achievable speedups are shown as well in Fig.5 and Fig.6


when inputs are a randomized records following a uniform
5k
distribution.
X: 4311

4k
Y: 3928 2) Sort by Insertion: Likewise, inherent parallelism is
evaluated in the insertion sort algorithm considering different
Speedup

3k distributions for the randomized inputs. The results of our tests


X: 2763
Y: 1963
for the sort by insertion algorithm are given in Fig.7 and Fig.8.
2k

200
X: 1143
1k Y: 573.4
150
Speedup

0
0 500 1k 1.5 k 2k 2.5 k 3k 3.5 k 4k 4.5 k 5k 100
Input size

50
Fig. 4. Maximum achievable speedup in the bubble sort algorithm for large
sizes of records that follow geometric distribution. 0
0 50 100 150 200 250 300
Input size

Fig. 7. Maximum achievable speedup in the sort by insertion algorithm


1) Bubble Sort: We considered different distributions in- considering several records sizes that follow geometric distribution.
cluding Poisson, geometric and uniform distribution for the
randomized inputs. Tests are then carried on considering dif-
ferent input sizes. The total inherent parallelism (or maximum B. Inherent Parallelism Assessment in Parallel Algorithms
achievable speedup) for the small range of input size is We deal in this section with the quick sort and the merge sort
evaluated using our profiler Ex DeProK. The size of records algorithms. They are considered parallel because we proceed
considered for this small range are 16, 32, 64, 96, 110, 128, sorting using the logic of partitioning. We perform the sort
160, 224 and 256. Results of our measures for this small range incrementally on the several partitions forming the target
is shown in Fig. 1. Inherent parallelism for the large records domain. We deal with the simple form of algorithms where
sizes is obtained by regression using a quadratic interpolation. we consider only a dichotomous partitioning. The quick sort
Fig. 2 indicates the maximum achievable speedup for large is implemented using a recursive call for a function involving
sizes of records. Inherent parallelism in the bubble sort al- two arguments named ”bottom” and ”top”. The listing. 1 given
gorithm when inputs are a randomized records following a bellow gives a slight indication about the implementation we
geometric distribution is illustrated in Fig. 3 and Fig. 4. have chosen to make.

368
4.5 k
listing. 1 : Pseudo-code for quick sort algorithm
4k
X: 4276
Y: 3444
3.5 k
Quick(bottom, top) { 3k
pivot;

Speedup
pivot = M ake dichotomous parts&getpivot() 2.5 k X: 2690
Y: 2117

Quick(bottom, pivot); 2k
Quick(pivot + 1, top);
1.5 k
} X: 1064
1k Y: 814.1

0.5 k

0
The ”Quick()” function call can be mapped 0 0.5 k 1k 1.5 k 2k 2.5 k 3k 3.5 k 4k 4.5 k 5k
Input size
to concurrent threads. However, the function
Make dichotomous parts&getpivot() is a sequential fragment. Fig. 8. Maximum achievable speedup in the sort by insertion algorithm
Any algorithm likely contain a portion of code that should be considering large sizes of records that follow geometric distribution.
sequential. Each thread spawns recursively two threads that
30
have to handle the function calls Quick(bottom, pivot) and
Quick(pivot+1, top). Results of tests are given in Fig.9 and 25
Fig.10. The merge sort can be implemented in a recursive
form or in a no recursive form. In the listing. 2 we brought 20

Speedup
a variant of a pseudo code for the recursive form of the
merge sort. Once again only a dichotomous partitioning is 15

applied and every thread will spawn two threads for handling
recursive calls. In addition this pseudo-code is given here to 10

indicate that the sequential portions will be located in the


5
function Make merge(). 0 50 100 150 200 250 300
Size

listing. 2 : Pseudo-code for merge sort algorithm in a Fig. 9. Maximum achievable speedup in the quick sort algorithm considering
recursive form several records sizes that follow geometric distribution.

M erge sort(bottom, top) { Make dichotomous parts&getpivot() is potentially sequential


var endof f irsthalf, startof lasthalf ; and is not enough appropriate for parallelism.
if ((bottom + 1)! = top) { In contrast parallel potential in merge sort is good. Quick
endof f irsthalf = (bottom + top)/2;
sort and merge sort as parallel algorithms can be scheduled for
startof lasthalf = endof f irsthalf + 1;
M erge sort(bottom, endof f irsthalf ); parallel implementation as discussed previously on mapping
M erge sort(startof lasthalf, top); the recursive functions to threads. So, it is expected to get
M ake merge(bottom, endof f irsthalf, a profitable result with the merge sort, while scheduling the
startof lasthalf, top); quick sort in this suggested form for parallel implementation
}
seems not to would be beneficial. If we accept that the whole
else
T wo elements sort(bottom, top); of the sequential fraction is keeping only inside the function
} ”Make merge()”, we may state that we will be able to achieve
using threads a speedup being close to Smax /log2 (N ), where

40
V. I NTERPRETATIONS
X: 4116
35 Y: 31.96
Results of these empirical tests allowed the following obser- 30 X: 2502
vations. The intrinsic parallel potential increase proportionally Y: 25.51
Speedup

25
as the size of the domain rises following linear or hyperbolic X: 1039
Y: 19.66
asymptotes obtained by a linear or quadratic interpolation. 20

although bubble sort and insertion sort are considered se- 15


quential, at large scales they contain considerable amount 10
of intrinsic parallelism. However, extracting large amount of
5
parallelism from such algorithms keeps a serious challenge. In 0 0.5 k 1k 1.5 k 2k 2.5 k 3k 3.5 k 4k 4.5 k 5k
Input size
the quick sort, results have been poor. The intrinsic parallel
potential keeps slight even sorting is done for large scales. Fig. 10. Maximum achievable speedup in the quick sort algorithm considering
Our explication for that outcome is probably that the function large sizes of records that follow geometric distribution.

369
90
the function ”Make dichotomous parts&getpivot()” and sec-
80 ondly, increasing the degree of partitioning may both improve
70 considerably the inherent parallel potential of the quick sort.
60 That matter of improving forms of parallel may be a subject
Speedup

50
of a separate future investigation.
40 R EFERENCES
30
[1] H. Bae, D. Mustafa, J. W. Lee, Aurangzeb, H. Lin, C. Dave, R. Eigen-
20 mann, and S. P. Midkiff, “The cetus source-to-source compiler infras-
tructure: Overview and evaluation,” Int J Parallel Prog, vol. 41, pp. 753–
10 767, December 2013.
0 50 100 150 200 250 300
Size [2] S. Campanoni, T. M. Jones, G. Holloway, G. Y. Wei, and D. Brooks,
“Helix: making the extraction of thread-level parallelism mainstream,”
Fig. 11. Maximum achievable speedup in the merge sort algorithm consid- IEEE Micro, vol. 32, pp. 08–18, 2012.
ering several records sizes that follow geometric distribution. [3] C. Dave, H. Bae, S. Min, S. Lee, R. Eigenmann, and S. Midkiff, “Cetus:
A source-to-source compiler infrastructure for multicores,” Computer,
4.5 k vol. 42, no. 12, pp. 36–42, December 2009.
[4] A. E. Debbi and H. Bakhti, “Incremental banerjee test conditions
4k
committing for robust parallelization framework,” Turk J Elec Eng Comp
3.5 k X: 4104 Sci, vol. 26, pp. 2595–2604, may 2018.
Y: 2976
3k [5] D. Jeon, S. Garcia, C. Louie, and M. B. Taylor, “Kismet: Parallel
speedup estimates for serial programs,” in Proceedings of the 2011 ACM
Speedup

2.5 k
International Conference on Object Oriented Programming Systems
2k X: 2606 Languages and Applications, ser. OOPSLA ’11. ACM, 2011, pp. 519–
Y: 1461
1.5 k
536.
[6] S. Kumar, V. Srinivasan, A. Sharifian, N. Sumner, and A. Shriraman,
1k “Peruse and profit: Estimating the accelerability of loops,” in Proceed-
X: 1013
Y: 391.9
0.5 k ings of the 2016 International Conference on Supercomputing, ser. ICS
’16. ACM, 2016, pp. 21:1–21:13.
0
0 0.5 k 1k 1.5 k 2k 2.5 k 3k 3.5 k 4k 4.5 k 5k [7] V. H. F. Oliveira, A. F. A. Furtunato, L. F. Silveira, K. Georgiou,
Input size K. Eder, and S. Xavier-de Souza, “Application speedup characterization:
Modeling parallelization overhead and variations of problem size and
Fig. 12. Maximum achievable speedup in the merge sort algorithm consid- number of cores.” in Companion of the 2018 ACM/SPEC International
ering large sizes of records that follow geometric distribution. Conference on Performance Engineering, ser. ICPE ’18. ACM, 2018,
pp. 43–44.
[8] M. Kumar, “Measuring parallelism in computation-intensive sci-
entific/engineering applications,” IEEE Transactions on Computers,
Smax is the maximum achievable speedup indicated in the vol. 37, no. 9, pp. 1088–1098, September 1988.
curve of Fig. 10 for the size N, i.e. we can achieve ×5.3 of [9] A. Ketterlin and P. Clauss, “Profiling data-dependence to assist paral-
speedup when the domain size is of 64 records and we achieve lelization: Framework, scope, and optimization,” in Proceedings of the
2012 45th Annual IEEE/ACM International Symposium on Microarchi-
×39.2 of speedup when the domain size is of 1024 records. tecture, ser. MICRO-45. IEEE Computer Society, 2012, pp. 437–448.
[10] A. Elnashar and S. Aljahadli, “Experimental and theoretical speedup
VI. C ONCLUSION prediction of mpi-based applications,” Computer Science and Informa-
Many algorithms, even though they appear to be sequential tion Systems, vol. 10, pp. 1247–1267, june 2013.
[11] D. Jeon, S. Garcia, C. Louie, S. Kota Venkata, and M. B. Taylor,
in nature, in large scale they contain a considerable amount of “Kremlin: Like gprof, but for parallelization,” in Proceedings of the 16th
inherent parallelism. In large scale we have generally good ACM Symposium on Principles and Practice of Parallel Programming,
opportunities to get favorable accelerations on scheduling ser. PPoPP ’11. ACM, 2011, pp. 293–294.
[12] S. L. Graham, P. B. Kessler, and M. K. Mckusick, “Gprof: A call graph
them for parallel implementation. In many cases however, execution profiler,” in Proceedings of the 1982 SIGPLAN Symposium on
they may be written in some forms that are not easy to Compiler Construction, ser. SIGPLAN ’82. ACM, 1982, pp. 120–126.
implement in parallel way i.e. when algorithms contain a [13] S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor, “Kremlin: Rethinking
and rebooting gprof for the multicore age,” in Proceedings of the 32Nd
deeply nested loops with complex dependencies. These nested ACM SIGPLAN Conference on Programming Language Design and
loops are hard to handle by automatic parallelizers realizing Implementation, ser. PLDI ’11. ACM, 2011, pp. 458–469.
loops transformations and they can’t be mapped easily to [14] Y. S. Shao, B. Reagen, G. Wei, and D. Brooks, “Aladdin: A pre-rtl,
power-performance accelerator simulator enabling large design space
parallel threads. Applying the concept of ”divide to rake” is exploration of customized architectures,” in 2014 ACM/IEEE 41st Inter-
suitable to make parallelizations. Parallel algorithms admit national Symposium on Computer Architecture (ISCA), 2014, pp. 97–
the application of such concept. In quick sort and merge 108.
[15] M. A. Kim and S. Edwards, “Computation vs. memory systems:
sort algorithms we applied a dichotomous partitioning. Our Pinning down accelerator bottlenecks,” AMAS-BT - 3rd Workshop on
profiling have shown that we have little to no chances to Architectural and Microarchitectural Support for Binary Translation,
obtain good accelerations with quick sort in its form indicated pp. 86–98, june 2010.
[16] K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and
earlier. The merge sort contain a considerable amount of K. D. Bosschere, “Performance prediction based on inherent program
inherent parallelism and since it is possible to map its functions similarity,” Proceedings of the 15th International Conference on Parallel
to threads, it appears that it will be the most profitable Architectures and Compilation Techniques, pp. 114–122, 2006.
[17] Y. Yang, P. Yu, and Y. Gan, “Experimental study on the five sort
scenario of parallelization. Quick sort should not be considered algorithms,” in 2011 Second International Conference on Mechanic
absolutely poor for parallelization. Firstly, the enhancement of Automation and Control Engineering, July 2011, pp. 1314–1317.

370
[18] Z. Yildiz, M. Aydin, and G. Yilmaz, “Parallelization of bitonic sort
and radix sort algorithms on many core gpus,” in 2013 International
Conference on Electronics, Computer and Computation (ICECCO),
November 2013, pp. 326–329.
[19] M. H. Durad and M. N. A. and, “Performance analysis of parallel
sorting algorithms using mpi,” in 2014 12th International Conference
on Frontiers of Information Technology, December 2014, pp. 202–207.
[20] Z. Cheng, K. Qi, L. Jun, and H. Yi-Ran, “Thread-level parallel algorithm
for sorting integer sequence on multi-core computers,” in 2011 Fourth
International Symposium on Parallel Architectures, Algorithms and
Programming, December 2011, pp. 37–41.

371
Affordable and Portable Realtime Saudi License
Plate Recognition using SoC
Loay Alzubaidi Ghazanfar Latif Jaafar Alghazo
Department of Computer Science, Department of Computer Science, Department of Computer Engineering,
Prince Mohammad bin Fahd Prince Mohammad bin Fahd Prince Mohammad bin Fahd
University,Al Khobar, Saudi Arabia. University,Al Khobar, Saudi Arabia. University,Al Khobar, Saudi Arabia.
lalzubaidi@pmu.edu.sa glatif@pmu.edu.sa jghazo@pmu.edu.sa

Abstract— Stand along single board computers (SoC) have its own unique plate which is going throw identification and
become so inexpensive and yet so powerful that paved the way security processes [1].
for easily developing fully automated systems. SoC systems are
equipped with sensors, cameras and various embedded systems This research aims to design and the implementation of the
that allow developing systems that interact with the surrounding Plate Number Recognition system. Unlike the gate’s opener
environment. Therefore, the task of capturing images of License that uses a remote control in a hand of human as third party,
plates and using Optical Character Recognition (OCR) the system takes picture of a detect approaching vehicle,
techniques to recognize the numerals and characters allows for analyse the images and only opens the gate when a recognized
developing an inexpensive License Plate (LP) Recognition vehicle plate is identified. The main objective of the research
system. LP systems are important and can be used for various is to develop a real time fully automated number plate
application from traffic control, toll payment, and parking recognition system that is based on Raspberry Pi. This system
access. This paper proposes a Raspberry PI based LP will be built using the raspberry as the main component. The
recognition for Arabic/English Characters and Numeral on system will be able to detect the vehicle, recognize the plate,
license plates used in Saudi Arabia. The proposed process compare it with the database and control the gate.
utilizes the phases of Preprocessing, Segmentation, Feature
Extraction and Classification. At the end of the preprocessing
phase, the Characters and Numerals are segmented. Pixel II. BACKGROUND
distribution and Horizontal projection profiles are used in the With the start of 20th century, automobile industry
feature extraction phase for the segmented image. Distance boomed and number of motorized vehicles increased rapidly.
Classifier and k-nearest neighbors classifier are used in the From 1890 to 1910 the world witnessed a transition from
classification phase. The proposed system achieved an accuracy horse to automobiles. As the number increased, law
of 90.6%. The advantage of such a system is the low cost and enforcement officials started facing issues to maintain
portability making it affordable and easily deployable in any
vehicles record and trace them. As a result, in 1890 first
location.
number plate was introduced by France and Germany also
Keywords— Single Board Computer; Raspberry Pi; Saudi followed them by introducing in 1993. In United States,
License Plate; Real time Plate Number Plate Recognition; KNN Massachusetts was the first state who introduced number plate
in 1903 with proper vehicle registration and driver’s license
registration. Netherland become the first country by
I. INTRODUCTION introducing national license plate in 1899 by starting license
These days, everything now tends to be moving toward plate with number 1 which reaches to 2001 in 1906 as they
automating. People were used to deal with everything selected different way to number the license plates [2]. Fig. 1
manually, for example, people used to open the gates shows some of the initial number plates introduced by
manually, which means that users had to stop the vehicle, and different countries.
wait for someone to check their authorization before passing
the gate. This process requires at least one man to stand by the In 1938, the first oil well was discovered in Saudi Arabia.
gate and check the vehicle, open the gate manually, and then However, because of World War II in 1939 the Saudi
closing it. After the invention of the remote controlled garage government delayed the development programs and research
doors has caused a great impact on making the lives of the on the oil industry until 1946. From 1946 to 1950, the
consumer easier; the security person will open and close the Kingdom of Saudi Arabia witnessed a revolution in the oil
gate with a press of button. However, as technology improves industry, which raised the country's economy and in this
the lives of the consumers become easier. Thus, this system is period traffic in Saudi Arabia was on the rise, which led to the
aiming to have the gate to open automatically without needing development of the licensing plate to register the necessary
a person spending his whole day standing to press a button. information regarding automobiles owners. The first license
plate in Saudi Arabia appeared in 1950-1962, where they
The system approaches the same idea in an easy and differed from one region to another as shown in Fig. 2. In
automated way by recognizing the vehicle’s plate number, 1972, license plates were established in the entire country with
then if authorized the system will automatically open the gate different types of use (privet, bus, taxi and truck) as showing
by using low cost embedded system. One of the biggest in Fig. 3. However, in 2007 the design was change once again,
advantages of automation is ensuring the quality and because license plates were not enough for the demand and
consistency of the product without forgetting the important population increase which is shown in Fig. 4 [3]. The new
aspect which security. The system is going to automate the version was different from previous ones; the 1996 series was
functionality of the gate systems by using a unique sign for considered to be most preferred by the majority of the public.
opening the gates. In other words, each individual vehicle has

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 372


consumption. The Raspberry Pi board also has 4 USB ports
and an Ethernet port. The USB port is used for our Camera
that will used to capture the vehicle image for number plate
recognition. The full description of Raspberry Pi 3 board is
Fig. 1. Very first license plate designs in different countries shown in Fig. 7.

Fig. 2. First series of Saudi Arabia license plate

Fig. 3. Unified license plate for the entire country

Fig. 4. New license plate design to meet the increased population demands

III. PROPOSED SYSTEM


The proposed system consists in to different modules. The
first module will be dealing with the hardware part of our
system. The second module will be dealing with the software
part of our system. The third module will be dealing with the
database part of the system. The fourth module will be dealing
with the design part of our system. Fig. 5 shows flow chart of
proposed system. The system’s start point is the ultrasonic
sensor. When sensor detects an approached vehicle it turns on
the camera to capture image. The capture image is processed
to extract vehicle number by applying KNN technique [4].
The extracted results are compared with database in order to
decide to open the gate or not.

A. Dimensions of Saudi License Plate


The new Saudi license plate standard size is
310mmX155mm (1:2 proportions). The license plate is
divided into 5 regions which are shown in Fig. 6. The right
part Region 1 (R1) contains name of the country in Arabic;
‫الس عودية‬, three letters; K S A, and the palm tree of Saudi
emblem. The top right Region (R2) has three Arabic alphabets Fig. 5: Flow chart of the proposed system
and top left Region (R3) has one to four Arabic numerals. The
lower part contains the remaining two regions, R4 has three
English letters, R5 has one to 4 numbers.

B. Connecting System Components Together


For real time fully automated Saudi number plate
recognition system, different hardware and software
components are combined together. The proposed system
mainly relies on Raspberry Pi 3.0 which contains b4 bit quad
core 1.2GHz processor, 1GB SDRAM, 440MHz Video Core Fig. 6: Saudi license plate regions
IV GPU with 802.11n/Bluetooth wireless support [5]. This
model of the Raspberry Pi contains 40 pins including 26 An HC-SR04 ultrasonic sensor is used which has 4-Pin
General Purpose Input Output (GPIO) pins and pins for circuit board which is shown in Fig. 8. When the Raspberry pi
3V/5V voltage supply [6]. These pins can be controlled by provides the Vcc with 5V then there is a working current of
C/Java/Python script to link between hardware and software 15mA through the circuit. Using ohms’ law then it can figure
components along with Open Source Computer Vision out that the inner resistance of the entire circuit is 333Ω. When
(OpenCV) libraries [7][8]. High performance cortex Cortex- the python code is executed a 5V pulse is sent to the Trig pin
A53 processor has four processors cores with L2 cache that to generate a 40 kHz wave from both sensors in the forward
supports both 64 bit and 32 bit applications with low power direction. The maximum range of these wave pulses are 4m,

373
the minimum range where it can give you a distance is 2cm.
The of these waves that are being generated is a 15 angle.
This is equivalent to , after the wave pulse has been
sent, when the pulse hits an object and bounces back to the
sensor the trig will become high for 10 μ Seconds indicating
that there is an object in range. It then shoots 8 cycle bursts of
ultrasound at 40 kHz through the echo these 8 cycle bursts are
called “Sonic Burst” [9]. The range can be calculated from the
moment the trigger signal was sent and the echo signal
received by using equation 1 which is shown in Fig. 9.
Fig. 8: HC-SR04 ultrasonic sensor pins description

(1)

Lightweight and small sized SG90 servo motor is used


which can rotate to 180º degrees 90º in each direction. It can
rotate with speed of 60º per second and operates with 4.8V to
6 Voltage. The servo motor is controlled by using python
library in Raspberry Pi. This Servo motor consists of 3 pins
PWM, Vcc and GND which is described in Fig. 10. To control
servo motor, frequency (or period) is adjusted and duty cycle
to set the servo angle. We look up the timing for our specific
servo and Hitec HS-645MG is being used in the example. 0o Fig. 9: Ultrasonic Sensor Timing Diagram
angle requires a high pulse for 600 us (0.6 ms) and 180o angle
requires a 2400 us (2.4ms) pulse. Therefore, to achieve a
spread of 180o movement, a spread of (2.4ms – 0.6ms) 1.8 ms
and 0.9ms pulse time is required. Based on these calculations,
0.01ms time pulse per degree is required.
To maintain the servo position, need to send a pulse every
10ms or required frequency of 100Hz as in Equation 2. Based
on the above time pulse calculations, the duty cycle for the
desired angle of servo motor is calculated as in Equation 3
[10].
100 (2)
.

% 0.01
0.6 10 (3)
For capturing images of the vehicle, we used Logitech
Fig. 10: Servo Motor pins and PWM Cycle
c310 camera which needs a supply voltage of 5V from the
USB port in addition to 100mA of current giving it according
to ohms law an internal resistance of 50Ώ. The camera C. License Plate Recognition
captures an image of 5 megapixel resolution and HD video The first License Plate Recognition method that used in
1280×720 pixels. the proposed system was based on Tesseract-OCR. Tesseract
was originally developed initially in 1994 by the Hewlett
Packard (HP) Laboratories in 1994 which further improved in
1998 to support C++ in Windows [11]. In 2005 HP made
Tesseract open source. From 2006, onward Google is making
changes in it to further enhance it. Currently the OCR support
more than 100 languages to recognize them [12].
K-nearest neighbor (KNN) is a supervised classifier with
the ability for instant based learning [13]. The use of training
samples along with attribute is used for classifying a new
object and subsequently determining the nearest neighbor of
any instance through the use of various algorithms [14].
Classification in KNN requires analyzing similar groups.
KNN works very good with Multi-Modal classes and is
known to be an accurate process. However, in KNN all
features are treated equally when computing for similarities.
Fig. 7: Description of Raspberry Pi 3 Board Components This may lead to classification errors especially when the
feature set is small.

374
TABLE I. ENGLISH TO ARABIC LETTERS MAPPING of accuracy of 90%. Figure 13 shows the accuracies
Arabic English comparisons of proposed method with the other existing
No Description
letter letter techniques.
1 ‫ا‬ A ***
2 ‫ب‬ B ***
Does not have English letter similar to
3 ‫ح‬ J
pronunciation of the letter (‫)ح‬
4 ‫د‬ D ***
5 ‫ر‬ R ***
6 ‫س‬ S ***
Letter (S) was served for letter (‫ )س‬and
7 ‫ص‬ X
letter (C ) is similar to (G)
8 ‫ط‬ T ***
9 ‫ع‬ E ***
10 ‫ق‬ G ***
11 ‫ك‬ K ***
12 ‫ل‬ L ***
(M) is similar to (N) and is thus, rejected Fig. 11: A) Original Image, B) Gray Scaled Image, C) Threshold based
13 ‫م‬ Z
too wide Binary Image, D) Image after finding all Contours, E) Image after finding
14 ‫ن‬ N *** possible Characters, F) Image after finding all vectors of matching
Characters, G) Boundary of matching Characters of plate part, H) Extracted
15 ‫ھـ‬ H *** English letters part of the plate, I) Extracted numbers part of the plate.
16 ‫و‬ U (W) is thus, rejected too wide
17 ‫ي‬ V (Y) is thus, rejected too high TABLE II. PERFORMANCE OF DIFFERENT METHODS

With the same concept, KNN as an algorithm for character Plate/Method Tesseract OpenALPR KNN
detection is used. The algorithm needs to be trained first for a
certain set of characters then it became ready to use and License Plate Set 1
compare what it sees with what It has been trained on. 40 % 70 % 90 %
Understanding the concept of KNN is not enough to
implement it in real case, since the input image won’t be as
clear as the algorithm would like it to be so needed a set of License Plate Set 2
image processing steps that will prepare the image for 60 % 80 % 92 %
extracting information in it and then lock for the suitable
matches and assess each one of them to see whether it satisfy
being a character or not [15]. The process is mainly two parts; License Plate Set 3
the first is locating the plate of the image then detecting the 61 % 74 % 95 %
characters in the plate itself using KNN. If the first part of the
process failed to successfully locate a Plate, the whole process License Plate Set 4
is failed. Before passing captured image to the Tesseract,
preprocessing is done including converting the color image to 57 % 75 % 88 %
grey level, erosion and dilation [16][17]. The sample results
of the extracted plates are shown in Fig. 11 and Fig. 12. License Plate Set 5

48 % 70 % 88 %
IV. 4. EXPERIMENTAL RESULTS
Table 2 shows a general comparison between all the three
algorithms used and how accurate the result is, generally KNN Average 53.2 % 73.8 % 90.6%
gives the most accurate result due to our modification and
implementation of it. Based on the achieved results we can see
that the license plate recognition method that had the most
accurate results is the method based on KNN. KNN resulted
in recognizing the previous tested images with an average of
90%.
Whereas the OpenALPR the license plate recognition
method resulted in recognizing the tested images with an
average of 75%. Furthermore, the Tesseract-OCR based
license plate recognition method resulted in recognizing the
Fig. 12: Converting from RGB to Gray and then Binary image of detected
tested images with an average of 55%. After looking to these license plate.
results, it is decided to implement the KNN based license plate
recognition method as it resulted with the highest percentage

375
100% users as well as will use different LED lights to indicate that a
vehicle is allowed or denied to enter.
90%

80% REFERENCES
[1] Arth, C., Limberger, F., & Bischof, H. (2007, June). Real-time license
70%
plate recognition on an embedded DSP-platform. In 2007 IEEE
Conference on Computer Vision and Pattern Recognition (pp. 1-8).
60%
IEEE..
50% [2] Kothman, G. S. (1951). U.S. Patent No. D163,328. Washington, DC:
U.S. Patent and Trademark Office.
40% [3] Saudi Arabian Private and Passenger vehicle license plate History:
http://www.worldlicenseplates.com/world/AS_SAUD.html
30%
[4] Boiman, O., Shechtman, E., & Irani, M. (2008, June). In defense of
20% nearest-neighbor based image classification. In Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-
10% 8). IEEE.
[5] Monk, S. (2016). Raspberry Pi cookbook: Software and hardware
0% problems and solutions. " O'Reilly Media, Inc.".
License License License License License
[6] Rasp Richardson, M., & Wallace, S. (2012). Getting started with
Plate Set 1 Plate Set 2 Plate Set 3 Plate Set 4 Plate Set 5
raspberry PI. " O'Reilly Media, Inc.".
Tesseract OpenALPR KNN [7] Monk, S. (2015). Programming the Raspberry Pi: getting started with
Python. McGraw Hill Professional.
[8] Brahmbhatt, S. (2013). Embedded Computer Vision: Running
Fig. 13: Performance comparison chart of different LP detection methods OpenCV Programs on the Raspberry Pi. In Practical OpenCV (pp. 201-
218). Apress.
[9] Carullo, A., & Parvis, M. (2001). An ultrasonic sensor for distance
V. CONCLUSION measurement in automotive applications. IEEE Sensors journal, 1(2),
Our research aims to create integrated systems that will 143-147.
reduce man labor, discard redundant work and to create an [10] Dote, Y. (1990). Servo motor and motion control using digital signal
automated future. Three different ways to process the license processors. Prentice-Hall, Inc..
plate OpenALPR, Tesseract and KNN are discussed. The [11] Smith, R. W. (2013, February). History of the Tesseract OCR engine:
what worked and what didn't. In IS&T/SPIE Electronic Imaging (pp.
different results of each algorithm singling out the KNN for 865802-865802). International Society for Optics and Photonics.
its superior results in terms of license recognition. The [12] Patel, C., Patel, A., & Patel, D. (2012). Optical character recognition
ultrasonic measures the distance of the car approaching the by open source OCR tool tesseract: A case study. International Journal
gate, when a certain distance is measured an instruction will of Computer Applications, 55(10).
be sent to the camera to capture a picture of the car’s license [13] Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy k-nearest
plate. This image gets processed and runs as input to the KNN neighbor algorithm. IEEE transactions on systems, man, and
algorithm, opening the gate if the result is found in the cybernetics, (4), 580-585
database, otherwise the gate will not open. This system can [14] Mitchell, Tom M. (1997). Bayesian Learning. In Machine Learning
be integrated into main gate substituting the need for security (pp. 154–178). Mitchell, T. (1997). Machine Learning.
personnel to be stationed there all the time. When a vehicle is [15] Kosbatwar, S. P. and Pathan, S. K. 2012. Pattern Association for
character recognition by Back- Propagation algorithm using Neural
verified by the security official, its license plate details are Network approach, nternational Journal of Computer Science &
being inserted to the database. These information from the Engineering Survey, vol. 3 – no. 1, pp. 127-134. G. Eason, B. Noble,
database are used to open the gate once the license plate is and I.N. Sneddon, “On certain integrals of Lipschitz-Hankel type
verified, making it easier for the security personnel to make involving products of Bessel functions,” Phil. Trans. Roy. Soc.
London, vol. A247, pp. 529-551, April 1955.
their rounds and focus on other useful things rather than stay
at the gate and open it manually all the time. [16] Chen, C. W., Luo, J., & Parker, K. J. (1998). Image segmentation via
adaptive K-mean clustering and knowledge-based morphological
In future, we will try to improve the algorithm to recognize operations with biomedical applications. IEEE Transactions on Image
Processing, 7(12), 1673-1683.
Arabic Letters and numbers. We will also add more training
and testing data to improve the results. In hardware level, we [17] Patel, C. I., Patel, R. and Patel, P. 2011. Handwritten Character
Recognition using Neural Network. International Journal of Scientific
will add LCD to display the important messages to the system & Engineering Research, vol. 2 – no. 5, pp. 1-6.

376
Two Information Systems in Air Transport
It is a Short Journey from Success to Failure
Victor P. Lane Derar Eleyan James Snaith
Business School Computer Science Department Business School
London South Bank University Palestine Technical University-Kadoorie London South Bank University
London, UK Tulkarem, Palestine London, UK
profviclane@btinternet.com d.eleyan@ptuk.edu.ps snaithja@lsbu.ac.uk

Abstract-- Businesses across the world are launching


B. Ubiquitous Information Systems
ambitious computer-based information systems (ISs) to
improve their competitive advantage. ISs have evolved Nowadays, technology and information systems (ISs)
over three decades. In this period, have we gained permeate everywhere in businesses. Without ISs, normal
appropriate knowledge to successfully create ISs, i.e., business daily activities would be difficult. A successful IS
from conception to design realization and project is a project that creates a system which operates in
implementation? In this paper, we examine two different its business environment and satisfies the requirements of
IS case studies from the air transport sector – a sector business end-users [3]. There are many dangers with
currently under considerable scrutiny. The aim is to planning new ISs. The main problem is that a new IS means
extract lessons from the successes and failures of these change for someone.
case studies to help practitioners. Change is always problematic. Consequently, problems
can occur, often declared as a failure. These can occur in
Keywords— information systems; problems; failure;
many forms e.g., deviation from previous agreed aims;
success; terminal airport; baggage system; contingency inability to satisfy an end-user’ needs; overspending on IS
planning development; or missing target dates [4]. Understanding and
exploring success and failure of business ISs is a difficult
I. INTRODUCTION task because failures are complex and involve elaborate
socio-technical arrangements [5]. While simple failures
A. Study Motivation and Current Trends relate to over-spending or taking too long to build, the
Recently, it has been noticeable that several important serious failures are those that malfunction and interfere with
major ISs have been implemented (1) with a ‘big-bang’ the operation of the business. Such events recently occurred
implementation, (2) with virtually no contingency with the Boeing737 plane, related to the anti-stall software
emergency planning to alleviate any initial problems that [6]. The causes of this problem are unknown. Therefore, it
might occur with the IS implementation, and (3) such that cannot be used in this paper. Instead, two earlier case studies
the IS end-users, i.e., in many ways the innocent spectators also from air transport are used [7, 8, 9, 10]. The paper
or customers, suffer all the pain of an initial system outage analyses the good and not-so-good incidents within these
[1, 2]. Are these anomalies or current trends? two scenarios, to learn from successes and failures. This
It is puzzling that at the same time during which we are paper is part of research activities spanning the last three
thinking about (1) computer science in terms of AI and decades, that started with exploration of failures in hospital
robotics, that (2) massive and highly expensive modern ISs [11, 12, 13].
infrastructure, mission-critical computer-based information C. Research Method
systems, are still witnessing ruinous and appalling IS
implementation failures. The aim is to answer the overarching question ‘Can we
These are not ultra-modern or futuristic ISs. These are now develop ISs without any major risk of failure?’ The
computer based ISs that should not fail in implementation in answer to this question is obtained using paradigms and
2019. Why failure? It is a conundrum! Therefore, the guidelines relating to failure and success of ISs. Two IS case
benefits of this study are that it will help to highlight (1) the studies are outlined. The paradigms and guidelines are
continuing problems relating to the implementations of presented in the Section II. They form the framework for
computer-based systems and (2) the underlying factors analyzing the case studies. The paper combines two research
causing the failures. The impact of this study is to raise the approaches - (1) a case study approach and (2) an
problems of implementation and their significance related to argumentative approach.
the agenda of computer science and IS practitioners. A case study provides descriptions that exist in one
This paper relates to problems that can occur in many of specific situation in a single organization. A major
the wide-ranging computer science topics covered in advantage of the case study approach is that it captures
present-day state-of-the-art international conferences – reality. It has limitations because the data come from one
including topics from AI to robotics and machine-learning, single situation, and in a different organization the data
and from pattern-recognition to automation. might have limitations. An argumentative approach captures

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 377


ideas based on speculation and opinion. It can be a valuable
contribution to theory building. TABLE 2. CRITICAL FACTORS ASSOCIATED WITH FAILURE

II. SUCCESS AND FAILURE I. Organizational Context IV. Design Realisation


CF1 Poor reporting structures CF12 Poor consultation
There are many methods of analyzing failures of IS (stakeholders)
projects and many suggestions for how failure might be CF2 Abdicating responsibility CF13 IT fix for
caused. For example: - Do end-users believe that the IS and management problem
its software operate correctly? Does it deliver the CF3 Bad news moderated CF14 Design by committee
anticipated benefits? Does the value of benefits achieved II. Management of Project V. During Building Stage
exceed the costs of the IS development? There are myriads
of ways to do this assessment. [14, 15, 16]. CF4 Over commitment to CF15 Competency
success

TABLE 1. SUCCESSFUL IT PROJECTS – STANDISH GROUP CF5 Over commitment to CF16 Staff turnover
completion
Success Factors – 1994 Success Factors – 2012
1. Executive 6. Smaller 1. Executive 6. An Agile CF6 Unable to be impartial CF17 Communication
Management Project Support Process
Support Milestones CF7 Political external
2. User 7.Competent 2. User 7. Project pressures
Involvement Staff Involvement Management CF8 Targets set outside the
Expertise project

3. Clear 8. Ownership 3. Clear 8. Skilled III.. Conceptual Stage VI. Implementation into
Listing of Business Resources the Organization
Requirements Objectives
4. Proper 9. Clear 4. Emotion- 9. Execution CF9 Complexity CF18 Poor testing of
Planning Vision & al Maturity underestimated product
Objectives CF10 ICT over emphasized CF19 Poor training of users
CF11 Lure & trap CF20 Receding deadlines
5. Realistic 10. Hard- 5.Optimizing 10. Tools & of leading-edge IT
Expectations Working, Scope Infrastructure
Focused Staff

C. The Pragmatic and Sceptical Approach

These paradigms and guidelines often overlap. Therefore, in This pragmatist’s guide to avoiding failures provides
this paper, only a small number are used for analysis. They several maxims, some of which appear both recognizable
indicate how, where and why failures have occurred. It is and sometimes disparaging.
astonishing how the same mistakes continually re-occur. It
TABLE 3: THEMES THAT RECUR IN MOST LARGE FAILURES
is clear that “Learning from failures” is not straightforward
T1. Over ambitious T2 Technocrats think they know it all
A. The Standish Group: Ideas of success and failures
T3. Computing must be T4. Management abdicate
For over 30 years, the Standish Group in the USA, a beneficial responsibility
primary research advisory organization, has written about
T5. Credulity - it will turn T6. Conflicts that may have a conflict
software project performance; and success and failure in IT out alright when needed of interest
projects. Its reports entitled “Chaos”, try to show methods
T7. Custom built product T8. Concealment of bad news by
for the achievement of successful projects – see Table 1 for
middle managers
suggestions of success factors.
T9. Buck passing T10. Mistaken belief - litigation will
solve problems
B. The Critical Factors Approach
The items listed in Table 2 are the “Critical Failure
Factors” which have been found to be associated with
This does not detract from the fact that they are
success or failure in computing projects [17, 18]. It is
extremely useful and, regrettably, too often ignored by
recognized that one factor alone may not be critical, but the
practitioners. This approach starts from the premise that the
concatenated effect of several factors, brings greater risk
mistakes that occur in computer-based projects are always
and possible failure. These factors are intended to help
similar. These recurring themes are encapsulated in Table 3.
practitioners to identify the true status of a computing
The approach encourages a pessimistic and cautious view of
project and in the case of a troubled project lead to
computing, which is based on the conviction that (1) many
appropriate remedial action.
computing projects ‘fail’, and (2) few if any systems ever
Some of the observations and recommendations which
come close to being perfect [19]. However, while
are associated with this approach are like those described in
emphasizing the need for caution, and being disparaging of
the next section. For example, that senior (non-technical)
IT enthusiasm or IT hyperbole, there is a recognition that no
general management should not abdicate their responsibility
enterprise in the 21st century can survive without IT
to ‘manage’ projects to any internal or external party; and
systems &/or without change. This change must be brought
that there is no certain way to avoid a disaster.

378
about by harnessing technology - without the lure of the the time of the airport opening, and (5) some 10% of the
technology causing developers to lose sight of the only real terminal’s 275 lifts not being operational.
goal, which is to bring improvements and benefits to the Each item, in isolation, appears trivial, especially in the
business. context of such a huge undertaking. The overall project was
a new technologically advanced airport terminal costing
£4.3bn. Prior to the real-time implementation of the baggage
III. CASE STUDIES IS, there had been 66 trials, using 15,000 people from the
The two case studies are selected from the air transport public and from stakeholders, and 400,000 bags. Together,
industry. At this point in time, for various incidents across these created 50,000 passenger profile trials and all travel
the world, some life critical embedded IS/IT systems in air scenarios. But, non-IT problems, apparently tiny,
transport are under scrutiny [6]. The two case studies compounded the IT/IS difficulties, such that on the day of
illustrate some of the problems faced by IT practitioners in opening, the terminal was ‘Not fit for purpose’ – see Table
the air transport industry. 5. A UK Government report [8] said that the main factors
causing the failures were (1) insufficient communication
A. Case Study 1: A New Terminal at Heathrow Airport
between the terminal user and terminal operator, and (2)
The new Heathrow Terminal 5 was designed to be one poor staff training combined with incomplete systems
of the most technologically advanced airport terminals in the testing. Slightly different from the claim of the CEO of BA.
world, but the initial opening was indefensible. While the Naturally, many of the faults were in the remit of BAA, the
terminal resulted in (1) praise for its use in ISs for the owner of the airport. BAA is approximately 90% owned by
planning and creating of a huge civil engineering project non-UK parties. Some of the faults that occurred are
[20] it may well be remembered because of (2) the failure of detailed below.
the new but relatively ‘humble and unexciting’ baggage • There were difficulties with the LAN facility.
system [7, 8]. The overall cost for T5 was £4.3bn [$US 8.5 Consequently, at some check-in stations, computer-
billion], with £250m [$US 323m] invested in technology handheld devices were inoperable, and consequently
and IT systems, see Table 4. The complex systems used airport staff could not enter baggage-data into the
400,000 people-hours for software engineering. The baggage IS.
terminal T5 required 180 IT suppliers, 163 IT systems, 546
interfaces, more than 9,000 connected devices, and 2,100 • An initial problem was that the BA loading staff could
PCs. not sign on to the baggage-reconciliation system – so
Written evidence submitted to the UK Government’s staff had to reconcile bags manually, causing significant
Transport Select Committee, discovered that a multitude of delays.
problems were unearthed in the first days of operation of • In the afternoon of opening day, BA could no longer
T5. Unfortunately, the attempts of IT staff of BA to alleviate accept checked baggage. Therefore, at check-in
the failings did not reduce the passenger problems. Instead passengers were told that they could choose between
the initial problems were intensified. (1) travelling without baggage or (2) re-booking their
flight. Unfortunately, passengers already checked-in
TABLE 4. TERMINAL T5 and waiting in the departure lounge were informed that
Cost of Terminal 5 £4.3 billion [$US 8.5 B]
they would be leaving without their bags
Terminal Area 251 hectares • During the earlier testing of the baggage system, IT
testing staff installed ‘testing software’. This software
Workers on site at one time 6,000
was not removed, and it caused problems when it was
Glass walls 30,000 m2 used with real events. In real use, the T5 baggage
New aircraft stands 60 system did not receive data about luggage transferring to
Tunnels constructed 13,000 m BA from other airlines. Therefore, these ‘unknown’ bags
were sent for manual sorting in a storage service, i.e., a
storage Centre outside T5.
The only airline using T5 was BA, but BA was not • An "incorrect configuration" between ISs stopped the
responsible for the airport and not for T5. The responsibility feed of data from the baggage-handling system to the
and ownership of the airport and T5, is a different private baggage reconciliation system. A week after the original
company, BAA - approximately 90% of the ownership is opening, the reconciliation system failed for the whole
non-British. The CEO of BA, the major airline using day. Bags missed their flights because the faulty system
Terminal 5, stated that (1) IT difficulties, plus (2) a lack of told staff that they had not been security screened.
testing played a significant proportion in the malfunctions at
T5. However, he suggested that if the issues had been • BA was compelled to cancel flights as it attempted,
simply IT related, then the airline might have coped. A huge unsuccessfully, to understand and clear the luggage
number of non-IT difficulties hit the T5 implementation blockage.
during its first few days, and these were intensified by the TABLE 5: PROBLEMS IN FIRST 5 DAYS OF OPENING.
way staff handled these difficulties. These non-IT problems
included (1) insufficient reservations for car parking for new Number of passenger bags misplaced 23,000
T5 staff, causing many staff to be missing from their posts, Flights cancelled 500
(2) security searches delayed, (3) staff not fully trained, (4)
construction of parts of the T5 building being incomplete at Made losses of some £16m [$US 21m]

379
Some 10 years before the terminal T5 construction, the This type of incident is not reported fully. Although few
Denver International Airport in USA had experienced details about the system crash have been made public, major
similar problems with its baggage IS [21]. The Denver features are known. The incident affected a large area of the
baggage system, then the most advanced system in the south west of the USA - from western Arizona to the west
world, is still known as a notorious example of project coast, and from Mexico to southern Nevada. Fortunately, no
failure. It was planned to automate the handling of baggage accidents or injuries occurred, but hundreds of flights were
for the entire Denver airport. The baggage system was found delayed. The National Air Traffic Controllers Association
to be extremely complex, and the resultant problems caused reported that ERAM was back up and running within an
the new airport being unused for 16 months. The Denver hour - perhaps, a good indicator of the strength of the air
delay added approximately $US 560m to the cost of the traffic control system. It appears the problem was caused by
airport. a simple lack of memory.
Terminal T5 was more fortunate. At its opening, T5 was
the largest free-standing structure in the UK. It had the IV. ANALYSIS
finest of architects and civil engineers, namely Richard A. Heathrow Terminal 5
Roger and Arup with Mott-MacDonald, respectively. T5’s As to be expected with a high profile and well-funded
first passengers arrived at 4.50am on 27 March 2008 on a project such as Heathrow Terminal 5, the project
flight from Hong Kong. It was perfectly successful. The rest management teams followed the sound principles and
of that day, and later days were chaos. After its less than guidelines that are described in the publications listed in
auspicious opening, the terminal T5 had many misfortunes. Section II and outlined in Tables 1, 2 and 3. In addition,
Over the first 10 days, 42,000 bags did not fly in the same there were significant sums spent on system testing and staff
plane as their owners. The first full schedule from T5 training, but without successful completion. However, it is
occurred on 8 April 20008, some 1½ weeks after its first often suggested that practitioners often exaggerate how
opening. much time and money they spend on testing [22]. In the case
of Terminal 5, whatever was spent was insufficient in that
B. Air traffic control system failure – Los Angeles Airport T5 project became a dreadful failure.
The Los Angeles International Airport, USA, locally There are no major or significant omissions or
known as the LAX airport, is the primary international differences. However, there were some minor differences,
airport for the city, and is the world’s third-busiest airport which later contributed to the failure.
based on total movements. The air traffic control system, With respect to project management, i.e., item 7 of Table
i.e., the En-Route Automation Modernization (ERAM) 1, staff from BA and BAA knew that the overall project was
system, is fundamental for the safe running of such a large a little late with some parts of the building works. The
airport. ERAM was developed by the Lockheed Martin knock-on from this was the IS testing was started late,
Corp and cost $2.4 billion. In April 2014, ERAM was without any correction to completion date, i.e., the testing
thought to be secure and dependable. completion time should have resulted in a delay in the
However, on 30 April 2014, a rogue plane entered the testing time, and perhaps also in T5 opening time or date. At
flying space [9, 10]. An air traffic controller could see that the time, this was thought to be too small a time-change to
this unknown plane (1) was going in and out of the Los cause problems.
Angeles control area multiple times, and (2) was higher than In Table 2, the critical factors, CF18, 19 & 20,
normal commercial flights. It did not have a simple point-to- emphasise how important testing, training and receding
point route like normal commercial flights. Later, it was deadlines are, particularly in the final stages of
recognised that the aeroplane was a U-2 spy plane, operating implementation. At this stage, any problems have little
at high altitude, with a complex flight plan. The controller leeway for additional time to address last-minute faults, like
entered its entry at 60,000 feet. The ERAM system those that occurred in T5. In addition, CF5 focusses on over-
calculated all possible flight paths of the unknown plane to commitment to completion dates; in the case of T5,
ensure that it was not on a crash route with the commercial commitment to completion dates caused reduced times for
planes with known flight paths at lower altitudes. testing and training. System testing and staff training are
Unfortunately, before all paths could be completed, the often thought of as non-important activities; in the case of
process used a large amount of available memory and T5 they were crucial.
interrupted the system’s other flight-processing functions, In Table 3, items that often occur in IS failures are
causing a system crash. The system then recycled, Themes T6 and T7, namely possible conflicts of interest and
attempting to complete the process. A repeating failure. the dangers of custom-built computer systems. Both are
Commercial flights have a relatively small data need and the apparent in the Terminal T5 events. Theme 5, credulity, is
rogue plane quickly overran the remainder of the system also evident, i.e., the wishful thinking that ‘when we start’
data memory. everything will be satisfactory.
With the ERAM system down, the air traffic controllers The above events emerged in various forms in the T5
in the regional LAX main centre switched to a simple situation. However, in the above discussion there is not one
uncomplicated back-up system. In this way, they could see item that seems large enough to cause the huge upheaval
the commercial planes on their screens. Using phones and that occurred at the Terminal T5 opening. It was their
paper, they were able to send flight information relating to combined effects, plus logistics and building incidents, that
commercial planes flying in their airspace, and to other brought the whole T5 enterprise to a virtual halt.
control centres in the region. Finally, it is well-known that a ‘big-bang’ approach can
cause problems. There is no explanation as to why BA

380
selected this approach, rather than a gradual phased Finally, Case Study 2 highlights the importance of back-
approach. up systems and contingency planning. In Case 1, i.e.,
B. Air Traffic Control System at Los Angeles Airport. Terminal T5, if any serious back-up or contingency
The LAX system degraded slowly before failing planning had been in place, or if the system testing and the
completely. Fortunately, the organization had a back-up staff training had been correctly completed [22] then the
manual system, to take over virtually all traffic control problems that occurred within T5 might never have
needs. Everything went well. It was certainly a ‘failure’, but occurred.
one can only applaud the way in which the back-up system
operated. REFERENCES
The organization did not claim it was their ‘contingency [1] M. Field, “O2 network restored after Ericsson Software outage
plan’, but it certainly helped the air traffic controllers to left millions of mobile users without 4G data access,” The
Telegraph, London, 7 December 2018.
work in unexpected and dangerous circumstances. The LAX [2] J. Jolly, “The TSB bank computer meltdown bill rises to £330m”, The
incident, like the T5 implementation, was a huge Guardian, London, 1 February 2019.
bewilderment. [3] L. McLeod, B. Doolin, and G. MacDonell, “A Perspective-Based
The T5 incidents caused passengers to be Understanding of Project Success,” Project Management J. vol. 43, pp.
68–86, 2012.
inconvenienced, but it did not have any threat to passengers’
[4] R. Sweis, "An Investigation of Failure in Information Systems Projects:
safety; whereas the LAX incident had the greater potential The Case of Jordan," J. Management Research, vol. 7, 2015.
of danger. It was more likely to endanger the lives of people [5] B. Shore, “Systematic biases and culture in project failures,” Project
within the LAX location. LAX could claim that they were Management J. vol. 39, pp. 5–16, 2008.
well-prepared for even this unexpected rogue event. [6] G. Toham and H. Smith, “Investigators believe Ethiopian Boeing 737
Max's anti-stall system activated,” The Guardian, London, Friday, 29
V. CONCLUSIONS March 2019.
[7] R. Thomson, “British Airways reveals what went wrong with Terminal
In Section I, the question “Can we now develop ISs 5: The full extent of the IT problems,” Computer Weekly On-Line, 14
without any major risk of failure?” was posed. The insights May 2008.
that we have from the guidelines addressed in Section II [8] UK House of Commons – Transport Committee, The opening of Heathrow
Terminal 5: Twelfth Report of Session 2007–08, HC 543, London:
help practitioners. However, Case 1 reminds us that it may Stationery Office, 22 October 2008.
not be the computer science or the brilliant IT that will [9] J. Hamil, “Los Angeles air traffic meltdown: system simply ran out of
cause failure. Other simple basic practices related to memory,” The Register Online, 12 May 2014.
‘management of change’ are at-least as important. In Case 1 [10] A. Scott, and J. Menn, “Exclusive: Air traffic system failure caused by
was it wise to use a ‘Big-Bang’ implementation? Or was the computer memory shortage,” Reuters, Technology News, 12 May
absolute necessity of good training and of system testing 2014.
really understood by BA and BAA management? Or was it [11] V. P. Lane, “Information Systems Projects – Are failures congenital or
acquired?”156-164. Current Perspectives in Healthcare Computing, pp.
simply that the date-of-start took preference over training 156-164, March 1999 [Proc. of HC’99, British Computer Society].
and testing times? [12] V. P. Lane, “The NTfIT in the NHS: £12.7bn – The NHS computer
The LAX case study shows that even the best system can system can still provide joined-up healthcare,” The Guardian, London,
pp 31, 4 August 2009.
be unable to continue, when a rogue incident occurs. It also
[13] V. P. Lane, J. A. Snaith, and D. C. Lane, “Hospital information systems
demonstrates the need for back-up systems. The events of – Are failures problems of the past?” Invited Paper, Annual Journal of
the LAX incident are not known; so, it unwise to pontificate. Medical Informatics & Technologies, University of Silesia, vol. 11, 11-
Nevertheless, the LAX back-up system averted problems, 22, November 2009.
and possibly fatalities. It also reminds practitioners of the [14] R. Ibrahim, E. Ayazi, S. Nasrmalek and S. Nakhat, “An investigation
need for contingency planning. of critical failure factors in IT projects,” J. Business and Management,
vol. 10, pp. 87-92, 2013.
Recently, there have been UK incidents - with banks and [15] M. Kateb, R. Swies, B. Obeidat, and M, Maqableh, “An investigation
with mobile phone companies [1, 2] - where these large of the critical factors of information system implementation in
companies have used a ‘big-bang’ approach that failed. The Jordanian information technology companies,” European J. Business
and Management, vol. 7, pp. 11-28, 2015.
end-users, like the T5 passengers, were innocent by-standers
[16] H. N. Nasir and S. Sahibuddin, “Critical success factors for software
or victims. Is a ‘big-bang’ implementation, even if it is more projects: A comparative study,” Scientific Research and Essays, vol. 6,
problematic to end-users, coming into fashion? This would pp. 2174-2186, 2016.
appear to be a suitable study for future research. [17] C. Sauer, Information Systems Project Performance: A
While new technologies provide us with new business Continuing Journey, Warwick Business School, ISM Forum,
opportunities, the case studies remind us of the dangers of Warwick University, 2008.
forgetting the lessons we have learned from past [18] K. T. Yeo, “Critical failure factors in information system projects,”
Int. J. Project Management, vol. 20, pp. 241–246, 2002.
experiences, such as: -
[19] T. Collins and D. Bicknell, Crash: Ten Easy Ways to Avoid a
• the importance of the completion of system testing, Computer Disaster, Simon & Schuster, Australia: Sydney, 1997.
• the importance of good quality staff training, [20] A. Davies, D. Gann and T. Douglas, “Innovation in mega-projects:
• the absolute necessity for cooperation between staff from Systems integration at London Heathrow Terminal 5,” Cal.
all organizations involved in the first real-time working Management Review, vol. 51, pp. 101-125, Winter, 2009.
of the new system, [21] M. Schloh, “Analysis of the Denver International Airport baggage
system,” Computer Science, Department School of Engineering,
• the difficulties of a ‘big-bang’ implementation - it is California Polytechnic State University, 16 Feb. 1996.
better, if possible, to use a phased implementation, and [22] M.M. Beller, G. Georgios, A. Panichella, and A. Zaidman, A. When,
• the dangers of receding deadlines, leading to attempting how, and why developers (do not) test in their integrated development
to try to do jobs in less time than originally estimated. environments, Proceedings - 10th Meeting on Foundations of Software
Engineering, ESEC/FSE 2015, ACM, New York, USA, 179-190, 2015
381
Task Scheduling based on Modified Grey Wolf
Optimizer in Cloud Computing Environment
Abdullah Alzaqebah Rizik Al-Sayyed Raja Masadeh
Computer Science Department, Information Technology Department, Computer Science Department,
The World Islamic Sciences and King Abdullah II School for The World Islamic Sciences and
Education University Information Technology, Education University
Amman, Jordan University of Jordan Amman, Jordan
Abdullah.zaqebah@wise.edu.jo Amman, Jordan raja.masadeh@wise.edu.jo
r.alsayyed@ju.edu.jo

Abstract—Task scheduling is considered as one of the to maximize resource utilization, minimize both makespan
most critical problems in cloud computing environment. The and cost to optimize the scheduling in cloud environments.
main target of task scheduling includes scheduling jobs on
virtual machines as well as improves performance. This Cloud task scheduling is known as an NP-complete
study employed Grey Wolf Optimization (GWO) algorithm problem [13]. More precisely, the required time for
with modifications on the fitness function by making it detecting the solution changes by the problem size [14].
handles multi-objectives in single fitness; the makespan and Cloud task scheduling is categorized into two classes
cost are the objectives included in the fitness in order to solve namely; meta-heuristic and heuristic algorithms. Heuristics
task scheduling problem. The main target of this technique is algorithms problem-specific strategy; it cannot be used to
to reduce both cost and makespan. CloudSim tool is used to answer open problems. On the other hand, the meta-
evaluate the objectives of the proposed method. The heuristics algorithm can be used (or applied) to solve a
simulation results showed that the proposed method wide range of problems in reasonable time.
(Modified Grey Wolf Optimizer - MGWO) has better
performance than both the traditional Grey Wolf Recently; Meta-heuristic algorithms are the most
Optimization Algorithm (GWO) and Whale Optimization applied techniques for task scheduling because they find
Algorithm (WOA) with makespan based fitness in terms of the optimal solutions or near-optimal solutions in
makespan, cost and degree of imbalance. reasonable time. Moreover, they detect the solutions by
employing the random choices. The most suitable example
Keywords—GWO, MGWO, WOA, Fitness, Makespan, and of a meta-heuristic algorithm is a Genetic Algorithm (GA)
Cost which is adopted by many studies to solve task scheduling
problem (TSP) in several manners. In literature studies
I. INTRODUCTION [15-18], the required time for mapping tasks into resources
Due to the availability of big data as well as the on- is increased when the number of jobs is increased.
demand operation in cloud computing (CC), the
requirements of CC environments have increased in recent In this research, we proposed cloud task scheduling
years. CC [1, 2] allows the clients to access the available which is based on the multi-objective model and Grey
and suitable resources such as internet applications, Wolf Optimization (GWO) algorithm to minimize both
storages, and servers [3]. The main role of the cloud cost and makespan in the cloud environments. CloudSim
service provider is to handle and manage client requests tool is used to evaluate the proposed technique.
(services) over the Internet [4]. The CC environment The organization of the paper is described as follows:
presents various services to clients. The most important Section II contains the related work, while section III
services are Platform as a Service (PaaS) [5], Infrastructure describes the GWO algorithm in details. Section IV
as a Service (IaaS) [6], Expert as a Service (ExaaS) [7] and outlines the suggested work. Simulation results are
Software as a Service (SaaS) [8][9]. The cloud clients have presented in section V. Finally, section VI concludes this
various tasks, and these tasks are implemented and research.
achieved at the same time by the available resources in the
cloud. The performance of CC can be developed by II. RELATED WORK
mapping tasks into resources in an optimized manner. One Many researchers tried to solve cloud task scheduling
of the most critical operations of the cloud is task using different techniques. Most of them employed meta-
scheduling which generates great influence on the entire heuristic algorithms such as GA, ACO, GWO, and WOA
cloud by impacting the Quality of Service (QoS) [10, 11]. in order to solve one of the main problems of cloud
The CC task scheduling preserves the balance over the environment which is task scheduling problem (TSP) as
entire system load. Each job demands response time, well as to find the optimal distribution of available
memory and computing time in several scales. In resources. However, there are still some issues in this
additions; the CC has the distributed resources. research area [2,19].
The efficient task scheduling method must minimize A novel algorithm is proposed which is based on neural
the makespan of the application [12]. Therefore, there is a network (NN) in order to classify the tasks queues which
need for algorithms to schedule the cloud tasks of users occur on any resource as well as to grant priorities to a
which optimally assign tasks into resources as well as variety of tasks [20]. NN is considered as an artificial
reduce the makespan. However, there are other criteria intelligence system which can discover and distinguish a
playing role in cloud task scheduling such as cost and pattern. Also, it can learn by instance and adapt to novel
utilization. Multi-objectives task scheduling algorithm has
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 382
concepts and knowledge. Employing NN will be high the proposed technique can greatly minimize the total
potential to optimize mapping of tasks into virtual execution time to find the available cloud resources as well
machines (VMs) in CC environments. as significantly develop efficiency.
Few researchers employed GWO algorithm to solve the Some studies employed a Genetic Algorithm (GA) to
problem. Multi-Objectives cloud independent task propose novel cloud scheduling techniques. A new
scheduling based on mean GWO is presented [21]. The scheduling strategy and assists in appropriate and dynamic
primary objectives of the proposed algorithm [21] are to resource utilization are proposed in Kumar, P. et al. work
reduce both makespan and power consumption. Based on [30]. In other words, an improved GA is introduced which
simulation results, they proved that the suggested Mean of combined Min-Min and Max-min techniques in traditional
Grey Wolf Optimization algorithm has better results than GA. Based on simulation results, the proposed strategy
other traditional GWO and PSO algorithms. While [22] outperformed the traditional GA in terms of makespan.
employed the GWO method in order to solve dependent Suggested enhancement of GA is introduced by Wang, T.
tasks in CC environments. Makespan, cost, and resource et al. [31] which achieved independent task scheduling
utilization are taken into consideration. The experimental with minimizing makespan and balancing the entire system
results showed that the proposed algorithm has better load. The experimental results proved that the suggested
performance than the other existing techniques. algorithm can reduce the makespan and balance the system
load efficiently.
Some studies used Whale Optimization Algorithm
(WOA) to solve TSP. The study of sharma, m. et al. [23] III. GREY WOLF OPTIMIZATION (GWO) ALGORITHM
focused on both minimizing energy consumption and
makespan for cloud independent task scheduling. Grey Wolf Optimization (GWO) algorithm is
Experiments are performed over a variable number of tasks considered one of the most recently nature-inspired meta-
and VMs. Based on simulation results, the suggested heuristic optimization algorithm that is proposed by [32].
technique provided superior results than Min-min Moreover, it mimics the foraging and hunting behavior of
algorithm in terms of makespan and consumed energy. grey wolves. The most distinguished of grey wolves is
Another cloud task scheduling technique is suggested their social hierarchy; where they live in a pack that
based on WOA and multi-objective model that is called consists of 5-12 wolves. Each pack has alpha, beta, delta,
W-Scheduler [24]. The main objectives of W-Scheduler and omega members. Alpha is represented as a leader
are reducing makespan and budget cost. In addition, the which is responsible for take the decisions. Beta is a
simulation results of W-Scheduler are outperformed other consultant to the leader (alpha) which helps alpha to make
existing compared algorithms. Another multi-objectives decisions. Delta wolves are described as subordinate that
WOA is proposed in study of Reddy, G. N et al. [25] in submits to the upper levels (alpha and beta) but they
order to schedule independent tasks in CC environments. dominate the lower level which is called omega.
Energy consumption, makespan, resource utilization and Hunting behavior of grey wolves is split into stages as
quality of services are taken into accounts. Simulation follow [32-37]:
results proved that the suggested algorithm has better
performance compared with the existing techniques. • Tracking, chasing and approaching prey.
Masadeh, R. et al. [26] proposed a new metaheuristic • Pursuing, encircling and harassing the prey
optimization algorithm which is called Vocalization until it stops moving.
behavior of humpback Whale Optimization Algorithm
(VWOA). VWOA mimics the vocalization behavior of • Attack towards the prey.
humpback whales in nature. Also, the researchers
introduced cloud task scheduling technique which is based The mathematical model of the GWO algorithm is
on the VWOA and multi-objective model that is focused provided as follows:
on makespan, cost, resource utilization, and energy 1- Encircling prey: during the hunt phase, the grey
consumption. The simulation results showed that the wolves encircle the prey which is mathematically
proposed technique has better performance than other modeled as following equations Eq.1 and Eq.2:
algorithms.
Many researchers utilized Ant Colony Optimization
(ACO) to solve TSP in CC environment. Cloud task = | . ( )− ( ) (1)
scheduling algorithm is proposed based on load balancing
and ACO algorithm (LBACO) [27]. This algorithm ( + 1) = ( )− . (2)
balanced the entire system load, in turn, minimizing
makespan. Simulation results showed that the results of the
suggested strategy are provided superior results than First- Where t indicates to the current iteration, → and → are
Come-First-Served (FCFS) and traditional ACO coefficient vectors while is denoted as the position
algorithms. Another solution is proposed in study of
Tawfeek, M. A. et al. [28] that take into consideration the vector of the prey and → represents the position vector of a
makespan and degree of imbalance. Moreover, the grey wolf. In addition, → and → are computed using the
experimental results demonstrated that the suggested
following Eq. 3 and Eq. 4.
strategy outperformed Round-Robin (RR) and FCFS
techniques. Dependent tasks scheduling based on ACO and
two-way ants strategies is introduced in the work of Zhou,
Y. et al. [29]. The experimental results demonstrated that =2 . − (3)

383
= 2. (4) the broker is to optimize some needed parameters such as
makespan, Cost, resource utilization and energy
Where r1 and r2 represent random vectors in [0, 1] and consumption by assigning the tasks to VMs to satisfy the
→ is linearly decreased from 2 to 0. [32] optimization function.
2- Hunting: This phase is guided by the leader alpha The scheduling process is based on some parameters;
and the consultant's beta and delta wolves which the scheduler needs information about the resources during
have enough knowledge about the position of prey. the tasks execution process. The Resource Information
Thus, the rest of the wolves should update their Server (RIS) is responsible about feeding the scheduler
locations according to the location of the best about theses information by summarizing the data center
agent that is mathematically modeled as following information such as CPU, memories and all other
equations Eq.(5,6 and 7): information about the contained VMs. On the other hand,
the scheduler assigns the tasks to the resources based on
this information with respect to optimize the given
= . − , = . − , = . − (5) parameters [38].

= − . , = − . , = − . (6) In this research, GWO is employed as the scheduler


engine of the cloud tasks according to their optimization
( + 1) = ( + + )/3 (7)
behaviour with respect of consumed time and it is recently
proposed by Mirjalili (2014). GWO scheduler algorithm
starts with random individuals (solutions) then evaluate
3- Exploitation and exploration (Attacking prey and these solutions according to the fitness values and update
search for prey): The prey that chased and attacked the search agents' positions in order to create other
by wolves is considered the ability of wolves solutions. Moreover, according to the evaluation process
catching the prey. More precisely, the ability of (Fitness Function), the algorithm creates near-optimal or
wolves can lead to global optima; which is the optimal solution by keeping the best fitness value
ability of exploitation. Since the value of A plays a solutions. In this research; the modification is make the
significant role; in case |A|<1, the grey wolves are fitness function contains multi-objectives instead of single
obliged to assault the prey. In case |A|>1, the grey one, the objectives cost and makespan are the used
wolves are forced to go away from the prey and objectives inside the fitness function in GWO in order to
looking for another one. Algorithm 1 shows the evaluate each solution; Thus, MGWO based on Multi-
pseudocode of the GWO algorithm. [32] Objectives function and Grey Wolf Optimization
Algorithm. for that we call it MGWO.
Algorithm 1: Pseudocode of GWO algorithm
A. Performance Metrics
Begin
1- Makespan: is the overall execution time that
1. Initialize Population needed to accomplish the tasks in the CC
2. Initialize a, A and C environment. The minimum value of makespan
3. Calculate the fitness of each search agent) means better efficiency in CC and this is done by
4. X = the best search agent making the scheduler assigns tasks to prior VM
5. X = the second-best search agent according to the task's information and the RIS
6. information. Assume the ET is the execution time
X = the third best search agent
7. of (tn) task on (Vmm) VM as following, {t1, t2,
While (t< maximum number of iteration) …… tn} are tasks, {Vm1, Vm2, …….Vmm} are
8. For each search agent VMs and the execution time is {ET1, ET2, …..
9. Update the position of the current search agent ETn}, Eq.8shows the makespan fitness function
by equation (7). [26].
10. End for
11. Update a, A and C
12. Calculate the fitness of the current search agent. = { } (8)
13. Update Best Solution.
14. Update X_α, X_β and X_δ
15. t= t+1 2- Cost: is the execution cost of execution task on a
16. End While specific VM, this cost relies on the length of the
17. Return X task (TaskSize), the cost of transfer task to the
specific VM and the storage of that VM. Eq. 9
End shows the cost equation and Eq. 10 illustrates the
fitness of cost metric [26, 29].
IV. PROPOSED WORK
Task manager which is called cloud broker is key = (9)

responsible for collecting and controlling the tasks
submitted by cloud users. More precisely, the management = { … … } (10)
process is distributing the incoming tasks to the available
resources (VMs) in the cloud datacenter. The main aim of

384
3- Evaluation of Fitness Function: in this paper, two
performance metrics makespan and cost are
included in the fitness function of the MGWO
scheduler which aims to minimize the fitness value
and this is the modification of traditional use of
GWO. The fitness equation presented in Eq. 11
where ti represents the i'th task from the tasks list.

=( + ) (11)

V. SIMULATION RESULTS
Fig.2: Makespan of various numbers of tasks when the number of VMs
The proposed algorithm is simulated using CloudSim is 2
tool; where its platform based on Java. All these
experiments are validated on a personal computer with
Intel Core i-7 processor, 16 GB RAM, and Windows 8.1
operating system. The proposed MGWO differs from
GWO by modifying the core fitness function by make it
considering multi-objectives instead of just a single
objective which is the makespan. The outcomes of
employed modified GWO (MGWO) are compared with
the original GWO and existing WOA technique since the
WOA is a recently proposed optimizer by Mirjalili (2016)
with various numbers of independent tasks (200, 400, 600,
800 and 1000) and different numbers of VMs (1,2, 4 and
8); in terms of makespan, cost and degree of imbalance.
Fig.3: Makespan of various numbers of tasks when the number of VMs
The simulation results showed minimum cost and total is 4.
execution time compared with other selected algorithms. In
this simulation, each scenario is executed 10 times and
then the average is calculated and taken into consideration.
The average makespan for executed tasks using MGWO,
GWO and WOA is illustrated in Fig.1 – Fig.4. It is obvious
that MGWO has better performance than existing WOA
and traditional GWO in terms of makespan because of
using cost with makespan in the fitness function to
evaluate the solutions which make better scheduling
process which directly effect on the overall makespan. In
addition, when the number of VM equals one, all
algorithms form same results in both makespan and cost
since there are no other resources to schedule tasks into it.
The cost which represents the execution cost of running Fig.4: Makespan of various numbers of tasks when the number of VMs
an independent task on a particular VM. Moreover, it is 8
depends on the task's length, VM's storage and cost of
transmitting task to a particular VM, Moreover, due to the
simulation settings are almost same so as clearly shown in
the results there is no significant difference in term of cost
metric. Fig.5- Fig.9 showed the cost of a different number
of executing tasks on various numbers of VMs.

Fig.5: Scheduling cost of a various number of tasks when the number of


VMs is 1.

Fig.1: Makespan of various numbers of tasks when the number of VMs


is 1

385
Fig.9: Degree of Imbalance of GWO, MGWO, and WOA on 8 VM
Fig.6: Scheduling cost of a various number of tasks when the number of
VMs is 2.
VI. CONCLUSION
Various meta-heuristic algorithms are employed in
order to develop task scheduling methods for CC
environment. In this work, a new task scheduling based on
GWO (MGWO) is introduced by modifying the fitness
function and make multi-objective in single fitness instead
of using the single makespan objective. The major target of
independent task scheduling based on both cost and
makespan is executed in the CloudSim. The performance
of the proposed technique is compared with traditional
GWO and WOA. The simulation results provided good
outcomes in reducing makespan, cost, and degree of
imbalance.

Fig.7: Scheduling cost of a various number of tasks when the number of


VMs is 4. REFERENCES

[1] Mell, P., & Grance, T. (2011). The NIST definition of cloud
computing..
[2] JoSEP, A. D., KAtz, R., KonWinSKi, A., Gunho, L. E. E.,
PAttERSon, D., & RABKin, A. (2010). A view of cloud
computing. Communications of the ACM, 53(4).
[3] He, H., Xu, G., Pang, S., & Zhao, Z. (2016). AMTS: Adaptive
multi-objective task scheduling strategy in cloud computing. China
Communications, 13(4), 162-171.
[4] Lin, X., Wang, Y., Xie, Q., & Pedram, M. (2014). Task scheduling
with dynamic voltage and frequency scaling for energy
minimization in the mobile cloud computing environment. IEEE
Transactions on Services Computing, 8(2), 175-186.
[5] Navimipour, N. J., Rahmani, A. M., Navin, A. H., &
Hosseinzadeh, M. (2015). Expert Cloud: A Cloud-based
framework to share the knowledge and skills of human resources.
Fig.8: Scheduling cost of a various number of tasks when the number of Computers in Human Behavior, 46, 57-74.
VMs is 8. [6] Malawski, M., Juve, G., Deelman, E., & Nabrzyski, J. (2015).
Algorithms for cost-and deadline-constrained provisioning for
While the Degree of Imbalance (DI) is measured the scientific workflow ensembles in IaaS clouds. Future Generation
Computer Systems, 48, 1-18.
imbalance among VMs [36] using the following Eq. 5:
[7] Navimipour, N. J. (2015). A formal approach for the specification

and verification of a trustworthy human resource discovery
( ) = (5) mechanism in the expert cloud. Expert Systems with Applications,
42(15-16), 6112-6131.
[8] Keshanchi, B., Souri, A., & Navimipour, N. J. (2017). An
Where Tmax represents the maximum execution time of improved genetic algorithm for task scheduling in the cloud
VMs, Tmin and Taverage are denoted the minimum and environments using the priority queues: formal verification,
average execution time, respectively. Fig.9 illustrated DI's simulation, and statistical testing. Journal of Systems and Software,
experiments which are performed for a different number of 124, 1-21.
independent tasks on 8 VMs. Moreover, it clearly showed [9] Alkhanak, E. N., Lee, S. P., & Khan, S. U. R. (2015). Cost-aware
that the MGWO achieved lowest level of degree of challenges for workflow scheduling approaches in cloud
computing environments: Taxonomy and opportunities. Future
imbalanced which means better scheduling balancing. Generation Computer Systems, 50, 3-21.

386
[10] Rimal, B. P., Jukan, A., Katsaros, D., & Goeleven, Y. (2011). [30] Kumar, P., & Verma, A. (2012). Independent Task Scheduling in
Architectural requirements for cloud computing systems: an Cloud Computing by Improved Genetic Algorithm. International
enterprise cloud approach. Journal of Grid Computing, 9(1), 3-26. Journal, 2(5).
[11] Rimal, B. P., Choi, E., & Lumb, I. (2009, August). A taxonomy [31] Wang, T., Liu, Z., Chen, Y., Xu, Y., & Dai, X. (2014, August).
and survey of cloud computing systems. In 2009 Fifth International Load balancing task scheduling based on genetic algorithm in
Joint Conference on INC, IMS and IDC (pp. 44-51). Ieee. cloud computing. In 2014 IEEE 12th International Conference on
[12] Navin, A. H., Navimipour, N. J., Rahmani, A. M., & Dependable, Autonomic and Secure Computing (pp. 146-152).
Hosseinzadeh, M. (2014). Expert grid: new type of grid to manage IEEE.
the human resources and study the effectiveness of its task [32] Mirjalili, S., Mirjalili, S. M., & Lewis, A. (2014). Grey wolf
scheduler. Arabian Journal for Science and Engineering, 39(8), optimizer. Advances in engineering software, 69, 46-61.
6175-6188. [33] Masadeh, R., Alzaqebah, A., Hudaib, A., & Rahman, A. A. (2018).
[13] Ullman, J. D. (1975). NP-complete scheduling problems. Journal Grey Wolf Algorithm for Requirements Prioritization. Modern
of Computer and System sciences, 10(3), 384-393. Applied Science, 12(2), 54.
[14] Xu, Y., Li, K., He, L., & Truong, T. K. (2013). A DAG scheduling [34] Masadeh, R., Hudaib, A., & Alzaqebah, A. (2018). WGW: A
scheme on heterogeneous computing systems using double hybrid approach based on whale and grey wolf optimization
molecular structure-based chemical reaction optimization. Journal algorithms for requirements prioritization. Advances in Systems
of Parallel and Distributed Computing, 73(9), 1306-1322. Science and Applications, 18(2), 63-83.
[15] Singh, S., & Kalra, M. (2014, November). Scheduling of [35] Masadeh, R., Sharieh, A., & Sliet, A. (2017). Grey wolf
independent tasks in cloud computing using modified genetic optimization applied to the maximum flow problem. International
algorithm. In 2014 International Conference on Computational Journal of Advanced and Applied Sciences, 4(7), 95-100.
Intelligence and Communication Networks (pp. 565-569). IEEE. [36] Yassien, E., Masadeh, R., Alzaqebah, A., & Shaheen, A. (2017).
[16] Kaur, K., & Kaur, A. (2015). Optimal Scheduling and Load Grey wolf optimization applied to the 0/1 knapsack problem.
Balancing in Cloud using Enhanced Genetic Algorithm. International Journal of Computer Applications, 169(5), 11-15.
International Journal of Computer Applications, 125(11). [37] Alzaqebah, A., & Abu-Shareha, A. A. (2019). Ant Colony System
[17] Wang, T., Liu, Z., Chen, Y., Xu, Y., & Dai, X. (2014, August). Algorithm with Dynamic Pheromone Updating for 0/1 Knapsack
Load balancing task scheduling based on genetic algorithm in Problem. International Journal of Intelligent Systems and
cloud computing. In 2014 IEEE 12th International Conference on Applications, 11(2), 9.
Dependable, Autonomic and Secure Computing (pp. 146-152). [38] Menascé, D. A., Saha, D., Porto, S. C. D., Almeida, V. A., &
IEEE. Tripathi, S. K. (1995). Static and dynamic processor scheduling
[18] Lakshmi, R. D., & Srinivasu, N. (2016). A dynamic approach to disciplines in heterogeneous parallel architectures. Journal of
task scheduling in cloud computing using genetic algorithm. Parallel and Distributed Computing, 28(1), 1-18.
Journal of Theoretical & Applied Information Technology, 85(2).
[19] Singh, P., Dutta, M., & Aggarwal, N. (2017). A review of task
scheduling based on meta-heuristics approach in cloud computing.
Knowledge and Information Systems, 52(1), 1-51.
[20] Maqableh, M., & Karajeh, H. (2014). Job scheduling for cloud
computing using neural networks. Communications and Network,
6(3), 191-200.
[21] Natesan, G., & Chokkalingam, A. (2018). Task scheduling in
heterogeneous cloud environment using mean grey wolf
optimization algorithm. ICT Express, 1-5.
[22] Khalili, A., & Babamir, S. M. (2017). Optimal scheduling
workflows in cloud computing environment using Pareto‐based
Grey Wolf Optimizer. Concurrency and Computation: Practice and
Experience, 29(11), 1-11.
[23] Sharma, M., & Garg, R. (2017, December). Energy-aware whale-
optmized task scheduler in cloud computing. In 2017 International
Conference on Intelligent Sustainable Systems (ICISS) (pp. 121-
126). IEEE.
[24] Sreenu, K., & Sreelatha, M. (2017). W-Scheduler: whale
optimization for task scheduling in cloud computing. Cluster
Computing, 1-12.
[25] Reddy, G. N., & Kumar, S. P. (2017, October). Multi objective task
scheduling algorithm for cloud computing using whale
optimization technique. In International Conference on Next
Generation Computing Technologies (pp. 286-297). Springer,
Singapore.
[26] Masadeh, R., Sharieh, A., & Mahafzah, B. A. Humpback Whale
Optimization Algorithm Based on Vocal Behavior for Task
Scheduling in Cloud Computing.
[27] Li, K., Xu, G., Zhao, G., Dong, Y., & Wang, D. (2011, August).
Cloud task scheduling based on load balancing ant colony
optimization. In 2011 Sixth Annual ChinaGrid Conference (pp. 3-
9). IEEE.
[28] Tawfeek, M. A., El-Sisi, A., Keshk, A. E., & Torkey, F. A. (2013,
November). Cloud task scheduling based on ant colony
optimization. In 2013 8th international conference on computer
engineering & systems (ICCES) (pp. 64-69). IEEE.
[29] Zhou, Y., & Huang, X. (2013, November). Scheduling workflow in
cloud computing based on ant colony optimization algorithm. In
2013 Sixth International Conference On Business Intelligence And
Financial Engineering (pp. 57-61). IEEE

387
Causal Path Planning Graph Based on Semantic
Pre-link Computation for Web Service Composition
Moses Olaifa Tranos Zuva
Department of ICT Department of ICT
Vaal University of Technology Vaal University of Technology
Vanderbijlpark, South Africa Vanderbijlpark, South Africa
newmosesolaifa@yahoo.com tranosz@vut.ac.za

Abstract—The web has impacted development across different approaches in respect of any specification defined by the users.
spheres of life by facilitating connection and communication In some cases, a single service appropriate for a request may
between people and machines, with organizational productivity not be found, hence a need to compose a set of services that
enhancement. Beyond connection and communication, access to
functionalities via the same web for solving business tasks has provides the required output.
increased its popularity. The idea of functionalities deployment The problem of service composition is a major problem in
on the web termed Service Oriented Computing (SOC) has been dynamic and fast growing web service environment [4] [5]
a major research area for some time. A key area of research [6]. More specific is the problem of time efficient service
focus in (SOC) is service composition. Service Composition deals discovery for web service composition process. While most
with aggregating available services to address complex business
processes or produce better functionalities. Due to explosion of the research works have been focused on service compo-
in the size of published web services, a need to improve the sition, these works are based on conventional web discovery
performance of web service composition arises. One of the key processes. Existing approaches underlying conventional web
research issues in Service Composition is providing an efficient service discovery processes do not form suitable basis for
web discovery approach that contributes to an improved web time efficient service composition [17]. Some attempts have
service composition. This work proposes an efficient web service
composition framework based on causal path pre-computation. been made at integrating web service discovery process and
Index Terms—Service Oriented Computing, web service, ser- composition process. However, the problem of lack of well
vice composition defined service discovery approaches underpinning the service
composition approaches still persists. Web service composition
I. I NTRODUCTION requires more than conventional service discovery approaches
The concept of web service is rooted in Service Oriented for improved performance in the face of growing web services.
Architecture (SOA) a paradigm of Service Oriented Comput- This research work presents a framework for web service
ing (SOC) that deals with the organization and provision of composition based on causal path pre-computation of service
web deployable software components called web services, that concepts.
encapsulate different functionalities and business processes
from simple requests to complex business processes [1]. II. R ELATED W ORK
Web services are loosely coupled, self-describing and self Different approaches have been proposed to deal with issues
contained applications that can be discovered via the published surrounding web service composition [11] [12] [10] [8] [9]. No
descriptions and remotely invoked through the internet across matter the approach used, central to any composition process is
different platforms using XML based standards such as the the service discovery process. This is required for identifying
Simple Object Access Protocol (SOAP) [2] [3]. By self- the component services contributing to the generation of the
describing, it means its capability to describe its operations final composite service. For realizing an appropriate published
and parameter requirements so that service brokers can dy- service for a particular web service request specification,
namically determine the functionalities of a service and how service retrieval through discovery of the particular service is
it can be invoked. Its self-contained characteristics signify its performed. This involves the search through service registries,
autonomy and platform independent nature. In order to use any matchmaking of concepts, ranking and selection of services.
of the web services, a service request in the form of service The main goal of any of the discovery approaches [13] [14]
specification is required. Between the request and delivery of [15] is the retrieval of appropriate services for service requests.
such request result is a series of tasks including search, dis- Performance evaluation is based on the ability to retrieve
covery, invocation and execution of relevant services published required service under the assumption that the request will
by service providers. Publication of services includes the be satisfied by single appropriate service.
description of the functional and non-functional components However, there are situations where a single service that
provided by the services possibly in machine understandable satisfies a particular service request may not be available.
formats. Available components are searched using different Aggregation of multiple services are required before such

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 388


services can be satisfied. Each of the component services reduced to a look up task. Hence, a considerable amount in
required must be discovery. Even though majority of the time saving which translates into an enhanced efficiency of
composition approaches are based on the conventional service both the service discovery and composition modules.
discovery approaches, these discovery approaches are not well When a service is published, the pre-link computation tax-
defined for an efficient composition process. This problem onomy is updated with any newly defined concept not existing
therefore requires a well defined approach to service discovery in the taxonomy with a ranked list of input-relevant services
method underlying web service composition. based on their subsumption relationship. The published service
A Causal Link Matrix (CLM) that computes and stores parameters are sent as a request to the discovery system for
request input and output related web services was presented in input relevant services. Before any update of the taxonomy
[16]. The CLM interfaces between the service discovery and is done by the pre-computation module, taxonomy concept
service composition processes. Input and output parameters of matching is initiated by identifying the concept node that
the service requests are used to index the rows and columns matches the published service input concept(s). The degree
of the CLM, and the corresponding services are discovered. of similarity between the corresponding concept node(s) and
These are used in the generation of the composition plan. the defined ontology concept is computed and the link vector
However, discovery of relevant services for the CLM is subsequently updated. For an exact match, the update of the
underpinned on conventional service discovery approaches and immediate parent and child node of the matched concept is
provided at composition time. This in effect impact the time performed. This is necessary in cases where services that
efficiency of the composition process. Furthermore, discovery defined a concept exact match no longer exist. Such cases
of services are strictly based on request input and output will consider the plug-in (subsume) match. The link vector
parameters with a requirement that requesters declaratively associated with each concept in the hierarchy is represented as
define operations that will be involved in the composition. out c1 = [< in si , rel >, < in sj , rel >, ..., < in sn , rel >
While these may enhance the composition for an optimized ]. The pair < in si , rel > represents the information on the
and efficient composition plan, solutions may not always be subsumption relationship with input concepts of service si ,
found even though they exist. [17] in their work provided a where in si and rel denote the service si and the relationship
theoretical analysis of the dependency of service composition between the concepts respectively. The entries are ranked
on service discovery. This analysis separates the discovery based on the similarity measure with exact matches ranked
problem for service composition into input discovery problem highest and subsume matches ranked least.
and output discovery problem. An inverted index map is con-
structed for the recovery of input-relevant services or output- IV. C AUSAL D IRECTED G RAPH FOR U NIVERSAL S ERVICE
relevant services using a relevantIO function. However, with C OMPOSITION
the definition of the service discovery problem, solutions may
not always be found even if a cumulative partial solutions that The composition problem is defined as a causal directed
can fulfill the request exists. In view of the above challenges, graph B [18] described by a set of nodes and directed edges.
a causal path pre-computation framework for an efficient web The arrows can be interpreted causal relationship of the cause
service composition is presented in this work. This framework and effects where the cause of any node (effect) exists in the
has the ability to improve on the time efficiency of web service graph. Suppose there is a path from si to sj , then si has a
composition approaches. causal effect on sj in the system. Readers can refer to [19]
for more details.
III. S EMANTIC P RE - LINK C OMPUTATION Assuming that edges g(in s0 ) and g(out sn+1 ) correspond
The semantic pre-link computation map is an overlay struc- to the boundary edges of B for a composition space C ⊂ S,
ture that interacts with the discovery system. The service pre- that is, the incident edges on service s0 and emergent edges
link computation is the upper layer responsible for maintaining from service sn+1 defining the start and end nodes of B. In
a partial link between concept pairs considered as binary-value the path from s0 to sn+1 , si traverses a set of descendant
parameters. The composition taxonomy maintained by the services adj s ⊂ C and we say the set adj s is a requisite
service pre-link computation engine is central to the realization neighbor set to si . Each service adj si traverses a new set of
of the universal composition of any received service request. adj s in the path to sn+1 . If adj si and adj sj are services
Each of the service concepts described in the composition in B with adj si traversing adj sj , the emergence of adj sj
taxonomy is annotated with a link vector that captures the is conditionally dependent on adj si and the path between
relationship between the concept and published services with adj si and adj sj is referred to as the causal effect path.
similar concepts. This similarity is measured by the level Given a set of requisite neighboring services adj s =
of subsumption relationship existing between a taxonomy {adj s1 , ..., adj sn } overHservices set [s0 , ..., sn+1 ], we de-
concept and related concepts defined in the web service fine a mapping function (s) for s ∈ B that traverses the
publication. The captured link vector information is utilized as corresponding requisite set adj s in equation 1:
service pointers in the construction of possible adjacent service
nodes for composition in later sections. With the constructed I ⊗
vector link, the discovery/matching process has been partially (si ) = {adj s ∈ C | out si adj sIn ̸= {∅}} (1)

389
where s and adj s are concept vectors s = {in s, out s} [⟨s1 , rel⟩, ⟨s2 , rel⟩, ..., ⟨sn , rel⟩], the highest ranked service is
and adj s = {adj sIn, adj sOut} respectively. Suppose B is selected for each concept parameter. These services form the
bounded by s0 = {∅, r in} and sn+1 = {r out, ∅}, the sum initial set of services for the universal composition plan. For
of all requisite services in the boundary (s0 , sn+1 ] generates each step of the composition, matInp and unmatInp tracks the
a predictor set for r = {r in, r out}. set of service input parameters that have been discovered, and
We define a predictor set over service (s0 , sn+1 ) = {si | outstanding input and output parameters prepared for the next
s0 < si < sn+1 } and a set of requisite neighboring services step in the composition process. For each services discovered
adj s as in the current step of the composition, the output parameters
∑ I are compared with the expected parameter set req out. For
BP = ( (si ) | si ∈ C) (2) all outstanding request output parameters, discovery of input
adj si ∈C relevant services is performed until req out is empty.
where si , adj si ∈ C and s0 < adj si ≤ sn+1 .
The predictor network BP models the dependencies be- Algorithm 1 CompP lan(r in, r out)
tween the services and the corresponding requisite services 0: req in = {r ini }
in C with an ordering to satisfy the requirements of r = 0: req out = {r outi }
{in r, out r}. With respect to BP , each adjacent service node 0: sel comp = ∅
adj si captures the dependency on the set of requisite service 0: while req out ̸= ∅ do
nodes. Given a set of requisite services adj s over services si 0: matInp = compos conc(c, req inp)
bounded by r = {r in, r out}, it is assumed the BP factors 0: unmatInp = ∅
into the composition of composite services and atomic services 0: unmatInp = req inp \ matInp
corresponding to the products of paired service nodes edge 0: req in = ∅
(si−1 , si ). For any two adjacent service nodes si−1 and si in 0: for all ci ∈ matInp do
BP , where si−1 is a parent service node to si , we define the 0: si = argmaxci [⟨s1 , rel⟩, ⟨s2 , rel⟩, ..., ⟨sn , rel⟩]
cost of the causal effect path c(si−1 , si ) as the incident edge 0: if si ∋ sel comp then
cost c(in si ) on service node si . Therefore the cost of a node 0: sel comp = sel comp + si
c(si ) is defined as the sum of the cost of all causal effect 0: comp req inp = comp req inp + out si
edges c(in si ) on s. 0: else
0: continue

k
0: end if
c(si ) = c(in si ) (3) 0: end for
i=1
0: for all out si ∈ comp req inp do
Let the adjacent service nodes generated H from the set of 0: matOut = outM atch(req out, comp req inp)
requisite services by the mapping function (si ) for node si be 0: end for
the set {adj s1 , ..., adj sk }, where adj si ∈ B. We have also 0: req out = req out \ matOut
defined the predictor set BP and the edge cost between service 0: req inp = comp req inp \ matOut
node pairs c(s). Then we can now define the causal directed 0: req inp = unmatInp ∪ req inp
graph for a Universal Composite Service for the service query 0: end while
r = {r in, r out}. In the composition problem, the Universal 0: return sel comp
Composite Service is a causal directed graph that is described =0
by the ordering of a set of atomic nodes and pair nodes defined
over a set of si ∈ C and bounded by the service query s0 =
{∅, r in} and sn+1 = {r out, ∅}. V. BACKWARD P RUNING
The causal directed graph B ∗ for a Universal Composite H
Service over a set of services in C is a tuple {BP , c(s),H (s)}, After the generation of the causal directed universal com-
where BP is the predictor set over the services in C, (s) is position, pruning of the universal composition is performed
a mapping function that traverses the corresponding requisite to realize a minimum predictor set minBP that will realize
services adj s ⊂ C for a service si and c(s) is the cost of the optimal composition plan. The pruning starts from the goal
traversing an adjacent service node adj si from a requisite service nodes and traced back until the initial service nodes are
service si . reached. In traversing the graph backwards, This will enable
Algorithm 1 generates a universal composition plan for a the elimination of service nodes that do not have causal effect.
service request r = {r in, r out}. Lines 1 to 3 initializes Therefore the minimum predictor set that produces the optimal
the service request input parameters and expected output pa- composition plan is given as:
rameters. Each of the request input and output has at least two
parameters. Initially, the set of request input parameters req in I

0 ⊗
are presented for discovery of the input relevant services from minBP = {min ( (si ) | si si−1 ) ∀si ∋ CN } (4)
the semantic pre-link map. In the corresponding services set i=n+1

390
All the services that are not included in the minimum discovery was directly exposed to the service description and
predictor set have not causal effect on the goal service nodes. subsumption relationships between concepts other than exact
matching are also considered. This may bloat the size of the
VI. E XPERIMENT AND R ESULT D ISCUSSION
services involved in the composition unnecessarily.
The presented framework is evaluated for scalability and
time efficiency. The performance of this approach is compared
with the composition based on the conventional service dis-
covery approach. For conventional discovery, all component
services for the universal composition is realized without the
aid of the semantic pre-link map. In order to observe the
scalability and time efficiency of the composition process,
the size of web services in the service environment is varied
between 150 services and 4000 services. The metrics used in
the evaluation are given as follows:
• Universal composition size: This is the number of ser-
vices involved in the universal composition.
• Optimal composition size: This is the number of services
realized after pruning the universal composition plan.
• Universal composition time: The time taken for request
processing to the completion of the composition.
At each increase in the number of web services in the envi- Fig. 2. Universal Composition Size.
ronment, the experiment is performed 10 times. Figure 1 shows
the processing time required by the different approaches for The above results are recorded for similar requests through-
universal composition generation. The conventional discovery out the experiments. However, there may be slightly varying
(conv disc) shows a tremendous increase in processing time results if the requests are changed from one experiment to
as web services grow in size. As the number of web services another. Overall, the impact of searching and reasoning for
increases, a higher processing time is required for the search every component service needed for the composition using
and matching of service concepts. conventional service discovery increases the processing time
for composition.
VII. C ONCLUSION AND F UTURE W ORK
This study presents a novel approach to improve the scal-
ability and time efficiency of web service composition. It
combines a semantic pre-link computation phase with a causal
graph with directed edges to realize an enhanced discovery
approach for web service composition. The semantic pre-
computation phase performs a pre-discovery of relevant ser-
vices according to different input and output definitions of
the web services. This saves a reasonable amount of time
spent during the composition process. Furthermore, the causal
directed graph allows for aggregation of fewer number of
component services which improves on the scalability of the
service composition.
This work assumes a largely stationary web service en-
Fig. 1. Universal Composition Time. vironment. For future research work, there is need for an
improvement of the pre-processing phase to fully address the
Unlike the composition based on pre sem, each services
issue of the dynamic nature of the web service environment.
required in the composition process has to be discovered
With service creation and deletion done at random, this may
directly from the service environment which translates to a
impact the expected time for the composition. In addition,
higher processing time requirement. In the case of pre sem,
component services of existing composite services can be
discovery of component services is based on the semantic pre-
deleted without updating the composite services. Therefore,
link map which reduces the required time. Figure 2 shows the
more work is required in the area of changing web service
size of service nodes involved in the universal composition
environment.
plan for each of the approaches. In each of the experiment,
Conv disc is observed to involve a higher number of services R EFERENCES
in the composition than the Pre Sem. This may be due [1] Mike P. Papazoglou, ”Service-oriented computing: Concepts, charac-
to the fact that composition based on conventional service teristics and directions.” In Proceedings of the Fourth International

391
Conference on Web Information Systems Engineering, 2003. WISE
2003., pp. 3-12. IEEE, 2003.
[2] G. Mein, S. Pal, G. Dhondu, T.K. Anand, A. Stojanovic, M. Gho-
sein, P.M. Oeuvray, ”Simple object access protocol.” U.S. Patent No.
6,457,066. 24 Sep. 2002.
[3] O. Hatzi, D. Vrakas, M. Nikolaidou, N. Bassiliades, D. Anagnostopou-
los, I. Vlahavas, An integrated approach to automated semantic web
service composition through planning. IEEE Transactions on Services
Computing. 2011 Apr 7;5(3):319-32.
[4] S. Kalasapur, Kumar M, B.A. Shirazi, ”Dynamic service composition
in pervasive computing.” IEEE Transactions on Parallel and Distributed
Systems. 2007 Jul;18(7):907-18.
[5] K. Fujii,, Suda T. Semantics-based context-aware dynamic service
composition. ACM Transactions on Autonomous and Adaptive Systems
(TAAS). 2009 May 1;4(2):12.
[6] X. Wang, J. Cao, Y. Xiang, Dynamic cloud service selection using
an adaptive learning mechanism in multi-cloud computing. Journal of
Systems and Software. 2015 Feb 1;100:195-210.
[7] P. Rodriguez-Mier, C. Pedrinaci, M. Lama, M. Mucientes, An integrated
semantic web service discovery and composition framework. IEEE
transactions on services computing. 2015 Feb 11;9(4):537-50.
[8] A. Vakili, NJ. Navimipour, Comprehensive and systematic review of the
service composition mechanisms in the cloud environments. Journal of
Network and Computer Applications. 2017 Mar 1;81:24-36.
[9] Y. Lu, X. Xu, A semantic web-based framework for service composi-
tion in a cloud manufacturing environment. Journal of manufacturing
systems. 2017 Jan 1;42:69-81.
[10] RB. Lamine, RB. Jemaa, IA. Amor, Graph planning based composition
for adaptable semantic web services. Procedia Computer Science. 2017
Jan 1;112:358-68.
[11] M. Liu, M. Wang, W. Shen, N. Luo, J. Yan, A quality of service
(QoS)-aware execution plan selection approach for a service composition
process. Future Generation Computer Systems. 2012 Jul 1;28(7):1080-9.
[12] D. Wang, Y. Yang, Z. Mi, A genetic-based approach to web service
composition in geo-distributed cloud environment. Computers Electrical
Engineering. 2015 Apr 1;43:129-41.
[13] M. Klusch, Semantic web service description. InCASCOM: intelligent
service coordination in the semantic web 2008 (pp. 31-57). Birkhuser
Basel.
[14] M. Klusch, P. Kapahnke, isem: Approximated reasoning for adaptive
hybrid selection of semantic services. InExtended Semantic Web Con-
ference 2010 May 30 (pp. 30-44). Springer, Berlin, Heidelberg.
[15] G. Priyadharshini, R. Gunasri, A survey on semantic web service
discovery methods. International Journal of Computer Applications.
2013 Jan 1;82(11).
[16] EG. Da Silva, LF. Pires, M. Van Sinderen, Towards runtime discovery,
selection and composition of semantic services. Computer communica-
tions. 2011 Feb 15;34(2):159-68.
[17] P. Rodriguez-Mier, C. Pedrinaci, M. Lama, M. Mucientes, An integrated
semantic web service discovery and composition framework. IEEE
transactions on services computing. 2015 Feb 11;9(4):537-50.
[18] S. Greenland, J. Pearl, JM. Robins, Causal diagrams for epidemiologic
research. Epidemiology. 1999 Jan 1;10:37-48.
[19] P. Spirtes, CN. Glymour, R. Scheines, D. Heckerman, C. Meek, G.
Cooper, T. Richardson, Causation, prediction, and search. MIT press;
2000.
[20] K. Czajkowski, S. Fitzgerald, I. Foster, C. Kesselman, Grid information
services for distributed resource sharing. In Proc. 10th IEEE Symp. on
High Performance Distributed Computing. 2001.
[21] National Center for Biotechnology Information.
http://www.ncbi.nlm.nih.gov

392
Accelerating Stochastic Gradient Descent using
Adaptive Mini-Batch Size
Muayyad Saleh Alsadi Rawan Ghnemat Arafat Awajan
Computer Science Dept. Computer Science Dept. Computer Science Dept.
Princess Sumaya University for Tech. Princess Sumaya University for Tech. Princess Sumaya University for Tech.
Amman, Jordan Amman, Jordan Amman, Jordan
muayyad.a@opensooq.com r.ghnemat@psut.edu.jo awajan@psut.edu.jo

Abstract—Training Artificial Neural Networks takes a long A typical basic design of a CNN model starts with an input
time to converge and achieve acceptable accuracy. The proposed image of a certain width and height Wi × Hi and in the case
method alternates between two modes: fast-forward mode and of color images that is a volume of size Wi × Hi × 3. Then
normal mode. The fast-forward mode iterates faster than normal
mode by using a smaller number of samples in each mini-batch. that volume is feed to a sequence of convolution layers of
Cycling between those two modes in an adaptive way is driven certain kernel size and depth (number of filters). A pooling
by accuracy change, by selectively using faster mode as long as layer follows (maximum pooling or average pooling). Going
it gives good results. Otherwise, it falls back to normal mode. deeper by alternating many convolution and pooling layers.
This way training becomes feasible even on commodity CPUs. The objective of the design is to form a flat signal with no
Our approach was tested on commodity CPU on Pets-37
dataset obtaining an accuracy of 91% in less than an hour and
spacial dimension (width=1 and height=1) so that the signal
on the Birds-200 dataset obtaining an accuracy of 72% in less is along depth axis which will be the signal of output classes,
two and a half hours. an example of this is seen in figure 1 showing the design
Index Terms—Artificial Neural Network; Convolutional Neural of LeNet[3]. The objective is achieved using strides on some
Networks; Stochastic Gradient Descent; Adaptive Batch size; layers (pooling or convolutional), reducing the width and the
Deep Learning; height of the output or having a convolutional filter of kernel
size that matches its input size.
I. I NTRODUCTION

Many types of ANN architectures were applied to a wide


range of applications[1] including: Multi-Layer Neural Net-
works, Unsupervised Learning for Deep Architectures, Deep
Generative Architectures, and Convolutional Neural Networks
(CNNs or more commonly ConvNets)[2] which were inspired
Fig. 1. Design of LeNet[3]
by how the human visual cortex is assumed to work.
The most basic form of ANNs is a Perceptron, which is a The final output oi from the last layer is processed with
single node having a single output defined to be the weighted Softmax function (equation 1) to form probabilistic-like val-
summation of its input, those weights are the trainable pa- ues.
rameters. Deep Neural Networks (DNNs)[1] are Multi-Layer
Neural Networks with two or more hidden layers. eo i
sof tmax(oi ) = ∑
n (1)
CNN is similar to ANN but having trainable weights to eoj
be the values of many Convolutional Matrix operators called j=1
kernels or filters operating on an input within a spacial This paper proposes a method to accelerate the convergence
context. In the case of colored image input, matrices of each of Artificial Neural Networks (ANNs), by alternating between
color channel are stacked into a 3D volume called “Tensor” different settings, basically a normal slower mode and a faster
(higher dimensional stacked matrices). Several filters can be mode fast-forwarding parts of the training process having no
implemented in the form of a convolution-matrix like edge- significant increase in accuracy in an adaptive way.
detection, sharpening, smoothing, blurring, and pattern match- After “Literature Review”, proposed method is described
ing filters. The size of each convolutional filter matrix which in section III. Empirical experiments conducted in sectionIV.
is also called the “kernel size” or the “neighborhood size” Before ending with conclusion in section VI, the proposed
specifies the receptive field of the filter. ANN techniques can method is discussed and analyzed in sectionV.
be used to learn specific values in those many convolutional This paper used two datasets:
matrices resulting in image filters that get activated when being • “The Caltech-UCSD Birds-200-2011 Dataset”[35] or
exposed to specific visual features or structures. birds-200 for short.

978-1-7281-2882-5/19/$31.00 ©2019 IEEE 393


Several methods were used to accelerate the conver-
gence of back propagation of Stochastic Gradient De-
scent (SGD) algorithm based on different factors like SGD
momentum[15] which helped training deep and recurrent neu-
ral networks overcoming the random poor initialization using
tuned momentum methods. Adaptive sub-gradient methods
(AdaGrad)[16], stochastic objective functions were used in
Adam[17], and ADADETLA [18] which uses per-dimension
adaptive learning rate. Krizhevsky in AlexNet[19] trained two
part of Convolutional network in parallel using two GPU cores.
Mini-batch size as a factor for the training of deep neural
networks was used in distributed synchronous SGD with
large mini-batches[20] the authors presented how to achieve
Fig. 2. Graph from [4] showing highlighted parts that can hypothetically be
Fast-forwarded
better accuracy in much shorter time by using large mini-
batch size of 8192 images. Fine-tuning works by adapting an
existing pre-trained source model by only modifying weights
• “The Oxford-IIIT Pet Dataset”[31] or pets-37 for short. of top layer so that it fit target task. Model-Agnostic Meta-
The accuracy is measured by sampling from a 10% partition Learning (MAML)[21] approach accelerate that by training a
that is never presented during training. A small part of each model called meta-learner during meta-learning phase treating
dataset (10% in this research) spared aside and not presented entire set of tasks as training examples. This approach does
during the training phase, validating generalization and avoid- not involve varying mini-batch size during training. While
ing over-fitting. AdaNet[23] use general theoretical method to optimize a
neural network structure and its parameters by balancing a
II. L ITERATURE REVIEW between the theoretical model and experimental result based
on specific data.
The deeper the network, the harder it gets to train (conver- Many papers studied adjusting learning rates during train-
gence would need more operations, more time, and requires ing. Linear Scaling Rule[20] suggests multiplying learning
more labeled samples). Some methods focused on having more rate by the same factor k that is used to multiply the
efficient design, while others focused on the training process batch size. Which can be rephrased as ”to compensate the
for the same network design. slowing speed resulted from k× larger batch size, by using
SqueezeNet[5][6], MobileNet[7], and MobileNet v2.0[8] larger approaching steps by multiplying the learning rate by
delayed down-sampling using strides (in pooling and convolu- k”. Cyclic Learning Rate(CLR)[24] uses a cyclic triangular
tion) toward the end of the network, and instead they used dif- learning rate wave that ranges between two values. Warm
ferent way to early reduce complexity and number of trainable restarts method[25] used a similar cyclic wave of learning rate
parameters, that is using the concept of separable operators[9] but it used a cosine wave. Super-Convergence[26] suggested
by decomposing a 3 × 3 filter into two operators a depth- using large values for learning rate beside cyclic technique.
wise 3 × 3 filter and a point-wise 1 × 1 filter. This technique Ruder’s survey[27] revised many gradient descent optimization
was also used by[10] with batch normalization. Szegedy et algorithms and explain many strategies used to enhance this
al.[11] used auxiliary classifier branches connected to early algorithm. Some researches benchmarked different methods
intermediate layers and added a fraction of the loss (0.3) of and setups[28][29].
this classifier to the total loss of the network, this forces the One paper[30] suggested increasing the batch size instead
network to discriminate in early shallow stages. Later they[12] of fading learning rate.
suggested batch-normalized auxiliary classifiers, lowering pa- Based on this literature review the proposed method is
rameter count, avoiding bottle-necks, factorizing Convolutions, different because it focus on batch size instead of learning rate,
Model Regularization via Label Smoothing. but unlike [30] it’s not one directional change but alternating
Without changing network design, a accelerate training have between two or more settings. Cycling between large and
been studied for long time. Yann LeCun[13] was one of the small batch sizes is not periodic but rather adaptive based on
early researchers studying various techniques to enhance the validation accuracy observed in the preceding run.
process of training ANNs. One of those techniques is using
“Stochastic Learning” or Stochastic Gradient Descent (SGD) III. P ROPOSED P ROCEDURE
which is done by taking small random samples (mini-batches) In its most basic form the proposed method is shown in
instead of the whole batch of training data, it converges Algorithm 1 which have two modes, a normal slower mode
much faster than iterating over all training data as in “Batch with normal mini-batch size, and a faster one with smaller
Learning”. Not only fast to converge but also better in handling mini-batch size called “Fast-forward mode”. The two batch
the noise and non-linearity. That’s why batch learning was sizes in each mode (normal and fast) are hyper-parameters
considered inefficient[14]. depending on how much boost do we want to get, using half

394
TABLE I
OVERHEAD OF 3×3×128 CONVOLUTION IN I NCEPTION ALMOST HALVED USING SEPARABLE OPERATORS

Name Input kernel output weights mults


Separable Version
point-wise 28×28×192 1×1×96 28×28×96 1×1×96×192 28×28×96×192
depth-wise 28×28×96 3×3×128 28×28×128 3×3×128×96 28×28×3×3×128×96
Total 129K 101M
Non-separable Version
original 28×28×192 3×3×128 28×28×128 3×3×128×192 =221K 28×28×3×3×192×128 =173M

the batch size in fast-forward mode will result in two times rate of items per second (or images per second in our case)
more number of iterations in same time period. For example, it can process. This is not the case when using small batch
one can have normal batch size to be 128, and the smaller sizes with a special hardware of large capacity like GPUs, but
one to be 64. Instead of periodically cycling between those two it’s the case for commodity CPUs. In other words, the time
modes, keep using fast mode as long as accuracy is increasing, needed to process a batch is linearly proportional to number
and switch to slower normal mode when it is not increasing. of items in the batch.
Assuming we want a boost factor of n, so that the faster
Algorithm 1 Adaptive mini-batch size by alternating two mode is n times faster than the normal mode, its batch size
batch sizes would be normal batch size/n, and number of iteration in
1: let acc old ← 0
each step can be set to be normal iterations/n so that the
2: let batch size ← small batch
time spend in each step is the same regardless of the mode. In
3: while exit criteria do ◃ like number of iterations other words we will be doing n times more updates in same
4: run training batch() period of time.
5: let acc new ← evaluate()
Since the proposed method has only too modes, learning
6: if new acc ¿ old acc then
rate hyperparameter of each mode can be handpicked and
7: let batch size ← small batch
tuned.
8: else
One can mix and match settings for each mode depending
9: let batch size ← normal batch
on the needed boost. Examples of hyper-parameters choices
10: let acc old ← acc new
for the two modes:
More generic Algorithm 2 uses a custom criteria to switch Normal batch size and normal learning rate, number of

modes, and custom hyper-parameters for each mode, like batch iterations
size, learning rate, dropout rate, regularization factors, and – For example, batch size=64, learning rate=0.01, iter-
number of iterations. ation count=100
• Fast forward hyperparameters examples:
Algorithm 2 Generic adaptive mini-batch size by alternating
two configuration – 2x setup: halve the batch size and same learning rate,
1: initialize normal batch size ◃ normal settings and 2x number of iterations
2: initialize normal learning rate – 10x setup: 1/10 of the batch size and 1/2 of the
3: initialize ff batch size ◃ fast forward settings learning rate, 10x number of iterations
4: initialize ff learning rate – ...etc.
5: let acc old ← 0 The fast forward criteria can be defined in multiple ways,
6: let mode ← normal the most simple one is “if new accuracy is better than old
7: while exit criteria do ◃ like number of iterations one”, that is if we are getting better then keep going using
8: run training batch() the fast forward mode, why use the slower mode if the faster
9: let acc new ← evaluate() mode is enough to increases the accuracy.
10: if ff criteria then Another criteria can be defined based on number of itera-
11: let mode ← ff tions with stalled accuracy For example, if accuracy did not
12: else get better after three consecutive iterations switch to normal
13: let mode ← normal mode, else keep going using fast forward mode.
14: let acc old ← acc new One is not limited only two configurations, for example one
can define three or more modes, or even arbitrary number of
It’s reasonable to assume that a given fully utilized machine modes, the general form would be, if accuracy stalled for more
has constant throughput (regardless of batch size) which is the than a threshold of iterations adjust parameters like this:

395
TABLE II
ACCURACY OVER ITERATIONS IN STEPS FOR DIFFERENT BATCH SIZES

Iteration (steps) Accuracy for batch sizes


10 50 100 200
1000 6% 10% 8% 8%
1500 10% 16% 16% 14%
2000 14% 21% 20% -
3000 21% 28% 27% -

Fig. 4. Top-1 Accuracy fine-tuning Birds 200 dataset with different batch
h′ = f actor × h sizes, x-axis is in hours

Where f actor is a multiplier, and h is the hyper-parameter TABLE III


like batch size, number of steps and learning rate, ACCURACY OVER TIME IN HOURS FOR DIFFERENT BATCH SIZES
Otherwise reset to initial fast-forward settings
Time (hours) Accuracy for batch sizes
IV. E MPIRICAL E XPERIMENTS 10 50 100 200

A. The setup 0:30 37% 16% 6% 2%


1:00 46% 29% 14% 5%
Using Tensorflow 1.12, Inception V1[11] pre-trained on 1:30 51% 36% 21% 8%
ImageNet 1K[32], which is to be fine-tuned[33][34] on “The 2:00 56% 40% 28% 12%
Caltech-UCSD Birds-200-2011 Dataset” (Birds-200 task) [35]
“The Oxford-IIIT Pet Dataset” (Pets-37 task)[31] by only
training the last fully connected layer except when otherwise than 50, and 20× faster than 200. By making x-axis measure
noted. Training was done on a laptop having 7th generation in time (hours) instead of steps, the result was as in figure 4.
Intel CPU (specifically i7-4710MQ@2.50GHz). Looking at table III, the smallest batch size of 10 was
systematically always better than others, for example, after
B. Effectiveness of smaller batch size two hours it reached more than 56% accuracy while others
The test was done 4 times using 200, 100, 50, and 10 were lagging behind at 40%, 28%, and 12% for 50, 100, and
batch size having fixed learning rate 0.01. Metrics measured 200 batch sizes respectively.
on the batch while training (like cross-entropy-loss) will be The setup having batch size of 10 achieved 10% accuracy
over estimated as they will give results based on the small non- in 5 minutes only, while batch size of 200 took around more
representative batch (for example, the ten images in the batch). than one and half hour to reach same 10% accuracy.
That’s why 10% of dataset were used for validation (more
than one thousand image, independent of training batch size) C. Effect of using large learning-rates
to evaluate the model periodically. Accuracy on that dataset Figure 5 shows effect of different choices of batch-size and
were recorded and results are shown in figure 3 and table . learning rates. It was done on “The Oxford-IIIT Pet Dataset”
(Pets-37)[31] with batch size of 8 and 32 and learning rate
of 0.001 and 0.0005. While using small batch sizes means
doing more frequent updates to trainable weights, using large
learning-rate means doing updates with larger magnitude (and
larger risk too). Faster Convergence toward better accuracy
can be consider analogous to walking with faster steps (smaller
batch size, more frequent updates) or walking with wider steps
(larger learning rate).
But how extreme one can go with smaller batch sizes and
larger learning rates? By looking at figure 6, one can see that
using ridiculously extremely small batch-size of 4 items and
Fig. 3. Top-1 Accuracy fine-tuning Birds 200 dataset with different batch
extremely large learning-rate of 0.1 did actually manage to
sizes, x-axis is in steps cross 80% accuracy in just below 5 minutes! which would
have required one hour of training using the best settings of
By looking at accuracy over steps one might think that a figure 5.
batch size of 10 was always the worst, after 1500 step it was The downside of such extreme setting is that it got stuck
only 10% accuracy, while others were around 15%. But this very early. Accuracy was flapping beyond that point and never
is not a good measure as a batch size of 10 is 5 times faster crossed 89% even after an hour of training. Compared to other

396
Fig. 5. Evaluation accuracy along time axis for different batch-sizes and
learning rates (LRs) for Pets-37 task. Fig. 7. Accuracy over time of Birds-200 task trained using adaptive batch
size of 4,8 then 8,16

Fig. 6. Comparing accuracy overtime for extremely small batch-sizes and Fig. 8. Accuracy over time of Birds-200 task trained using fixed batch size
learning-rate for Pets-37 task. of 4 and 8

V. A NALYSIS AND D ISCUSSION


settings, batch size of 8 items and learning-rate of 0.1 first
crossed 89% as early as 20 minutes, compared 28 minutes To summarize numbers from previous experiments, ridicu-
for same batch-size with learning-rate of 0.01. Similarly the lously extremely small batch-size and extremely large
extreme learning-rate 0.1 got stuck early and till minute 44 learning-rate do high risk updates which is useful when the
it was still at 89% which was the point in time the smaller network has nothing to loose that is at initialization phase.
learning-rate of 0.01 followed and later at minute 52 it first Later the network needs more precise lower risk updates.
reached 91% while the larger learning-rate of 0.1 needed 96 Similar to focusing a microscope with course adjustment knob
minutes. until the view is not getting any better, then start using the
To further test the proposed method on a more challenging the more precise fine adjustment knob to continue adjusting
task of Birds-200. As seen in figure 4, it got stuck at accuracy the focus slowly until you reach the optimal focus, If fine
below 50% no matter which batch size. This time, we included adjustment knob was used from the beginning it would take
the last two inception blocks (Mixed 5c and Mixed 5b) beside too long to adjust the focus.
the last fully connected layer in the training process. Having Inspired by this, the proposed method uses fast high risk
even more parameter to train. When trained on fixed batch settings as long as it’s getting better results, and once it
sizes as seen in figure 8, the model never got beyond accuracy doesn’t, switch to normal slower and more fine mode. Because
of 66.25% wasting the rest of training time. On the other hand, the adjustments or the updates that are back propagated are
adaptive batch size obtained more than 72% accuracy after not absolute but rather relative to error delta (the ratio is
about 2 hours and 21 minutes as seen in figure 7. The model the learning rate), one can alternate between the two modes
was first initialized with batch size of 4 and learning rate of multiple times.
0.1 , then for the the first hour of training we used a fast Stochastic Gradient Descent (SGD) and alike algorithms
forward mode having batch size of 4 and normal mode with takes a batch of random sample (as opposed to whole training
size of 8 both having learning rate of 0.01 then after an hour dataset), pass it to the neural network, calculate error and based
we switched to a fast mode having batch size of 8 and learning on it do an adjustment to the trainable parameters (weights and
rate of 0.005 and a normal mode having batch size of 16 and biases), then repeat.
learning rate of 0.001. And like with using the microscope, the proposed method

397
uses faster higher risk settings as long as it works. One way to R EFERENCES
make faster convergence is to walk with short fast steps that is
[1] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A
by doing more frequent updates, even based on a sample that survey of deep neural network architectures and their applications,”
is too small to represent all classes in the task. Birds-200 task, Neurocomputing, vol. 234, pp. 11–26, 2017.
having a sample two items from each class means a batch size [2] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and
of 400 item. Doing frequent updates based on a batch barely trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
representing 5% of classes was very effective as in experiment applied to document recognition,” Proceedings of the IEEE, vol. 86,
shown in figure 4 and in table III. Iterating 20 times faster no. 11, pp. 2278–2324, 1998.
(even with ridiculously under-represented samples) resulted in [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
much multiple times better accuracy for two main reasons; and pattern recognition, 2016, pp. 770–778.
First, the relation between batch time and batch size is linear [5] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
while the negative effect on batch accuracy is not. Second, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and¡ 0.5 mb model size,” arXiv preprint, 2016. [Online].
updates are made based on a small fraction of error delta that Available: http://arxiv.org/abs/1602.07360
is the learning-rate which can be as small as 0.001 and in next [6] K. He and J. Sun, “Convolutional neural networks at constrained time
step it would have also under-represented but in a different way cost,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2015, pp. 5353–5360.
affecting different class due to the stochastic nature of training [7] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
algorithm, those flapping small fraction mistakes are canceling T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient
each other and negligible compared to the accumulative move convolutional neural networks for mobile vision applications,” arXiv
preprint, 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
toward least error point.
[8] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
The time consumed to process a batch is linearly propor- “Inverted residuals and linear bottlenecks: Mobile networks for
tional to the batch size (assuming the machine is fully utilized), classification, detection and segmentation,” arXiv preprint, 2018.
[Online]. Available: http://arxiv.org/abs/1801.04381
that is using 8 items per batch is 50× faster than 400 items per [9] F. Mamalet and C. Garcia, “Simplifying convnets for fast learning,”
batch. On the other hand, the negative side effect on accuracy Artificial Neural Networks and Machine Learning–ICANN 2012, pp. 58–
(if any) of using a smaller batch size is not linear, that is as 65, 2012.
[10] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
long as the accuracy is increasing with rate less than 50× network training by reducing internal covariate shift,” in International
slower then it’s a win situation. When using 200 items per Conference on Machine Learning, ser. ICML’15, 2015, pp. 448–456.
batch instead of 400 we get double the speed but we won’t [Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045167
[11] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
lose half of the accuracy increasing rate. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
As long as we are increasing the accuracy, there is no need in Proceedings of the IEEE conference on computer vision and pattern
to use a slower settings. But if accuracy got stuck because due recognition, 2015, pp. 1–9.
[12] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
to lacking enough samples for different classes, in that case the inception architecture for computer vision,” in Proceedings of the
one might use a slower mode for a few steps, long enough to IEEE Conference on Computer Vision and Pattern Recognition, 2016,
put the SGD on a good initial position to start sliding with the pp. 2818–2826.
[13] Y. LeCuni, “Efficient backprop yann lecuni, leon bottoui, genevieve b.
gradient using the faster mode. orr2, and klaus-robert miuller3 1 image processing research department
One might ask if smaller batches are good why not using it at& t labs-research, 100 schulz drive, red bank, nj 07701-7033, usa 2
willamette university, 900 state street, salem, or 97301, usa.”
all the way? why one need to use adaptive batch size based on [14] D. R. Wilson and T. R. Martinez, “The general inefficiency of batch
some criteria? The accuracy increasing rate is not linear, and training for gradient descent learning,” Neural Networks, vol. 16, no. 10,
after long while it would start to be flat horizontal line, as it’s pp. 1429–1451, 2003.
much easier to go from 10% to 15% than to go from 90% to [15] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance
of initialization and momentum in deep learning,” in International
95%. When accuracy got stuck, or even worse start flapping conference on machine learning, 2013, pp. 1139–1147.
and decreasing, one need to activate the normal mode with the [16] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
slower hyper-parameters that eventually overcome that barrier. for online learning and stochastic optimization,” Journal of Machine
Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
[17] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
VI. C ONCLUSION [18] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
preprint arXiv:1212.5701, 2012.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
In SGD, smaller batch sizes are very effective (even if they mation processing systems, 2012, pp. 1097–1105.
are 20× or 50× smaller than number of classes). And have [20] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola,
linear effect on speed while barely degrade accuracy increasing A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training
rate, one can exploit this property to fast-forward “boring” imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
[21] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
parts of training process and get good results in hours that for fast adaptation of deep networks,” arXiv preprint, 2017. [Online].
used to take days or require specialized hardware. This can be Available: http://arxiv.org/abs/1703.03400
summarized as simple as: Do very high risk initialization, then [22] A. Defazio, F. Bach, and S. Lacoste-Julien, “Saga: A fast incremental
gradient method with support for non-strongly convex composite objec-
“Train-Measure-Adapt-Repeat”. As long as it’s getting better tives,” in Advances in neural information processing systems, 2014, pp.
results keep using fast-forwarding settings. 1646–1654.

398
[23] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang, “Adanet:
Adaptive structural learning of artificial neural networks,” arXiv preprint,
2016. [Online]. Available: http://arxiv.org/abs/1607.01097
[24] L. N. Smith, “Cyclical learning rates for training neural networks,”
in Applications of Computer Vision (WACV), 2017 IEEE Winter
Conference on. IEEE, 2017, pp. 464–472. [Online]. Available:
http://arxiv.org/abs/1506.01186
[25] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent
with warm restarts,” arXiv preprint, 2016. [Online]. Available:
http://arxiv.org/abs/1608.03983
[26] L. N. Smith and N. Topin, “Super-convergence: Very fast training of
residual networks using large learning rates,” arXiv preprint, 2017.
[Online]. Available: http://arxiv.org/abs/1708.07120
[27] S. Ruder, “An overview of gradient descent optimiza-
tion algorithms,” arXiv preprint, 2016. [Online]. Available:
http://arxiv.org/abs/1609.04747
[28] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee,
B. Schroeder, and G. Pekhimenko, “Tbd: Benchmarking and analyzing
deep neural network training,” arXiv preprint, 2018. [Online]. Available:
http://arxiv.org/abs/1803.06905
[29] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi,
P. Bailis, K. Olukotun, C. Ré, and M. Zaharia, “Dawnbench: An end-
to-end deep learning benchmark and competition,” Training, vol. 100,
no. 101, p. 102, 2017.
[30] S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, “Don’t decay the
learning rate, increase the batch size,” arXiv preprint, 2017. [Online].
Available: http://arxiv.org/abs/1711.00489
[31] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar, “Cats
and dogs,” in 2012 IEEE Conference on Computer Vision and Pattern
Recognition, 2012, pp. 3498–3505.
[32] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei,
“Imagenet large scale visual recognition competition,” (ILSVRC2012),
2012.
[33] W. Ouyang, X. Wang, C. Zhang, and X. Yang, “Factors in finetun-
ing deep model for object detection with long-tail distribution,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 864–873.
[34] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring
mid-level image representations using convolutional neural networks,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 1717–1724.
[35] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The
caltech-ucsd birds-200-2011 dataset,” California Institute of Technology,
Tech. Rep. CNS-TR-2011-001, 2011.

399
Author Index
A Al-Zewairi, Malek ...................................................... 1
Alzubaidi, Loay...................................................... 372
Ababneh, Mohammad ............................................ 56
Assiri, Basem ............................................................ 8
Abdalhaq, Baker ................................................... 318
Atallah, Rahma ..................................................... 335
Abdel-Nabi, Heba ................................................. 170
Awajan, Arafat............ 136,170,208,231,244,265,393
Adefila, Arinola ........................................................ 83
Awajan, Arafat A. .........................................213,251
AlArmouty, Batool ................................................. 178
Ayoubi, Eyad ........................................................... 74
Al-Asa'd, Muntaha ................................................ 341
Azeem, Omar ........................................................ 353
Al-Dabet, Saja ....................................................... 271
Alenezi, Fayadh .................................................... 277 B
Al Etaiwi, Wael ...................................................... 251 Baghdadi, Ameer .................................................. 183
Al Etaiwi, Wael Mahmoud ..................................... 265 Bahita, M. ............................................................ 360
Al-Fayoumi, Mustafa ..................................... 45,56,74 Bakhti, Haddi......................................................... 366
Alghazo, Jaafar ..................................................... 372 Bashar, Abul ......................................................... 124
AlHaidar, Shoaa.................................................... 124 bazi, Yakoub ......................................................... 283
Alhaj, Fatima .................................................. 324,347 Belarbi, K. ........................................................... 360
Al-Haj, Fatima ....................................................... 258 Benbrahim, Ghassen ............................................ 302
Alharbi, Yaser ....................................................... 189 Biltawi, Mariam...................................................... 231
Alhichri, Haikel S. ................................................ 283 C
Alhijawi, Bushra .................................................... 136
Chakraborty, Rajat Subhra ....................................... 1
Ali, Mohd Shukuri Mohamad ................................ 107
Chantar, Hamouda................................................ 318
Ali, Nazlena Mohamad.......................................... 107
Chefranov, Alexander ............................................. 20
Alia, Shahd ........................................................... 113
Clauss, Alexander ................................................. 101
Al-Jarrah, Heba..................................................... 341
Alkasassbeh, Mouhamad ....................................... 51 D
Alkasassbeh, Mouhammd ...................................... 27 Dafoulas, Georgios ................................................. 87
Al-kasassbeh, Mouhammd ..................................... 33 Dafoulas, Georgios A. ........................................... 94
AlKhatib, Lina ........................................................ 124 Daoud, Mohammad .............................................. 158
Al-Kasassbeh, Mohammad .................................... 62 Debbi, Aimad Eddine ............................................ 366
Al-Lahham, Yaser A.M. ...................................... 226 Dermol, Valerij ........................................................ 83
Al-Madi, Nailah ..................................................... 152 DeWinter, Alun ........................................................ 83
Almajali, Sufyan .................................................... 208
E
Al-Mousa, Amjed .................................................. 335
Almseidin, Mohammad ........................................... 33 Eleyan, Derar ........................................................ 377
Al-Naymat, Ghazi........................................... 136,170 Elhassan, A. ........................................................ 142
Al Omari, Islam ..................................................... 142 Elhassan, Ammar.................................................. 302
Al Omoush, Razan................................................ 142 Elnagar, Ashraf ..................................................... 238
AlOraidh, Aqeela ................................................... 124 El-Nakla, Darin ...................................................... 119
Alsadi, Muayyad Saleh ......................................... 393 El-Nakla, Samir ..................................................... 119
AlSaid, Hawra ....................................................... 124 El Rifai, Hozayfa ................................................... 238
Al-Sakran, Hasan.................................................. 189 El-Seoud, Samir .................................................... 289
Al-Sayyed, Rizik .................................................... 382 Eshtayah, Mohammad .......................................... 183
AL-Smadi, Mohammad ......................................... 341 F
Al Qadi, Leen ........................................................ 238 Fekry, Ahmed.......................................................... 87
Alzaqebah, Abdullah ............................................. 382 Fraihat, Salam ....................................................... 178
Al-Zboon, Sa'ad A. .............................................. 341

400
G Manna, Abdelrahman ........................................45,51
Manzoor, Ayisha ................................................... 312
Ghnemat, Rawan ........................................... 302,393
Masadeh, Raja ...................................................... 382
Giacinto, Giorgio ..................................................... 14
McNally, Beverley ................................................. 119
H Mohammad, Nazeeruddin .................................... 312
Halabi, Dana ......................................................... 244 Mohiuddin, Iman ................................................... 312
Hamad, Nagham..................................................... 20 Morrar, Jalal .......................................................... 183
Hambouz, Ahmed ................................................... 45 Mostafa, Ahmad .................................................... 289
Hamdan, Salam .................................................... 208 Muslmani, Baraa K. ............................................... 74
Hamida, Abdelhak Farhat ..................................... 366 Muzzammel, Raheel ............................................. 353
Hammad, Mahmoud ............................................. 341
N
Hammo, Bassam .................................................. 258
Naz, Rubina .......................................................... 130
Hamtini, Thair ....................................................... 113
Neilson, David ......................................................... 94
Hanna, Samer......................................................... 39
Haque, Tahreem ....................................................... 1 O
Hart, Stefan Willi ................................................... 202 Obaid, Safa ........................................................... 238
Hawash, Amjad..................................................... 183 Obeid, Nadim ........................................................ 136
Hudaib, Amjad ...................................................... 324 Olaifa, Moses ........................................................ 388
Hussein, Walid ...................................................... 289 Ouni, Ridha ........................................................... 283
I Q
Ibrahim, Anas .......................................................... 20 Qabbaah, Hamzah ................................................ 164
Innab, Haneen ...................................................... 142 Qasaimeh, Malik ................................................56,74
Islam, Noman........................................................ 130 Qureshi, Muhammad Faheem .............................. 195
Ismail, Manal ........................................................... 87
R
Issa, Lana ............................................................. 220
Raza, Asad ........................................................... 195
J
Romman, Ali Abu .................................................. 195
Jaber, Hayat ........................................................... 39
S
Jamous, Naoum .................................................... 202
Jusoh, Shaidah ..................................................... 220 Saeed, Nayab ....................................................... 353
Saeed, Reham ...................................................... 302
K
Saeed, Umair ........................................................ 130
Karaymeh, Ashraf ................................................... 56 Sammour, George ................................................ 164
Kazakzeh, Saif ........................................................ 74 Santikellur, Pranesh .................................................. 1
Khan, Omer .......................................................... 353 Sarhan, Sami ........................................................ 329
Khanafsa, Mohammad.......................................... 329 Scalas, Michele ....................................................... 14
Kharshid, Areej ..................................................... 283 Schoop, Eric .......................................................... 101
Kovacs, Szilveszter................................................. 33 Serguievskaia, Irina .............................................. 189
Krishnasamy, Gomathi ........................................... 68 Shaheen, Yousef .................................................... 45
Kumar, Kamlesh ................................................... 130 Shaheen, Yousef Khaled ........................................ 62
L Shaikh, Aftab Ahmed ............................................ 130
Shaikh, Eman........................................................ 312
Lane, Victor P. .................................................... 377
Sharieh, Ahmad .................................................... 347
Latif, Ghazanfar ............................................. 312,372
Sheta, Alaa ........................................................... 296
Lenk, Florian ......................................................... 101
Širca, Nada Trunk ................................................... 83
M Sleit, Azzam .......................................................... 347
Mafarja, Majdi ....................................................... 318 Snaith, James ....................................................... 377

401
Suleiman, Dima ............................................. 213,251 V
Sundus, Katrina .................................................... 258
Vanhoof, Koen ...................................................... 164
Surakhi, Ola .......................................................... 329
W
T
Wimpenny, Katherine ............................................. 83
Tahir, Umair .......................................................... 353
Tanveer, Jaweria .................................................. 130 Y
Tawalbeh, Saja Khaled ......................................... 341 Yasen, Mais .......................................................... 152
Tedmori, Sara ........................................... 45,231,271
Z
Thaher, Thaer ....................................................... 318
Trunk, Aleš .............................................................. 83 Zuraiq, AlMaha Abu ................................................ 27
Turabieh, Hamza ........................................... 296,306 Zuva, Tranos ......................................................... 388

402
Organized by:

Supported by: Sponsored by:

Technically Co-Sponsored by

Das könnte Ihnen auch gefallen