Beruflich Dokumente
Kultur Dokumente
www.ictcs.info
Proceedings
Amman, Jordan
9 -11 October 2019
Organized by
Editor
Prof. Arafat Awajan
The conference is held in Amman-Jordan from 9-11 October 2019. We expect more than 200
participants in this edition of the conference which will feature 65 papers from 28 countries
(Acceptance rate 33%), 8 Keynotes speeches (from the USA, UK, Italy, and Malaysia), an
industrial track with 3 talks and 3 specialized workshops on data science, computer security and
artificial intelligence.
Area 1: Data Science and Big Data: Evolutionary Computation, Big Data Analytics, Information,
Retrieval for Big Data, Social Network Analysis and Mining, Data and Text Mining, Data Analysis
and Visualization, Computational Statistics and Modeling, Data Engineering, Mining Massive
Data, Data Visualization.
Area 2: Computer and Network Security: Botnet Detection and Prevention, Forensic Investigation
of the IoT, Big Data Forensics, Cloud Computing Security, Network Flow Analysis, Intrusion
Detection and Prevention, Mobile Security, Digital Forensics and Anti-Forensics, Malware
Analysis and Memory Forensics, Wireless Network.
Area 3: Natural Language Processing: Semantic Processing, Lexical Semantics, Ontology, Latent
Semantic Analysis, Linguistic Resources, Paraphrasing, Language Generation, Text Entailment,
Machine Translation, Information Retrieval, Text Mining, Question Answering, Speech Analysis
and Recognition, Arabic Natural Language Processing.
Area 5: Internet of Things (IoT): Architectures and protocols for the IoT, IoT new designs and
architectures, IoT/M2M Management, Interoperability of IoT systems, IoT applications, Security,
Identity and privacy of IoT, Reliability of IoT, Scalability issues for IoT networks, Disaster
recovery in IoT.
Area 6: Electronic and virtual Learning: e-Learning Tools, Mobile Learning, Gamification,
Collaborative Learning, Educational Systems Design. Virtual Learning Environments, Virtual
reality for education and workforce training.
iii
Acknowledgments
We would like thank the program committee members, all the reviewers, and sub reviewers who
worked very hard to support ICTCS’19 and have sent their reviews on time. We would like to
thank our main sponsors for their valuable support. In particular, we would like to thank the
Scientific Research Support Fund for their generous support. We also would like to thank the
authors for their high-quality scientific contributions. We appreciate all the support we obtained
from the president of Princess Sumaya University for Technology (Jordan) Prof. Mashhoor Al-
Refai and from Prof. Issa H. Al Ansari president of Prince Mohammad bin Fahd University (Saudi
Arabia). Finally, we would like to extend our sincere gratitude to HRH Princess Sumaya Bint
Elhassan for her continuous support and guidance and for accepting to patron ICTCS’19.
iv
Message from the Conference Chairs
On behalf of the Organizing Committee, we are honored and delighted to welcome you to Amman
and to the Second International Conference on New trends in Computing Sciences (ICTCS’19).
The ICTCS’19 is organized by the King Hussein School for Computing Science at Princess
Sumaya University for Technology (Jordan) in partnership with Prince Mohammad bin Fahd
University (Saudi Arabia), and being supported the Scientific Research Fund at the Ministry of
Higher Education and the Royal Scientific Society (RSS).
Building on the resounding success of its first edition, the goal of the conference is to continue
providing international biannual forum where international scientists meet with Jordanian
scientists from the different fields of computer science to exchange ideas and information on
current trends of research results, system developments, and practical experiences. The main
themes of ICTCS’19 include Data science, Data mining, Big Data, Artificial Intelligence, Internet
of Things, Natural Language Processing and Computer Security.
The technical program is rich and varied with 8 featured specialized keynote speeches, 3 industrial
presentations and 65 technical papers from 23 countries. In addition, several workshops are
organized in parallel with the conference.
On behalf of the organizing committee, we wish to thank all authors for their papers and
contributions to this conference. we would like to thank the keynote speakers for sharing their deep
knowledge and experiences in the hot research topics in the different fields of computer science
and Information and communication technology. we offer our deep thanks to all the members of
the International Scientific Committee and reviewers, who offered their time and technical
expertise in the review process.
We know that the success of the conference depends ultimately on the many people who have
worked in planning and organizing both the technical program and supporting social arrangements.
we would like to share with you our gratitude towards all members of the organizing committee
for their efforts and dedication to the success of this conference. we also thank Professor Mashhour
Al Refai president of PSUT and Professor Issa H. Al Ansari for their support in organizing this
conference.
Finally, we would also like to thank Scientific Research Support Fund (Ministry of Higher
Education and Scientific Research-Jordan) for the valuable support to the conference. Special
thanks to all Session Chairs, Student Volunteers and Sponsors for their contributions to make
ICTCS’19 a success.
v
ICTCS’19 Committees
Conference Chairs
Arafat Awajan, Princess Sumaya University for Technology, Jordan
Faisal AL Anezi, Prince Mohammad Bin Fahd University, Saudi Arabia
Advisory Committee
Arafat Awajan, Princess Sumaya University for Technology, Jordan (Chair)
Faisal AL Anezi, Prince Mohammad Bin Fahd University, Saudi Arabia (Chair)
Gheith Abandah, IEEE Jordan Section Chair, Jordan
Ali Chamkha. Prince Mohammad Bin Fahd University, Saudi Arabia
Thiab Taha, University of Georgia, USA
Nabeel Fayoumi, Royal Scientific Society, Jordan
Omer Rana, Cardiff University, UK
Aladdin Ayesh, De Montfort University, UK
Mohammad Bettaz, Dean of the Faculty of Information Technology, Philadelphia University
Ahmad Hiasat, Princess Sumaya University for Technology, Jordan
Amjad Hudaib, Dean of the Faculty of IT, Jordan University, Jordan
Jaafar Al Ghazo, Prince Mohammad Bin Fahd University, Saudi Arabia
Mohammad Bataz, Dean of the Faculty of IT, Philadelphia University, Jordan
Nijad Najdawi, Dean of the Faculty of IT, Al-Balqa Applied University, Jordan
Sahar Edwan, Dean of the Faculty of IT, Hashemite University, Jordan
Hassan Shalaby, Al-Hussein Bin Tala University, Jordan
Essam Al Daoud, Zarqa University, Jordan
vi
Organizing Committee
Sufyan Al Majali, Princess Sumaya University for Technology, Jordan
Edwrad Jaser, Princess Sumaya University for Technology, Jordan
Jaafar Al Ghazo, Prince Mohammad Bin Fahd University, Saudi Arabia
Muder Al Miani, IEEE Jordan Section, Jordan
Khaled Jaber, IEEE Jordan Section, Jordan
Said Ghoul, Philadelphia University, Jordan
Malek Al-Zewairi, Princess Sumaya University for Technology, Jordan
Laila A-Sayaydeh, Princess Sumaya University for Technology, Jordan
Ezzeldeen A-Issa, Princess Sumaya University for Technology, Jordan
Publication Committee
Edward Jaser, Princess Sumaya University for Technology, Jordan (Chair)
Rania Sinno, Prince Mohammad Bin Fahd University, Saudi Arabia
Laila Al-Sayaydeh, Princess Sumaya University for Technology, Jordan
Malek Al-Zewairi, Princess Sumaya University for Technology, Jordan
vii
Technical Committee
Arafat Awajan, Princess Sumaya University for Technology, Jordan
Chedly B. Yahya, Prince Mohammad Bin Fahd University, Saudi Arabia
Brahim Medjahed, the University of Michigan – Dearborn, USA
Aladdin Ayesh, De Montfort University, UK
Christian Boitet, Joseph Fourier University, France.
Adil Alpkocak, Dokuz Eylul University, Turkey
Abd El-Aziz Ahmed, Anna University, India
Abdallah Qusef, Princess Sumaya University for Technology, Jordan
Eric Schoop,TU-DRESDEN University, Germany
Essam Al Daoud, Zarqa University, Jordan
Essam Rashed, The British University in Egypt, Egypt
Fares Fraij,The University of Texas at Austin, USA
Abdallah Shdaifat, The University of Jordan, Jordan
Abdullah Aref,Princess Sumaya University for Technology, Jordan
Abul Bashar,Prince Mohammad Bin Fahd University, Saudi Arabia
Cathryn Peoples,The Open University,UK
Ankur Singh Bist, Krishna Institute of Engineering & Technology, India
Chiheb-Eddine Ben N'Cir,University of Tunis.,Tunis
Adiy Tweissi, Princess Sumaya University for Technology, Jordan
Adnan Gutub, Umm Al-Qura University, Saudi Arabia
Adnan Hnaif, Al-Zaytoonah University of Jordan, Jordan
Bushra Alhijawi, Princess Sumaya University for Technology, Jordan
Daoud Daoud, Princess Sumaya University for Technology, Jordan
Darin El-Nakla, Prince Mohammad Bin Fahd University, Saudi Arabia
Adnan Shaout, University of Michigan, USA
Ahmad Abusukhon, Al-Zaytoonah University of Jordan, Jordan
Ahmad Al-Qerem, Princess Sumaya University for Technology, Jordan
Ahmad Hiasat, Princess Sumaya University for Technology, Jordan
Ala Al-Fuqaha, Western Michigan University, USA
Albara Awajan, Al Balqa Applied University, Jordan
Ali Hadi, Champlain College, Computer and Digital Forensics, USA
Amin Beheshti,Macquarie University, Australia
Amjad Hudeib, The University of Jordan, Jordan
Amjed Almousa, Princess Sumaya University for Technology, Jordan
Ammar Elhassan, Princess Sumaya University for Technology, Jordan
Gaurav Garg, ABV-Indian Institute of Information Technology & Management, India
George Sammour, Princess Sumaya University for Technology, Jordan
Ghassan Al Qaimari, Jumeira University, United Arab Emirates
Ghassan Shobaki, Sacramento State University, USA
Anas Abu Taleb, Princess Sumaya University for Technology, Jordan
Arinola Adefila, Coventry University, UK
Ashraf Odeh, Isra University, Jordan
Ashraf Tahat, Princess Sumaya University for Technology, Jordan
Baha Khasawnwh, Princess Sumaya University for Technology, Jordan
viii
Basheer Dwaoiri, Jordan University of Science and Technology, Jordan
Bassam Hammo, The University of Jordan, Jordan
Heba Abdelnabi, Princess Sumaya University for Technology, Jordan
Hejab Alfawareh, Northern Border University, Saudi Arabia
Bayan Abu Shawar, Arab Open University, Jordan
Dhiah Abu Tair, German Jordanian university, Jordan
Dima Suleiman, Princess Sumaya University for Technology, Jordan
Doaa ElZanfaly, The British University in Egypt, Egypt
Haytham Bani Salameh, Yarmouk University, Irbid, Jordan
Hosam El-Sofany, King Khalid University, Saudi Arabia
Hunaida Awwad, Dokuz Eylul University, Turkey
Huseyin Abachi, Adnan Menderes University, Turkey
Hussein Sane Yagi, The University of Jordan, Jordan
Ibrahim Aljarah, The University of Jordan, Jordan
Ilyes Jenhani, Prince Mohammad Bin Fahd University, Saudi Arabia
Isidro Maya-Jariego, Universidad de Sevilla, Spain
Dojanah Al-Nabulsi, Amman University College, Jordan
Dojanah Bader, Al-Balqa` Applied University, Jordan
Ebaa Fayyoumi, The Hashemite University, Jordan
Edward Jaser, Princess Sumaya University for Technology, Jordan
Emad Abdallah, The Hashemite University, Jordan
Firas Alghanim, Princess Sumaya University for Technology, Jordan
Ghassen Ben Brahim, Prince Mohammad Bin Fahd University, Saudi Arabia
Ghazi Naymat, Princess Sumaya University for Technology, Jordan
Gheith Abandah, University of Jordan, Amman, Jordan.
Hani Almimi, Al-Zaytoonah University of Jordan, Jordan
Hasan Al Shalabi, Al Hussein University, Jordan
Ismail Ababneh, Al al-Bayt University, Jordan
Jaafar Alghazo, Prince Mohammad Bin Fahd University, Saudi Arabia
Jaafer Saraireh, Princess Sumaya University for Technology, Jordan
Jaber Alwidian, Al-Isra University, Jordan
Khaled Al-Begain, University of South Wales, UK
Khaled Almakadmeh, The Hashemite University, Jordan
Khaled Almi'ani, United Arab Emirates
Khaled Alzoubi, Saint Xavier University, USA
Ja'far Alqatawna, The University of Jordan, Jordan
Jalal Atoum, Princess Sumaya University for Technology, Jordan
Jamal Arafat, Ohio University, USA
Jawad Fawaz Al-Asad, Prince Mohammad Bin Fahd University, Saudi Arabia
Jihad Jaam, International Journal of Computing and Information Sciences, United Arab Emirates
Khair Eddin Sabri, The University of Jordan, Jordan
Khalaf Khatatneh, Al-Balqa` Applied University, Jordan
Khaled Mahmoud, Princess Sumaya University for Technology, Jordan
Khaled Makadmeh, The Hashemite University, Jordan
Khaled Mansour, Al-Zaytoonah University of Jordan, Jordan
Khaled Nagaty, The British University in Egypt, Egypt
ix
Majid Ali Khan, Prince Mohammad Bin Fahd University, Saudi Arabia
Malek Al-Zewairi, Princess Sumaya University for Technology, Jordan
Malik Qasaimeh, Princess Sumaya University for Technology, Jordan
Malik Saleh, Prince Mohammad Bin Fahd University, Saudi Arabia
Khaled Younis, The University of Jordan, Jordan
Khamis Omar, Jordan
Khatatneh Khalaf, Al Balqa Applied University, Jordan
Lalit Garg, L-Università ta' Malta, Malta
Leonel Sousa, Universidade de Lisboa, Portugal
Loay Alzubaidi, Prince Mohammad Bin Fahd University, Saudi Arabia
Majdi Rawashdeh, Princess Sumaya University for Technology, Jordan
Majdi Sawalha, The University of Jordan, Jordan
Mamoun Hattab, University of Petra, Jordan
Maram Bani Younes, University of Ottawa, Canada
Mariam Biltawi, Princess Sumaya University for Technology, Jordan
Mariam Khader, Princess Sumaya University for Technology, Jordan
Marius Nagy, Prince Mohammad Bin Fahd University, Saudi Arabia
Marwah Alian, Princess Sumaya University for Technology, Jordan
Mohamed Anis Bach Tobji, ESEN University, Tunisia
Mohamed Aymen Ben Hajkacem, Higher Institute of Management of Tunis, Tunis
Mohamed Wiem Mkaouer, Rochester Institute of Technology, USA
Mohammad Ababneh, Princess Sumaya University for Technology, Jordan
Nadim Obeid, The University of Jordan, Jordan
Nailah Al-Madi, Princess Sumaya University for Technology, Jordan
Naoufel Werghi, Khalifa University, United Arab Emirates
Mohammad Abusharaih, The University of Jordan, Jordan
Mohammad Alauthman, Al-Zaytoonah University of Jordan, Jordan
Mohammad Alia, Al-Zaytoonah University of Jordan, Jordan
Mohammad Al-Zoube, Princess Sumaya University for Technology, Jordan
Mohammad Belal Al Zoubi, Princess Sumaya University for Technology, Jordan
Mohammad Daoud, Microsoft MVP, Jordan
Omar Nofal, Princess Sumaya University for Technology, Jordan
Omar Rana, Cardiff University, UK
Osama Dorgham, Al-Balqa` Applied University, Jordan
Mohammed Al-Saleh, The University of Jordan, Jordan
Mohammed Alweshah, Al-Balqa` Applied University, Jordan
Mohammed Zeki Khedher, The University of Jordan, Jordan
Montassar Ben Messaoud, Higher Institute of Management of Tunis, Tunis
Mostafa Ali, Jordan University of Science and Technology, Jordan
Mouhammd Alkasassbeh, Princess Sumaya University for Technology, Jordan
Mousa Al-Akhras, Saudi Electronic University, Saudi Arabia
Omar Hiari, German Jordanian University, Jordan
Omar M. Al-Jarrah, Jordan University of Science and Technology, Jordan
Mousa Ayyash, Colorado State University, USA
Mustafa Al Fayoumi, Princess Sumaya University for Technology, Jordan
Nadia Sweis, Princess Sumaya University for Technology, Jordan
x
Nazeeruddin Mohammad, Prince Mohammad Bin Fahd University, Saudi Arabia
Nijad Najdawi, Al Balqa Applied University, Jordan
Omar Al-Hujran, Princess Sumaya University for Technology, Jordan
Omar H. Karam, The British University in Egypt, Egypt
Osama Haj Hassan, Al-Isra University, Jordan
Osama Ouda, The University of Jordan, Jordan
Parag Kulkarni, United Arab Emirates University, United Arab Emirates
Paul Richardson, The University of Michigan – Dearborn, USA
Paul Watta, The University of Michigan – Dearborn, USA
Peter King, Heriot Watt University, UK
Priyanka Chaurasia, ULSTER, UK
Raed Abu Zitar, American University of Madaba, Jordan.
Raghda Hraiz, Princess Sumaya University for Technology, Jordan
Rami Alazrai, German Jordanian University, Jordan
Rawan Ghnemat, Princess Sumaya University for Technology, Jordan
Ridha Ghayoula,Université Laval, Canada
Rosana Marar, Princess Sumaya University for Technology, Jordan
S Smys, RVS Technical Campus, India
Sadiq Alhuwaidi, Prince Mohammad Bin Fahd University, Saudi Arabia
Sahar Idwan, The Hashemite University, Jordan
Said Ghoul, Philadelphia University, Jordan
Salam Fraihat, Princess Sumaya University for Technology, Jordan
Salam Hamdan, Princess Sumaya University for Technology, Jordan
Saleh Abu-Soud, Princess Sumaya University for Technology, Jordan
Thiab Taha, University of Georgia, USA
Varsha Jain, Narsee Monjee Institute of Management Studies, India
Vladimir Geroimenko, The British University in Egypt, Egypt
Wael Etaiwi, Princess Sumaya University for Technology, Jordan
Walid A Salameh, Princess Sumaya University for Technology, Jordan
Samer Sawalha, Princess Sumaya University for Technology, Jordan
Samir Abou El-Seoud, The British University in Egypt, Egypt
Samir Elnakla, Prince Mohammad Bin Fahd University, Saudi Arabia
Samy Ghoniemy, The British University in Egypt, Egypt
Sane Yagi, University of Jordan, Jordan
Saqer Abdel Rahim, Jordan
Sara Tedmori, Princess Sumaya University for Technology, Jordan
Sarah Gellynhail, Western Michigan University, USA
Shadi Aljawarneh, The University of Jordan, Jordan
Yahia Al-Halabi, Princess Sumaya University for Technology, Jordan
Yasmeen Alsufaisan, Prince Mohammad Bin Fahd University, Saudi Arabia
Yi Lu Murphy, University of Michigan, USA
Yousef Daradkeh, Prince Sattam bin Abdulaziz University, KSA
Shahabuddin Muhammad, Prince Mohammad Bin Fahd University, Saudi Arabia
Shaidah Jusoh, Princess Sumaya University for Technology, Jordan
Sharefa Murad, University of Salerno, Italy
Sufyan Almajali, Princess Sumaya University for Technology, Jordan
xi
Suleiman Yerima, De Montfort University, UK
Tarek Abbes, Higher Institute of Electronic and Communication of Sfax, Tunisia
Walid Hussien, The British University in Egypt, Egypt
Wided Guezguez, Tunis Business School, Tunis
Zaydon Hatamleh, Al Ain University of Science and Technology, United Arab Emirates
xii
Keynotes
xiii
Prof. Ku Ruhana Bt Ku M
Universiti Utara Malaysia, Malaysia
The current trend is to design hybrid metaheuristics by combining different metaheuristics which will
benefit from the individual advantages of each method. An effective approach consists in combining a
population-based method with a single-solution method (often a local search procedure such as Taboo
search with ant colony optimization (ACO)). In hybrid optimization algorithms, many combinations of
famous optimization methods have been developed, such as, a hybrid grey wolf optimizer and genetic
algorithm, hybrid Cuckoo Search and Particle Swarm Optimization (PSO), a hybrid PSO and ACO and a
Hybrid ACO and artificial bee colony algorithm. Hybrid SI-based metaheuristics can obtain satisfying
results when solving optimization problems in a reasonable time. However, they suffer especially with
high-dimensional optimization problems. Future research to overcome this limit could be in the area of
parallel metaheuristics.
xiv
Prof. Mubarak Shah
University of Central Florida, USA
xv
Prof. Moussa Ayyash
Chicago State University, USA
This talk highlights the need for a strategic framework for coexisting heterogenous wired and wireless
deployments and computing infrastructures. The talk provides examples of recent promising solutions that
promote coexistence strategies (e.g. coexisting radio and optical wireless deployments (CROWD)). The
speaker will also focus on the fact that large-scale heterogenous integration requires artificial intelligence
(AI) techniques that can naturally deal with coexisting heterogenous environments.
The speaker will briefly shed light on the need for a different future workforce which is ready to deal with
the heterogeneity of computing sciences and the “future-of-work” trends.
xvi
Prof. Giorgio Giacinto Prof. George Dafoulas
University Of Cagliari, Italy Middlesex University, UK
The shift towards Education 4.0 has changed significantly the pressure for a learning experience that is
fully aligned to a volatile employment sector. Therefore, there is a need for a revised pedagogical approach
in the way digital forensic curricula are delivered and supported. The evolution of educational technologies,
as well as the increasing integration of a range of hands-on experiences in the learning process means that
the digital forensic programmes are enhanced with the use of Internet of Things, Immersive Learning
Environments, Social Learning Networks, Augmented Virtual and Mixed Reality, Sensor Generated Data,
Biometrics and new perspectives of the impact ethical, social and professional issues have on security and
privacy. Such pressures triggered a significant reshaping of learning, teaching and assessment practices,
with emphasis on delivery digital forensics programmes in ways that equip graduates towards seamless
employability readiness. This keynote will (i) discuss the various challenges of the changing educational
sector, (ii) share examples of good practice in delivering programmes within the framework of Industrial
Revolution 4.0 and (iii) provide guidance for adapting new educational practices in the delivery of digital
forensic programmes.
xvii
Prof. Elhadj Benkhelifa
Staffordshire University, UK
xviii
Prof. Mona T. Diab
The George Washington University, USA
xix
Prof. Salim Hariri
The University of Arizona, USA
xx
Table of Contents
Leader Election and Blockchain Algorithm in Cloud Environment for E-Health ......................8
Basem Assiri
An Approach for Web Applications Test Data Generation Based on Analyzing Client
Side User Input Fields ...............................................................................................................39
Samer Hanna and Hayat Jaber
An Energy Aware Fuzzy Trust Based Clustering with Group Key Management in
MANET Multicasting................................................................................................................68
Gomathi Krishnasamy
xxi
Track 2: Virtual and Electronic Learning
The JOVITAL Project: Capacity Building for Virtual Innovative Teaching and
Learning in Jordan .....................................................................................................................83
Katherine Wimpenny, Arinola Adefila, Alun DeWinter, Valerij Dermol, Nada Trunk Širca,
and Aleš Trunk
Deep Learning Assisted Smart Glasses as Educational Aid for Visually Challenged
Students ...................................................................................................................................124
Hawra AlSaid, Lina AlKhatib, Aqeela AlOraidh, Shoaa AlHaidar, and Abul Bashar
xxii
Novel Approach towards Arabic Question Similarity Detection ............................................158
Mohammad Daoud
Using K-Means Clustering and Data Visualization for Monetizing Logistics Data ...............164
Hamzah Qabbaah, George Sammour, and Koen Vanhoof
Data Analytics and Business Intelligence Framework for Stock Market Trading ..................178
Batool AlArmouty and Salam Fraihat
Framework Architecture for Securing IoT using Blockchain, Smart Contract and
Software Defined Network Technologies ...............................................................................189
Hasan Al-Sakran, Yaser Alharbi, and Irina Serguievskaia
Arabic Text Classification of News Articles using Classical Supervised Classifiers .............238
Leen Al Qadi, Hozayfa El Rifai, Safa Obaid, and Ashraf Elnagar
xxiii
A Deep Learning Approach for Arabic Text Classification....................................................258
Katrina Sundus, Fatima Al-Haj, and Bassam Hammo
Identification and Tagging of Malicious Vehicles through License Plate Recognition ..........289
Ahmad Mostafa, Walid Hussein, and Samir El-Seoud
Wrapper-Based Feature Selection for Imbalanced Data using Binary Queuing Search
Algorithm ................................................................................................................................318
Thaer Thaher, Majdi Mafarja, Baker Abdalhaq, and Hamouda Chantar
A Parallel Face Detection Method using Genetic & CRO Algorithms on Multi-Core
Platform ...................................................................................................................................329
Mohammad Khanafsa, Ola Surakhi, and Sami Sarhan
xxiv
Resolving Conflict of Interests in Recommending Reviewers for Academic
Publications using Link Prediction Techniques ......................................................................341
Sa'ad A. Al-Zboon, Saja Khaled Tawalbeh, Heba Al-Jarrah, Muntaha Al-Asa'd,
Mahmoud Hammad, and Mohammad AL-Smadi
Track 7: Miscellaneous
Would It be Profitable Enough to Re-Adapt Algorithmic Thinking for Parallelism
Paradigm ..................................................................................................................................366
Aimad Eddine Debbi, Abdelhak Farhat Hamida, and Haddi Bakhti
Affordable and Portable Realtime Saudi License Plate Recognition using SoC ....................372
Loay Alzubaidi, Ghazanfar Latif, and Jaafar Alghazo
Causal Path Planning Graph Based on Semantic Pre-Link Computation for Web
Service Composition ...............................................................................................................388
Moses Olaifa and Tranos Zuva
xxv
Optimized Multi-Layer Hierarchical Network
Intrusion Detection System with Genetic Algorithms
Pranesh Santikellur Tahreem Haque
Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering
Indian Institute of Technology Heritage Institute of Technology
Kharagpur, West Bengal, India 721302 Kolkata, West Bengal, India 700107
pranesh.sklr@iitkgp.ac.in tahreemhaque97@gmail.com
Abstract—The number of connected devices on the Internet or availability of a network. Because of its role, they have
has exceeded 31 billion devices in 2018 and it is forecasted become an important part of the network security.
that this number will exceed 50 billion by the year 2020.
One the other hand, malicious software and network attacks Over the years, several researchers have proposed different
are raising on an alarming rate. It is estimated that more methods and techniques on network intrusion detection. The
than 230,000 new malware are produced daily and over 53,000 research on network intrusion detection systems has evolved
new Cryptoware malware engines are detected as well. This following different approaches, for instance; the use of rule
proliferation in security attacks constitutes a great challenge based, statistical analysis and Finite State Machine (FSM)
for Intrusion Detection Systems (IDS), in particular, in detecting
modern attacks. In this paper, a multi-layer hierarchical Network based modeling.
Intrusion Detection System (NIDS) is proposed with the aim to • Statistical based intrusion detection involves network
improve the overall detection performance of the IDS for detect-
traffic samples going through statistical inference test
ing modern attack types. The proposed multi-layer NIDS utilizes
multiple models of machine learning algorithms in a hierarchical which decides the packet belongs to normal flow or
architecture in addition to using evolutionary computing, namely malicious. It involves both the parametric tests and non-
Genetic Algorithms, to tune the configurations of the neural parametric tests where the underlying distribution is as-
network models used in the first layer. A modern dataset (i.e. sumed in parametric distribution and non-parametric tests
CICIDS-2017) is used to evaluate the proposed approach, which
are distribution-free tests. Chi-square based detection
contains several modern attacks. The results showed that the
proposed multi-layer system significantly improved on the error methods [1] and parzen window based [2] are parametric
generalization metrics. and non-parametric based statistical detection systems.
Index Terms—Evolutionary Computing; Machine Learning; • Rule based intrusion detection characterizes normal
Network Intrusion Detection; Network Security; Multi-Layer; flow with a rule. Any flow which doesn’t follow the rule is
considered to be malicious flow. Rule learning algorithms
learn the rules where data that can be expressed in the
I. I NTRODUCTION form of an IF-THEN rule. [3] proposed a new rule
formation algorithm called base-support association rules
New technologies such as Cloud and Fog Computing,
to distinguish between normal and intrusive behavior.
Big Data, and the Internet of Things (IoT) have progressed
• FSM based intrusion detection system involves de-
enormously and made people more dependent on computer
ducing FSM from network data where state represents
networking technology more than ever before. At the same
network attacks and transitions are matching features.
time, incidents of information security breaches have increased
On each matched feature, a successful transition is made.
drastically. Ensuring end-to-end security is of utmost concern.
Final acceptance state decides the attack. The authors of
This involves ensuring the safety and trustworthiness of net-
[4] proposed real-time intrusion detection tool (STAT),
working hardware, to high level effective defensive measures
which is based on the state transition analysis technique.
against various types of network attacks which leverage the
[5] Salvador and Chan demonstrate a way to perform
vulnerabilities of the deployed security protocols. Network
time series anomaly detection via generated states and
Intrusion Detection System (NIDS) is considered one of the
rules using RIPPER to form detection system.
most important security controls to detect malicious attack
behaviors that might compromise the integrity, confidentiality, Machine learning has been widely used in NIDS including
2
packets. A DoS attack [19] can be either a single-source attack
or a multi-source, where latter is called a distributed denial of PCAP
service (DDoS) attack. Files
2) Web Attacks: Websites often uses database servers to
store their records. Databases also keep the various web
applications in the persistent state. One of the attacks that
compromise the security of back-end databases is SQL In- CICFlowMeter
jection. Cross-site scripting ensures that malicious JavaScript
code gets executed on the victim’s machine [20].
3) Port Scan: Port scan attack [21] involves sending client
requests to a range of server ports to find active ports and
inactive ports on the server. This attack also can also get the
AdaBoost ANN* NB
services running in the victim site.
4) Patator: Patator based attacks are popular brute-force Layer-1
attacks that are used for password guessing. Patator [22] is Binomial Classifiers
* Optimized using Genetic
an open-source multi-purpose command line Python tool. The Algorithms
dataset contains packet flows used for brute force SSH and
FTP logins.
3
mine the relationship between two features [23]. Corre- TABLE IV
lation Coefficient of X and Y is defined as follows: F EATURES S ELECTION FOR L AYER -1
∑N
(xi yi ) − N X̄ Ȳ Selection Metric Selected Features
rX,Y = i=1 Correlation Coef- Packet_Length_Mean,
N σ X σY ficient min_seg_size_forward,
Active_Mean,
If rX,Y is near to 1, indicates the features are highly Active_Std, Active_Max,
correlated. It means one feature contains the information Active_Min, Idle_Mean,
about other feature. So one of the features is redundant Idle_Std, Idle_Max
Information Gain Flow_IAT_Std, Idle_max,
and can be removed. The threshold value of 0.1 is used Flow_IAT_max,
for extracting the features. Table IV shows the list of Fwd_IAT_max,
features selected by the correlation coefficient. Idle_Min, Idle_Mean,
Packet_Length_Std,
• Information gain uses the information-theoretic ap- Fwd_IAT_Total,
proach to find the features, based on their entropy [24]. Bwd_IAT_Std,
An entropy value is higher when the attack distribution is Init_Win_bytes_forward
ANOVA + RFE Fwd_IAT_Std,
more even, that is when the data items have more classes. Bwd_Packet_Length_Mean,
Information gain is a measure on the utility of each Flow_IAT_Max,
attribute in classifying the data items which are measured Fwd_IAT_Max, Idle_Max,
Bwd_Packet_Length_Max,
using entropy value. The entropy and information gain are Idle_Min, Idle_Mean
given by following formula:
∑m
E(D) = − Pi log2 Pi TABLE V
i=1 F EATURES S ELECTION FOR L AYER -2 USING ANOVA & RFE
∑v |Di | Attack Selected Features
E(D, A) = − × E(Di )
i=1 |D| Group
DoS Source_Port, Destination_Port,
Gain(A) = E(D) − E(D, A) Fwd_Packet_Length_Std,
Bwd_Packet_Length_Std, Flow_Bytes.s,
Flow_IAT_Max, Fwd_IAT_Total,
Table IV shows the list of features selected using infor- Bwd_Packets.s
mation gain. Web Source_Port, Destination_Port,
• Recursive Feature Elimination with ANOVA Attack Total_Fwd_Packets, Flow_IAT_Mean,
Flow_IAT_Std, Fwd_IAT_Min, Bwd_IAT_Min,
Univariate feature selection with ANOVA (Analysis of Total_Length_of_Fwd_Packets
Variance) F-test [25] does the feature scoring. It analyzes PortScan Source_Port, Destination_Port,
each feature individually to determine the strength of the Total_Length_of_Fwd_Packets,
Flow_Bytes.s, Flow_IAT_Std, Bwd_IAT_Min,
relationship of the feature with labels. On the scored Fwd_Packets.s, Bwd_Packets.s
features, a recursive feature elimination [26] is applied Patator Source_Port, Destination_Port,
which recursively build a model, placing the feature aside Flow_Bytes.s, Flow_Packets.s,
Flow_IAT_Min, Fwd_IAT_Min,
and then repeating the process with the remained features FBwd_Packets.s, Packet_Length_Std
until all features in the dataset are exhausted. It uses
the weights of a classifier to produce a feature ranking.
The eliminated features are those with the lowest weights crossover and selection procedures to find the dominant in-
computed during training. We use ANOVA+RFE for both dividuals. Genetic algorithms maintains a balance between
layer-1 and layer-2 modeling as shown in Table IV and exploration of search space and exploitation of good solutions
Table V. [28].
2) Artificial Neural Networks: A neural network is a set
B. Classification Methods of interconnected nodes called neurons. Each node has a
1) Genetic Algorithms: Genetic Algorithms (GA) [27] are weighted connection to several other nodes in adjacent layers.
heuristic global optimization technique based on principles Neural networks can learn from supervised or unsupervised
of biological evolution and natural selection. GAs simulate training. The important component of training neural net-
the evolution of living organisms, where the fittest individuals work model includes activation function, loss function, and
dominate over the weaker ones. In Genetic algorithms, search optimization algorithm. The activation function makes the
space of algorithm is represented as the collection of individual neural network to learn non-linear complex functions. For the
which are referred as chromosomes. The set of parameters supervised model, the loss functions calculates the error i.e
specifying an individual is called gene. The part of search difference between output and the target variable. Optimization
space to be examined is called as population. The purpose of algorithms are used to find the proper parameters( weights)
using a genetic algorithm is to find the dominating individual of a model. The backpropagation algorithm is to update the
from search space evaluated w.r.t to evaluation function called weights to each neuron. The proposed method uses feed-
fitness function. Genetic algorithm uses random mutation, forward neural networks trained to predict Layer-1 intrusion
4
TABLE VI leaf node is reached. The main problem here is deciding the
H YPERPARAMETERS USED FOR L AYER -1 N EURAL N ETWORK M ODELS attribute, which will best partition the data into various classes.
Hyperparameters Values There are many methods to construct the decision tree, such as
L2 regularization penalty 0.0001 ID3 and C4.5 [32] and CART (Classification and Regression
Learning Rate 0.001 Trees) [33].
Optimizer Adam
Loss Function Cross-entropy The ID3 algorithm works on the concept of information
Hidden Layer Activation function Sigmoid gain, while the C4.5 algorithm is an extension of ID3. C4.5
avoids overfitting the data by determining a decision tree,
It can also handle continuous attributes, is able to choose
detection. A feed-forward neural networks has an input layer, an appropriate attribute selection measure, handles training
an output layer, with one or more hidden layers in between data with missing attribute values and improves computation
the input and output layer. efficiency. CART (Classification and Regression Trees) is a
The structural optimization of the neural network is done process of generating a binary tree for decision making [33].
using Genetic Algorithms. The optimization involves finding CART handles missing data and contains a pruning strategy.
the optimal number of hidden layers, the number of neurons
within each layer and the right activation function in order to
maximize the performance of neural networks. Each individual IV. E XPERIMENTAL R ESULTS
represents single neural network and the hyperparametrs like
activation function and number of layers are genes. The The proposed modeling setup was implemented using
Genetic algorithm converges with efficient architecture that Python 2.7 and sklearn 0.19 [34], and executed on a Linux
produces better results after 10 generations with 20 individuals workstation with 32 GB of main memory and a 4-core, 3.3
each. Our neural network architecture converged by genetic GHz processor. The dataset was divided into 80% training set
algorithms consists of two hidden layers with 512 neurons in and 20% test set. The Layer-1 uses MLPClassifier from scikit-
each layer. learn, which implements a multi-layer perceptron (MLP) algo-
The hidden layer activation function used is the sigmoid rithm that trains using Backpropagation. Similarly Adaboost
function [29]. It is a special case of the logistic function that and Naive Bayes implementation uses functions from sklearn
is defined by formula Sigmoid(z) = 1+e1−z . It is bounded library. The base classifier used for Adaboost is decision trees.
and has a positive derivative at each point. Sigmoid function Layer-2 uses Decision trees Classifier from scikit-learn. It
is the most commonly used because of its non-linearity and implements the split algorithm very similar to C4.5 which is
the computational simplicity of its derivative. Table VI shows an extension of a popular ID3 algorithm.
different hyperparameters used for modeling. Classification accuracy is not the sole appropriate parameter
3) AdaBoost: AdaBoost is an algorithm for constructing a to measure the performance since the training set consists of a
strong classifier as a linear combination of “weak” classifiers large amount of benign data as compared to malicious network
[30]. The AdaBoost algorithm corrects the misclassified in- traffic. Hence, we have also estimated precision, recall, F1-
stances made by weak classifiers, and it is less susceptible to score and FAR, along with accuracy in our experiments. The
overfitting than most learning algorithms. A group of weak various classifiers mentioned in Section III-B were applied
classifiers has to be prepared as inputs of the AdaBoost to the network traffic dataset containing both benign and
algorithm. Weak classifiers can be linear classifiers, ANNs malicious flows. To ensure that the classifier generalizes well
or other common classifiers. For modeling, we select the to unseen data, evaluation of prediction is done with test
“decision trees” as the weak classifier due to its simplicity. dataset. Table VII shows the best values of accuracy, precision,
4) Naive Bayes: The naive Bayes model is based on the recall, F1-score and FAR obtained from these classifiers on the
Bayes rule in probability theory [31]. The naive Bayes uses test dataset for the first layer. The results were obtained for
the probability of several related evidence variables. The different feature selection algorithms across various models.
probability of an end result is encoded in the model along ANN shows the best accuracy compared to the other two
with the probability of the evidence variables. The naive Bayes classifiers. It correctly identified 98.74% of malicious traffic in
classifier operates on a strong independence assumption. This the test set, with 3.50% of false alarm rate. AdaBoost shows
means that the probability of one attribute does not affect the a low false alarm rate than Neural Networks.
probability of the other. Table VIII presents the results of “Decision trees” based
5) Decision trees: Decision trees are a very popular used learning in Layer-2 model. The accuracy, precision, recall, F1-
approach for classification. Decision trees learn inductively to score values are greater than 99%, and FAR value is reached
construct a model from pre-classified data set. The technique as low as 0.04% in portscan attacks. The important point to
is to select the features, which best divides the data items into consider is that if an intrusion detection system has the higher
their classes. Induction of the decision tree uses the training number of false alarm values, the detection system is not
data, which is described in terms of the attributes. To classify useful. Because the normal flow will be shown as malicious.
an attack, one starts at the root of the decision tree and follows From the above metrics, we can say Layer-1 and Layer-2
the branch indicated by the outcome of each test until a model results are promising and encouraging.
5
TABLE VII
R ESULTS FOR LAYER -1 MODEL
Classifier Features Selection Accuracy (%) Precision (%) Recall (%) F1-Score (%) FAR (%)
Correlation Coefficient 91.21 95.51 94.04 94.77 24.41
ANN Information Gain 94.72 96.88 96.78 96.83 15.72
ANOVA+ RFE 97.54 97.87 99.18 98.52 10.02
All features 98.74 99.30 99.19 99.25 3.50
Correlation Coefficient 93.21 97.82 94.25 96.00 13.44
Adaboost Information Gain 93.12 97.60 94.34 95.94 14.52
ANOVA+ RFE 95.99 97.85 97.35 97.6 11.01
All features 98.19 99.61 98.25 98.92 2.09
Correlation Coefficient 17.21 0.7 88.71 1.42 83.28
Naive Bayes Information Gain 84.03 93.50 88.07 90.71 47.12
ANOVA+ RFE 83.59 91.95 88.78 90.34 49.53
All features 56.64 48.06 99.91 64.90 72.34
TABLE VIII
R ESULTS FOR LAYER -2 MODEL
Sub Attack Categories Accuracy (%) Precision (%) Recall (%) F1-Score (%) FAR (%)
DoS 99.82 99.90 99.89 99.90 0.1
Web Attacks 100 100 100 100 0.24
Portscan 100 100 100 100 0.04
Patator 99.9 99.99 100 100 0.07
V. C ONCLUSION [7] M. Panda and M. R. Patra, “Network intrusion detection using naive
bayes,” International journal of computer science and network security,
In this paper, a multi-layer hierarchical network intrusion vol. 7, no. 12, pp. 258–263, 2007.
detection system was proposed which mainly consists of two [8] S. Mukkamala, A. H. Sung, and A. Abraham, “Intrusion detection
layers; the first layer is used for distinguishing benign from using an ensemble of intelligent paradigms,” Journal of Network
and Computer Applications, vol. 28, no. 2, pp. 167 – 182,
malicious traffic using multiple binomial classifiers including 2005, computational Intelligence on the Internet. [Online]. Available:
AdaBoost, Neural Networks and Naive Bayes. While, in the http://www.sciencedirect.com/science/article/pii/S1084804504000049
second layer a multinomial decision trees classier is used to [9] L. Khan, M. Awad, and B. Thuraisingham, “A new intrusion detection
system using support vector machines and hierarchical clustering,” The
identify the exact attack category from potential malicious VLDB Journal, vol. 16, no. 4, pp. 507–521, Oct 2007.
traffic. Genetic algorithms were used in the first layer in order [10] H. Debar, M. Becker, and D. Siboni, “A neural network component
to perform neural network structure optimization. Moreover, for an intrusion detection system.” in IEEE symposium on security and
privacy, 1992, pp. 240–250.
Correlation Coefficient, Information Gain and Recursive Fea- [11] G. Stein, B. Chen, A. S. Wu, and K. A. Hua, “Decision tree classifier
ture Elimination with ANOVA were utilized in order to find for network intrusion detection with ga-based feature selection,” in
the best features for each attack category. In order to evaluate Proceedings of the 43rd annual Southeast regional conference-Volume
2. ACM, 2005, pp. 136–141.
the proposed model, several combinations of feature selection [12] J. Zhang and M. Zulkernine, “A hybrid network intrusion detection
algorithms and classifiers were applied. The experimental technique using random forests,” in First International Conference on
results showed that the NN classifier showed better accuracy Availability, Reliability and Security (ARES’06), 2006, pp. 8 pp.–269.
[13] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das, “The
results whereas the AdaBoost classifier achieved the lowest 1999 darpa off-line intrusion detection evaluation,” Computer networks,
FAR value in the first layer. Conversely, the results in the vol. 34, no. 4, pp. 579–595, 2000.
second layer might indicate overfitting issue which we intend [14] “KDD Cup 1999. ,” http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html,
to investigate in the future work. 2018, accessed: Aug 2018.
[15] J. Song et al., “Statistical analysis of honeypot data and building of kyoto
2006+ dataset for nids evaluation,” in Proceedings of the First Workshop
R EFERENCES on Building Analysis Datasets and Gathering Experience Returns for
[1] N. Ye and Q. Chen, “An anomaly detection technique based on a Security, ser. BADGERS ’11. ACM, 2011, pp. 29–36.
chi-square statistic for detecting intrusions into information systems,” [16] “NSL-KDD data set for network-based intrusion detection systems,”
vol. 17, pp. 105 – 112, 03 2001. http://nsl.cs.unb.ca/NSL-KDD/, 2018, accessed: Aug 2018.
[2] D.-Y. Yeung and C. Chow, “Parzen-window network intrusion detectors,” [17] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating
in Pattern Recognition, 2002. Proceedings. 16th International Confer- a new intrusion detection dataset and intrusion traffic characterization.”
ence on, vol. 4. IEEE, 2002, pp. 385–388. in ICISSP, 2018, pp. 108–116.
[3] M. Qin and K. Hwang, “Frequent episode rules for internet anomaly [18] A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, and A. A. Ghorbani,
detection,” in Network Computing and Applications, 2004.(NCA 2004). “Characterization of tor traffic using time based features.” in ICISSP,
Proceedings. Third IEEE International Symposium on. IEEE, 2004, 2017, pp. 253–262.
pp. 161–168. [19] L. Garber, “Denial-of-service attacks rip the internet,” Computer, no. 4,
[4] P. Porras, “Stat – a state transition analysis tool for intrusion detection,” pp. 12–17, 2000.
Santa Barbara, CA, USA, Tech. Rep., 1993. [20] T.-S. Chou, “Security threats on cloud computing vulnerabilities,” Inter-
[5] S. Salvador, P. Chan, and J. Brodie, “Learning states and rules for time national Journal of Computer Science & Information Technology, vol. 5,
series anomaly detection.” in FLAIRS conference, 2004, pp. 306–311. no. 3, p. 79, 2013.
[6] L. Portnoy, E. Eskin, and S. Stolfo, “Intrusion detection with unlabeled [21] C. B. Lee, C. Roedel, and E. Silenok, “Detection and characterization
data using clustering,” in In Proceedings of ACM CSS Workshop on Data of port scan attacks,” Univeristy of California, Department of Computer
Mining Applied to Security (DMSA-2001, 2001, pp. 5–8. Science and Engineering, 2003.
6
[22] “Patator Ver 0.7 ,” https://github.com/lanjelot/patator, 2018, accessed:
Aug 2018.
[23] B. Ratner, “The correlation coefficient: Its values range between +1/-
1, or do they?” Journal of Targeting, Measurement and Analysis for
Marketing, vol. 17, no. 2, pp. 139–142, 2009.
[24] L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast
correlation-based filter solution,” in Proceedings of the 20th interna-
tional conference on machine learning (ICML-03), 2003, pp. 856–863.
[25] M. Berenson, D. Levine, and M. Goldstein, “Intermediate statistical
methods and applications: A computer package approach. 1983.”
[26] Guyon et al., “Gene selection for cancer classification using support
vector machines,” Machine Learning, vol. 46, no. 1, pp. 389–422, 2002.
[27] J. H. Holland, “Genetic algorithms,” Scientific american, vol. 267, no. 1,
pp. 66–73, 1992.
[28] M. Gen and R. Cheng, Genetic algorithms and engineering optimization.
John Wiley & Sons, 2000, vol. 7.
[29] J. Han and C. Moraga, “The influence of the sigmoid function param-
eters on the speed of backpropagation learning,” in From Natural to
Artificial Neural Computation. Springer Berlin Heidelberg, 1995, pp.
195–201.
[30] Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,”
Journal-Japanese Society For Artificial Intelligence, vol. 14, no. 771-
780, p. 1612, 1999.
[31] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach
(International Edition). Pearson, 2002.
[32] J. Quinlan, “C4.5: Programs for machine learning,” 1993.
[33] L. Breiman, Classification and regression trees. Routledge, 2017.
[34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
7
Leader Election and Blockchain Algorithm in Cloud
Environment for E-Health
Basem Assiri
Faculty of CS & IT, Jazan University
Jazan, Saudi Arabia
babumussmar@jazanu.edu.sa
Abstract—The enhancement in e-health systems demands Many people left the city, medical treatments were necessary
the adaptation of computerized techniques, new algorithms but doctors of the respective hospital could not access the
and methods. To achieve better efficiency, Electronic Personal health history of the respective patients [4, 5].
Health Record (E-PHR) requires to adopt a new mode of
storage such as cloud storage. Cloud storage is a supportive In addition, the use of E-PHR technology requires
technique that provides better security, easy availability and supportive infrastructure, software, hardware and resources.
accessibility of files. However, the availability of E-PHR on the Storage is one of the vital aspects to be considered. Indeed,
cloud allows parallel access to the corresponding files. In cloud storage is one of the advanced technologies that
parallel and distributed computing, many users communicate facilitate the access of distributed resources. It provides
and share resources to achieve a targeted goal. Therefore, servers to store files in and to access them wherever the
leader election is a major technique that maintains and Internet is available. Cloud storage technology is a size
coordinates parallelism. This research implies the leader
effective, secure, maintainable and accessible that makes it
election algorithm for E-PHR in the cloud environment. It
proposes an adoptive leader election algorithm (ALEA) that suitable for E-PHR [6, 7].
takes into account medical and healthcare specifications. Use of E-PHR and cloud storage, many devices process
Therefore, the paper has incorporated the ideas of Primary files in parallel. This requires specific control and
Leader, Secondary Leader for emergency, leader appointing
and multiple tokens to allow parallel updates. ALEA limits the
coordination over the shared files to guarantee data
massage passing for leader electing and for acquiring token. In correctness and consistency. Therefore, leader and leader
regular case it reduces the number massages to 0. Moreover, election techniques are useable to control and maintain data
the paper discusses the advantages and disadvantages of using consistency on shared E-PHRs. The control means to decide
Blockchain technology to implement ALEA. who create, access, copy, move, edit and delete the E-PHR.
Data consistency is to have the expected result after each
action or process on data. In other word, the output of each
Keywords—Distributed System, Leader Election, Cloud process on the data is predictable [8]. Actually, parallel
Storage, Electronic Personal Health Record, Blockchain access of E-PHR may cause conflicts. There will be no
confliction when many users are accessing the same E- PHR
I. INTRODUCTION simultaneously for reading. However, when one of them or
During the last few decades, a remarkable development is more are trying to update a file, then some of them may read
noticed in the field of computational applications, Internet un-updated data or the update of one may contrast with the
tools and technologies, which make them part of many other updates of others; which is a conflict. For example, when
fields in people daily life. These overlaps create new fields or two doctors are accessing the same E-PHR in parallel and
sub-fields such as e-governance, e-learning, e- banking, e- one doctor updates the patient's blood pressure record, while
health and many more. This research focuses on e-health the other may still considering non-updated blood pressure
where the computerized technology is used in the field of reading and has no idea about the change. Actually, leader
healthcare. Presently, healthcare organizations have election provides exclusive access (that is known as token) to
incorporated many new sophisticated techniques to avail control and keep data consistency.
more advantageous and modern in their system. These are
Furthermore, Blockchain is one of the new promising
being reflected in providing services, stakeholders'
technology that supports the implementation of decentralized
satisfaction, costs reduction and to get rid of managerial
distributed system. Blockchain is a distributed public ledger
burden. Actually, healthcare organizations compete in
that keeps records, transactions or any digital processes [9].
providing better services to their stakeholders. The services
cost can be time, effort, physical space and infrastructure. Blockchain includes a cluster of nodes that share the same
Also, the use of technology also facilitates the outsourcing of data, propose some processes on the data, and verify the
execution of the process through consensus. Nakamoto
some services to reduce cost and effort of management,
exploits the idea of Blockchain to introduce the first
maintenance, risk handling, and new technology adoption.
cryptocurrency that is known as Bitcoin. Using Bitcoin, users
In this field, one of the major technology is to use E- have a peer-to-peer electronic financial system, where users
PHR. It is a digital version of PHR that enables to access and can exchange money with no need for a third party [9]. After
exchange information of patients electronically [1]. This the success of the Bitcoin, many other cryptocurrencies have
provides data accessibility, availability, privacy, security, been introduced and used for other financial services such as
completeness and consistency (which mean having accurate trading and insurance services [10, 11]. The idea of Bitcoin
and up to date data). It helps in monitoring, controlling, gives inspires people to extend the use of Blockchain to many
better communication and coordination. It decreases not only other fields such as in judiciary, notary, rights, ownership,
the costs but also the risk of gaining physical records of healthcare, and educational services [11].
healthcare [1, 2, 3]. It is known that when Katrina hurricane
This paper proposes and investigates an adoptive leader
hits New Orleans city in the USA in the year 2005, the flood
election algorithm (ALAE) suiting the use of E-PHRs. It
destroyed the healthcare records of thousands of people.
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 8
introduces the idea of a primary leader secondary leader and to store the queue so that every node (user) is capable to see
having multiple tokens. It shows how to handle the failure of the updated information.
leaders with a limited number of massages. Beside the
technical specifications of the leader election, the procedural In some cases such as emergency or transfer, it creates a
aspects and administrative rules of the medical environment temporary queue TempQ and appoint a secondary leader
are considered. The paper also discusses the advantages and called SLeader.
disadvantages of using Blockchain architecture in The Q linked-list is shown in Figure 1, where the main
combination of ALEA. queue list has three doctors, and PLeader pointing to the head
of the queue. Another linked-list queue TempQ appears in
II. RELATED WORK the emergency block with SLeader. Practically, this queue
does not usually exist.
Many works propose methodologies (strict or relaxed) to
maintain the consistency of cloud storage [12, 13].
Coppieters et al. provide a strict consistency algorithm,
where they order all concurrent processes on all replicas. In
fact, in sequential execution, it is easy to argue about
consistency, since a process accesses the file when the other
finishes. Thus, there must be matching between the order of
the concurrent execution and the order of a correct sequential
execution (which is called serializable) [14]. Zellag and
Kemme show that the relaxation of consistency for cloud
results in approximated output; and the influence of such
relaxation has an insignificant effect on the cloud systems
[15].
Another approach to sustain consistency is to choose one
user as a leader. The leader controls and coordinates the tasks
among all other users to achieve the targeted goal. Actually,
when users detect a failure of the leader, they elect a new one
as the leader using leader election algorithms [8, 16]. In bully
algorithm [8, 17], the complexity of electing new leader is
O(n²) messages, which is very expensive. In token ring
algorithm [18], the complexity of electing new leader is O(n)
messages. Numan et al. propose an algorithm that uses
shared centralized queue of all users. The leader is the head
Fig. 1. Leader election queue using linked-list, showing the PLeader
of the queue; and when it fails, another user dequeues the old pointing to the head node; another linked-list queue appears in the
head. The complexity of this approach is O(1) [19]. emergency block with SLeader.
Furthermore, currently E-PHR is managed through
The TokenPointer shows who is holding the token for an
hospitals or healthcare agencies (third party). However, the
exclusive access to update a file. Section V (C), shows how
use of Blockchain technology helps to have fully
to have more than one token.
decentralized management of E-PHR. Blockchain also
supports the availability, robustness and security of E-PHR,
and all related financial and administrative operations [9, 10, IV. PROPOSED ALGORITHM
20].
In the beginning, E-PHR is created for a patient, and
hospital appoints a leader, where the leader is the primary
III. PROPOSED SYSTEM MODEL doctor PLeader of the patient as shown in Algorithm 1. For
ALEA is built and designed in pursuance of the Well- every doctor, create a new node that contains three things
Organized Bully Leader Election algorithms [20] (that uses a (i) data where data is the unique doctor_Id; (ii) a pointer to
linked-list queue to reduces complexity of leader electing point to the next node; and (iii) a token flag with the value
process to O(1)). For more efficiency and adoption, we of either 0 or 1. Upon inserting a new node to the queue,
modify the algorithm significantly to be applicable and the doctor can read the E-PHR (when token value is 0), but
compatible with the healthcare (medical) procedures and for update permission the token has to change to 1. The
specifications. TokenPointer is another pointer pointing to the node that
has a token (token =1). Then increase the queue size.
ALEA creates a queue Q with size Size that shows the
total number of nodes, where node is denoted as Node. The When PLeader is inaccessible for some reasons (except
node represents a processor/doctor; each doctor is in failure situation), create a temporary queue TempQ that
represented with a unique identifier doctor_Id, pointer to the is led by SLeader as shown in Algorithm 2. On creating a
next node, and a token flag. Upon inserting a new node to the new node, increase the size of the queue. The procedures in
queue, the doctor can read the E-PHR but cannot update the Algorithm 2 are almost similar to Algorithm 1.
E-PHR except if the token equal to 1. The head of the queue Algorithm 3 shows how to add a new node (when the
is the leader termed as Pleader. If there is any requirement to PLeader wants to add a new doctor to the doctors' team). it
change the leader, it dequeues the head node and moves the creates a new node, enqueues it to the Q, and increases the
PLeader pointer to the next node. A shared memory is used size of the queue. The same procedure is applicable for
TempQ.
9
For some cases, the team or the leader decides to leader is appointed. Thus, the other detectors (who apply
modify the priority of the doctors who access the E-PHR. CAS) will find the doctor_Id of new Pleader, which is not
Then, it has to rearrange the positions of the nodes in the Q equals to their local ID, so they have nothing to do.
(swapping), as shown in Algorithm 4. After the insertion of
the doctors Id's, TempPointer1 starts from the head
position, it checks the doctor_Id, and keeps shifting until it Algorithm 1
finds the first doctor. After that, TempPointer2 continues 1. ║ Initialization():
and keeps shifting until it finds the other doctor. Finally, it 2. //Upon creating E-PHR
swaps them by inserting doctor_Id1 in the node of 3. //Create Queue Linked-list
TempPointer1 and doctor_Id2 in the node of 4. Size =0
TempPointer2. 5. Node = new_node()
Pleader retires from leadership but still a member in 6. Nodedata = doctor_Id
the doctors' team of the E-PHR (Algorithm 5). If the doctor 7. Nodenext = NULL
is the only node in the Q, the retirement is not allowed. 8. Nodetoken = 0
Otherwise, use TPointer to point to the PLeader node; 9. Pleader←node
move the PLeader pointer to the next node in the Q. If the 10. TokenPointer = PLeader
TPointer has the token, move the token to the next node by 11. TokenPointertoken = 1
resetting the token to 0, move the TokenPointer to the next 12. Size++
node, and set it to 1. Finally, dequeue the TPointer node 13. return
and enqueue it again to other end of the Q. The same thing
works for TempQ. Algorithm 2
14. ║ Emergency():
Algorithm 6 represents the situation when the doctor will 15. //To add new doctor as SLeader
not access the E-PHR anymore (clearness). If the doctor is 16. //Create a temporary queue TempQ
the only one who handles the E-PHR, the clearness is not
17. Node = new_node()
allowed. Otherwise, remove it just like in Algorithm 5. On
the other hand, for TempQ all doctors can make clearness 18. Nodedata = doctor_Id
even if it is the only node in the queue. 19. Nodenext = NULL
20. Nodetoken = 0
Algorithm 7 explains how to practice leadership and to 21. SLeader←node
move the token from one node to another. The PLeader finds 22. TokenPointer1 = SLeader
the required node using doctor_Id and activates the token by 23. TokenPointer1token = 1
changing it to 1, or deactivates it by resetting it back to 0. In 24. Size++
addition, the PLeader can activate the token for more than 25. return
one node at the same time. This helps us to apply
parallelism. This situation arises when there is no work Algorithm 3
dependencies exists among the nodes. It is explained in more 26. ║ AddDoctor():
details in section V (C). 27. //To add a new doctor to Q
28. Node = new_node()
In Algorithm 8, when any doctor needs to get the token, 29. Nodedata = doctor_Id
it sends acquiring message to the PLeader. Then it has to 30. Nodenext = NULL
wait for specific time Timeout. (It gets the current time, adds 31. Nodetoken = 0
the Timeout, stores the new time in T and wait until current 32. Q←enqueue()
time becomes T.) Now, it waits until either it receives 33. Size++
acknowledge message (reply message) from PLeader or the 34. return
Timeout finishes. When the timeout finishes without
receiving the acknowledge message, then leader fails or Algorithm 4
crashes and the node calls Failure(). Since many nodes may 35. ║ SwapDoctors (doctor_Id1, doctor_Id2):
discover the failure of the leader at the same time, every
36. //To change the positions of doctors
node copies and passes the doctor_Id of the failed leader in
ID (more details will be explained in Algorithm 9). 37. TempPointer1 = PLeader
38. TempPointer2
Algorithm 9 illustrates the state of failure of a leader. 39. While i=1 to size do
Upon the discovery of leader failure or crash, the detector 40. If (TempPointer1data != doctor_Id1)
node calls Failure() and passes ID, which is a local copy of 41. TempPointer1 = TempPointer1next
the doctor_Id of PLeader. In Failure(), move the PLeader 42. Else
pointer to the next node and dequeue the failure node. 43. //First doctor is found, now find the other
Conversely, if more than one node detects the failure, all of 44. TempPointer2 = TempPointer1next
them call Failure(), which would results in multiple 45. Break
unnecessary dequeues. Therefore, it must to use Compare-
46. End While
and- Swap statement CAS, which is an atomic operation that
allows only one node to change the leader. Using CAS, one 47. While i ≤ size do
detector checks if the PLeader still in failure (if the 48. If (TempPointer2data != doctor_Id2)
doctor_Id of PLeader still equals to ID), it calls Clearness(). 49. TempPointer2 = TempPointer2next
In Clearness(), the failure leader is dequeued and another 50. Else
10
51. //Second doctor is also found, now swap (CurTime()< T) do
52. TempPointer1data = doctor_Id2 105. Wait()
53. TempPointer1token = 0 106. End While
54. TempPointer2data = doctor_Id1 107. //If there is no response, then PLeader fails
55. TempPointer2token = 0 108. //Otherwise it is a live and do nothing
56. Break 109. If (receive_ack() = false)
57. End While 110. ID = Pleaderdata
58. return 111. Failure(ID)
112. return
Algorithm 5
59. ║ Rretirement(): Algorithm 9
60. //To retire from leadership 113. ║ Failure (ID):
61. If (PLeadernext = NULL) 114. // If leader still in failure or crash
62. return False 115. CAS (Pleaderdata, ID, Clearness())
63. Else 116. Return
64. TPointer = PLeader
65. PLeader = PLeadernext
66. If (TokenPointer = TPointer) V. ANALYSIS
67. TokenPointertoken = 0 This section discusses many important points such as
68. TokenPointer = TokenPointernext algorithm correctness, synchronization, file sharing, traffic
69. TokenPointertoken = 0 flow and replication.
70. TPointer.dequeue()
71. TPointer.enqueue() A. Correctness
72. return
It is trivial to argue about the correctness of ALEA
Algorithm 6 since it relies on the correctness of bully algorithm [8, 17],
73. ║ Clearness(): and the well-organized bully algorithm [19]. Moreover,
74. //To free the patient completely ALEA uses a linked-list queue in which it follows well-
75. If (PLeadernext = NULL) known lock-based or lock-free algorithms such as the
76. return False algorithm of Michael and Scot [21], that is considered as
the best lock-free algorithm in this filed.
77. Else
78. TPointer = PLeader
79. PLeader = PLeader next B. Syncronization
80. If (TokenPointer = TPointer) To synchronize operations, ALEA considers the event-
81. TokenPointertoken = 0 based models [8]. Indeed, every doctor performs a read
82. TokenPointer = TokenPointernext and/or update operations on the file. Every operation is
83. TokenPointertoken = 0 represented into two instantaneous events, which are begin
84. TPointer.dequeue() and end. Then, order the concurrent operations in a way
85. return that matches a correct sequential execution; this is known
as Linearizability [22]. Linearizability respects the real-
Algorithm 7 time order of the concurrent execution. Therefore, the
86. ║ Leadership (doctor_Id1): synchronization of events must follow a well-form clock.
87. //When the leader moves the token However, since doctors live in different time zoon and
88. //First get the token accessing files remotely, also patients travel to different
89. TokenPointertoken=0 places; the physical clock is difficult to be used except if
the whole world uses one time zone such as Greenwich
90. //Now find the node that will get the token
Time. Otherwise, it is preferred to use a logical clock to
91. While i=1 to size do
order events such as Lamport's logical clock [8, 22]. In
92. If (TokenPointerdata != doctor_Id1) ALEA, the operations that happen in different processors
93. TokenPointer1 = TokenPointer1 next are ordered since they use a single version of the E-PHR
94. Else (extra versions only for recovery) and any update must use
95. TokenPointer1token=1 token to take places. Thus, the order of the operations
96. Break follows the token movements.
97. End While
98. return C. Parallel Access of E-PHR
To use E-PHR in parallel and avoid all kinds of conflict,
Algorithm 8
read operations accesses the file without acquiring the token,
99. ║ Reminder ():
while the update operations have to acquire the token to
100. //Doctor reminds leader to get the token
execute. The token is implemented as a file lock Lock().
101. Send_msg(PLeader, "Acquire token")
However, the access of the read operation may be denied, if
102. //Wait for some time (Timeout)
the file is locked by an update operation.
103. T = CurTime() + Timeout
104. While ((receive_ack() = false) &&
11
To enhance parallelism, divide the E-PHR file into leadership transfers from the other end with respect to the
multiple sections, so that doctors are capable to access enqueuing order (First-In-First-Leader). However, in line for
different sections in parallel. Every section is a range of the special conditions of the healthcare and medical
bytes with a corresponding lock. It means that there are treatment, the leader may decide to change the order of the
multiple locks for the same file and every doctor should nodes in queue. This shrinks the fairness from the technical
specify the required section to access. perspective (Approximate-First-In-First-Leader). However,
Accordingly, ALEA is modified in such a way that there this is fair in the perspective of medical and humanity
are a set of locks and the number of locks donated as k, ground, which is the main concern of the algorithm.
where k is an integer number (the number of lock equals to Second, the leader passes token from one doctor to
the number of sections). Every doctor acquires the token, another, which is completely fair to the ground of medical
determines the required section s where s is an integer from 1 treatment perspective. Moreover, there is no a chance for
to k. In some cases, the leader decides s for every doctor. starvation (when a doctor may wait forever to be a leader or
Now, modify ALEA such that the initial value of the token is to get the token). Starvation has an alternative concern in
0, which implies that there is no access for update. When medical system. In some cases, a doctor is enqueued to
update operation is required, the token value changes to any access the file for a specific task but has no necessity of
value s based on the respective (required) section. In being a leader, so there is no starvation.
addition, to lock the whole file the token value should be However, there are some basic conventional rules and
k+1. So in ALEA, TokenPointer is replaced by a two- regulations in a medical system that tells when a doctor
dimensional array of pointers. From there the leader will be should get the token or wait. In the time of an emergency, the
able to identify the doctors who holds the tokens and their emergency department gives the access permission over E-
respective section number. Clearly, the two-dimensional PHR for a secondary leader (there is no starvation). In
array has k rows and two columns, one for doctor_Id and the addition to the previously mentioned scenario if the leader
other shows the value of s. fails or crashes, a new leader is elected and the token is
passed as usual.
D. Traffic Flow
As mentioned earlier, the concurrent access of some VI. USING BLOCKCHAIN
critical shared recourse causes conflicts that result in
incorrect view of data. To cope with such issue massage As mentioned previously, the Blockchain is a
technology to have an electronic ledger that is build based
passing is required to elect a leader and to move the token.
on consensus of a cluster of nodes [9]. It is used for
Firstly, the leader is elected through passing messages among financial transactions and it can be extended to many other
all nodes. As mentioned before in section II, the centralized areas [9, 10]. Indeed, the Blockchain is a technology that
leader election algorithms complexity reaches to n² messages has many algorithms. Generally, the Blockchain algorithms
[16, 17]. In the decentralized leader election algorithms has three stages: (i) one node broadcasts a proposal to the
enable to have more than one leader and the decisions will be other nodes; (ii) the nodes vote on the correctness of the
based on the votes of the majority [16, 17]. Such type of proposal; (iii) according to the consensus of votes, the
permission requires approximately n messages. However, proposal commits or aborts.
using the Well-Organized Bully Leader Election algorithms
[19] and ALEA the number of messages is reduced to 0, To implement ALEA using Blockchain technology, the
process of queue creation in Algorithm 1 and Algorithm 2
because the leader election is conducted by maintaining a
will be conducted through consensus decision. So,
shared queue linked list, so the leader is elected without
Blockchain helps to avoid the need for a third party such as
traffic. hospitals. In addition, the functions of maintaining queue
Secondly, there are some other kinds of messages to list will go through making a proposal, voting and then
move the token among nodes. In many cases, the token does taking a decision. This is applicable on functions such as
not necessarily to be held by the leader. In ALEA, the token adding a doctor to the queue list (Algorithm 3), swiping
is a flag that exists in each node, and the leader sets it to 1 for doctors (Algorithm 4), or removing a doctor from the
lock acquiring and sets back to 0 for lock releasing. In queue list (Algorithms 5 and 6). The same procedure is
normal case, the leader moves and sets the token with no used to manage token acquiring decisions (Algorithms 7, 8
messages. In rare cases, a node (doctor) for some reasons and 9).
insists to get the token, so it sends a message to acquire the
token (Algorithm 8). In this situation, the respective node VII. DISCUSSION ON THE USE OF BLOCKCHAIN
receives an acknowledgment message from leader. This
phenomenon is rarely occurs and it does not create any The combination of the Blockchain technology with
trafficking problem in the system. ALEA results in several positive and negative
consequences. Thus, there are many issues to discuss such
E. Fairness and Starvation (timeout) as decentralization, robustness, availability, ownership
Talking about fairness, there are two dimensions, one is protection, security, privacy, computational cost and traffic
related to the leader election and the other is related to the flow [9, 10, 22].
token. 1) Decentralization: the decentralization allows to avoid
Firstly, in the situation of fairly leader election, the
the permission of the hospital to access the E-PHR, to assign
hospital or the patient decides the primary doctor from
the leadership or to create an emergency linked list. The
medical point view. Actually, all doctor are enlisted
decentralization of Blockchain denies the need for a third
(enqueued) in the queue linked list from one end, while the
party, where the decision will be taken through the consensus
12
of the cluster nodes. It gives patients the full access to their REFERENCES
E-PHR. However, decentralization issue must be controlled
in strict manner to avoid the delay and trust issues. [1] Tang, Paul C., et al. "Personal health records: definitions, benefits,
2) Robustness: the use Blockchain does not allow a and strategies for overcoming barriers to adoption." Journal of the
American Medical Informatics Association 13.2 (2006): 121-126.
single point of failure, such that when some nodes fail, the
[2] Davis, Selena, A. Roudsari, and Karen L. Courtney. "Designing
others continue the work. For example, if there is an Personal Health Record Technology for Shared Decision
emergency case, while the hospital has a technical issue to Making." Studies in health technology and informatics 234 (2017):
give an access to E-PHR. Then, the majority of nodes do the 75-80.
work and the system still running robustly. [3] Woollen, Janet, et al. "Patient experiences using an inpatient personal
health record." Applied clinical informatics 7.02 (2016): 446-460.
3) Availability: in Blockchain technology, every node
[4] Sherman, Arloc, and Isaac Shapiro. "Essential facts about the victims
has a complete copy of the files. This shows high level of of Hurricane Katrina." Center on Budget and Policy Priorities 1
availability but it increases the number of replications. The (2005): 16.
large number of redundant replications could be considered [5] Taylor, Shayne Sebold, and Jesse M. Ehrenfeld. "Electronic health
records and preparedness: lessons from Hurricanes katrina and
as a negative point, since it increases cost of space Harvey." Journal of medical systems 41.11 (2017): 173.
complexity, communication and files' updating processes. [6] Kuo, Alex Mu-Hsing. "Opportunities and challenges of cloud
4) Ownership: in regular systems, the owner of the file, computing to improve health care services." Journal of medical
hospitals, healthcare agencies and some leaders have the Internet research 13.3 (2011).
privileges to change the ownership of some files. However, [7] Dinh, Hoang T., et al. "A survey of mobile cloud computing:
architecture, applications, and approaches." Wireless communications
with Blockchain only the owner of the file is able to change and mobile computing 13.18 (2013): 1587- 1611.
the ownership, which is a negative point that contrasts with [8] Tanenbaum, Andrew S., and Maarten Van Steen. Distributed systems:
the specifications of this model. For example if the patient in principles and paradigms. Prentice-Hall, 2007. [9] Nakamoto, Satoshi.
unconscious situation or died, then the ownership change "Bitcoin: A peer-to-peer electronic cash system." (2008). Pilkington,
Marc. "11 Blockchain technology: principles and applications."
should has another procedures. Research handbook on digital transformations225 (2016).
5) Security and privacy: with Blockchain the users [10] Kuo, Tsung-Ting, Hyeon-Eui Kim, and Lucila Ohno-Machado.
identities are hidden; their files, transactions and processes "Blockchain distributed ledger technologies for biomedical and health
are encrypted, which is positive from data sensitivity care applications." Journal of the American Medical Informatics
Association 24.6 (2017): 1211-1220.
prespictive. However, the nodes are untraceable which is not [11] Crosby, Michael, et al. "Blockchain technology: Beyond bitcoin."
acceptable for healthcare systems. For example, when some Applied Innovation 2.6-10 (2016): 71.
doctors take suspicious or illegal decisions. [12] Hashem, Ibrahim Abaker Targio, et al. "The rise of “big data” on
6) Immutability: using Blockchain, committed cloud computing: Review and open research issues." Information
systems 47 (2015): 98-115.
transactions are unchangeable. Therefore, users are able only
[13] Agrawal, Divyakant, Sudipto Das, and Amr El Abbadi. "Big data and
to create and read files but in healthcare systems users need cloud computing: current state and future opportunities." Proceedings
to create, read, update and delete files. of the 14th International Conference on Extending Database
7) Performance cost: the blockcain technology has Technology. ACM, 2011.
negative impacts on computational and communication [14] Coppieters, Tim, Wolfgang De Meuter, and Sebastian Burckhardt.
"Serializable eventual consistency: consistency through object method
costs. First, the computation must be executed in more than replay." Proceedings of the 2nd Workshop on the Principles and
one node (validators). Second, the proposer proposes a Practice of Consistency for Distributed Data. ACM, 2016.
transaction by broadcasting it to all nodes, let say n [15] Zellag, Kamal, and Bettina Kemme. "How consistent is your cloud
messages. Then, nodes check the correctness of the proposal application?." Proceedings of the Third ACM Symposium on Cloud
Computing. ACM, 2012.
and send votes to all others, which costs n² messages. After
[16] Tel, Gerard. Introduction to distributed algorithms. Cambridge
that, based on the votes, a commit or abort massage is university press, 2000.
broadcasted from n nodes to all others. Thus, the use of [17] Coulouris, George F., Jean Dollimore, and Tim Kindberg. Distributed
Blockchain negatively affects the speed of the system and its systems: concepts and design. pearson education, 2005.
traffic flow. [18] Soundarabai, P. Beaulah, et al. "Message efficient ring leader election
in distributed systems." Computer Networks & Communications
VIII. CONCLUSION (NetCom). Springer, New York, NY, 2013. 835-843.
[19] Numan, Muhammad, et al. "Well-Organized Bully Leader Election
This paper proposes an adaptive algorithm, for cloud- Algorithm for Distributed System." 2018 International Conference on
based E-PHR, so it can be easily used with a minimal Radar, Antenna, Microwave, Electronics, and Telecommunications
infrastructure. ALEA enhances the parallelism using an (ICRAMET). IEEE, 2019.
alternative leader election technique that is suitable for [20] Pilkington, Marc. "11 Blockchain technology: principles and
applications." Research handbook on digital transformations225
healthcare system. The paper stately analyzes and (2016).
investigates the performance of the proposed algorithm to
[21] Michael, Maged M., and Michael L. Scott. Simple, fast, and practical
clarify the advantages of ALEA comparing to the existing non-blocking and blocking concurrent queue algorithms. No. TR-600.
ones. It also shows that the use of Blockchain to implement Rochester UNIV NY DEPT OF Computer Science, 1995.
ALEA has many negative impacts. [22] Herlihy, Maurice, and Nir Shavit. The art of multiprocessor
programming. Morgan Kaufmann, 2011.
13
Automotive Cybersecurity: Foundations for
Next-Generation Vehicles
Michele Scalas, Student Member, IEEE Giorgio Giacinto, Senior Member, IEEE
Department of Electrical and Electronic Engineering Department of Electrical and Electronic Engineering
Abstract—The automotive industry is experiencing a serious Vehicle-to-Vehicle), with a generic infrastructure (V2I) or with
transformation due to a digitalisation process and the transition pedestrians (V2P). The typical application of these models is
to the new paradigm of Mobility-as-a-Service. The next-generation smart cities, with the aim of optimising traffic management,
vehicles are going to be very complex cyber-physical systems,
whose design must be reinvented to fulfil the increasing demand sending alerts in case of incidents, coordinating a fleet of
of smart services, both for safety and entertainment purposes, vehicles.
causing the manufacturers’ model to converge towards that of As regards autonomous driving, it consists in expanding the
IT companies. Connected cars and autonomous driving are the current Advanced Driver Assistance Systems (ADASs), such
preeminent factors that drive along this route, and they cause the as lane keeping and braking assistants, in order to obtain a
necessity of a new design to address the emerging cybersecurity
issues: the ”old” automotive architecture relied on a single closed fully autonomous driverless car. The Society of Automotive
network, with no external communications; modern vehicles are Engineers (SAE) provides, in fact, six possible levels of
going to be always connected indeed, which means the attack autonomy, from level 0, with no assistance, to level 5, where
surface will be much more extended. The result is the need for the presence of the driver inside the car is not needed at all.
a paradigm shift towards a secure-by-design approach. All these innovations have a common denominator: in-
In this paper, we propose a systematisation of knowledge about
the core cybersecurity aspects to consider when designing a formation technology. Current top end vehicles have about
modern car. The major focus is pointed on the in-vehicle network, 200 million lines of code, up to 200 Electronic Control
including its requirements, the current most used protocols and Units (ECUs) and more than 5 km copper wires [23], which
their vulnerabilities. Moreover, starting from the attackers’ goals means cars are becoming very complex software-based IT
and strategies, we outline the proposed solutions and the main systems. This fact marks a significant shift in the industry:
projects towards secure architectures. In this way, we aim to
provide the foundations for more targeted analyses about the the ”mechanical” world of original equipment manufacturers
security impact of autonomous driving and connected cars. (OEMs) is converging towards that of IT companies.
Index Terms—Cybersecurity, Mobility, Automotive, Connected In this context, the safety of modern vehicles is strictly
Cars, Autonomous Driving related to addressing cybersecurity challenges. The electronic
architecture of the vehicle has been designed and standardised
I. I NTRODUCTION over the years as a ”closed” system, in which all the data
HE automotive industry is experiencing a serious trans- of the ECUs persisted in the internal network. The above
T formation due to a digitalisation process in many of its
aspects and the new mobility models. A recent report by
new services require instead that data spread across multiple
networks; there is, therefore, a bigger attack surface, i.e.
PwC [20] states that by 2030 the vehicle parc in Europe and new possibilities to be vulnerable to the attackers. Hence,
USA will slightly decline, but at the same time the global automotive OEMs need to reinvent the car architecture with a
industry profit will significantly grow. The main factor for this secure-by-design approach.
phenomenon is the concept of Mobility-as-a-Service (MaaS), Another implication of this transformation is that the vehicle
i.e. the transition to car sharing and similar services, at the will be a fully-fledged cyber-physical system (CPS), that is
expense of individual car ownership (expected to drop from “a system of collaborating computational elements controlling
90% to 52% in China [20]). In this sense, the main keywords physical entities” [15]. This definition reminds that, in terms
that will contribute to this new model are ’connected cars’ of security, both the cyber- and the physical-related aspects
and ’autonomous driving’. should be considered. As an example, an autonomous car
According to Upstream Security [27], by 2025 the totality of heavily interacts with the real world environment and faces
new cars will be shipped connected, intending as connected the challenge of guaranteeing the resilience of the sensing
not only the possibility of leveraging Internet or localisa- and actuation devices. Therefore, security in automotive also
tion services but the adoption of the V2X (Vehicle-to-X) involves addressing the specific issues of a CPS, as can be read
paradigm. This term refers to the capability of the car to in the work by Wang et al. [30]; however, in this paper, we
communicate and exchange data with other vehicles (V2V, will consider the attacks that are carried out in the cyber-space.
A. Constraints
Although common IT security concepts can be used to
design car electronics, there are some specific constraints
to consider both in the hardware and the software side, as
summarised by Studnia et al. [26] and Pike et al. [18]:
Hardware limitations The typical ECUs for cars are embed-
ded systems with substantial hardware limitations, that is Fig. 1. Main domains in a modern car. [5]
with low computing power and memory. This restriction
means some security solutions like cryptography might be CAN The Controller Area Network is the most used protocol
not fully implementable. Moreover, the ECUs are exposed for the in-vehicle network. It was released in 1986, but
to demanding conditions (such as low/high temperatures, several variants and standards have been developed over
shocks, vibrations, electromagnetic interferences), and the years. For simplicity, there is a low speed CAN that
must have an impact on the size and weight of the vehicle reaches up to 125 Kb/s while the high-speed version
as small as possible. This is why the bus topology, which reaches up to 1 Mb/s; the first one it’s suited for the body
requires a much lower number of wires, is preferable domain, the other one is used in ’powertrain’ (engine or
compared to the star one. transmission control) and ’chassis’ (suspension, steering
These constraints cause the OEMs to be sensitive to or braking) domain. The CAN network is implemented
component costs, which limits the possibility to embrace with twisted pair wires, and an essential aspect is the
innovations. network topology, which is a bus line. Although current
Timing Several ECUs must perform tasks with fixed real-time designs are transitioning to a slightly different setting
constraints, which are often safety-critical. Therefore, any (Figure 1), with a Domain Controller (DC) that manages
security measure must not impact these tasks. different sub-networks for each domain (i.e. function-
Autonomy Since the driver must be focused on driving, the ality), the main idea is still that the CAN bus acts as
car should be as much autonomous as possible when the backbone and all the data spread across the entire
protection mechanisms take place. network, in broadcast mode.
Life-cycle The life-cycle of a car is much longer than that Automotive Ethernet Although its adoption is still limited,
of conventional consumer electronics, so the need for Ethernet has a crucial role for next-generation automotive
durable hardware and easy-to-update software (especially networks; it is a widespread standard for the common IT
security-related one). uses, and its high bandwidth is a desirable characteristic
Supplier Integration To defend intellectual property, suppli- for modern vehicles. However, as it is, its cost and
ers often provide (software) components without source weight are not suited for automotive, hence the need for
code; therefore, any modification to improve security can ’Automotive Ethernet’: in the past few years, among the
be more difficult. various proposals, the ’BroadR-Reach’ variant by Broad-
15
com emerged and now its scheme has been standardised Intellectual challenge The attack is conducted to demon-
by IEEE (802.3bp and 802.3bw); moreover, other variants strate hacking ability.
are under development by ISO. The standard is currently Intellectual property theft This refers to the elicitation of
guided by the One-Pair Ether-Net (OPEN) alliance. the source code for industrial espionage.
The main difference compared to standard Ethernet is the Data theft This is an increasingly important goal, a conse-
use of a unique unshielded twisted pair, which let the cost, quence of the new paradigm of connected cars. There
size and weight significantly decrease, without sacrificing are different types of data to steal, such as:
the bandwidth (100 or 1000 Mb/s). • License plates, insurance and tax data;
Before moving on to the description of the vulnerabilities • Location traces;
caused by these designs, it is useful to introduce an essential • Data coming from the connection with a smartphone,
standard for diagnostics: OBD. It stands for On-Board Diag- such as contacts, text messages, social media data,
nostics, and it consists in a physical port, mandatory for US banking records.
and European vehicles, that enables self-diagnostic capabilities The combination of these data might allow the attacker
in order to detect and signal to the car owner or a technician to discover the victim’s habits and points of interest,
the presence of failures in a specific component. It gives direct exposing him to burglary or similar attacks.
access to the CAN bus, then causing a serious security threat,
as will be described in Section IV; moreover, anyone can buy IV. ATTACK S CENARIOS
cheap dongles for the OBD port, extract its data and read them
In this Section, an overview of attack techniques and
for example with a smartphone app.
examples is provided. Following the work by Liu et al. [12],
C. Vulnerabilities the typical attack scheme includes an initial phase in which a
physical (e.g., OBD) or wireless (e.g., Bluetooth) car interface
The constraints described in Section II-A, such as the need is exploited in order to access the in-vehicle network. The
to reduce the cost and the size impact of the network, together most common interface to access it is OBD, but several works
with the past context in which the in-vehicle data was not leverage different entry points: Checkoway et al. [2] succeeded
exposed to external networks, caused the presence in the in sending arbitrary CAN frames through a modified WMA
(CAN) backbone of the following design vulnerabilities [12]: audio file burned onto a CD. Mazloom et al. [13] showed
Broadcast transmission Because of the bus topology, the some vulnerabilities in the MirrorLink standard that allow
messages between the ECUs spread across the entire controlling the internal CAN bus through a USB connected
network, causing a severe threat: accessing one part of smartphone. Rouf et al. [21] analysed the potential vulnera-
the network (for example the OBD port) implies the bilities in the Tire Pressure Monitoring System (TPMS), while
possibility to send messages to the entire network or Garcia et al. [6] found out that two widespread schemes for
being able to eavesdrop on these communications. keyless entry systems present vulnerabilities that allow cloning
No authentication There is no authentication that indicates the remote control, thus gaining unauthorised access to the
the source of the frames, which means it is possible to vehicle.
send fake messages from every part of the network. Once the interface is chosen, then the following methodolo-
No encryption The messages can be easily analysed or gies are used to prepare and implement the attack:
recorded in order to figure out their function.
Frame sniffing Leveraging the broadcast transmission and
ID-based priority scheme Each CAN frame contains an
the lack of cryptography in the network, the attacker can
identifier and a priority field; the transmission of a high
eavesdrop on the frames and discover their function. It is
priority frame causes the lower priority ones to back off,
the typical first step to prepare the attack. An example of
which enables Denial of Service (DoS) attacks.
CAN frames sniffing and analysis is the work by Valasek
III. ATTACK G OALS et al. [28].
Frame falsifying Once the details of the CAN frames are
In this Section, different motivations that attract the attack- known, it is possible to create fake messages with false
ers are described. Taking the works by Studnia et al. [26] and data in order to mislead the ECUs or the driver, e.g., with
IET [9] as references, these are the possible attack goals: a wrong speedometer reading.
Vehicle theft This is a straightforward reason to attack a Frame injection The fake frames, set with a proper ID, are
vehicle. injected in the CAN bus to target a specific node; this
Vehicle enhancement This refers to software modifications is possible because of the lack of authentication. An
especially realised by the owner of the car. The goal might illustrative —and very notorious— attack regards the
be to lower the mileage of the vehicle, tune the engine exploitation made by Miller et al. [14] towards the 2014
settings or install unofficial software in the infotainment. Jeep Cherokee infotainment system, which contains the
Extortion This can be achieved for example through a ability to communicate over Sprint’s cellular network in
ransomware-like strategy, i.e. blocking the victim’s car order to offer in-car Wifi, real-time traffic updates and
until a fee is paid. other services. This remote attack allowed to control some
16
cyber-physical mechanisms such as steering and braking. adopting the principle of least privilege, i.e. a policy
The discovery of the vulnerabilities in the infotainment whereby each user (each ECU in this case) should have
caused a 1.4 million vehicle recall by FCA. the lowest level of privileges which still permits to
Replay attack In this case, the attacker sends a recorded perform its tasks.
series of valid frames into the bus at the appropriate time, Isolation/Slicing This hardening measure aims at preventing
so he can repeat the car opening, start the engine, turn the chance for an attacker to damage the entire net-
the lights on. Koscher et al. [11] implemented a replay work. This goal can be achieved for example isolating
attack in a real car scenario. the driving systems from the other networks (e.g., the
DoS attack As anticipated in Section II-C, flooding the net- infotainment), or through a central gateway that employs
work with the highest priority frames prevents the ECUs access control mechanisms.
from regularly sending their messages, therefore causing Intrusion detection Intrusion Detection Systems (IDSs)
a denial of service. An example of this attack is the work monitor the activities in the network searching for ma-
by Palanca et al. [17]. licious or anomalous actions. Some examples in the
literature are the works by Song et al. [24] and by Kang
V. S ECURITY C OUNTERMEASURES et al. [10], which uses deep neural networks.
This Section firstly aims to summarise the basic security Secure updates The Over-The-Air (OTA) updates are on the
principles to consider when designing car electronics and one hand a risk that increases the attack surface; on
related technology solutions. Then, it focuses on the major the other, they are an opportunity to quickly fix the
projects for new architectures. discovered vulnerabilities (besides adding new services).
Some recent works to secure the updates but also V2X
A. Requirements
communications are those by Dorri et al. [4] and Steger
A typical pattern to help to develop secure architectures et al. [25], both taking advantage of blockchain.
is the so-called ’CIA triad’, i.e. three conditions that should Incident response and recovery It is necessary to ensure an
be guaranteed as far as possible; they are: confidentiality, appropriate response to incidents, limit the impact of the
integrity, availability. As the previous Sections demonstrated, failures and be always able to restore the standard vehicle
none of them is inherently guaranteed through the current functionality.
reference backbone —the CAN bus. All the above aspects should be fulfilled in a Security Devel-
Bearing in mind these concepts and taking a cue from the opment Lifecycle (SDL) perspective, with data protection and
work by ACEA [1], the proposed countermeasures and some privacy as a priority. Testing and information sharing among
of the related implementations in the research literature are industry actors are recommended.
the following:
Dedicated HW To supply the scarcity of computing power B. Main Projects
of the ECUs and satisfy the real-time constraints, it may In the past ten years, several research proposals and stan-
be necessary to integrate hardware platforms specifically dardisation projects started, aiming to develop and integrate
designed for security functions. This approach has been the ideas of the previous Section organically; a map of these
pursued, for example, in the EVITA and HIS project, and initiatives can be seen in Figure 2.
it is referred to as Hardware Security Module (HSM) or
Security Hardware Extension (SHE).
Cryptography Encryption can help in ensuring confidential-
ity and integrity. It is worth noting that implementing
cryptography is not trivial, since the low computing
power may prevent the OEMs from using robust algo-
rithms, which means cryptography might be even counter-
productive. The guidelines recommend state-of-the-art
standards, taking care of key management and possibly
using dedicated hardware. There are several works about
cryptography; for example, Zelle et al. [31] investigated
whether the well-known TLS protocol applies to in-
vehicle networks.
Fig. 2. Safety and security initiatives inside and outside of the automotive
Authentication Since different ECUs interact with each other, domains. (ENISA [5])
it is fundamental to know the sender of every incoming
message. Two recent works that integrate authentication Among these projects, SAE J30611 , finalised in 2016,
are those by Mundhenk et al. [16] and Van Bulck et al. guides vehicle cybersecurity development process, ranging
[29]. from the basic principles to the design tools. However, a
Access control Every component must be authorised in order
to gain access to other parts. The guidelines suggest 1 https://www.sae.org/standards/content/j3061 201601/
17
new international standard, the ISO/SAE 21434, is under deep learning is the main enabling technology. In addition to
development; its goal is to (a) describe the requirements for the inherent complexity in developing a fully autonomous car
risk management (b) define a framework that manages these for the real world, several studies demonstrated how machine
requirements, without indicating specific technologies, rather learning-based algorithms are vulnerable, i.e. the fact that
giving a reference, useful also for legal aspects. carefully-perturbed inputs can easily fool classifiers, causing,
Moreover, the implementation of these guidelines and the for example, a stop sign to be classified as a speed limit
transition towards a new in-vehicle network architecture is ([7]). These issues originate the research topic of adversarial
currently guided by some projects like AUTOSAR2 . This learning. Moreover, the use of machine learning is not limited
initiative is a partnership born in 2003 between several to computer vision but also includes cybersecurity software,
stakeholders, ranging from the OEMs to the semi-conductors such as IDSs, and safety systems, such as drowsiness and
companies, which aims to improve the management of the E/E distraction detectors. Therefore, it is fundamental to leverage
architectures through reuse and exchangeability of software proper techniques (e.g., [3]) to a) avoid consistent drops of
modules; concretely, it standardises the software architecture performances b) increase the effort of the attacker to evade
of the ECUs. It is still an active project, now also focused on the classifiers c) keep the complexity of the algorithms within
autonomous driving and V2X applications, and it covers dif- an acceptable level, given the constraints described in Sec-
ferent functionalities, from cybersecurity to diagnostic, safety, tion II-A. Ultimately, these concerns must be addressed with
communication. AUTOSAR also supports different software the same attention as the ones related to the internal network
standards, such as GENIVI3 , another important alliance aiming architecture. In this sense, some works, such as [22], propose
to develop open software solutions for In-Vehicle Infotainment to include machine learning-specific recommendations in the
(IVI) systems. ISO 26262 4 standard.
VI. D ISCUSSION VII. C ONCLUSION
To sum up, in this paper we deduced how the digitalisation
process within the automotive industry, where the OEMs
are converging towards IT companies and the vehicles are
becoming ”smartphones on wheels”, came up against serious
cybersecurity issues, due to security flaws inherited by an
original design where the in-vehicle network did not interact
with the external world. By contrast, the Mobility-as-a-Service
paradigm causes the vehicle to be hyper-connected and con-
sequently much more exposed to cyber threats.
In this transition phase, we observed the effort in developing
more and more complex platforms in a safety-critical context
Fig. 3. Applying security principles ([23])
with strict requirements such as the limited hardware and
the real-time constraints. For these reasons, both the industry
The ideas expressed in the previous Section can be sum- and the researchers are pledging to leverage the common IT
marised by Figure 3, which shows how the security principles methodologies from other domains and tailor them for the
can be implemented in practice. In our opinion, the primary automotive one. The route towards this goal is not straightfor-
protocol upon which the backbone of the future in-vehicle ward, as noted in the study by Ponemon Institute [19]: 84% of
network will be built is Automotive Ethernet. Moreover, the the professionals working for OEMs and their suppliers still
takeaway message from these initiatives is the specific focus have concerns that cybersecurity practices are not keeping pace
on security: each building block implies a research activity with evolving technologies.
aimed at proposing a solution tailored for the automotive As a final remark, we claim that the core ideas concerning
domain. the in-vehicle network, and described in this paper, could be
In this paper, we examined the core elements and concerns considered for further analyses on the security for autonomous
for secure internal networks; however, it is worth discussing, driving and V2X communications.
although in an introductory manner, about how the same
awareness should be extended to the very new actors in ACKNOWLEDGEMENT
automotive, i.e. artificial intelligence and V2X. These elements The authors thank Abinsula srl for the useful discussions
enable new advanced, smart services —e.g., platooning, that on the mechanisms of the automotive industry and its trends.
is the use of a fleet of vehicles that travel together in a
coordinated and autonomous way— and, as a consequence, R EFERENCES
further threats. In particular, focusing on artificial intelligence, [1] ACEA. Principles of Automobile Cybersecurity. Tech.
the primary concerns come from autonomous driving, where rep. ACEA, 2017.
2 https://www.autosar.org 4 https://www.iso.org/standard/68383.html
3 https://www.genivi.org
18
[2] Stephen Checkoway et al. “Comprehensive Exper- [18] Lee Pike et al. “Secure Automotive Software: The Next
imental Analyses of Automotive Attack Surfaces”. Steps”. In: IEEE Software 34.3 (May 2017), pp. 49–55.
In: USENIX Security Symposium. San Francisco, CA: [19] Ponemon Institute. Securing the Modern Vehicle: A
USENIX Association, 2011, pp. 447–462. Study of Automotive Industry Cybersecurity Practices.
[3] Ambra Demontis et al. “Yes, Machine Learning Can Tech. rep. 2019.
Be More Secure! A Case Study on Android Malware [20] PwC. The 2018 Strategy & Digital Auto Report. Tech.
Detection”. In: IEEE Transactions on Dependable and rep. 2018.
Secure Computing (2017), pp. 1–1. [21] Ishtiaq Rouf et al. “Security and Privacy Vulnerabilities
[4] Ali Dorri et al. “BlockChain: A Distributed Solution to of In-Car Wireless Networks : A Tire Pressure Moni-
Automotive Security and Privacy”. In: IEEE Communi- toring System Case Study”. In: 19th USENIX Security
cations Magazine 55.12 (Dec. 2017), pp. 119–125. Symposium. Whashington DC: USENIX Association,
[5] ENISA. Cyber Security and Resilience of smart cars. 2010, pp. 11–13.
Tech. rep. ENISA, 2017. [22] Rick Salay, Rodrigo Queiroz, and Krzysztof Czarnecki.
[6] Flavio D Garcia et al. “Lock It and Still Lose It - On the “An Analysis of ISO 26262: Using Machine Learning
(In)Security of Automotive Remote Keyless Entry Sys- Safely in Automotive Software”. In: arXiv preprint
tems”. In: 25th USENIX Security Symposium (USENIX arXiv:1709.02435 (Sept. 2017).
Security 16). Austin, TX: USENIX Association, 2016. [23] Balazs Simacsek. “Can we trust our cars?” 2019.
[7] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. [24] Hyun Min Song, Ha Rang Kim, and Huy Kang Kim.
“BadNets: Identifying Vulnerabilities in the Machine “Intrusion detection system based on the analysis of
Learning Model Supply Chain”. In: arXiv preprint time intervals of CAN messages for in-vehicle net-
arXiv:1708.06733 (2019). work”. In: 2016 International Conference on Informa-
[8] Yinjia Huo et al. “A survey of in-vehicle communi- tion Networking (ICOIN). Vol. 2016-March. IEEE, Jan.
cations: Requirements, solutions and opportunities in 2016, pp. 63–68.
IoT”. In: 2015 IEEE 2nd World Forum on Internet of [25] Marco Steger et al. “Secure Wireless Automotive Soft-
Things (WF-IoT). IEEE, Dec. 2015, pp. 132–137. ware Updates Using Blockchains: A Proof of Concept”.
[9] IET. Automotive Cyber Security : An IET/KTN Thought In: Advanced Microsystems for Automotive Applications
Leadership Review of risk perspectives for connected 2017. Lecture Notes in Mobility. Springer, Cham, 2018,
vehicles. Tech. rep. IET, 2014. pp. 137–149.
[10] Min-Joo Kang and Je-Won Kang. “Intrusion Detection [26] Ivan Studnia et al. “Survey on security threats and
System Using Deep Neural Network for In-Vehicle protection mechanisms in embedded automotive net-
Network Security”. In: PLOS ONE 11.6 (June 2016). works”. In: 2013 43rd Annual IEEE/IFIP Conference
Ed. by Tieqiao Tang. on Dependable Systems and Networks Workshop (DSN-
[11] Karl Koscher et al. “Experimental Security Analysis of W). IEEE, June 2013, pp. 1–12.
a Modern Automobile”. In: 2010 IEEE Symposium on [27] Upstream Security. Global Automotive Cybersecurity
Security and Privacy. IEEE, 2010, pp. 447–462. Report 2019. Tech. rep. 2018.
[12] Jiajia Liu et al. “In-Vehicle Network Attacks and Coun- [28] Chris Valasek and Charlie Miller. “Adventures in Au-
termeasures: Challenges and Future Directions”. In: tomotive Networks and Control Units”. In: Defcon 21.
IEEE Network 31.5 (2017), pp. 50–58. 2013, pp. 260–264.
[13] Sahar Mazloom et al. “A Security Analysis of an [29] Jo Van Bulck, Jan Tobias Mühlberg, and Frank Piessens.
In-Vehicle Infotainment and App Platform”. In: 10th “VulCAN: Efficient component authentication and soft-
USENIX Workshop on Offensive Technologies (WOOT ware isolation for automotive control networks”. In:
16). USENIX Association, 2016. Proceedings of the 33rd Annual Computer Security
[14] Charlie Miller and Chris Valasek. “Remote Exploitation Applications Conference on - ACSAC 2017. New York,
of an Unaltered Passenger Vehicle”. In: Black Hat USA New York, USA: ACM Press, 2017, pp. 225–237.
2015 (2015), pp. 1–91. [30] Eric Ke Wang et al. “Security Issues and Challenges
[15] Roberto Minerva, Abyi Biru, and Domenico Rotondi. for Cyber Physical System”. In: 2010 IEEE/ACM Int’l
“Towards a Definition of IoT”. 2015. Conference on Green Computing and Communications
[16] Philipp Mundhenk et al. “Security in Automotive Net- & Int’l Conference on Cyber, Physical and Social
works: Lightweight Authentication and Authorization”. Computing. IEEE, Dec. 2010, pp. 733–738.
In: ACM Transactions on Design Automation of Elec- [31] Daniel Zelle et al. “On Using TLS to Secure In-Vehicle
tronic Systems 22.2 (Mar. 2017), pp. 1–27. Networks”. In: Proceedings of the 12th International
[17] Andrea Palanca et al. “A Stealth, Selective, Link- Conference on Availability, Reliability and Security -
Layer Denial-of-Service Attack Against Automotive ARES ’17. New York, New York, USA: ACM Press,
Networks”. In: International Conference on Detection 2017, pp. 1–10.
of Intrusions and Malware, and Vulnerability Assess-
ment. Springer Verlag, 2017, pp. 185–206.
19
NTRU-Like Secure and Effective Congruential
Public-Key Cryptosystem Using Big Numbers
Anas Ibrahim Alexander Chefranov Nagham Hamad
Department of Computer Engineering Department of Computer Engineering Palestine Technical University
Eastern Mediterranean University Eastern Mediterranean University Tulkarem, Palestine
Palestine Technical University Famagusta, North Cyprus nagham.hamad@ptuk.edu.ps
anas.ibrahim@emu.edu.tr alexander.chefranov@emu.edu.tr
Abstract—We propose RCPKC, a random congruential public NTRU variants working with polynomials inside more
key cryptosystem working on integers modulo q, such that the complicated structures. Such as MATRU [9], working with
norm of a two-dimensional vector formed by its private key, square matrices of polynomials and showed better encryption
(f, g), is greater than q. RCPKC works similar to NTRU, the
fastest and secure PKC. NTRU, uses high order, N , polynomials and decryption performance than NTRU by 2.5 times. NNRU
and is susceptible to the lattice basis reduction attack (LBRA) [10], working with polynomials also being entries of square
taking time exponential in N . RCPKC is a secure version of matrices forming a specified non-commutative ring.
insecure CPKC proposed by NTRU authors and easily attackable Thus, NTRU and its known variants work with order, N ,
by LBRA since CPKC uses small numbers for the sake of the polynomial rings. The main problem NTRU faces is that it
correct decryption. RCPKC specifies a range from which the
random numbers shall be selected, it provides correct decryption is susceptible to the lattice basis reduction attack (LBRA)
for valid users and incorrect decryption for an attacker using using Gaussian lattice reduction (GLR) algorithm for two-
Gaussian Lattice Reduction (GLR). Because of its resistance dimensional lattices and LLL algorithm for higher dimensions
to LBRA, RCPKC is more secure, and, due to the use of big [11]. LLL algorithm solves in time exponential in N the
numbers instead of high order polynomials, about 24 (7) times shortest vector in a lattice problem (SVP) revealing the secret
faster in encryption (decryption) than NTRU. Also, RCPKC is
more than 3 times faster than the most effective known NTRU key [12] because the private keys are selected as polynomials
variant, BQTRU. with small coefficients for the decryption correctness. NTRU
Index Terms—Congruential public-key cryptosystem, Inte- encryption/decryption mechanism is used for polynomials. In
ger, Lattice, Lattice basis reduction attack, LLL algorithm, [13, p. 373-376], the authors of NTRU applied that mechanism
Minkowski’s boundary for a lattice shortest vector norm, NTRU, to integers modulo q >> 1, considering congruential public
Polynomial
key cryptosystem (CPKC), and found that it is insecure since
GLR finds its private keys in an order of ten iterations. That is
I. I NTRODUCTION why, CPKC is considered there as a toy model of NTRU that
“provides the lowest dimensional introduction to the NTRU
Emerging of cloud computations raises demand for low public key cryptosystem” [13, p. 374]. Insecurity of CPKC
computational complexity homomorphic PKC [1] [2]. NTRU stems from the choice of the private keys used as small
[3] is a PKC standardized as IEEE P1363.1 and faster than numbers to provide decryption correctness.
RSA and ECC [4]. Many variants of NTRU have been Thus, from the analysis conducted we see that NTRU vari-
proposed and studied recently targeting further decreasing ants try minimizing its computational complexity extending
its computational complexity. All these variants work with coefficients of the polynomials used or using matrices of
polynomials and mainly differ in the choice of their coeffi- polynomials that allows preserving security level to decrease
cients, ring defining polynomial, or the polynomials are used the polynomial order because operations with high-order poly-
as entries of structures such as matrices. We overview them nomials are time consuming. The extreme case is a polynomial
briefly below. order zero, that is a number, as used in CPKC, but CPKC is
NTRU variants differing in the choice of their coefficients. shown in [13] as insecure with respect to LBRA by GLR. If
Such as ETRU [5] working with polynomials over CPKC could be made resistant to GLR attack, it would be the
Eisenstein integer coefficients is faster than NTRU in best possible choice for the NTRU modifications.
encryption/decryption by 1.45/1.72 times, BQTRU [6], Herein, we propose a CPKC modification, RCPKC, that
working over quaternions but with bivariate polynomials specifies a range from which the random numbers shall be
that is 7 times faster than NTRU in encryption. selected, it provides correct decryption for valid users and
NTRU variants working with different rings. NTRU variant incorrect decryption for an attacker using (GLR-attacker), i.e.
that works with polynomials with prime cyclotomic rings GLR never can find its private key because GLR solves SVP
is proposed in [7]. A variant of NTRU working with returning the shortest in a lattice vector, whereas our private
non-invertible polynomials is proposed in [8]. key is in the safe region (above the Minkowski’s boundary
21
where Fq is the inverse of f modulo q. A random polynomial, Code 1: GLR algorithm pseudocode finding the shortest
r, and message, m, are of the form: vector v1 of the lattice E(V1 , V2 ).
Input: basis vectors V1 , V2 ;
r = r1 · r2 + r3 , m ∈ Rp , (16) Output: the shortest vector v1 in E(V1 , V2 ) ;
where ri is from T (dri , dri ), i =1..3. v1 = V1 ; v2 = V2 ;
NTRU encryption is represented by (7), and decryption uses Loop
(8) followed by modulo p operation for polynomials, f , h, r, If ||v2 || < ||v1 ||
and m, defined in (14)-(16). swap v1 and v2 .
Compute m = ⌊(v1 · v2 )/||v1 ||2 ⌉.
III. L ATTICE BASIS R EDUCTION ATTACK BY GLR ON If m = 0
CPKC P RIVATE K EY /M ESSAGE return the shortest vector v1 of the basis. {v1 , v2 }
In the following, ||x||, (x · y), ⌊a⌉, and R, denote Euclidean Replace v2 with v2 -mv1 .
norm [14] of the vector x, dot product of the vectors, x and y, Continue Loop.
rounding of the real number, a, and the set of real numbers, LBRA by GLR using Code 1 on CPKC private key/message
respectively. for the data from the Example II-D, finds in 9 iterations the
Let E(V1 , V2 ) ⊂ R2 be a 2-dimensional lattice with basis shortest vector, v1 = (231233, 195696) as shown in Fig. 1. The
vectors, V1 and V2 : shortest vector, v1, found by GLR corresponds to the private
key components, (f, g), because they were selected small,
E(V1 , V2 ) = a1 V1 + a2 V2 : a1 , a2 ∈ Z. (17) √
having order O( q) values according to (1). The message
CPKC private key recovery problem can be formulated as related vector, (r, e−m), is not disclosed in the attack because
the Shortest Vector Problem (SVP) in the two-dimensional e= O(q) in the Example II-D.
lattice, E(V1 , V2 ). From (4), we can see that for any pair of
positive integers, F and G, satisfying:
√ √
G = F h mod q, F = O( q), G = O( q), (18)
(F, G) is likely to serve as the first two components, f , g, of
the private key, SK [13, p. 376]. Equation (18) can be written
as F · h + q · n = G, where n is an integer. So, our task is to
find a pair of comparatively small integers, (F, G), such that
F · V1 + n · V2 = (F, G), (19)
where V1 = (1, h) and V2 = (0, q) are basis vectors, at least
one of them having Euclidean norm of order O(q). Similarly,
CPKC message recovery problem can be formulated as SVP in
the two-dimensional lattice, E(V1 , V2 ), where V1 ,V2 are from
(19). From (7), we can see also that for any pair of positive
integers, (RR, EM ), satisfying:
√ √
EM = RR · h mod q, RR = O( q), EM = O( q),
(20)
(RR, EM ) is likely to serve as the vector (r, e − m) since
Fig. 1. Screenshot of LBRA by GLR using MuPAD Code 2 on CPKC for
the encryption equation (7) can be written as r · h + q · n = the data from the Example 1 finding the private key components,(f, g)=v1,
e − m, where n is an integer. So, our task is to find a pair of in 9 iterations.
comparatively small integers, (RR, RM ), such that
LBRA by GLR succeeds in finding CPKC private key since
RR · V1 + n · V2 = (RR, EM ). (21) it, by the settings (1) used, is likely the shortest vector in the
lattice. Minkowski’s Second Theorem [15, p. 35] sets an upper
We want to find the shortest vector w from E(V1 , V2 ) using
bound for the norm of the shortest nonzero vector, λ, in a 2-
GLR that might disclose (r, e − m) if e, r are of the order of
√ dimensional lattice:
O( q). Comparing (19) and (21), we see that they are the √
same up to the unknowns’ names used, and hence, finding λ ≤ λ2 Vol(L)1/2 , (22)
the shortest vector in E(V1 , V2 ) may reveal either the private √
where λ2 =2/ 3 ≈ 1.154 is Hermite’s constant [15, p. 41],
key components (F, G)=(f, g), or the message related vector,
and Vol(L) is the volume of the lattice which is equal to q for
(RR, EM ) = (r, e − m).
the lattice L = E(V1 , V2 ) where V1 , V2 are defined in (19).
We can rewrite (22) as follows:
GLR algorithm [13, p. 437], shown in Code 1, on termina-
√
tion returns the shortest vector w=v1 in E(V1 , V2 ). λ ≤ α q, (23)
22
√
where α= λ2 ≈ 1.07. The LBRA by GLR failure condition (26) holds if (27) is true
since
λ √ √
λ′ = √ , (24) ||(f, g)|| f 2 + g2 α2 · q + g 2
q √ = √ = √ > α,
q q q
the following inequality (25): √
||(r, e − m)|| r2 + (e − m)2
λ′ ≤ α. (25) √ = √
q q
√
GLR fails attacking CPKC private key/message when (25) is α2 · q + (e − m)2
not satisfied for the secret vector relative norm (f, g), i.e. if = √ > α,
q
√
||(f, g)||/ q > α (26) for g, e − m ̸= 0. Condition (27), in RCPKC, substitutes for
the conditions (1), (6) on f , r, in CPKC. The message, m,
holds, GLR fails to find CPKC private key/message. and the private key, g, instead of (5), (1), used in CPKC, are
redefined in RCPKC as follows:
IV. RCPKC P ROPOSAL , P ROOF OF I TS C ORRECTNESS ,
AND E XAMPLE OF E NCRYPTION /D ECRYPTION AND LBRA 2mgLen > g ≥ 2mgLen−1 > m ≥ 0, (28)
BY GLR FAILURE
where mgLen represents the length of m and g in bits.
In this section, we propose random CPKC (RCPKC) by For RCPKC, correctness decryption condition (9) shall hold,
adjusting CPKC described in Section 2 to satisfy (26). The that is true (see (33)) when f , r values in addition to (27) meet
main ideas of RCPKC are: (29):
- Contrary to the settings (1) of CPKC, using secret key q
√ 2 · 2mgLen
> f, r. (29)
(f, g) with small norm not exceeding q , so that (f, g) may
be found as a shortest vector (SV) in the lattice E(V1 , V2 ) For q = 2qLen , (27), (29) can be rewritten:
defined by (19), we use (f, g) with a large norm meeting (26)
so that it cannot be returned by LBRA using GLR as an SV; 2qLen−mgLen−1 > f, r ≥ α · 2qLen/2 . (30)
- Small values (1) are chosen in CPKC to meet the de- To have a non-empty range for f , r, of the width at least
cryption correctness condition (9), which we also meet in α · 2qLen/2 , from (30), we get the following condition:
RCPKC due to the skew in the components of (f, g); it
2qLen/2
might happen, and it was noted by an anonymous Reviewer, > 2mgLen+1 . (31)
thanks to him, that in spite of the large norm of (f, g), the 2·α
SV = (F, G), obtained in the result of LBRA using GLR Defining β = log2 1/(2 · α) ≈ −1.103, from (31) we have
may meet decryption correctness condition (9), and thus may
2β · 2qLen/2 > 2mgLen+1 ,
be used for the correct plaintext message disclosure. Our
proposed RCPKC before encrypting by (7), contrary to CPKC qLen + 2 · β > 2 · (mgLen + 1),
using a random number from the predefined range (6), defines
qLen > 2 · (mgLen + 1 − β). (32)
a range for the random number selection using the SV, (F, G),
returned by GLR attack on the lattice E(V1 , V2 ) defined by Let us show that the decryption correctness condition (9) holds
(19), so that decryption correctness condition (9) holds for when (28), (30), and (32) hold:
(f, g) but does not hold for (F, G) that leads to the failure of
r·g+f ·m
LBRA using GLR on RCPKC. Thus, RCPKC assumes that the
private key owner selects the range for random value, r, used < 2qLen−mgLen−1 · 2mgLen + 2qLen−mgLen−1 · 2mgLen−1
in encryption (7) based on the secret key, (f, g) and respective
SV, (F, G), in the lattice E(V1 , V2 ) defined by (19) values, < 2qLen−1 + 2qLen−1 = 2qLen = q. (33)
guaranteeing correct decryption for a valid user and incorrect Thus, for RCPKC, norm of (f, g) meets (26) and decryption
decryption for a GLR attacker. Because of the special choice correctness condition (9) holds. We need additionally that
of the random value range, the proposed algorithm is called decryption correctness condition (9) is violated for (F, G),
Random CPKC, RCPKC. The problem for RCPKC which that is the SV obtained in the result of GLR attack on the
might happen that the range for random numbers such kind lattice E(V1 , V2 ) defined by (19). Hence, it cannot be used as
defined may be rather narrow and, thus, security of RCPKC a private key for the plaintext message correct decryption.
may suffer. But we show that the range is rather large and Inequality (30) defines a range for r so that f, g, r, m meet
may significantly exceed the range for a secret message. (9). Now, we define constraint on r,
A. RCPKC Proposal and Proof of Its Correctness r ≥ rmin ≥ (q + g|F |)/|G| (34)
To meet (26), we require such that F, G, r, m violate (9). Using (34) and (28):
√
f, r ≥ α · q. (27) |G · r + F · m| ≥ |G| · |r| − |F | · m
23
|G|(q + g|F |) ≥ max(α · 2qlen/2 = 812, 397, 633.7, rmin = 812, 397, 637)
≥ − |F | · m ≥ q + g|F | − m|F | > q. (35)
|G|
= rmin.
Thus, inequality (30) is used for f , but for r from (34) and
(30), we have Ciphertext, e, is calculated according to (7) as follows:
2qLen−mgLen−1 > r ≥ max(α · 2qLen/2 , rmin). (36) e = r · h + m mod q = 65, 549.
For RCPKC security, range defined by (36) shall be rather For decryption, in the first step, according to (8), we multiply
large, max(α · 2qLen/2 , rmin), hence: the ciphertext, e, by private key f :
2qLen−mgLen−1 ≥ 2 · max(α · 2qLen/2 , rmin). (37) a = f · e mod q = 53, 251, 852, 707, 713.
RCPKC Proposal In the second decryption step, according to (10), we multiply,
The private key components, (f, g), meet (2), (3), (28), (30), a, by Fg to get the message m as follows:
where qLen, mgLen meet (32) and (37), where (F, G) is
m = a · Fg mod g = 14.
an SV obtained in the result of GLR attack on the lattice
E(V1 , V2 ) defined by (19). The public key component, h, is We see that the message, m, is correctly retrieved.
defined by (4). Message, m, meets (28), and random integer, Now, attacking RCPKC using GLR Code 1. GLR
r, is selected from the range defined in (34), (36). terminates in 15 iterations finding v1 = (F, G) =
Encryption and decryption follow (7), and (8), (10), respec- (214653159, 709596869) ̸= (f, g) as shown on the screenshot
tively (see Sections II-B, and II-C). in Fig. 2. From the other side, we see that (35) is satisfied as
Decryption correctness condition (9) is proved for RCPKC follows:
in (33), thus proving RCPKC correctness.
576, 474, 822, 603, 342, 779 = |G · r + F · m|
Example 2 illustrates RCPKC encryption and decryption,
and GLR failure to find RCPKC secret key/message. > q = 576, 460, 752, 303, 423.
B. Example 2. Example of RCPKC Encryption/ Decryption Hence, trying to decrypt ciphertext using (F, G) will fail as
and LBRA by GLR Failure follows:
RCPKC encryption and decryption, and GLR failure to find aGLR = F · e mod q = 14, 070, 299, 919, 291
RCPKC secret key/message. For calculations, we use MuPAD.
Let mgLen = 16, qLen = 59, meeting (32), q = 259 = 65, 549 = mGLR = FG · aGLR mod G ̸= m = 14.
576, 460, 752, 303, 423, 488, private key components, g, and We see that the original message is not disclosed. Thus,
f , are selected to meet (28) and (30) respectively as follows: actually, using of the shortest vector, returned by GLR for
g = 65, 535, and f = 812, 397, 637. the ciphertext decryption, fails.
We see that values of g and f satisfy (28) and (30):
V. RCPKC P ERFORMANCE E VALUATION
65, 536 = 2mglen > g ≥ 2mgLen−1 = 32, 768 Herein, we use NTRU parameters, EES401EP2 [16], of the
4, 398, 046, 511, 104 = 2 qLen−mgLen−1
>f security level, k = 112 bits:
≥α·2 qLen/2
= 812, 397, 633.7. N = 401, p = 3, q = 2048, df1 = df2 = 8,
Similarly, message, m, is selected to meet (28), m = 14. df3 = 6, dg = 133, dr1 = dr2 = 8, dr3 = 6. (38)
We see that value of m satisfies (28) for 2mgLen−1 = In order to meet the same security level, the RCPKC settings
215 = 32, 768 > m = 14. According to (3), Fq = satisfying (32) are:
240, 507, 095, 595, 400, 845, and Fg = 8, 728. The public key
component, h, is calculated according to (4) as follows: qLen = 473, mgLen = 225. (39)
h = Fq · g mod q = 42, 620, 364, 389, 368, 179. We use NTRU code [17], and we have implemented RCPKC in
C99 language the same as used in [17] with MPIR library [18]
GLR algorithm Code 1 can be launched with inputs on a PC equipped with 2 GHz Intel Pentium Dual CPU E2180,
V1 = (1, h) and V2 = (0, q). GLR terminates in 15 it- 3 GB RAM, and Windows 10. The both NTRU code [17] and
erations and returns the shortest vector v1 = (F, G) = our RCPKC are implemented in Visual Studio 2017. NTRU
(214653159, 709596869), see Fig. 2. From (34), parameters (38) and RCPKC parameters (39) are used. We
measure CPU encryption and decryption time of RCPKC and
812, 397, 637 = rmin ≥ (q + g|F |)/|G| = 812, 397, 637
NTRU for 103 , 104 , 105 , and 106 runs (see Tables I, II with
Thus, random value, r = 812, 397, 637, is selected to meet respective averages). In each run, new secret and public keys,
(36): and messages are chosen randomly for NTRU and RCPKC.
From Table I (Table II), we see that RCPKC is 23.34 (7.5)
4, 398, 046, 511, 104 = 2qLen−mgLen−1 > r times faster than NTRU in encryption (decryption). The large
24
TABLE I
AVERAGE ENCRYPTION TIME OF NTRU AND RCPKC FOR DIFFERENT
NUMBER OF RUNS
Number of runs
NTRU average
encryption time (s) 1.52 × 10−4 1.45 × 10−4 1.46 × 10−4
NTRU/RCPKC
averages encryption
time ratio 25.33 20.71 20.85
Averaged over
all runs
NTRU/RCPKC
averages encryption
time ratio 23.34
25
TABLE II [7] Y. Yu, G. Xu, and X. Wang, “Provably secure NTRU instances over
AVERAGE DECRYPTION TIME OF NTRU AND RCPKC FOR DIFFERENT prime cyclotomic rings,” in Public-Key Cryptography - PKC 2017
NUMBER OF RUNS - 20th IACR International Conference on Practice and Theory in
Public-Key Cryptography, Amsterdam, The Netherlands, March 28-31,
Number of runs 2017, Proceedings, Part I, ser. Lecture Notes in Computer Science,
S. Fehr, Ed., vol. 10174. Springer, 2017, pp. 409–434. [Online].
103 105 106 Available: https://doi.org/10.1007/978-3-662-54365-8 17
[8] W. D. Banks and I. E. Shparlinski, “A variant of NTRU with non-
RCPKC average invertible polynomials,” in Progress in Cryptology - INDOCRYPT 2002,
decryption time (s) 2.1 × 10−5 1.9 × 10−5 2.0 × 10−5 Third International Conference on Cryptology in India, Hyderabad,
India, December 16-18, 2002, ser. Lecture Notes in Computer Science,
NTRU average A. Menezes and P. Sarkar, Eds., vol. 2551. Springer, 2002, pp. 62–70.
decryption time (s) 1.55 × 10−4 1.47 × 10−4 1.44 × 10−4 [Online]. Available: https://doi.org/10.1007/3-540-36231-2 6
[9] M. Coglianese and B. Goi, “Matru: A new NTRU-based cryptosystem,”
NTRU/RCPKC in Progress in Cryptology - INDOCRYPT 2005, 6th International
averages decryption Conference on Cryptology in India, Bangalore, India, December 10-12,
time ratio 7.38 7.74 7.20 2005, Proceedings, ser. Lecture Notes in Computer Science, S. Maitra,
C. E. V. Madhavan, and R. Venkatesan, Eds., vol. 3797. Springer, 2005,
Averaged over pp. 232–243. [Online]. Available: https://doi.org/10.1007/11596219 19
all runs [10] N. Vats, “NNRU, a noncommutative analogue of NTRU,” arXiv preprint
NTRU/RCPKC arXiv:0902.1891, 2009.
Averages decryption [11] A. K. Lenstra, H. W. Lenstra, and L. Lovász, “Factoring polynomials
time ratio 7.50 with rational coefficients,” Mathematische Annalen, vol. 261, no. 4, pp.
515–534, 1982.
[12] J. Hoffstein, J. H. Silverman, and W. Whyte, “Estimated breaking times
for NTRU lattices,” NTRU Cryptosystems, Tech. Rep. 012 version 2,
TABLE III
2003.
NTRU VERSUS ALGORITHMS ’ (RCPKC AND DIFFERENT NTRU
[13] J. Hoffstein, J. Pipher, and J. H. Silverman, An Introduction to
VARIANTS ) ENCRYPTION AND DECRYPTION TIME RATIO
Mathematical Cryptography. New York: Springer, 2014. [Online].
Available: https://doi.org/10.1007/978-1-4939-1711-2 7
Algorithm NTRU/Algorithm NTRU/Algorithm [14] N. Bourbaki, Topological Vector Spaces: Chapters 1-–5, 1st ed., ser.
encryption time decryption time Elements of Mathematics. Springer-Verlag Berlin Heidelberg, 2003.
[15] I. Smeets, A. Lenstra, H. Lenstra, L. Lovász, P. Q. Nguyen, and
Proposed RCPKC 23.34 7.5 B. Vallée, The LLL Algorithm: Survey and Applications. Verlag, Berlin,
Heidelberg: Springer, 2010.
BQTRU [6] 7 No data [16] Eess#1: Implementation aspects of NTRU. Last Accessed 18/1/2018.
[Online]. Available: https://github.com/NTRUOpenSourceProject/ntru-
MaTRU [9] 2.5 2.5 crypto/blob/master/doc/EESS1-v3.1.pdf
[17] W. Whyte and M. Etzel. Open source NTRU public key cryptography
ETRU [5] 1.45 1.72 algorithm and reference code. Last Accessed 18/1/2018. [Online].
Available: https://github.com/NTRUOpenSourceProject/ntru-crypto
[18] W. H. B. Gladman and e. a. J. Moxham. (2015) MPIR: Multiple
precision integers and rationals. Version 2.7.0, http://mpir.org, Last
R EFERENCES Accessed 18/1/2018.
26
Review: Phishing Detection Approaches
AlMaha Abu Zuraiq Mouhammd Alkasassbeh
Computer Science Department Computer Science Department
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
alm20178050@std.psut.edu.jo m.alkasassbeh@psut.edu.jo
Abstract—Phishing is one of the most common attacks on approaches like Google safe browsing, Phish Tank, and
the internet that employs social engineering techniques like user’s voting. So, when a web page is initiated the browser
deceiving user with forged websites in an attempt to gain searches the blacklist for it and alerts the user if the webpage
sensitive information such as credentials and credit card was found. Finally, the blacklist can be stored on the user’s
details. This information can be misused, resulting in large machine or in a server [5]. Blacklists are often used to
financial losses to these users. Phishing detection algorithms classify websites as malicious or legitimate. But while these
can be an effective approach to safeguarding users from such techniques have low false-positive rates, but they lack the
attacks. This paper will review different phishing detection ability to classify newly produced malicious URLs [6].
approaches which include: Content-Based, Heuristic-Based,
and Fuzzy rule-based approaches. The content-based approach deploys a deep analysis of
the pages’ content. Building classifiers and extract features
Keywords—Phishing, detection, fuzzy, machine learning, from page contents and third-party services such as search
malicious website. engines, and DNS servers. Yet these methods are ineffective
because of a massive number of training features and the
I. INTRODUCTION reliance on third-party servers which assault user's privacy
The internet is everywhere today, we use web services by uncovering his browsing history [4].
for a range of activities such as sharing knowledge, social A Heuristic Based Approach, the detection technique is
communication and performing various financial activities based on employing various discriminative features extracted
which include buying, selling and money transferring. by understanding and analyzing the structure of phishing
Malicious websites are a serious threat to internet users and web pages. The method used in processing these features
unaware users can become victims of malicious URLs that plays a considerable role in classifying web pages effectively
host undesirable content such as spam, phishing, drive-by- and accurately [7]
download, and drive-by-exploits.
Because Fuzzy logic permits the intermediate level
Phishing is a common attack on the internet and it is among values. In Fuzzy rule-based approach, it is utilized to
defined as the social engineering process of luring users into classifies webpages based on the level of phishness that
fraudulent websites to obtain their personal or sensitive appeared in the pages by implementing and employing a
information such as their user names, passwords, addresses, specific group of metrics and predefined rules [8]. Using
credit card details, social security numbers, or any other fuzzy approach allows processing of ambiguous variables.
valuable information. According to the Anti-Phishing Fuzzy logic integrates human experts to clarify those
Working Group (APWG) report, the number of different variables and relations between them. Also, fuzzy logic
phishing incidents reported to the organization over the last approaches using linguistic variables to explain phishing
quarter of the year 2016 was 211,032 [1] and they increased features and the phishing web page likelihood [9].
up by 12% in last quarter of 2018 which received 239,910
reports [2]. Furthermore, a recent Microsoft Security The main purpose of this study is to represent a
Intelligence (Volume 24) report found that phishing attacks comprehensive survey of existing approaches used in the
were on the top of the discovered web attacks of 2018 and detection of phishing approaches. In the literature review, the
we can only expect them to continue increasing [3]. related work will be discussed based on the aforementioned
classification of phishing detection approaches.
The major challenge when detecting phishing attacks lies
in discovering the techniques utilized. Phishers continuously II. LITERATURE REVIEW
enhance their strategies and can create web pages that are
able to protect themselves against many forms of detection. The review of existing studies in phishing detection
Accordingly, developing robust, effective and up to date approaches will be categorized into three groups which are
phishing detection methods is very necessary to oppose the Content-Based Approach, Heuristic Based Approach, and
adaptive techniques employed by the phishers [4] Fuzzy rule-based approach. The review will be based on
studies in the period between 2013 and 2018.
28
accuracy of the proposed system in detecting phishing web techniques are utilized: support vector machine (SVM),
pages is 97.16% [14]. naive Bayes, decision tree, k-nearest neighbor (KNN),
random tree, and random forest. To evaluate and training a
This study criticizes existing solutions for detecting classifier a dataset that collected 3,000 phishing webpages
phishing webpages such as antiviruses and firewalls are not from PhishTank and 3,000 legitimate webpages from
completely protecting users from web spoofing attacks. Also, DMOZ. The experiment results show that machine learning
the application of Secure Socket Layer (SSL) and a digital classifier that achieved the best performance is Random
certificate (CA) are not fully efficient because that some Forest (FR) with 98.23% of accuracy [18].
types of SSL and CA can be faked even if the web pages are
appeared to be legitimate. So, this paper proposed a phishing This paper proposed a heuristic-based method to detect
detection method that applying multiple steps to check URLs phishing web pages by utilizing URLs features. A selected
and domain name features. The performance of this work is 138 features are developed based on previous work. Then
assessed by applying a dataset of URLs that randomly these gathered features are grouped into four different classes
collected from Phishtank and Yahoo directory. 100 URLs which are Lexical based features, Keyword-based features,
are used (59 legitimate URLs and 41 malicious URLs). Reputation-based features and Search engine-based features.
PhishChecker detected 68 of the URLs as legitimate, and 32 The system is evaluated using data sets that consist of more
URLs are detected as malicious, the results show that the than 16,000 phishing and 31,000 non-phishing URLs is
accuracy of PhishChecker in detecting phishing is 96% [15]. employed. Seven different classifiers are implemented which
are Support Vector Machines (SVM with RBF kernel), SVM
with linear kernel, Multilayer Perceptron (MLP), Random
In this paper, URLs is also utilized for checking whether Forest (RF), Naïve Bayes (NB), Logistic Regression (LR)
the web page is phishing or not, they proposed a heuristic and C4.5. According to experiment results, Random Forest
approach that able to detect zero-day phishing attacks that (RF) achieved a higher accuracy rate and lower error rate
can't be detected by list-based methods. In addition, it is [19].
faster than visually based approaches, system is implemented In the two previous works, a heuristic based approach is
as a desktop application named PhishShield, which takes implemented with a machine learning algorithm. Regardless
URL as an input and classify it as phishing or legitimate of that each of them utilized its own data sets, employing
website. Heuristic features used in this study are drawn out several features and applying different machine learning
from the web page by using JSou without any user algorithms. But in both studies, the Random Forest algorithm
intervention. To evaluate the performance of PhishShield achieved the most effective classification of webpages.
application they obtained 1600 phishing websites from
PhishTank and 250 legitimate websites which 176 of them While the next two studies will demonstrate hybrid
are obtained from PhishTank and rest are collected machine learning approaches that get a benefit from
randomly. The accuracy attained for the proposed application strengthens of each algorithm and overlooked about its
is 96.57% [16]. weaknesses. Because more effective techniques are needed
to limit the fast evolution of phishing attacks.
Some studies combined a heuristic-based with a
machine-learning algorithm to enhance a classification Therefore, this study proposed a method that combines
process of web pages. Machine learning algorithms are two algorithms. K-nearest neighbors (KNN) algorithm which
utilized a clarify features and effective algorithm to produce is an effective approach against noisy data, and Support
an accurate classifier model to distinguish between phishing Vector Machine (SVM) algorithm which is a robust
and legitimate web pages. classifier. The combination process is done in two phases.
At first, KNN will be applying, then SVM will employing as
First of all, this paper suggested a heuristic-based a classification tool. The dataset used for the experiment is
phishing detection method used to recognize the phishing taken from related work which contains more than 1353
web pages. In the beginning, the system extracts and utilize sample gathered from various sources. Each sample record is
URL-based features. Then, these features are applied to composed of nine features in addition to the class label
machine learning algorithms and it will recognize if the web which is Phishing, Legitimate or Suspicious web page.
page is phished or legitimate. This system used 10 features Consequently, the clearness of KNN is integrated with the
on the input URL's dataset. The output results are effectiveness of SVM, regardless of their disadvantages
categorized as either Legitimate or Phishing. Next Support when they used individually. The accuracy of the proposed
Vector Machine algorithm is used on extracted features method is 90.04% [5].
result and find the value of FP, TP, FN, and TN. Also, the
value of F1-measure and accuracy are calculated, where the Likewise, this paper proposed a fast and accurate
accuracy value was 96%. Dataset of URLs is collected from phishing detection method that combined both Naïve Bays
PhishTank and yahoo directory which contains 200 (NB) and Support Vector Machine (SVM), utilizing features
Legitimate and phishing web pages URLs [17]. of URLs, and webpage contents. NB is used in detecting web
pages. But if the web pages are not detected efficiently and
still suspicious, SVM will be employed to reclassifying the
This paper is also implemented a heuristic-based web pages. The utilized dataset is generated from Phish Tank
phishing detection approach besides machine learning which is 600 phishing web pages and 400 are legitimate
algorithms features of URL. As well, the proposed method ones, 100 legitimate and 100 phishing web pages are
elicited URL features of web pages requested by the user and occupied as the training set, and the rest are carried as testing
applied them to decide if a requested web page is phishing or dataset. Experimental results exhibit that this proposed
not. To choose a classifier that most effectiveness for approach achieved high detection accuracy and lower
employing URL-based features, five machine learning detection time [20].
29
C. Fuzzy Rule-Based Approach This paper is also proposed phishing detection method by
There are many studies that suggested different phishing employing fuzzy systems and neural network. Unless a data
detection techniques based on different properties such as set of 300 value is extracted from six data sources which are
URLs, web page contents or combining both. However, Each Legitimate site rule, User behavior profile, PhishTank, User-
study has its own advantages and drawbacks. So, researches specific-site, Pop-up windows, and User-credential profile.
in this field are always required because the most There are the same of the previous study, but a new source is
appropriate, effective, and accurate method does not exist added which is User-credential profile. Also, the proposed
yet. system applied using 2-fold cross-validation to training and
testing the model. The fuzzy model has five functions to
In trying to get the benefits of a fuzzy logic system, This understand and make judgments which includes input layer,
study proposed a novel approach that targeted the URL fuzzification, rule-based, normalization, and defuzzification.
features and fuzzy logic method. The system is applied in The proposed system achieved 99.6% of accuracy which is
five phases which are Select URL features, calculating the better than the previous study[24].
values of 6 heuristics, calculating 12 fuzzy values for 6
heuristics from membership functions, Defuzzification by
calculating mean of 6 fuzzy values of phishing linguistic This paper talk over different phishing techniques
label (MP) and mean of 6 fuzzy values of legitimate developed by other researchers and discussed an efficient
linguistic label (ML), finally the values of MP and ML will way in distinguishing web pages. It is done by getting the
be compared to classify the web page. The approach was benefits from the genetic algorithm to treat phishing web
assessed with 11,660 phishing web pages and 5,000 pages. Then perform it over fuzzy logic technique. Fuzzy
legitimate web pages. The accuracy of the proposed method logic is implemented to assess the phishing degree in
was 98.17% [21]. various web pages based on a set of pre-defined rules. So, If
This paper presented a phishing detection method using a the URL meets the specified rules, then it will be estimated
fuzzy logic technique with five heuristic labels (Highly as a phishing webpage and given a score. Consider 10 sets of
Legitimate, Legitimate, Suspicious, Phished and Highly pre-defined rules that utilized to assess the phishiness degree
Phished). Classifying web pages is based on specific of the URL. If the rule is matched with webpage URL then it
predetermined rules split into 3 main groups: address bar- is Weighted by 0.1 score, after that the total of all ten layers
based features, domain-based features, HTML and which vary between 0 and 1 will denote the phishiness
JavaScript-based features. Whereas the first group used to degree. Whereas 1 indicates very legitimate web page and 0
recognize webpage authenticity, the second group is very phishy web page.
preserving webpage integrity, and the last group gives According to these outputs, it can determine if the
reliability to a webpage. webpage is fake or not. There are four phases to detect
The proposed model is consisting of four steps: in the phishing webpages using Fuzzy logic. Which are
first step, the Fuzzification step which converts crisp inputs Fuzzification where crisp inputs are transformed to fuzzy
to fuzzy inputs, then define a set of fuzzy rules. After that, inputs, evaluating rule using if...then statements, Aggregating
determine the membership function of fuzzy sets. In the final the rule output by unifying the outputs of all rules, and
step defuzzification process that produce the crisp outputs. Defuzzification where fuzzy output is transformed to crisp
The system is tested on a dataset of 300 URLs that randomly output (phishy or legitimate). This study concludes that even
collected from phishTank and DMOZ. The evaluation is the web page contains phishy characteristics it does not mean
based on a fuzzy logic method using triangular membership that the whole page is phishy. Therefore, using fuzzy logic is
function, then using three defuzzification methods which are one of the most effective methods to obtain the phishiness
Mean of maximum, Weighted Average method and Centroid degree of web page [8].
method [22].
III. DATASETS SUMMARY
In Table no.1 listed each paper and corresponding used
Instead of using a standalone fuzzy system this work dataset.
applied a Neuro-Fuzzy Scheme, which is an integration of a
Fuzzy Logic and a neural network. This integration enables
using of linguistic and numeric characteristics. This scheme TABLE I. DATASETS
utilized 288 features that extracted from five inputs
(Legitimate site rules, User-behavior profile, PhishTank, Approach Reference Data sets applied
used
User-specific sites, Pop-Ups from emails) which were not
167 phishing webpages are downloaded from
used together in a one system platform and that is the main [10] PhishTank
contribution of this work. While Neural Network is effective 51 legitimate webpages selected manually.
in treating with raw data, Fuzzy Logic has a high level of 1140 webpages.
reasoning, using numeric and linguistic characteristics. [11]
Phishing webpages are downloaded from
Neuro-Fuzzy was selected due to the abilities of learning Content- PhishTank Legitimate webpages are
data from Neural Network point of view and creates Based downloaded from Alexa
Approach 2, 826 phishing webpages from PhishTank
linguistic rules from Fuzzy viewpoint. The experiment tested
13, 416 legitimate webpages were directly
288 features by applying 2-Fold cross-validation, the [6]
crawled from the Internet.
accuracy results in 98.5% [23].
3066 phishing webpages, 686 legitimate.
[12]
30
Phishing webpages are downloaded from In Content-Based approach, which analyses webpages
[13]
PhishTank Legitimate webpages are content like extracting some words such a brand names from
downloaded from DMOZ URLs or HTML contents and giving weights to them,
extracting of logo images and comparing them with original
11,660 Phishing webpages are downloaded ones, or finding the consistency between URLs and web
from PhishTank content.
[14]
5,000 Legitimate webpages are downloaded
from DMOZ In Heuristic Based Approach, distinctive features
59 Legitimate webpages are downloaded from extracted from the structure of phishing web pages such as
Yahoo directory URLs, domain name and web page rank are employed in the
[15] 41 Phishing webpages are downloaded from
detection process. These features are applied to machine
PhishTank
learning algorithms to build an accurate classifier to
effectively differentiate between phishing and legitimate web
1600 phishing are downloaded from PhishTank.
250 legitimate webpages which 176 are pages.
[16] downloaded from PhishTank and remaining are
considered randomly.
In Fuzzy rule-based approach, classifying web pages is
Heuristic- based on the level of phishness that presented in the web
Based 200 Legitimate webpages are downloaded from pages using predefined rules. The fuzzy logic process is
Approach applied in multiple steps which usually started with
Yahoo directory.
[17] 200 Phishing webpages are downloaded from Fuzzification step and end with Defuzzification step. A fuzzy
PhishTank. rule may be combined with different artificial intelligence
algorithms like Neural network and genetic algorithm to
3,000 Phishing webpages are downloaded from upgrade their functionality.
PhishTank
[18] 3,000 Legitimate webpages are downloaded Finally, we can conclude that there is no perfect
from DMOZ approach to be used in detecting phishing web pages. Each
approach has its advantages and disadvantages and
11,361 Phishing webpages are downloaded improving these approaches is always required.
from PhishTank
[19]
22,213 Legitimate webpages are downloaded REFERENCES
from DMOZ
[1] "Anti-Phishing Working Group (2016). Phishing Activity Trends
600 Phishing webpages are downloaded from Report (4 th Quarter 2016). Unifying the Global Response To
PhishTank Cybercrime. [online] APWG".
[20] 400 legitimate webpages are downloaded from
PhishTank [2] " Anti-Phishing Working Group (2018). Phishing Activity Trends
Report (4 th Quarter 2018). Unifying the Global Response To
Cybercrime".
11,660 Phishing webpages are downloaded
from PhishTank 5,000 Legitimate webpages are [3] "Microsoft Security Intelligence Report Volume 24".
[21] [4] H. a. B. B. a. R. I. Shirazi, "Kn0w Thy Doma1n Name: Unbiased
downloaded from DMOZ
Phishing Detection Using Domain Name Based Features," in
Proceedings of the 23nd ACM on Symposium on Access Control
300 random webpsges Models and Technologies, 2018.
Phishing webpages are downloaded from
Fuzzy PhishTank [5] A. Altaher, "Phishing websites classification using hybrid SVM and
rule-based [22] KNN approach," International Journal of Advanced Computer
Legitimate webpages are downloaded from
approach Science and Applications, vol. 8, pp. 90-95, 2017.
DMOZ
[6] Y.-S. a. Y. Y.-H. a. L. H.-S. a. W. P.-C. Chen, "Detect phishing by
checking content consistency," Proceedings of the 2014 IEEE 15th
11,660 Phishing webpages are downloaded International Conference on Information Reuse and Integration (IEEE
from PhishTank IRI 2014), pp. 109-119, 2014.
[24] 10,000 Legitimate webpages are downloaded [7] N. a. A. A. a. T. F. Abdelhamid, "Phishing detection based associative
from DMOZ classification data mining," Expert Systems with Applications, vol.
41, pp. 5948-5959, 2014.
[8] K. A. K. N. Manoj Kumar, "Detecting Phishing Websites using Fuzzy
Logic," International Journal of Advanced Research in Computer
Engineering & Technology (IJARCET), vol. 5, 2016.
IV. CONCLUSION
[9] M. a. H. M. A. a. D. K. a. T. F. Aburrous, "Intelligent phishing
Phishing web pages are increased these days, resulting in detection system for e-banking using fuzzy data mining," Expert
huge financial losses. The need for protection methods from systems with applications, vol. 37, pp. 7913-7921, 2010.
these phishing web pages has become very necessary. [10] C. L. a. C. K. L. a. o. Tan, "Phishing website detection using URL-
assisted brand name weighting system," 2014 International
Formerly, Blacklist based approach was the most Symposium on Intelligent Signal Processing and Communication
common method used in the detection of phishing web Systems (ISPACS), pp. 054-059, 2014.
pages. The drawback of that approach is an inability to [11] K. L. a. C. E. H. a. T. W. K. a. o. Chiew, "Utilisation of website logo
recognize non-blacklisted or temporary phishing webpages. for phishing detection," Computers \& Security, vol. 54, pp. 16-26,
2015.
Therefore, more robust and effective approaches for
[12] M. a. V. A. Y. Moghimi, "New rule-based phishing detection
detecting phishing attacks have been developed. In this method," Expert systems with applications, vol. 53, pp. 231-242,
paper, different approaches are reviewed according to three 2016.
main groups which are Content-Based approach, Heuristic- [13] M. N. a. M. S. Feroz, "Phishing URL detection using URL ranking,"
Based approach, and Fuzzy rule-based approach. 2015 ieee international congress on big data, pp. 635-638, 2015.
31
[14] L. A. T. a. T. B. L. a. N. H. K. a. N. M. H. Nguyen, "A novel Conference on Computational Intelligence \& Communication
approach for phishing detection using URL-based heuristic," 2014 Technology, pp. 220-223, 2015.
International Conference on Computing, Management and [20] X. a. W. H. a. N. T. Gu, "An efficient approach to detecting phishing
Telecommunications (ComManTel), pp. 298-303, 2014. web," Journal of Computational Information Systems, vol. 9, pp.
[15] A. A. a. A. N. A. Ahmed, "Real time detection of phishing websites," 5553-5560, 2013.
2016 IEEE 7th Annual Information Technology, Electronics and [21] B. L. a. N. L. A. T. a. N. H. K. a. N. M. H. To, "A novel fuzzy
Mobile Communication Conference (IEMCON), pp. 1-6, 2016. approach for phishing detection," 2014 IEEE Fifth International
[16] R. S. a. A. S. T. Rao, "PhishShield: a desktop application to detect Conference on Communications and Electronics (ICCE), pp. 530-535,
phishing webpages through heuristic approach," Procedia Computer 2014.
Science, vol. 54, pp. 147-156, 2015. [22] S. D. Shirsat, "Demonstrating Different Phishing Attacks Using
[17] J. a. V. R. G. Solanki, "Website phishing detection using heuristic Fuzzy Logic," 2018 Second International Conference on Inventive
based approach," Proceedings of the third international conference on Communication and Computational Technologies (ICICCT), pp. 57-
advances in computing, electronics and electrical technology, 2015. 61, 2018.
[18] J.-L. a. K. D.-H. a. C.-H. L. Lee, "Heuristic-based approach for [23] P. A. a. H. M. A. a. T. M. a. S. G. a. A. N. Barraclough, "Intelligent
phishing site detection using url features," Proc. of the Third Intl. phishing detection and protection scheme for online transactions,"
Conf. on Advances in Computing, Electronics and Electrical Expert Systems with Applications, vol. 40, pp. 4697-4706, 2013.
Technology-CEET, pp. 131-135, 2015. [24] L. A. T. a. T. B. L. a. N. H. K. Nguyen, "An efficient approach for
[19] R. B. a. D. T. Basnet, "Towards developing a tool to detect phishing phishing detection using neuro-fuzzy model," Journal of Automation
URLs: a machine learning approach," 2015 IEEE International and Control Engineering, vol. 3, 2015.
32
Detecting Slow Port Scan Using Fuzzy Rule
Interpolation
Mohammad Almseidin Mouhammd Al-kasassbeh Szilveszter Kovacs
Department of Information Technology Computer Science Department Department of Information Technology
University of Miskolc Princess Sumaya University for Technology University of Miskolc
Miskolc, Hungary Amman, Jordan Miskolc, Hungary
alsaudi@iit.uni-miskolc.hu m.alkasassbeh@psut.edu.jo szkovacs@iit.uni-miskolc.hu
Abstract—Fuzzy Rule Interpolation (FRI) offers a convenient fined standard rules. However, multi-step attacks implement
way for delivering rule based decisions on continuous universes several steps, some of which appear legitimate and therefore
avoiding the burden of binary decisions. In contrast with the make these types of attacks more difficult to detect. The most
classical fuzzy systems, FRI decision is also performing well on
partially complete rule bases serving the methodologies having common detection mechanism used is the Intrusion Detection
incremental rule base creation structure. These features make the System (IDS). It can be categorized as either anomaly-based,
FRI methods to be perfect candidate for detecting and preventing or signature-based detection. Anomaly-based detection is able
different types of attacks in an Intrusion Detection System to detect new types of attacks by using the network traffics’
(IDS) application. This paper aims to introduce a detection historical behaviour. However, this type of detection renders
approach for slow port scan attacks by adapting the FRI
reasoning method. A controlled test-bed environment was also a greater value of false positive alerts. On the other hand,
designed and implemented for the purpose of this study. The the signature-based detection offers the lowest value of false
proposed detection approach was tested and evaluated using positives for the stored attack signatures (patterns). Form
different observations. Experimental analysis on a real test-bed another perspective, the signature-based detection mechanism
environment provides useful insights about the effectiveness of the needs to be updated frequently with different attacks patterns
proposed detection approach. These insights include information
regarding the detection approach’s efficacy in detecting the port [3]–[6]. While each of the previous detection mechanisms had
scan attack and in determining its level of severity. In the its own benefits and drawbacks, the anomaly-based detection
discussion the efficacy of the proposed detection approach is mechanism is more widely used [7].
compared to the SNORT IDS. The results of the comparison Detecting the multi-step attacks is not a straight-forward
showed that the SNORT IDS was unable to detect the slow and procedure. The IDS may face difficulties in detecting multi-
very slow port scan attacks whereas the proposed FRI rule based
detection approach was able to detect the attacks and generate step attacks [2]. The characteristic strength of these types of
comprehensive results to further analyze the attack’s severity. attack is that they are carried out sequentially, and usually,
Index Terms—Fuzzy Rule Interpolation, Intrusion Detection start their sequence with some legal actions used to discover
System, Port Scan Attack, SNORT. and probe connected computers. After that, the attacks focus
on opening direct pathways into the system. This is done by
I. I NTRODUCTION the attackers accumulating significant information about the
The rapid growth of technologies makes protecting com- expected victims. Therefore, one of the most important steps
puter networks a challenging task. Another challenge is that for the attacker is to gather the required information about the
attackers’ needs have grown and changed relative to the expected victims.
rapid technological growth. Attacks are generally not executed The port scan attack [8] is considered a preliminary step
blindly. Rather, the new techniques are strategically imple- of different type of multi-step attacks. It provides significant
mented. In other words, the attacker strategically executes information about the intended victims within the connected
several steps before achieving his final goal. The first step is to network. Meanwhile, it gathers large amounts of information
collect the necessary information about their desired victims. that are required for the latter steps of the attack. From
These types of attacks are known as ”multi-step attacks” due another perspective, the port-scan could be useful as a tool
to their strategic execution which takes place in various stages. for the network administrator to diagnose and troubleshot their
Attackers first focus on finding open pathways to implement network. However, attackers abuse the port-scan tool to exploit
their illegal activities in order to eventually break down the it as a means of attacking the system. Practically, the IDS
availability and integrity of the connected network. detects various types of port-scan attacks but it has difficulty
Multi-step attacks made up 60% of total attacks world wide detecting the slow port-scan. Slow port scan [9] means that
[1], [2]. The attackers change their techniques because low- an attacker does not send probe packets from more than two
level attacks (single-stage attacks) are detected using prede- computers permanently. Rather, attackers send packets to a
host for example only every 30 seconds or every 60 seconds.
The attacker uses the slow port-scan to gather the required
34
0 and 1. The intrusion-based fuzzy rules were suggested by for complete fuzzy rules. The FRI methods approximate the
the expert. The major parameter used to detect the port scan required conclusions based on the most important fuzzy rules,
was the Session Description Protocol (SDP); it indicates the a through summary of the FRI methods is presented in [20].
unique connection between source and destination using the
IV. FRI AGAINST P ORT S CAN ATTACK
same port. The proposed method was tested and evaluated
using a simulated attack environment. The proposed system As mentioned in section (III), first the input and output
effectively detected the port scan attack in addition to other universes needed to be defined in order to establish the
intrusion types i.e. backdoor and Trojan horse attacks. fuzzy system. The general structure of the proposed detection
There are several works that contribute to the research approach is shown in Fig. 1.
into different methods for detecting and preventing port scan
attacks. A good summary of the different detection approaches
against port scan attack proposed in [16]. The previous works
provide convincing contributions and support the idea that
implementing a fuzzy inference system as a detection approach
could be a suitable approach for detecting the port scan
attack. From another perspective, the previous works still have
common flaws, namely that the classical inference system
required complete fuzzy rule-base to detect the port scan Fig. 1. The Structure of The Proposed Detection Mechanism
attack. It could be difficult in some cases to obtain a complete
fuzzy rule-base. As a result, when an observation appeared The general structure of the proposed detection approach
it is possible that it was not covered by any of the fuzzy was initiated by extracting the FRI inference system’s required
rules. In this case, the detection approach was incapable of input parameters. The extraction process was executed using
offering the desired output. Unlike the previous efforts, in SNORT. SNORT is a free open source network intrusion
this work, the FRI reasoning method was adapted instead of detection system [21], it can be installed and configured to
the classical fuzzy inference system. The advantage of using detect various types of intrusions using real-time traffics. The
the FRI reasoning method is that it eliminates the need for SNORT structure is implemented based on library of packet-
a complete fuzzy rule-base; the detection approach can be capture. The SNORT detection mechanism is based on pre-
implemented using only a few significant intrusion detection defined rules. These rules act as signatures for different types
fuzzy rules. of intrusions. Every packet that passed through SNORT was
thoroughly analyzed and investigated to define any matches to
III. F UZZY RULE I NTERPOLATION the predefined detection rules. This requires that the repository
The term ”fuzzy logic” was initially introduced by Lotfi of the predefined rule be continuously updated. The SNORT
Zadeh [17]. There are some application areas, where the need rules can be written in a friendly-way allowing the system
for handling continuous universes requires the concept of administrator to easily edit, delete and insert new rules [22].
fuzzy set and continuous valued logic instead of crisp set and The incorporation between SNORT and the FRI reasoning
binary logic. Fuzzy logic can be also implemented as a suitable method are carried-out to derive the network input parameters
reasoning method for application areas dealing with the issue for the FRI detection approach. In the sniffing mode, SNORT
of binary decision. For example, with regard to intrusion in the collects many network parameters and information.
detection, the binary decision is not a suitable for recognizing Important parameters must be defined in order to detect
the level of intrusion. The fuzzy system, however, is able to different types of port scan attacks. Time is one of the primary
avoid the binary decision by smoothing the boundaries and parameters used for recognizing the port scan attack. Accord-
present more comprehensible results [4]. A detection approach ing to the results of the literature in [22], [23], the following
based on the fuzzy system must meet the following demands: parameters were extracted and used as input parameters for
specify the input and output universes, specify the input and the proposed detection approach:
output’s fuzzy partitions, and generate the intrusion detection • The Number of Sent Packet (NSP) between source and
fuzzy rules [18]. distention.
In the classical fuzzy inference system i.e. Mamdani and • The Average Time between received Packets (ATP) by
Takagi-Sugeno, the fuzzy rule-base must cover all observations the destination victim in milliseconds.
(inputs) to generate results. However, the classical fuzzy • The Number of Packets Received by the destination
inference system could not generate the expected results for victim in seconds (NPR).
all observations when dealing with partially defined fuzzy To carry out an actual experimental port scan attack, a test-
rule-base [7]. The FRI reasoning methods are introduced bed network environment was constructed. Fig. 2 shows the
to generate conclusion even in case, when the fuzzy rule test-bed network architecture.
base is only partially defined (sparse). Moreover, the FRI According to the experiments conducted, four connected
methods can significantly reduce the number of fuzzy rules computers (Client 1, Client 2, Client 3 and Client 4) were
[19] because when using the FRI methods there is no need considered attackers and the last one was presented as a victim
35
A. Fuzzification and Fuzzy rule Generation
The FRI detection approach had three input parameters
(NPR, NPS and ATP). For each input parameter, four linguistic
terms were used to represent their ranges during each phase
of the experiments. Table I lists the linguistic terms used to
classify each of the FRI detection approach’s input parameters.
TABLE I
L INGUISTIC T ERMS OF T HE S ELECTED PARAMETERS
36
TABLE II
T HE S PARSE F UZZY RULES
This section discusses the results of the implemented ex- TABLE III
FRI A PPROACH V S SNORT O UTPUT A LERTS
periments. SNORT could be enhanced by extending the FRI
reasoning methods’ binary decision to the continuous space. Input Parameters Detection Method-based IDS Alerts
It is worth mentioning that every observation used to evaluate Obs NPS NPR ATP SNORT Alerts FRI Approach Alerts
1 1500 3500 2 Attack Alert High Port Scan Attack
the FRI detection approach was presented as a fuzzy singleton. 2 150 1050 18 No Alert Very Slow Port Scan Attack
The FRI detection approach yielded useful information such as 3 900 2500 7 Attack Alert Medium Port Scan Attack
the ”level of port scan attack”, which gives the administrator 4 77 817 19 No Alert Very Slow Port Scan Attack
5 900 1000 15 No Alert Slow Port Scan Attack
a better understanding of the recent port scan attack. This 6 1600 3750 2 Attack Alert High Port Scan Attack
information can be expressed through these two observations: 7 1100 2020 8 Attack Alert Medium Port Scan Attack
the first, yielded the following crisp values (NPS = 1500, 8 490 1100 16 No Alert Slow Port Scan Attack
NPR = 3500, and ATP = 2) while the second registered (NPS
= 150, NPR = 1050, and ATP = 18). The FRI detection Consequently, these experiments demonstrate the proposed
approach’s output response for the two previous observations FRI detection approach’s ability to present concise, compre-
are illustrated in Fig. 4 and Fig. 5 respectively, where the first hensible results . Moreover, it had the ability to detect the
observation was classified as a high port scan attack and the very slow and slow port scan where the SNORT has no attack
second as a very slow port scan attack. alert. Traditional fuzzy-based detection approaches focus on
adapting the complete fuzzy rules to detect port scan attacks.
However, this may not be a straightforward procedure in some
cases. Therefore, the proposed FRI detection approach was
based on the (FRIPOC) FRI method to smoothe the boundaries
and recognize the level of port scan attack. Furthermore, the
approach was able to generate comprehensive results even if
the fuzzy rule-base is only partially defined.
VI. C ONCLUSION
This paper introduces a novel approach for detecting port
scan attacks. The proposed approach was designed and con-
structed using fuzzy rule interpolation. The FRI-based de-
tection approach’s inference engine was performed using
Fuzzy Rule Interpolation based on the POlar Cuts (FRIPOC)
method. The sparse fuzzy rules were generated based on
expert knowledge, the range values of the input parameters
Fig. 4. FRI Detection Approach Output in Case of High Attack during the experiments’ four phases, and the relationship
between the input parameters and the number of attacker
37
clients. The conducted experiments reflect the proposed FRI- [16] M. H. Bhuyan, D. Bhattacharyya, and J. K. Kalita, “Surveying port scans
based detection approach’s ability to effectively detect the very and their detection methodologies,” The Computer Journal, vol. 54,
no. 10, pp. 1565–1581, 2011.
slow and slow port scans based solely on the sparse fuzzy [17] L. A. Zadeh, “Fuzzy sets,” Information and control, vol. 8, no. 3, pp.
rules. The FRI-based detection approach’s output responses 338–353, 1965.
were compared with SNORT and the results reflected that the [18] S. Dhopte and N. Tarapore, “Design of intrusion detection system
using fuzzy class-association rule mining based on genetic algorithm,”
proposed detection approach was successful in detecting the International Journal of Computer Applications, vol. 53, no. 14, 2012.
very slow port scan attack in instances where the SNORT [19] S. Kovács, “Fuzzy rule interpolation,” in Encyclopedia of Artificial
did not render any alert. Furthermore, the FRI-based detection Intelligence. IGI Global, 2009, pp. 728–733.
[20] Z. C. Johanyák and S. Kovács, “A brief survey and comparison on
approach presented additional information, such as the level various interpolation based fuzzy reasoning methods,” Acta Polytechnica
of port scan attack, instead of a binary alert. Hungarica, vol. 3, no. 1, pp. 91–105, 2006.
[21] M. Roesch et al., “Snort: Lightweight intrusion detection for networks.”
ACKNOWLEDGMENT in Lisa, vol. 99, no. 1, 1999, pp. 229–238.
[22] W. El-Hajj, H. Hajj, Z. Trabelsi, and F. Aloul, “Updating snort with a
The described article was carried out as part of the EFOP- customized controller to thwart port scanning,” Security and Communi-
3.6.1-16-00011 ”Younger and Renewing University – Inno- cation Networks, vol. 4, no. 8, pp. 807–814, 2011.
vative Knowledge City – institutional development of the [23] W. El-Hajj, F. Aloul, Z. Trabelsi, and N. Zaki, “On detecting port
scanning using fuzzy based intrusion detection system,” in Wireless
University of Miskolc aiming at intelligent specialization” Communications and Mobile Computing Conference, 2008. IWCMC’08.
project implemented in the framework of the Szechenyi 2020 International. IEEE, 2008, pp. 105–110.
program. The realization of this project is supported by the [24] Y.-C. Chen, L.-H. Wang, S.-M. Chen et al., “Generating weighted fuzzy
rules from training data for dealing with the iris data classification
European Union, co-financed by the European Social Fund. problem,” International Journal of Applied Science and Engineering,
vol. 4, no. 1, pp. 41–52, 2006.
R EFERENCES [25] Z. C. Johanyák and S. Kovács, “Fuzzy rule interpolation based on
[1] Y. Zhang, D. Zhao, and J. Liu, “The application of baum-welch polar cuts,” in Computational Intelligence, Theory and Applications.
algorithm in multistep attack,” The Scientific World Journal, vol. 2014, Springer, 2006, pp. 499–511.
2014. [26] Z. C. Johanyak, D. Tikk, and S. K. and, “Fuzzy rule interpolation matlab
[2] M. Almseidin, I. Piller, M. Al-Kasassbeh, and S. Kovacs, “Fuzzy toolbox - fri toolbox,” in 2006 IEEE International Conference on Fuzzy
automaton as a detection mechanism for the multi-step attack,” Inter- Systems, July 2006, pp. 351–357.
national Journal on Advanced Science, Engineering and Information
Technology, vol. 9, no. 2, 2019.
[3] M. Almseidin, M. Alzubi, S. Kovacs, and M. Alkasassbeh, “Evaluation
of machine learning algorithms for intrusion detection system,” in In-
telligent Systems and Informatics (SISY), 2017 IEEE 15th International
Symposium on. IEEE, 2017, pp. 000 277–000 282.
[4] M. Almseidin and S. Kovacs, “Intrusion detection mechanism using
fuzzy rule interpolation,” Journal of Theoretical and Applied Information
Technology, vol. 96, no. 16, pp. 5473–5488, 2018.
[5] M. Alkasassbeh, G. Al-Naymat, A. Hassanat, and M. Almseidin,
“Detecting distributed denial of service attacks using data mining
techniques,” International Journal of Advanced Computer Science and
Applications, vol. 7, no. 1, pp. 436–445, 2016.
[6] M. Alkasassbeh and M. Almseidin, “Machine learning methods for
network intrusion detection,” in The 20th International Conference
on Computing, Communication and Networking Technologies ICCCNT
2018, 2018, pp. 105–110.
[7] M. Almseidin, M. Al-kasassbeh, and S. Kovacs, “Fuzzy rule interpo-
lation and snmp-mib for emerging network abnormality,” International
Journal on Advanced Science, Engineering and Information Technology,
vol. 9, no. 3, pp. 735–744, 2019.
[8] W. Zhang, S. Teng, and X. Fu, “Scan attack detection based on
distributed cooperative model,” in Computer Supported Cooperative
Work in Design, 2008. CSCWD 2008. 12th International Conference
on. IEEE, 2008, pp. 743–748.
[9] M. Ring, D. Landes, and A. Hotho, “Detection of slow port scans in
flow-based network traffic,” PloS one, vol. 13, no. 9, p. e0204507, 2018.
[10] J. Kim and J.-H. Lee, “A slow port scan attack detection mechanism
based on fuzzy logic and a stepwise policy,” 2008.
[11] E. Ireland et al., “Intrusion detection with genetic algorithms and fuzzy
logic,” in UMM CSci senior seminar conference, 2013, pp. 1–6.
[12] H. M. Moshiul et al., “An efficient framework for network intrusion
detection.” Computer Science & Telecommunications, vol. 24, no. 1,
2010.
[13] M. Z. Shafiq, M. Farooq, and S. A. Khayam, “A comparative study
of fuzzy inference systems, neural networks and adaptive neuro fuzzy
inference systems for portscan detection,” in Workshops on Applications
of Evolutionary Computation. Springer, 2008, pp. 52–61.
[14] “Endpoint Security dataset,” http://www.nexginrc.org/Datasets, 2004.
[15] J. E. Dickerson, J. Juslin, O. Koukousoula, and J. A. Dickerson, “Fuzzy
intrusion detection,” in Ifsa world congress and 20th nafips international
conference, 2001. joint 9th, vol. 3. IEEE, 2001, pp. 1506–1510.
38
An Approach for Web Applications Test Data
Generation Based on Analyzing Client Side User
Input Fields
Samer Hanna
Department of Software Engineering Hayat Jaber
Faculty of Information Technology, Department of Computer Science,
Philadelphia University, Jordan Faculty of Information Technology,
shanna@philadelphia.edu.jo Philadelphia University, Jordan
hayoot91@gmail.com
Abstract— Since it is time consuming to manually generate for a given Web application under test based on the different
test data for Web applications, automating this task is of great types of inputs for this application.
important for both practitioners and researchers in this
domain. To achieve this goal, the research in this paper To demonstrate the problem discussed in this research,
depends on an ontology that categorizes Web applications suppose that a Web application quality professional wants to
inputs according to input types such as number, text, and date. test a Web application that has only the following 3 inputs:
This research presents rules for Test Data Generation for Web user name, age, and country. This person must decide the
Applications (TDGWA) based on the input categories specified test data that must be used with each of these 3 inputs. If this
by the ontology. Following the approach in this paper, Web task is done manually then it will consume lots of time and
applications testers will need shorter time to accomplish the effort.
task of TDGWA. The approach had successfully been used to
generate test data for different experimental and real-life Web To solve this problem, researchers in this domain must
applications. find approaches to automate the task of TDGWA. One of the
approaches to accomplish this task is to determine the needed
Keywords—Test Data Generation for Web Applications, conditions or constraints that must be applied to each input
Ontology, and Web Applications inputs types of a Web application under test depending on the type of this
input. Examples of such input constraints, consider the Web
I. INTRODUCTION application with 3 inputs mentioned above, the constraints
that can be applied to these inputs are as follows:
Web Applications, in different domains, are used by
millions of people around the world every day. For this • For the "user name" input, user inserted value must
reason, practitioners and researchers in the domain of Web be between 1 character and 40 characters,
Applications must find means to assess the quality of these
• For the "age" input, user inserted value must be
applications.
between 1 year and 150 years (or any other upper
Software testing is an important activity that can be used limit for age), and
to assess the quality of software applications. Testing
includes: generating test data and then executing the • For the "country name" input, user inserted value
applications under test with test data in order to compare the must be a valid country name among a specified list
expected output, according to the requirement specifications, of valid countries such as {USA, UAE, etc.}.
with the actual output resulted from the execution. To write the code of a tool that can be used for
Testing Web applications is different than testing automating the task of TDGWA, this tool must firstly
traditional applications because Web applications have many identify the type of each input in the investigated Web
characteristics that do not exist in traditional applications, application such as name, date, address, etc. And secondly,
one of these characteristics is that it is being used by many the tool must determine the constraints that must be
users at the same time. associated with each of these inputs. After accomplishing
these two tasks, test data can then be generated by applying
Testing and test data generation consume lots of time and different testing techniques such as boundary value testing,
effort if it is done manually and this includes also testing and robustness testing, and syntax testing to the input constraints.
test data generation for Web applications. For this reason, it
is very important to find means to automate this task. To explain this idea, consider again the above example
Web application:
Current Web applications testing tools generate the same
test data for all the inputs of an application under test For the name input, since the constraint associated with
regardless of the purpose, semantic, or meaning of each this input is "a name must be between 1 character and 40
individual input. The main idea of this research is to use an characters (or any other upper limit for a name)," then,
ontology for the purpose of categorizing and relating Web according to the boundary value-based robustness testing is:
applications inputs in order to facilitate test data generation (a) an empty name, and (b) a name with more than 40
characters. Besides, according to the syntax testing, test data
• For the “name” input, there is no “type” attribute for As shown in the ontology in Figure 2, the input types that
the input element, in this case, the associated text belong to the date category can be classified into sub-
with this input which is “Your name” will be categories, namely, birth date, start date, end date, and
considered in the input classification process for the departure date. There are many texts that belong to each of
purpose of test data generation for this input. the previous sub-categories, for example, the texts that are
associated with the “birth date” sub-category can be: “Your
• For the second input which is the “age,” there is a birth date”, “the date of your birth”, etc. The same discussion
“type” with a value equals to “number” and this value can be made for the other input categories in Figure 2.
is useful for the test data generation process because,
based on boundary value testing, test data like: very It can be concluded that the main duty for the ontology
big number, very small number, nominal number can in Figure 2 is to classify or categorize Web applications input
be used to test this input. Moreover, the associated types in order to conclude the test data that can be used with
text with this input, which is “age,” can also be used a certain input based on its type as explained in the example
for test data generation because, we can use test data in Section 3.
like 300 which is semantically invalid age to test this
input. So, for the age input, both, type attribute and V. AN APPROACH FOR TEST DATA GENERATION
the associated text can be used for test data FOR WEB APPLICATIONS
generation.
The approach that is suggested by this research for test
• For the third input in Figure 1, which is the “Country” data generation is based on the following activities:
input, it uses the HTML <select> tag and this tag has
no “type” attribute, in this case, it is easy to conclude Activity 1: Specify the input elements in the investigated
that the type of this input is “enumerated” since there HTML document.
is a list of only 3 options. The associated text, which Activity 2: Specify the text associated with each input
is “Country,” is important, in this case, for test data element specified by Activity 1 and also the type attribute of
generation since we must know that the options are this element if it exists.
country names to be able to generate test data such as
an invalid country name or a country name that is not Activity 3: For each associated text specified by Activity
among the options of the select tag. 2, determine the type of this associated text depending on the
ontology in Figure 2. For example, if the associated text with
In brief, as shown in the example in Figure 1, a given HTML input is “Your birth date,” then depending on
determining the type of an input using the “type” attribute or the ontology it can be concluded that this text belong to the
using the associated text with that input, or both, will lead to “Birthdate” sub-category which in return belong to the
determining the test data that must be used to test such input. “Date” category, and so on.
Activity 4: If an input element has a type attribute then
IV. WEB APPLICATIONS INPUT DATA map this type to one of the main input categories specified by
CLASSIFICATION the main ontology in Figure 2, namely, text, number, date,
In order to generate test data for a Web application input, enum, and URL. For example:
it is important firstly to determine the category of this input
• If the type attribute value is “phone” then it is mapped
e.g. name, age, country, email, etc., after that, test data for
to the “number” category.
this input can be generated based on the associated text of
this input or the “type” attribute as explained in Section 3. • If the type attribute value is “password” then it is
Based on analyzing a sample of 250 Web applications in mapped to the “text” category.
the ecommerce domain, it was concluded that the inputs of • If the type attribute value is “color” then it is mapped
these applications can be classified into the following main to the “enum” category.
categories: text, number, date, enum, and URL as shown in
the ontology in Figure 2. • If the type attribute value is “url” then it is mapped to
the “URL” category.
41
Activity 5: Based on an input main category specified by • If the input is a "password" then test data is any
Activity 3 or Activity 4 apply the following test data random text with size > 100.
generation rules for each of the main input categories.
5. URL
1. Date category
The URL data type has one rule only; we use syntax
The rules that are associated with the inputs in this testing technique to generate wrong URL.
category are based on boundary value testing and they are:
• If the input is a "URL" then test data is URL without
• If the input is a "Birthdate" "departure date" or "end “http://”.
date" then test data is a date with value <1/1/1900 or a
value > current date. Our complete approach of test data generation based on
activity 1 to activity 5 is demonstrated in Figure 3.
• If the input is a "day" then test data is day with value
of day<1 or day >31.
• If the input is a "month" then test data is a month with
value of month<1 or month >12.
• If the input is a "year" then test data is a year with
value of year >2018 (current year) or year < 1900
2. Number category
The rules that are associated with the inputs in this
category are based on boundary value testing and syntax
testing, examples of these rules are:
• If the input is a "phone" then test data is a phone
number that has letter/symbol or a number like
"000000000". (According to syntax testing).
• If the input is a "price" or "income" then test data is a
price or income with value <0. (According to
boundary value testing). Test data can also a random
string value. (According to syntax testing).
• If the input is a "security code" then test data is a
random string value. (According to syntax testing).
3. Enumeration category
In the Enumeration category each input has specific
accepted values; test data is any different value than these
specific values, for example:
• If the input is a "gender" then tests data is any random
text except “Male” or “Female”. (According to
robustness testing).
• If the input is a "marital status" then test data is any
random text except “Married”, “Single” or “Partner”.
(According to robustness testing).
• If the input is a "title" then test data is any random
text except "Miss", "Mrs." " Mr." or “DR.”.
(According to robustness testing).
4. Text category
Figure 3. Test data generation approach
The rules that are associated with the inputs in this
category are based on boundary value testing and syntax As shown in Figure 3, the approach consists of 4 main
testing, example of such rules are: phases; parse HTML page, determine the category of each
input, apply rules and generate test data to assess user input
• If the input is an "email" then test data is an email validation and finally use these test data to assess the web
without”@” sign. (According to syntax testing) application user input validation by invoking the We
• If the input is an "address" or "comments" or application under test using the test data and then analyzing
"message" then test data is any text with size > 500 the response of the application. If an invalid input is accepted
characters. by the investigated application then this application has
semantic based input validation vulnerability.
• If the input is a "name" then test data is any random
text with size > 50.
42
The approach in this paper had successfully been used to AbdulRazzaq et al. [5] Presented an approach of
generate test data for different experimental and real life disclosing Web application attacks. The research identifies
Web applications. Web application attacks applying semantic rules.
Bisht et al. [6] proposed a black-box approach to detect
VI. EVALUATION parameter tampering vulnerability. In this approach, client-
250 Web applications had been analyzed and the inputs side HTML and JavaScript code are analyzed in order to
of these application were fed into the test data generation extract the constraints imposed on a Web application inputs.
ontology depending to the type of each input as explained in The constraints are violated afterwards in order to exploit
Section 4. tampering vulnerabilities in the tested Web application.
After that, another 10 sample Web application were used In Alkhalaf et al. [7], client-side input validation function
in an experiment to evaluate if the test data generation is checked to make sure that it conforms to the policies
ontology can shorten the needed time for TDGWA. In this specified by the research. The policies are based on regular
experiment, 4 testers that work in the filed were asked to expressions that specify the set of acceptable input values. If
generate test data for the 10 applications sample. an input validation function accepts an input that does not
follow the specified regular expression then this is
Two of the testers only were allowed to use the ontology considered vulnerability. The research in this paper is
and the related rules for test data generation discussed in different in that test case generation is based on analyzing the
Section 5. It was estimated that the testers that used this semantics of each input in an HTML page. The approach by
research ontology and the related rules finished their work in Alkhalaf et al. [7] will not work if the HTML page has no
40% less time than the other two testers. Obviously, the client-side validation functions.
ontology and the rules reduced significantly the needed time
for TDGWA. Aydin et al. [8] presented an automated testing
framework for testing input validation and sanitization
The threats and limitations to the experiment is that it operations in web applications based on vulnerability
was the 4 testers have different experience in the field of signatures that are characterized as automata. For
TDGWA also the experiment was conducted by only 4 specification of different types of vulnerabilities they use
testers and using a sample of only 10 Web applications. regular expressions that characterize the strings that would
cause a problem or vulnerability when sent to a security
VII. RELATED WORK sensitive function.
Since it is important to generate test data to assess input Offutt et al. [9] describes specific rules for generating test
validation and quality of Web applications and since this task data for Web applications based on violating the constraints
is time and labor consuming, researchers proposed many associated with Web applications inputs. The concept of
approaches that can be used for the purpose of reducing the bypass testing was introduced to submit values to Web
needed time and effort for this task. applications that are not validated by client-side checking.
The closest approaches to the approach in this research Lei et al. [10] proposed an approach for test case
are: generation to detect SQL injection vulnerability. The
Li et al. [1] Suggested extracting the text associated with approach aimed at improving the coverage and efficiency of
an input of a client side HTML document of a Web test case generation process.
application and then generating valid and invalid text data None of the previous research proposed rules for test case
based on this text. The research in this paper also proposed generation for Web applications based on different testing
using the associated text for a certain input for test data techniques depending on input categories.
generation, however, this research introduces a systematic
approach for categorizing or classifying Web applications VIII. CONCLUSIONS AND FUTURE WORK
input based on ontology in order to facilitate the process of
test data generation. Web applications are used every day by most of the
people around the world which makes the process of
Scholte et al. [2] Proposed an approach that is used to assessing the quality of these applications one of the most
improve the secure development of web applications by important processes to be considered by the researchers and
transparently learning types for web application parameters practitioners in this domain. Since software testing is one of
during testing, and automatically applying robust validators the important processes that can be used to assess quality,
for these parameters at runtime. researches must find means to test Web applications, to do
Deepa et al. [3] Introduced Web applications parameter that researchers must firstly find means to generate test data
tampering vulnerability which is vulnerability that occurs for Web applications.
when a user violates client-side input data constraints and an The approach of TDGWA in this paper is based on
application accepts that input without validation. The analyzing HTML client-side input fields where the
research in this paper can detect such vulnerability. associated texts with inputs are stored in ontology in order to
Shahbaz el al. [4] Presented an approach for generating be able to classify these inputs and generate test data
test data for string validation routines. The approach accordingly. After classifying or categorizing a Web
produces both invalid and valid test cases. Invalid test data is applications inputs, test data can be generated depending on
produced by mutating the input regular expression. the category or type of each input. The approach in this
research will reduce the needed time and effort for TDGWA.
43
The ontology used in this research for input data Programming, vol. 97, pp. 405-425, 2015.
classification can be augmented when considering more Web [5] A. Razzaq, K. Latif, H. F. Ahmad, A. Hur, Z. Anwar and P. C.
applications in different domains since this ontology is based Bloodsworth, “Semantic security against web application attacks,”
on a sample of 250 Web applications only. Information Sciences, vol. 254, pp. 19-38, 2014.
[6] P. Bisht, T. Hinrichs, N. Skrupsky, R. Bobrowicz and V.
A tool will be built that can automatically generate test Venkatakrishnan, “NoTamper: Automatic Blackbox Detection of
data for a Web application based on analyzing the client-side Parameter Tampering Opportunities in Web Applications,” in 17th
data and searching for this data in an ontology of input types. ACM conference on Computer and communications security,
Chicago, Illinois, USA, 2010.
Future work will also discuss generating test data that can
[7] M. Alkhalaf, T. Bultan and J. L. Gallegos, “Verifying Client-Side
be used to assess whether a Web application can defend itself Input Validation Functions Using String Analysis,” in 34th
against one the known Web applications attacks or International Conference on Software Engineering (ICSE), Zurich,
vulnerabilities, namely, SQL injection. Switzerland, 2012.
[8] A. Aydin, M. Alkhalaf and T. Bultan, “Automated Test Generation
from Vulnerability Signatures,” in International Conference on
REFERENCES Software Testing, Verification, and Validation, 2014.
[1] N. Li, T. Xie, M. Jin and C. Liu, “Perturbation-based user-input- [9] J. Offutt, Y. Wu, X. Du and H. Huang, “Bypass Testing of Web
validation testing of web applications,” The Journal of Systems and Applications,” in 15th International Symposium on Software
Software, vol. 83, no. 11, pp. 2263–2274, 2010. Reliability Engineering, ISSRE, France, 2004.
[2] T. Scholte, W. Robertson, D. Balzarotti and E. Kirda, “Preventing [10] L. Lei, X. Jing, L. Minglei and Y. Jufeng, “Dynamic SQL Injection
Input Validation Vulnerabilities in Web Applications through Vulnerability Test Case Generation Model Based on the Multiple
Automated Type Analysis,” in in IEEE 36th Annual, Turkey, 2012. Phases Detection Approach,” in 2013 IEEE 37th Annual Computer
Software and Applications Conference, 2013.
[3] G. Deepa, P. Thilagam, F. Khan, A. Praseed, A. Pais and N. Palsetia.,
“Black-box detection of XQuery injection and parameter tampering
vulnerabilities in web applications,” International Journal of
Information Security, pp. 1-16, 2017.
[4] M. Shahbaz, P. McMinn and M. Stevenson, “Automatic generation
of valid and invalid test data for string validation routines using web
searches and regular expressions,” Science of Computer
44
Achieving Data Integrity and Confidentiality Using
Image Steganography and Hashing Techniques
Ahmed Hambouz, Yousef Shaheen, Abdelrahman Manna, Dr. Mustafa Al-Fayoumi, and Dr. Sara Tedmori
Departmnet of Computer Science
Princess Sumaya University for Technology
Amman, Jordan
ahmedhambouz@gmail.com, yousefpsut@icloud.com, manna.93@outlook.com, m.alfayoumi@psut.edu.jo, s.tedmori@psut.edu.jo
Abstract—Most existing steganography algorithms keen on weight, while the Most Significant Bit (MSB) is the left
achieving data confidentiality only by embedding the data into most bit and is associated with the highest weight. The
a cover-media. This research paper introduced a new technique proposed in this paper is based on LSB, due to the
steganography technique that achieves both data minimal effect that LSB has on the original image.
confidentiality and integrity. Data confidentiality is achieved
Typically, LSB encoding in steganography is performed by
by embedding the data bits in a secret manner into stego-
image. Integrity is achieved using SHA 256 hashing algorithm altering the LSB of the cover image to become similar to the
to hash the decoding and encoding variables. The proposed value of the most significant bit of the plaintext.
model performed a high PSNR values for using a dataset of
different image sizes with an average PSNR of 82.933%.
B. Hash Function
Keywords— Steganography, Data Confidentiality, Data
Integrity, PSNR, SHA 256, Data Tampering. Cryptographic hash function is a one way function that
takes as input a variable length plaintext and generates a
I. INTRODUCTION fixed size hash value. The hash function is accounted as a
Image steganography is one of premier secure data hiding robust cryptography technique as it is infeasible to compute
techniques. The role of steganography is to hide sensitive the plaintext. The hash function ensures that the sent
data into a cover image. This will protect the data from being plaintext is untampered by comparing the sent hash value
captured by any unauthorized party. Steganography helps with the decoded hash value. Secure Hash Algorithm (SHA)
maintain data confidentiality, data integrity, data is one of the most commonly used hashing techniques.
authentication, and data privacy. Steganography techniques Many versions of the SHA algorithm have been introduced;
vary depending on the algorithm used. Steganography can be the most popular family of hash function is the SHA-2
combined with symmetric algorithms or asymmetric which was adopted in this research paper. SHA-2 consists of
cryptography techniques. The advantage of a specific six hash functions with hash values: 224, 256, 384, or 512
technique lies in the technique ability to achieve the bits. SHA-512 is separated into two sub-families; SHA-
information security fundamentals [1]. In this paper, a new 512/224 and SHA-512/256 [2].
steganography technique that combines a new approach for
Least Significant Bit (LSB) method with a robust hash The rest of this research paper is organized as follows:
algorithm was introduced. The idea behind this approach is section 2 reviews related data encryption works that exploit
to embed any text into a cover image based on an offset flag, steganography and hashing function in data encryption. The
a shared key, and the robust hash function – SHA256 to proposed technique is described in section 3. The
achieve data confidentiality and integrity. Fig.1 illustrates the performance measures of the designed model are presented
general process of steganography algorithm. in section 4. The results are detailed in section 5 and
discussed in section 6 under security and performance
analysis. Section 7 concludes the paper and provides areas
for future research.
II. RELATED WORK
The vast majority of published researches focus on
securing data transmission using different cryptography
techniques. Steganography is an art that researchers have
adopted for its robustness in encrypting sensitive data and
achieving data confidentiality and integrity.
Jose et al. [3] adopted a new model to embed sensitive
Fig.1. Steganography Workflow data into a cover-image by propagating the plaintext bits over
the cover-image using hash salt technique with a password
A. Least Significant Bit Algorithm
provided by the user. The adopted model increased the
All computer data is represented using binary, and difficulties of brute-force attack as the salt hash will result
grouped together in bytes. The LSB represents the right into 2256 combinations of possible salted passwords. The
most bit of an 8 bits array and is associated with the lowest authors also increased the model security by using Advanced
Encryption Standard (AES) algorithm to encrypt the operator XOR is applied between the LSB bits and the array
plaintext before embedding it into a cover-image. indexes.
Gupta et al. [4] proposed a hybrid approach by Indrayani et al. [9] adopted a new mp3 audio
combining steganography with the AES algorithm and then steganography technique by combining steganography with
used a hash function to increase the security of the model. AES algorithm and MD5 hash function. The designed model
The approach the authors proposed start by encrypting the is divided into four core levels: encrypting the data using
plaintext using AES algorithm. The encrypted data then is AES algorithm, where the key that is used to encrypt the data
stored in a hashed pixel location of the cover-image to is digested using MD5 hash function. The encrypted data
generate a stego-image. The results of the proposed model then are embedded into a cover-image which presents the
achieved an accurate Mean Squared Error (MSE) values encoding process. Once stego-image is received by the
when evaluated on different image types such as, .tiff, .png, second party of communication, the image is extracted and
.jpg, .bmp, and .gif. the cipher text is decrypted using the same hashed-key. The
authors achieved a high secured model against several active
Chaudhary et al. [5] proposed an effective steganography attacks types.
technique that uses RGB images. The idea behind their work
is to indicate the pixel value using the most significant bit of Saini et al. [10] proposed a hybrid approach in image
RGB channels instead of utilizing the entire channel. The security. The image is encrypted using a modified AES
algorithm works as follows: the LSB channels that are used algorithm and then embedded into a cover-image to generate
for hidden data depend on the MSB sequence. For example, a stego-object. A new version of AES algorithm is presented
if the sequence of MSB is 101 then the data hidden sequence “MAES”; where a new shift row transformation is presented.
is GRB. Hash function was adopted in this model by This transformation is done as follows: if the bit value that is
applying the logical operator XOR between the cover-image located in the first row and first column of the initial matrix
LSB bits and the stego-image to indicate which pixels have is even then no shifting is applied. Moreover, the other three
changed. rows will be shifted with an offset value equal to the offset
value of the common AES row shifting transformation. The
Madhuravani et al. [6] presented an authenticated designed model achieves a high PSNR rate of all size images
steganography scheme that uses a dynamic hashing comparing to traditional AES algorithm.
algorithm. Firstly, the texture data is embedded into a cover-
image that is then encoded using a stego-key. When the The previous approaches focused on achieving a high
second party receives the stego-image, the stego-image is secured steganography model using various algorithms of
extracted to generate both the plaintext and the image-size. encryption and encoding. In this paper, a new steganography
The plaintext after the extracting process is applied on a algorithm is presented to achieve confidentiality and integrity
dynamic hash function using either MD5 or SHA functions in a high performance secured model.
to generate a digested text. The digested text is then
embedded into a cover-image to generate the stego-image at III. PROPOSED MODEL
receiver side. Once the stego-image is sent and received by This research paper introduces a hybrid steganography
the sender party, the receiver will extract the message and scheme that combines a new steganography algorithm
compares the received hash with the hash of extracted approach with Hash function. The process of steganography
message. This new approach improved the security of is generally divided into two stages of encoding and
steganography technique by securing the communication decoding. In this research, the proposed model has four main
channel. stages: new image addressing, text size hashing, encoding,
Riasat, et al. [7] introduced a robust hash-based and decoding.
steganography model. The designed model has a strenuous A. New Image Addressing and Confusion Concept
capability to hide image and data without losing the image
Image addressing process starts by selecting a conditional
quality. The data scattering depends on a random number
image size that will be used for embedding the plaintext. The
that is generated by using a hash function, where the hash
reason behind using a conditional image size is that the
function uses both elements; the hash-key and the image
following formula must apply.
chunks number. The image chunks are separated into three
fields, where the ASCII values will distributed on these
chunks sequentially.
Image Size (IS) = Selected Image Size – 512 pixels (1)
Charan et al. [8] proposed an efficient secured
steganography technique using multi-level encryption
algorithms. The adopted model is based on two levels of The intuition behind deducting 512 pixels is to store the
encryption, the Chaos encryption and the Ceaser encryption needed variables, in which the 512 pixels are divided into
techniques. At the beginning, the data is encrypted using two halves; the first half is reserved for the permutated
Ceaser cipher. After the encryption, LSB algorithm is used encoded message size using the permutation formula as
and applied on the RGB image. Thus, the data distribution is follows.
done sequentially, where the first three bits are replaced in
the three LSB bits of Red byte, the second three bits are + , 2 = 0
= (2)
replaced in the three LSB bits of Green, and the last two bits + 256 − , 2 = 1
are replaced in the two LSB bits of Blue. The data after
applying LSB are scattered into a Chaos cover-image that is Where L is Message Size Bits’ Pixel Location Address.
divided into a two dimensional array and then the logical
46
`
The second half is reserved for the hashed message size
that will be discussed later. The permutation formula is based
on the bit sequence state; if it is odd then it is stored starting
from the top of the encoded cover-image as shown in
equation (3), and, if the bit sequence is even then it is stored
at the bottom of the encoded cover-image as shown in
equation (4).
X = I + K + √ + ) mod IS (3)
X = MS + K + √ + ) – I) mod IS (4)
47
`
Where MSE is the mean square error between the
processed and original cover image, Max2 is the maximum
intensity value used to represent each pixel in the image. If
the obtained PSNR value is more than 30db then the
processed image quality is unremarkably changed compared
to the original. However, if the obtained PSNR value is less
than 29db, this will result in a visual degradation in image
quality. Table 1 illustrates a comparison in PSNR values
between the proposed model and MAES algorithm [10]. The
authors in MAES algorithm achieved a high PSNR values as
well by combining steganography technique with an
enhanced AES algorithm. The simulation results of the
MAES were accurate enough to compare it the proposed
model experimental results.
= 10 log (6)
48
`
Confidentiality was achieved by embedding the text bits into
the LSB pixels of the cover-image using formulas that
depends on a secret key. The text size was hashed SHA-256
algorithm and then applied it to the logical operator XOR
that produced a hashed message size. This process achieves
the concept of integrity. Using robust hash function
increased the difficulties of brute-force attack; where the
hashed message size is hard to be calculated due to the XOR
operation with the secret shared key.
B. Performance Analysis
Fig. 4. (a) Cover-Image Histogram Vs. (B) Stego-Image Histogram The proposed model fulfilled a high performance through
a set of various metrics. The PSNR results that were
The histogram analysis shows a slight difference between discussed in table 1 had shown that the proposed model was
cover-images that are selected for text embedding and the better than the MAES algorithm. Moreover, the encoding
resulting stego-images. This indicates that the user can not and decoding processes were executed over Csharp
recognize any difference between both images resolutions. programing language, concluding that the time that was
consumed by the overall model is less than the consumed
V. EXPERMENTAL RESULTS AND ANALYSIS time while running the implementation over Matlab or other
The proposed model was measured using Intel® CoreTM machine learning. At the end, the performance of the
i7-4580HQ 64 bits system with 8GB RAM running on proposed model depends on the selected image size as it
windows 8.1. found that the stego-image size is approximately equal to
Different image sizes (128x128, 256x256, and 512x512) the cover-image size.
were selected for the encoding process as illustrated in Fig.5 VI. CONCLUSION AND FUTURE WORK
respectively.
The huge data transfer over public networks leads most of
information security engineers to adopt many methods that
allow them to transfer sensitive data over a secure fashion
environment. Steganography is one of the most commonly
used techniques, due to the ease of use and the high data
security that can be provided. In this paper the combined
approach for image security has been presented. The
scattering and embedding processes of texture data were
achieved over a set of equations. The proposed model
increased the difficulties for any intruder to alter the
embedded sensitive data which proved over histogram
analysis that shows a slight difference between cover-
images that are selected for text embedding and the resulting
stego-images, as the concepts of confusion, permutation,
and hashing were adopted in the discussed model. For future
work, the selected image type should be expanding to
include different types of images such as, .tiff, .bmp, and
.gif.
REFERENCES
50
Detecting network anomalies using machine
learning and SNMP-MIB dataset with IP group
Abdelrahman Manna Mouhamad Alkasassbeh
Princess Sumaya University of Technology Princess Sumaya University of Technology
manna.93@outlook.com m.alkasassbeh@psut.edu.jo
Abstract— SNMP-MIB is a widely used approach that uses resources such as server, this attack is considered as a
machine learning to classify data and obtain results, but using dangerous attack because it prevents the legitimate users from
SNMP-MIB huge dataset is not efficient and it is also time and reaching the resources whenever they need them, especially
resources consuming. In this paper, a REP Tree, J48(Decision if the resource has sensitive and important information that
Tree) and Random Forest classifiers were used to train a model
needs to be reached immediately.
that detects the anomalies devices inside the network in order to
predict the network attacks that affect the Internet Protocol(IP)
group. This trained model can be used in the devices that are B. Simple Network Management Protocol (SNMP)
used to detect the anomalies such as intrusion detection systems.
SNMP was found in the late 1980s [2] is an application
Keywords—Network attacks, SNMP, SNMP-MIB, Anomaly layer protocol that is used to control the functions of the
Detection, DOS. network nodes(devices) in order to change their information
or to change the devices’ behaviours when needed, SNMP is
I. INTRODUCTION
supported by multiple devices such that routers, switches,
Nowadays, almost the entire world is connected to each servers and more and is included in the internet protocol(IP)
other via the internet and the number of internet users is package.
increasing day by day, every user has at least 1-2 devices such SNMP collects the data that needs to be managed and
as laptop or mobile phone. manages it using a management information base (MIB) that
As the number of users is increasing, the attacks on their describes the system configuration.
devices are also increasing especially the attacks that affect the
networks which is called “Network Attacks”.
II. RELATED WORK
One of the most widely used and a well-known attack is the
One of the current hot topics in the network attacks
denial of service (DOS) that will be described in the coming
section. is the DOS, researchers focus on anomaly detection for the
anomalies that exploit the network and behave badly in order
In this paper, a DOS attack is analyzed as well as the attacks to prevent the legitimate nodes from connecting to the
on Internet Protocol (IP) group that is as a subset of SNMP- network or from reaching sensitive and important
MIB groups that are described in [1] where authors showed information.
different groups that are part of SNMP-MIB including their In [3] showed in details the classification and technical
different attacks and attacks analyses, in this paper only IP analyses of network intrusion detection systems and the
group is taken and analyzed in order to work on its variables aspects that must be taken into consideration when using the
and show the effect of all variables together and its occurrence
Intrusion Detection Systems (IDS).
percentage, then eliminate the most irrelevant ones and
In [4] [5]the authors showed one of the most commonly used
concentrate on the most relevant ones that give the highest
accuracy for the trained model which enables the model techniques in detecting nodes that may affect the network and
detecting the network attacks and reducing the false negative result in denial of service attack by using machine learning
rates, this helps in implementing the trained model in the by training a model and give it a set of attacks with actual
devices that are responsible for detecting the network attacks measures so the model can detect the anomalies or attack
such as intrusion detection systems. depending on the predefined datasets and results.
The authors in [5] discussed machine learning technique for
A. Network Attacks detecting the anomalies that uses the feature selection
Network attacks is a term that reflects and describes the analysis that takes the top or most frequently used attacks and
attacks that may occur and affect the computer network in objects and classifies them in a specific way that does not
general, these attacks have big effects on the connected nodes consume the network resources and does not exhaust them by
as they might destroy the software that is installed on the enhancing the performance, but there is a probability of
connected node or prevent the connection from reaching to having false negative and false positive in the network.
the node or from the node, this is also known as denial of In [6] authors showed ways for detecting the Distributed
service attack (DOS). Denial Of Service(DDOS) attack which is more dangerous
Denial of service attack can be described as an attack that than the regular denial of service because the attacks come
affects the network to prevent the reach to the network from different locations, the authors used a dataset and
52
• REP Tree Algorithm Classifier: REP Tree The true negative (TN) rate is the total number of negative
algorithm uses the regression tree logic then creates traffic that is classified correctly as negative, while false
different multiple trees in different iterations, after negative (FN) rate shows the total number of positive traffic
generating the trees it chooses the best one from that is classified incorrectly as negative traffic.
them and this is considered as the representative [1]
(1)
C. Feature Selection
The features are mainly used to reduce the computation
(2)
time and to improve the performance of the model that is
trained by minimizing the amount of data used, the feature
selection strategy aims to remove the irrelevant fields to 2 (3)
provide good results.
Feature Selection Methods
Fig. 1 shows a description of the recall and precision
There are three methods for feature selection based on the
concepts:
evaluation criteria which are (Filter, Wrapper, and Hybrid)
that are defined by the authors in [9].
Filter methods are used as a step before the processing.
Feature selection is independent of any machine learning
algorithm. So, features are selected depending on their scores
that are calculated from previous steps and statistics.
Wrapper methods are considered as selecting a set of features
like a search problem; this is done by combining different
features together, and then gives a score for them according
to the accuracy of the model.
Hybrid methods are a combination of many feature selection
methods such as filter and wrapper that are used together to
achieve the best results.
True positive (TP) rate reflects the rate of the correct Accuracy 99.98% 99.88% 99.98%
predictions of the positive traffic, while false positive (FP)
rate reflects the rate of negative packets that are considered
as positive traffic. The F-Measure results for all of the IP group variables (V1,
V2, V3, V4, V5, V6, V7 and V8) are shown in Fig. 2, it can
be noticed that in bruteforce attack the three used classifiers
53
gave 1 which means that their accuracy for this attack is reduced in comparison with the above results, but udp-flood
100% while they are different in the other attacks. and slowpost attacks were still 100% accurate, this means
that it is not necessary to have more accuracy when
removing more irrelevant variables or reduce the training set
size because in this experiment the top five variables gave
more accuracy than selecting the top 3.
The results that are shown in Fig. 3 represent selecting the top
5 variables which are (V1, V4, V5, V6, and V8):
The results that are shown in Fig. 4 represent selecting the top
3 variables which are (V1, V4, and V5):
V. CONCLUSION
In this paper, SNMP-MIB data were used to detect DOS
attacks anomalies that may affect the network. Three machine
learning algorithms were used to classify the data which are
Random Forest, J48 (Decision Tree) and REP Tree. Two
Fig. 4: F-Measure for Top 3 variables - InfoGain Attribute Evaluator Attribute evaluators were used to remove the irrelevant
variables and get top 5 and top 3 variables, the two attribute
It can be noticed that the bruteforce attack accuracy were evaluators are InfoGain and ReliefF. The classifiers and
54
attributes were applied on the IP group and the results showed Emerging Network Abnormality," International
that applying the REP tree algorithm classifier gave the Journal on Advanced Science, Engineering and
highest accuracy all of the times in all IP group, top 5 and top Information Technology, vol. 9, no. 3, 2019.
3. [6] S. Aljawarneh, M. Aldwairi and M. BaniYassein,
"Anomaly-based intrusion detection system through
feature selection analysis and building hybrid efficient
VI. REFERENCES model,"," Journal of Computational Science25, pp.
152-160, 2018.
[7] M. Alkasassbeh, G. Al-Naymat, A. Hassanat and M.
[1] M. Al-Kasassbeh, G. Al-Naymat and E. Al-Hawari, Almseidin, "Detecting Distributed Denial of Service
"Towards generating realistic SNMP-MIB dataset for Attacks Using Data Mining Techniques,"
network anomaly detection," International Journal of International Journal of Advanced Computer Science
Computer Science and Information Security, vol. 14, and Applications, vol. 7, no. 1, 2016.
p. 1162–1185, 2016.
[8] M. Belavagi and B. Muniyal, "Performance
[2] J. Schdnwdlder, A. Prast, M. Harvan, J. Schipperst Evaluation of Supervised Machine
and R. deMeent, "SNM-- LearningAlgorithms for Intrusion Detection," in
TraficAnalsis:,Approaches,Tools,andFirstesults," 10th Twelfth International Multi-Conference on
IFIP/IEEE International Symposium on Integrated Information Processing, 2016.
Network Management, 2007.
[9] B. Cui-Mei, "Intrusion Detection Based on One-class
[3] N. Nanda and A. Parikh, "Classification and Technical SVM and SNMP MIB data," 2009 Fifth International
Analysis of Network Intrusion Detection Systems," Conference on Information Assurance and Security,
International Journal of Advanced Research in 2009.
Computer Science, vol. 8, 2017.
[10] G. Chandrashekar and F. Sahin, "A survey on feature
[4] M. Alkasassbeh, G. Al-Naymat and E. Hawari, "Using selection methods," Computers & Electrical
machine learning methods for detecting network Engineering, 2014.
anomalies within SNMP-MIB dataset," International
Journal of Wireless and Mobile Computing, 2018.
[5] M. Almseidin, M. Al-kasassbeh and S. Kovacs,
"Fuzzy Rule Interpolation and SNMP-MIB for
55
Enhancing Data Protection Provided by VPN
Connections over Open WiFi Networks
Ashraf Karaymeh Mohammad Ababneh
KPMG King Hussein School of Computing Sciences
akaraymeh@kpmg.com, Princess Sumaya University for Technology
ashrafkaraimeh@gmail.com Amman, Jordan
m.ababneh@psut.edu.jo
Abstract—Open Wi-Fi networks are a serious challenge to SSID in popular places and tricks people to connect with him
sensitive and private data because it is hard to know who else is rather than the genuine hot spot. This will enable the attacker to
using the network and monitoring traffic. Such open, free and monitor traffic, possibly infect the victim's devices with
unencrypted networks might allow an adversary to hack devices malware, possibly take control of the devices and maybe execute
connected to them making the use of such networks highly risky Man-in-the-middle attacks (MITM). [4]. This is also sometimes
and harmful. In order to use these public networks securly, it is called an Evil-Twin attack, this attack happens mostly for
recommended to use VPN in Tunneling Mode to assure that the unattended hot spot for a long period [5].
data is encrypted during transmission. But this is not enough as
most of today's smart devices and laptops run applications that The proven solution for providing an additional layer of
might start communicating with their servers before this VPN has security when using public and open networks is by establishing
been established. In this work, we solve this problem by creating a a Virtual Private Network (VPN) tunnel. This will ensure that
device that enables users to access the internet securely over Public all traffic be encrypted before transmission. However, until the
Wi-Fi networks and provides security right from the beginning VPN is established, the system remains exposed to
when deployed between the public Wi-Fi and the user's personal vulnerabilities. Some people would think that they are under
devices. Experiments show the security advantages of our solution. VPN protection just because they turned on the VPN connection
on their browser or by entering their credentials to a VPN client
Keywords— open Wi-Fi networks Security, Raspberry Pi, Open
authentication window. Most applications on modern devices
VPN.
need to connect to their servers at startup automatically for
various reasons such as looking for updates, receive emails and
I. INTRODUCTION messages as with WhatsApp, Facebook or even updates to the
It has become very common these days for employees to OS itself as soon as they see an established Internet connection.
work remotely outside their organization's premises. A recent Hackers would take advantage of this behavior by monitoring
survey published by Forrester consulting on Citrix website traffic and acquiring some important information about the
claims that 65% of the respondents have at least worked device and even succeed in sending a malware to the device in
remotely one day per week, 37% said that they worked two or the few minutes before establishing the VPN connection [6].
more days per week [1]. In order to get access to their work data Some solutions try to mitigate this problem by installing a
servers they need to establish internet connections through VPN application or a VPN browser on the user's device. But
gateways of places they are trying to connect from. These places these solutions would only work on certain operating systems
could be Hotels, coffee shops, restaurants, airports, etc. and still need to be connected to the internet before establishing
Allowing employees to work remotely is a high risk to the VPN tunnel, which brings us back to the first square.
organizations' sensitive data. Most of the big companies use and In our work, we present a solution to the problem in the form
require VPN technology in order to allow their employees to of an affordable device of our own design that is deployed
remotely access and exchange sensitive data [2]. However, in between the open WiFi and the user’s device capable of
order to establish a VPN connection, someone should connect to prohibiting any communication from the user’s device until the
the available internet gateway first and wait for few minutes VPN is established. In our solution, we first enforce the
until the VPN connection becomes fully running, leaving his establishment of the VPN tunnel then we allow the encrypted
device vulnerable to various types of attacks, especially if he is data to be transmitted through the open Wi-Fi.
connecting through an open Wi-Fi network [3].
In addition to open WiFi networks, there is the danger of
Rogue WiFi networks, where a hacker masquerades a network
Internet
of use Support with other Solution
Wi-Fi
solutions
1 Open VPN X √ √ X
Tunnel VPN Tunnel VPN
without firewall
2 OpenVPN with X √ √ X
firewall
Hacker
3 EncryptMe √ √ √ X
4 Hotspot2.0 √ √ X X
Fig. 1. OpenVPN without firewall
57
VPN tunnel, which exposes the device for the hackers
monitoring or attacks.
Internet
E. Research gap VPN Tunnel
Office Servers
Any solution other than the previous ones should work on
all platforms and all Wi-Fi networks regardless of band nor
mechanism, must be easy to use by a person with no technical
experience and most importantly, does not require Laptop
58
• sudo ifconfig wlan0 192.168.1.1 # to the
embedded wireless adaptor vlan0
3) ConfigureHostpad and SSID: DHCP Static
The next Step is to configure the Hostapd and to assign the Raspberry Pi
SSID for the user's side Wi-Fi network "SecureNet", along with Wlan0
the security Features needed to secure the connection between Wlan1 Open SSID:SecureNet
the user's devices and wlan0. VPN
• interface=wlan0
• driver=nl80211 Fire Wall
• ssid=SecureNet Tunnel VPN
External Wi-FI Embedded
• hw_mode=g Public Wi-Fi User’s Device
• channel=6 Wi-Fi
• macaddr_acl=0
• auth_algs=1 Fig. 5. Network layout with the solution and components
• ignore_broadcast_ssid=0
• wpa=2 4) Activate the open forwarding feature on the main router.
• wpa_passphrase=****** #A password for The easiest way to do this is to create a file associated with
Wlan0 that IP address of client maily because we are using static IP
• wpa_key_mgmt=WPA-PSK here. If more than one device is going connect as an OpenVPN
• #wpa_pairwise=TKIP # You better do not use client then we need to create a client file for each one of them
this weak encryption (only used by old and change the static IP for each correspondingly.By finishing
client devices) these steps we conclude the implementation of our device. “Fig.
• rsn_pairwise=CCMP 5,” depicts the layout of the network and shows where the tunnel
4) Configure IPTable: is created. Some additional steps were taken to improve the
The final step is to insert an “iptables” rule to allow NAT security of our device such as: operating only in the WPA2
using the following: mode, reducing the signal strength, hiding the SSID and
enabling the MAC-address filtering.
a) Enable IP forwarding in the Kernel by:
• sudo sh -c "echo 1 > /proc/sys/net/ipv4 V. SOLUTION TESTING
/ip_forward" To prove that our device has improved the security of the
b) Enable NAT in the Kernel VPN-based connections, we have tested it in two ways. A
• sudo tables -t nat -A POSTROUTING -o eth0 vulnerability scan using Nessus was conducted to see whether
-j MASQUERADE the solution has increased or reduced the number and type of
• sudo tables -A FORWARD -i eth0 -o wlan0 - vulnerabilities found by the scanner [20]. Then we used
m state -state RELATED, ESTABLISHED -j Wireshark to see if our solution have helped with the encryption
ACCEPT of the data (any data) from the beginning.
• sudo tables -A FORWARD -i wlan0 -o eth0 -
j ACCEPT A. NESSUS Vulnerability Scan
c) Make these changes permanent: Nessus is a vulnerability assessment tool that scans the
• sudo sh -c "iptables-save > network for open ports, services and programs. Its aim is to find
/etc/iptables.ipv4.nat" the weaknesses and flaws that can be exploited. The scan was
conducted in three stages as illustrated in the following sections.
F. The second Wi-Fi Adaptor
The second Wi-Fi adaptor is used to establish the connection 1) Access Point Vulnerability Scan [Stage (1)]:
from the Raspberry Pi to the open Wi-Fi network. This The first stage is to scan the access point itself to see what
connection is configured as “wlan1” on the Raspberry Pi. Its vulnerabilities would be between the user device and the access
configuration is the same as the previous adaptor except that it point. For this experiment a 4G router (Huawei E5377) was
has to be a DHCP connection in order to be able to acquire its IP used as the access point and a regular HP laptop as the user’s
address from the Public Wi-Fi. device. “Table II,” shows the scan results, which were divided
into 5 classifications:Critical, High, Medium, Low, and Info.
G. The VPN server
Info is the lowest rating and Critical is the highest and most
The VPN server is established using OpenVPN as follows: dangerous that needs to be fixed immediately.
2) Access point + Raspberry Pi without VPN connection
1) Install OpenVPN
• sudo apt-get install OpenVPN [19] Vulnerability Scan [Stage (2)]:
• cp -r /usr/share/doc/openvpn/examples/ The second stage is to run the vulnerability scan on the same
easy-rsa/2.0/etc/openvpn/easy-rsa laptop. But this time the laptop is connected to our solution
2) Configure the RSA file device. The VPN is turned off so that, in this case, we can find
• nano /etc/openvpn/easy-RSA/vars
TABLE II. NESSUS SCAN RESULTS
• export EASY_RSA=”/etc/openvpn/easy-RSA
exportKEY_SIZE=2048 Stage CRITICAL HIGH MEDIUM LOW INFO
3) Create the OpenVPN Client File 1 0 0 4 2 19
2 0 0 1 1 15
3 0 0 1 1 14
59
the vulnerabilities of the device itself. “Table II,” also shows transmitted. On the receiving part, it could only see the IP
the number of vulnerabilities found after performing the stage address of the VPN provider, but not the data itself as in “Fig 6”.
2 vulnerability scan. By using only the device the number of This proves the effectiveness of our solution.
medium vulnerabilities has been lowered from four to one.
While low vulnerabilities has become only one. The number of VI. CONCLUSION
INFO vulnerabilities has decreased from 19 to 15. We created an intermediary device that can be used to help
users in connecting to the Internet securely over open Wi-Fi
3) Access point + Raspberry Pi with the VPN connection networks. Our experiments showed that the device is achieving
Vulnerability Scan [Stage (3)]: good results improving security and filling the gap of no-
The third stage is to run the vulnerability scan after turning protection before the establishment of the VPN tunnel. The
the VPN tunnel. This stage test enabled us to find device is easy to use and affordable.
vulnerabilities from the secure connection towards the internet.
“Table II,” shows the number of vulnerabilities found in the REFERENCES
secure part of the network. We can see that running the VPN [1] Forrester,"https://www.citrix.com/content/dam/citrix/en_us/docume
has managed to reduce only the number of INFO vulnerabilities nts/oth/maximize-productivity-and-security-with-mobile-
workspaces.pdf,"
[2] O. Elkeelany, M.M.Matalgah and J. Qaddour, "Remote access virtual
4) Analysis of the vulnerabilities found in the three stages private network architecture for high-speed wireless internet users,"
A comparison of the three stages vulnerability report was WIRELESS COMMUNICATIONS AND MOBILE COMPUTING,
conducted and “Annex 1,” depicts these vulnerabilities. It is vol. 1, no. 4, p. 567, 2004.
clear that the device has reduced the number of medium class [3] IBM, IBM security Virtual Private network V.7.2, Rochester: IBMi,
vulnerabilities to only one vulnerability (50686 - IP Forwarding 2013.
[4] Sachin Shetty, Min Song, Liran Ma, "Rogue Access Point Detection
Enabled), which is Vital for the laptop being used to execute the by Analyzing Network Traffic Characteristics," 1 June 2007.
functions of this experiment. There was also one low [Online]. Available:
vulnerability found on the device (10663 - DHCP Server https://pdfs.semanticscholar.org/384b/54dd72c7f7418d77d70b987d
Detection), which can be neglected since the only reason for 2cfa2c1da4c5.pdf. [Accessed 1 January 2018].
having a DHCP is for the purpose of the demo in this project. [5] Zhanyong Tang, Yujie Zhao, Lei Yang, Shengde Qi, Dingyi Fang,
"Exploiting Wireless Received Signal Strength Indicators to Detect
Once the demo is completed, then the DHCP server will be
Evil-Twin Attacks in Smart Homes," Mobile Information Systems,
removed and the system will work only on Static IPs. vol. 2017, no. Article ID 1248578, pp. 1-14, 2017.
As for Info vulnerabilities, these are informational [6] Pranav S. Ambavkar, Pranit U. Patil, Prof. Pamu Kumar Swamy,
vulnerabilities and have no risk or impact on the security of the "Exploitation of WPA Authentication," IOSR Journal of
project. The number of shared INFO vulnerabilities was Engineering, vol. 2, no. 2, pp. 320-324, 2012.
reduced from 15 to 9, from which six new class INFO [7] "Best VPN," [Online]. Available: https://www.bestvpn.com/vpn-
encryption-the-complete-guide/. [Accessed 23 December 2017].
vulnerabilities were found in stages 2 and 3. These new [8] C. Rubin, " Is public Wi-Fi safe?," Entrepreneur, vol. 44, no. 11, p.
vulnerabilities appeared due to the configuration and 56, 2016.
installation on the Raspberry device to initiate the SSH server, [9] "Restricting uTorrent to VPN interfaces," Ipredator, [Online].
which is very vital to the system and cant be avoided. The level Available: https://blog.ipredator.se/howto/restricting-utorrent-to-
of these vulnerabilities is INFO and is not considered risky. vpn-interfaces-part-1.html. [Accessed 1 January 2018].
[10] "https://encrypt.me/," Encrypt me, [Online]. Available:
B. Using Wireshark https://encrypt.me/. [Accessed 1 January 2018].
[11] "https://www.wi-fi.org/discover-wi-fi/wi-fi-certified-passpoint,"
We used Wireshark to sniff traffic from the network and WiFI alliance, [Online]. Available: https://www.wi-fi.org/discover-
watch packets being transmitted or received. Again, the wi-fi/wi-fi-certified-passpoint. [Accessed 31 12 2017].
experiment was executed in three stages just like the [12] C. Hoffmann, "how to Geek," 08 Dec 2014. [Online]. Available:
vulnerability scan. https://www.howtogeek.com/204335/warning-encrypted-wpa2-wi-
In stages one and two Wireshark was able to monitor the fi-networks-are-still-vulnerable-to-snooping/.
traffic transmitted and received beyond the access point. [13] "wikipedia," [Online]. Available: https://en.wikipedia.org/wiki/
Raspberry_Pi. [Accessed 22 12 2017].
However, in the second stage the device being used to surf the [14] "Raspberry Pi," [Online]. Available: RaspberryPi.org. . [Accessed 02
web was connected to the Raspberry device, but without having Janyary 2018].
the VPN tunnel initiated. It can be seen that, once the VPN [15] A Skendzic, B Kovacic, "Open source system OpenVPN in a function
Tunnel is initiated, Wireshark couldn’t see anything being of Virtual Private Network," in IOP Conference Series: Materials
Science and Engineering, Belgrade, 2017.
[16] "host pad," [Online]. Available: https://w1.fi/hostapd/. [Accessed 22
12 2107].
[17] "elinux," linux, [Online]. Available: https://elinux.org/RPI-Wireless-
Hotspot. [Accessed 22 12 2017].
[18] A Skendzic, B Kovacic, "Open source system OpenVPN in a function
of Virtual Private Network," in IOP Conference Series: Materials
Science and Engineering, Belgrade, 201
[19] "https://www.raspberrypi.org/," raspberry pi, [Online]. Available:
https://www.raspberrypi.org/forums/viewtopic.php?t=81657.
[20] L. Harrison, R. Spahn, M. Iannacone, E. Downing, J.R. Goodall,
"NV: Nessus Vulnerability Visualization for the Web," in VizSec '12
Proceedings of the Ninth International Symposium on Visualization
for Cyber Security, Seattle, Washington, USA, 2012.
Fig. 6. Wireshark Screen shot
60
Annex 1: A Detailed Comparison Between The Three Stages of The Vulnerability Reports
61
A Proactive Design to Detect Denial of Service
Attacks Using SNMP-MIB ICMP Variables
Yousef Khaled Shaheen
Department of Computer Science Dr. Mohammad Al Kasassbeh
Princess Sumaya University for Department of Computer Science
Technology Princess Sumaya University for
Amman, Jordan Technology
yousefpsut@icloud.com Amman, Jordan
m.alkasassbeh@psut.edu.jo
Abstract— One of the most cyber-attacks that attract cyber these attacks. A set of algorithms such as, Meta, Lazy IBK,
criminals is Denial of Services (DOS) Attack. DOS attack aims Bayes, RJ48 and Rule-Based were adopted to find which
to reduce the network appliances performance from doing one of these algorithms is the most effective in detecting
their intended functions. Moreover, DOS Attacks can cause network anomalies.
huge damage to the data Confidentiality, Integrity and
Availability. This paper introduced a system that detects the
network traffic and varies the DOS attacks from normal traffic This paper is organized as follows: Section II provides
based on an adopted dataset. The results had shown that the several related works in the area of using machine learning
adopted algorithms with the ICMP variables achieved a high in detecting network anomalies, where the DOS if service
accuracy percentage with approximately 99.6% in detecting attacks and SNMP-MIB dataset are illustrated in section III.
ICMP Echo attack, HTTP Flood Attack, and Slowloris attack. The proposed model that is used in this contribution is
Moreover, the designed model succeeded with a rate of 100%
discussed in section IV. Section V discusses the
in varying normal traffic from various DOS attacks.
experimental results of the adopted methodology. Finally,
Keywords—Cyber-attacks, availability, DOS attack, ICMP the conclusion of the provided model and future work are
variables, meta, lazy IBK, bayes, RJ48, rule.tree. discussed in section VI.
I. INTRODUCTION
II. RELATED WORK
The wide use of the internet and the rapid increase of
communication and computer networks increase the Most of the current researches focus on detecting
cybercriminals’ activities in attacking these networks and different network attacks using machine learning techniques.
cause catastrophic damage to them. Network security Many of these techniques have been introduced, tested, and
attacks are varied based on their effect on the network and evaluated. One of the most used techniques in detecting and
the financial losses that they may cost the organization. analyzing network anomalies is SNMP-MIB data.
DOS attack listed as one of the easiest attacks that can be Al – Kasassbeh et al. [2] generated effective datasets that
launched with a huge impact on the network assets and cost solved the limited resources in the previous datasets. The
the organizations heavy losses. Many exhaustive researches authors adopted a reliable SNMP-MIB dataset to investigate
have been done on the financial losses that DOS attack can the SNMP for network attacks and anomalies detection. The
cause. Ponemon Institute reported that the average losses for authors collected SNMP-MIB data based on a set of Brute-
641 individuals are approximately equal to $1.5 million over Force attack and DOS attacks. The collected dataset is a
the year of 2015 divided into five categories (Revenue reliable published dataset and it consists of 4998 records,
Losses, Technical Support Costs, Operations Disruption, where each record mapped to 34 MIB variables. The MIB
Lost User Productivity, and Damage to Information groups are categorized as follows: TCP, UDP, IP, ICMP and
Technology Assets) [1]. Thus, many organizations aim to Interface.
protect their networks from several attacks that can cost Al – Kasassbeh et al. [3] adopted a reliable method in
them heavy losses using different network security services. detecting network attacks and anomalies based on SNMP-
One of the commonly used security services is Intrusion MIB dataset using machine learning techniques. They proved
Detection System (IDS), which is a security model that is that the SNMP-MIB is an effective technique in detecting a
designed to detect the abnormal and malicious traffic in large set of various DOS attacks using three algorithms
real-time or close to it. An IDS is as an effective security categories; Random Forest, AdaboostMI, and MLP. The
mentioned algorithm had been applied to several MIB groups
service against DOS attack. The idea behind DOS attack is
(TCP, UDP, IP, ICMP, and Interface). The classified
to prevent a system from doing its intended functions and
algorithms achieved a varied accuracy based on the group.
preventing the authorized users from accessing the system The Random Forest algorithm achieved a high accuracy
resources by injecting a flood of data to a specific target when it was applied to the IP group with a rate of 100% and
system. DOS attack can be categorized into two main 99.93% when it was applied to the interface group.
techniques either by exploiting the vulnerabilities in the
Al – Kasassbeh [4] proposed a new hybrid approach to
network servers, appliances and protocols, and exploiting a
capture and detect the malicious traffic based on the
huge amount of spoofed source addresses. This paper
collected dataset that is applied as an input to the Neural
introduced a new model to detect various DOS attacks by
Network in order to predict the behaviour of input data. The
using a set of ICMP variables and an adopted dataset of
proposed model achieved a high accuracy with a rate of
63
ICMP Echo attack depends mainly on the ping flood “Management Information Base”. SNMP Agent is
using echo request packets. The ease of using this attack and embedded on the required device, where it responds and
the reason behind considering it as a traditional attack is that exchanges the requests and the actions from the SNMP
the ICMP protocol is useful in network diagnostic, which Manager using SNMP protocol.
leads most of network admins to control and restrict this
protocol using different network security appliances such IV. PROPOSED MODEL
intrusion detection systems, intrusion prevention systems or This part is divided into three sections, starting with a
firewalls. However, this protocol is also critical to some brief description of the used dataset. The second section
networks such TCP/IP networks. Intruders in this attack provided a full explanation of the machine learning
generate a huge volume of ICMP packets toward the victim classifiers to classify the dataset and create a decision if
server which utilize the link bandwidth, which make other either a normal traffic or an attack. The last part provided a
users to face difficulties while reaching to the victim server. summary of feature selection techniques that are used in the
module to evaluate the efficiency of applying these features
on the ICMP variables.
Table 1 classifies the dataset records according to the
related attacks. A. SNMP-MIB Data
In this research paper, (Al – Kasassbeh et al. 2016)
TABLE I. DATASET RECORDS ACCORDING TO RELATED ATTACKS
SNMP-MIB dataset was used for testing and implementing
No. Traffic Label Traffic Count this paper approach. The dataset was built from almost 5000
1 Normal 600 records that related to six main types of attacks (ICMP
2 ICMP-Echo Attack 632 Echo, TCP-SYN, UDP flood, HTTP flood, Slowpost, and
3 TCP-SYN Attack 960 Slowloris). The set of attacks were detected using a set of
4 UDP Flood Attack 773 variables that are included in the dataset. The traffic
5 HTTP Flood Attack 573
prediction will be based on ICMP group. Most of the
network traffic deals with ICMP protocol to ensure the best
6 Slowloris Attack 780
packet delivery by comparing the number of sent and
7 Slowpost Attack 480
received packets. Six MIB variables were selected for this
8 Brute Force Attack 200 group as follows:
The icmpOutMsgs (iOM) variable indicates the
total count of attempt ICMP sending messages.
B. Simple Network Management Protocol (SNMP)
The icmpInMsgs (iIM) variable is an indicator
SNMP is an application layer protocol that allows the
user to monitor, analyse and manage network traffic. SNMP
of the total number of ICMP received
protocol divides into three versions that vary in features; messages.
SNMPv1 and SNMPv2 are known as SNMP community, The icmpOutDestUnreachs (iOU) variable indicates
where SNMPv3 known as SNMP security, the only the total amount of unreachable ICMP messages
difference between these three versions is that SNMPv3 sent at the destination.
designed with advanced security features. Fig.1 illustrates the
network management architecture. The icmpInDestUnreachs (iIU) variable is an
indicator of the total count of unreachable
ICMP messages at the destination.
The icmpInEchos (iIE) variable indicates the
total ICMP number request packets received.
The icmpOutEchos (iOE) variable indicates the total
ICMP number reply packets received.
B. Machine Learning Classifiers
The idea behind using classifiers in a network anomaly
detection system is to analyze and classify the corresponding
Fig.1. Network Management Architecture [9] traffic. In this paper, five classifiers were applied on the
adopted dataset as follows.
Fig.1. shows that the SNMP network model is divided
into two main subsystems; the SNMP Manager and the Meta Bagging classifier was presented by Efron
SNMP Agent. The SNMP Manager is a personal computer Tibshirani. Bagging is a Meta bootstrap algorithm
that is designed and configured to pull the data from SNMP that trains every single classifier randomly of the
Agent. SNMP Manager is designed to provide a solution for original dataset to generate and form a final
a set of faults and categories such as, fault monitoring, prediction. The bagging classifier is divided into two
performance monitoring, configuration control and security categories based on the dataset subset; if the dataset
control. subsets are drawn randomly, then it called pasting.
While if the dataset subsets are drawn with
SNMP Agent plays the main role in network replacement, then it called bagging.
management model, by collecting the required data from the
network and stores them in a database called the
64
The lazy classifier is known as an algorithm or a where for filter technique two methods were selected the
system that trains and generalizes the records in the infoGain and the ReleifF, and for the wrapper technique
dataset after the system receives queries. Lazy IBK correlation-based method was selected.
classifier is applied on the adopted dataset, since it InfoGain, ReleifF, and Correlation-based are attributed
proved its efficiency when applying it on large evaluators that are used in the WEKA machine learning
datasets with various attributes. tool. InfoGain finds out the most useful attribute for
J48 classifier is an implementation branch of tree prejudiced between the various classes to be used.
classifier family that is also called C4.5. J48 Moreover, InfoGain determines the best split to be chosen;
the more accurate split is the one that has a high value.
algorithm was introduced and developed by Ross
Quinlan. The process of attribute selection is done ReleifF attribute evaluator is an effective method of
over top-down induction of decision trees and then attribute ranking. The role behind selecting is the more
important attribute is based on the algorithm output; the
uses information theory key concepts in order to
more positive number means the more important attribute,
select the best attribute.
where the output is a number that varies between -1 and 1.
The rule-based classifier is one of the most The attribute weight is continuously updated through the
commonly used algorithms in artificial intelligence process. Three samples are selected and recognized
science, due to the high accuracy provided results. respectively, a selected sample from the dataset, the closest
The role of this classifier is using a set of rules in neighbouring sample that belongs to the same class in the
order to generate several choices. Rule-based dataset, and the closest neighbouring sample in a different
classifier falls into two characteristics, the mutually class in the dataset. The attribute weight is affected by any
exclusive rule, where each record in the dataset is change that can be done on any attribute value which also
covered mostly by one rule. In addition, exhaustive could be responsible for the class change.
rule occurs when each record in the dataset is The correlation-based evaluator is based on finding the
covered at least by one rule. correlation between two related features by evaluating the
Bayes classifier is also known as Naïve Bayes, this correlation coefficient. The attribute can be redundant by
classifier was developed by Thomas Bayes. The role either deriving it from another set of attributes or if it’s
related to some other attributes. So that, to consider an
of this classifier is conditional probability; which is
attribute as a good attribute it should be highly correlated to
the probability of something to happen based on
the attributes class and not highly correlated to any other
something else has already occurred. Bayes attributes. Table 2 shows the ICMP variable ranks when
classifier executes the probabilities for each class of they were applied under the attributes selection factors.
the dataset, where the highest probability rate is the
most occurring class. TABLE II. DATASET RECORDS ACCORDING TO RELATED ATTACKS
iIE
variables. The strength of this technique is obvious
iIE
when applying it on large datasets.
iOE
The wrapper technique looks through the feature
space and uses the algorithm to find the best attribute iOM iOE
set. The searching method of the wrapper technique
can be in several directions (forward, backwards, or
bidirectional). The strength in wrapper The search method in this paper was done by using the
technique relates to the efficient results because of ranker searching method, which ranked the attributes based
the complexity of this method, as it participates in
on their evaluations from the highest importance to the
the selection process.
lowest one.
The hybrid approach combines both the filter and the
D. Evaluation Metrics
wrapper technique which results into a complex
feature selection technique. The performance of the proposed model was measured
using a set of well-known parameters such as accuracy,
The Filter and the wrapper techniques were used in
order to compare the accuracy of the generated results,
65
precision, and recall. The classifiers performance was TABLE IV. CLASSIFIERS ACCURACY FACTORS AVERAGE WEIGHT
measured based on the confusion matrix as follows Accuracy Factors Average Weight
Classifiers TP FP Precision Recall F-
TABLE III. CONFUSION MATRIX Rate Rate Measure
Predicted Class Bayes 0.864 0.014 0.935 0.864 0.879
Lazy-IBK 0.867 0.026 0.895 0.867 0.872
Actual Class Positive Negative
Met -Bagging 0.871 0.029 0.906 0.871 0.874
Positive TP FP Rules-Based 0.867 0.026 0.895 0.867 0.872
Negative FN TN RJ.48 0.868 0.026 0.896 0.868 0.872
The true positive (TP) rate indicates the rate of the Fig.2. illustrates the performance of the classifiers that
correct predictions of the positive traffic proverbs. False- were used in the proposed model in terms of F-Measure rates
positive (FP) rate indicates to the proportion of negative and based on the ICMP variables.
packets that are positive packets. The true negative (TN) rate
is an indication of the total number of negative traffic that
classified correctly as negative, where the false negative
(FN) rate shows the total number of positive traffic that
classified incorrectly as negative traffic.
The precision rate represents the ratio of the total correct
predictions of the positive traffic proverbs to the total count
of irrelevant and relevant traffics. The recall accuracy rate
represents the attribution of the correct prediction rate of the
positive traffic instances to the total count of relevant traffic
instances.
Fig.2. F-Measure Results of All ICMP Group
Finally, the accuracy rate takes all confusion matrix
parameters into its calculation to measure the correctly From Fig.2 it was found that the F-Measure values of all
classified traffic instances. Precision, recall and accuracy classifiers are efficient for normal traffic, HTTP flood attack,
formulas are showing below respectively. and slowloris attack. Moreover, Meta Bagging classifier
achieved a high performance in identifying UDP flood
attack.
Accuracy = (3)
66
VI. CONCLUSION
Data filtering becomes essential to protect the local and
remote networks from different types of attack that harm the
sensitive data and cost the organizations heavy losses. Thus,
many methods were introduced to detect network anomalies
in order to keep network structure running normally without
any disturbance and data disruption. In this paper, it was
found that the ICMP group with the adopted classifiers
wasn’t efficient in detecting all DOS attacks. Moreover,
reducing the count of ICMP variables varied in their
performance when detecting these attacks. However, the
Fig.4. F-Measure Results with Top 3 ICMP Variables – ReliefF
Evaluator
designed model achieved an efficient performance in
detecting some attacks such as, ICMP Echo attack, HTTP
From fig.4 it was found that all classifiers achieved a flood attack, and slowloris attack.
high F-Measure rate for normal traffic, HTTP flood attack,
ICMP Echo attack, and slowloris attack. However, all For a future work, an enhancement on ICMP variables
classifiers weren't efficient in detecting the rest types of should be applied in order to increase their ability to detect
attacks. all types of DOS attack.
VII. REFERENCES
67
An Energy Aware Fuzzy Trust based Clustering
with group key Management in MANET
Multicasting
1st Dr. Gomathi Krishnasamy
Department of Computer Information Systems
Imam Abdulrahman Bin Faisal University
Dammam, Saudi Arabia
gkrishna@iau.edu.sa
Abstract— The group key maintenance in MANET is especially The ultimate Trust of the node is combination of initial
risky, because repeated node movement, link breakdown and trust of the node estimated using direct or indirect
lower capacity resources. The member movement needs key methodology, energy level of node and packet integrity. One
refreshment to maintain privacy among members. To survive among the node in network is nominated as Certificate
with these characteristics variety of clustering concepts used to Authority, which is a node having highest final trust value
subdivide the network. To establish considerably stable and and this node authorized for issuing trust certificates, and
trustable environment fuzzy based trust clustering taken into this certificate valid only for certain period of time, often
consideration with Group key management. The nodes with certificates will be renewed whenever time elapse.
highest trust and energy elected as Cluster Head and it forms
Apparently data transaction does not include misbehaving
cluster in its range. The proposed work analyze secure
multicast transmission by implementing Polynomial-based key
nodes those who have not assigned with certificate. The
management in Fuzzy Trust based clustered networks fuzzy analyzer divide reliable and unreliable nodes, in the
(FTBCA) for secure multicast transmission that protect against meantime certificate authority notifies other nodes by
both internal and external attackers and measure the producing agitations as soon as the malicious node demands
performance by injecting attack models. for getting certificate. [3]
69
in fuzzy table for spontaneous classification of mobile C. Fuzzy Trust Based Clustering With Group Key
nodes. Management (FTBCGKM)
T ( Na , Nb ) tanh rt 1 rt rt Ea
n
(1)
The Fuzzy Trust Clustering is united by means of
Polynomial based group key scheme adopted from [12],
“Polynomial-based key management for secure intra-group
where = recent transactions among the nodes. and inter-group communication”) to offer protected
communication. The authentic information conduction
= number of transactions among the two initiates as soon as the development of clusters and
nodes. circulation of group key between cluster members.
= weight of transaction.
IV. EVALUATING FTBCGKM WITH ECGKM
= +1 when the transaction is positive. The proposed Fuzzy Trust based Cluster and Group Key
= -1, when the transaction is negative. Management (FTBCGKM) is compared with existing “An
efficient clustering scheme for group key management in
Ea = Energy of the node ‘a’ MANETs (ECGK)”, [4] where in this direct and Indirect
observations are used for Trust calculation.
In the experiment, four different groups of nodes are
branded specifically Totally Trusted, Trusted, Partly The indirect trust assessment may not be a true value,
trusted and Distrusted by using their trust value. The sometimes malicious node produces fake information during
fuzzy logic variables delimited using the trust value trust calculation. And also Energy is not considered for
ranges from -1 to +1. The subsequent fuzzy table TABLE Electing CH. So it increases the chances of having CH with
I with trust value fixes whether or not to contemplate the low energy.
node for clustering or detach node from network doings.
The advantage over proposed FTBCGKM considers only
The Totally trusted nodes are more suitable to become
direct observation for evaluating trust and energy level of the
CH than the normal trusted nodes.
mobile node is compared before electing CH. The following
simulation section shows comparison of proposed
TABLE I. FUZZY TABLE FTBCGKM with existing ECGK on the basis of increasing
number of nodes and increasing number of attackers.
Fuzzy Evaluated Nodes Category
Ranking Trust Value
A. Simulation Setup
The proposed model simulated using Network Simulator
Very High 0.9 to +1 Totally Trusted NS2 [14], the one hundred mobile nodes unfold with in the
space of 750 x 750m was simulated and shown in Fig. 1. The
High 0.8 to 0.75 Trusted mockup goes for two hundred sec. The complete simulation
factors and their standards are mentioned in TABLE II.
Medium 0.7 to 0.3 Partially Trusted
Fuzzy Guidelines:
Fig. 3. Misbehaving Nodes Vs Delivery Ratio Fig. 6. Misbehaving Nodes Vs Energy Consumption
71
delay of proposed FTBCGKM approach has 4% less than
the ECGKM approach.
Fig. 10 shows the comparision of FTBCGKM and
ECGKM techniques with respect to Energy Consumption.
The reduced energy consumption represented in proposed
FTBCGKM. Initially all nodes are set with same amount of
energy for different number of nodes scenario. After some
amount of time energy decreased due to data transmission
and other computational activities.
V. CONCLUSION
The proposed Fuzzy Trust based Clustering with Group
Key Management developed to automate the process of
eliminating misbehaving nodes. The FTBCGKM construct
clusters with trusted nodes. The proposed FTBCGKM is
compared with existing ECGKM based on some metrics like
delay, delivery ratio and drop.
The node capture attackers are invaded to study the
performance of proposed FTBCGKM. The simulation
results shows betterment of proposed FTBCGKM than the
existing ECGKM. The direct trust evaluation along with
energy estimation nodes supports proposed FTBCGKM to
Fig. 8. Different size of Network Vs Packet Drop
eliminate malicious node from communication.
Fig. 8 shows the drop of FTBCGKM and ECGKM The intra and inter cluster communications are well
techniques for different number of nodes scenario. The organized in proposed FTBCGKM, as well rekeying also
conclusion from above analysis is, the drop of proposed carried out to preserve secrecy among mobile nodes. One of
FTBCGKM approach has 8% less than ECGKM approach. the weak points of fuzzy logic system is considerable amount
of memory is used for storing fuzzy logic rules database.
This could be focused for future study to optimize the
memory size.
REFERENCES
72
[5] Dijiang Huang and Deep Medhi (2008), “A secure group key
management scheme for hierarchical mobile ad hoc networks”, Ad
Hoc Networks, Vol. 6, pp. 560–577
[6] Bhuvaneswari, V. and Chandrasekaran, M. (2014), “Cluster head
based Group key Management for Malicious Wireless Networks
using Trust Metrics”, Journal of Theoretical and Applied Information
Technology, Vol. 68, No. 1, pp. 1-9.
[7] Yanji Piao, JongUk Kima, Usman Tariq and Manpyo Honga, (2013),
“Polynomial-based key management for secure intragroup and inter-
group communication”, Computers and Mathematics with
Applications.
[8] Athira V and Jisha G (2014), “Network layer attacks and protection in
MANET-A survey”, International Journal on Computer science and
Information Technologies, Vol5(3),
pp 3437-3443
[9] Diwaker C, Choudhary S and Dabas P (2013), Attacks on Mobile Ad-
hoc Networks, International Journal of Software and Web Sciences,
Vol. 4(1), pp. 47-53
[10] Supreet Kaur and Varsha Kumari (2015), “Efficient Clustering with
Proposed Load Balancing Technique for MANET”, International
Journal of Computer Applications Vol. 111, No 13
[11] Jayaraj Singh, Arunesh Singh and Raj shree (2015), “An Assessment
of frequently adopted Security patterns in Mobile Ad hoc Network:
Requirement and Security Management Perspective”, Journal of
Wireless Network and Microsystems, Vol. 4, No. 1-2, pp. 1-7.
[12] Piao, Y., Kim, J., Tariq, U and Hong, M. (2013), “Polynomial-based
key management for secure intra-group and inter-group
communication”, Computers & Mathematics with Applications, Vol.
65, No. 9, pp. 1300-1309.
[13] Saju P john and Philip Samuel (2014), “Self- organized Key
Management with trusted certificate exchange in MANET”, Ain
Shams Engineering Journal,, Vol.6, pp. 161-170
[14] NS-2 simulator. Available online: http://www.isi.edu/nanam/ns .
[15] Veerpal Kaur and Simpel Rani (2018), “A Hybrid and Secure
Clustering Technique for Isolation of Black hole Attack in MANET
“,International Journal of Advanced Research in Computer
Engineering & Technology (IJARCET) Vol. 7, Issue 3, pp. 230-237.
[16] Dheepak, T. and Neduncheliyan S (2017), “Security Scheme in MAC
Protocol based Attack Detection Model using Cryptography and
Basiyan method”, Inter. Journal of Pure and Applied Mathematics
Vol. 116, No. 21, pp. 459-467.
[17] Zhe Wei and Shuyan Yu (2018),” Energy Aware and Trust Based
Cluster Head Selection for Ad-hoc Sensor Networks”, International
Journal of Network Security, Vol.20, No.3, PP.496-501.
73
Framework for Blockchain Deployment:
The Case of Educational Systems
Saif Kazakzeh, Eyad Ayoubi, Baraa K. Muslmani, Malik Qasaimeh, Mustafa Al-Fayoumi
Pricess Sumaya University for Technology
Amman, Jordan
xsaifahmadx@gmail.com, eyadayoubi@gmail.com, b.muslamani@yahoo.com ,m.qasaimeh@psut.edu.jo, m.alfayoumi@psut.edu.jo
Abstract— Blockchain is an emerging technology that lacks verification procedure, known as mining. The new block is
sophisticated guidelines and frameworks for deployment then linked to the last block in the chain. Each BC starts with
purposes. This paper proposes a framework that helps in the root block containing its setting [6].
making suitable decisions concerning blockchain model
adoption. In addition, the authors classified the major categories Blockchain has the following advantages [7]:
of blockchain metrics. Furthermore, the authors evaluated the transparency - each party has the capacity to enter into the
proposed framework with two well-known educational transaction; immutability - it is not possible to modify the
blockchain-based models. written records; security - the infrastructure offers secure
operations using strong cryptography; self-sovereignty,
Keywords—Blockchain, Bitcoin, Ethereum Decentralization, scalability, and decentralization - as a result of the
TruerRec, Blockcerts elimination of the third party, it is possible to add new users
(nodes) to the chain, and users have the authority of managing
I. INTRODUCTION their own data; tamper-proofing - a unique timestamp is
Blockchain (BC) is a distributed and decentralized data associated with each data store operation in the blocks [8].
management solution that includes cryptography, consensus Drawbacks include the high-power consumption and time of
mechanisms, and hashing functions to ensure the the mining process, the complexity of managing one’s data,
immutability of its blocks (data). There is no need for a third and the performance issue; because BC is highly secured,
party to validate the transactions; any completed transaction there is a performance trade-off. One potential application
is recorded simultaneously in an immutable ledger, in a area for BC is in educational documentation.
permanent, transparent, verifiable, and secure way, with a
timestamp [1]. Ledger is the heart of blockchain, where the Proving one’s level of education and skills, work
transactions between two parties are efficiently stored in a experience, or even training accomplishment requires
permanent and verifiable manner. Furthermore, it is possible certification in some format, including several types of
to program the ledger to enable automatic triggering of information statements. The most important are: the kind of
transactions [2]. A smart contract is “a computerized the qualification such as “certificate of: accomplishment,
transaction protocol that executes the terms of a contract”; it attendance, or graduation…etc.”, the name and the address of
executes automatically, and is visible to all users of the the certificate issuer, the name and the title of the certifier,
blockchain [3]. who has validated the certificate, the date of obtaining the
certificate, the name of the learner. Moreover, there could be
The genesis of BC is usually traced to a Japanese theorist more information based on the type of the certificate, such as
known as ‘Satoshi Nakamoto’ who published an online paper the validation period or information about the examination
concerning the original source code for the virtual currency regulations.
Bitcoin in 2009, whereby “nodes collect new transactions into a
Paper-based certificates have advantages such as ease of
block, hash them into a hash tree”, and subsequently broadcast
archiving and retrieving, to be displayed to any person for any
the block “when they solve the proof-of-work… and the block is
purpose. However, this hard copy might be subjected to damage
added to the block chain” [4].
or loss, which can lead to difficulties for the holder to be reissued
Decentralization is one of the most important new documents or obtain new copies, costing extra time and
characteristics of BC, whereby users can manage the database money, at the expense of potential opportunities. Similarly,
in which their transactions are recorded jointly, and there is many forced immigrant students and refugees suffer from a lack
neither presence nor control of a third party. Fault tolerance, of certificates because they have lost access to their original
resistance to attacks, and collusion resistance are assured by locations, and/ or cannot contact the authorities to be issued new
decentralization [5]. ones. In contrast, the certificate issuer or authority needs to
maintain a database of certificates for a long period of time, and
Each block in the BC can contain thousands of
this will lead to the
transactions, and a new block can be added by a hash
75
Truerec by the German multinational software However, it is still at the prototype stage, and needs some
corporation SAP SE is another model employing blockchain improvements. For instance, the accreditation authority is a
technology in the education system and employment. single powerful root node, and if its private key was
TrueRec is based on the open-source distributed platform compromised or lost the whole system would be affected.
Ethereum, which is available to the public. Moreover, a cost overhead is applied for adding certificates
to the blockchain, as it is based on Ethereum blockchain. The
The main objective of TrueRec is to track and verify
revocation model does not allow showing or validating the
credentials for candidates that can later be used in the hiring
revoked certificate.
verification process, or upon admission to an educational
institution (e.g. a university), by enabling candidates to upload CredenceLedger is a system that stores consolidated data
their certificates to TrueRec, enabling verification by trusted proofs of academic credentials in a blockchain, enabling easy
authorities. TrueRec has efficiently reduced the costs (including verification by third parties such as employers or education
time) of the hiring process, as described in their patent [22], stakeholders. This model depends on the permissioned
reducing the seven steps of the traditional verification process to multichain combined with a mobile application for verifying
two: receiving the application with the certificates already academic credentials. When students graduate, they are awarded
verified; and conducting the interview. an authentic digital version of their credentials, in addition to the
paper certificate. This provides easy access from the mobile to
TrueRec (Figure 2) proved that costs can be significantly the certificate, and easy verification by the third party. There is
reduced by employing Blockchain technology while no need for transacting a cryptocurrency as it uses streams
increasing the security and reliability of certificates and being (hexadecimal value with a key value pair). CredenceLedger is a
open to the public ensures the usability of the system. private blockchain that enables digital forms of credentials to be
verified easily, without needing the public blockchain
transaction, which incurs mining costs. Furthermore,
CredenceLedger does not need a centralized system, and it
provides high throughput with low costs [25]. However, it stills
need to be tested in public use, and it should be expanded and
developed to be used on public blockchain for global use,
because otherwise special efforts and knowledge are needed to
access the application.
Other models are developed by different vendors, such as
Sony Global Education by Sony [26] and Open Certificates
by Attores Solutions [27]. However, these models are not
discussed in this paper as the systems are under development
and testing phases.
B. Models of blockchain in education
Figure 2. TrueRec proposed process [22]. To achieve a better decision-making procedure through
blockchain technology, guidelines have been proposed in the
The Dutch organization Applied Scientific Research (TNO) literature, such as [28], which presented a blockchain
started a blockchain project called the Self-Sovereign Identity maturity model that extends the CMMI model based on five
Framework to support supplying official information in digital aspects with four characteristics. The paper aimed to produce
form while only sharing a minimum amount of personal data, guidance on how organizations in different industries could
managed and stored in a wallet on people’s cellphones in an systematically decide on adopting blockchain. However, the
encrypted form. This information provides official confirmation adoption procedure is complex. It investigated three non-
about the identity of the person using a decentralized, public- technical aspects without detailing the process to be
permissioned blockchain [23]. considered as a full reference.
Blockchain for Education is another practical model for Yuan et al [29] have presented a reference model for
issuing, validating, and sharing certificates. Ethereum researchers in the field that divides the blockchain framework
blockchain is concerned with the correctness of security-relevant into six layers, as shown in Figure 3. The model is well-
contracts, using the approved smart contract template of described and presented, with the need for some
OpenZeppelin. The aim of the Blockchain for Education enhancements to become comprehensive, to include all the
platform is to support counterfeit protection as well as secure components that may constitute a blockchain.
management and access of certificates according to the needs of
More literature presented different models to support the
learners, companies, educational institutions, and certification
decision-making process of blockchain technology. Lo et al
authorities. It is similar to Blockcerts in using smart contact for
[30] proposed an evaluation framework to help organizations to
managing the identity of the certification authorities or the
assess the suitability of applying blockchain. Through a decision
certifiers, and for managing the certificate lifecycle. However,
tree, an organization may decide whether using a blockchain
Blockcerts uses Bitcoin, and therefore cannot apply complex
technology is suitable for their system or not. However, the
contracts. Another benefit is it allows the identity of the certifier framework depends on very limited and strict
to remain anonymous [24].
76
Yes/No questions (Figure 4), without taking into section describes the most-used versions of the former, such as
consideration some special aspects that may arise for each Elliptic Curve Digital Signature Algorithm (ECDSA) and X.509
business that may affect the resultant suitability decision. Standard, to define the format of public key certificates, because
blockCAM relies only on hashing algorithms.
a) Elliptic Curve Digital Signature Algorithm
This is a cryptographic algorithm used in many
blockchain platforms to issue public and private keys, and to
digitally sign a file, which allows users to verify the
authenticity of a file. Unlike Advanced Encryption Standard
(AES), which encrypts the content of the file, ECDSA
protects the file from tampering. The main strengths of
ECDSA are that it is impossible to duplicate the signature,
and it requires less computing power compared to other
algorithms [33]. The major services of ECDSA are [34]:
• Ensuring data integrity.
• Origin authentication.
Figure 3. Blockchain components [29]. • Tamper-proof data.
b) X.509 Standard
X.509 Standard defines the format of public key certificates,
in which a certificate contains a public key and an address that
specifies the owner. This certificate can be signed by a
Certificate Authority (CA) or signed the owner. X.509 certificate
uses the Public Key Infrastructure (PKI) to verify that a public
key belongs to the assigned address. The cross-certification
process is calibrated by PKIs [35], certifying that all user
certificates in PKI 2 (User 2) are trusted by PKI 1, whereby CA1
generates a certificate (Cert 2.1) that contains the public key of
CA2. As Cert2 and Cert2.1 have the same subject and public
key, there are two valid chains for Cert2.2 (User 2): "cert2.2 →
cert2" and "cert2.2 → cert2.1 → cert1". Similarly, CA2 can
generate a certificate (Cert1.1) containing the public key of CA1
so that user certificates existing in PKI 1 (User 1) are trusted by
PKI 2.
77
SHA-3 is the latest member of the Secure Hash Algorithm This is the most used architecture between digital currencies
family, released by the US National Institute of Standards and (Bitcoin and Ethereum). Permission-less blockchain allows any
Technology (NIST). The main functionality of SHA-3 is the user to use and interact with the blockchain while maintaining
same as mentioned in SHA-256, in which a hash function is anonymity and transparency. Permission-less blockchains allow
a function that takes some message of any length as input and any user to run as a normal node or a mining node, to help in
transforms it into short, fixed-length bit strings called hash verifying new transactions. The main characteristics of
values. permission-less blockchains are that they are decentralized,
transparent, and anonymous.
By utilizing these two cryptographic technologies,
asymmetric cryptography and hashing functions make b) Permissioned blockchain (private)
blockchain one of the most secure existing technologies, with Permissioned blockchain is governed by an organization
algorithms not yet solved mathematically (Esslinger et al., or authority, which determines users who are approved to use
2014) [36]. and interact with the blockchain, with varying degrees of
B. Consensus privileges. The main characteristics of permissioned
blockchains are different levels of decentralization, different
1) Consensus protocol levels of transparency and anonymity, and their governance
One of the main mechanisms used in blockchain structure, whereby organizations and communities have a
technology is consensus protocol, used to achieve agreement decision-making role in what architecture to adopt based on
on a single transaction, value, or block in distributed systems. their needs and size. For organizations it may be safer to
Consensus protocol provides reliability in a network. In other adopt permissioned blockchain, while for public
words, consensus means that all the nodes in the network organizations it is more usable to adopt permission-less
agree on the same state of the blockchain [37]. There are blockchain to serve more users.
many types of consensus protocols adapted by blockchain
platforms. The following subsections describe the most D. Blockchain scalability
common consensus protocols. a) Scalability of transactions
a) Proof of work (POW) The scalability of the blockchain is important to serve
Proof of Work protocol is adopted by the Bitcoin and more users. Problems can arise when there are too many
Ethereum platforms. The mechanism of POW works by transactions to be processed by the network. Figure 5 displays
requiring a solution from blockchain mining nodes to a the drastic increase in the number of daily Bitcoin (BTC)
specific mathematical problem in order to add a new block. transactions since 2009 [40]. The scalability of blockchain is
In this case nodes must solve the hash function; the only way a major concern, which is why authors included this as a
is to use trial and error. When a node solves the hash function category.
it receives a reward of some currency to cover some of the b) Scalability of nodes
power consumption costs [38]. POW thus adds new Another means of blockchain scalability is the simplicity
transactions to the blockchain based on computational power of adding new users (nodes) to the blockchain. In permission-
[39]. less architecture it is easier to add new users to the
b) Proof of stake (POS) blockchain, however it takes more time and effort when
adding new users to a permissioned blockchain in order to
verify the correct identity of the new user, and if the user
The mechanism of proof of stake is that the mining node
fulfills the admission requirements.
is now called a forger node, in which, instead of using
computational power as a measure, POS uses an amount that
must be staked to select the forger node. The higher the stake,
the higher the probability of being selected to validate the
new block or transaction. The rewarding system is similar to
POW, whereby the forger is rewarded for transaction fees. In
the case of false validation of a transaction or block, the
staked amount is lost.
Ethereum platform is trying to change its consensus
protocol to use POS instead of POW. Other mechanisms like
Delegated Proof of Stake and Proof of Authority exist but are
not described in this paper. These consensus protocols aim to
increase the time needed to add new blocks to the blockchain,
Figure 5. Number of BTC transactions since 2009 [40].
and to ensure that no one can validate false transactions or
blocks, or even compromise the blockchain network. E. Network performance
The performance of any network has always been a great
C. Blockchain architecture concern concerning usability and availability, thus due to its
intrinsic significance the authors decided to add this category in
a) Permission-less blockchain (public)
the evaluation of this paper. The main metrics of performance
are throughput and latency, as discussed below.
78
a) 6.1. Throughput
The rate at which the blockchain platform uploads the a) Specify the business needs
verified transactions into the blockchain ledger. Not to be Specifying the business needs and what an organization
mixed with latency, throughput is the rate for uploading a really looks for is one major step to reach our goal in
group of transactions, while latency is the rate for a single evaluation. Sine business requirements may not be amenable
transaction. to certain blockchain models’ specifications. Knowing the
b) 6.2. Latency business/ client goal helps in deciding what model is most
suitable to a particular application.
Blockchain platforms such as Bitcoin and Ethereum
require time to process each block in order to verify its b) Specify the most relevant blockchain models
validity. Latency is the amount of time needed to process and This step is about narrowing the evaluation process to the
validate a single block or transaction before adding it to the models most relevant to the business goals identified in the
ledger. Bitcoin process transactions in minutes, while previous step. This task should be done by an expert (i.e.
Ethereum does so in seconds [41]. Predicting the latency of any person or organization with experience in blockchain
blockchain-based systems using architectural modelling and models and business applications) to ensure that no related
simulation focuses on processing time as the main parameter, model that could be more efficient has been missed.
when the number of transactions increases, and the number
of users or nodes increases.
TABLE II. EVALUATION METRICS CATEGORIZATION
IV. FRAMEWORK
As discussed before, in our paper we propose a prototype Evaluation Metrics Categorization
framework for evaluating different blockchain models. In order Security
to achieve that we started by categorizing the specifications of
Encryption/ digital
blockchains that can be used as metrics. Table II shows different ECDSA X.509
signature
categories that any blockchain model adopting process should
consider. Considering that every blockchain model is driven by Hashing functions SHA-256 Ethash (SHA-3)
its business goals, each of them should be interested in some of Consensus
the mentioned categories and subsequent metrics more than the
other. This offers one way that we can provide a primitive Consensus
POW POS
mechanisms
decision on whether a blockchain model can be adopted for this
business need. For example, in the previously explained Architecture
blockchain models, BlockCerts and TrueRec, the former can be
Blockchain Permission-less Permissioned
used for educational documents accreditation even though it architecture (public) (private)
takes more time (i.e. higher latency), but it is not suitable for
direct purchases due to this characteristic. On the other hand, Scalability
TrueRec can be used for some systems with less stringent Scalability of Transaction
Transaction size
security requirements than those enforced by BlockCerts. transaction processing time
Scalability of nodes Simplicity of adding nodes to blockchain
A. Evaluation process
Network Performance
The literature produced many processes that may rely on
conditional statement [30] or on other guidelines models, such The rate at which the blockchain platform
as CMMI model [28]. We have implemented our framework Throughput uploads the verified transactions into the
with five steps, taking the advantages of the metrics blockchain ledger
categorization produced earlier. The process flow is shown in The amount of time needed to process a single
Figure 6 below. Latency block or transaction before being added to the
ledger
B. Framework steps
79
Prioritize the metrics based on impacts on goals -
c) Specify evaluation categories most related to According to the business goals in this usecase, security
business needs is more important than network performance, due to the
fact that the nature of the business allows students or other
The process of specifying the evaluation categories considers entities with enough privileges to submit a request for
their relation to the business goals. In this step, we can specific documents, and they can later check the status for
include all or some of the evaluation categories that were validity and correctness. Network performance is also
presented earlier that have major influences on the business important; however, the considered business goals can
goals. The aim of this process is to eliminate any unrequired afford network delays, since the process does not need to
metrics from the evaluation, since these unnecessary metrics be attended. Based on our methodology, as explained in
may negatively affect the decision of selecting the suitable the framework, BlockCerts weight is 5, and TrueRec
blockchain model. weight is 7.
80
REFERENCES [24] Gräther, W., Kolvenbach, S., Ruland, R., Schütte, J., Torres,
C., & Wendland, F. (2018). Blockchain for Education:
[1] Holotescu, C. (2018). Understanding Blockchain Lifelong Learning Passport. In Proceedings of 1st ERCIM
Opportunities and Challenges. eLearning & Software for Blockchain Workshop 2018. European Society for Socially
Education, 4. Embedded Technologies (EUSSET).
[2] Hao, Y., Li, Y., Dong, X., Fang, L., & Chen, P. (2018, June). [25] Arenas, R., & Fernandez, P. (2018, June). CredenceLedger: A
Performance Analysis of Consensus Algorithm in Private Permissioned Blockchain for Verifiable Academic
Blockchain. In 2018 IEEE Intelligent Vehicles Symposium Credentials. In 2018 IEEE International Conference on
(IV)(pp. 280-285). IEEE. Engineering, Technology and Innovation (ICE/ITMC) (pp. 1-
[3] Iansiti, M., & Lakhani, K. R. (2017). The truth about 6). IEEE.
blockchain. Harvard Business Review, 95(1), 118-127. [26] Russell, J. (2017). Sony wants to digitize education records
[4] Nakamoto, S. (2009). The original bitcoin source code. Online using the blockchain. Available at:
at https://github.com/trottier/original-bitcoin (Accessed 29 https://techcrunch.com/2017/08/09/sony-education-
December 2018). blockchain. (Accessed on 01 January 2019).
[5] Buterin, V. (2017). The Meaning of Decentralization. [27] Open Certificates. Available at: http://opencertificates.co/.
Medium. Online at https://medium.com/@VitalikButerin/the- (Accessed on 01 January 2019) .
meaning-of-decentralization-a0c92b76a274 (Accessed 29 [28] Wang, H., Chen, K., & Xu, D. (2016). A maturity model for
December 2018). blockchain adoption. Financial Innovation, 2(1), 12.
[6] Dhillon, V., Metcalf, D., & Hooper, M. (2017). Blockchain [29] Yuan, Y., & Wang, F. Y. (2018). Blockchain and
Enabled Applications: Understand the Blockchain Ecosystem cryptocurrencies: Model, techniques, and applications. IEEE
and How to Make it Work for You. Apress. Transactions on Systems, Man, and Cybernetics: Systems,
[7] Grech, A., & Camilleri, A. F. (2017). Blockchain in education. 48(9), 1421-1428.
[8] Bhowmik, D., & Feng, T. (2017, November). The multimedia [30] Lo, S. K., Xu, X., Chiam, Y. K., & Lu, Q. (2017, November).
blockchain: A distributed and tamper-proof media transaction Evaluating Suitability of Applying Blockchain. In Engineering
framework. In Digital Signal Processing (DSP), 2017 22nd of Complex Computer Systems (ICECCS), 2017 22nd
International Conference on (pp. 1-5). IEEE. International Conference on (pp. 158-161). IEEE.
[9] Alexander Grech and Anthony F. Camilleri. 2017. Blockchain [31] Gervais, A., Karame, G. O., Wüst, K., Glykantzis, V.,
in Education. No. JRC108255. Joint Research Centre (Seville Ritzdorf, H., & Capkun, S. (2016, October). On the security
site) . and performance of proof of work blockchains. In Proceedings
[10] Hanna Park and Ashley Craddock. 2017.: Diploma Mills: 9 of the 2016 ACM SIGSAC Conference on Computer and
Strategies for Tackling One of Higher Educations Most Communications Security (pp. 3-16). ACM.
Wicked Problems, https://bit.ly/2DoEeyu [32] Wang, W., Hu, N., & Liu, X. (2018, June). BlockCAM: A
[11] Bazley, T. D. (2005). Degree Mills: The Billion Dollar Blockchain-Based Cross-Domain Authentication Model. In
Industry That Has Sold Over a Million Fake Diplomas. 2018 IEEE Third International Conference on Data Science in
College and University, 80(4), 49. Cyberspace (DSC) (pp. 896-901). IEEE.
[12] Rutkowski, J. (2007). From the shortage of jobs to the shortage [33] Understanding How ECDSA Protects Your Data.
of skilled workers: labor markets in the EU new member https://www.instructables.com/id/Understanding-how-
states. ECDSA-protects-your-data/. (Accessed on 03 January 2019)
[13] Gerhard Mauz. (1997).: A juggler, an artist, [34] Khalique, A., Singh, K., & Sood, S. (2010). Implementation of
http://www.spiegel.de/spiegel/print/d-8742708.html elliptic curve digital signature algorithm. International journal
of computer applications, 2(2), 21-27.
[14] Musee, N. M. (2015). An academic certification verification
system based on cloud computing environment. PhD diss., [35] "Cross-Certification Between Root CAs". Qualified
University of Nairobi. Subordination Deployment Scenarios. Microsoft. August
2009. (Accessed on 05 January 2019)
[15] Mike Sharples et al. 2016. Innovating pedagogy 2016: Open
University innovation report 5 . [36] modelling and simulation. In Software Architecture (ICSA),
2017 IEEE International Conference on (pp. 253-256). IEEE.
[16] Gencer, A. E., Basu, S., Eyal, I., van Renesse, R., & Sirer, E.
G. (2018). Decentralization in bitcoin and ethereum networks. [37] Consensus protocol https://lisk.io/academy/blockchain-
arXiv preprint arXiv:1801.03998. basics/how-does-blockchain-work/consensus-protocols.
(Accessed on 5 January 2019)
[17] Miller, A., & Bentov, I. (2017, April). Zero-collateral lotteries
in Bitcoin and Ethereum. In Security and Privacy Workshops [38] Understanding Blockchain Fundamentals, Part 2: Proof of
(EuroS&PW), 2017 IEEE European Symposium on (pp. 4-13). Work & Proof of Stake. https://medium.com/loom-
IEEE. network/understanding-blockchain-fundamentals-part-2-
proof-of-work-proof-of-stake-b6ae907c7edb . (Accessed on
[18] Blockchain Credentials. (2018). Blockcerts. Available at: 05 January 2019).
https://www.blockcerts.org (Accessed 29 December 2018).
[39] Types of Consensus Protocols Used in Blockchains.
[19] Schmidt, P. (2016). Blockcerts—An Open Infrastructure for https://hackernoon.com/types-of-consensus-protocols-used-
Academic Credentials on the Blockchain. MLLearning in-blockchains-6edd20951899. (Accessed on 05 January
.(2016/10/24) 2019).
[20] Case Study Malta|Learning Machine. from [40] Blockchain.com. (2009). Bitcoin Charts & Graphs -
https://www.learningmachine.com/customer-story-malta/ Blockchain. [online] Available at:
(Accessed 29 December 2018). https://www.blockchain.com/charts [Accessed Nov. 2009].
[21] Case Study FSMB|Learning Machine. from [41] Yasaweerasinghelage, R., Staples, M., & Weber, I. (2017,
https://www.learningmachine.com/customer-story-fsmb/ April). Predicting latency of blockchain-based systems using
(Accessed 29 December 2018). architectural
[22] Tummuru, N., Sheth-Shah, S., Kunzmann, M., Shirole, S., & [42] Chu, S., & Wang, S. (2018). The Curses of Blockchain
Meng, J. (2018). U.S. Patent Application No. 15/385,479. Decentralization. arXiv preprint arXiv:1810.02937.
[23] Jongsma, H. J., & Joosten, H. J. M. (2018). Technical Report
Studybits.
81
Appendix A: Evaluation results
82
THE JOVITAL PROJECT: CAPACITY
BUILDING FOR VIRTUAL INNOVATIVE
TEACHING AND LEARNING IN
JORDAN
Arinola Adefila Alun DeWinter
Katherine Wimpenny
Centre for Global Learning, Education Centre for Global Learning, Education
Centre for Global Learning, Education
and Attainment and Attainment
and Attainment
Coventry University Coventry University
Coventry University
Coventry, United Kingdom Coventry, United Kingdom
Coventry, United Kingdom
ab0191@coventry.ac.uk aa2567@coventry.ac.uk
k.wimpenny@coventry.ac.uk
Aleš Trunk
Valerij Dermol Nada Trunk Širca International School for Social and
International School for Social and International School for Social and Business Studies
Business Studies Business Studies Celje, Slovenia
Celje, Slovenia Celje, Slovenia
ales.trunk@mfdps.si
trunk.nada@gmail.com
valerij@dermol.si
This qualitative paper presents the preliminary findings of an activities and the findings of our project work to date and
ongoing education-focused project JOVITAL an international provides a snapshot of the JOVITAL project during its delivery.
cooperation project co-funded by the Erasmus + Capacity
Building in HE programme of the European Union during the Keywords— E-LEARNING, JORDAN, JOVITAL, HIGHER
period October 2017 - 2020 involving four European institutions EDUCATION, LEARNING TECHNOLOGIES, ONLINE
and five Jordanian universities1. Our paper outlines how new LEARNING, COLLABORATIVE ONLINE INTERNATIONAL
and emerging technologies are being innovatively used in LEARNING, TEACHING AND LEARNING
institutions around the world and on this basis, how they are
being adapted and implemented in Jordan as part of JOVITAL. I. INTRODUCTION
Regulations and instructions on an institutional and national In a world that is increasingly interconnected,
level have been continuously changing over the years, with the interdependent and diverse, engaging in international and
Ministry of Higher Education and Scientific Research intercultural learning and exchange is a key focus for many
(MOHESR) approving the blended model within 25% of Higher Education Institutions (HEIs) around the globe [1][2].
programmes, placing a cap on the amount of online learning that Such a trend can be considered in relation to several issues.
take place within a HE programme. However, an alliance of For example, universities are experiencing exponential
three or more Jordanian universities can establish a fully online growth in their recruitment of international students
programme as well. That being said, MOHESR has expressed [3][4][5]; accordingly, online international learning is
some constraints regarding quality assurance including the way increasingly becoming a core pillar of university
exams are conducted, how learning outcomes are measured and collaborations for globally networked learning [6][7][8]; and
how course funding and cultural perceptions are considered. open courses such as Massive Open Online Courses
Challenges in the open education methodology, therefore, still (MOOCs) target learners, regardless of their geographic and
exist in the academic medium in Jordan, where three main issues cultural background [9][10][11]. Many countries are
are of particular note: the governmental policies instructed by experiencing, due to their demographic and socioeconomic
Ministry of Higher Education and Scientific Research; the context, a massification phenomenon concerning learners
alignment of these policies with regulations published by accessing higher education (HE). Because of such trends,
Jordanian accreditation institutes; and the cultural acceptability responsive, and effective education processes are required to
of open education and distance learning in general. maintain quality learning [12][13][14]. As an answer to the
The ideas we present here include applications of technology for challenges mentioned above, state of the art education
domestic online learning, as well as global partnerships that technology may be used in HEI to encourage learning as well
support the development of intercultural competencies through as the recruitment of international students and the inclusion
the use of Virtual Collaborative Learning (VCL) or Collaborative of students belonging to disadvantaged social groups.
Online International Learning (COIL). This paper presents the However, in some countries, restrictions regarding the
amount of e-learning within study programmes can be noted.
Such limits can also be seen as a rejection of e-learning
1 methodologies as an inferior or lazy option where learning
Full List of JOVITAL Partners: Technische Universität
content merely is dumped online with little effort to
Dresden, Coventry University, International Business
contextualise the learning or to improve the learner
School for Social and Business Studies Slovenia, UNIMED, experience.
Princess Sumaya University for Technology (PSUT),
German Jordanian University (GJU), Tafila Technical
University (TTU), Al-Hussein Bin Talal University (AHU),
and Jordan University of Science and Technology (JUST)
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 83
In this paper, we outline how new and emerging well as support mechanisms such as textbooks or IT which
technologies are being innovatively used in institutions all play crucial role in perception of quality. Overall,
around the world and on this basis, how they are being “satisfaction” in the eyes of a student is a complex concept
adapted and implemented for use in Jordan. This includes with foundations in the subjective impressions of pedagogy
applications of technology for domestic online learning, as and context within which the pedagogy is delivered.
well as global partnerships that develop intercultural
competencies through the use of Collaborative Online III. Gains and benefits to the student experience and
International Learning (COIL), sometimes also referred to as changes to pedagogy
Virtual Collaborative Learning (VCL). This paper presents
One of the most important benefits of e-learning is
the activities and the findings to date of the JOVITAL
project in its goal of building the capacity of Jordanian the possibility for / of enabling students to study at a
academics in the design and delivery of collaborative online convenient “pace, place and mode” in order to ensure that the
(international) pedagogies. JOVITAL is an international quality of teaching and learning is maintained
cooperation project co-funded by the Erasmus + Capacity [17]. The mode of delivery can enhance or inhibit this
Building in HE programmes of the European Union during affordance, and adequately designed e-learning programmes
the period October 2017 - 2020 involving four European allow for learner-centered flexible approaches to HE
institutions and five Jordanian universities. The overall education.
project aims to foster academic exchange using virtual A key area for consideration related to e-learning is also the
mobility in order to develop the capability of academic staff, role of academic facilitator who can significantly improve
university students and disadvantaged learners in Jordan. As students’ learning experience. Such facilitators should have
part of the overview of the JOVITAL project and the appropriate skills and competencies in the field
technologies used, the paper also includes and presentation of learning. Moreover, Buhl,
of the applications of technology for domestic online Andreasen, and Pushpanadham [18] suggest e-learning
learning, as well as global partnerships that develop fragments be included in lecturers’ traditional roles.
intercultural competencies through the use of Online They should not be responsible only for “planning,
International Learning (OIL). practice, and reflection,” rather, such activities may now be
“performed by different actors with different areas of
II. STATE-OF-THE-ART TECHNOLOGIES AND E-LEARNING responsibility” [18]. Therefore, many institutions have
IN HIGHER EDUCATION (HE) introduced support to technical and design areas of e-
Implementing e-learning can present significant challenges learning delivery with the emergence of roles such
for HEIs. Many institutions now view e-learning as a as learning technologists, e-developers, etc. [19].
strategic tool which can be used to boost their reach, Also, teachers have to adopt new skills and techniques so
reputation, and finances. they can prepare and engage the students to become reflexive
There is also an increased competition to deliver innovative learners in e-learning environments. This might be quite
programmes that attract and connect students across the a challenge. Nowadays, the students may be accustomed
world, which has many implications for HEIs in terms of to modern technology, but they are not necessarily adept as
course design and content. E-learning makes it possible engaging in transformative learning and lack the kind of
for students to attend a variety of study programmes digital capital that enables them to be co-creators of their
without even leaving their country whilst enabling students own learning [20][21].
to connect and engage with the wider world. Such
approaches to the delivery of study programmes may
be beneficial also for vulnerable and disadvantaged groups IV. CHALLENGES IN HE IN JORDAN
who would like to study but have little or no access to HE Teaching experiences delivered throughout
[11]. the JOVITAL project and a short review of the use of new
That being said, there are concerns over the use of and emerging technologies in HEI around the world enabled
technology and online resources in terms of quality control us to recognise some key challenges which HE in Jordan
(as evidenced by the 25% cap seen in the Jordanian HE should be facing.
system), a strategic approach to recognition For example, in Jordan, e-learning has been associated with
by national governments needs to exist, especially in regions removing barriers for female learners in remote locations and
where there is no strategic oversight providing opportunities to upskill the existing workforce
over the quality of study programmes and HEIs. The study of [22]. However, the challenges of ensuring high-
Calvo-Porral, Levy-Mangin, Novo-Corti [15], quality training have been discussed by employers leading to
for example, found that tangibility and empathy dimensions restrictions, such as the afore-mentioned 25% cap seen in
have the most substantial influence on student’s perceived Jordan.
quality. The tangibility dimension is associated with facilities Another unique challenge for Jordan is related to
and equipment, while the empathy dimension concerns the the equity in access, as well as the inclusion of Syrian
attitudes of the teaching and administrative staff towards refugees in the region. Although Jordanian institutions
students. Yusoff, McLeay, and Woodruffe-Burton [16] want to include the refugees in e-learning, many barriers
identified 12 aspects that drive student satisfaction and exist. In 2017, the Open University attempted to deliver
among them they emphasised the importance online courses to Syrian Refugees in Jordan, which was not
of student’s learning experience and his or well received due to the lack of interactivity [23]. The
her satisfaction with quality provision of (online) learning as conclusion stemming from this experience shows that
the attitudes towards e-learning are different within different
84
students’ communities and they should be engaging with the VLE. Following this pilot, a summer
properly addressed. A key challenge is, school is to take place in Dresden to allow for the training of
therefore, to change the mindset from using approximately 25 student experts – specialists who will assist
technology, not just as a tool for teaching, but a platform for the future delivery of the online delivery within the VLE in
education that seeks to engage the learner with activities and the Jordanian universities in the future. All participants of
opportunities for feedback and discussion. Such the initial pilot have been invited to give qualitative feedback
change requires the shift from a ‘teacher at the front’ model through a survey tool developed by Coventry University.
of learning to an approach of designing a course together The results of the survey are forthcoming (September 2019),
with appropriate pedagogical implementation. but it is intended that data will be available to present at the
ICTCS conference in October 2019.
Through the JOVITAL project, training was made an
integral part of the study programme delivery,
with a variety of methods and approaches to the teacher as
well as student training. Namely, the HEI has the VI. FINAL THOUGHTS
responsibility to ensure that staff is adequately equipped with Through presenting and exploring the activities and findings
competencies to perform their role, and equally, students of JOVITAL, this paper seeks to outline the challenges and
need to be supported to study online, with the necessary benefits of e-learning technologies in HE teaching and
skills of autonomy and self-efficacy. The preparation of the learning, and how these can be tailored for use within the
students is a demanding task, not least because pre- unique Jordanian context. In addition, it offers insight into a
university education does not typically prepare them to work-in-progress project that is continually developing and
tackle the new technology challenges. Students of the 21st adapting to the needs of all stakeholders and participants.
century need to develop requisite skills (problem- This paper argues that online learning, in many forms, is of
solving, teamwork and communication skills) for the benefit to students and teachers alike, but utilisation of
workplace (Warner & Palmer, 2015). E-learning also technologies requires careful planning,
requires the students to master the communicative and tailoring, and training in order to see maximum benefit. As
networking tactics to engage in such online learning spaces. such, it is imperative that time is taken to train teaching staff
Furthermore, the experiences stemming from the delivery and to prepare student expectations of online learning in
of JOVITAL project shows that institutions need to concern order to gain the maximum benefit e-learning technologies
themselves seriously with ensuring that assessment practices have to offer. It is not merely enough to buy into technology
are appropriate for e-learning context. Assessment needs to and expect it to do all of the work – changes to approach and
align with the evolution of e-learning. Assessments should implementation are vital to the success of
also be varied and flexible. online approaches to pedagogy. In addition to having access
to internet-enabled technology, instututions must also have
an awareness of e-learning and the software required to
support this. Importantly, HEIs must also take the time to
V. STUDENT EXPERIENCES – DRESDEN VIRTUAL LEARNING
train and develop academic staff to fully realise the potential
ENVIRONMENT
of e-learning in order to achieve strong levels of learning
In May 2019, over 500 Jordanian students took part in a engagement, Beyond this, institutions must also invest in
Virtual Learning Environment trial, led by Technische relevant support staff, which might include IT experts,
Universität Dresden, to experience online learning first hand developers and learning technologists. With this in mind, it is
in a ‘live’ environment. Students from the Jordanian not sufficient to simply ‘buy in’ to the technology; e-learning
universities, predominantly from engineering courses, needs investment in staff, resources and infrastructure to
undertook tasks and assessments in the VLE, supported by succeed.
‘e-lectures’ and staff guidance on how to learn in an online
environment. This was powered by Elgg, an open-source tool
that specialises in social and collaborative activities for In terms of the next stages of the JOVITAL project, the
education, with the team at Dresden creating the virtual feedback and results from the Dresden pilot testing and the
environment. The activities took place in closed groups that Dresden Summer School will offer valuable insights into e-
saw students enrol to undertake activities that were mapped learning approaches and student engagement. The ICTCS
to specific topics and modules. There were also discussion presentation will also invite participants to give their own
forums per activity group in order to allow staff and students views and feedback on JOVITAL, which will offer another
alike to provide feedback of their experiences of the pilot. In route for valuable data for the project.
some cases, the online activities were directly incorporated
into local taught elements of a module, for example the
systems analysis and design online group was specifically
incorporated into the teaching and learning activity for a
TTU module, with lectures, class exercises, student
presentations and lab work taking place alongside the online
discussions and activities.
86
The relation between Individual Student Behaviours
in Video Presentation and their Modalities using
VARK and PAEI Results
Manal Ismail
Ahmed Fekry Georgios Dafoulas
Computer Science Department
Computer Science Department Computer Science Department
National Egyptian
National Egyptian Middlesex University
e-Learning University
e-Learning University London, UK
Cairo, Egypt
Cairo, Egypt g.dafoulas@mdx.ac.uk
mismaeel@eelu.edu.eg
afekrymohamed@eelu.edu.eg
Abstract—This research paper aims to investigate the Previous work has focused on metadata to reach the
relationship between students' personality characteristics using content. [4] Metadata should describe the video content in a
well-recognized models and their behaviours and activities in generic method to support indexing and searching, but in our
video content presentation. This paper is part of a research study research, we use a different way to tag video content and
focusing on video tagging methods for analysing the behaviour of describe human behaviours> This takes place by observing
team members. The authors analysed videos of student group specific behaviours and activities to find a relation which can
presentations and a data set of student personality tests of the same help in building a future model that can judge video content
student cohort, identifying their characteristics. By finding a and generate a point system for specific behaviours. Such a
relation between the two we better support students after assessing
system can be used for both individual behaviour and
video content. The study aims at pursuing associations between
behaviours of several group members.
human behaviour and personal modalities. This practice cab be
very supportive in student assessment and career coaching. The Typical approaches to video annotation include video
work carried out was based on quantitative research methods for structure analysis, object discovery and event classification.
analysing the videos and combining them with two personality All through the past decade, these approaches have been
models, VARK (Visual, Aural, Read/write, and Kinaesthetic) advanced from the use of handcrafted highlights [5] to feature
modality preferences test and PAEI (Producer, Administrator, learning techniques. Recent researches claim that deep
Entrepreneur, Integrator) methodology to give us information learning can achieve great accuracy in video annotation
about student modality and leadership preferences. We found the
applications. [6]
average of behaviour occurrence and presentation duration
average for each VARK profile and PAEI roles. We conclude from To understand human body activities in videos, we need
our results that students with an Administrator role in PAEI roles to define human gestures. Gestures can originate from body
and with Multimodal and Aural style in VARK model are the most movements like walking, bending, jumping, and hand waving.
self-talking, while the highest average value for eye focus While a video is playing, human action detection is not easy
behaviour is for students with Producer role in PAEI roles and to achieve. This problem exists due to variations in motion
with Visual style in VARK model, and the largest speech loudness appearance of actions, camera angles, movement in the
average is for students with an Administrator role in PAEI roles
background and any surrounding noise. The objective of a
and with Visual style in VARK model.
similar application is to detect different gestures in multimedia
Keywords— human behaviour, student presentation, video clips by pre-processing the video and then apply an algorithm
tagging, video content, learning modality, VARK, PAEI, personal for detecting various actions. [7]
modality, recommendation system. Video tagging or concept detection is emphatically related
to tasks like scene recognition and object recognition [8]. In
I. INTRODUCTION (HEADING 1) our research we focus on behaviours of humans while making
Recent progress in information capture, storage, and a group presentation, in order to investigate the relations
communication technologies have increased accessibility to between individuals’ behaviours and their personality. We
video data. Collaboration with mixed media information, and believe that this investigation will help in building a model
video, in specific, requires more than interfacing with data that gives an automatic rating for students and have a good
banks. The recommended approach is to index video understanding of student activities and common behaviour, as
information and transform it into organised media. [1] well as summarise video content and extract important data.
Appropriate annotation of mass video data is very important Furthermore, this model would help and support in building
for traditional text-based search engines to retrieve semantic algorithms for automatic judging model or a recommendation
data. Hence, Video annotation has been recognised as a system from these findings. This paper is part of our research
valuable research area. [2] in video tagging methods for analysing group member
behaviour. This research also involves the exploration mof
There is now a mandatory demand for audio-visual or relations in student behaviour patterns and individual
multimedia contents in various fields. [3] In our research, we characteristics, which is published elsewhere. The focus of
focus more on observing human behaviours that happen in this paper is on investigating the relation between an
video content, and how video tagging techniques can support individual’s behaviours and certain personality types.
understanding what is going on in video content.
While many teachers around the world and pioneers are
calling for students to create 21st-century competencies
88
Analysing human behaviour during group presentations, • End time.
requires more investigation about relations between human
behaviours during presentation and their modalities or • Duration of Video.
preferences of learning. This is the focus of our research as we
need to develop a model that can have a good understanding B. Research Question
of human behaviours. We used each behaviour as a tag (node) to collect and
record the occurrence of the behaviour, the following table
III. RESEARCH METHOD describes each behaviour and how to be observed.
89
D. Calculations
After the observation of behaviours in the videos, we
created some calculated fields from our observation as
following:
• Stability duration = presentation duration–movement
duration
• Eye focus duration = presentation duration – eye focus loss
duration
• Self-Talking Duration = presentation duration – read from
Fig. 2 Video snapshot (Brief presentation) slide or note duration.
90
• The most self-talking students are those with an
Administrator role in the PAEI roles and with
Multimodal style in the VARK model.
• The highest average value for eye focus behaviour is
for students with Producer role in PAEI roles and with
Visual style in the VARK model.
• The largest speech loudness average is for students
with an Administrator role in PAEI roles and with
Visual style in the VARK model.
From figure 9, we got that most frequent combination
patterns between VARK & PAEI is (Producer & Visual). We
Fig. 4 Speech Loudness also found thattThe following combinations didn’t appear at
all in our research:
• Read\Write & Integrator
• Read\Write & Multirole
• Multimodal & Multirole
VI. REFERENCES
[1] N. Dimitrova, H.-J. Zhang, B. Shahraray, I. Sezan, T. Huang and A.
Zakhor, "Applications of video-content analysis and retrieval," IEEE
MultiMedia, vol. 9, no. 3, p. 14, 2002.
[2] Y. Mallawarachchi, K. Ashangani, K. U. Wickramasinghe and D. W.
De Silva, "Semantic Video Search by Automatic Video annotation
using TensorFlow," in Manufacturing & Industrial Engineering
Symposium 2016, Colombo, 2016.
[3] Y. Nakamura, M. Ozeki and Y. Ohta, "Human Behavior Recognition
for an Intelligent Video Production System," in Advances in
Multimedia Information Processing, Third IEEE Pacific Rim
Conference on Multimedia, Hsinchu, Taiwan, 2002.
Fig. 5 Appearance in video & PAEI roles [4] M. Sanderson, J. S. Pedro, and S. Siersdorfer, "Automatic Video
Tagging using Content Redundancy," in The 32nd Annual ACM
SIGIR, Boston, Massachusetts, USA, 2009.
[5] L. Shao, "Generic Feature Extraction for Image/Video Analysis," in
IEEE International Symposium on Consumer Electronics, Petersburg,
Russia, 2006.
[6] W. Hu, N. Xie, L. Li and X. Zeng, "A Survey on Visual Content-Based
Video Indexing and Retrieval" IEEE Transactions on Systems, Man,
and Cybernetics, Part C (Applications and Reviews), vol. 41, no. 6, p.
23, 2011.
[7] T. J. Saleem and S. Mushtaq, "Human Gesture Analysis Based on.
Video," International Journal of Advanced Research in Computer
Science, vol. 9, no. 1, p. 5, 2018.
[8] T. Breuel and R. Paredes, "Fast Discriminative Linear Models for
Scalable Video Tagging," in International Conference on Machine
Learning and Applications, Miami, Florida, USA, 2009.
[9] M. Worsley and P. Blikstein, "Towards the development of learning
analytics: Student speech as an automatic and natural form of
assessment," Annual Meeting of the American Education Research
Association (AERA), p. 22, 2010.
Fig. 6 Appearance in video & VARK style [10] M. Bande, A. Stojanova, N. Stojkovikj, M. Kocaleva, B. Zlatanovska,
and C. Martinovska-Bande, "Application of VARK learning model on
V. CONCLUSION "Data structures and algorithms" course," in IEEE Global Engineering
Education Conference (EDUCON), Athens, Greece, 2017.
From previous results in finding’s sections, we can [11] D. J. Lamb, D. Al-Jumeily, A. J. Hussain and M. Alghamdi, "Assessing
conclude that students with an administrator role have the the Impact of Web-Based Technology on," in Sixth International
longest presentation duration, while students with an Conference on Developments in eSystems Engineering, Abu Dhabi,
United Arab Emirates, 2013.
Integrator role has the shortest presentation duration. With
respect to VARK styles, the longest presentation duration is [12] I. Adizes, "Organizational passages—Diagnosing and treating lifecycle
problems of organizations," Organizational Dynamics, A division of
associated with students who are classified as Kinaesthetic American Management Associations. , vol. 8, no. 1, p. 23, 1979.
and shortest duration is for students classified Aural. [13] H. Shiva and T. Hassan, "Study of Conflict between Adizes's
Regarding behaviours’ occurrence, we have got the following Leadership Styles and Glasser’s Five Basic Needs," Mediterranean
important indicators: Journal of Social Sciences, vol. 7, no. 3, p. 8, 2016.
[14] N. Yokoya, A. Tejero-de-Pablos and Y. Nakashima, "Human action
recognition-based video summarization for RGB-D personal sports
91
video," in IEEE International Conference on Multimedia and Expo, [16] R. M. Silva, C. C. Figueroa, T. P. Rubilar and F. S. Díaz, "An Adaptive
Seattle, WA, USA, 2016. E-Learning Platform with VARK Learning Styles to Support the
[15] L. Guan, N. Khan and R. Tan, "Real-Time System for Human Activity Learning of Object Orientation," in IEEE World Engineering
Analysis," in IEEE International Symposium on Multimedia, 2017. Education Conference, Buenos Aires, Argentina, 2018.
[17] S. Tauroza and D. Allison, "Speech Rates in British English," Applied
Linguistics, vol. 11, no. 1, pp. 90-105, 1990.
Moving This behavior happened when the presenter starts to move Calculated manually by counting the number of
his body by changing his legs location on the ground. occurrences, assuming that each occurrence takes 1
second.
Body Pose Front This is the default behavior for presenter, as he is facing the Assumed as default, calculated automatic by subtracting
camera or audience with his body. moving duration from total presentation duration.
Side This behavior happened when presenter move his body Calculated manually when the body in a side position, we
away from the camera or from the panel so one of his neglect when he moves his body to check projector as this
shoulders is not shown. already will be considered in reading mode.
Face Normal This is the default behavior for presenter to show normal Assumed as default calculated automatic by subtracting
Expression facial expression. smile duration from total presentation duration.
Happy(Smile) This behavior happened when presenter show some Calculated manually by # of occurrences, assuming that
positive expression such as happiness, smiling and each occurrence takes half a second.
relaxing. (assuming that he is not looking into his team for
presentation purpose).
Eye Contact In This is the default behavior for presenter to be looking at Assumed as default, calculated automatic by subtracting
the camera. focus out of camera duration from total presentation
duration.
Out This behavior happened when the presenter is look out of Calculated manually by counting the number of
the camera while he is presenting. (assuming that he is not occurrences, assuming that each occurrence takes 1
looking into his team for presentation purpose or not second.
reading slide from paper or notes ).
Reading Self-Talking This is the default behavior for presenter to be self-talking Assumed as default, calculated automatic by subtracting
Method without any external supports. reading from note and projector duration from total
presentation duration.
Note/ This behavior happened when the presenter is read from a Duration calculated manually by recording behavior
Projector note in his hand or projector occurrence duration.
Pauses while Pauses_Count This behavior happened when student make a pause while Calculated manually during the observation process.
Presentation presentation, we consider pause when he stops talking for
more than 3 seconds without any interruption from the team
or supervisor.
Speech Level Calculated using audio analysis software. Calculated in percentage after converting decibel into a
Loudness magnitude, as the highest value the more voice is clear and
understandable
Speech Rate Fast/ Fast: > 190 wpm (word per minute) Calculated using transcript voice into text, then calculate
(Pace) Moderate/ Moderate: between 150 - 190 wpm for a comfortable pace the number of words per minutes.
Slow Slow : < 150 wpm [17]
Movement # Hand Gestures Eye focus (Out) # Reading from Note # Speech Loudness Speech Pace
0 Moderate 14 10 -31.4 Moderate
3 Good 2 12 -21.29 Moderate
0 Bad 6 24 -24.56 Moderate
0 Moderate 6 80 -22.97 Moderate
0 Bad 1 103 -27.58 Moderate
0 Good 1 16 -26.96 Moderate
0 Good 0 50 -21.7 Moderate
0 Bad 3 57 -28.9 Moderate
0 Moderate 1 33 -25.91 Moderate
0 Good 1 17 -22.45 Moderate
0 Good 0 37 -22.77 Moderate
1 Good 0 35 -23.79 Moderate
3 Good 3 7 -24.02 Moderate
5 Moderate 2 3 -25.49 Moderate
2 Moderate 1 96 -22.96 Moderate
2 Moderate 2 28 -21.83 Moderate
92
Fig. 7 PAEI roles Vs. Behaviours
93
An overview of Digital Forensics Education
Georgios A. Dafoulas David Neilson
Computer Science Department Computer Science Department
Middlesex University Middlesex University
London, United Kingdom London, United Kingdom
g.dafoulas@mdx.ac.uk g.dafoulas@mdx.ac.uk
Abstract—This paper follows an initial review conducted as • An analysis of modules taught in undergraduate
part of an EU-funded Erasmus+ project under the programme programmes.
of Capacity Building in Higher Education. The original paper
focused on providing the state of the art in undergraduate • A discussion on learning outcomes at programme
computer forensic programmes [22]. This work is part of a and module level.
study towards the EU funded Pathway in Forensic Computing
(FORC) project. FORC aims to address the challenges in • A review of state-of-the-art tools and techniques
information society development concerned Cyber Security used in such programmes.
and privacy in a world oriented towards e-technologies. The This second paper extends the study further, as it
project meets the regional needs of the Middle East area by emphasises the fundamental concepts and key topics that are
responding to the current and emerging cyber security threats common in postgraduate curricula. The additional study also
by educating the IT and Legal professionals in the field of e-
provides the means to understand how the sector has seen the
crime, thus supporting development of e-based economics, life
and society in partner countries. The work is funded under
development of online programmes and the skillsets that
project reference number 574063-EPP-1-2016-1-IT-EPPKA2- seem to be viewed as critical for recruitment in the relevant
CBHE-JP, Grant Agreement 2016 – 2556 / 001 – 001. In this industry.
paper we focus on the second work package of the project
(WP-2) aiming to ‘establish a forensic computing pathway’ and II. LITERATURE REVIEW
the first task for this work package aiming at ‘defining
pathway objectives, learning outcomes, and career
There seems to be a limited source of relevant works in
perspective’. In this second paper we focus more on the field, as there are not that many researchers investigating
postgraduate programmes, online provision, and career the undergraduate and postgraduate curricula in digital
pathways. We also consider the availability of programmes in forensics. Anderson et al [1] provide a brief comparison
the US, while emphasis is given more on digital forensic between the British and German models of programme
education. structure with emphasis on teaching forensics at University
level. The authors conclude that the British curriculum
Keywords—Digital Forensics Education, Computer design approach appears to be the more mature of the two.
Forensics, Emerging trends in computer forensics, Computer This has affected this investigation as it focused more on UK
Science Education, Curriculum Development, Curriculum provision of Computer Forensics programmes.
Design
Bashir and Campbell [2] correctly identify that “digital
forensics education curriculum needs to be developed by
I. INTRODUCTION taking into consideration the need for students to be aware of
The aim of this paper is to provide a review of the current the multiplicity of field specializations”. In their efforts to
provision in digital forensics. The study is based on a design a Computer Forensics curriculum for a US institution
preliminary state of the art review of current best practice in they identified as major challenges providing students “with
the field of Computer Forensics, mainly at an undergraduate the greatest depth of knowledge of a particular aspect of a
level. The original work attempted to identify the learning field that encompasses a wide range of technical topics.
outcomes of programmes specialising in computer forensics, According to Conklin et al [3] a shift to Knowledge Unit
with or without a cyber security specialism. The scope of the (KU) based cyber security education is beginning. In their
study is to provide a useful reference point for colleagues paper they illustrate the relationship of training to education
who are in the process of curriculum development in digital identifying two-year associate degrees preparing technicians,
forensics. network operators and system administrators, four-year
bachelor’s degrees producing analysts and engineers, and
The work focused on identifying programmes of study in master’s degrees focusing on risk management and
the field of Computer Forensics in UK, US and EU and management specialists.
review the state of affairs in curriculum design and
development in the field. The following steps were followed: Cooper et al [4] provide a series of illustrations that help
visualising the relationship between digital forensics and
• A literature review of curriculum design and other computing principles. They mention the ACM/IEEE
development practices. Joint Task Force for Computing Curricula in their effort to
• An investigation in Forensic Computing map the domain. They conclude with a number of areas
programmes in UK, EU and US. where greater emphasis is needed for digital forensics as
follows: (i) networking, (ii) information security, (iii)
• An analysis of the programme structure for Forensic systems administration, (iv) electronics, (v) mathematics and
Computing programmes. statistics, (vi) ethics, (vii) criminology, (viii) forensics
science and (ix) law and legal issues. As early as 2001, the
DFRWS report [5] identified a number of areas “as valid
96
handling of digital evidence and forensic 12. George Mason University – Digital Forensics and
investigations. Cyber Analysis (Online)
• Generic knowledge of computer and IT e.g. data 13. Capitol Technology University – Cybersecurity
storage, operating systems, file systems and (Online)
Computer Networks.
14. University of Detroit Mercy Information – Assurance
We also recommended that the programme learning (Cybersecurity) (Online)
outcomes are distinguished in the following four categories:
15. Norwich University – Information Security &
• Knowledge and understanding Assurance (Online)
• Cognitive (thinking) skills 16. Edinburgh Napier University – Advanced Security &
Digital Forensics (Online)
• Practical skills
17. Stratford University – Digital Forensics (Online)
• Graduate skills
18. Capella University – Information Assurance and
These programme level learning outcomes should be Cybersecurity (Digital Forensics) (Online)
aligned to module-level learning outcomes that would
describe in more detail the achievement of a student who 19. DeSales University – Digital Forensics (Online)
successfully completes each module, ideally demonstrating 20. University of New South Wales Canberra – Cyber
the full experiential learning cycle as described by Kolb’s Security (Digital Forensics) (Online)
learning style model. The learning, teaching and assessment
strategy of the FORC programme should be in line with 21. University of the Sunshine Coast – Cyber
Bloom’s taxonomy and make full use of the learning Investigations and Forensics (Online)
pyramid as suggested by the National Training Laboratories, 22. Edith Cowan University – Cyber Security (Online)
Bethel, Maine.
23. Auckland University of Technology – Information
V. POSTGRADUATE PROVISION Security and Digital Forensics (Online)
Our current investigation was focused on postgraduate From the 23 institutions, 18 appear to have online
programmes and also an analysis of how many of such provision in the field. This is a very interesting finding, as
programmes are available online. We primarily focused on there is evidence that more institutions are able to shift
UK and US programmes, as discussed below. tuition in digital forensics and related subjects online. This is
despite the technical nature of such courses and the need for
A. US Provision using specialised software. The modules taught appear in
figure 1 at the end of the paper. The most popular module
As mentioned already, we only managed to investigate a topics are (i) cybersecurity foundations, (ii) network
specific part of the educational provision in digital forensics forensics, (iii) legal and ethical issues, (iv) research project,
in the US. The selection of providers and their postgraduate (v) digital forensics analysis and (vi) crime scene
programmes titles are listed below: investigation.
1. George Mason University – Digital Forensics and
Cyber Analysis) B. UK Provision
2. University of South Florida – Cybersecurity (Digital The following list provides all the postgraduate
Forensics) programmes we could identify in UK Higher Education
Institutions (HEIs). In the UK there are only two
3. University of Alabama Birmingham – Computer programmes that are offered both online and on-campus. The
Forensics and Security Management list of available modules are included in figure 2 at the end of
the paper.
4. University of Maryland University College – Digital
Forensics and Cyber Investigation 1. University of East London –Information Security
and Digital Forensics
5. John Jay College of Criminal Justice – Digital
Forensics and Cybersecurity 2. University of Greenwich –Computer Forensics and
Cyber Security
6. University of Central Florida – Digital Forensics
(Online) 3. University of Salford –Cyber Security, Threat
Intelligence and Forensics (MSc)
7. Champlain college – Digital Forensics (Online)
4. Edinburgh Napier University –Advanced Security
8. Stevenson University – Cyber Forensics (Online)
and Digital Forensics (Online and on campus)
9. Utica College – Cybersecurity (Computer Forensics)
5. Middlesex University –Electronic Security and
(Online)
Digital Forensics
10. University of Maryland – Digital Forensics and
6. Canterbury Christchurch University –Digital
Cyber Investigation (Online)
Forensics and Cybersecurity - MSc by Research
11. Sam Houston State University – Digital Forensics
(Online)
97
7. De Montfort University –Professional Practice in some of the more specific or rare module titles into more
Digital Forensics and Security (Online and on generic categories to enable better comparison. For example,
campus) “Principles of Cybersecurity” (University of Detroit Macy)
was placed under “Cybersecurity Foundations”. Similarly,
8. University of Westminster – Cyber Security and “Wireless Network Security” was simply moved under
Forensics “Network Forensics”. The results we present are not
9. University of Bedfordshire – Computer Security and intended to be used as a distinct quantitative analysis, rather
Forensics than data presented in a way that allows the overall trends to
be detected. Another example of above point is the
10. University of South Wales – Computer Forensics University of Detroit Macy – here they provide a module
called “Secure Acquisition”. This again is included but the
11. University of Portsmouth – Forensic Information column marked “Digital Evidence Management” and also
Technology digital media forensics as it is assumed that the content will
be very similar.
12. Leeds Beckett University – Computer Forensics and
Security (MEng) Another issue is that very few universities had offerings
in a module entitled cybercrime where the focus may be the
13. Coventry University – Forensic Computing types of crime and methods that are used online. This does
14. University of Derby – Digital Forensics and seem a strange omission but could be due to lots of the types
Computer Security of crime being discussed in other modules, and also due to
the fact that it starts to veer towards the subject of
15. Teesside University – Digital Forensics and Cyber criminology.
Investigations
There also appears to be very little provision in terms of
In the UK the most popular module topics appear to be programming and scripting when compared with
(i) network security, (ii) information security and risk undergraduate programmes. The underlying assumption is
management, (iii) incident response, (iv) crime scene that these skills are already in place for this level
investigation and (v) cyber security.
An interesting finding from our wider searches, is that
institutions in Australia appear to have less focus on the legal
VI. DISCUSSION side, whereas this is a much more prominent feature in the
From the analysis of the programmes we came across we USA. It could be argued that this could be due to the country
can discuss a number of findings. It appears that this field off being a more litigious society but we could not find a
education, although highly specialised, still attracts a supporting reference or resource to support such a statement!
significant number of students. It is also an attractive study It is also noticeable that relatively little coverage seems to be
choice for mature students, as it appears that a significant given to mobile forensics in the US when compared with
number of professionals in the field have no formal their Australian counterparts.
education. Increasingly the need for professional standards
and benchmarks push individuals towards postgraduate study VII. CONCLUSIONS
to ensure compliance with a sector that is likely to be more
regulated in the future. In our paper we extended our original study to include a
wider view of digital forensics, covering both undergraduate
There is also a concern about the appearance of online and postgraduate programmes in the UK and US. We
courses and programmes in the field. Although these provide discussed our main findings in terms of the prominent
a suitable option for professionals who wish to study modules offered for postgraduate study and the reasons
remotely, there is a concern whether these programmes can behind such curriculum design choices,
be taught online. There is a significant proportion of highly
specialised software and specific techniques that are difficult
ACKNOWLEDGMENT
to teach remotely.
FORC is funded by the European Commission under the
The use of forensic science, forensic computing and Erasmus+ funding stream. Project reference Number
digital forensics as search keywords make it difficult for 574063-EPP-1-2016-1-IT-EPPKA2-CBHE-JP. Grant
applicants to identify relevant courses that can be easily Agreement 2016 – 2556 / 001 – 001.
compared. Cyber security appears to be a significant
proportion of most programmes, affecting the balance of
programme learning outcomes in certain provisions. REFERENCES
[1] Anderson, P., Dornseif, M., Freiling, F.C., Holz, T., Irons, A., Laing,
In the US several providers tend to offer Associate of C., & Mink, M. 2006. A Comparative Study of Teaching Forensics at
Technical Arts degrees. These concentrate on a particular a University Degree Level. IMF, 116-127.
skill or trade, generally seen as equivalent to the first two [2] Bashir, M., & Campbell, R. (2015). Developing a Standardized and
years of a bachelor’s degree and therefore have less options Multidisciplinary Curriculum for Digital Forensics Education.
and more generalized content for module topics. [3] Conklin, W.A., Cline, R.E., & Roosa, T. 2014. Re-engineering
Furthermore, most institutions tend to introduce more Cybersecurity Education in the US: An Analysis of the Critical
Factors. HICSS.
specialised modules after the first two years of study, and
[4] Cooper, P., Finley, G.T., & Kaskenpalo, P. 2010. Towards standards
mostly in the final year. in digital forensics education. ITiCSE-WGR '10.
Due to the wide variety of topic names and the different [5] DFRWS. 2001. A Road Map for Digital Forensic Research:
ways in which a topic can be represented, we have placed Collective work of all DFRWS attendees, Proceedings of, The Digital
98
Forensic Research Conference DFRWS 2001 USA, Utica, NY (Aug [14] Karie, N.M., & Venter, H.S. 2014. Toward a general ontology for
7th - 8th). digital forensic disciplines. Journal of forensic sciences, 59 5, 1231-
[6] Gorgone, J.T., Gray, P., Stohr, E.A., Valacich, J.S., & Wigand, R.T. 41.
2006. MSIS 2006: Model Curriculum and Guidelines for Graduate [15] Raghavan, S. and Raghavan, S.V., 2013, November. A study of
Degree Programs in Information Systems. SIGCSE Bulletin, 38, 121- forensic & analysis tools. In Systematic Approaches to Digital
196. Forensic Engineering (SADFE), 2013 Eighth International Workshop
[7] Gottschalk, L., et. al., “Computer Forensics Programs in Higher on (pp. 1-5). IEEE.
Education: A Preliminary Study,” the proceedings of the 36th [16] Ekstrom, J.J., Lunt, B.M., & Rowe, D.C. 2011. The role of cyber-
SIGCSE Technical Symposium on Computer Science Education, St. security in information technology education. SIGITE Conference.
Louis, Missouri, Feb. 23-27, 2005, pp147-151. [17] Sabeil, E. Manaf, A.B.A., Ismail, Z. and Abas, M. 2011. Cyber
[8] Dathan, B., Fitzgerald, S., Gottschalk, L., Liu, J., & Stein, M. 2005. Forensics Competency-Based Framework – Areview. International
Computer forensics programs in higher education: a preliminary Journal on New Computer Architectures and Their Applications
study. SIGCSE. (IJNCAA) 1(3): 991-1000. The Society of Digital Information and
[9] Hawthorne, E.K., Shumba, R.K. 2014. Teaching Digital Forensics Wireless Communications, 2011 (ISSN: 2220-9085).
and Cyber Investigations Online: Our Experiences. European [18] Srinivasan, S. 2013. Digital Forensics Curriculum in Security
Scientific Journal September 2014 /SPECIAL/ edition Vol.2 ISSN: Education. Journal of Information Technology Education:
1857 – 7881. Innovations In Practice. Volume 12, 2013.
[10] Bashir, M., Campbell, R., DeStefano, L., & Lang, A. 2014. [19] Tu, M., Dianxiang, X., Wira, S., Balan, C., and Cronin, K. 2012. On
Developing a new digital forensics curriculum. Digital Investigation, the Development of a Digital Forensics Curriculum. Journal of Digital
11, S76-S84. Forensics, Security and Law, Vol. 7(3). 13-32.
[11] Liu, J. 2016. Developing an Innovative Baccalaureate Program in [20] TWGETDF. 2007. Technical Working Group for Education and
Computer Forensics. 36th ASEE/IEEE Frontiers in Education Training in Digital Forensics. West Virginia University Forensic
Conference S1H-1. Science Initiative
[12] Manson, D., Carlin, A., Ramos, S., Gyger, A., Kaufman, M. and [21] Dittrich, D., Garfinkel, S., Kearton, K., Lee, C.A., LANT, N., Russell,
Treichelt, J., 2007, January. Is the open way a better way? Digital A., & Woods, K. (2011). Creating Realistic Corpora for Security and
forensics using open source tools. In System Sciences, 2007. HICSS Forensic Education. ADFSL Conference on Digital Forensics,
2007. 40th Annual Hawaii International Conference on (pp. 266b- Security and Law, 2011. 123-134.
266b). IEEE. [22] G. Dafoulas, D. Neilson, and H. Sukhvinder, “State of the Art in
[13] Nance, K., Armstrong, H., and Armstrong, C. 2010. Digital Computer Forensic Education – A Review of Compuyter Forensic
Forensics: Defining an Education Agenda. In System Sciences, 2007. Programmes in the UK, Europe and US”, 2017 International
HICSS 2007. 40th Annual Hawaii International Conference on (pp. 1- Conference on New Trends in Computing Sciences (ICTCS) Amman,
10). IEEE. Jordan.
99
Fig. 1. List of modules taught in US postgraduate programmes
100
Enhancing International Virtual Collaborative
Learning with Social Learning Analytics
Alexander Clauss Florian Lenk Eric Schoop
Chair of Wirtschaftsinformatik esp. Chair of Wirtschaftsinformatik esp. Chair of Wirtschaftsinformatik esp.
Information Management Information Management Information Management
TU Dresden TU Dresden TU Dresden
Dresden, Germany Dresden, Germany Dresden, Germany
alexander.clauss@tu-dresden.de florian.lenk@tu-dresden.de eric.schoop@tu-dresden.de
Abstract— The ability to work collaboratively in intercultural through […] the set-up of an international learning community
virtual teams, is constantly gaining importance for the labour whereby staff and students acquire interpersonal and
market. Virtual Mobility enables students to acquire the necessary intercultural skills” [4]. VM enables students to gain the
intercultural teamwork skills while remaining locally integrated into necessary intercultural teamwork competencies while
their regular studies. But still, international virtual collaborative remaining locally integrated into their regular studies at a
learning scenarios demand much time and effort for planning and lower cost compared to physical mobility [5].
coordination which binds resources. The support concepts for such
collaborative virtual learning groups are also resource-intensive, A proven implementation of Virtual Mobility are Virtual
because learners should be accompanied by qualified e-tutors to Collaborative Learning (VCL) arrangements, these focus on
optimise learning results both at individual and group level. the virtual classroom to include geographically separated
Classical summative tests and exams are rather unsuitable for the learners in a project-based social learning experience [6].
assessment of collaboration as expected learning outcome. These These had been used since 2001 in over 60 mostly
arrangements also need new formative assessment forms, as international learning collaborations at the authors’ chair of
participants need active and ongoing feedback. A meaningful Wirtschaftsinformatik - Information Management.
assessment of learning processes and outcomes should not only be International VCL arrangements are characterised by
based on the observation of ‘soft’ factors but should also be intensive interaction between participants. Tawileh [5] states
complemented by 'hard', fixed, automatically measurable,
that VCL has a „considerable potential to be implemented as
quantitative indicators. To gain these hard indicators the research
a flexible, attractive, and cost-effective modality for virtual
project ISLA - Indicator-based Social Learning Analytics was
launched. This paper presents the procedure for implementation as
mobility that brings authentic international activities to the
well as virtual presence, content creation and relationships within domestic classrooms“. The aim of the arrangement is to
the community as first derived indicators and their prototypical transfer group learning into the virtual room. Small
visualisation in a Learning Analytics Dashboard. international, interdisciplinary groups with around five
participants work on realistic cases for five to seven weeks, in
Keywords—Collaborative Learning; Virtual Mobility; Social a social network using social media tools. The overriding
Learning Analytics; Learning Analytics Dashboard learning objective is the student-centred development of
professional, personal, communication and media skills
I. INTRODUCTION aiming for successful international collaboration, which is
necessary for a well-prepared entry into the knowledge-
Working conditions are shifting more and more, especially intensive, interconnected working world [7]. The learners are
in the field of knowledge work. Modern Information and accompanied by qualified e-tutors to realise formative
Communication Technology (ICT) leads to a decline in the assessment and maximise learning results both at individual
importance of centralised, local, limited workplaces. At the and group level [8].
same time, the ability to work collaboratively in decentralised,
intercultural, interdisciplinary teams is gaining importance [1]. The implementation of formative assessment has a
The preparation of students for these changing working significant influence on teaching and learning settings.
conditions is a major challenge for Higher Education (HE) [2]. "Formative evaluation includes all activities of the teacher
Despite its high importance for the labour market, the and/or the learner that provide information that can be used as
importance of gaining core competencies for international feedback, to modify teaching and learning activities" [9]. The
virtual collaborative work is not yet reflected extensively in general aim is to recognise and respond to students' learning
Higher Educaitonal curricula [1]. to improve it during the learning process [10].
International physical mobility of students is associated This requires a changed assessment culture, which should
with high costs and linked to a variety of external factors. be characterised by new forms of examination that go beyond
Deficiently implemented internationalisation strategies, the assessment of individual performance, such as group
limited financial support and legal and administrative assessments with individual components. These new
restrictions are just a few examples of the typical challenges assessment forms can only be implemented objectively,
of physical mobility [3]. The continuous development of ICT purposefully and with legal certainty if they are embedded in
has made a major contribution to the development of the a new assessment culture that evaluates not only final results
growing possibilities of Virtual Mobility (VM). “VM but also learning processes. Wollersheim and Pengel [11]
facilitates intercultural experiences of students and their staff emphasise that "like summative assessments, formative
102
an SQL database and visualised with the help of a network formative accompaniment of learners and which supports
representation. In addition, a systematisation in form of a mind supervisors in objectifying their assessment. In the context of
map was created to show the connections between evaluation, a master thesis a prototypical implementation on elgg was
both formative and summative, snd processes and their realised [21].
analysis possibilities. Based on these detailed findings, the
data-driven and demand-oriented analysis was started in the For this purpose, 35 databases were analysed in an
second work package. explorative way. 14 empty, 9 irrelevant and 12 potentially
relevant tables were identified. 9 of the 12 empty tables were
old log tables, which once served as a backup and were later
B. Definition of indicators for successful virtual replaced by a newer version. The other three tables were
collaboration intended for functions such as API or georeferencing, but are
The second work package was aimed at the definition of not yet implemented or used. Tables were defined as irrelevant
the indicators for successful collaboration that should be if they only served to create and structure the course or
monitored, to operationalise the VCL’s overriding learning platform or contained metadata that was not relevant for SLA
objective collaboration. In the first step, a systematic literature purposes. The identified indicators are summarised in the
review was carried out to provide a broad theoretical basis. three: categories virtual presence, virtual interaction and
This systematic literature review analysed success factors and virtual relation. In the following these categories and detailed
obstacles of virtual collaboration and provided a focus for the descriptions of indicators are shown.
further development of indicators. On the one hand, these
indicators helped to identify promising behaviour patterns in A. Virtual presence
the context of social learning analytics. On the other hand,
obstacles should be identified to be able to recognise problems The number of visits on the VCL platform can be
early and to define early warning indicators. Further indicators compared to the physical presence in a traditional course [5].
were derived from the observation sheets mentioned above, The indicators were analysed in order to map the virtual
which were continuously refined in the course of our own presence data-driven. This allows the analyses of the
research [12], [13]. The aim was to achieve a systematic, following questions:
controlled derivation procedure from the topic of the course to • How often is a participant present on the platform
the expected learning outcome - the ability to collaborate - to compared to other learners?
provide a basis for indicator-based social learning analytics.
• How has the activity of the users changed over the
C. Collection and processing of interaction data course of the project and the different assignments?
The third work package aimed at collecting and processing Logins of all participants: The first indicator that reflects
interaction data and correlating identified indicators with student engagement in a virtual classroom is the number of
available data that can be used to assess and analyse learners' logins. To display the activity history, the database query can
performance. To operationalise the aforementioned defined be started several times within certain intervals. If the amount
indicators further, existing learner data, from both completed falls below a predefined value over a certain period of time,
and running instances on the used elgg platform, were the e-tutor should ask the group for reasons and intervene if
evaluated, to be able to support the formative assessment of necessary.
the determined factors with the help of data traces from the
Average number of logins per group: The presence within
database of the virtual learning platform. Subsequently, the
the VCL course can also be analysed from a group perspective.
database was also exploratively examined. This data-driven
The potential to trigger the e-tutors to intervene in case of
analysis from two views resulted in 23 database queries using
insufficient activity within a certain period of time is also the
the database language SQL.
main purpose of the indicator. A strictly summative view of
the logins can lead to inaccurate conclusions if there are fewer
D. Testing the data-driven evaluation and visualisation of members in some groups than in others. For this reason,
indicators average values of logins were used across all groups.
The fourth work package focussed on testing the data-
driven evaluation of the indicators and at developing a mock- Total login duration of participants: As an alternative
up for an indicator-based data provisioning and visualisation. indicator, an attempt was made to calculate the total login
For this purpose, meaningful data on user activities and duration of the participants. This would provide a higher
interactions with learning content as well as between learners quality statement regarding the presence in the VCL event,
on the virtual platform had to be identified, recorded, because a high number of logins does not indicate how long a
processed and made available in an understandable form on user was active on the learning platform. In elgg database, all
the basis of digital traces. The testing took place during the successful logins are stored in a table, but only the manual
project, but could only be operated by the project team. logouts. If a participants simply closes the corresponding
Therefore, the evaluations of the database were not visible ad- browser window, it cannot be traced when they left the elgg
hoc for the e-tutors. The analyses using SQL queries were platform. A reliable solution that measures whether the
carried out without any problems and delivered meaningful platform is currently visible in the active browser window of
results. the participant and whether the participant has made active
inputs is currently being developed.
III. RESULTS
B. Content creation
In the following it will be described how the results of the The previous indicators only described the unproductive
research project were used to develop a dashboard which virtual presence on the platform. The following provide
supports e-tutors through Social Learning Analytics in the information on the tools used, i.e. the actual productive
103
activity on the learning platform. The content types analysed indication of possible conflicts that require the e-tutor’s
in our case are published blog posts, remarks on blog posts, attention.
comments, remarks on comments, discussion topics created,
remarks on discussion topics, discussion posts, remarks on D. Dashboard development
discussion posts, chat messages, direct messages, comments In the next step, the indicators and indices developed in the
on tasks and the sum of all content. The mere numerical master thesis were integrated and combined into a first draft
representation of the results, however, allows the answering of a dashboard in the course of a master seminar thesis [22].
of several relevant questions for the evaluation of the The analysis was based on the standardised observation sheet
communication between the participants, for example: for e-tutors provided by our chair and on the preliminary
• Which communication tool is used the most/least and results described in before.
thus has the highest/lowest acceptance among the The analysis platform was created step by step with the
participants? insights gained from systematic literature review, the
• Which participants communicate most/least? accessible data basis and the observation sheet. The dashboard
focuses directly on e-tutors and supervisors, for this reason the
• Do participants actively participate in the discussion structure of the dashboard is based on the observation sheets.
with group members? Figure 1 below schematically illustrates the basic structure of
the Learning Analytics Dashboard.
• Is there a continuous communication over a longer
period of time?
The same content types can also be aggregated in an
elevated fashion: at group level. This not only reveals the
differences within a group or between all participants of the
even. It also gives the opportunity to compare the groups with
each other. So as to avoid any misrepresentation of the values
- equivalent to the logins - average communication tools used
per group were used as indicators. By weighting the individual
items, it is now possible to create indices at individual and
group level, which can, for example, provide information on
the extet of collaboration within the group. The following
Fig. 1. Basic structure of the Learning Analytics Dashboard [22]
table shows an exemplary weighting of the individual content
types. It should be emphasised that this is only an example. The result was a summary page with the observation
Weighting of the different indicators should be both evidence- criteria and eight linked dashboards. The main page offers a
based and adapted to the expected learning outcomes of the clear and user-friendly interface for e-tutors. It represents the
course. This well-founded weighting of indices is currently a linking between the criteria of the observation sheet and the
further focus of our research activities. different dashboards. By selection of the criteria to be
considered they are linked to the corresponding dashboard.
TABLE I. EXEMPLARY WEIGHTING OF CONTENT TYPES Consequently, it is very clear and easy to use and facilitates
the formative assessment. The dashboards are kept very
Content type Weighting
simple and can be modified or extended at any time. The
published blog posts 0,5 individual dashboards contain various visualisations such as
comments on blog posts 0,2 simple tables, stacked column or bar charts, pie charts or maps.
discussion topics created 0,4 Stacked column or bar charts have proven to be the best
comments on discussion
0,1 visualisation option. They are characterised by the fact that
topics they usually display as much data as possible without losing
discussion posts 0,3 their clarity, and require as little space as possible. For the
… … team criteria we prefer using stacked column diagrams. For
individual criteria the stacked bar diagrams have been
sufficient. Pie diagrams and maps were used as extensions of
C. Relationships within the community dashboards that still had enough space. Simple tables were
In a social network, the focus lies on the interaction of used to view mere data. All mentioned types of visualisation
participants and their networks. A clear indicator in this were additionally extended with adequate colour
context is the number of friends the participants have in combinations to refer to certain values or simply to generate a
different groups and the total number of friends. From this, better clarity.
basic insights can be derived. In the early phase, it becomes As long as the structure of the database does not change,
visible whether the participants in their team are networked the SQL queries and consequently the dashboards work. With
properly. The threshold value recommended therefore is the a click on the update button, the queries are executed again
defined group size plus e-tutors and supervisors. In addition, and the dashboards are updated as well. Figure 3 shows an
it is also possible to determine which "informal relationships exemplary screenshot from the numerical overview for
of friendship" exist beyond group boundaries. In addition, e- discussion contributions of the participants, in group
tutors have the opportunity to investigate reasons for isolated comparison, as a single page of the Learning Analytics
team members. Above all, a sudden drop in the number of Dashboard.
friends within the team during the course can serve as a clear
104
Fig. 2. Numerical overview for discussion contributions of the participants, in group comparison in the Learning Analytics Dashboard
105
2004.
[7] A. Clauss, “How to Train Tomorrow’s Corporate Trainers – Core
Competences for Community Managers,” in 2018 17th
International Conference on Information Technology Based Higher
Education and Training (ITHET), 2018, pp. 1–8.
[8] A. Clauss, F. Lenk, and E. Schoop, “Digitalisation and
Internationalisation of Learning Processes in Higher Education: A
best practices report,” in Proceedings of the 13th Iranian and 7th
International Conference on e-Learning and e-Teaching (ICeLeT
2018), 2019.
[9] P. Black and D. Wiliam, “Assessment and classroom learning,” Assess.
Educ., vol. 5, no. 1, pp. 7–474, 1998.
[10] B. Cowie and B. Bell, “A Model of Formative Assessment in Science
Education,” Assess. Educ. Princ. Policy Pract., vol. 6, no. 1, pp.
101–116, 1999.
[11] H. W. Wollersheim and N. Pengel, “Von der Kunst des Prüfens -
Assessment literacy,” HDS.Journal - Perspekt. guter Lehre, vol. 2,
pp. 14–32, 2016.
[12] M. Rietze, “Analysing eCollaboration : Prioritisation of Monitoring
Criteria for Learning Analytics in the Virtual Classroom,” pp.
2110–2124, 2016.
[13] M. Rietze, “Monitoring E Collaboration Preparing An Analysis
Framework.” 2016.
[14] F. Lenk, “Virtual Social Learning Environments – a Cybernetic System?
Towards a Decision Support System,” in 2018 17th International
Conference on Information Technology Based Higher Education
and Training (ITHET), 2018, pp. 1–5.
[15] A. Bandura, Social learning theory. Prentice Hall, 1977.
[16] J. Robes, “Social Learning zwischen Management, Unternehmenskultur
und Selbstorganisation,” Wirtschaft Beruf Zeitschrift für berufliche
Bild., vol. 66, pp. 20–25, 2014.
[17] M. S. Reed, A. C. Evely, G. Cundill, I. Fazey, J. Glass, and A. Laing,
“What is Social Learning ?,” Ecol. Soc., 2010.
[18] S. B. Shum and R. Ferguson, “Social learning analytics,” Proc. 2nd Int.
Conf. Learn. Anal. Knowl. - LAK ’12, vol. 15, p. 23, 2012.
[19] & K. B. Kay, J., Reimann, P., Diebold, E., “MOOCs: So Many Learners,
So Much Potential What Is a MOOC?,” pp. 2–9, 2013.
[20] D. T. Tempelaar, A. Heck, H. Cuypers, H. van der Kooij, and E. van de
Vrie, “Formative assessment and learning analytics,” p. 205, 2013.
[21] S. Kretzschmar, “Entwicklung und Evaluation von Indikatoren für Social
Learning Analytics am Beispiel eines Virtual Collaborative
Learning Kurses in Elgg,” Technischen Universität Dresden, 2018.
[22] C. Krebs, “Datengetriebenes Feedback: Erstellung und Implementation
einer Plattform zur Datenanalyse mittels Power BI für E-Tutoren,”
Technische Universität Dresden, 2019.
106
Evaluation of Students’ Acceptance of the
Leap Motion Hand Gesture Application in
Teaching Biochemistry
Nazlena Mohamad Ali Mohd Shukuri Mohamad Ali
Institute of Visual Informatics (IVI) Faculty of Biotechnology and
Universiti Kebangsaan Malaysia Biomolecular Sciences
43600, Bangi, Selangor, Malaysia Universiti Putra Malaysia
nazlena.ali@ukm.edu.my 43400 Selangor, Malaysia
mshukuri@upm.edu.my
Abstract— This paper presents an early stage of the Leap mm) and can distinguish between 10 fingers and track them
Motion controller regarding user acceptance in the teaching and individually. This device is a drastic change from one hand on
learning process. The Leap Motion is a new device for a hand- a mouse or two-finger pinch-to-zoom on novel trackpads and
gesture-controlled user interface. For appropriate evaluation, a smartphones. By moving 10 fingers in the workspace, users
novel experiment and questionnaire were created utilizing 35 can communicate with a computer in many more ways than
Biochemistry undergraduate students in Enzymology from the other devices [2].
Universiti Putra Malaysia. The subjects participated in the user
experiment and performed several tasks, such as rotating,
translating and zooming in and out on the molecules. The tasks
were performed using the Molecules application on an Airspace
platform. The research compared the performance of Leap
Motion with mouse interaction. As a result, 79.2% of the
respondents gave a positive opinion about the Leap Motion
because of its ease of use, acceptance, effectiveness and accuracy.
These students were excited and looked forward to
implementing the Leap Motion in class. Thus, the Leap Motion
controller can potentially be used as a teaching tool for a better
learning experience of the biomolecule.
I. INTRODUCTION
Gesture-based interaction represents a fundamental and Fig. 1. The Leap Motion controller relative size
universal form of nonverbal communication that has an
essential role in the human-computer relationship. Gesture- Leap motion has received great attention in the current
based technology offers a natural way of interaction, thus, years because of their massive applications, include gaming,
contributing to the key area of engagement [1]. Users robotics, education and medicine. Research by [3] developed
generally prefer and are excited to use multimodal a game for hand rehabilitation using the Leap Motion
interaction, which provides users with the freedom and controller for the more effective rehabilitation process.
flexibility to choose the best inputs for specific tasks. Another work conducted on Leap Motion in an educational
Whether users are pointing to select an object out of a group, environment as carried out by [4] on a hands-on field
putting five-fingers down to shut down the computer or experiment to verify the feasibility of using gesture control on
curling fingers to zoom in or out on an image, gestures play a computer free-hand drawing of elementary students. The
an important role in developing technology with no mouse experimental results and statistical evidence suggested that
and less touch. Gesture-based interaction will help teachers Leap Motion could operate elementary free-hand drawing as
and students actively communicate in the classroom. [5] explored Leap Motion’s feasibility in educational usage.
Leap Motion controller (Fig. 1) is a small device that Leap Motion can be considered in applications in educational
allows users to control the computer by gesturing with their fields. An investigation regarding elementary students was
hands and fingers in mid-air (Leap Motion, conducted to assess their technology use and theory of
http://www.leapmotion.com). Leap Motion controller works planned behaviour, or TPB. These students showed significant
effectively, capturing any motion in its workspace and potential in using this new gesture input device.
translating it to the computer. Leap Motion does this through
an array of camera sensors that monitor a 1 cubic foot To demonstrate the easy-to-use and human-friendly
workspace. Leap Motion is also extremely accurate (to 0.01 control, [6] applied and programmed the controller to change
the display settings of three-dimensional objects. Leap Motion
applied easily to educational and medical imaging. In an
108
The sensitivity and speed of the mouse were kept consistent gesture devices. Gestures are defined as any physical
during the entire test session. movement, large or small, that can be interpreted by any
Setting up Leap Motion is straight forward. A user plugs motion sensor.
one end into the laptop and the other end into the controller. TABLE II. DEMOGRAPHIC PROFILE OF COMPUTER LITERACY
Then, the user positions Leap Motion where it can see his or AND MOLECULAR GRAPHIC AND GESTURE DEVICE
EXPERIENCE.
her hands, i.e., in front of a laptop or between a desktop
keyboard and the screen gallery. When plugged in, the green Aspects Evaluation
LED on the front of the device and the infrared LEDs beneath
the top plate light up. In the present experiment, the Leap 1 2 3 4 5
Motion controller was placed on a table. The placement was
marked to ensure no undesired movement of the device. Computer literacy 0 0 10 15 10
Moreover, a video camera was placed in front of the (0.0) (0.0) (28.6) (42.9) (28.6)
respondents to record their facial expressions. Each student
must rotate, zoom and pan the molecule using a mouse and Molecular
Leap Motion. The students’ facial expressions indirectly Graphic 1 8 9 10 7
showed whether the input device was exciting or boring. Experience (2.9) (22.9) (25.7) (28.6) (20.0)
III. RESULTS AND DISCUSSION Gesture Device 3 7 6 15 4
Of the 35 students involved in the Leap Motion controller Experience (8.6) (20.0) (17.1) (42.9) (11.4)
user evaluation, 82.86% (n=29) were female and 17.14%
(n=6) were male. As shown in Table 1, all males (n=6) in this
study were Malays. Table I also shows that 5.71% (n=2) of Notes: 1, poor; 2, bad; 3, ok; 4, good; 5, excellent.
the females were Chinese, 2.81% (n=1) were Indian and Table III shows the evaluation performed on hand
2.81% (n=1) were Kadazan. The ages of both males and gesture-based interaction, which uses Leap Motion and a
females ranged from 18-24 years old. mouse input device. The evaluation shows the comparison of
TABLE 1. DEMOGRAPHIC PROFILE. THE DATA DISTRIBUTIONS perceived usefulness, ease of use and acceptance towards the
INCLUDE GENDER, RACE AND AGE. Leap Motion controller and mouse. Most of the subjects
Demographic profile Male Female Total agreed (n=21) and strongly agreed (n=8) that they would like
to use Leap Motion during class compared with a mouse:
(n=6) (n=29) (n=35) only 40% (n=14) agreed and 14.3% (n=5) strongly agreed.
As a result, 57.1% (n=20) of the subjects agreed and 28.6%
Age: (n=10) strongly agreed that Leap Motion would help them
understand molecules better. Five students disagreed and
18-24 6 (17.14) 29 (82.86) 35 (100)
said that a mouse would help them understand molecules
better. Based on the recorded video (Figure not shown), the
Race:
students look focused and excited regarding the given task.
Malay 6 (17.14) 25 (71.43) 35 (100) This approach excites the students and makes them enjoy the
Chinese 2 (5.71) class.
Indian 1 (2.86)
Other 1 (2.86)
109
TABE III. EVALUATION OF HAND GESTURE BASED AND MOUSE INTERACTION. THE COMPARATIVE EVALUATION INVOLVES THE
ANALYSIS ON PERCEIVED USEFULNESS, EASE OF USE AND ACCEPTANCE TOWARDS THE LEAP MOTION CONTROLLER AND MOUSE.
STUDENTS MAJORLY ACCEPT THE USE OF LEAP MOTION CONTROLLER APPLICATION.
Aspects Leap Motion Mouse
1 2 3 4 5 1 2 3 4 5
Using device during 0 (0.0) 1(2.9) 4 (11.4) 21 (60.0) 8 (22.9) 0(0.0) 1(2.9) 15 (42.9) 14 (40.0) 5 (14.3)
class
Easy to use 0 (0.0) 1(2.9) 8 (22.9) 18 (51.4) 7 (20.0) 0(0.0) 2(5.7) 12 (34.3) 13 (37.1) 8 (22.9)
Need technical
support 0 (0.0) 3(8.6) 13 (37.1) 10 (28.6) 8 (22.9) 11 (31.4) 11 (31.4) 7 (20.0) 3(8.6) 3(8.6)
Inconsistency 1 (2.9) 14 (40.0) 14 (40.0) 5 (14.3) 0(0.0) 5 (14.3) 11 (31.4) 9 (25.7) 8 (22.9) 2(5.7)
quickly 0 (0.0) 0(0.0) 2(5.7) 22 (62.9) 10 (28.6) 0(0.0) 0(0.0) 7 (20.0) 14 (40.0) 14 (40.0)
Cumbersome to use 1 (2.9) 6 (17.1) 17 (48.6) 9 (25.7) 1(2.9) 4 (11.4) 6 (17.1) 18 (51.4) 4 (11.4) 3(8.6)
Confident when
using the device 0 (0.0) 1(2.9) 7 (20.0) 18 (51.4) 8 (22.9) 0(0.0) 0(0.0) 9 (25.7) 17 (48.6) 9 (25.7)
Understand
molecules better 0 (0.0) 0(0.0) 4 (11.4) 20 (57.1) 10 (28.6) 0(0.0) 5 (14.3) 12 (34.3) 12 (34.3) 5 (14.3)
device longer 3 (8.6) 13 (37.1) 8 (22.9) 6 (17.1) 4 (11.4) 8 (22.9) 14 (40.0) 7 (20.0) 4 (11.4) 2(5.7)
Very effective 0 (0.0) 1(2.9) 4 (11.4) 19 (54.3) 10 (28.6) 0(0.0) 5 (14.3) 15 (42.9) 10 (28.6) 5 (14.3)
Very accurate 0 (0.0) 0(0.0) 11 (31.4) 19 (54.3) 4 (11.4) 0(0.0) 2(5.7) 16 (45.7) 13 (37.1) 4 (11.4)
A mouse would be expected to receive high responses in benefits, Leap Motion is quiet, accurate and effective during
ease of use because the students are more familiar using it: teaching and learning. Of all of the subjects, 54.3% (n=19)
37.1% (n=13) agreed and 22.9% (n=8) strongly agreed. agreed and 28.6% (n=10) strongly agreed that Leap Motion
However, Leap Motion received a higher percentage in this is very effective when implemented in class. In contrast, only
category than the mouse: 51.4% (n=18) agreed and 20% 28.6% (n=10) of the subjects agreed and 14.3% (n=5)
(n=7). However, the students still needed technical support to strongly agreed regarding the effectiveness of using a mouse
use Leap Motion even though they considered it easy to use in class. Considering the last statement, “This gestural
because 8.6% (n=3) disagree with the statement “I would interface is very accurate”, 54.3% of the subjects (n=19)
need the support of a technical person to use this gestural
agreed, and 37.1% (n=13) strongly agreed. In contrast to
interface”. In contrast, 62.8% (n=22) of the subjects did not
“This mouse is very accurate”, 11.4% (n=4) of the
need technical support to use a mouse. In addition, the
subjects found that they learned both Leap Motion and a respondents agreed and 11.4% (n=4) strongly agreed with
mouse very quickly. this statement.
Of the respondents, 42.9% (n=15) disagreed regarding Overall, 79.2% of the subjects gave a positive opinion of
inconsistency in the Leap Motion controller, and no subject Leap Motion because of its ease of use, acceptance,
strongly agreed. Meanwhile, 45.7% (n=16) of the subjects effectiveness and accuracy. The subjects were interested in
disagreed that a mouse had excessive inconsistency, but two using Leap Motion again in the future.
strongly agreed. Furthermore, 10 subjects agreed that Leap Based on Table IV, more than 97.1% of the respondents
Motion was cumbersome to use. In contrast, the mouse was considered the Leap Motion and mouse easy to use. All the
convenient because only 7 subjects agreed that a mouse was subjects were excited using Leap Motion and some were
cumbersome to use. bored using the mouse. The Leap Motion controller is more
Most of the subjects felt confident using a mouse and efficient, speedier and more relaxed than the mouse for
Leap Motion, but only one subject felt unconfident using 97.2% of the respondents. Of all the respondents, 91.5%
considered the Leap Motion and mouse stable. Furthermore,
Leap Motion. This confidence is a good response to a gestural
they concluded that both devices were accurate.
interface that is new on the market. Furthermore, the subjects
felt fatigued after using Leap Motion for a long time. Using
Leap Motion for approximately a half-hour can feel like an
arm workout. Continually holding the hands up is not desk-
friendly behaviour. Although fatigue could detract from its
110
TABLE IV. SEMANTIC DIFFERENTIAL SCALES FOR THE LEAP “I think it is a fun and interesting way to teach students
MOTION AND MOUSE. about molecular structure. It is awesome and new. Students
Assessment aspects Devices will find it easier to learn a complicated molecular structure.
It also makes learning fun and relaxing. It’s amazing!!!”[PID
LEAP MOTION MOUSE 169082]
Easy 97.1 97.2 Despite the positive feedback, some negative feedback
Difficult 2.9 2.8 was also given.
Exciting 100 74.2 “It would be very nice and interesting if the system is
Mellow 0 25.8
upgraded so that it can detect gestures more quickly. It is slow
like I just experienced, and it is better not to use it in teaching
Efficient 97.2 88.6 and learning.” [PID 166760]
Personable 2.8 11.4 “It is good. But sometimes it is not very efficient because
Speedy 97.2 94.3 we need to control our hand gestures.” [PID 169426]
Methodical 2.8 5.7 “There are pros and cons. The use of hand gestures may
attract students to the subject because of the technology.
Relaxed 97.2 88.6
However, to achieve this goal, hand gestures would be my
Intense 2.8 11.4 last choice.” [PID 169119]
Pleasant 85.8 88.6 The Leap Motion controller undoubtedly represents a
Unpleasant 14.2 11.4 revolutionary input device for gesture-based human-
computer interaction. In this study, we evaluated the
Stable 91.5 91.5 controller to introduce new technology in teaching and
Volatile 8.5 8.5 learning systems. Based on the results and overall
experience, we conclude that the Leap Motion controller
Accurate 91.4 91.4
should be applied to biochemistry classes. The Leap Motion
Inaccurate 8.6 8.6 could receive more attention if the sensory space and
inconsistent sampling frequency were improved.
IV. CONCLUSION
In general, the subjects gave positive feedback regarding
the introduction of the Leap Motion to teaching and learning. Teaching with technology can deepen student learning by
The positive reactions and acceptance during the evaluation supporting academic objectives. However, the best
might have occurred because of the Hawthorne effect. Instead technology must be selected while not losing sight of the
of a traditional method (mouse), people tend to prefer the goals of student learning. Gesture-based technology might be
unique and attractive device when introduced to new the best choice for teaching and learning rather than through
technology (Leap Motion). A Leap Motion application can typing or moving a mouse. Gestures are universal and more
provide more intuitive interactions. This scenario presents natural than operating a keyboard or mouse. Gestures could
avenues for further investigation. Most subjects gave positive
also be a valuable tool in maintaining and focusing on
feedback regarding Leap Motion.
students' attention and promoting an interactive classroom.
“I think it’s good if you know how to use it. It is very
efficient in showing the molecules as opposed to using a Based on the results from the experiments and
mouse. It saves time and might even get the students’ questionnaire, we can conclude that the new Leap Motion
attention in class. Truthfully, I’m not into all this enzyme- input device possesses huge potential for use during lecture
protein thing but this device managed to grab my attention (a sessions. Based on the attitudes toward Leap Motion, the
little).” [PID 170176] respondents were very excited to use Leap Motion in the
future compared with mouse interaction. Leap Motion could
“I think it is very exciting to use this technique in teaching make the class more appealing and efficient.
and learning. It corresponds with the growth of technology &
science. I think this will create interest for students to learn REFERENCES
about structural biochemistry, which used to be considered a [1] M. S. A. Rahman, N. M. Ali, and M. Mohd, “Natural user
lame subject. I’m looking forward to this technique being interface for children: From requirement to design,” in Lecture
used in future lessons.” [PID GS38981] Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics),
“Hand gestures can enhance both student and lecturer 2017.
experience during teaching and learning. They make it easier [2] F. Weichert, D. Bachmann, B. Rudak, and D. Fisseler, “Analysis
to navigate the molecule, rotate it and zoom in and out. Hand of the accuracy and robustness of the Leap Motion Controller,”
gestures are more reliable and easier compared to a mouse.” Sensors (Switzerland), 2013.
[PID 168820] [3] M. Alimanova et al., “Gamification of hand rehabilitation
process using virtual reality tools: Using leap motion for hand
“I would use it during class because it is interesting and rehabilitation,” in Proceedings - 2017 1st IEEE International
new. People will be more interested during class. We must Conference on Robotic Computing, IRC 2017, 2017.
move our bodies with gestures, but it is not a burden or [4] T. Yang, K. Miao, and J. Hung, “Gesture control in education for
a young student,” Comput. Technol. Mod. Educ., pp. 44–54,
tiresome at all.” [PID 169901] 2014.
[5] L. Kuo, Y. HJ, M. Ho, S. Su, and H. Yang, “Assessing a new
input device for an educational computer,” Mod. Comput. Appl.
111
Sci. Educ., pp. 114–116, 2014.
[6] D. Huszar, L. Kovacs, A. Palffy, and A. Horvath, “Application of
three dimensional gesture control for educational and medical
purposes,” Budapest Peter Pazmy Cathol. Univ. Fac. Inf.
Technol. Bionics., 2013.
[7] S. Deb and T. Nama, “Interactive boolean logic learning using
leap motion,” in Proceedings - IEEE 9th International
Conference on Technology for Education, T4E 2018, 2018.
[8] F. L. Nainggolan, B. Siregar, and F. Fahmi, “Anatomy learning
system on human skeleton using Leap Motion Controller,” in
2016 3rd International Conference on Computer and Information
Sciences, ICCOINS 2016 - Proceedings, 2016.
[9] V. Silva, S, Eduardo., Anderson, Jader., Henrique, Janiel.,
Teichrieb and G. Ramalho, “A Preliminary Evaluation of the
Leap Motion Sensor as Controller of New Digital Musical
Instruments,” Cent. Inform., 2013.
[10] J. Guna, G. Jakus, M. Pogačnik, S. Tomažič, and J. Sodnik, “An
analysis of the precision and reliability of the leap motion sensor
and its suitability for static and dynamic tracking,” Sensors
(Switzerland), 2014.
[11] J. Coelho and F. Verbeek, “Pointing Task Evaluation of Leap
Motion Controller in 3D Virtual Environment,” Creat. Differ.
Proc. Chi Sparks 2014 Conf., 2014.
[12] L. E. Potter, J. Araullo, and L. Carter, “The leap motion
controller: A view on sign language,” in Proceedings of the 25th
Australian Computer-Human Interaction Conference:
Augmentation, Application, Innovation, Collaboration, OzCHI
2013, 2013.
[13] H. M. Robinson, Emergent computer literacy: A developmental
perspective. 2008.
112
Designing and Implementing an e-Course Using
Adobe Captivate and Google Classroom: A Case
Study
Shahd Alia Dr.Thair Hamtini
Department of Computer Information Systems Department of Computer Information Systems
University of Jordan University of Jordan
Amman, Jordan Amman, Jordan
shahdalia94@gmail.com thamtini@ju.edu.jo
Abstract—Nowadays, Learning Management Systems (LMS) And since its cloud-based it also gives unlimited storage
are widely spread across academic institutions. They are not capacity. This study propose a modified Technology Accep-
restricted to online and distant courses but are also useful during tance Model (TAM) [2] that tries to analyze the effectiveness
or in addition to face-to-face learning sessions. This Study took
place at The University of Jordan as an attempt to evaluate and acceptance of google classroom using a course designed
the acceptance level of Google Classroom which is one of the on Adobe Captivate 2019, this course targets beginner re-
most trending LMS platforms. The experiment used a course searchers, students and anyone interested in writing a scientific
designed on Adobe Captivate 2019, this course teaches LATEX, a document according to the standards. The platform that is
high-quality typesetting system that includes features designed going to be the subject of this course is called Overleaf, it
for the production of technical and scientific documentation to
help researchers focus on the content of their research and is a very easy to use, beneficial text-editing platform that uses
worry less about the structure of the document. To measure the LATEX [3], yet the learning process is very practical and it is
satisfaction level of learners and teachers with Google classroom, hard to teach this topic using the theoretical abstract ways.
a modified and extended version of Technology Acceptance This study measures the learners understanding for such a
Model(TAM) has been used. The results showed that participants practical topic using google classroom.The rest of the paper
felt comfortable taking a course using Google Classroom, and
they agreed upon the high effectiveness of Google Classroom as is organized as follows: in the next section, a review of
a learning management system. related works is provided, followed by the research questions,
Index Terms—Learning Management System (LMS), Elearn- sampling methods , Instrument, a brief on the topic discussed
ing, Google Classroom, LaTex, Overleaf, Technology Acceptance in this e-course, research methodology, . The results and
Model (TAM) findings are then explained and summarized.
II. R ELATED W ORK
I. I NTRODUCTION
A vast majority of people started to prefer online learning
Google Classroom is a new - cloud-based - product in in the last twenty years over the traditional tedious ways of
Google Apps for Education (GAFE) [1]. This product aims learning. Hence, most countries directed their efforts towards
to give teachers more time to teach and students more time to having a successful, robust online education system for schools
learn. Unlike traditional teaching and learning approaches that and for higher education too. one of the journeys that was
are teacher-centered, time consuming, and inflexible, google worth going through, is the united states journey with online
classroom is an interactive LMS that gives students and education. Elaine and Jeff mentioned in their book [4] that
teachers the ability to ask, comment, and give feedback, it also number of higher education students taking at least one online
eliminates the time needed to get ready and reach the campus course, has been increasing linearly during a ten- years period
since many students and teachers might be living in distant of time. This study started during fall 2002 with less than 10%
areas, this google product definitely helps saving their time of students taking at least one online course, the percentage
and keeps all the concentration on the teaching and learning has been increasing until it reached 32% in fall 2011, which
process. Talking about flexibility, google classroom not only shows how important and beneficial online education was,
allows teachers to provide material and extra examples or even at the beginnings . Elaine and Jeff continued to track the
resources, but it is a free, easy to use platform that allows progress of online education in the united states through their
teachers to create classes, distribute assignments, track student reports [5] [6] all the way from 2011 to 2016. In 2018, three
progress, grade and send feedback in a way that is much more researchers decided to study the impact of the new Online
flexible and efficient than the traditional ways that are based Master of Science in Computer Science (OMSCS) offered by
on papers for everything! the Georgia Institute of Technology (Georgia Tech), to answer
114
180 countries, including CalTech, Stanford, MIT, Harvard,
and Brown. Many researchers and students started to use
Overleaf in the writing of their thesis and research papers,
that is why the goal of this course is to become a guide
for beginners to teach them the basics of Overleaf. All the
topics discussed were very simple in order to match the
understanding of beginners and help them start their first latex
document, therefore, no advanced commands or codes were
mentioned in this course, just the basics of every scientific
document.
This course contains six lessons, starting with an overview
Fig. 1. Screenshots of the Lessons and the knowledge Tests
of the environment, the most important preserved words in
Overleaf and the job of each preserved word, how to add
tables, figures, and equations to your document, the last lesson
V. S AMPLING
talked about referencing and citation.
The design of this course has been done using Adobe The sample was chosen carefully from The University
Captivate 2019, because of the great enhancements added to Of Jordan so that it contains both males and females from
this release. The following points are the teaching and learning different ages, varying demographic information, with IT-
strategies that have been followed in the design of this course: background and without IT-background for the reason that not
• The design of the course depends mainly on software all researchers are researchers in the IT field or even interested
simulation technique. Since it aims to teach researchers in technology, the researcher might be from other fields like
how to use overleaf environment, the simplest way to biology or physics and wishes to learn how to use overleaf,
design the course was by simulating the real environment therefore, google classroom and the design of the course must
and directly show learners how the actual work is done. be simple taking into consideration those learners with no IT-
• Text-to-speech feature provided by Adobe Captivate has background. The one thing that all the participants have in
been used to keep the learners engaged, since many common is that they are all either postgraduate students who
studies proved that students learn better by listening. Also are required to write papers and documents, or undergraduate
according to cognitive learning theory [20] and modality students who are interested in the research field, master and
principle [21], it will be overwhelming to the learners PHD students who are preparing to write their thesis and
if we added on-screen texts to the software simulation. dissertations . The sample also included teachers, so we can
We need to minimize the chances of overloading learners measure Google classroom as an LMS from their point of
visual/pictorial channel and present verbal explanation view.
as speech processed through auditory/verbal channel in-
stead. VI. I NSTRUMENT
• Pop colors and side notes were used when necessary to
A questionnaire was developed to measure the learners satis-
grab the learners attention towards something, like when
faction level with google classroom features, this questionnaire
its time to try a command on Overleaf, or when we
contained 3 parts. the first part consisted of some demographic
are about to explain something that needs learner’s full
questions to ensure that the sample have participants with
attention.
varying demographics. In the second part we asked the par-
• Red squares and arrows were used to show learners on
ticipants about how often do they use the internet on a daily
which part of the screen we are working. Sometimes
basis to determine the level of information and communication
when the learners are not familiar with the software
technology (ICT) usage among those participants.The third
environment they might get lost.
part of the questionnaire has 23 questions to measure the
• Learners were given real examples of published papers,
opinion of participants in the following areas after taking the
so they can relate and understand what the lesson is
course on google classroom. ”Fig.2”:
talking about.
• At the end of each lesson there was a test of knowledge
so learners could refresh their memories and make sure
they understood the current lesson before moving to the
next one. Those tests were designed using motivators and
sound effects to keep the learner active and engaged, and
to make the learning process more interesting.
The following figure ”Fig.1 ” shows some screenshots of
the course and the knowledge test questions to help you
understand the whole idea : Fig. 2. Areas Under Investigation
115
To measure the participant answers mathematically, we used familiar with using technology- and this aspect will be further
a 5-point Likert scale [22] ranging from 1 (strongly disagree) tested in the next part of the questionnaire- we also have 8
to 5 (strongly agree). Then the answers were analyzed and participants with other different jobs.
tabulated to make the data understandable and organized
in a meaningful way so that we can make decisions and TABLE I
enhancements based on those tables. D EMOGRAPHIC Q UESTIONS .
116
lecturer was straight to the point and all ideas and concepts TABLE V
were explained clearly. Therefore, they find google classroom P ERCEIVED U SEFULNESS
a successful medium for disseminating such courses. The Perceived Usefulness
highest score was for the feedback property, most of the Aspect Mean
participants agreed that this feature helped them receive the Participants believed that Google Classroom is
4.63
an excellent medium for E-learning.
information much more clearly. This feature gives google Google Classroom is a very organized
classroom a competitive advantage against other learning platform that helps students track their
3.98
management systems. performance and understand their current
situation in a particular topic.
Assignment submission method in Google
TABLE III Classroom made it easier for student to 4.2
Q UALITY OF I NFORMATION D ELIVERY submit assignments on time.
Participants successfully achieved their course
Quality of Information Delivery 3.77
objectives.
Aspect Mean
The course Idea and concepts were
3.62
demonstrated clearly.
Lecturer was straight to the point and delivered ”Table.VI” shows an outstanding results regarding the ease
3.6
the needed information effectively. of use in google classroom. Most of the participants strongly
Participants found Google Classroom suitable agreed that it was easy to login, join class, navigate through
3.9
for disseminating courses like this course.
Course activities helped participants build a the system, access the course material, and understand how
4.2
basic knowledge in overleaf. each process is done using google classroom .
Participants found the feedback property
4.72
in Google Classroom very useful
TABLE VI
P ERCEIVED E ASE OF U SE
The next table ”Table.IV summarizes the participants evalu-
ation of communications and interactions in google classroom. Perceived Ease of Use
Aspect Mean
The highest mean goes to google classroom’s being an open It was easy for participants to sign\login 4.68
platform, which is a very important feature since the whole Participants did not face any trouble to join
4.47
idea of an LMS is to exchange knowledge and learn using the class
such platforms. The lowest mean was for the second aspect, It was easy to understand the system and
4.19
navigate through it
the participants didn’t agree much that they were active most It was easy to access the course material 4.17
of the time. This indicates that there is a need to have more It was easy to understand the method of
3.94
motivation boosters to keep learners engaged, and we need to assignment submission and feedback.
The design of Google Classroom is
come up with new ideas to ensure that the learners are fully user-friendly and everything is easy 4.12
active and focused, also to make the learning process a bit to find.
more interesting.
The last table ”table.VII” shows the level of user’s satisfac-
TABLE IV tion with some aspects they tested during the course. Results
C OMMUNICATION AND I NTERACTION showed that users were relatively satisfied with most of google
Communication and Interaction classroom’s features. The lowest mean value 3.5 was for the
Aspect Mean third aspect, not all learners believed that this method of learn-
The class was open to new ideas, and ing is more interesting than face-to-face learning. Although
participants could have contacted each 4.1
other if they wanted to. google classroom focused on being an open platform that
Participants felt engaged in the learning allows announcements, commenting, and feedback properties,
3.59
process and active most of the time. clearly it was not enough to make all learners believe that it
Participants felt comfortable taking this
course in a Google Classroom.
3.81 is more interesting than traditional learning.
Participants felt that the lecturer is
4.23
available and easy to contact most of the time. TABLE VII
Participants believes that Google Classroom U SER ’ S S ATISFACTION
4.72
is an open platform to exchange knowledge.
User’s Satisfaction
”Table.V discusses the perceived usefulness, all mean values Aspect Mean
Participants would not have preferred to take
were above average again. The highest mean was for the first this course in a normal class.
3.66
aspect, learners believed that google classroom is an excellent Participants would recommend using Google Classroom
4.31
medium for elearning. Also participants with mean value of in other courses.
This method makes learning process less boring. 3.5
4.2 preferred google classroom’s way of submitting assign- Participants preferred Google Classroom’s method
ments. Most of the participants also liked the organization of of examination and assignment submission 3.8
google classroom’s and believed it helped them track their more than the traditional (paper) way
performance.
117
VIII. C ONCLUSION AND F UTURE WORK [13] W. Afthanorhan, “A comparison of partial least square structural equa-
tion modeling (pls-sem) and covariance based structural equation mod-
This study shows that most of the participants are satisfied eling (cb-sem) for confirmatory factor analysis,” International Journal
with the features of google classroom that were presented of Engineering Science and Innovative Technology, vol. 2, no. 5, pp.
198–205, 2013.
through this course. This proves google classroom’s effec- [14] K. Siegel, Adobe Captivate 2017: The Essentials. IconLogic, 2017.
tiveness as a learning management system(LMS), and makes [15] N. Marangunić and A. Granić, “Technology acceptance model: a liter-
it one of the leading LMS’s that are expected to be widely ature review from 1986 to 2013,” Universal Access in the Information
Society, vol. 14, no. 1, pp. 81–95, 2015.
used in the teaching and learning process of various topics [16] R. Saade, F. Nebebe, and W. Tan, “Viability of the” technology ac-
in the next few years. Google classroom almost satisfied all ceptance model” in multimedia learning environments: a comparative
the needs of any student, it allowed students to take classes, study,” Interdisciplinary Journal of E-Learning and Learning Objects,
vol. 3, no. 1, pp. 175–184, 2007.
view material, check the teacher announcements, comment on [17] S. Alharbi and S. Drew, “Using the technology acceptance model in
it, submit assignments, track their progress in a specific topic, understanding academics behavioural intention to use learning manage-
request feedback from the lecturer, and check their grades that ment systems,” International Journal of Advanced Computer Science
and Applications, vol. 5, no. 1, pp. 143–155, 2014.
are updated by the teacher, all this and more on one platform. [18] N. Fathema, D. Shannon, and M. Ross, “Expanding the technology
This study also shows that interactivity of google classroom acceptance model (tam) to examine faculty use of learning management
has not reached the required level yet, it needs to motivate the systems (lmss) in higher education institutions.” Journal of Online
Learning & Teaching, vol. 11, no. 2, 2015.
learners more and keep them engaged in order to be an equal [19] F. Abdullah and R. Ward, “Developing a general extended technology
alternative of face-to-face learning in certain topics. acceptance model for e-learning (getamel) by analysing commonly used
Although the initial results of the study are positive, but for external factors,” Computers in Human Behavior, vol. 56, pp. 238–256,
2016.
future work, the number of participants needs to be increased [20] S. Sepp, S. J. Howard, S. Tindall-Ford, S. Agostinho, and F. Paas,
to minimize the sampling error and reach more students and “Cognitive load theory and human movement: towards an integrated
teachers at the University of Jordan. Also the instrument model of working memory,” Educational Psychology Review, pp. 1–25,
2019.
and course used in the experiment needs to be designed in [21] J. Wang, K. Dawson, K. Saunders, A. D. Ritzhaupt, P. . Antonenko,
way that allows further analysis of the difference between L. Lombardino, A. Keil, N. Agacli-Dogan, W. Luo, L. Cheng et al.,
teacher’s feedback and learner’s feedback. Most importantly, “Investigating the effects of modality and multimedia on the learning
performance of college students with dyslexia,” Journal of Special
this method of learning needs to applied to other topics to Education Technology, vol. 33, no. 3, pp. 182–193, 2018.
ensure Google classroom’s effectiveness in all types of classes, [22] A. Joshi, S. Kale, S. Chandel, and D. Pal, “Likert scale: Explored and
all areas and topics. And to ensure that all expected users explained,” British Journal of Applied Science & Technology, vol. 7,
no. 4, p. 396, 2015.
accept this platform for education.
R EFERENCES
[1] M. E. Brown and D. L. Hocutt, “Learning to use, useful for learning:
a usability study of google apps for education,” Journal of Usability
Studies, vol. 10, no. 4, pp. 160–181, 2015.
[2] P. Legris, J. Ingham, and P. Collerette, “Why do people use information
technology? a critical review of the technology acceptance model,”
Information & management, vol. 40, no. 3, pp. 191–204, 2003.
[3] C. Hayes, “An introduction to latex,” 2016.
[4] I. E. Allen and J. Seaman, Changing course: Ten years of tracking online
education in the United States. ERIC, 2013.
[5] ——, Grade Level: Tracking Online Education in the United States.
ERIC, 2015.
[6] ——, Online Report Card: Tracking Online Education in the United
States. ERIC, 2016.
[7] J. Goodman, J. Melkers, and A. Pallais, “Can online delivery increase
access to education?” Journal of Labor Economics, vol. 37, no. 1, pp.
1–34, 2019.
[8] S. M. Jafari, S. F. Salem, M. S. Moaddab, and S. O. Salem, “Learning
management system (lms) success: An investigation among the univer-
sity students,” in 2015 IEEE Conference on e-Learning, e-Management
and e-Services (IC3e). IEEE, 2015, pp. 64–69.
[9] D. E. Marcial, J. M. N. Te, M. B. Onte, M. L. S. Curativo, and J. A. V.
Forster, “Lms on sticks: Development of a handy learning management
system,” in 2017 7th International Conference on Cloud Computing,
Data Science & Engineering-Confluence. IEEE, 2017, pp. 782–787.
[10] S. Iftakhar, “Google classroom: what works and how?” Journal of
Education and Social Sciences, vol. 3, no. 1, pp. 12–18, 2016.
[11] I. N. M. Shaharanee, J. M. Jamil, and S. S. M. Rodzi, “Google classroom
as a tool for active learning,” in AIP Conference Proceedings, vol. 1761,
no. 1. AIP Publishing, 2016, p. 020069.
[12] R. A. S. Al-Maroof and M. Al-Emran, “Students acceptance of google
classroom: an exploratory study using pls-sem approach,” International
Journal of Emerging Technologies in Learning (iJET), vol. 13, no. 06,
pp. 112–123, 2018.
118
The Importance of Institutional Support in
Maintaining Academic Rigor in E-Learning
Assessment
Darin El-Nakla Beverley McNally Samir El-Nakla
College of Business Admintsration College of Business Admintsration College of Engineering
Prince Mohammad Bin Fahd Prince Mohammad Bin Fahd Prince Mohammad Bin Fahd
University University University
Alkhobar, Saudi Arabia Alkhobar, Saudi Arabia Alkhobar, Saudi Arabia
delnakla@pmu.edu.sa bmcnally@pmu.edu.sa snakla@pmu.edu.sa
Abstract— This paper reports on the perceptions of a group of of gathering and analysing information about
academics regarding the role of higher education institutions in student learning by teachers as well as learners
dealing with cheating when completing on-line assessments. A and of evaluating it in relation to prior
thematic approach to data collection and analysis was utilized. achievement and attainment of intended, as well
The findings showed there was an ad-hoc approach to the issue as unintended learning outcomes”
of academic integrity and dealing with cheating. While
institutional policies did exist concerns were expressed as to
their overall effectiveness. Additionally, faculty were not Furthermore, when a student submits an online assessment,
provided with sufficient training in the use of detection is it possible to prove that he/she wrote it themselves or that
methods and the use of available systems and processes to they truly understand the subject or material? There has
ensure academic rigour in relation to cheating in on-line been ongoing concern expressed by educationalists about
assessments. The findings have implications for institutions in the perceived increase in the incidence of student academic
the development and implementation of academic misconduct dishonesty [4, 5, 6]. Academic dishonesty is deemed to be
policies. any act of deception perpetrated by the student with the
Keywords—online, faculty, students, cheating, tools, intent to misrepresent one’s learning achievement for
plagiarism.
evaluation purposes [7]. This is of particular concern for
higher education institutions as research has indicated that
I. INTRODUCTION increases with the age of the student through to age 25 [7, 8,
This paper reports on a small exploratory study conducted in 9, 10].
a UK university. The study examined the perceptions of a Additionally, there is a view that cheating is much easier in
group of faculty as to how academic dishonesty (cheating) an online environment as faculty and students are separated
can be minimized and academic integrity achieved and by time and space [6]. There is a lack of research examining
sustained when using on-line assessments. On-line academic misconduct related to cheating. Where research
instruction has been growing exponentially over the past has been conducted it indicates that formal warnings and
two decades. For example in 2002 a total of 1,602,970 student counselling are the most preferred means of
students in higher education took at least one course online. controlling the prevalence of cheating [7]. However, this
By 2011 this had risen to 6,714,792 students taking one or appears not to be as successful from the perspective of
more online classes [1]. Stack goes on to state this faculty as it could be. Therefore, the faculty members prefer
represents an increase of 318.9%, or a 4.189 to one ratio [1]. more severe penalties for students involved in cheating. The
This signifies a three-fold increase in the level of on-line faculty members are aware of the prevalence of different
participation from 9.6% to 32.0% in 2011[2]. Consequently, types of cheating strategies, but they fail to confront it due
it can be argued there has been a corresponding increase in to lack of evidence [7].
on-line assessment as a feature of distance and eLearning Consequently, three key issues have been identified with
programs provided by tertiary (higher) education institutions regard to the use of on-line assessment [11]. The first, is the
[2]. This situation gave rise to the following research difficulty with synchronicity of assessments, the second,
problem: How can higher education institutions support security and prevention of students hacking into the system
faculty in ensuring the incidence of cheating can be to re-take the test and third, collusion where someone other
minimized? than the student takes the assessment.
As the increase of e-learning delivery has occurred, on-line The study took place in a UK University under the auspices
assessment has become more sophisticated, cost-efficient of the University’s research ethics policy. As an exploratory
and easy to use, making it more attractive to educators [2]. study, a mixed-methods approach was used to gather the
Therefore, a question is posited as to what extent it is data. As an exploratory study, a mixed-methods approach
possible to trust the results achieved. For the purposes of was used to gather the data. The form and nature of the
this study the definition of on-line assessment proposed by research questions indicated the need to employ different
Pachler, Daly, Mor, and Mellar as cited in Baleni [3] was data sources. For example, studies that answer who, what
utilized: and when questions are more likely to be found in the
“the use of ICT to support the iterative process quantitative domain. In order to attain a more in-depth
This research was funded by Prince Mohammad Bin Fahd University.
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 119
understanding of these questions a researcher turns to the
answering of how and why questions. Bryman [12] contends 6
that the, what, when and where questions support the 5
achieving of an understanding of the causes and effects of
Participants
4
people’s actions. Whereas, the how and why questions allow 3
for clarification of the underlying motivations or 2
explanations of the behavior of the individual [13]. The use 1
of questions of a how and why nature, encourage the
0
research participants to be self-reflective about their
Increased
Decreased
No change
perceptions and views and how they construct meaning from
the situations they find themselves in.
Convenience and purposive sampling was used to obtain the Possible answer
sample. The criteria for the sample are that the Faculty
member used Blackboard to conduct on-line assessments
and were available to be interviewed. This resulted in a Fig. 1. Incidents of cheating over the last five years
sample size of six. The participants came from three
schools, Computing and Creative Technology, Business and • Question 2: Have you used your institution’s
Contemporary Science. process to deal with cheating?
Data collection involved face to face interviews of
approximately 30 minutes. The questionnaires were emailed Five of the participants have said that they have used the
to the participants prior to the interview in order to University process to deal with cheating. Only one
maximize the use of time. The questions were designed to participant has not used it before. This respondent was one
elicit responses of both qualitative and quantitative nature. who stated he had not taken steps to prevent or identify
The qualitative questions sought to identify the participants’ cheating in his classes, see Fig. 2.
personal opinions regarding cheating in on-line tests. The
responses to the quantitative questions were summarized in
terms of frequency of responses. The participants were also 6
able to provide additional comments to these questions. The
5
responses were analyzed using thematic analysis. The
concurrent data collection and thematic analysis followed 4
participants
0
IV. RESULTS AND DISCUSSIONS Yes No
possible answers
A. Quantitative questions
• Question 1: Do you believe the incidence of
cheating has increased or decreased in the last five Fig. 2. Number of participant used the institution policy against cheating
years?
• Question 3: Please rate your satisfaction with the
Two of the participants said that the incidence of cheating outcome(s) of the process.
has increased whereas four participants stated that it has
stayed the same. For the participants who said it has One participant was satisfied with the institution process;
increased indicated the detection methods and the university however, a lack of consideration and flexibility seems to
rules for cheating have to be reviewed and reconsidered to occur. Two of the participant were not satisfied and as one
overcome the cheating issues especially in the on-line of them referred to the situation where the University has
environment. For the participants who stated cheating stayed not prosecuted any of the students that have been caught
the same in the last five years indicated that if it had risen cheating. Whereas the other referred to the process as
then urgent measures should be taken to combat this ridiculously formal and the academic staff were frightened
increase, see Fig. 1. to use the process. Three of the participants were satisfied
with the institution’s processes. However, they were
emphatic in their view that there was a need for more
training and development of Faculty, see Fig. 3.
120
invigilators were not easily available. Also to have a set of
questions where the order is different. Thus, every student
gets the same questions with slight variation of wording and
6 randomizing of the order of questions. This helps prevent
5 collusion or unintentional cheating if someone glances at
Participants
Satisfied
satisfied
steps taken were generally successful.
Not
The other half of the participants considered that students
Possible answer were fundamentally honest and did not use any methods of
detection other than Blackboard facilities. However, all
Fig. 3. Rating the satisfaction of the university process participants believed these were not robust enough to detect
high levels of cheating. There was an awareness of the
increasing sophistication of technology and the ability of
students to manipulate their answers.
B. Qualitative Questions
• Question 1: Briefly describe up to three incidents The participants observed that where steps were taken to
where you have detected cheating in online prevent cheating they were effective. However, they were
assessments in your subjects? difficult to implement as they were resource intensive
especially of Faculty time. Designing a randomized online
Five of the six participants stated that they have not detected test requires the instructor to prepare a large set of questions
any cheating in online assessments, only one of the to be distributed on-line to students for students to have
participants had detected cheating as he caught some of the different questions from each other. Purchased test-banks
students attempting to use emails to communicate and share were not always feasible. Often Faculty were not provided
answers during the test. Consequently, this has been stopped with the training in the appropriate software to complete this
and the email is blocked from being used during the test. with ease and in a time effective manner. This was deemed
Despite the participants not detecting cheating they do vital to meet the challenges presented by the exponential
believe that cheating does exist. However, owing to poor changes in technology.
resources, the lack of detection methods, and the software
used to deliver the online assessment, it is not revealed. • Question 4: What methods do you use to detect
cheating?
• Question 2: How did the different types of cheating
occur in your subjects? All of the participants used their own memory when
grading. They considered that the students submitting their
All of the participants agreed that students collaborated work together will have similar assignments and get scores
closely with each other leading to the possibility of copying that are very close together. Thus raising suspicions there is
work, also engaging in plagiarism in terms of copying from cheating. Also if a student is receiving high grades and had
the internet without references nor varying the writing style. not been attending the class then this is suspicious. Only
three of the participants use Plagiarism Detection Tools such
Based on the responses of the participants most of the as JISC, SafeAssign and iThenticate software. These are
cheating occurs because students copy from each other. The useful when students are copying and pasting from the
reason proffered for this happening is, either the University internet and from fellow students.
policy against cheating is not strict enough to deter the
student or the instructor failing to take action when it is • Questions 5: How does your institution convey the
detected thereby not deterring students from, or even policies and processes pertaining to cheating?
encouraging the students, to continue cheating. There also
appeared to be no education for students in what comprised The University has its own academic deceit policy and
cheating and plagiarism. procedures which can be accessed from the University
website, the most important part is quoted:
• Question 3: How have you implemented processes “The degrees and other academic awards of University
to prevent cheating in your subjects over the last are granted in recognition of a student’s individual
five years and how effective were they? achievement. Students are not permitted to seek unfair
academic advantage, i.e. to cheat. Any deliberate
Half the participants used codes to prevent students from attempt to obtain unfair advantage by one or more of a
printing during an online test, to have an invigilator for the variety of means will be penalized”.
test if taken in a formal classroom situation to support the
Faculty member. However, this is not always possible as
121
Five of the six respondents suggested that the university includes but is not limited to the following
policy is well known. However, it was also suggested that recommendations.
the University did not take action beyond having a policy.
There was a suggestion that often the University [any It is recommended that the use of online cameras and
university] could compromise its academic reputation if the biometric data to monitor students to verify their
true incidence of cheating became widely known. identification be investigated, especially for those students
who a sitting tests away from the university. This would aid
• Question 6: How effective are the processes your in reducing the potential for students to have someone else
institution uses to reduce cheating? sit the exam for them.
The participants stated that there needed to be more It is recommended that the University act to provide
preemptive attempts by the University to reduce cheating. invigilators for all on-line assessments. This may mean
The University needed to be proactive in publicizing the taking a non-traditional approach to their use. One that is
Academic Deceit Policy. For example, students registering more suited to on-line assessments as opposed to being
at the University for the first time be given a copy of the present in a classroom. This would include supporting
Academic Deceit Policy and Procedures. Program tutors Faculty with the preparation of assessments so they achieve
also ensuring that the policy is re-stated at the ‘best practice’ in minimizing cheating.
commencement of the course. Again the importance of the
students understanding what comprised cheating was Students' awareness of academic integrity policies of
stressed. University and signing a code of conduct document is
believed to lower the occurrence of cheating. Further
• Question 7: How could the processes be improved? research is needed to establish the effectiveness of such
strategies. Also further research is needed investigating
All the participants were emphatic that more training and student awareness of what exactly constitutes cheating and
development for Faculty is urgently required. They were plagiarism
very open to finding out more about the latest techniques to
prevent cheating, how they can change the assessment While the participants had not been involved with off-
design and use collaborative groups to share ideas. The campus assessment they were aware of the issues that could
University also had a responsibility to ensure academic staff arise from this and stated that invigilators were essential if
do make use of the available processes and resources. There this were to occur. Further research is required to establish
also needs to be a change in policy and practices better best practice in this regard.
reflecting the on-line environment.
It was noted the Faculty’s memory is not always effective
• Question 8: Do you agree that a low grade especially if there is a large number of students in the class.
weighting of the online test would reduce cheating? Therefore, it is recommended that the Plagiarism Detection
Tools are upgraded and should become a requirement for
There was a disagreement between the participants as to the use by academic staff and not only by few and training
best strategy regarding weighting of assessments. Two of should be given by the University to the academic staff. It is
the participants agreed that using low grade weighting for recommended that institutions investigate the challenges
online assessment will discourage students from cheating, as presented for Faculty in this situation.
the penalty and being labelled a cheater if caught is not
worth it. However, with a high weighted assessment the Faculty training in using Plagiarism Detection Tools such as
temptation may be greater to cheat. Two participants stated JISC, SafeAssign and iThenticate software which is
that this may depend on the students themselves, one imperative. The study showed that students cheating
participant preferred to have one big assessment for the increased with faculty who do not use the tools.
module rather than have many smaller one, and he thinks it
does not work to have low grade for an assessment. In summary, the challenge of maintaining academic
integrity is not going to go away. It is imperative that all
higher education institutions are proactive in meeting the
IV. CONCLUSION AND RECOMMONDATION challenge and ensuring Faculty are supported in their efforts
The aim of this exploratory study was to identify the to combat these issues especially in the e-learning
awareness of cheating in on-line environment and Faculty’s environment.
responses and satisfaction with efforts made to ensure
academic integrity. It was found that there was a limited ACKNOWLEDGMENT
awareness on the part of academic staff as to the potential
and extent of cheating in on-line assessment. Moreover, the The authors would like to acknowledge the support and
cheating and plagiarism tools available to academic staff to funding research by Prince Mohammad Bin Fahd University
detect cheating were limited. The study provides a basis for (PMU).
further research investigating the challenges posed by the
increase in on-line assessment and the potential for a growth
in cheating in this form of assessment. This research
122
REFERENCES
[1] S. Stack, “Learning Outcomes in an online vs traditional course,”
International Journal for the Scholarship of Teaching and Learning, Vol. 9,
No. 1, Article 5. 2015
[2] I. E. Allen and J. Seaman, “Changing course: Ten years of tracking
online education in the United States,” Babson Survey Research Group and
Quahog Research Group, LLC, 2013
[3] Z. Baleni, “Online formative assessment in higher education: Its pros
and cons,” The Electronic Journal of e-Learning, 13(4), 228-236, 2015
[4] M. J. Bishop and M. Cini, “Academic dishonesty and online
education (Part 1): Understanding the Problem,”
https://evolllution.com/revenue-
streams/distance_online_learning/academic-dishonesty-and-online-
education-part-1-understanding-the-problem/ accessed on 1st April 2019
[5] N. Rowe, “Cheating in online student assessment: beyond plagarism,”
Online Journal of Distance Learning Administration,7(2), 2004
[6] G. Watson and J. Sottile, “Cheating in the digital age: do students cheat
more in online courses,” Online Journal of Distance Learning
Administration, Volume 13.1spring 2010
[7] P. Singh and R. Thambusamy,” To cheat or not to cheat, that is the
question: undergraduates’ moral reasoning and academic dishonesty,” 7th
International Conference on University Learning and Teaching, 2016
[8] G. J. Cizek, “ Cheating on tests: how to do it, detect it and prevent it,”
Mahwah, NJ: Lawrence Erlbaum, 1999
[9] M. Dick et al., “Addressing student cheating: definitions and solutions,”
ACM SIGCSE, 35(2), 172-184, 2003
[10] A. Lathrop and K. Foss,”Student cheating in the Internet era: A wake
up call,” Englewood, Co: Libraries Unlimited, 2000
[11] M. Olt,” Ethics and distance education: Strategies for minimizing
academic dishonesty in online assessment,” Oneline Journal of Distance
Learning Administration, 5(3), 2002
[12] A. Bryman, “Quality and Quantity in Social Research”, Unwin
Hyman, London, 1988
[13] V. Braun, & V. Clarke, "Using thematic analysis in psychology”,
Qualitative research in psychology, 3(2), 77-101, 2006
[14] A. Bryman, “Social research methods”, New York, Oxford University
Press, 2001
123
Deep Learning Assisted Smart Glasses as Educational
Aid for Visually Challenged Students
Hawra AlSaid, Lina AlKhatib, Aqeela AlOraidh, Shoaa AlHaidar, Abul Bashar
Abstract— Computer Vision Technology has played a levels of needs and not all levels require special places and
significant role in assisting visually challenged people to carry special schools. For instance, people with vision difficulties can
out their day to day activities without much dependency on study with other students if they have an appropriate
other people. Smart glasses in one such solution which enables environment. In order to solve this issue, we can use the help of
blind or visually challenged people to “read” images. This computer vision technology to make special aids which the
paper is an attempt in this direction to build a novel smart glass visually impaired people can live comfortably, as far as
which has the ability to extract and recognize text captured from possible.
an image and convert it to speech. It consists of a Raspberry Pi It is observed that most blind people are intelligent and can
3 B+ microcontroller which processes the image captured from study if they have the chance to be able to study in regular
a webcam super-imposed on the glasses of the blind person. government administered schools as they exist almost
Text detection is achieved using the OpenCV software and open everywhere. It is a misconception among majority who think
source Optical Character Recognition (OCR) tools Tesseract people who are blind or with vision difficulties cannot live
and Efficient and Accurate Scene Text Detector (EAST) based alone and they need help of other people at all times. In fact,
on Deep Learning techniques. The recognized text is further they do not need help all the times, they can be independent
processed by Google’s Text to Speech (gTTS) API to convert most of the times and they have the chance to live like other
to an audible signal for the user. A second feature of this people.
solution is to provide location-based services to the blind people One of the popular solution in this scenario is to use Smart
by identifying locations in an academic building using the RFID Glasses for the visually impaired people [3]. These types of
technology. This solution has been extensively tested in a glasses make the use of computer vision hardware and software
university environment for aiding visually challenged students. tools (camera, image processing, image classification and
The novelty of the implemented solution lies in providing the speech processing). Such a solution gives a chance to visually
desired computer vision functionalities of image/text impaired people to lead a comfortable life with other people and
recognition which is economical, small-sized, accurate and uses study in any school or university without the need of help from
open source software tools. This solution can be potentially other people every time. It has been observed that the use of
used for both educational and commercial applications. Smart Glasses has increased the percentage of educated people.
Most schools, colleges and universities are accepting students
Keywords: Image Recognition; Speech processing; Optical with vision difficulties. It is expected that from next academic
Character Recognition; Deep Learning; Raspberry Pi; Python. year Prince Mohammad bin Fahd University (PMU) will accept
blind students for admission [4]. The college would like to start
I. INTRODUCTION using smart glasses for the first time in this setup and help
In our societies, there are many people who are suffering students to improve their education level with minimum
from different diseases or handicap. According to World Health assistance from the instructor.
Organization (WHO), about 8% of the population in eastern This was the motivation behind the design and
Mediterranean region has vision difficulties, which includes development of smart glasses is to help blind and visually
blindness, low vision and some kind of visual impairment [1]. impaired students with their studies. These glasses are designed
Such people need to be provided special facilities so that they to use the computer vision technology to capture an image and
can live comfortably. Especially in the field of education, there extract English text and convert it into audio signal with the aid
are special schools and universities for people with special of speech synthesis. Also, it was decided to add a feature of
needs [2]. Most blind people and people with vision difficulties translating text/words from English to Arabic language as the
were not in a position to complete their studies special schools majority of the students at PMU are Arabic speaking.
for people with special needs are not available everywhere and
most of them are private and expensive. So the only alternative The main objectives of the proposed system can now be
was that they study at home acquiring basic knowledge from summarized as the follows: capturing image, extracting text
their parents. This education was not technical enough and from the image, identifying the correct text, converting text to
hence cannot compete with other people. There are different speech, translate the text to other language, to integrate the
125
Table I: Comparative Summary of Smart Glasses Solutions
126
Fig. 2: Process Diagram of the Proposed System
127
group of individual symbols.
(ii) OpenCV Libraries
OpenCV is a library of programming functions for real-time
computer vision, the library is cross-platform and free for use
under the open-source BSD license [15]. For the installation of
the OpenCV 4 libraries, the recommended operating system for
the raspberry pi B+ which is Raspbian Stretch was installed.
Win32 Disk Imager was used to flash the SD card.
(iii) Google Text to Speech (gTTS) API
One of the most important functions of the smart glasses is
text to voice conversion. In order to implement this task, we
installed gTTS (Google Text-to-Speech). It is a python library
that interfaced with Google Translate API [13]. gTTS has many
features such as convert ultimate length of text to voice, provide
error pronunciation using customizable text pre-processors and
support many languages and retrieve them when needed. We
used the gTTS to perform language translation from English to
Arabic (called as Button 2, see Fig. 4).
128
worthwhile to include multi-lingual feature (e.g.
French or Urdu ) in the speech translation module.
● To improve the direction and warning messages to the
user, we can include GPS-based navigation and alert
system.
● To provide for more space visibility, we can include a
wide angle camera (e.g. 2700 degrees as compared to
600 currently used).
● Finally, to provide for more real-time experience, we
can include video processing instead of still images.
129
DeepDR:An image guideddiabetic retinopathy detection technique using
attention-based deep learning scheme
Noman Islam1, Umair Saeed2, Rubina Naz3, Jaweria Tanveer4, Kamlesh Kumar5, Aftab Ahmed Shaikh6
1
Iqra University
2-6
SindhMadrassatul Islam University, Karachi
Abstract- This paper proposes an efficient and cost effective It can be said that a multidisciplinary approach is required for
deep learning architecture to detect the diabetic retinopathy in catering to this challenge.
real time. Diabetes is a leading root cause of eye disease in
patients. It illuminates eye vessels, and releases blood form In this paper, a machine learning approach to diagnosis of
vessels. Early detection of diabetic retinopathy is useful to reduce diabetes mellitus is proposed. Machine learning is the branch
the risk of blindness or any hazard. In this paper, after some pre-
of artificial intelligence that is based on learning a model from
processing and data augmentation,InceptionV3 is used as pre-
trained model to extract the initial features set. Convolutional data that can later on perform prediction. So, the paper
neural network has been used with attention layers. These proposes an approach based on convolutional neural network
additional CNN layers are added to extract the deep features to to perform classification task. The images of the fundus are
improve classification performance and accuracy. Initially, the acquired and a convolutional neural network model is trained
model was proposed by Kevin Mader in Kaggle. The paper that provides improved accuracy compared to conventional
introduced additional layers in proposed model and improved approaches.
the validation and testing accuracy significantly. More than 90%
validation accuracy was achieved with the proposed 2. Literature Review
Convolutional Neural Network model. Testing accuracy was
In Table 1 previous work was summarized with respective
improved up to 5%. This improvement in accuracy is very
significant because the dataset is imbalanced and contains noisy accuracy. NurselYalçin et al. [1] proposed a deep learning
images. It is concluded that global average pooling (GAP) based approach for DR disease classification. After some pre-
attention mechanism increased deep learning architecture processing, CNN was used to classify the disease in image
accuracy to detect the Diabetic Retinopathy in imbalanced and dataset with 98.5% validation accuracy. Omer Deperlioglu et
noisy image dataset al. [2] proposed a CNN based deep learning model. 96.67%
validation accuracy was achieved. DarshitDoshi et al [3]
Keywords:diabetic retinopathy, deep learning, transfer proposed a CNN model with 0.386 accuracy. Three deep
learning, convolutional neural network, attentionmechanism, learning models was proposed. Images channels (Green, Red)
global average pooling were extracted from original images and were given to models
respectively. ArkadiuszKwasigroch et al [4] proposed CNN
1. Introduction based decision support system for DR disease classification.
Diabetes mellitus has reached to an epidemic level globally The 82% validation accuracy as claimed. Fully connected
and according to some statistics it will reach to 360 million convolutional neural network was proposed by Manaswini
people by 2030. Despite decades of intense research, diabetic Jena et al [5]. The model validation accuracy was claimed as
retinopathy (DR) is still the leading causes of visual loss all 91.66%. XiaoliangWang et al [6] used deep learning model
over the world and account for 28% of diabetes patients in with 63.23% validation accuracy. The proposed model was
USA. Specifically, it is quite prevalent among the working age based on pre-trained model inceptionNetV3.
populations. Patients, who suffered from visual loss due to this
problem, often reflect late diagnosis of diabetes or sometimes HaiQuan Chen et al [7] obtained validation accuracy up to
they are unaware of diabetes and eye problems. It has been 80.0%. Deep neural network model was discussed in his
observed that an earlier diagnosis of retinopathy can prevent paper. Abhay Shah et al [8] described a CNN model with
or avoid a significant proportion of visual loss. This can also 53.57% accuracy. IgiArdiyanto et al [9] proposed a Deep
ease the healing process or stop the progression of disease. learning model for assessment DR disease in embedded
However, accurate diagnosis of this disease and identifying system. This model was named as Deep-DR-Net. The
the stage of the disease is a challenge. Often ophthalmologist accuracy of this model was claimed up to 65.40%.
performs the screening through visual inspection of fundus HanungAdiNugroho [10] discussed the three different
and evaluation of color photographs. However, this is an approaches. First approach was based on pathologies. Second
expensive and time consuming process. Most of the patients of approach was based on foveal avascular zone (FAZ) structure.
diabetic retinopathy live in underdeveloped areas where In third approach, deep learning was proposed with more than
specialist and the diagnostic infrastructure is not available. 95% validation accuracy.
Early detection of disease and treatment is very essential to
combat the increasingly large number of retinopathy patients.
131
InceptionV3 pre-trained model was used for feature
extraction. To extract deeper features, further convolutional
layers were being added.To reduce over fitting, dropout layer
was being used with 0.5 rate.64 filters with 1 X 1 kernel size
were used in first convolutional layer. Activation function was
Relu.Second convolutional layer contained 1 X 1 sized 16
filters with Relu activation function.Third Convolutional layer
was being added contained 8 filters with 1 X 1 size. This was
our contribution in Kevin Mader's proposed model.In fourth
convolutional layer, sigmoid activation function was used with
1 kernel.
An attention layer was added with liner activation function. Figure 1: Sample images from Kaggle DR Dataset
This layer was not being trained during training process
(trainable = False) because this layer was used for attention
purpose. Mask features were calculated with the help of Initial Before training, we augmented lots of images to improve the
extracted features generated by pre-trained model and deeper classification performance. We used 640 × 640 size for
features extracted after adding further Convolutional layers.To augmented images. We implemented horizontal flip. We used
build attention mechanism, Global average pooling was being random brightness and contrast. Random saturation was used.
used. GAP features and GAP mask were obtained from mask Color mode was RGB. Minimum crop percentage was 0.001
features and attention layers respectively.Lambda layer was and maximum crop percentage was 0.005. Rotation range was
used to rescale the features. set up to 10. For Data augmentation, batch size was 16 and
crop probability was set to 0.5. We shuffle the whole dataset
Two Dropout layers with 0.25 rate were used with fully before training.
connected layer. This fully connected Dense layer was used
with 128 units with ReLu activation function.Another fully Google Colab GPU environment (1xTesla K80 GPU with
connected layer was added with 64 unit with linear activation. 2496 CUDA cores, 12.6 GB RAM) was used for model
This was also our contribution in this architecture.Finally training and testing. 778 images (equal number of images
output layer was used with softmax activation. 5 unit were from all classes) were used for training and 274 images were
used in this output layer to classify the all five labels used for validation process. For training, we adjusted reduce
accordingly.The model was compiled with Adamax optimizer learning rate parameters. Patience was setas 20 number of
and categorical crossentropy lose function. Initially, Kevin epochs. Cool down parameter was set as 5. Factor adjusted as
Mader compiled his proposed model with Adam optimizer. 0.4 (reduction of learning rate).Training stop parameter were
adjusted. We adjusted patience of early stop parameter as 20
4. Details of Proposed Approach and validation lose quantity as parameter to be monitored. For
Images of Diabetic retinopathy were used from Kaggle testing, 1008images were used. To show attention advanced
dataset. This dataset contains 35,000 color images. 5 class visualization technique heatmap was used. Testing
labels were defined as “No DR”, “Mild, Moderate”, “Severe” performance measures like accuracy, recall, precision, f-score
and “Proliferative DR”. Retina images are high-resolution statistical analysis were used to evaluate the architecture.
taken under a diversity of imaging circumstances. A left and InceptionV3 transfer learning based architectures from
right eyes images are provided for every patient. Noise is imageNet were used to extract initial features.
observed in the images. Due to the lighting effects, pixel
intensity varies and it causes variation dissimilarity to Accuracy can be further increased by adjustment in
classification pathology. Sample images were provided in convolutional layers and fully connected layers. Further pre-
Figure 1.Images were normalized by using Gaussian processing can enhance classification process in proposed
Smoothing Filters. Unsharp masking techniques were used to architecture.
enhance the edges in images. Filtering technique of Contrast
Limited Adaptive Histogram Equalization was used to adjust 5. Results and Discussion
the contrast in images. Training time, training/validation accuracies and losses were
provided in Table 3. Performance parameters were provided in
Table 4. More than 94% validation accuracy was achieved. On
Test dataset, 65% accuracy was obtained. Compared with
initial proposed model, up to 5% test accuracy was improved.
For class labels0 (No DR), 1 (Mild), 2 (Moderate), 3 (Severe),
4 (Proliferative DR), testing precision was obtained 72%,
16%, 22%, 11% and 27% respectively. Testing recall was
132
achieved 90%, 4%, 11%, 03% and 26% respectively. 80%,
7%, 11%, 3% and 27% F-score was obtained for class label 0,
1, 2, 3 and 4 respectively. Total testing time was 33 second
and 32 millisecond per step. Improved model prediction
(AUC) was 60%. For class label 0, 632 was true positive. 5,
15, 1 and 6 were true positive for class label 1, 2, 3and 4
accordingly. Training time, training lose and accuracy graphs
were provided in Figures 1, 2, 3 and 4. Validation loss and
accuracy can be seen in Figures 5 and 6. Comparison of
overall testing accuracies of initial proposed model and our
proposed was shown in Figure 7. Confusion matrix and ROC Figure3: Training Loss for each Epoch
curve are visualized in Figures 8 and 9.Some examples of
Actual severity and predicted severity were shown in Figure
10. In Figure 11heatmap visualization was given. The heat
map described the prominent features of relevant class label.
In Figure 13 we compared learning time between proposed
model and Kevin Mader’s model. Learning time of our
proposed model is greater but it is reducing gradually on each
epoch.As per Figure 10 and 11, proposed model is able to
predict the correct class basis on concerned regions. Figure 4
and 6 show that validation is improving on each epoch. Figure
1 and 3 show that learning time is decreasing on each epoch
and proposed model is trained more quickly. Figure 5 Figure4: Training Accuracy for each Epoch
describes that on each epoch, loss of proposed model is also
decreasing.
Table 3: Training time, training loss/accuracy and Validation
loss/accuracy for each Epoch
Training Training Validation
Epoch Training Validation
Time Validation Accuracy
No Loss (%) Loss (%)
(Second) (%) (%)
1 1802 1.5947 68.4% 1.6432 56.9%
2 512 1.3743 79.3% 1.4293 68.4%
3 507 1.3381 83.2% 1.1537 89.5%
4 514 1.3249 85.4% 1.0832 89.7%
5 511 1.2376 97.0% 0.9956 94.3%
Figure5: Validation Loss for each Epoch
133
Figure8: Confusion matrix of our proposed model
Figure12: Validation accuracy comparison of different proposed models and our proposed model
134
learning based method for retinal lesion detection.
ICACCI. 2017; 33-37
12. Yashalshaktikanungo K, Bhargavsrinivasan S,
Savitachoudhary C. Detecting diabetic retinopathy
using deep learning. RTEICT. 2017; 801-804.
13. Syahidahizzarufaida S, Mohamad ivanfanany M.
Residual convolutional neural network for diabetic
retinopathy. ICACSIS. 2017; 367-374.
14. Ratulghosh R, Kuntalghosh K, Sanjitmaitra S.
Figure13: Learning time comparison between proposed Automatic detection and classification of diabetic
model and Kevin Mader’s model retinopathy stages using CNN. SPIN. 2017; 550-554.
15. Bariqiabdillah B, Alhadibustamam A, Dewisarwinda
References D. Classification of diabetic retinopathy through
1. Nurselyalçin NY, Seyfullahalver SA, Neclauluhatun texture features analysis. ICACSIS. 2017; 333-338.
NU. Classification of retinal images with deep 16. Arisharoy R, Debasmitadutta D,
learning for early detection of diabetic retinopathy Pratyushabhattacharya B, Sabarnachoudhury C. Filter
disease. SIU. 2018;1-4. and fuzzy c means based feature extraction and
2. Omer deperlioglu O, Utkuköse U. Diagnosis of classification of diabetic retinopathy using support
Diabetic Retinopathy by Using Image Processing and vector machines. ICCSP . 2017; 1844-1848.
Convolutional Neural Network. ISMSIT. 2018. 17. Yanyan dong D, Qinyanzhang Z, Zhiqiangqiao Q, ji-
3. Darshitdoshi D, Aniketshenoy A, Deep sidhpura S, jiang yang Y. Classification of cataract fundus image
Prachigharpure P. Diabetic retinopathy detection based on deep learning. IST. 2017;1-5.
using deep convolutional neural networks. CAST. 18. S Choudhury C, S Bandyopadhyay B, S K latibL, D
2018;261-266. K Kole K, c giri G. Fuzzy C means based feature
4. Arkadiuszkwasigroch K, Bartlomiejjarzembinski J, extraction and classification of diabetic retinopathy
Michal grochowski G. Deep CNN based decision using support vector machines. ICCSP. 2016; 1520-
support system for detection and assessing the stage 152.
of diabetic retinopathy. IIPhDW. 2018;111-116. 19. “Diabetic Retinopathy detection”,
5. Manaswinijena M, Smitapravamishra S, https://www.kaggle.com/kmader/inceptionv3-for-
debahutimishra D. Detection of Diabetic Retinopathy retinopathy-gpu-hr, 2018
Images Using a Fully Convolutional Neural Network.
ICDSBA. 2018.
6. xiaoliangwang W, yongjinlu L, yujuanwang Y, wei-
bang chen C. Diabetic Retinopathy Stage
Classification Using Convolutional Neural Networks.
IRI. 2018; 465-471.
7. Haiquanchen C, Xianglongzeng Z, Yuan luo L,
Wenbin ye Y. Detection of Diabetic Retinopathy
using Deep Neural Network. DSP. 2018; 1-5.
8. Abhay shah A, Stephanie lynch S, Meindertniemeijer
M, Ryan amelon R, Warren clarida W, James Folk J,
Stephen Russell SR, Xiaodong Wu X, Michael D.
Abràmoff MD. Susceptibility to misdiagnosis of
adversarial images by deep learning based retinal
image analysis algorithms. ISBI. 2018; 1454-1457.
9. HanungAdiNugroho H. Towards development of a
computerised system for screening and monitoring of
diabetic retinopathy. EECSI. 2017; 1-1.
10. Fengliyu Y, Jing sun S, Annan li L, Jun cheng C,
Cheng wan W. Image quality classification for DR
screening using deep learning. EMBC. 2017; 664-
667.
11. Bhavanisambaturu B, Bhargavsrinivasan S,
Sahanamuraleedharaprabhu M, Kumar
thirunellairajamani T, Thennarasupalanisamy P,
GirishHaritz G, Digvijay Singh BS. A novel deep
135
Mitigating the Effect of Data Sparsity: A Case
Study on Collaborative Filtering Recommender
System
Bushra Alhijawi∗ , Ghazi Al-Naymat† Nadim Obeid‡¶ , Arafat Awajan§
King Hussien School of Information Technology, Princess Sumaya University for Technology Amman, Jordan
¶ King Abdullah II of Information Technology, The University of Jordan Amman, Jordan
Abstract—The sparsity problem is considered as one of the CF can provide recommendations for [8]. Therefore, the CF
main issues facing the collaborative filtering. This paper presents may be unable to produce a recommendation to those items
a new dimensionality reduction mechanism that is applicable to which have only a small number of rates. This is due to the
collaborative filtering. The proposed mechanism is a statistical-
based method that exploits the user-item rating matrix and item- fact that the users are usually rating a small proportion of
feature matrix to build the User Interest Print (UIP) matrix. The the items compared with the total number of items in the
UIP is a dense matrix stores data that reflects the satisfaction system. Neighbor transitivity refers to the problem in which
degree of the users about the item’s semantic feature. This like-minded (i.e. similar) users may not be determined since
method is developed based on the assumption that people tend they may not have sufficient and enough common ratings [9].
to buy items related to what they have previously bought. Also,
this method benefited from the fact that the number of features Consequently, the sparsity problem has a significant negative
is much less than the number of items and mostly constant. impact on the accuracy of the CF prediction. The effect of
The effectiveness of the proposed mechanism is tested on two the sparsity problem on CF is examined by Bobadilla and
real datasets namely Movielens and HetRec 2011. The obtained Serradilla [7]. They concluded that the impact of the sparsity
accuracy results using UIP matrix are compared with the one effect depends on the k-neighborhood value selected and the
obtained using the user-item rating matrix. The experimental
studies demonstrate the superiority of our proposed method. On used similarity measure.
average, using UIP matrix the collaborative filtering achieved 8% Several methods have been considered to alleviate the data
improvement in terms of prediction accuracy. sparsity problem. Traditionally, the user’s demographic infor-
Index Terms—Dimensionality reduction, sparsity, collaborative mation (e.g. gender, country, age, etc...) is utilized to compute
filtering, recommender system. the similarity among users [10], [11] which helps in alleviating
the neighbor transitivity issue. In addition, the item’s semantic
I. I NTRODUCTION information has been used to overcome the issues related to the
Collaborative Filtering (CF) is the most common and pop- sparsity problems (i.e. coverage and neighbor transitivity) [2],
ular recommendation approach [1], [2]. The core idea behind [12]–[14]. Alhijawi and Kilani [3] used the genetic algorithm
the CF is to estimate a particular item’s probability to be to obtain the optimal similarity values among users instead of
favorite to a user by comparing that user’s historical shopping using the user-item rating matrix. Representing the historical
behavior record with the recorded shopping behavior of other rating data in lower dimensional space is one of the proposed
like-minded users [3], [4]. The basic assumption that motivates solutions to deal with this challenge [15]. Principal Component
the CF is that there is a high probability that the users will Analysis (PCA) [16], [17] and Singular Value Decomposition
give similar rates to other items if they gave rates to n items (SVD) [18]–[20] are two-dimensionality reduction methods
in a similar way [5]. used to alleviate the data sparsity problem. PCA is a di-
The historical shopping behavior records are stored in a mensionality reduction technique proposed by Pearson [21]
data file that can be viewed as a matrix whose rows and that is a statistical-based method to obtain an ordered list of
columns represent users and items, respectively. This matrix components that account for the largest amount of the variance
is called the user-item rating matrix. The performance and the from the data (i.e. finding patterns in a high dimensionality
recommendation quality produced by the CF depend mainly on space) [15]. SVD is a matrix factorization approach that
the quality of stored data = in the user-item rating matrix. The decomposes the user-item rating matrix into the product of
user-item rating matrix usually stores rating records related three lower dimensionality rectangular matrices proposed by
to tens of thousands of users for tens of thousands of items, Billsus and Pazzani [22].
thus it will be extremely sparse. The sparsity problem is a This paper presents a dimensionality reduction method that
result of the fact that most of the users only rated a small is applied to handling the sparsity problem. The proposed
proportion of the items [2], [6], [7]. This problem contributes technique is a statistical-based method that exploits the user-
to reduce coverage and cause neighbor transitivity [8]. The item rating matrix (U × I) and item-feature matrix (I × F ) to
CF’s coverage is defined as the fraction of items that the build the User Interest Print (UIP) matrix (U × F ). The UIP
137
where, rui is the rate that user u gave to item i. For Let R : U × I → U − I be a utility function that measures
instance, VU 1 = (5, 0, 0, 0, 0, 0, 4, 0, 3, 2). the probability of item (i) to be favorite to the user (u).
2) Represent each item by a vector of features (Vi ) as For each user, the recommendation problem consists of
follows: finding the item (i∗ ) which maximizes the utility of user u.
Vi = (fi1 , fi2 , . . . , fik ), (3) Mostly, the utility of an item is represented as a rate which
indicates the user’s satisfaction level about this item. These
where, the value of fit is either 0 or 1. fit = 1 if item
values are aggregated in term of features and stored in UIP
i belongs to feature t. Otherwise, fit = 0. For instance,
matrix. To achieve this goal (i.e. finding i∗ ), the similarity
VI1 = (1, 0, 0, 1, 0).
between AU and other users are computed depending on the
3) Compute the interest print (IP u ) for each user as
UIP matrix. The most similar users to the AU (u∗ ) are used as
follows:
an input to a prediction function P : u∗ ×F to predict the AU ’s
u u u u
IP = (SatDegreef 1 , SatDegreef 2 , . . . , SatDegreef k ) satisfaction degree about the features (f ). Then, the predicted
(4) rate of each item is computed depending on the satisfaction
For instance, IP U 1 = (3.5, 0, 3, 3.5, 2.5). level of the features which the item belongs to it (Eq. 6). Based
P i on the predicted rates, the items set (I) will be labeled as either
ru
SatDegreef t =u
, where i ∈ f t
(5) favorite items or non-favorite items. The favorite item set is
#i ordered and considered as a recommendation list for the AU .
where, P u
AU ft
u
• IP represents the interest print of user u. ItemP Ri = , where i ∈ f t (6)
u
• SatDegreef t represents the satisfaction degree of
#f
user u about semantic feature f t . IV. E XPERIMENTS AND R ESULTS
For instance, This section provides details of how the UIP matrix was
I1 I7 I9 I10
tested. Various experiments were conducted for purposes of
rU 1 +rU 1 +rU 1 +rU 1 5+4+3+2
SatDegreeU F
1
1 = 4 = 4 = 3.5 comparing the prediction accuracy resulted from using UIP
matrix with the one obtained using the user-item rating matrix.
Section IV-A presents details related to the data that have
been used in the experiments. Section IV-B provides detailed
related to the experiments design and the measures that were
used in the experiments. Finally, the results are presented and
discussed in Section IV-C.
A. Datasets
To evaluate the UIP matrix, two real datasets, MovieLens
and HetRec 2011, were considered (described in Table I). The
descriptions of the datasets are as follows:
1
• Movielens dataset [23]. This dataset is considered as
one of the most popular reference RS research over the
last years [24]. In this dataset, 100,000 5-star scale rating
collected from 943 users on 1682 movies. Each user has
rated at least 20 movies and each movie belongs to at
least one category from the 18 categories. Hence, 32.4%
of the users gave rates to 20 − 40 items which are a
small number of rates compared with remaining users.
The average number of rates is 106. The sparsity level of
Fig. 2. Example on UIP matrix construction process. this dataset is 93.7% (sparsity level =1 − (100000/(943 ∗
1682)) = 0.937).
III. UIP M ATRIX FOR R ECOMMENDATION • HetRec 2011 (MovieLens + IMDb/Rotten Tomatoes)
In general, the recommendation problem consists of finding dataset 2 [25]. It is an extension of the MovieLens10M
a set of items that have the highest probability to be favorite dataset, published by GroupLeans research group 3 . The
to a particular user (AU ). The challenge is to predict these HetRec 2011 dataset includes 2113 users, 10197 movies,
probabilities accurately. More formally, the recommendation 95321 actors, 4060 directors and 20 genres. In this
problem can be formulated as follows: dataset, the users have provided ratings on a 5-star scale
Let U = u1 , u2 , u3 , ..., un be a set of users, I = 1 http://grouplens.org/datasets/movielens/
i1 , i2 , i3 , ..., im be a set of all possible items and F = 2 https://grouplens.org/datasets/hetrec-2011/
138
TABLE I For the prediction step, the Resnick’s Adjusted Weighted
T HE DATASETS SPECIFICATIONS USED IN THE EXPERIMENTS . Sum (Eq.10) was considered. Note that, in this step, the
feature’s (x = f ) rate and the item’s (x = i) rate are predicted
Movielens HetRec 2011
when using the UIP matrix and the user-item rating matrix,
Number of users 943 2113
Number of movies 1682 10197
respectively.
Number of genres 18 20
Number of actors 0 95321 PkAU x
Number of directors 0 4060 u=1 [sim(AU, u) ∗ (ru − r̄u )]
pAU,x = r̄AU + PkAU (10)
Number of ratings 100000 855598
Rating scale 1-5 1-5 u=1 sim(AU, u)
Sparsity level 93.7% 96.1% Note that only the genre feature was considered to construct
the UIP matrix. Thus, the dimensions of the user-item rating
matrix and the UIP matrix when using Movielens dataset are
and include 855598 rating. Each user gave rates at least 943 × 1682 and 943 × 18, respectively. While a 2113 × 10197
to 20 items. Hence, 496 users gave rates to 20−100 items user-item rating matrix and a 2113 × 20 UIP matrix are
(i.e. 23.5% of overall users) and 38% of those users gave considered when using HetRec 2011 dataset.
rates to 20 − 40 items. The average number of rates is
405. The sparsity level of this dataset is 96.1% (sparsity C. Results
level = 1 − (855598/(2113 ∗ 10197)) = 0.961). The results presented in this section refer to the prediction
accuracy, processed using the MAE. The x-axis represents the
B. Experiments setup different values used of K-neighbor and y-axis represents the
The prediction accuracy resulted from using the UIP matrix MAE results.
shown in this paper compares to the one result obtained from Fig. 3 shows the MAE results obtained from applying
using the user-item rating matrix. The Mean Absolute Error Pearson-based CF (Fig. 3(A)) and cosine-based CF (Fig. 3(B))
(MAE) (Eq. 7) is utilized as a prediction accuracy measure. using Movielens dataset. Generating the recommendation us-
The prediction accuracy is computed using a different number ing Pearson-based CF and depending on UIP matrix lead to
of neighbors; (K). The value of K range between 25 to 400. fewer errors, particularly for K values lie in the range [25-
A smaller value of MAE signifies better prediction quality. 125]. Using the UIP matrix with Pearson-based CF improved
the prediction accuracy, on average, 1.6%. In general, the
U PIu
1 X i=1 |pu,i − ru,i | performance of the cosine-based CF when depends on the UIP
M AE = , (7) matrix is better than when depends on the user-item rating
#U u=1 #Iu
matrix. However, the prediction errors recorded when using
where #U represents the number of users and #Iu repre- UIP matrix is, on average, 2.4% less than the prediction errors
sents the number of items rated by the user u. recorded when using the user-item rating matrix.
Both metrics are utilized in the baseline CF approaches: Fig. 4 inform about the accuracy results that are collected
Pearson-based CF and cosine-based CF. Thus, the most pop- using HetRec 2011 dataset. The recommendation methods
ular similarity metrics are considered in the experiments to (i.e. Pearson-based CF (Fig. 4(A)) and cosine-based CF (Fig.
select the similar users to the AU ; Pearson correlation (Eq.8) 4(B))) achieve significant fewer errors when using Using
and cosine (Eq.9) [7]. UIP matrix than when using user-item rating matrix for any
selected value of K-neighbors in the range [25-200]. The com-
parative results in Fig. 4(A) show improvements in accuracy
P earson(AU, u) = up to 16.3% when depending on the UIP matrix. While the
P cosine-based CF (Fig. 4(B)) improved the prediction accuracy
i∈I (rAU,i − rAU
¯ )(ru,i − r¯u )
qP q (8) by 4.2% when using the UIP matrix.
2 P 2
i∈I (rAU,i − rAU¯ ) i∈I (ru,i − r¯u ) According to Bobadilla and Serradilla [7], the performance
of the cosine-based CF is negatively affected by the sparsity
Puz i i problem and this negative behavior can be reduced by selecting
u=1 (rAU ∗ ru ) high k-neighbor values. While the performance of the Pearson-
Cosine(AU, u) = qP qP , (9)
i 2 2
i∈I (rAU ) i∈I (rui ) based CF is positively affected by the sparsity problem. The
experiments had been conducted using two real datasets with
where, different sparsity level. Whereas, the sparsity level of HetRect
• I is the group of items that both users AU and u have 2011 dataset is higher than the one of Movielens dataset.
rated. The gathered results indicate that the sparsity level has a
• rAU,i is the rate of user AU on item i. positive impact on the behavior of Pearson-based CF. The
• rAU
¯ is the mean rating value of user AU . percentage improvement made by Pearson-based CF using UIP
• ru,i is the rate of user u on item i. matrix lies in the range [0.16%-12.3%] and [0.38%-16.3%]
• r¯u is the mean rating value of user u. when using Moveilens and HetRec 2011, respectively. Fig.
139
Fig. 3. The prediction accuracy results using Movielens. Fig. 4. The prediction accuracy results using HetRec 2011.
3(B) and Fig. 4(B) show that the highest distance between that the user has no interest in the items which belong to
the accuracy results achieved using UIP matrix and the one this feature.
achieved using the user-item rating matrix is for K values lie
To generate the recommendation, the UIP matrix is used
in the range [25-75]. Thus, using the UIP matrix alleviate the
to compute the similarity between users instead of using the
effect of sparsity problem on the cosine-based CF performance
user-item rating matrix.
even when the k-neighbor value is small. The percentage
The prediction accuracy resulted from using the UIP matrix
improvement achieved by applying the cosine-based CF using
was compared to the one gathered from using the user-item
UIP matrix on Moveilens and HetRec 2011 lies in the range
rating matrix. Two benchmark datasets are utilized in the
[0.58%-15.1%] and [1.8%-17.5%], respectively.
experiments namely, Movielens and HetRec 2011. The results
V. C ONCLUSION obtained proved that using the UIP matrix leads to fewer errors
In this research, a new dimensionality reduction method was in prediction than when using the user-item rating matrix.
proposed to handle the sparsity problem of CFRS. The core
idea lies in exploiting both the user-item rating matrix and R EFERENCES
item-feature matrix to form the UIP matrix. The UIP matrix [1] Lalita Sharma and Anju Gera. A survey of recommendation system:
has two main features: Research challenges. International Journal of Engineering Trends and
Technology (IJETT), 4(5):1989–1992, 2013.
• The UIP is a dense matrix. [2] Bushra Alhijawi, Arafat Obeid, Nadim amd Awajan, and Sara Tedmori.
• The UIP matrix reflects the user’s satisfaction degree Improving collaborative filtering recommender system using semantic
about the item’s semantic features. The UIP matrix stores information. page International Conference on Information and Com-
munication Systems (ICICS 2018). IEEE, 2018.
values that are ranged between [0, M axr ]. The M axr [3] Bushra Alhijawi and Yousef Kilani. Using genetic algorithms for
represents the highest satisfaction degree and 0 indicates measuring the similarity values between users in collaborative filtering
140
recommender systems. In 2016 IEEE/ACIS 15th International Confer- and Intelligent Agent Technology-Volume 01, pages 71–78. IEEE Com-
ence on Computer and Information Science (ICIS), pages 1–6. IEEE, puter Society, 2011.
2016. [25] Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2nd workshop on
[4] Jonathan L Herlocker, Joseph A Konstan, and John Riedl. Explaining information heterogeneity and fusion in recommender systems (hetrec
collaborative filtering recommendations. In Proceedings of the 2000 2011). In Proceedings of the 5th ACM conference on Recommender
ACM conference on Computer supported cooperative work, pages 241– systems, RecSys 2011, New York, NY, USA, 2011. ACM.
250. ACM, 2000.
[5] Bushra Alhijawi. The use of the genetic algorithms in the recommender
systems, 2017.
[6] Bushra Alhijawi and Yousef Kilani. The recommender system: A survey.
International Journal of Advanced Intelligence Paradigms, 10:1, 2018.
[7] Jesus Bobadilla and Francisco Serradilla. The effect of sparsity on
collaborative filtering metrics. In Proceedings of the Twentieth Aus-
tralasian Conference on Australasian Database-Volume 92, pages 9–18.
Australian Computer Society, Inc., 2009.
[8] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative
filtering techniques. Advances in artificial intelligence, 2009, 2009.
[9] Sajad Ahmadian, Mohsen Afsharchi, and Majid Meghdadi. A novel
approach based on multi-view reliability measures to alleviate data
sparsity in recommender systems. Multimedia Tools and Applications,
pages 1–36, 2019.
[10] Laila Safoury and Akram Salah. Exploiting user demographic attributes
for solving cold-start problem in recommender system. Lecture Notes
on Software Engineering, 1(3):303–307, 2013.
[11] Mohammad Yahya H Al-Shamri. User profiling approaches for demo-
graphic recommender systems. Knowledge-Based Systems, 100:175–
187, 2016.
[12] Mehrbakhsh Nilashi, Othman Ibrahim, and Karamollah Bagherifard. A
recommender system based on collaborative filtering using ontology and
dimensionality reduction techniques. Expert Systems with Applications,
92:507–520, 2018.
[13] G. Lv, C. Hu, and S. Chen. Research on recommender system based on
ontology and genetic algorithm. Neurocomputing, 187:92–97, 2016.
[14] Qusai Shambour, Mouath Hourani, and Salam Fraihat. An item-
based multi-criteria collaborative filtering algorithm for personalized
recommender systems. International Journal of Advanced Computer
Science and Applications, 7(8):274–279, 2016.
[15] Xavier Amatriain, Alejandro Jaimes, Nuria Oliver, and Josep M Pujol.
Data mining methods for recommender systems. In Recommender
systems handbook, pages 39–71. Springer, 2011.
[16] Mehrbakhsh Nilashi, Mohammad Dalvi Esfahani, Morteza Zamani
Roudbaraki, T Ramayah, and Othman Ibrahim. A multi-criteria col-
laborative filtering recommender system using clustering and regression
techniques. Journal of Soft Computing and Decision Support Systems,
3(5):24–30, 2016.
[17] Mehrbakhsh Nilashi, Othman bin Ibrahim, Norafida Ithnin, and Nor Ha-
niza Sarmin. A multi-criteria collaborative filtering recommender system
for the tourism domain using expectation maximization (em) and pca–
anfis. Electronic Commerce Research and Applications, 14(6):542–562,
2015.
[18] Jesús Bobadilla, Rodolfo Bojorque, Antonio Hernando Esteban, and
Remigio Hurtado. Recommender systems clustering using bayesian non
negative matrix factorization. IEEE Access, 6:3549–3564, 2018.
[19] Remigio Hurtado Ortiz, Rodolfo Bojorque Chasi, and César Inga Chalco.
Clustering-based recommender system: Bundle recommendation using
matrix factorization to single user and user communities. In Interna-
tional Conference on Applied Human Factors and Ergonomics, pages
330–338. Springer, 2018.
[20] Bo Zhu, Fernando Ortega, Jesús Bobadilla, and Abraham Gutiérrez. As-
signing reliability values to recommendations using matrix factorization.
Journal of computational science, 26:165–177, 2018.
[21] Karl Pearson. Liii. on lines and planes of closest fit to systems of points
in space. The London, Edinburgh, and Dublin Philosophical Magazine
and Journal of Science, 2(11):559–572, 1901.
[22] Daniel Billsus and Michael J Pazzani. Learning collaborative informa-
tion filters. In Icml, volume 98, pages 46–54, 1998.
[23] Jonathan L Herlocker, Joseph A Konstan, Al Borchers, and John Riedl.
An algorithmic framework for performing collaborative filtering. In
22nd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR 1999, pages 230–237.
Association for Computing Machinery, Inc, 1999.
[24] Qusai Shambour and Jie Lu. A hybrid multi-criteria semantic-enhanced
collaborative filtering approach for personalized recommendations. In
the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence
141
Visualizing Program Quality – A Topological
Taxonomy of Features
Islam Al Omari Razan Al Omoush Haneen Innab A. Elhassan
isl20188022@std.psut.edu.jo raz20188047@std.psut.edu.jo han20188024@std.psut.edu.jo a.elhassan@psut.edu.jo
143
conjunction with calendar controls are used to represent upgrade non-viable in terms of added visualization and
temporal dimension as main attribute to hierarchy insight value. This is due to the usual issues with 3D models
configuration and drill down navigation. The authors created including depth ambiguity, tilt angel impact, hidden
an application with Java and Swing library to create the components etc. The alternate option explored in this paper
graphical user interface based on the MVC model to represent is linked multi-view models with cross-reference, cross-
3 layers: filtering and drill-down capability.
144
Begin Begin
For each Course & Outcome pair in For each CIDϵ C; C: All courses,CID:
All Assessments((CID-CLO)ϵ AA) (feature) in Rubric Dataset
For each Rubric-Line (rl) ϵ R; R:
For each Assessment Instance in All
All rubric assessments
Assessments(aiϵ AA) If (rl.CID == CID)
Int Population = 0; {
If (ai.Score> U-MIN &&ai.Score< CU += rl.U;
U-Max) CM += rl.M;
U++; CA += rl.A;
//Unsatisfactory CE += rl.E;
else if (ai.Score> M-MIN CAssessmentPopulation +=
&&ai.Score< M-Max) rl.Population;
M++; CAssessments++;
//Minimal }
End If
else if (ai.Score> A-MIN
End For
&&ai.Score< A-Max)
End For
A++;
End.
//Adequate
else
Table 4.4 (Appendix A) shows samples of course level
E++;
(layer3) assessments for data in Table 4.3 in appendix A.
//Exemplary
Population++;
End If D. Student Outcome (SO) Assessment – Layer4
End For
U = 100 * U / Population;
M = 100 * M / Population;
The assessment layer that is most indicative of the health
A = 100 * A / Population; status of the BSc program is the one that uses the three
E = 100 * E / Population; assessment layers above in conjunction with Course-SO and
End For CLO:SO mappings to calculate the attainment rates in the
End. Student Outcomes. The health status of the Student Outcome
where U, M, A and E denote Unsatisfactory, Minimal, attainment rates tend to form the first item on the checklist of
Adequate and Exemplary performance standards most QA processes and requirements. This is calculated
respectively, as shown in table 4.2 for example as shown in according to the algorithm below
table 4.2 (Sample Performance Standards) and table 4.2 (From
and To range values for each performance standard) are Pseudo-code: Make-SO-Assessment
designated by the academic program administration as Begin
For each ((CID-CLO)ϵ AA) in
necessary. A rubric line looks like the ones shown in table
rubric instance (ri) in Layer2
4.3, all tables are in appendix A. For each SO in CLO-SO Mapping |
per CID-CLO pair
C. Course Aggregated Assessments - Layer3 SO.U += ri.U;
SO.M += ri.M;
The rubrics in Layer2 above include a performance instance SO.A += ri.A;
for every Course-CLO pair that is assessed throughout the SO.E += ri.E;
academic semester. In the build-up to the 4th abstraction SO.Population+= ri.Population;
layer, below it follows to group rubric records for every SO.Lines++;
End For
course (CID, no CLO) and taking the accumulated averages
SO.U = 100 * SO.U / SO.Lines;
for the performance standards U, M, A and E. The resulting SO.M = 100 * SO.M / SO.Lines;
dataset contains one single instance for every CID as follows: SO.A = 100 * SO.A / SO.Lines;
SO.E = 100 * SO.E / SO.Lines;
End For
End.
Pseudo-code: Make-Course-Assessment The data is derived from a set of Course Learning Outcome
Input(): All Rubric Lines of CID-CLO Assessments, (CLOs) collected from the classroom of a small
Assessments as in Layer2 academic college over a 3-year period. Instruments including
Output():For each CID in Course- Major1 (M1), Major2 (M2), Midterm (MT), Final (F), In-
Rubricassessment (cr ϵ CR)
Output(): CID, CU, CE, CM, CE : Aggregated
course Projects (P), Capstone Projects (SD), Internships
from U, M, A, E in Layer2 (INT), Quizzes (Q), Homework Assignments (HW) and
Output(): CAssessment: Number of Labs.
Assessmennts in cr
145
The multi-layer design above comes as part a comprehensive
data ETL, wrangling and PowerBI/Tableau modelling and Usability Analysis (number of Clicks)
visualization process as illustrated in figure 4.1 below.
18 18
17
12 12 12
10 10 10
7
6
5 5
4
1 1 3
146
REFERENCES
[2] Jain, Divya & Singh, Vijendra. (2018). An Efficient Hybrid Feature
Selection model for Dimensionality Reduction. Procedia Computer
Science. 132. 333-341. 10.1016/j.procs.2018.05.188.
147
[21] I. Jenhani, G. B. Brahim and A. Elhassan, "Course Learning Outcome
Performance Improvement: A Remedial Action Classification Based
Approach," 2016 15th IEEE International Conference on Machine
Learning and Applications (ICMLA), Anaheim, CA, 2016, pp. 408-
413.
[22] ] MicrosoftPowerBI – Business Intelligence & Visualization Package.
https://powerbi.microsoft.com/en-us/ accessed May 2019
[23] Tableau Business Intelligence.
https://www.tableau.com/products/desktop accesed May 2019
[24] Business Intelligence Solutions.
https://www.sap.com/products/analytics/business-intelligence-bi.html
accessed May 2019
[25] Matthew O. Ward, Georges Grinstein, Daniel Keim. Interactive Data
Visualization: Foundations, Techniques, and Applications, Second
Edition.First Published 2015, eBook Published 11 June 2015,
https://doi.org/10.1201/b18379, eBook ISBN 9780429173226
[26] A. Mittmann and A. Von Wangenheim, “A Multi-Level Visualization
Scheme for Poetry,” 2016 20th Int. Conf. Inf. Vis., pp. 312–317, 2016.
[27] R. Vliegen, J. J. van Wijk and E. van der Linden, "Visualizing Business
Data with Generalized Treemaps," in IEEE Transactions on
Visualization and Computer Graphics, vol. 12, no. 5, pp. 789-796,
Sept.-Oct. 2006.
doi: 10.1109/TVCG.2006.200.
[28] P. Craig and X. Huang, “Animated Space-Filling Hierarchy Views for
Security Risk Control and Visualization on Mobile Devices,” no.
Meita, pp. 772–775, 2015.
[29] J. S. Yi, Y. Kang, J. T. Stasko, and J. A. Jacko, “Toward a Deeper
Understanding of the Role of Interaction in Information Visualization,”
IEEE Trans. Vis. Comput. Graph., vol. 13, no. 6, pp. 1224–1231, 2007.
[30] M. Sondag, B. Speckmann, and K. Verbeek, “Stable Treemaps via
Local Moves,” IEEE Trans. Vis. Comput. Graph., vol. 24, no. 1, pp.
729–738, 2018.
[31] B. Shneiderman and M. Wattenberg, “Ordered Treemap Layouts,” vol.
2001, pp. 2–7, 2001.
[32] H. Di, X. Tang, and S. Wang, “A Novel High-dimension Data
Visualization Method Based on Concept Color Spectrum Diagram,”
2015 IEEE 11th Int. Colloq. Signal Process. Its Appl., pp. 140–144,
2015.
[33] Y. Xie, “Using Color to Improve the Discrimination and Aesthetics of
Treemaps,” vol. 21, no. 4, p. 2016, 2016.
[34] M. B. De Carvalho, B. S. Meiguins, and J. M. De Morais, “Temporal
data visualization technique based on Treemap,” Proc. Int. Conf. Inf.
Vis., vol. 2016-August, pp. 399–403, 2016.
[35] J. Görtler, C. Schulz, D. Weiskopf, and O. Deussen, “Bubble
Treemaps for Uncertainty Visualization,” IEEE Trans. Vis. Comput.
Graph., vol. 24, no. 1, pp. 719–728, 2018.
[36] H. M. Nicholas, B. Liebold, D. Pietschmann, P. Ohler, and P.
Rosenthal, “Hierarchy Visualization Designs and their Impact on
Perception and Problem Solving Strategies,” Proc. Int. Conf. Adv.
Comput. Interact., no. c, pp. 93–101, 2017.
148
Appendix A
Tables
Course Assess
Description U% M% A% E% Population
ID ments
C401 Capstone 2 5 10 19 65 1664
C102 CS2 2 12 10 34 43 1212
C104 Data Struct. 4 2 11 35 50 3480
149
Table 5.1 Visualization Use Cases
150
Students with best/worst From 4th view, sort by “GRADE”, descending /ascending
grades of the assessments
above
151
Improved Swarm Intelligence Optimization using
Crossover and Mutation for Medical Classification
Mais Yasen, Nailah Al-Madi
Department of Computer Science
Princess Sumaya University for Technology
Amman, Jordan
mai20130045@std.psut.edu.jo, n.madi@psut.edu.jo
Abstract – Early diagnoses helps in curing most diseases or in and adjusts the weights without returning back to the input layer,
making them more bearable, it is vital to enhance the accuracy of and avoids getting stuck in local optima. This can explain why
predicting chronic diseases. Extreme Learning Machine (ELM) is a ELM has good generalization performance without using cycles,
classifier which can be efficiently used to predict diseases. Artificial thus learning faster than other training methods such as
Bee Colony algorithm (ABC) and Dragonfly Algorithm (DA) have
backpropagation [5].
been efficiently used in several optimization problems, including the
optimization of ELM settings. Evolutionary Computation is a type To increase the prediction accuracy of ELM, it can be
of optimization algorithm, which has biological operators to find implemented in conjunction with optimization algorithms to
desired solutions. Two of these operators are crossover and mutation efficiently choose the number of its hidden layer nodes and values
(CM) that are used to generate new solutions from old ones, and can of weights throughout the learning process [27]. Swarm
be integrated with swarm intelligence algorithms to enhance their Intelligence (SI) is a type of population based and nature inspired
results. In this paper, models that make use of ABC and DA to metaheuristic optimization algorithms that reflects the natural
optimize the number of hidden neurons and weights of ELM are behavior of biological swarm groups [6]. Artificial Bee Colony
presented. Moreover, crossover and mutation are combined with the algorithm (ABC) and Dragonfly Algorithm (DA) are SI algorithms
swarm search of ABC and DA for chronic diseases forecasting, in
that can be applied in the optimization of the number of hidden
models called ELM-ABC-CM and ELM-DA-CM. Using 4 real
datasets to evaluate the proposed models, and compare their results nodes and weights from an ELM. The reason why ABC and DA
with the results of standard ABC and DA, and other well-known are chosen is that ABC has a feature of grouping the solution, and
classifiers, including regular ELM, using different evaluation DA has a feature of distracting from enemies, these features and
metrics. The results show that crossover and mutation improved the their phases will enable the employment of natural operations.
outcome of ABC and DA. Moreover, ELM-DA-CM proved its Also, ABC and DA proved their efficiency before in previous
efficiency over ELM-ABC-CM. works [7, 8].
Keywords—Machine Learning; Swarm Intelligence; Evolutionary Evolutionary Computation (EC) is another type of population
Computation; Extreme Learning Machine; Dragonfly Algorithm; based and nature inspired metaheuristic optimization algorithms.
Artificial Bee Colony; Crossover; Mutation; Medical Prediction. EC iteratively applies biological evolution to generate solutions
I. INTRODUCTION [9]. Crossover and mutation are two vital biological operators in
Early diagnoses is important to cure most diseases or to manage EC that are used to generate new populations from an existing one
them by preventing their consequences and making them more and enhance the results by having more exploration and
bearable [1]. Therefore, it is an essential requirement to increase exploitation [9]. These operators can be applied with SI
the accuracy of predicting diseases such as heart disease, hepatitis, optimization algorithms to enhance the accuracy of prediction of
diabetes, and diabetic retinopathy. The symptoms of these diseases an optimized classification algorithm. The contribution of this
need to be taken into consideration when forecasting them using paper is summarized as the following:
machine learning [2]. 1. Using crossover and mutation on ABC and DA.
Machine learning (ML) in artificial intelligence enables the 2. Optimizing ELM using ABC-CM and DA-CM and improving
computers to learn without being explicitly programmed [2]. It the tuning of ELM.
finds patterns by searching through the data and uses the detected 3. Using 4 real datasets for training and testing our models.
patterns to alter program actions accordingly [2]. The process 4. Evaluating the proposed models and comparing them with other
when algorithms reflect what has been learned in the past from classifiers.
training data to predict new data is called supervised ML [3]. This paper is structured as follows: Section II includes the
Classification is one of the main supervised ML tasks that aims related literature in the area of work. Section III describes the
to build a model based on previous data to classify new data. background of methods used in this work. Section IV includes the
Extreme Learning Machine (ELM) is a neural network that is proposed methodology used in the development. Section V
inspired by the biological brain, it consists of a computational presents the experiments and the results, and Section VI concludes
model that contains a number of processing nodes called neurons the research and discusses future work.
[4]. Neurons send signals to one another over large number of
weighted connections that link input, hidden and output layer
together for communication purposes. ELM training method is
feedforward, which travels from the input layer to the output layer
153
problems. The ABC algorithm was first proposed by Karaboga in survive, thus all dragonflies move towards the food sources in the
2005 [17]. And it is a meta-heuristic SI optimization algorithm attraction to food principle [19]. Fifth, to survive all dragonflies
inspired by the foraging behavior of honeybees in nature. The move away as far as possible from the enemy sources in the
solution of ABC is represented in a multi-dimensional search distraction from enemies principle [19]. To calculate the values of
space as food sources and a population of three different types of the different principles the following equations are used [19]:
bees (employed, onlooker, and scout). Let xi be the food source set ∑ … (5) ∑ / … (6)
found by the employed bees for each iteration of the ABC, xi = ∑ / … (7) … (8)
{xi1, xi2, …, xin} where n is the number of solutions needed. … (9)
Equation (2) is used to calculate a new derived solution [18]. The separation is calculated using Equation (5), where Xi is the
, , , ∗ , , … (2) position of the current dragonfly (i), Xj is the position of the jth
Where ϕ is a random number between 0 and 1, y is a random dragonfly close to the current, and n is the total of dragonflies. The
number between 0 and the maximum number of food sources, y alignment is found using Equation (6), where Vj is the velocity
should not equal the current food source (i), and j is a random value of the jth dragonfly close to the current (i). The cohesion is
number generated between 0 and maximum number of solutions. calculated as shown in Equation (7). The attraction to food is
Equation (3) is to calculate the probability of each solution calculated by using Equation (8), where Xf is the position of the
suggested by the employed bees, it is also known as the roulette solution. The distraction from enemy is calculated as shown in
wheel equation that evaluates the solutions based on the fitness Equation (9), where Xe is the position of the enemy. Values of ∆X
values achieved, this phase is called the onlooker bee phase [18]. and X are calculated using Equations (10) and (11) [19]. Where s,
/ ∑ … (3) a, c, f, e, and w are the weights of their correspondent principle (S,
A, C, F, E, and ∆X). e is calculated using Equation (12) where i is
Where i is the current solution, pi is the probability of solution
the current iteration and I is the maximum iterations. s, a, and c are
i, fiti is the fitness value of solution i, sn is the solutions total, j is
three different random numbers between 0 and 2e, f is a random
the solutions counter, and fitj is the fitness of each solution j.
number between 0 and 2, and w is calculated using Equation (13).
The scout bee phase is the final stage, it is responsible of
∆ , ∆ , … (10)
checking the epoch reached so far, which is the number of times
, , ∆ , … (11)
the solution is allowed to get worse than the solution produced
. ∗ . / / … (12)
before. The scout bee abandons the old solution and discovers a
. ∗ / … (13)
new solution for the employed bees to work on it in the following
iterations. Equation (4) is used to calculate the new solution [18]. D. Crossover and Mutation
, ∗ … (4) Evolutionary Computation (EC) is another type of population
Where ub, lb are the vectors that contain the upper bounds and based and nature inspired metaheuristic optimization algorithms.
lower bounds allowed for the solution, ϕ is a random number What distinguishes EC is the use of biological evolution on
between 0 and 1, i is the current food source, and j is a random candidate solutions to remove the worst, and to change solutions
number between 0 and maximum number of solutions. iteratively [9]. Crossover and mutation are popular examples on
the operators used in EC. These operators are applied to generate
C. Dragonfly Algorithm new solutions from existing ones [9].
Dragonfly Algorithm (DA) was first proposed by Seyedali Crossover usually occurs in every iteration to combine the
Mirjalili in 2016 [19]. And it is an algorithm that can be used in genetic information of two parents and generate new children [23].
ELM number of hidden nodes and weights optimization. DA is a Crossover has many types, uniform crossover is illustrated in Fig.
meta-heuristic SI optimization algorithm inspired by the static and 2, where two parents integrate in a uniform pattern to generate a
dynamic behaviors of dragonflies in nature [20]. In the static new child [24]. There are two reasons why uniform crossover was
behavior a large number of dragonflies migrate in a certain chosen to be implemented; first is that using a uniform pattern will
direction travelling for long distances [21]. On the other hand, in guarantee having stable and proportional new derived solutions.
the dynamic behavior dragonflies get into groups and fly over Second is that uniform crossover will help in reaching the best
different areas to find food resources [22]. solution faster, because the amount of solution change is high. On
DA has five principles that are important in finding the the other hand, mutation usually happens less frequently to find
solutions required. First, the separation principle implies the static better solutions by altering the genetic information of one or more
collision avoidance of a dragonfly from other dragonflies that are genes of members of a solution [23]. Fig. 3 shows bit inversion
close to its position [19]. Second, the alignment principle reflects mutation where a single gene is altered [24].
the process of velocity matching of a dragonfly to other
dragonflies that are close to its position [19]. Third, the cohesion
principle is the tendency of a dragonfly towards the center of the
space that contains other dragonflies close to its position [19].
Fourth, the main aim of dragonfly swarms is to stay alive and Fig. 2 Crossover [24] Fig. 3 Mutation [24]
154
IV. PROPOSED APPROACHES A. Data
The following are the execution steps of the proposed models, The performance evaluation was done on four Medical datasets
where the fitness calculation is done by sending the proposed [25]. Where first, we applied feature selection on the datasets using
number of hidden nodes and their weights to ELM. ABC-CM and gain ratio to consider only the most relevant features to the class
DA-CM steps are: attribute using WEKA [26]. Then we split the data into two sets,
1. Calculate the solution probability using roulette wheel. Check if 66% for training and 34% for testing, as shown in Table 1.
the solution probability is lower than probability of mutation. Table 1 Number of Records and Features in the data files
2. In the employed bee or dynamic phase: if the probability is Dataset Training Testing Features (Selected)
lower, alter the current solution using mutation, where a new Heart disease 177 93 14 (10)
solution is derived using the equations explored in section III. Hepatitis 102 53 20 (11)
3. Choose the two parent solutions that got the highest fitness. Diabetes 506 262 9 (5)
Retinopathy 760 391 19 (16)
4. Calculate the solution probability using roulette wheel. Check if
the solution probability is lower than probability of crossover.
5. In scout bee or static phase: if probability is lower, reset solution
that reached the epoch and generate a new solution by
combining the two parents selected in a uniform pattern.
6. Repeat steps 1 to 5 in each iteration of ABC or DA.
The execution steps of ELM-ABC, as shown in Fig. 4, are:
1. Initialize all food sources randomly.
2. Employed bees find all the possible solutions.
3. Find the fitness value for each proposed solution using ELM,
and retrieving the resulting accuracy.
4. Onlooker bee phase calculates the probability of each solution.
Then decides greedily based on a random number whether to Fig. 4 ELM-ABC Process Fig. 5 ELM-DA Process
follow the solution or not. B. Experiments settings
5. Scout bee phase checks if each solution reached the epoch time. For the evaluation of our model the fitness function was
6. Store the best solution based on a greedy selection. accuracy. The settings of ABC and DA used are; Iterations: 100,
7. Repeat from step 2 to 6 until maximum iterations is reached. Swarm size: 20, Seed: Random, Number of Sources: 50, Upper
The execution steps of the ELM-DA, as shown in Fig. 5, are: bound: 1, Lower bound: 0, Epoch: 50, Crossover probability: 0.8,
1. Initialize the dragonfly positions and positions difference (∆X) Mutation probability: 0.2. ELM settings are; Output Neurons: 2,
randomly. Seed: Random, Hidden Layers: 1, Hidden Layer Nodes: Random.
2. Calculate the fitness values for the proposed solutions. After preparing the datasets and building our proposed models,
3. Start the static phase by updating the best fitness value. ELM-ABC-CM and ELM-DA-CM need to be run 30 times to
4. If the fitness value was better than the best fitness found so far, cover the randomness of ABC and DA solutions. To evaluate our
then update the best food source with the solution. models, they are compared with seven classifiers implemented in
5. If the fitness value was worse than the worst fitness found so far, WEKA with their default settings: Bayes Network (BN), Naïve
then update the worst enemy source with the solution. Bayes (NB), Decision Tree (J48), K-Nearest Neighbors (IBK), K-
6. Start the dynamic phase by calculating s, a, c, f, e, w, and ∆X. star (K*), Repeated Incremental Pruning (J-Rip), Artificial Neural
7. Calculate the separation, alignment, cohesion, attraction to food, Network (ANN).
and distraction from enemy values. To evaluate the efficiency of the classifiers we can use the
8. Update dragonfly positions difference (∆X) and dragonfly following metrics; Accuracy, recall, precision, Fmeasure, and
positions (X). AUC, using the Equations (14-18). Where TN is true negative, TP
9. Repeat from step 3 to 8 until maximum iterations is reached. is true positive, FN is false negative and FP is false positive.
/ … (14)
V. EXPERIMENTS AND RESULTS / … (15)
The performance of our approaches was evaluated by conducting / … (16)
a number of experiments that are explained in this section. / … (17)
… (18)
155
Table 2 Results (*1 Accuracy, *2 Precision, *3 Recall, *4 F-measure, *5 AUC)
Classifier Heart Disease Hepatitis
*1 *2 *3 *4 *5 *1 *2 *3 *4 *5
BN 81.52 0.88 0.80 0.84 0.82 81.13 0.45 0.56 0.50 0.71
NB 82.61 0.88 0.82 0.85 0.83 86.79 0.60 0.67 0.63 0.79
J48 67.39 0.79 0.62 0.69 0.69 86.79 0.67 0.44 0.53 0.70
IBK 78.26 0.89 0.73 0.80 0.80 81.13 0.45 0.56 0.50 0.71
K* 71.74 0.78 0.73 0.75 0.71 90.57 0.75 0.67 0.71 0.81
J-Rip 72.83 0.83 0.69 0.75 0.74 83.02 0.50 0.67 0.57 0.77
ANN 76.09 0.87 0.71 0.78 0.77 88.68 0.67 0.67 0.67 0.80
ELM 75.00 0.90 0.65 0.76 0.77 84.91 0.60 0.33 0.43 0.64
ELM-ABC 83.70 0.83 0.91 0.87 0.82 84.91 0.56 0.56 0.56 0.73
STDEV 1.35 0.03 0.02 0.01 0.02 5.05 0.12 0.13 0.12 0.07
best runs 85.87 0.88 0.93 0.88 0.85 88.68 0.67 0.67 0.67 0.80
ELM-DA 83.70 0.83 0.91 0.87 0.82 88.68 0.80 0.44 0.57 0.71
STDEV 0.00 0.01 0.01 0.00 0.00 0.82 0.04 0.15 0.09 0.06
best runs 83.70 0.84 0.93 0.87 0.83 88.68 0.75 0.67 0.67 0.80
ELM-ABC-CM 84.78 0.86 0.89 0.88 0.84 88.68 0.80 0.44 0.57 0.71
STDEV 1.33 0.03 0.01 0.01 0.02 0.96 0.02 0.06 0.05 0.03
best runs 85.87 0.88 0.93 0.88 0.85 90.57 0.83 0.56 0.67 0.77
ELM-DA-CM 84.62 0.84 0.93 0.88 0.82 90.57 0.83 0.56 0.67 0.77
STDEV 0.54 0.01 0.01 0.00 0.01 0.96 0.05 0.08 0.04 0.03
best runs 84.78 0.84 0.95 0.88 0.83 90.57 0.83 0.89 0.70 0.88
156
of both approaches were very competitive in the other datasets, and [10] J. Kennedy and R. C. Eberhart, "A discrete binary version of the
their best runs were better than most classifiers in the table. particle swarm algorithm", (1997), IEEE International Conference
on Systems, Man, and Cybernetics, Vol. 5, PP. 4104-4108.
VI. CONCLUSION AND FUTURE WORK [11] N. Higashi and H. Iba, "Particle swarm optimization with Gaussian
This work goal was to construct models that can predict mutation", (2003), IEEE Swarm Intelligence Symposium, PP. 72-79.
chronic diseases and evaluate their performance. The proposed [12] Weimin Zhong, Jianliang Xing and Feng Qian, "An improved theta-
models are swarm-based which integrate crossover and mutation PSO algorithm with crossover and mutation", (2008), 7th World
with the search of ABC and DA (called ABC-CM, DA-CM). The Congress on Intelligent Control and Automation, PP. 5308-5312.
enhanced ABC and DA models were used to improve the results [13] Dong G, Cooper J., “Particle Swarm Optimization with Crossover
of ELM classifier. The datasets used in this research were real and Mutation Operators Using the Diversity Criteria”, (2013),
patients’ records of four different medical cases. Results were ASME International Design Engineering Technical Conferences and
compared with other well-known classifiers including ELM using Computers and Information in Engineering Conference, Vol. 3A, PP.
different evaluation metrics. The results showed that ELM-ABC- V03AT03A010.
CM and ELM-DA-CM improved the efficiency of ELM-ABC and [14] Pant M., Thangaraj R., Abraham A., (2007), “A New PSO Algorithm
ELM-DA, and crossover and mutation decreased the randomness with Crossover Operator for Global Optimization Problems”,
of the solutions produced. Moreover, ELM-DA-CM reached the Innovations in Hybrid Intelligent Systems, Advances in Soft
best prediction in three datasets, and ELM-ABC-CM got the best Computing, Vol. 44, PP. 215-222.
accuracy in one dataset. [15] Maureen Caudill, (1989), "Neural Network Primer", San Francisco:
Based on the results, as a future work it is necessary to enlarge Miller Freeman Inc., PP 321.
the search space of ABC and DA to increase their accuracy. [16] A. C. C. Coolen, (1998), “A Beginner’s Guide to the Mathematics of
Moreover, the running time was long, thus it is important to find a Neural Networks”, Springer, Chapter 2, PP 13-70.
way of parallelizing these models to achieve good results in a [17] Dervis Karaboga, (2005), “An Idea Based on Honey Bee Swarm for
Numerical Optimization”, Technical Report-TR06, PP 1-10.
meaningful time.
[18] Yunfeng Xu, Ping Fan, Ling Yuan, (2013), “A Simple and Efficient
REFERENCES Artificial Bee Colony Algorithm”, Mathematical Problems in
[1] WEBMD, “Health Screening: Finding Health Problems Early”, Engineering (MPE), Volume 2013, PP 1-9.
Retrieved on: February 11, 2019. From: www.webmd.com. [19] Seyedali Mirjalili, (2016), “Dragonfly algorithm: a new meta-
[2] Margaret Rouse, (2016), “Analytics tools help make sense of big heuristic optimization technique for solving single-objective,
data”, AWS, Retrieved on: December 6, 2018, From: discrete, and multi-objective problems”, Springer, PP 1053–1073.
searchbusinessanalytics.techtarget.com. [20] M. A. Salam, H. M. Zawbaa, E. Emary, K. K. A. Ghany and B. Parv,
[3] Jerome H. Friedman, (1997), “Data mining and statistics: What’s the (2016), “A hybrid dragonfly algorithm with extreme learning
connection”, Proceedings of the 29th Symposium on the Interface machine for prediction”, INnovations in Intelligent SysTems and
Between Computer Science and Statistics, PP 1-7. Applications (INISTA), PP. 1-6.
[4] Jun-Shien Lin, and Shi-Shang Jang, (1998), “Nonlinear Dynamic [21] Robert W. Russell, Michael L. May, Kenneth L. Soltesz, John W.
Artificial Neural Network Modeling Using an Information Theory Fitzpatrick, (1998), “Massive Swarm Migrations of Dragonflies in
Based Experimental Design Approach”, American Chemical Eastern North America”, University of Notre Dame, PP 325-342.
Society, Vol. 37, PP 3640–3651. [22] Martin Wikelski, David Moskowitz, James S Adelman, Jim
[5] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew, (2006), Cochran, David S Wilcove, Michael L May, (2006), “Simple rules
“Extreme learning machine: Theory and applications”, guide dragonfly migration”, PMC, PP 325-329.
Neurocomputing, Vol. 70, PP 489-501. [23] Zakir H. Ahmed, (2010), “Genetic Algorithm for the Traveling
[6] Beni G., Wang J., (1993), “Swarm Intelligence in Cellular Robotic Salesman Problem using Sequential Constructive Crossover
Systems”, Robots and Biological Systems: Towards a New Bionics?, Operator”, International Journal of Biometrics and Bioinformatics
Vol. 102, PP 703-712. (IJBB), Vol. 3, PP 96-105.
[7] M. Z. Yasen, R. A. Al-Jundi and N. S. Al-Madi, (2017), “Optimized [24] Marek Obitko, (1998), “Introduction to Genetic Algorithms”,
ANN-ABC for Thunderstorms Prediction”, 2017 International Obitko, Retrieved on: February 14, 2019, From: obitko.com.
Conference on New Trends in Computing Sciences (ICTCS), PP 98- [25] David Aha, (2013), “UCI Machine Learning Repository”, University
103. of California Irvine.
[8] M. Yasen, N. Al-Madi and N. Obeid, (2018), “Optimizing Neural [26] WEKA, Version: 3.8, Retrieved on: September 5, 2016, From:
Networks using Dragonfly Algorithm for Medical Prediction”, 2018 www.cs.waikato.ac.nz.
8th International Conference on Computer Science and Information [27] Faris, H., Ala’M, A. Z., Heidari, A. A., Aljarah, I., Mafarja, M.,
Technology (CSIT), PP 71-76. Hassonah, M. A., & Fujita, H. (2019). “An intelligent system for
[9] Al-Jundi, Ruba, Mais Yasen, and Nailah Al-Madi, (2017), spam detection and identification of the most relevant features based
“Thunderstorms Prediction using Genetic Programming”, on evolutionary random weight networks”. Information Fusion, 48,
International Journal of Information Systems and Computer 67-83.
Sciences, Vol. 7, PP 1-7.
157
Novel Approach towards Arabic Question
Similarity Detection
Mohammad Daoud
CS Department, Faculty of IT
American University of Madaba
Madaba, Jordan
m.daoud@aum.edu.jo
Abstract—In this paper we are addressing the automatic Besides, questions are paraphrased more often that other
detection of Arabic question similarity, which is an essential utterances [12].
issue in a variety of NLP/NLU applications such as question Arabic questions similarity is even more challenging,
answering systems, virtual assistants, chatbots…etc. We are because Arabic is a pi-language (poorly informatized
proposing and experimenting a rule-based approach that relies
language) [13] [14] and gaining semantic information from
on lexical and semantic similarity between questions with the
utilization of supervised learning algorithms. Our approach its corpus is difficult. Few research attempts have addressed
categorizes questions semantically according to their type and Arabic question similarity where mediocre results have been
scope; this categorization is based on hypothetical rules that achieved (when compared to other resourceful languages)
have been validated empirically, for example, a Timex Factoid [15].
question (a question asking about time) is less likely similar to With the absence or the scarceness of relevant semantic
an Enamex Factoid question (a question asking about a named corpus for Arabic, a rule-based system for categorizing
entity). This article details the procedures of question pairs questions can be used [16]. In this paper we are seeking a
preprocessing, lexical analysis, feature extraction and selection hybrid approach that utilizes supervised learning and
and most importantly the similarity measures. According to
hypothetical rules to find similarity and to detect
the experiment we have conducted, our approach achieved
promising precision and accuracy based on a test data of 1450 paraphrasing.
question pairs. Many researchers focus only on corpus data-driven
approaches to cluster, classify and map words and phrases
Keywords—text similarity, question analysis, question [7] [17]. We believe that this is an essential part of the
similarity, semantic similarity, data science, Natural Language similarity detection task. However, in the context of
Processing. question similarity, certain rules can be set to improve the
understanding of the questions and to relate them
I. INTRODUCTION accordingly, for example, these two questions are distanced
Finding similarity between various textual units (words, even though they have high string similarity, high term
expressions, phrases, paragraphs …) is an important NLP similarity, and high semantic similarity, simply because the
task [1]. Many applications report significant improvements first one asks about the time and the second one asks about a
in their performance when a text similarity component is location. Q1 = “Arabic: متى وقعت غزوة بدر؟- English: When
deployed, such as information retrieval [2], machine did the Battle of Badr take place?” Q2 = “Arabic: اين وقعت
translation [3], text clustering [4], sentiment analysis غزوة بدر؟- English: Where did the Battle of Badr take
[5]…etc. This task was tackled by researchers from different place?”. In this paper we are forming a framework to
point of views, some methods assumes that two textual units understand the Arabic questions and to use this in improving
are similar if they share subsequences of characters and question similarity.
words, for example, cosine similarity and Jaccard similarity This paper is organized as follows: the next section lists
[6] can be used as a simple similarity measure between and compares the most relevant related work. After that, in
section three we introduce our approach in question
phrases based on the common words between them.
comparison and analysis. In section four we detail the
Semantic similarity tries to find logical similarity between
aspects of the data set we are using for the experiment, and
texts even with the absence of the lexical similarity [7], for the preprocessing method. And then section five shows the
example, a semantic network or a corpus can be used to experiment and its results, while section six evaluates and
determine the degree of similarity between two words or assesses our method. And finally, we draw some
expressions even if the text seems different in terms of its conclusions, future work and possible applications.
characters and words [8].
Similarity between questions is an interesting task that II. RELATED WORK
can be very helpful for a series of applications such as
question answering systems [9], virtual assistants [10], Similarity between phrases can be approached through
chatbots [11]…etc. It can be considered as a sub problem of textual (String) similarity and semantic similarity. Question
text similarity. The challenge here is that questions are similarity, which is the focus of this paper, is a sub problem
difficult to be processed and has short to no textual context. of phrasal similarity. Therefore, this section will address
160
“ اليFor The 1450 couples were normalized (Arabic and question
whom” normalization) and then used to generate the features
NED Named ما,من ما تعريف described in section III.
Entity - What “what is the
Definition difinition” The distribution of the scopes of the 600 unique
“ من ھوWho questions was as shown in table 2.
is”
M Method كيف ما ھي طريقة TABLE 2. The distribution of the scopes of the 600 unique questions
How “What is the
method” Scope Number of
ما ھو وصفة questions
“What is the
recipe” Time - Factoid 88
ما الخطوات Location - Factoid 79
“What are
the steps” Numeric value - 69
P Purpose لماذا “ ما ھو السبب Factoid
Why what is the
reason” Named Entity - 27
ما المسبب Factoid
“What
Named Entity - 55
causes”
Definition
C Cause ماذا ما الذي
What “What” Method 78
L List عدد,اذكر
List Purpose 48
YN Yes/No ھل “ ءQuestion
Cause 45
Is/was/are… Hamza”
List 19
We seek to give a similarity measure for a couple of Yes/No 92
Arabic questions based on the scope of their interrogative
word (question word). We use empirical and hypothetical
approaches to establish the needed rules.
V. EXPERIMENT
It is intuitive that a method question that starts with “كيف
- How” will be dissimilar to a factoid timex question that We used several classification algorithms provided by
starts with “ متى- when” and based on that we can WEKA 3.8 [38] on the generated data set. Random Forests
hypothesize the following rule: [39] with 10 folds cross validation has produced the best
results amongst other classifiers that we have tested in terms
If q1.scope = M and q2.scope = TimexF then qw1 = -1 of precision, recall and f-Measures.
This hypothetical rule can be confirmed empirically by Table 3 shows the results reported from Random Forests
an experiment. In the same way we assumed that if the scope Classifier.
of the two questions is the same then they have a similarity
measure of 1.
TABLE 3. Results reported by Random Forests Algorithm, with our
We found out through the experiment that some of the proposed features
scopes have unconfirmed similarity such as NEF – NED, and
Precision Recall F-measures
P – M. Therefore, such occurrence would result in a 0
similarity measure. Yes 0.82 0.59 0.69
No 0.85 0.95 0.90
IV. DATA PREPARATION
For experimentation, we have selected 300 Arabic Weighted 0.84 0.85 0.84
questions from the Frequently Asked pages of various United Avg.
Nation’s organizations. And we have randomly selected 300
interrelated casual Arabic questions from ejaaba.com. We
used these 600 questions to randomly generate 1450 couples. To evaluate our novel approach we ran the test after
Each couple was given a YES or NO label, to indicate the removing our special features (End similarity, Start
similarity of the two questions. 419 couples were labeled Similarity, Question Word Similarity), and therefore the
with a YES, and 1031 couples were labeled as NO. Because remaining features were simply based on cosine similarity,
it was difficult to find YES-labeled questions in the jaccard similarity, Euclidean distance and Longest Common
randomly generated couples, we used paraphrasing to Subsequence. Table 4 shows results for the same test but
generate half of the YES-labeled couples and we used the without our features.
same technique with 100 NO-labeled questions.
161
TABLE 4. Results reported by Random Forests Algorithm, without our [2] X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu, “From word
proposed features
embeddings to document similarities for improved information
Precision Recall F-measures retrieval in software engineering,” in Proceedings of the 38th
Yes 0.40 0.32 0.35 international conference on software engineering, 2016, pp. 404–
415.
No 0.74 0.80 0.77
[3] M. Simard, N. Ueffing, P. Isabelle, and R. Kuhn, “Rule-based
Weighted 0.64 0.66 0.65 translation with statistical phrase-based post-editing,” in
Avg. dl.acm.org, 2007, pp. 203–206.
[4] C. C. Aggarwal and C. X. Zhai, “A survey of text clustering
algorithms,” in Mining Text Data, vol. 9781461432, Boston, MA:
As you can see there is a significant drop in accuracy for Springer US, 2012, pp. 77–128.
the same algorithms in terms of precision (-0.2), recall (-
0.19) and F-measures (-0.19). [5] B. Pang and L. Lee, Opinion Mining and Sentiment Analysis:
Foundations and Trends in Information Retrieval, vol. 2, no. 1–2.
2008.
VI. EVALUATION AND ASSESSMENT
[6] A. Huang, “Similarity measures for text document clustering,” in
Our system can detect question paraphrasing and
New Zealand Computer Science Research Student Conference,
synonymy with an overall precision of 0.85. The proposed
question type similarity increased the accuracy, especially NZCSRSC 2008 - Proceedings, 2008, pp. 49–56.
for NO-labeled questions. This was achieved without using a [7] A. Islam, “Semantic text similarity using corpus-based word
lexical or semantic dictionary. similarity and string similarity,” ACM Trans. Knowl. Discov.
From table 3 we notice that the accuracy of the YES- [8] M. Steyvers and J. B. Tenenbaum, “The large-scale structure of
labeled questions is behind the accuracy of the NO-Labeled semantic networks: Statistical analyses and a model of semantic
questions and that can be due to the fact that question type growth,” Cogn. Sci., vol. 29, no. 1, pp. 41–78, 2005.
similarity was very effective in determining if two questions [9] J. Weston et al., “Towards AI-Complete Question Answering: A
are dissimilar (for example, “When” questions can’t be Set of Prerequisite Toy Tasks,” arxiv.org, 2015.
similar to “Where” questions, and that can be easily
[10] T. R. Gruber, C. D. Brigham, D. S. Keen, G. Novick, and B. S.
determined). However, determining similar questions within
the same scope needs more than question type similarity. We Phipps, “Using Context Information to Facilitate Processing of
noticed that some of the YES-Labeled errors could be Commands in A Virtual Assistant,” Washington, DC U.S. Pat.
avoided by a simple synonymy lexicon. Trademark Off., 2018.
Our accuracy results are comparable with similar [11] N. M. Radziwill and M. C. Benton, “Evaluating Quality of
experiments, even those that were performed on resourceful Chatbots and Intelligent Conversational Agents,” Apr. 2017.
languages such as English [40] [41]. [12] T. Jurczyk, A. Deshmane, and J. D. Choi, “Analysis of
Wikipedia-based Corpora for Question Answering,” Jan. 2018.
We believe that utilizing a domain dedicated lexicon can
improve the results even more, and that is definitely a future [13] M. Daoud, “Building Arabic polarizerd lexicon from rated online
research focus. customer reviews,” in Proceedings - 2017 International
Conference on New Trends in Computing Sciences, ICTCS 2017,
VII. CONCLUSION 2018, vol. 2018-Janua, pp. 241–246.
We have presented a novel approach to detect similarity [14] C. R. Silveira, M. T. P. Santos, and M. X. Ribeiro, “A flexible
between Arabic questions. Our rule based similarity architecture for the pre-processing of solar satellite image time
algorithm showed effectiveness according to the experiment series data - The SETL architecture,” Int. J. Data Mining, Model.
we have conducted, despite its limited dependency on a Manag., vol. 11, no. 2, pp. 129–143, 2019.
lexical resource. String based similarity and lexical based [15] A. Hamza, N. En-Nahnahi, K. A. Zidani, and S. El Alaoui
similarity can be used as a base for our algorithm, but they Ouatik, “An arabic question classification method based on new
have narrow capabilities and thus our proposed similarity
taxonomy and continuous distributed representation of words,” J.
measures presented in this paper has improved accuracy and
King Saud Univ. - Comput. Inf. Sci., 2019.
precision. The results obtained by the experiment were
comparable to similar experiments in the English language, [16] C. Grosan and A. Abraham, “Rule-Based Expert Systems,” 2011,
which is significant considering that English is a resource pp. 149–185.
rich language if compared to Arabic. We anticipate that the [17] A. Prior and M. Geffet, “Word Association Strength, Mutual
result will be improved furthermore with the help of a Information and Semantic Similarity,” in EuroCogSci 2003,
carefully constructed multi domain Arabic lexicon. And this
2003.
is part of our future work.
[18] J. Lu, C. Lin, W. Wang, C. Li, and H. Wang, “String similarity
measures and joins with synonyms,” in Proceedings of the 2013
REFERENCES
international conference on Management of data - SIGMOD ’13,
2013, p. 373.
[1] M. K. Vijaymeena and K. Kavitha, “A survey on similarity [19] G. Navarro and Gonzalo, “A guided tour to approximate string
measures in text mining,” Mach. Learn. Appl. An Int. J., vol. 3, matching,” ACM Comput. Surv., vol. 33, no. 1, pp. 31–88, Mar.
no. 2, pp. 19–28, 2016. 2001.
162
[20] P. Gamallo, C. Gasperin, A. Agustini, and G. P. Lopes, similar words,” Nat. Lang. Process., no. 2003, pp. 37–44, 2008.
“Syntactic-Based Methods for Measuring Word Similarity,” [36] G. A. Miller, “WordNet: A Lexical Database for English,”
Springer, Berlin, Heidelberg, 2001, pp. 116–125. Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.
[21] A. Apostolico and C. Guerra, “The longest common subsequence [37] A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, “Farasa:
problem revisited,” Algorithmica, vol. 2, no. 1–4, pp. 315–336, A Fast and Furious Segmenter for Arabic.”
Nov. 1987. [38] E. Frank et al., “Weka-A Machine Learning Workbench for Data
[22] P. Angeles and A. Espino-gamez, “Comparison of methods Mining,” in Data Mining and Knowledge Discovery Handbook,
Hamming Distance , Jaro , and Monge-Elkan,” DBKDA 2015 Boston, MA: Springer US, 2009, pp. 1269–1277.
Seventh Int. Conf. Adv. Databases, Knowledge, Data Appl., no. c, [39] L. Breiman, “Random forests,” Mach. Learn., pp. 5–32, 2001.
pp. 63–69, 2015.
[40] P. Nakov et al., “SemEval-2017 Task 3: Community Question
[23] F. Miller, A. Vandome, and J. McBrewster, “distance: Answering.”
Information theory, computer science, string (computer science),
[41] B. V Galbraith, B. Pratap, and D. Shank, “Talla at SemEval-2017
string metric, damerau? Levenshtein distance, spell checker,
Task 3: Identifying Similar Questions Through Paraphrase
hamming distance,” 2009.
Detection.”
[24] V. Liki, “The Needleman-Wunsch algorithm for sequence
alignment 7th Melbourne Bioinformatics Course,” cs.sjsu.edu,
pp. 1–46.
[25] R. Mihalcea, C. Corley, and C. Strapparava, “Corpus-based and
knowledge-based measures of text semantic similarity,” in
Proceedings of the National Conference on Artificial
Intelligence, 2006, vol. 1, pp. 775–780.
[26] N. Oco, L. R. Syliongka, R. E. Roxas, and J. Ilao, “Dice’s
coefficient on trigram profiles as metric for language similarity,”
in 2013 International Conference Oriental COCOSDA held
jointly with 2013 Conference on Asian Spoken Language
Research and Evaluation (O-COCOSDA/CASLRE), 2013, pp. 1–
4.
[27] D. Daoud and M. Daoud, “Extracting terminological
relationships from historical patterns of social media terms,” in
Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), 2018, vol. 9623 LNCS, pp. 218–229.
[28] L. Azzopardi, M. Girolami, and M. Crowe, “Probabilistic
hyperspace analogue to language,” in Proceedings of the 28th
annual international ACM SIGIR conference on Research and
development in information retrieval - SIGIR ’05, 2005, p. 575.
[29] T. Hofmann, “Probabilistic latent semantic indexing,” in
Proceedings of the 22nd Annual International ACM SIGIR
Conference on Research and Development in Information
Retrieval, SIGIR 1999, 1999, vol. 51, no. 2, pp. 50–57.
[30] M. Monjurul Islam and A. S. M. Latiful Hoque, “Automated
essay scoring using Generalized Latent Semantic Analysis,” in
2010 13th International Conference on Computer and
Information Technology (ICCIT), 2010, pp. 358–363.
[31] O. Egozi, S. Markovitch, and E. Gabrilovich, “Concept-Based
Information Retrieval Using Explicit Semantic Analysis,” ACM
Trans. Inf. Syst., vol. 29, no. 2, pp. 1–34, Apr. 2011.
[32] G. Bouma, “Normalized (Pointwise) Mutual Information in
Collocation Extraction.”
[33] M. A. Islam and D. Inkpen, “Second Order Co-occurrence PMI
for determining the semantic similarity of words,” in Proceedings
of the 5th International Conference on Language Resources and
Evaluation, LREC 2006, 2006, pp. 1033–1038.
[34] R. L. Cilibrasi and P. M. B. Vitanyi, “The Google Similarity
Distance,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 3, pp.
370–383, Mar. 2007.
[35] P. Kolb, “Disco: A multilingual database of distributionally
163
Using K-Means Clustering and Data Visualization
for Monetizing logistics Data
George Sammour2 Koen vanhoof 1
Hamzah Qabbaah1
Department of Management Department of Business Informatics1
Department of Business Informatics1
Information Systems2
Hasselt university
Hasselt university
Princess Sumaya University for
Diepenbeek, Belgium
Diepenbeek, Belgium Technology (PSUT) Amman, Jordan
Koen.vanhoof@uhasselt.be
Hamzah.qabbaah@uhasselt.be George.sammour@psut.edu.jo
Abstract— Logistics companies possess collect large amount This paper investigates this process in particular for a large
of data on the shipments they perform while at the same time logistics company in the Middle East. Our focus will be most
facing a challenge to understand their complicated market better. on the segmentation phase as well as visualizing the data set to
They can extract useful market knowledge by using data mining be monetized.
technologies such as visualization and clustering. The detailed
results of such big data analytics methods can also be monetized
under certain circumstances. We studied the data on the II. RESEARCH QUESTION AND METHODOLOGY
transactions of a logistics company in the Middle East. K-Means In this paper, we try to answer the following research
clustering of their data proved to generate deeper insight into question: “How can segmentation in several customer groups
several clusters of customers having different profiles. The be used to enable the monetization of the data used in it?” The
results propose a best fit model for the clustering. Since the data mining technique we used to segment the available data
clustering and visualization results are relevant, reliable and set is clustering.
anonymous they fit the monetization criteria as well. Improved
data driven marketing applications are possible for the Clustering is the task of segmenting a heterogeneous
customers. population into a number of more homogeneous subgroups or
clusters with similar characteristics, such that both
Keywords— k-means clustering, data visualization, customer homogeneity of elements within clusters and the heterogeneity
segmentation, big data monetization between clusters are maximized [8]. It has been applied in a
wide variety of fields, such as engineering, computer sciences
I. INTRODUCTION (web mining, spatial database analysis, and segmentation), life
and medical sciences, earth sciences, social sciences and
Data, when analysed and interpreted well, can tell
economics (in marketing, business analysis and CRM
companies a lot about their customer’s interests and allow to
management) [9]. What distinguishes clustering from
improve their customers’ experiences. They are also a potential
classification is consequently that clustering does not rely on
source of income generation.[1] Companies like Google and
ordering data along predefined classes. Cluster analysis is
Facebook are already earning most of their revenues by
based on heuristics that try to maximize the similarity between
enabling marketers to target a specific audience, based on the
in cluster elements and the dissimilarity between inter-cluster
audience characteristics.[2] Companies can derive this income
elements [10]. This task has been performed in our paper
from their own collected high-quality data by selling them to
through the k-Means algorithm. This algorithm partitions the
other companies. Data are thus valuable for internal use and
data set into k clusters in which each object or instance is
potential use by other companies.[3] This monetizing process
assigned to the closest central point with the nearest mean [11].
is however facing a number of challenges. Acquiring the
Next, the heuristic performs a reassignment of the central
required large amount of data often exceeds the budgets of
points. The algorithm is completed when the assignments of
potential customers and the platforms for monetizing the data
the individual instances no longer change.
efficiently are still lacking. Moreover data quality has to be
unquestionable at all [4]. Our study consists of two separate parts. The first part
results in obtaining a data set that eventually can be monetized.
Data monetization by companies has not been studied
This part develops in depth statistics and visualization charts
extensively. Only few articles [5-7] have studied the
about the dataset. It also shows the product market share
phenomenon and mainly from the angle concerns over privacy.
statistics according to our destination countries. Since this part
Authors in [5] adopted an economics-based approach which
aims at showing in which way the visualization charts and
addresses the issue of disseminating sensitive data to a third
statistics can help in getting a clearer understanding about the
party data user.[2]. The economics-based approach normally
dataset, we will in this paper only shortly refer to it, moreover,
assesses the value of the data to be monetized on four
we will show an example of products market share of Jordan
characteristics. The data quality has to be reliable, the data set
case. The second part is the K-Means clustering itself, that is
has to be relevant to the potential customer, the data have to be
explained in a separate section. We then look at the
anonymous and secured, and finally segmented data have a
monetizability. We will use the monetizing characteristics
larger potential to lead to relevant business applications [3] [2].
mentioned before in section one to evaluate the monetizability
This signifies that before starting the monetizing effort,
of these data.
companies first have to visualize their data and apply
segmentation methods to them to make them more valuable for
potential customers.
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 164
III. DATA PREPARATION AND VISUALIZATION TABLE I. DATA DESCRIPTION
The data used in this research were obtained from a Variables Data type Data Description
logistics services company situated in the Middle East. Variables in the original data set
Cleaning, merging tables and pre-processing of the data has
ID Integer The ID of the order
been applied in order to obtain the final data set. We have
created new relevant variables to describe and group some CODValueUSD Double The amount of cash on delivery
other variables. We also standardized the values of some
Payment String Type of payment: prepaid, cash, third
variables to the same unit (kg , US dollar) to have more party, free
accurate results when analyzing the data. The total number of Destination String Destination city of shipment
transactions in the final dataset is equal to the size of the
sample (n=85959). Table I below shows the variables, the type Origin Country String Country of origin of the shipment
of data they represent and the description of each of the DestCountry String Country of destination of the
variables used in our research. shipment
ShipperID Integer The ID of the E-commerce
These data were then visualized using Tableau software. companies
Different attributes and dimensions, such as location, products, CODFlag Boolean Cash on delivery flag
customers and e-commerce companies were extensively
“Consignee Tel” Integer The telephone number of the
represented in graphs. These attributes are grouped in different customer
ways, such as by “customers”, “products”, “e-companies”, Created variables after data preparation
“destination countries” and so on. The list of visualized
dimensions is only shown here in Table II. Moreover, our Weight In KG Double Total weight in KG
destination countries were “Saudi Arabia”, “UAE” and Total Value Double The price of the goods in the
“Jordan”, we will show a sample of the results of the e- USD shipment in US Dollar
commerce market share for the common products transferred Product Group String Product group name of the shipment
only in Jordan case as shown in Fig.1. name
Product group Integer Product group ID
The figure presents the e-commerce companies market ID
share on the basis of the products transactions for Jordan. E-
company “15037” has the highest market share for “Apparel”,
”Bag/Case”, “Beauty supplies”, “Book”, “Food/Grocery”,
“Jewellery Accessories” and “shoes” with 69%, 86%, 92%, TABLE II. THE VISUALISATION OF THE DIFFERENT DIMENSIONS
82%, 77%, 85% and 79% respectively. Whereas e-company
“197483” has the highest market share percentages for “letter/ Number of transactions Dimensions
(variables)
card/ document” product with 40%. We can see the market Distribution percentages of country of Origin country
share of the products for the top five e-companies in the figure. destination, country of origin and city of Destination country
destination. Destination city
Products transferred to the country of Products
IV. K -MEANS CLUSTER ANALYSIS AND RESULTS destination Destination country
Products transferred to the city of destination Products
Customer segmentation focuses on getting knowledge Destination city
about the structure of customers and is used for targeted Products transferred from country of origin Products Origin
country
marketing [12], such as in new product development, The distribution percentages of the e- E-commerce
optimizing placement of retail products on shelves, analysis of commerce companies have orders transferred companies
cannibalization between products and more general in to country of destination Destination country
analysing the affinity between products and cross-category The distribution percentages of the customers Customer
sales promotion [13, 14]. The segmentation efforts we have orders transferred to the country of Destination country
destination
performed are essential for developing improved segmentation
The distribution percentages of the product Customer
bases for e-marketing applications such as the monetization of categories by the customers Products
the data in the dataset. [14, 15] The distribution percentages of the product Origin country
categories transferred to the countries of Destination country
Our segmentation model has the purpose to find segments destination from the countries of origin. Products
of customers sharing the same profile on the basis of a Retuned orders distribution by the country of Return products
combination of the variables products bought, location and destination Destination country
value of the goods purchased. Retuned orders distribution by the city of Return products
destination Destination city
The variables used in our model are Avg. Total Value Retuned orders distribution by the e- Return products
USD, Product Group Name, Country of destination, Consignee commerce companies E-commerce company
Tel and Destination. Retuned orders distribution by the customers Return products
Customer
In order to find the best cluster fit experiments we have
experiment the analysis for 2 to 5 clusters. Table III shows the
results of the 2-clusters solution. Both clusters have “Apparel”
as the most common product ordered.
165
TABLE IV. THE RESULTS OF THE 3-CLUSTERS SOLUTION
Attributes/
Cluster 1 Cluster 2 Cluster 3
Clusters
Number of 78244 5457 2257
Items
Avg. Total 111.31 46.487 95.425
Value USD
Product Group Apparel Apparel Apparel
Name
Most Common
Country of SA JO AE
Fig.1: E-commerce companies market share on the basis of the Destination
products transactions for Jordan Consignee Tel 9665555XX 96265358X 97145076XX
X XX X
Destination RUH AMM DXB
Note DXB: Dubai.AE: United Arab Emirates
Most Common
Country of SA JO AE SA
Destination
Consignee 9665555 9626535 97145076 96614393
Tel X 8XXX XXX X
Destination RUH AMM DXB JED
Note: JED: Jeddah
Name
variable shows the most common customer having transactions Country of SA JO AE SA SA
within each clusters. Destination
Consignee 9665555 962655 971457 9665712 9665051
Table IV, table V and table VI show the results of the 3- Tel XX 8XX 7XX XX XX
clusters, 4-clusters and 5-clusters solution respectively. Destination RUH AMM DXB RUH JED
Name with the average total price of 95.425. The Consignee Tel.
Country of SA JO
Destination
variable shows the most common customer having transactions
Consignee Tel 9665555XXX 9626535XXX within each clusters.
Destination RUH AMM Table V shows the first three clusters have “Apparel”
Note: RUH: Riyadh, AMM: Amman. SA: Saudi Arabia, JO: product as a most common product ordered, while in cluster-4
Jordan the most common one is “DVD/CD”. Cluster-1 shows that the
166
most frequent orders are shipped to “Riyadh” with the average TABLE VII. THE RESULTS OF THE ANALYSIS OF THE VARIANCE TEST FOR
OUR MODEL
of the total price = 109.09. Cluster-2 shows that the most
frequent orders are shipped to “Amman” with the average of Number of
Variable F-statistic
P-value
the price = 46.487. Cluster-3 shows that the most frequent clusters
orders are shipped to “Dubai” with the average of the total 2-clusters Avg. Total Value USD 269.9 0.000
price = 95.425. 3-clusters Avg. Total Value USD 138.5 0.000
Cluster-4 shows that the most frequent orders are shipped 4-clusters Avg. Total Value USD 1.28e+04 0.000
to “Jeddah” with the highest average of the total price = 17466. 5-clusters Avg. Total Value USD 1.4e+04 0.000
Consignee Tel variable shows the most common customer has
transactions within each clusters.
Cluster-1 has the highest average of the total price of
Table VI shows the first four clusters have the “Apparel”
shipments transferred to “Riyadh” in Saudi Arabia with 1250
product as the most common product ordered, while in cluster-
USD. The most expensive shipped products are “Computer”
5 the most common one is “Camera”. Cluster-1 shows that the
and “I-Phone” respectively in the cluster. The most expensive
most frequent orders are shipped to “Riyadh” with the average
shipped products in the cluster-2 which are transferred to
of the total price = 105.72. Cluster-2 shows that the most
“Amman” in Jordan are “IPad”, “Computer” and “Laptop”
frequent orders are shipped to “Amman” with the average of
with 600, 450 and 440 USD respectively. The most expensive
the total price = 46.487. Cluster-3 shows that the most frequent
shipped products in the cluster-3 which are transferred to
orders are shipped to “Dubai” with the total average of the
“Dubai” and “Abu Dhabi” in UAE are “Laptop” and
price = 95.425. Cluster-4 shows that the most frequent orders
“Computer” with 900 and 600 USD respectively. Whereas the
are shipped to “Riyadh”” with the average of the total price =
shipped products to “Abu Dhabi” are much cheaper since the
3608.
average of the total values are less than 200 USD .
Finally, cluster-5 shows that the most frequent orders are
shipped to “Riyadh” with the highest average of the total IV. CONCLUSION
price= 25251. Consignee Tel variable shows the most common
customer has transactions within each clusters. Our best fit K-Means clustering model segmented the
customers mainly according to destination cities, products and
We use the Calinski-Harabasz criterion to assess cluster the price . Each cluster group profiles customers sharing
quality. The Calinski-Harabasz criterion is defined as (1): identical product interests coupled to the amount they normally
(1) like to spend when using e-commerce for shopping. The model
proves to be an excellent model for e-commerce websites
wanting to segment their customers based on their interests and
location, one of the potential marketing applications.[17]
Where SSB is the overall between-cluster variance, SSW the
overall within-cluster variance, k the number of clusters, and N Moreover the clustering and data visualization also allow to
the number of observations [16]. The greater the value of this know the distribution pattern of the shipments according to
ratio, the more cohesive the clusters (low within-cluster “product types”, “customers”, “cities” and so on. This
variance) and the more distinct/separate the individual clusters information is highly valuable for the logistics companies
are (high between-cluster variance). If a user does not specify possessing these datasets. It helps them in managing their
the number of clusters, Tableau picks the number of clusters transactions better, but also allows to monetize the knowledge
corresponding to the first local maximum of the Calinski- contained in their data and selling it to other companies. The
Harabasz index automatically. The result of Calinski-Harabasz major benefit lies in identifying groups of customers with
test indicates that the best cluster fit model contains three profiles that are fairly similar and to draw value from these
clusters. profile characteristics as much as possible.
To validate the best fit cluster solution we used (ANOVA) Knowing for instance the average value of the shipments
statistics. The results are shown in Table VII. The analysis of and a percentage wise subdivision of the product categories
variance (ANOVA) of the all clusters solution show that the p- involved that are shipped to a certain destination is marketing
value <0.001 of the continuous variable “Total Value USD”. knowledge shippers (shipperID was one of the variables) can
So the values were statistically different between all the be interested in directing their marketing efforts. These
clusters. Moreover, the number of items of the last two clusters companies normally do not have this knowledge themselves in
for a 5-clusters solution is 99 and 5 items only, and that cluster- the same detail, so the logistics service companies can help to
4 in the 4-clusters solution only contains 10 items. The improve their marketing efforts and eventually monetize the
distribution of the number of the items for the 3-clusters data as a marketing application. The data are reliable (as they
solution is much more acceptable, since cluster-3 containing are taken from the dataset of all logistics transactions by the
the lowest number of items counts 2257 items in total. logistics company). They are relevant to the customer
Therefore our selection confirms the Calinski-Harabasz result companies as they are all situated in the same sector and
that the 3-clusters solution is the best cluster fit the model. region. The shipperID makes the data anonymous and the
results are segmented. Thus all four criteria for monetization
Fig.2 shows the distribution of the average of the total price
previously mentioned are fulfilled.
according to the 3-clusters solution for our model.
167
Fig.2 The distribution of the average of the total price of the most frequently shipped products to the
most common destinations according to the 3-cluster solution.
V. MANAGERIAL CONSEQUENCES
We recommend all e-commerce companies to segment
Data
their customer base. It will improve their campaign contents by Segmentation
tying them better to customer characteristics and thus improve
their effectiveness perspective. The study grouped each Data source Customer:
Data collection Demographic
customer per product category most frequently bought, Data Geographic
location and e-commerce company most frequently dealt with. Preprocessing Behavioral
Thus when an e-commerce company intends to increase their Profitability
168
In this paper we indeed proposed a marketing application conference on Next Generation Mobile Apps, Services and
for a logistics company. After the first step in which the data Technologies. 2013.
[5] Li, X.-B. and S. Raghunathan, Pricing and disseminating customer data
are made ready for data modelling, we suggest that the with privacy awareness. Decision support systems, 2014. 59: p. 63-73.
companies involved segment their customers geographically, [6] Laudon, K.C., Markets and privacy. Commun. ACM, 1996. 39(9): p. 92-
behaviourally and on the basis of profitability. The products 104.
and transaction routes can be segmented on the basis of the [7] Bélanger, F. and R.E. Crossler, Privacy in the Digital Age: A Review of
logistics application service we proposed. The next step is to Information Privacy Research in Information Systems. MIS Quarterly,
2011. 35(4): p. 1017-1041
visualize the results that have to be made clear for the decision [8] Joseph F. Hair, J., et al., Multivariate data analysis (4th ed.): with
makers. Thus our proposed work is made ready to be used for readings. 1995: Prentice-Hall, Inc. 745.
marketing applications linked to the logistics channels. [9] George Sammour , B.D., Koen Vanhoof and Geert Wets, Identiying
homogenous customer segments for risk email marketing experements,
Our research used logistics data in a different way. By in 11th International Conference on Enterprise Information Systems.
applying k-means clustering to these data. We focused on 2009: milan , italy. p. 89-94.
finding segments of customers sharing the same profile on the [10] Fraley, C. and A.E. Raftery, Model-Based Clustering, Discriminant
basis of a combination of the variables products bought, Analysis, and Density Estimation. Journal of the American Statistical
location and value of the goods purchased. Our contribution in Association, 2002. 97(458): p. 611-631.
[11] Carmona, C.J., et al., Web usage mining to improve the design of an e-
this study is to add to this research stream the value of commerce website: OrOliveSur.com. Expert Systems with Applications,
extensively looking into the monetization possibility of specific 2012. 39(12): p. 11243-11249.
logistics data of e-commerce companies (a field and [12] Gruca, T.S. and B.R. Klemz, Optimal new product positioning: A
combination has not studied before) and tries to indicate genetic algorithm approach. European Journal of Operational Research,
whether in an international context these data are valuable 2003. 146(3): p. 621-633.
[13] Leeflang, P.S.H., et al., Decomposing the sales promotion bump
enough to be marketed. accounting for cross-category effects. International Journal of Research
in Marketing, 2008. 25(3): p. 201-214.
[14] Holý, V., O. Sokol, and M. Černý, Clustering retail products based on
customer behaviour. Applied Soft Computing, 2017. 60: p. 752-762.
REFERENCES [15] Tsai, C.Y. and C.C. Chiu, A purchase-based market segmentation
methodology. Expert Systems with Applications, 2004. 27(2): p. 265-
[1] Tsai, C.-W., et al., Big data analytics: a survey. Journal of Big Data, 276.
2015. 2(1): p. 21. [16] Tableau. Find Clusters in Data. 2019; Available from:
[2] Bataineh, A.S., et al., Monetizing Personal Data: A Two-Sided Market https://onlinehelp.tableau.com/current/pro/desktop/en-us/clustering.htm.
Approach. Procedia Computer Science, 2016. 83: p. 472-479. [17] Hamzah Qabbaah, George.Sammour., Koen Vanhoof, DECISION TREE
[3] platform, L.s.d.m., How to Monetize Your Data, in How to Monetize ANALYSIS TO IMPROVE E-MAIL MARKETING CAMPAIGNS.
Your Data. 2018, Lotame. International Journal “Information Theories and Applications”, 2018.
[4] Mizouni, R. and M.E. Barachi. Mobile Phone Sensing as a Service: 25(4): p. 303-330.
Business Model and Use Cases. in 2013 Seventh International
169
Content Based Image Retrieval Approach using
Deep Learning
Heba Abdel-Nabi Ghazi Al-Naymat Arafat Awajan
Department of Computer Science Department of Computer Science Department of Computer Science
Princess Sumaya University for Princess Sumaya University for Princess Sumaya University for
Technology Technology Technology
Amman, Jordan Amman, Jordan Amman, Jordan
h.yousif88@yahoo.com g.naymat@psut.edu.jo awajan@psut.edu.jo
Abstract— In a world that seeks perfect results of any search the fixed keyword and feature engineering is not a suitable
query, an information retrieval system that produces an for image retrieval especially for large scale image databases.
accurate and relevant output is desired. However, because of
the famous semantic gab problem of image representation, a On the hand, the second approach which is the Content
Content Based Image Retrieval (CBIR) system faces some Based Image Retrieval (CBIR) overcome these limitations
difficulties, since it highly depends on the extracted image and slightly improve the retrieval performance through
features as basis for a similarity check between the query searching the images based on their visual contents
image and database images. This purposed approach represented by its low and middle level features such as
overcomes these difficulties with the aid of the most fast color, texture and shape, then comparing the similarities in
growing technology, namely Deep Learning. In addition, it some of these features between the images in the database
explores the effects of merging the features extracted from the and the query image. Determining the similarities and the
latter layers of the deep network to achieve better retrieval proper features that best describe the image is often relative.
results. The experimental results demonstrate the effectiveness Therefore, this raises the famous semantic gap problem that
of the proposed scheme in terms of the number of relevant are formed by a low level visual features of the images
retrieved images of the query results, and the mean average represented normally by their intensity or pixel values and
precision, while keeping low computational complexity since it the high level of human perception [2].
uses an already trained deep convolutional model called
AlexNet. Thus in turn, a reduction in the complexity that With the advances in machine learning methods, a
combines training a deep model from the scratch has been retrieval methods based on them succeed in outperforming
achieved. the traditional retrieval methods that are only based on image
indexing and keyword tagging, especially if we search for an
image in a large database for a match of the requested image
Keywords—Image Retrieval, Content Based, Deep Learning, query. However, the machine learning approaches has some
AlexNet. limited performance, because in order to be successful they
must be combined with supervised learning that required a
I. INTRODUCTION labeled dataset for the training process, and these labeled
In the fast growing and technology accelerated era, the data indicates that pairs of inputs and labels, that identify the
distribution and storage of digital images become easy and correct output specified with that input, must be manually
available widely. Therefore, a huge amount of digital images extracted by a domain human expert. For a large database
is stored and uploaded online in huge databases such as in containing millions of images, doing a fixed feature
the World Wide Web or in the medical images databases for engineering become infeasible, and consequently the same
example. Consequently, a search query based on images limitation of the text based approaches reappeared.
became of a great essential. Since these databases differ from The recent revolutions in computer vision and image
the traditional databases by the type of the unstructured data recognition, thanks to the deep learning breakthrough in
stored in them, new information retrieval methods are 2006 [3], make the deep learning seems a potential bridge to
introduced. this gap for retrieving images, because it has the ability to
There are mainly two main approaches for image process raw data and build the internal feature representation
retrieval; the text or concept based approach and the content of it through its multiple nonlinear layers of abstraction to
based approach. The text based approach depends on the provide eventually a high conceptual representation of the
manual indexing and the quality of the tagged keywords and image. In other words, the deep learning has the capability to
annotations that describe the images for the retrieval learn the image semantic representation through its training
purposes. However, the annotation based method can be phase. Therefore, a deep learning based model for content
considered an infeasible retrieval method due to many based image retrieval is proposed in this paper.
reasons, such as: the manual annotation process is time Any content based image retrieval system is concerned
consuming, tedious, subjective, and incomplete. Moreover, with achieving two goals; be able to recognize the existence
these assigned keywords may not describe the image or not of the query image in the image database [4], and to
properly since for example different keywords can describe a retrieve the most similar images to it (not the images of the
certain image and in the same time a single keyword can most probable classes) [5] through the multidimensional
describe multiple images [1], i.e., single keyword can have extracted feature vector. This feature vector is extracted from
different semantic meanings. All these factors indicate that
172
Equation (1), an appropriate indirect weights are considered 2. Searching Phase: The second phase is the search for
to combine the features when calculating the similarity identical or similar images to the query image. The main task
measures between the query image and the images in the of the CBIR system is to find N exact matches or similar
database. Below are the design steps of the proposed CBIR images to that query. The query image undergoesthe same
system, and Fig. 2 shows a flowchart that represents the feature extraction procedure described in phase one above.
proposed approach.
3. Similarity and Ranking Phase: The similarity
measurement step capture the semantic similarity between
B. Design Steps of the Proposed CBIR System the features extracted from the combined feature vector
1. Preparing Phase: The database of images undergoes obtained from the images in the database and between the
the first phase of preparing and collecting the features; this is combined feature vectors obtained from the query image.
done for each image in this database. The preprocessing Then all the images in the database are ranked, and the top N
consists of the following steps: images are retrieved.
A) The database images must be preprocessed in order to
suit the network model input; either cropping the image to V. DISCUSSION AND EXPERIMENTAL RESULTS
the correct size or resizing it. In the proposed approach, each The performance is judged by the quality of the retrieval
image in the database is resized to the proper size that the images, i.e., their counts and how relevant they are to the
used model accepts at its input layer. To improve the model query image. This is measured using the precision and recall
performance and to avoid overfitting in the deep model any metrics.
images with a size less than the size that the input layer
accepts are excluded from the database, and therefore A. Image Dataset
increase the chances for successful retrieval.
It consists of the collection of 600 images with 20
B) The images then are fed to the deep model. different categories including: Horses, Bears, Buses, Cars,
C) Then a new two features for each image are extracted Sport Cars, Cats, Dogs, Ducks, Flowers, Roses, Boathouses,
from the FC6 and FC8 layers of the deep model. Guitars, Old Castle, Owl, Pepper, Sailboats, Sheep, Sunset,
Tiger and Tomatoes. Each of these categories contains 30
D) A weighted combination of the two vectors is done, in images, taking from the ImageNet 2012 dataset [22]. A set of
which only part of the higher dimensional layer, i.e., FC6 is 15 image query has been applied as shown in Fig. 3 to test
taken and combined with the FC8 feature vector according to the efficiency of our system, similar categories has been
Equation (1). This is done to increase the efficiency of the selected such as cars vs sport cars for example. The top 30
features extracted from the last fully connected layers in images are retrieved for each query image. The AlexNet
retrieving the relevant images by introducing partial support network is not trained on this dataset before.
of the FC6 layer. Note that each of these layers learns a
different abstraction of the image.
C. Performance Metrics
The retrieval performance of a content based image
retrieval system depends mainly on the feature
representations and similarity measurements. The main aim
is to design an image retrieval system that is efficient and
effective [13] through fulfilling two requirements: Speed and
Precision. The quality of retrieval and how relevant it is to
the query image is measured through their precision and
recall values. A higher value on precision and recall indicates
a better result on image retrieval, meaning the set of returned
images are more preferable to user.
For performance evaluation, we used five metrics to
evaluate the proposed scheme, they are listed below. The
results of these metrics for each of the query images used in
the experiment is presented in tables 1 to table 15, in which
the precision and recall values at ranks 1, 5, 10, 15, 20, 25
and 30 are listed, in addition to the number of the retrieved
images that are similar to the query image, and the average
precision (AP) for each query image. In addition, we Fig. 3. The query images that are used in the experiments.
compared the results of the proposed approach with the
results obtained from image retrieval based of the features Fig. 4 shows the mean average precision values of
extracted from layers FC6 and FC8 that are used to construct the system when the features are extracted from
the combined vector. different layers such as FC6, FC8 and from the
combined vector used in the proposed approach. As can
- The precision at a particular ranks (P@K) : it be noted, the proposed scheme enhanced the results
measures the ability of the system to retrieve only the compared with the single feature based system which
images that are relevant when the number of retrieved proved the effectiveness of this approach.
images are k.
Precision = # relevant images retrieved / # total (2) 94%
of images retrieved 93%
- The recall at a particular ranks (R@K): it 92%
measures the ability of the system to retrieve all the 91%
images that are relevant when the number of retrieved
90%
images are k.
89%
Recall = # relevant images retrieved / # total of (3)
relevant images. 88%
Using LAYER FC6 Using LAYER FC8 Proposed
- The average precision (AP): averages the precision approach
values of the rank positions where a relevant image is
retrieved Fig. 4. MAP of CBIR using the features obtained from each layer and
the proposed approach.
Table 1. Query Image Tiger: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.950 0.920 0.866
FC6: 4096 – D 26/30 0.9705
Recall 0.033 0.166 0.333 0.500 0.633 0.766 0.866
Precision 1 1 1 1 1 0.920 0.900
FC8: 1000 – D 27/30 0.9826
Recall 0.033 0.166 0.333 0.500 0.666 0.766 0.900
Proposed approach Precision 1 1 1 1 1 1 1
30/30 1
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 1
Table 2. Query Image sailboat1: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 0.866 0.750 0.680 0.666
FC6: 4096 – D 20/30 0.9048
Recall 0.033 0.166 0.333 0.433 0.500 0.566 0.666
174
Precision 1 1 1 1 0.900 0.880 0.833
FC8: 1000 – D 25/30 0.9544
Recall 0.033 0.166 0.333 0.500 0.600 0.733 0.833
Proposed approach Precision 1 1 1 1 1 1 1
30/30 1
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 1
Table 3. Query Image sailboat 2: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.900 0.866 0.650 0.520 0.433
FC6: 4096 – D 13/30 0.9521
Recall 0.033 0.166 0.300 0.433 0.433 0.433 0.433
Precision 1 0.800 0.800 0.733 0.700 0.640 0.600
FC8: 1000 – D 18/30 0.7913
Recall 0.033 0.133 0.266 0.366 0.466 0.533 0.600
Proposed approach Precision 1 0.600 0.400 0.460 0.550 0.640 0.700
21/30 0.6263
1000 - D Recall 0.033 0.100 0.133 0.233 0.366 0.533 0.700
Table 4. Query Image horse: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.950 0.960 0.900
FC6: 4096 – D 27/30 0.9722
Recall 0.033 0.166 0.330 0.500 0.633 0.833 0.933
Precision 1 1 1 1 1 0.92 0.866
FC8: 1000 – D 26/30 0.9864
Recall 0.033 0.166 0.333 0.500 0.666 0.766 0.866
Proposed approach Precision 1 1 1 1 1 1 0.966
29/30 0.9976
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 0.966
Table 5. Query Image sunset: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.900 0.933 0.900 0.760 0.633
FC6: 4096 – D 19/30 0.9569
Recall 0.033 0.166 0.300 0.466 0.600 0.633 0.633
Precision 1 1 1 0.866 0.800 0.800 0.733
FC8: 1000 – D 22/30 0.9560
Recall 0.033 0.166 0.333 0.433 0.533 0.666 0.733
Proposed approach Precision 1 1 1 1 0.850 0.880 0.800
24/30 0.9606
1000 - D Recall 0.033 0.166 0.333 0.500 0.566 0.733 0.800
Table 6. Query Image cat: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.900 0.866 0.750 0.640 0.600
FC6: 4096 – D 18/30 0.9001
Recall 0.033 0.166 0.300 0.433 0.500 0.533 0.600
Precision 1 1 0.900 0.666 0.550 0.440 0.400
FC8: 1000 – D 12/30 0.8935
Recall 0.033 0.166 0.300 0.333 0.366 0.366 0.400
Proposed approach Precision 1 1 0.900 0.866 0.750 0.640 0.533
16/30 0.9240
1000 - D Recall 0.033 0.166 0.300 0.433 0.500 0.533 0.533
Table 7. Query Image guitar: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.950 0.800 0.666
FC6: 4096 – D 20/30 0.9909
Recall 0.033 0.166 0.333 0.500 0.633 0.666 0.666
Precision 1 1 1 1 1 1 0.900
FC8: 1000 – D 27/30 0.9960
Recall 0.033 0.166 0.333 0.500 0.666 0.833 0.900
proposed approach Precision 1 1 1 1 1 1 1
30/30 1
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 1
Table 8. Query Image brown bear: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.9 0.92 0.833
FC6: 4096 – D 25/30 0.9720
Recall 0.033 0.166 0.333 0.500 0.600 0.766 0.833
Precision 1 1 0.900 0.866 0.900 0.88 0.800
FC8: 1000 – D 24/30 0.9097
Recall 0.033 0.166 0.300 0.433 0.600 0.733 0.800
Proposed approach Precision 1 1 1 1 1 0.96 0.866
26/30 0.9891
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.800 0.866
Table 9. Query Image Owl 1: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 1 0.9 0.8 0.766
FC6: 4096 – D 23/30 0.9488
Recall 0.033 0.166 0.333 0.500 0.600 0.666 0.766
Precision 1 1 0.9 0.933 0.9 0.88 0.8
FC8: 1000 – D 24/30 0.9294
Recall 0.033 0.166 0.300 0.466 0.600 0.733 0.800
Proposed approach Precision 1 1 1 1 0.95 0.88 0.833
25/30 0.9749
1000 – D Recall 0.033 0.166 0.333 0.500 0.633 0.733 0.833
175
Table 10. Query Image Owl 2: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 0.400 0.200 0.133 0.100 0.080 0.066
FC6: 4096 – D 2/30 0.8335
Recall 0.033 0.066 0.066 0.066 0.066 0.066 0.066
Precision 1 0.600 0.500 0.466 0.400 0.400 0.366
FC8: 1000 – D 11/30 0.5992
Recall 0.033 0.100 0.166 0.233 0.266 0.333 0.366
Proposed approach Precision 1 1 0.700 0.600 0.55 0.56 0.600
18/30 0.7636
1000 - D Recall 0.033 0.166 0.233 0.300 0.366 0.466 0.600
Table 11: Query Image Color Duck: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # of relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 1 0.800 0.750 0.720 0.633
FC6: 4096 – D 19/30 0.8983
Recall 0.033 0.166 0.333 0.400 0.500 0.600 0.633
Precision 1 1 1 1 1 1 0.866
FC8: 1000 – D 26/30 0.9972
Recall 0.033 0.166 0.333 0.500 0.666 0.833 0.866
Proposed approach Precision 1 1 1 1 1 1 0.966
29/30 0.9988
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.833 0.966
Table 12. Query Image Mix Pepper: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.700 0.733 0.650 0.640 0.566
FC6: 4096 – D 17/30 0.8029
Recall 0.033 0.166 0.233 0.366 0.433 0.533 0.566
Precision 1 1 0.800 0.666 0.550 0.520 0.566
FC8: 1000 – D 17/30 0.7660
Recall 0.033 0.166 0.266 0.333 0.366 0.433 0.566
proposed approach Precision 1 1 0.800 0.866 0.850 0.840 0.766
23/30 0.8742
1000 – D Recall 0.033 0.166 0.266 0.433 0.566 0.700 0.766
Table 13. Query Image Red Pepper: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 0.800 0.700 0.666 0.650 0.680 0.666
FC6: 4096 – D 20/30 0.7384
Recall 0.033 0.133 0.233 0.333 0.433 0.566 0.666
Precision 1 0.800 0.900 0.866 0.750 0.720 0.700
FC8: 1000 – D 21/30 0.8436
Recall 0.033 0.133 0.300 0.433 0.500 0.600 0.700
Proposed approach Precision 1 0.800 0.700 0.800 0.850 0.800 0.766
23/30 0.8278
1000 - D Recall 0.033 0.133 0.233 0.400 0.566 0.666 0.766
Table 14. Query Image sheep 1: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 0.800 0.800 0.733 0.65 0.68 0.633
FC6: 4096 – D 19/30 0.7668
Recall 0.033 0.133 0.266 0.366 0.433 0.566 0.633
Precision 1 1 0.900 0.933 0.95 0.800 0.800
FC8: 1000 – D 24/30 0.9311
Recall 0.033 0.166 0.300 0.466 0.633 0.666 0.800
proposed approach Precision 1 1 1 1 1 1 0.900
27/30 0.9986
1000 - D Recall 0.033 0.166 0.333 0.5 0.666 0.833 0.900
Table 15. Query Image sheep 2: Retrieval Performance on the Two Fully Connected Layers and on the Proposed Approach.
Feature extraction Evaluation # of relevant retrieved/total Query
@1 @5 @10 @15 @20 @25 @30
method metrics retrieved AP
Precision 1 1 0.900 0.666 0.700 0.720 0.633
FC6: 4096 – D 19/30 0.8452
Recall 0.033 0.166 0.300 0.333 0.466 0.600 0.633
Precision 1 1 1 1 0.850 0.760 0.666
FC8: 1000 – D 20/30 0.9548
Recall 0.033 0.166 0.333 0.500 0.566 0.633 0.666
proposed approach Precision 1 1 1 1 1 1 0.900
27/30 0.9986
1000 - D Recall 0.033 0.166 0.333 0.500 0.666 0.8333 0.900
176
Other alternatives such as ResNet [23] succeed in level, using patch-based visual words. IEEE Trans. Medical Imaging,
outperforming AlexNet by achieving a lower error rate; a 30(3), pp.733-746.
3.57% error rate using an ensemble of residual nets on the [11] Lowe, D.G., 1999. Object recognition from local scale-invariant
features. In Computer vision, 1999. The proceedings of the seventh
ImageNet while the AlexNet achieved 15.3% error rate. IEEE international conference on (Vol. 2, pp. 1150-1157).
Nevertheless, the complexity also rises; AlexNet is just eight [12] Bay, H., Tuytelaars, T. and Van Gool, L., 2006, May. Surf: Speeded
layers while ResNet may contain up to 152 layers with a up robust features. In European conference on computer vision (pp.
residual connection. Therefore, to prove our idea we used the 404-417). Springer, Berlin, Heidelberg.
simpler model, however, the effects of the ResNet and other [13] Tunga, S., Jayadevappa, D. and Gururaj, C., 2015. A comparative
recent deep models on the image retrieval can be explored as study of content based image retrieval trends and approaches.
a future extension of this work. International Journal of Image Processing (IJIP), 9(3), pp.127-155.
[14] Singh, A.V., 2015. Content-based image retrieval using deep learning.
Rochester Institute of Technology.Anshuman Vikram Singh.
REFERENCES [15] Kumar, M., Chhabra, P. and Garg, N.K., 2018. An efficient content
based image retrieval system using BayesNet and K-NN. Multimedia
[1] Li, T., Mei, T., Yan, S., Kweon, I.S. and Lee, C., 2009, June. Tools and Applications, pp.1-14.
Contextual decomposition of multi-label images. In Computer Vision [16] Liu, P., Guo, J.M., Wu, C.Y. and Cai, D., 2017. Fusion of deep
and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. learning and compressed domain features for content-based image
2270-2277). IEEE. retrieval. IEEE Transactions on Image Processing, 26(12), pp.5706-
[2] Saritha, R.R., Paul, V. and Kumar, P.G., 2018. Content based image 5717.
retrieval using deep learning process. Cluster Computing, pp.1-14. [17] Saritha, R.R., Paul, V. and Kumar, P.G., 2018. Content based image
[3] Hinton, G.E., Osindero, S. and Teh, Y.W., 2006. A fast learning retrieval using deep learning process. Cluster Computing, pp.1-14.
algorithm for deep belief nets. Neural computation, 18(7), pp.1527- [18] Wan, J., Wang, D., Hoi, S.C.H., Wu, P., Zhu, J., Zhang, Y. and Li, J.,
1554. 2014, November. Deep learning for content-based image retrieval: A
[4] Shereena, V.B. and David, J.M., 2014. Content Based Image comprehensive study. In Proceedings of the 22nd ACM international
Retrieval: A Review. In Computer Science & Information conference on Multimedia (pp. 157-166). ACM.
Technology, Computer Science Conference Proceedings (CSCP) (pp. [19] Wang, H., Cai, Y., Zhang, Y., Pan, H., Lv, W. and Han, H., 2015,
65-77). November. Deep learning for image retrieval: What works and what
[5] Piras, L. and Giacinto, G., 2017. Information fusion in content based doesn't. In Data Mining Workshop (ICDMW), 2015 IEEE
image retrieval: A comprehensive overview. Information Fusion, 37, International Conference on (pp. 1576-1583). IEEE.
pp.50-60. [20] Liu, H., Wang, R., Shan, S. and Chen, X., 2016. Deep supervised
[6] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet hashing for fast image retrieval. In Proceedings of the IEEE
classification with deep convolutional neural networks. In Advances conference on computer vision and pattern recognition (pp. 2064-
in neural information processing systems (pp. 1097-1105). 2072).
[7] Hiremath, P.S. and Pujari, J., 2007, December. Content based image [21] Han, X., Zhong, Y., Cao, L. and Zhang, L., 2017. Pre-trained AlexNet
retrieval using color, texture and shape features. In Advanced architecture with pyramid pooling and supervision for high spatial
Computing and Communications, 2007. ADCOM 2007. International resolution remote sensing image scene classification. Remote
Conference on (pp. 780-784). IEEE. Sensing, 9(8), p.848.
[8] Jain, A.K. and Vailaya, A., 1996. Image retrieval using color and [22] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,
shape. Pattern recognition, 29(8), pp.1233-1244 Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. and Berg, A.C.,
2015. Imagenet large scale visual recognition challenge. International
[9] Islam, M.M., Zhang, D. and Lu, G., 2008, December. Automatic Journal of Computer Vision, 115(3), pp.211-252.
categorization of image regions using dominant color based vector
quantization. In Digital Image Computing: Techniques and [23] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning
Applications (pp. 191-198). IEEE. for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 770-778).
[10] Avni, U., Greenspan, H., Konen, E., Sharoon, M. and Goldberger, J.,
2011. X-ray categorization and retrieval on the organ and pathology
177
Data Analytics and Business Intelligence
Framework for Stock Market Trading
Batool AlArmouty Salam Fraihat
Computer Science Department Computer Science Department
King Hussein School of Computing Sciences King Hussein School of Computing Sciences
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
armoutib@gmail.com, s.fraihat@psut.edu.jo
Abstract— Business intelligence is an umbrella term that applied on the data after the transformation to extract
combines architectures, tools, databases, analytical tools, information, and finally the presentation of the extracted
applications, and methodologies. The efficiency of making information to the end-user, most likely the decision-maker.
decisions can increase significantly using business intelligence The rest of the paper is structured as follows, section II
solutions, by taking advantage of the existing historical or real-
presents the related work, section III explains the
time data of the business. Trading in stock markets is imminent
with taking risks of losing money, which requires extensive requirements of the framework, the proposed architectures
experience in the market, to make efficient decisions. In this in section IV, the design in section V, and business
paper, we propose a framework that makes use of stock prices intelligence presentation in section VI, finally the
historical data, to help investors in making more efficient trading conclusion in section VII.
decisions.
Keywords— Business Intelligence, Business Analytics, Data II. RELATED WORK
Analytics, Decision Support Systems, Stock Market.
A. Stock Market Analysis
I. INTRODUCTION Umadevi et al [3] applied analytical techniques on stock
market data and tried to design to a prediction model. The
Business Intelligence (BI) is defined as a collective
authors obtained Google, Apple and Microsoft stock prices
term that combines different technologies, applications, and
over six months, with four attributes (low, high, open and
tools used for the gathering of data from sources, storing,
close). The analysis applied on the stock market data
analyzing and visualizing it, with the purpose of helping
involves stock scores and candlelight plot to visualize all the
users to make better decisions [1]. In the last few years, data
parameters.
has been increasing rapidly, and with the ease of acquiring
Alraddadi [4] made analysis using the stock prices data
and storing this data, organizations have started to leverage
of John Wiley & Sons company over one year, the data
it for enhancing the decision making the process.
contains six attributes (open, high, low, closing, and
BI’s objective is transforming data to information
adjusted close). The author applied descriptive statistics to
through analysis to meet the business objective of the user
explore the nature of the data, and analysis measures
[2], by enabling the user to interactively manipulate the
including measures of central tendency, and measures of
data, and apply different analysis in the way she/he needs
variability. Moreover, they made use of plots to fully
for extracting information, and get valuable insights from
understand the nature of the data, like histogram and time
the data.
series plots.
The stock market is a public market with strict
Sen et al [5] made analysis on the Indian stock market,
regulation for the trading of companies’ stocks, where each
by decomposition the time series data into three
stock is a share of company permitted to be traded called
components; the trend, the seasonal component, and the
listed company ownership. Investors make money by buying
random component. The decomposition was done to help
stocks at a lower price while selling them at a higher price,
understand whether the buys are short-term or long-term,
stocks prices are determined by the success of the company,
and discover the pattern of the stocks trading. Based on this
supply and demand, and external factors like government
analysis, the months in which the seasonal component plays
regulations.
a major role were discovered, and have an idea about the
Investors take risks in determining the best time for
trends of the stocks. Moreover, the decomposition results
selling or buying stocks, that’s why they need an efficient
were used to forecast the values for 12 months.
help to reduce the risks of this decision. In this paper, a
Bhoopathi et al [6], proposed a framework to discover
business intelligence framework is proposed using the
the trends in stock trading by finding casual relationships in
historical stock market data, with the objective of analyzing
stock dataset, in the form of direct, indirect, and exception
stock market attributes for a collection of companies, and
association rules in the stock dataset, the framework also
enhance the efficiency of choosing the appropriate time for
considers the events and government decision that may
buying or selling specific company stocks, in order to
influence the stock trading.
reduce the risk of losing money.
The proposed framework covers the conversion of the
data to useful information satisfying the business objective. B. Business Intelligence
The framework contains the processes of acquiring the data Martin et al [7] proposed a business intelligence
from its source, transformations applied on the data, storing framework consist of Quantitative bankruptcy prediction
the data into an appropriate data warehouse, the analytics components, where financial features found using Genetic
180
missing data, and duplication will be applied, then the
relevant features will be selected before storing the
data in the data warehouse.
3- System of Analytics (SOA): The discovery of
meaningful patterns in the data, the requested analysis
will be applied in this process.
C. Technical Architecture
Fig.3 illustrate the techniques that will be used in
implementing the business intelligence system.
181
stock market of companies, to help the investors in making
future trading decisions. The framework proposed the
techniques and tools to collect data, transform, store,
analyze and present them to the end-user, in our case the
investor.
REFERENCES
[1] Azeroual O., Theel H., “The Effects of Using Business
Intelligence Systems on an Excellence Management and
Decision-Making Process by Start-Up Companies: A Case
Study”, International Journal of Management Science and
Business Administration, 2018, pp.30-40.
[2] Chang V., Larson D., “A Review and Future Direction
of Agile, Business Intelligence, Analytics and Data
Science”, International Journal of Information Management,
2016, pp.700-710.
[3] Gaonka A., Kulkarni R. et al, “Analysis of Stock Market
using Streaming Data Framework”, International
Conference on Advances in Computing, Communications
and Informatics, 2018, pp.1388-1390.
[4] Alraddadi R., “Statistical Analysis of Stock Prices in
John Wiley & Sons”, Journal of Emerging Trends in
Computing and Information Sciences, 2015, pp. 38-47.
[5] Sen J. and Chaudhuri T., “A framework for Predictive
Analysis of Stock Market Indices – A Study of the Indian
Auto Sector”, arXiv, 2015, pp. 1-19.
[6] Bhoopathi H. and Rama B., “A Novel Framework for
Stock Trading Analysis Using Casual Relationship Mining”,
2017, International Conference on Advances in Electrical,
Electronics, Information, Communication, and Bio-
Informatics (AEEICB), pp. 1-6.
[7] Martin A., Lakshmi T., and Venkatesan V., “A Business
Intelligence Framework for Business Performance using
Data Mining Techniques”, International Conference on
Emerging Trends in Science, Engineering and Technologies,
2012, pp. 373-380.
[8] Jadi Y., Lin J., “An Implementation Framework of
Business Intelligence in e-government systems for
developing countries: Case study: Morocco e-government
system”, International Conference on Information Society,
2017, pp.138-142.
[9] khedr A., Kholeif S. et al, “An Integrated Business
Intelligence Framework for Healthcare Analytics”,
International Journal of Advanced Research in Computer
Science and Software Engineering, 2017, pp. 263-270.
[10] Olexova C., “Business Intelligence Adoption: A Case
Study in the Retail Chain”, WSEAS Transactions on
Business and Economics, 2014, pp. 95-106.
[11] Bahill, A.T. and Dean, F.F. 2009. Discovering system
requirements. Handbook of Systems Engineering and
Management. A.P. Sage and W.B. Rouse, eds. John Wiley
& Sons. 205–266.
[12] Chakraborty A., Baowaly M. et al, “The Role of
Requirement Engineering in Software Development Life
Cycle”, Journal of Emerging Trends in Computing and
Information Sciences, 2012, pp.723-729.
[13] Bass L., Clements P. et al, “Software Architecture in
Practice”, Second Edition, Chapter 1, Addison-Weley,2003.
182
Reducing Ambulances Arrival Time to Patients
1st Mohammad Eshtayah 2nd Jalal Morrar 3rd Ameer Baghdadi 4th *Amjad Hawash
ICS Dept. ICS Dept. ICS Dept. ICS Dept.
An-Najah N. University An-Najah N. University An-Najah N. University An-Najah N. University
Nablus, Palestine Nablus, Palestine Nablus, Palestine Nablus, Palestine
mohammed.eshtayah@gmail.com jj.yy.mm1996@gmail.com ameer.r.baghdadi@gmail.com amjad@najah.edu
184
1) Patients’ side software. a parameter to the function findTheClosest along with the
2) Ambulance crew side software. parameter patient.GPSLocation. This function is responsible
3) 3G/4G wireless network for data traffic. for finding the closest ambulance to the patient in which
4) Web-based Application. the data related to that ambulance is saved in the object
5) The GPS service. ambulance. Determining the closest ambulance to the patient
6) The Firbase4 database. includes determining the complete shortest route between the
selected ambulance and the patient. After that, the function
sendRequest() is executed to save necessary data in the Fire-
base database taking patient and ambulance objects as param-
eters. Finally, If the reply of the function sendRequest() id
true, then the function drawMap() is executed taking two pa-
rameters: patient.GPSLocation and ambulance.GPSLocation.
Figure 1 below, represents a sample drawn route between a
given patient and the location of a selected ambulance. When
the closest available ambulance is determined, the request is
sent to its crew. The patient is then able to track the ambulance
until they both reaches the contact point. The positions of
both patients vehicle and the requested ambulance logos on
the drawn map changes every 3 seconds by continuously
contacting the GPS service for both5 .
Fig. 1. Major components of the system.
185
B. Ambulance crew side software: and when the ambulance crew searches the suitable hospital to
The software here can be adjusted in one of two modes: On handle the case. After the medical crew determines the medical
Duty or Of Duty to indicate whether the ambulance in a given situation of the patient, the crew fills (with their software) a
time is in duty or vacant. Of course this state is saved in the special form indicating the medical situation of the patient and
Firebase database. However, the software here is programmed determining the suitable hospital to deliver the patient to6 .
to keep contacting the database every 10 seconds in order f u n c t i o n p a t i e n t P i c k U p ( P a t i e n t p a t i e n t , Ambulance a m b u l a n c e
){
to check if there is a related request. Now, when a patient’s Hospital hospitals [] = searchForNearestHospital (
request is sent to the database and saved in a table called ambulance . g e t L o c a t i o n ( ) ) ;
foreach ( h o s p i t a l as h o s p i t a l s ){
Requests, the crew software is automatically accepted the case B o o l e a n r e p l y = s e n d R e q u e s t ( ambulance , h o s p i t a l
with a notification appears on their handheld device. Upon this, );
i f ( r e p l y == t r u e ) {
a map is drawn on the device showing both the ambulance drawMap ( a m b u l a n c e . g e t L o c a t i o n ( ) , h o s p i t a l .
and the patient’s locations. As in the part of the software getLocation () ) ;
break ;
installed for patients, the application keeps updating these }
positions till the ambulance and patient’s vehicle reaches the }
c h a n g e S t a t e ( ambulance , ON) ;
contact point. After examining the patient medical situation, }
the crew use the application to fill a special form describing
the patient’s medical situation like: the type of blood, blood The code starts by executing the function patientPickUp()
pressure, inhale/exhale condition, if the patient suffers from that takes the objects patient and ambulance as parameters.
chronic disease(s), types of daily medicine s/he takes, ...etc. The location of the ambulance is extracted from the object
All these data are sent to the database in order to be used by ambulance by executing the member function getLocation()
the desired hospital previously selected by the crew in order and the returned location is then sent to the function search-
to prepare the necessary treatment upon the patient’s medical ForNearestHospital() that searches for the nearest hospitals
condition. with respect to the current location of the ambulance. All
The following pseudocode illustrates the process. Please hospitals data are stored in an array of Hospital objects called
notice that this code is executed when the mode of the software hospitals that contains a list of ascending sorted hospitals
is On Duty: according to their GPS locations with respect to the ambulance
one. The code then revolves the list of hospitals and in each
f u n c t i o n c h e c k R e q u e s t ( Ambulance a m b u l a n c e ) {
Request r e q u e s t = l o a d R e q u e s t ( ambulance ) ; rotation the function sendRequest() is executed by the crew
GPSLocation p a t i n e t L o c a t i o n = r e q u e s t . g e t L o c a t i o n ( ) ; (if the current hospital is suitable). The function sendRequest
GPSLocation ambulanceLocation = g e t L o c a t i o n ( ) ;
replyRequest ( request , true ) ; takes two parameters: ambulance and hospital objects and
drawMap ( p a t i n e t L o c a t i o n , a m b u l a n c e L o c a t i o n ) ; returns a Boolean variable (saved in reply) indicates whether
c h a n g e S t a t e ( ambulance , OFF ) ;
} the hospital accepted the request or not. Upon the request
acceptance, the function drawMap() that takes the parameters
The code starts by executing the function checkRequest() amulance.getLocation() and hospital.getLocation() is executed
that takes ambulance object as a parameter. The body of the in which it draws the shortest map between the ambulance
function contains set of function calls started by executing location and the selected hospital location. The same route
the function loadRequest() that takes the parameter ambulance is also drawn by the web-based application installed for the
object as a parameter. The return value of that function is a involved hospital as we will illustrate in the next section. After
request object that contains necessary information about the breaking the loop, the crew changes the state of the ambulance
patient requesting the ambulance. The patient GPS location to ON to indicate they are vacant, of course after the delivery
is extracted from the object request by the member function process takes place, the crew software is configured to OFF
getLocation() and saved in the variable patinetLocation. The Duty to indicate the readiness of other medical requests.
ambulance GPS location is determined by executing the func-
tion getLocation() that contacts the GPS service. The function C. Web-based application:
replyRequest is then executed that saves in the database The web-based application installed in all involved hospitals
the value true along with the request object to indicate the keeps checking the database for any medical request sent by
readiness of the ambulance to handle the request. Both of some ambulance crew. If any request is available, and if the
the locations (patient and ambulance) are parameterised to the hospital is able to accept the medical case (for example, if it
function drawMap that draws the route between the patient and has enough vacant rooms), the hospital guarantees the request,
the ambulance. Finally, the function changeState() is executed and the crew software is immediately informed by that. If
taking two parameters: the ambulance and the binary value the hospital response is negative, the crew software is also
OFF to change the state of the ambulance to be In Duty in informed in order to let the crew to contact another related
order to prevent the system from displaying that ambulance in
6 If the patient medical situation is not serious and can be delivered to any
other requests till its state becomes Off Duty.
close hospital, the crew software searches for the nearest hospital and draws a
The following pseudocode is executed after the a given am- map containing the shortest route between the contact point and the hospital
bulance reaches the contact point with the requesting patient location.
186
hospital. Upon the accept sign of the hospital, a map is drawn to generate different reports related to all system participants:
in the web-based application indicating the GPS position of patients, ambulances and hospitals.
the ambulance and showing the closest route between the
D. Database:
ambulance and the hospital7 . This route is keep drawn till
the ambulance reaches the hospital. During the heading of We used Firebase as a DBMS in this work due to its
the patient of the hospital, the latter prepares the necessary simplicity and fast data retrieval. Figure 4 below represents the
medical treatment till the delivery of the patient. ER diagram of the constructed database. users entity is used to
The following pseudocode illustrates the process involved store information about patients of the system, Requests entity
in the web-based application of the hospital: is used to store requests of patients for ambulances as well as
function checkRequests ( Hospital h o s p i t a l ){
requests of an ambulance crew for a hospital, ambulance entity
if ( exist ( request ) ) is used to store information about participated ambulances,
i f ( H o s p i t a l R e a d y == t r u e ) {
r e p l y R e q u e s t ( r e q u e s t . ambulance , t r u e ) ;
hospital entity is used to store information about participated
drawMap ( a m b u l a n c e . g e t L o c a t i o n ( ) , h o s p i t a l . hospitals, and a patient entity contains information about the
getLocation () ) ;
}
patients who requested the service. However, a new patient
else record is created if the hospital has no previous information
r e p l y R e q u e s t ( r e q u e s t . ambulance , f a l s e ) ;
}
about the patient.
187
System Interface object in which contacts the GPS System by
invoking the function getLocation() to get the current location
of the patient. The GPS System object in turn returns the
location saved in patientGPSLocation the the System Inter-
face invokes the function loadAvailableAmbulances() on the
Firebase DBMS object that returns all the vacant ambulances
loaded in an array called Ambulances. The loaded array then
being searched for the closest GPS location to the patient.
Upon the search process, the closest ambulance is requested
by invoking the function request(ambulanceID, patientGPSLo-
cation) on the Ambulance object that returns a response of the
value ok, then the patient is notified.
V. E XPERIMENTAL T ESTS
In order to measure the improvement of our approach in
this work, we conducted 5 different experiments to measure
the amount of time needed to initially request an ambulance
till the reach of some hospital. We intentionally fixed the Fig. 6. A chart represents data appears in Table I
requests times for the two types of requests: by phone and by .
the system in order to be accurate in the calculation process.
So, we asked some volunteers to divide themselves into two
groups. The first group requests an ambulance by phone and arrival of ambulances to patients by utilizing the GPS system
at the same time the second group requests another ambulance in order to compute the shortest route between the two parties.
using the system, and in order to be fair and accurate in For future works, we plan to improve the data exchanged
calculations, we asked the two ambulances to be in the same between patients and ambulance crew like voice messages so
location at the time of requests. We repeated the experiment that the crew could direct patients and/or their relatives for
f times from different locations with respect to requesters some directive information till the arrival of ambulance.
and ambulances. Table I represents a comparison between
R EFERENCES
requesting an ambulance and waiting it till reaches the patient
and between requesting an ambulance and start moving by car [1] Alan Campbell and Matt Ellington. Reducing time to first on scene:
An ambulance-community first responder scheme. Emergency medicine
till the contact point with the ambulance given that the request international, 2016, 2016.
times (given in hours:minutes) for both: by phone and by the [2] Kerr J Saelens BE Natarajan L Frank LD Glanz K Conway TL Chapman
system, in the five experiments are 15:10, 17:38, 12:07, 14:02 JE Cain KL Sallis JF. Carlson JA, Schipperijn J. Locations of physical
activity as assessed by gps in young adolescents. PubMed Central
and 9:05 respectively. We noticed a reduction in access time PMCID: PMC4702023, 137:2015–2430, 2016.
as a whole. [3] S Dixit and A Joshi. A review paper on design of gps and gsm based
intelligent ambulance monitoring. International Journal of Engineering
Research and Applications, 4(7):101–103, 2014.
TABLE I
[4] Poonam Gupta, Satyasheel Pol, Dharmanath Rahatekar, and Avanti
A COMPARISON BETWEEN REQUESTING AN AMBULANCE BY PHONE CALL
Patil. Smart ambulance system. International Journal of Computer
VS . BY THE SYSTEM .
Applications, 6:23–26, 2016.
Request By Phone Request by the System [5] Bassey Isong, Nosipho Dladlu, and Tsholofelo Magogodi. Mobile-based
Exp. # Arrival Time Tour Time Arrival Time Tour Time Saving medical emergency ambulance scheduling system. International Journal
#1 15:23 13 15:20 10 3 of Computer Network and Information Security, 8(11):14, 2016.
#2 17:47 9 17:44 6 3 [6] Vijdan Khalique, shafaq shaikh, murlee daas, and Syed Muhammad
#3 12:10 3 12:10 3 0 Shehram Shah. Automatic ambulance dispatch system via one-click
#4 14:10 8 14:06 4 4 smartphone application. Indian Journal of Science and Technology, 10,
#5 09:13 8 09:09 4 4 09 2017.
[7] Price L. Treating the clock and not the patient: ambulance response
times and risk. Qual Saf Health Care, 15(2):127–30, 2006.
Figure 6 represents the data appear in Table I where a reader [8] Miss Priyanka Bachate Miss Pratima Jadhav Miss Anjalee, Miss Son-
ali Rayewar and Prof. Premlatha G. Survey on ambulance tracking with
can notice the amount of time preserved with the ambulance patient healthmonitoring system using gps. open access international
request by the system vs. requesting by phone call. journal of science and engineering (oaijse), 2:2456–3293, 2017.
[9] Muhd Zafeeruddin Bin Mohd Sakriya and Joshua Samual. Acmbulance
VI. C ONCLUSION emergency response application.
[10] Muhd Zafeeruddin Bin Mohd Sakriya and Joshua Samual. Ambulance
Minimizing the waiting time for ambulance arrival increases emergency response application.
the lives saving possibilities. The work presented is related to [11] Thije van Barneveld, Caroline Jagtenberg, Sandjai Bhulai, and Rob
decreasing the distance between patients and the requested van der Mei. Real-time ambulance relocation: Assessing real-time
redeployment strategies for ambulance relocation. Socio-Economic
ambulances in a try to minimize the waiting time for cure Planning Sciences, 62:129–142, 2018.
and treatment. The simple experiments done in the work
highlighted the possibility of reducing the wait time till the
188
Framework Architecture for Securing IoT Using
Blockchain, Smart Contract and Software Defined
Network Technologies
Hasan Al-Sakran YASER ALHARBI Irina Serguievskaia
MIS Department MIS Department unaffiliated
King Saud University King Saud University Riyadh, Saudi Arabia
Riyadh, Saudi Arabia Riyadh, Saudi Arabia serguievskaia@gmail.com
halsakran@ksu.edu.sa 437106487@student.ksu.edu.sa
Abstract— The botnet problem of launching Distributed and control server represent a malicious party [7]. Botnets
Denial of Service (DDoS) attacks on other networks mainly are typically constructed in several operational stages:
arises from the rapid growth in the number of insecure propagation, infection, command and control
Internet of Things (IoT) devices distributed across these communication, and execution of attacks [8].
networks. The focus of this work is to defend such an IoT
network and its associated resources from attacks, and to IoT devices have low computing capabilities. Client-
prevent such networks from becoming a part of botnet server architecture for management of the IoT devices has a
launching DDoS attacks on other networks and resources. To single point of failure which may lead to DDoS attacks like
achieve these objectives, this research emphases designing of a Mirai Botnet. There are several conventional solutions to IoT
botnet prevention model for Internet of Things using emerging security challenges. All of them come from the traditional
technologies such as Blockchain, Smart Contract, and Software information security practices that build controls to protect
Defined Networking (SDN). Blockchain is decentralized the IoT devices and its users which, in turn, consist of
structure which fits with the decentralized nature of IoT. For technical, operational and managerial controls [1].
securing IoT network, the proposed solution presented in this
research based on building above the IoT network a The complexity of managing the IoT networks security
Blockchain network and on top of it to use Smart Contracts significantly increased by the dynamic nature of IoT devices,
that embedded the SDN rules. like smart devices (cars and watches) with rich resources to
sensors, industrial robotics, and actuators with limited
Keywords— blockchain, internet of things, software-defined resources, and their heterogeneity.
networks, smart contract, botnet, distributed denial of service
IBM describes blockchain as a technology for
democratizing the future IoT since it addresses the current
I. INTRODUCTION critical challenges [9]:
There will be 50 billion Internet of Things (IoT) devices
by 2020 according to Cisco prediction [1]. Number of • A lot of IoT solutions are expensive as a result of
interconnected systems already exceeded the number of related to the deployment costs and maintenance of
human beings [2]. The worldwide technology spending on centralized clouds compiled mostly of the supplier
IoT is predicted to reach $1.2 Trillion in 2022 [3]. As and middlemen costs.
number of IoT implementations increases, so does the • Software updates distribution to millions of devices
number of connected into networks devices. Devices for maintenance purposes is quite problematic.
connected to the Internet are subjects to cyber-attacks. For
example, there has been a noticeable upsurge in DDoS • Technological partners usually give device access to
attacks [4]. Security issues, such as privacy, authorization, centralized organizations (service providers or
verification, access control, system configuration, manufacturers). This may lead to breach of privacy
information storage, and management, are the main and anonymity thus causing diminishing trust of IoT
challenges in an IoT environment [5]. These security issues adopters.
cannot be solved with conventional security solutions alone. This work proposes an alternative solution methodology
There are many differences between conventional networks, for solving IoT security problems by applying blockchain
those that are used to connect PC's and servers, and IoT technology. Blockchain technology is an attractive way to
networks, decentralized and distributed in nature, and they enforce privacy for IoT enabled devices and to maintain trust
have to be taken into account within the IoT network. It follows digital security
A serious problem for IoT is Botnet attacks. Botnet requirements: availability, accountability, integrity, and
attacks were originally created for PCs. But such an increase confidentiality. Availability of data in distributed network is
of IoT devices in recent years and their low security level led assured by keeping a copy of data in each block. Data
to emerging and rapid evolving of IoT-based botnets. integrity can be achieved by checking the received data that
already checked within a blockchain network. Transferring
A botnet is a computer network of infected devices data only within the network of trusted devices assures
controlled by malware [6]. Infected IoT devices or bots and cconfidentiality; and accountability is maintained because
are controlled by a botmaster bots and botnets via command any transaction of data must be verified by other devices. All
190
capable of complete IoT protection due to the differences • Transactions: the blockchain enables the information
between the conventional networks and IoT. In contrast from sharing and exchange among nodes on a P2P basis. This
the conventional networks, IoT devices are configured on information is transferred from node to node in files. After
low-power lossy network (LLN) topology which has tight each transaction the blockchain state is changed.
limits on power, memory, and processing resources.
• Consensus Mechanism: is needed to keep track of the
One example how these limits can affect system security transactions and ensure secure exchange (transferred in full,
is node impersonation in LLN that can lead to great data cannot be altered, time stamped) to avoid fraud such as
losses. It can happen if an attacker can connect to the double-spending attacks. To maintain a consistent state the
network using any identity during the data transmission same content-updating protocol for the ledger is agreed upon
process, and he can be assumed an authentic node [5]. There and used by all nodes. Blocks will not be accepted without
are also some differences in security features and this consensus mechanism.
requirements.
Four key characteristics of Blockchain were formalized in
[14]:
B. Blockchain
A blockchain, as its name implies, is a chain of • Immutable: permanent and tamper-proof. A
timestamped blocks that are linked by cryptographic means. blockchain is a permanent record of transactions.
It is a distributed ledger whose data are shared among a Once a block is added, it cannot be altered thus
network of peers. Blockchain technologies are capable to creating trust in the transaction record.
track, coordinate, carry out transactions, and store • Decentralized: a blockchain is stored in a file that can
information from a large amount of devices, enabling the be accessed and copied by any node on the network
creation of applications that require no centralized cloud. thus ensuring decentralization.
Four basic concepts that Blockchain is based on are [12]: • Consensus Driven: trust verification. Consensus
• A peer-to-peer network: there is no central trusted models provide rules for validating a block. Each
third party and all nodes have the same privileges. At block on the blockchain is verified independently
each node a pair of public/private keys is used for using these rules. In Bitcoin, this is referred to as the
interaction with other nodes, where the public key is mining process. Frequently a scarce resource is used
used as an address of the node on the network and to prove that adequate effort was made, such as
private key is used to sign transactions. computing power. No central authority or an explicit
trust-granting agent is participating in this
• Open and distributed ledger: each node has got its mechanism.
own copy of the same ledger. The ledger is open and
transparent to everyone. • Transparent: the blockchain is an open file, any party
can access it and audit transactions.
• Ledger copies synchronization: is done by
broadcasting the new transactions publicly, validating A Blockchain is built of a chain of blocks and each one
the new transactions, and adding the validated contains a database of transactions (see Fig. 1). The
transactions to the ledger. Blockchain is extended by adding blocks that are related to
each other by hashing algorithms and hence the Blockchain
• Mining: are competing among themselves to represents a complete ledger of transactions history. The
understand who will be the first to take the new additional block can be validated by network using
transaction, validate it and put it into ledger thus cryptography. In addition to transactions, each block has a
creating the chain. timestamp, a hash value of previous block, and a nonce
The core components that build Blockchain and its which is a random number for verifying the hash. Hashing
operations are as follows [13]: concept ensures the integrity of the data in the chain. Hash
values are unique. Fraud can be prevented since changes to
• Asymmetric Key Cryptography: public/ private key the block needs to change the whole chain of blocks [15].
pairs are used to secure its operation.
Block 1 Hash Block i-1 Hash Block i Hash Block i+1 Hash
Timestamp Timestamp Timestamp Timestamp
: : : :
Trans n Trans n Trans n Trans n
191
Smart contracts are digital contracts of an agreement that detection is aimed to break the chain of botnet cycle but was
can be programmed and automatically execute the terms of designed to be implemented on the propagation stage.
the agreement to carry out a particular action, events,
transaction, etc., at a given time or after a certain set of In contrast with the above method that intended to do its
conditions have been met. Smart contracts execute exactly job on the early stage of botnet life cycle, Meidan et. al., [8]
what its transitioning parts want to do without an developed a method that is used on the last stage and acts as
intermediary. They can be developed by using a high-level a last line of defense. Researchers assumed that botnets are
language called Solidity and are the ultimate automation of evolving and that they would be able to bypass the detection
trust. Digitizing a contract can be very useful for discovering tools that targeted early stages of the botnet lifecycle. This
attacks. Smart Contracts stored and replicated on a method uses deep learning techniques to take behavior
blockchain. The use of smart contracts allows for the snapshots and train the model, called AutoEncoder in the
validation of transactions and verification of counterparties, paper, to detect abnormal behavior of the system .
therefore reducing the risk of attacks. In [7] detecting of botnets is performed by analyzing
communities of IoT devices that are formed according to
C. Software Defined Network (SDN) Overview their network traffic. IoT devices sense and process data, and
SDN holds the concept of privacy and security by design communicate with other IoT devices. Developed system,
and aims to increasing network programmability by called AutoBotCatcher, uses blockchain to allow
separation of control and data planes. It’s a cost-effective collaboration of a set of pre-identified untrusted parties in
software-based solution compared to manual intervention to order to perform dynamic collaborative botnet detection by
each of the devices in IoT network. By separating the control collecting and auditing IoT devices’ network traffic flows as
plane from data or forwarding plane from independent IOT blockchain transactions. This solution uses blockchain rather
devices, SDN allows a sub network and overall network to than a centralized system because of the benefits that the
be centrally managed and monitored. Using SDN it becomes blockchain might bring. The consensus concept allows
easy to define security policies for a network that provides AutoBotCatcher validate correct execution of the collaborate
capabilities of preventing DDoS attacks. Using SDN in IoT process to be performed without central trusted part .
has several advantages. But using SDN instead of traditional A botnet prevention mechanism supplemented by
networking paradigm brings the issue of centrality nature of blockchain and SDN has been proposed in [17]. According
SDN. SDN separates control plane from data plane, but this to authors, each network consists of three modules; Security
means centralizing the control plane and make it target to policy module (SecPoliMod), Controller module (ConMod),
attackers. and Log module (LogMod). The SecPoliMod enforces
SDN has the following characteristics [16]: security policies and designate approved list of IoT devices
that meet minimum security requirements to prevent the
• Directly Programmable - due to decoupling of control whole IoT network from becoming a part of a botnet. The
plane and forwarding functions, network control is LogMod parses the flow rules running on the SDN controller
directly programmable of the network and checks the latest authenticated flow rules
linked in the blockchain for tracking any suspicious traffic
• Agile – where network flow could be dynamically destined to any innocent network for prevention of Botnet
adjusted to meet changing needs creations.
• Centrally managed – network intelligence is Architecture to defend against DDoS attacks by building
centralized in software-based SDN controllers a collaborative mechanism between service providers'
• Programmatically configured – managers can networks was developed in [18]. Blockchain is leveraged as
configure, manage, secure, and optimize network a transactions exchange media between SDN controllers in
resources very quickly by using SDN programs the service providers’ autonomous networks. Service
providers enrolled in this Blockchain service can signal the
• Open Standards based and Vendor neutral – hence occurrence of the DDoS attacks and take advantage of the
network design and operation is simplified because of shared detection and motivation mechanisms. The goal is to
open standards and vendor-agnostic devices and create an automated and easy-to-manage DDoS mitigation
protocols service. Three building blocks of this solution are:
blockchain, smart contracts and software defined network.
III. LITERATURE REVIEWS Collaborative defense system participants first need to create
a smart contract which is linked to registry-based type of
The major challenge in IoT is security of IoT devices and smart contracts. When the attacker overloads the web server
networks, and privacy of people and organizations that get of one of the service provider's autonomous network, IP
benefits from using the IoT. The traditional approaches to addresses of attackers are stored in the smart contract.
defeat threats on IoT are inapplicable due to the Service provider's autonomous network will then receive
decentralization nature of IoT network. updated lists of addresses to be blocked when they receive
Researchers attempt develop solutions to detect botnets. the Blockchain blocks that contain smart contracts.
Prokofiev et. al. [6] built a machine learning predictive A blockchain based solution to secure IoT devices in a
model that employs logistic regression technique for botnet smart home setup has been described in [19]. The developed
detection. This model has ability estimate the probability of blockchain has three-tier architecture, smart home or local
the IoT device being a member of a botnet or a bot. Data network, overlay network, and cloud storage.
need to be gathered to train the model. It is accomplished by
collecting data from 100 botnets oriented on IoT devices and When developers want to create blockchain systems for
capable to perform brute-force attacks. This method of botnet specific purposes, they are must to have a platform that will
192
support such a system with physical world applications, big Each smart contract segment contains security policies
data integration, data integrity, data storage, big data that meet minimum security requirements to prevent network
analytics, identity privacy, data access security, trusted data segment of IoT devices to become a botnet. SDN controller
sharing and collaboration, IoT integration, and general acts as a firewall. It is responsible for data analysis and
distributed and parallel computing. Design of such detection service in a timely manner. Task agent of each
blockchain platform described in [20] segment is monitoring network traffic flow of IoT devices
within the segment and taking actions according to the
Discussion on the hosting location of blockchain: situation. In performing such operations, two types of IP
directly on the IoT device, in the cloud, or in the fog address lists are created: one for predefined and trusted, and
presented in [21]. Hosting blockchain directly on the IoT the second for IP addresses that have been previously
device is impractical due to the limitations of computational detected as part of a botnet. Each local smart contract needs
resources, insufficient devices’ bandwidth, and need to first to register itself in the global smart contract registry,
preserve power. which stores all relevant smart contracts that should be
watched. Each network segment reports the results of data
IV. FRAMEWORK ARCHITECTURE OF IOT SECURITY processing to the global blockchain layer via mobile agent.
SYSTEM
Local blockchain and its associated smart contract are
In this work we demonstrate an adaptive blockchain and managed by local network administrator who is responsible
SDN in hierarchal distributed environment network. Figure 2 for adding/ removing IOT devices. Adding devices is done
presents an overview of the architecture of the proposed by performing a first transaction that starts the local
model, which consists of: blockchain. All the transactions then are chained together. It
• Global BC (GBC) is to store source IP addresses that is also a responsibility of the administrator to remove devices
should be allowed or blocked. It provides the from the sub network. The removal is done by removing the
necessary information on a large scale. The global ledger related to device. Devices can communicate with each
blockchain is used to provide large-scale event other if the administrator permits. This permission is done by
detection. giving the devices a shared key.
• Network segments are to maintain a local Private The controllers within the smart contract update the flow
Blockchain that saves all transactions, and smart rules by verifying the version of flow rule table that
contract containing SDN controllers that hold policies maintains IP address lists of trusted IP addresses of IoT
of accessing the devices. devices in their network segment, and IP addresses detected
as part of a botnet. Smart contracts need to run on a
• Task agent: constantly sniffing and monitoring blockchain to ensure that the contract content cannot be
network traffic flow of IoT devices and taking actions changed. If any IoT device is not following the rules in the
according to that . smart contract, for example sending undesirable data, then
this device is considered being a part of a botnet and capable
• Mobile agent: moving within the system that contains of launching a DDoS attack.
an object communicate. It can be dynamically
generated during the execution and can reconfigure The architecture checks network traffic flow of each IoT
itself dynamically based on changes of the services. device within its network segment. If a network traffic of an
IoT device within the trusted IP addresses then it is allowed
This design deploys a set of SDN controllers at each to forward the data. Otherwise the device and its traffic are
network segments to respond to attacks in this specific considered untrustworthy, the IoT device is isolated, and the
segment. All SDN controllers are imbedded within smart system will immediately stop the flow of data from it. This
contracts. Each blockchain segment is connected directly to attack information should be reported to the global
the GBC via a mobile agent. All SDN controllers in each blockchain so it can be shared among connected controllers
network segment are connected to GBC in a distributed to block similar activities before other segments of the
manner using local private blockchain smart contract network are affected.
techniques. This allows automatic configuring of responses
from IoT devices. Each network segment covers a small
associated community of IOT devices and local storage unit V. CONCLUSION
saving all transactions; and could be used as a local backup In this work, we proposed a decentralized blockchain-
drive. based architecture for securing IoT devices. Blockchain
network build on top of an SDN network that interrelates
different service providers' networks that are willing to share
information about botnet DDoS attacks on their networks
using smart contracts. Blockchain, smart contract and SDN
technology represent a perfect solution to solve security
problems of IoT. In this work, we proposed an approach that
is can prevent IoT device from becoming the part of botnet
by using the above technologies. The benefits of blockchain
for IoT and IoT security issues where discussed. Enterprises,
that already have IoT systems or just developing IoT
initiatives, are recommended to take into consideration the
blockchain technology and develop a strategy to secure its
Fig. 2 System Architecture IoT systems.
193
REFERENCES [12] A. Panarello, N. Tapas, G.Merlino, F. Longo, and A. Puliafito, ”
Blockchain and iot integration: A systematic survey,” Sensors
[1] M. A. Khan and K. Salah, “IoT security: Review, blockchain (Switzerland),vol. 18, http://doi.org/10.3390/s18082575
solutions, and open challenges. Future Generation Computer
[13] B. D. Puthal, N. Malik, S. P. Mohanty, E Kougianos, and G. Das, ”
Systems,”vol. 82, pp. 395–411, November 2017,
http://doi.org/10.1016/j.future.2017.11.022 Everything You Wanted to Know About the Blockchain : Its Promise
, Components , Processes , and Problems Everything you Wanted to
[2] A.Sfar, E. Natalizio, Y. Challal, and Z. Chtourou, ” A roadmap for Know about the Blockchain,“ July 2018,
security challenges in the Internet of Things,” Digital http://doi.org/10.1109/MCE.2018.2816299
Communications and Networks, vol. 4 , pp. 118–137, 2018,
https://doi.org/10.1016/j.dcan.2017.04.003 [14] K. Sutan, U. Ruhi, and Rubina Lakhani, ” Conceptualizing
Blockchains: Characteristics & Applications,“ 11th IADIS
[3] L. Columbus, “Roundup Of Internet Of Things Forecasts And Market International Conference Information Systems, 2018, Accessed in
Estimates,” https://www.forbes.com/sites/louiscolumbus Feb. 27, 2019, https://arxiv.org/ftp/arxiv/papers/1806/1806.03693.pdf
/2018/12/13/2018-roundup-of-internet-of-things-forecasts-and-
market-estimates/#729e5aa27d83, Accessed Febriary 23, 2019. [15] M. Nofer, P. Gomber, O. Hina, and D. Schiereck, ” Blockchain“ Bus
Inf Syst Eng vol. 59(3), pp.183–187, 2017,
[4] Akamai, “ How to Protect Against DDoS Attacks - Stop Denial of http://DOI.org/10.1007/s12599-017-0467-3
Service,” https://www.akamai.com/us/en/resources/protect-against-
[16] ONF, Open Networking Foundation), “Software-Defined Networking
ddos-attacks. jsp, Accessed 10 Jan 2017.
Definition,” Accessed April, 9 2019,
[5] F. A. Alaba, M. Othman, I. Abaker, T. Hashem, and F. Alotaibi, https://www.opennetworking.org/sdn-definition/
“Internet of Things security: A survey,” vol. 88, pp. 10–28, April
2017, http://doi.org/10.1016/j.jnca.2017.04.002 [17] Q. Shafi and Abdulbasit, ” DDoS Botnet Prevention using Blockchain
in Software Defined Internet of Things“ Proceedings of 2019 16th
[6] A. O. Prokofiev, Y. S. Smirnova, and V. A. Surov, (2018)”A method International Bhurban Conference on Applied Sciences & Technology
to detect Internet of Things botnets,” Proceedings of the 2018 IEEE (IBCAST) Islamabad, Pakistan, 8th – 12th January, 2019
Conference of Russian Young Researchers in Electrical and
[18] B. Rodrigues, T. Bocek, A. Lareida, D. Hausheer, S. Rafati, and B.
Electronic Engineering, ElConRus 2018, January 2018, pp. 105–108,
http://doi.org/10.1109/EIConRus.2018.8317041 Stiller, ” A Blockchain-Based Architecture for Collaborative DDoS
Mitigation with Smart Contracts,” D. Tuncer et al. (Eds.): AIMS
[7] G. Sagirlar, B. Carminati, and E. Ferrari, “ AutoBotCatcher: 2017, LNCS 10356, pp. 16–29, 2017, https://DOI.org/10.1007/978-3-
Blockchain-based P2P Botnet Detection for the Internet of 319-60774-0 2
Things,”pp. 1–8, 2018, http://doi.org/10.1109/CIC.2018.00-46
[19] A. Dorri, S. S.Kanhere, and R. Jurdak, “Blockchain in internet of
[8] Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D. things: Challenges and Solutions,”
Breitenbacher, and Y. Elovici, ” N-BaIoT-Network-based detection http://doi.org/10.1145/2976749.2976756
of IoT botnet attacks using deep autoencoders,”. IEEE Pervasive
[20] Z. Shae, and J. J. P. Tsai, ” On the Design of a Blockchain Platform
Computing, vol. 17(3), pp. 12–22, 2018,
http://doi.org/10.1109/MPRV.2018.03367731 for Clinical Trial and Precision Medicine,“ Proceedings -
International Conference on Distributed Computing Systems, pp.
[9] T. M. Fernández-Caramés and P. Fraga-Lamas, ” A Review on the 1972–1980, http://doi.org/10.1109/ICDCS.2017.61
Use of Blockchain for the Internet of Things,” IEEE Access, vol. 6,
[21] M. Samaniego and R. Deters, “Blockchain as a Service for IoT,”
pp. 32979–33001, 2018,
http://doi.org/10.1109/ACCESS.2018.2842685 Proceedings - 2016 IEEE International Conference on Internet of
Things, IEEE Green Computing and Communications, IEEE Cyber,
[10] Irena Bojanova, “What Makes Up the Internet of Things?”, Accessed Physical, and Social Computing, IEEE Smart Data, IThings-
in Feb. 24, 2019, https://www.computer.org/web/sensing- GreenCom-CPSCom-Smart Data 2016, pp. 433–436, 2017,
iot/content?g=53926943&type=article&urlTitle=what-are-the- http://doi.org/10.1109/iThings-GreenCom-CPSCom-
components-of-iot- SmartData.2016.102
[11] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and M.
Ayyash, ” Internet of Things: A Survey on Enabling Technologies,
Protocols, and Applications,” IEEE Communications Surveys and
Tutorials, vol. 17(4), pp. 2347–2376,
http://doi.org/10.1109/COMST.2015.2444095
194
Security issues in Wireless Sensor Network
Broadcast Authentication
Asad Raza1 Ali Abu Romman2 Muhammad Faheem Qureshi3
ISET Department ISET Department ISET Department
Abu Dhabi Polytechnic Abu Dhabi Polytechnic Abu Dhabi Polytechnic
Abu Dhabi, United Arab Emirates Abu Dhabi, United Arab Emirates Abu Dhabi, United Arab Emirates
asad.raza@adpoly.ac.ae ali.aburomman@adpoly.ac.ae muhammad.qureshi@adpoly.ac.ae
Abstract—Wireless sensor networks influence is increasing complexity of air pollution monitory and to obtain
day by day due to its cost effectiveness in handling real world real time measurements.
challenges. WSN consists of many small limited power and
limited computational-power devices to monitor physical and ⎯ Forest fire Detection:
environmental conditions, allowing communication over There have been many incidents of massive forest
wireless link. WSN is introducing new techniques of fire due to human carelessness and mistakes which
communication and dissemination of information in wireless
cause a negative impact on the ecosystem and the
network. Due to involvement in many applications, secure
authentication of broadcast packets is a mandatory results of forest fire could be catastrophic [17]. In
requirement which is one of the great challenges in WSN forest sensor nodes are installed to provide real-time
security. In this paper we have discussed all the threats to the and accurate fire detection. Advance fire detection
WSN concerning secure communication. However, the focus of
this paper is to highlight the security issues regarding is very critical to minimize the impact.
broadcast authentication in WSN and analyze the proposed ⎯ Battlefield Monitoring:
solutions with respect to various parameters. In battlefield sensor nodes can be uses to monitor
Keywords— Wireless sensor network, Broadcast
the enemy activities. Based on the information
authentication and Security gathered from the sensor nodes, the army can plan
how to prepare against the enemy’s activities.
⎯ Weather Monitoring:
I. INTRODUCTION The application of WSN in weather monitoring is
Wireless sensor network is a collection of nodes few hundred similar to that of air pollution and fire detection.
or even thousands organized into a cooperative network to Sensor nodes can be used for weather monitoring
monitor temperature, sound, vibration, pressure etc. and then and early prediction about rain or flood can be made
pass it to central device known as access point (AP) or base to take precautionary measures in advance.
station (BS). There are three major components in wireless
⎯ Health Monitoring:
sensor network: a sensor component (which senses and takes
measurements), computing component (processes data) and One of the most popular applications of WSN is
communication component (enables communication between patient health monitoring. These wearable health
nodes) [16] .The sensor nodes have low-power and limited monitoring units can help the doctors to
processing capability. WSNs have variety of applications continuously monitor the patient health and
ranging from military to home and industrial applications.
The applications of wireless sensor networks have not only maintain an optimal health status. Research shows
impacted but changed our daily life. Some of the most that the addition of WSNs in health monitoring has
common applications of WSN are summarized below shown very positive indicators in terms of patient’s
recovery responding to critical medical conditions
⎯ Land sliding detection:
like cardiac arrest. [18].
WSN are used in land sliding detection to detect
movement of soil and changes in the parameters
These are only few of the applications of wireless sensor
before and after the land slide. The information
networks describe above. There are numerous situations in
gathered from the sensor nodes can be utilized to which wireless sensor networks can prove to be the most
forecast land sliding in advance. suitable solution.
⎯ Air Pollution Monitoring: As any other network wireless sensor networks are also
With the fast-growing industrial activities, the prone to security threats. The sensor nodes may be located in
problem of air pollution is becoming a serious different location and uses wireless link to gather and
concern for health. Traditional data logging transport or communicate important information. It is
important to understand the attacks pertaining to wireless
methods are considered not only complex but also
sensor networks before we narrow down our discussion to
time consuming. WSNs are used to reduce the broadcast authentication issues. Some of common the
attacks to wireless sensor networks have been discussed
below.
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 195
ATTACKS ON WIRELESS SENSOR NETWORKS are strong or dead links are alive. As a result, node can
The most common attacks on WSN are select weak link for routing and packet send over the
• Routing information spoofing: In this type of attack, the weak link can be lost.
attacker sends fake routing information, creating
routing loops, false error messages are generated, • Impersonation: In this type of attack, the attacker adds
increases or decreases the source route. This attack a node in the existing network by copying the ID of
causes decreased lifetime of the network and increased existing node. Then attacker is able to corrupt, misroute
latency [7]. or delete the packet and attacker can disclose the
cryptographic keys as well.
• Selective Forwarding: In WSN multi hop paradigm is
common, in which each node must forward the received • Eavesdropping: Eavesdropping is not and active attack.
message correctly and securely. In case a node is get In this attack the attacker listens to the network traffic
compromised, it may refuse to forward the message or to discover some secret information. This type of attack
forward a selected (malicious code) is very hard to prevent. In most cases encryption is the
only solution to prevent eavesdropping.
• Denial of service attack: Dos is a type of attack that
prevents the network to perform its normal operation or • Traffic Analysis: Through traffic analysis the attacker
make it unavailable to the legitimate users. It can be can determine the base station as all traffic goes toward
launched in different ways i.e. sending jamming a single point. If the base station is compromised the
signals, which are radio signal transmitted to interfere attacker will be able to make the whole network useless
with the radio frequency used by the sensor network to
jam the node or sending too many bogus messages • Mote Class: Mote class attacks also called insider
(flooding) to the nodes to cause the power failure. attacks, are launched either by the compromised node
or the by attacker who has taken (stolen) the key
• Sybil Attack: In this type of attack, a malicious node material, code or data from legitimate sensor node.
appears to be in more than one place. The node
presents more than one identity to the neighboring node • Laptop Class: Also called outsider attack has no special
in the network. This type of attack mostly affects access to the WSN. It has access to more powerful
geographical. This attack can be prevented if each pair devices such as laptop which replace the legitimate
of neighboring node uses unique key for initializing node. This attack can jam the entire network as its radio
communication. transmitter power is high
• Wormhole: The basic idea of such type of attack is The next section will discuss the related research work on
tunnel packets received on one part of the network to security issues in general and broadcast authentication in
another part. A well-placed wormhole can disrupt particular pertaining to WSNs.
whole routing. A node that is multiple hops away from
base station is deceived to be one or two hops away II. RELATED WORK
from base station through wormhole. This attack can be
Broadcast authentication is a very crucial security service in
launched with conjunction of Sybil attack.
wireless sensor networks because it allows the nodes to send
authenticated broadcast messages to all other nodes.
• Sinkhole: In this type of attack, a compromised node
Techniques such as μTESLA and multi-level μTESLA have
looks attractive to other nodes in the network with
been proposed to handle broadcast authentication but none of
respect to routing algorithm. The compromised node
these techniques have been effective in terms of bandwidth
attracts all the network traffic to pass through the
and the number of sender nodes and also these techniques
compromised node creating a sinkhole with the attacker
cannot handle denial of service attacks.
at the center to get information. This attack can be used
to launch other type of attacks such as selective
forwarding [7] DonggangLiu proposed a technique for broadcast
authentication which is based on μTESLA but it can handle
• Hello Flooding: In WSN many protocols need that the both the problems of number of sender nodes and DOS
nodes should broadcast Hello message to the attack. He also proposed a technique to revoke broadcast
neighboring nodes to advertise its presence. The capability from the nodes which are malicious or
receivers of hello message assume that they are in radio
compromised [19].
range of the sender. The laptop-class attacker can
broadcast routing or any other information to deceive Considering the challenges of μTESLA Mohamed Hamdy
the nodes in the network that he is its neighbor and may Eldefrawy has proposed a protocol which uses two different
start exchanging information. hash functions and Chinses Remainder Theorem. The
protocol has proved to be more efficient and effective than
• Acknowledgement spoofing: Some routing protocols
μTESLA because the receivers can authenticate the
uses link layer acknowledgement which can be spoofed
by the attacker to convince the nodes that weak links broadcast messages in real time [20].
196
Rongxing L, Xiaodong L, Haojin Z, Xiaohui L and Xuemin ⎯ Location privacy in WSN
S proposed a scheme called BECAN which is not only
effective in terms of bandwidth, but it can also preemptively We will briefly discuss all these challenges in this section
detect bogus packet injections and helps to conserve the
energy. This also helps to lessen the burden of sink in 1. Secure routing protocols:
detecting bogus data injection [21]. One of the major issues in WSNs is secure routing protocol
M.Ramesh and Dr.C.Suresh have proposed a broadcast which not only protects the routing information, but it should
authentication scheme based on TESLA and ECDH, also be light weight. It is extremely challenging to design
however it user three keys .This scheme reduces the loss and secure routing protocol as the sensor nodes are of low-power,
delay in WSNs. After taking the initial parameters this low-capacity memory and processing power. WSN routing
scheme is based on three main steps: Auxiliary key security deals with the authentication of user node and
generation, Public/private key generation based on elliptic verification of packet being sent. Authentication can be
curve diffie-hellman and keys concatenation which finally achieved by using base station, a key or a certificate. The
results in a hash key [22]. This has key is validated before certificate is unique ID of each node. The scheme described
broadcasting the packets. If the key is verified, the packets in [8] provides secure and efficient routing protocol using
are broadcasted otherwise they are discarded, and the sink is encryption and authentication.
informed. Iman Almomani and Emad Almashakbeh[9] proposed a
Tsern-Huei Lee proposed two new user authentication power-efficient, secure routing protocol to manage the
recourse limitation in WSN. This protocol is a combination
protocols slightly different from password-based solutions.
of tree-based and cluster-based protocols. But the proposed
These protocols are very light weight in terms of protocol uses LEACH as a base protocol for cluster
computational power and communication load as compared formation process that has some weakness [10]. The aim of
to strong password-based techniques. Despite of the any secure routing protocol is to guarantee the
simplicity and this authentication scheme provides authentication, integrity and availability of packets. Some of
comparable security [1]. the well-known secure routing protocols that address various
issues are TESLA, μTESLA, intrusion tolerant routing
Junqi Zhang; Varadharajan, V. proposed scheme is based on
protocol (INENS), SPINS and trust routing for location
LOCK scheme and employees ID-based secure group key aware sensor networks (TRANS) to name a few.
management which minimizes the key storage requirement
and the number of messages rekeying [2]. Yilin Wang and 2. Key Establishment Issues:
Maosheng Qin proposed a scheme dealing with key To securely exchange the data, the protocol must establish
management issues using asymmetric cryptographic and manage keys distribution between all the nodes in WSN
techniques [3]. Haiguang Chen, XinHua Chen, Junyu Niu that wants to communicate. New node should be securely
proposed an authentication scheme focusing on the unique deployed and enabled to start secure communication with the
existing nodes in the network. No unauthorized node or user
characteristics and novel misbehavior find in WSN. The
should get access to the network.
proposed scheme authenticates the node based on some the The limitations of WSN such as physical node capturing
abnormal behavior or actions they would carry out anyway vulnerability, limited computational and communication
[4]. Norziana Jamil, Sera Syarmila Sameon, Ramlan power, no prior knowledge of deployment of sensor network,
Mahmood proposed a scheme focusing on the authentication. makes the WSN design more challenging.
Their work is based on identity bits commitments for the One of the basic approaches is to use a single shared key for
authentication, mainly addressing forgery and replay attack the entire network. In this approach all communication is
encrypted with the same key and then MAC (message
[5].D.Manivannan, B.Vijayalakshmi and P.Neelamegam
authentication code) is appended. Although this approach
proposed a new protocol in which congruence equations and lifts the burden of key management, but it has many
number theory concept has been introduced to provide secure drawbacks. If one node is compromised the whole network
authentication among the nodes [6]. will be compromised [11].
Another approach of key distribution is using asymmetric
cryptography or more commonly named as public key
III. SECURITY CHALLENGES IN WSNBA cryptography. In this approach before deployment of the
The main advantage of broadcasting and multicasting is that nodes, a master public/private key pair is generated. Then for
it reduces the communication overhead but at the same time each node public/private key pairs are generated. Each node
it also requires that only legitimate nodes / parties should be stores its key-pairs, master public key and master key
able access those messages. Some of the major challenges in signature. Now all nodes are ready to exchange the keys.
WSNBA are listed below. Nodes exchange their public keys and master key signatures.
Public keys of nodes can be verified by checking the master
⎯ Secure routing protocols key signature using master public key. Once the verification
of a node public key is done, a symmetric key is generated
⎯ Key Establishment Issues and is transmitted between the communicating nodes by
⎯ Fast response broadcast authentication encrypting it with public key of the receiver node. Now the
⎯ Defending DoS Attacks two nodes are ready for secure communication using this
197
symmetric key. The reason for using symmetric key for schemes mentioned above uses digital signature as primary
encryption is that it is computationally less expensive as authentication mechanism along with bloom filters. In these
compared to the public key encryption. But this approach schemes the sensors are low-cost without temper resistance
also has some problems e.g. key generation and verification hardware devices, performing basic cryptographic and public
overhead, vulnerable to DoS and node replication attacks. key operations. Both key pool scheme and key chain
Another approach is to use pair-wise shared keys, in which schemes use the strength or public key cryptography very
each node has unique symmetric shared key with each other efficiently solves the issue without the need of periodic key
node in the network. The main disadvantage of this approach distribution or synchronization. We will briefly discuss these
is the storage of too many keys on each node in the network. two schemes below
If the network is bigger in size than this approach is not
feasible at all in terms of storage capacity. Adding new nodes i. Key Pool Scheme
will be very difficult to the network which causes scalability
issues. In Key pool scheme, the nodes are divided into groups each
There is another key distribution approach known as possess partition of the network’s key pool while the access
Random key pre-distribution described in [11] and [12] None point possess all the keys. This scheme consists of three
of the schemes is perfect in all respect; all have advantages phases
and disadvantages and apply to specific situation. a) Pre-deployment phase:
In this phase each node stores keys necessary
3. Fast response broadcast authentication
for node-level operations, access point public
Another major challenge in WSNs is fast and authenticated key, network-wide hash function and
broadcast operations. Public key-based method of broadcast independent hash function for BFVs. [13].
authentication in wireless sensor networks is considered b) Signature generation:
more efficient as compared to the symmetric key based In this phase digital signature is generated and
approach because of simple protocol operations like no then access point creates BFV from digital
synchronization. Using PKC-based approach a sensor node signature which is used by the sensor nodes to
can detect false messages as it first authenticates before pre-verify the signature [14].
forwarding the message. But the problem is that PKC Finally message is broadcasted as {M,tt,Ds,I,
operation on low-computational power nodes will increase BFV} Where
message propagation time. Another issue is that sensor node M=message
may forward messages before even authenticating them. To tt= time stamp
achieve fast and efficient broadcast operation, the sensor
I= set of all key indices included in BFV
node has to make a decision that when to authenticate first
and when to forward the message first based on the situation. c) Message verification and forwarding phase:
To deal with this dilemma two new schemes have been This phase is responsible for message
proposed in [13]. These schemes will help the sensor nodes verification and forwarding whenever a
to decide when to authenticate first and when to forward broadcast message is received on each node in
first. These two schemes are known as: the network [13].
⎯ Key Pool Scheme
⎯ Key Chain Scheme Advantages of Key Pool Scheme:
For efficient and capture resistance PKC-based broadcast • By key partition, the attacker will learn limited
authentication protocols, distribution of secret keys among information from node capturing.
the sensor nodes and bloom filters are used in both schemes. • Provides protection against denial of service
There are two ways to solve the broadcast authentication attacks.
problem, one is using hardware approach and second is using
• Minimum broadcast authentication delay is
protocol approach. Hardware approach secures keys inside
achieved.
the sensor node to prevent any attack to bypass the protocol
Disadvantages of Key Pool Scheme:
by equipping them with temper-resistant memory, allowing
MAC approach to be used in wireless sensor network. • Hashing operation adds computational overhead
Because of high cost of temper-resistance hardware, • Transmitting additional bits in every broadcast
hardware approach is limited to critical applications. message results in communication overhead.
Many researchers are focusing on creating new protocols to
resist node capturing. As discussed in the previous sections ii. Key Chain Scheme
µTESLA (Timed Efficient Stream Loss-tolerant
Authentication) and its various extensions are efficient low Key chain Scheme uses multiple one-way hash chains to
computational overhead broadcast authentication protocols prevent communication overhead as in case of key pool
but require some level of synchronization between nodes scheme. In this scheme there is no need to include indices in
during periodic broadcasting. the broadcast messages because it uses one-way hash chain
Because PKC-based method efficiently deals with in forward direction in which each node starts with key index
verification delay problem, most researchers are trying to zero to the higher key indices.
speed up the operations of PKC. In PKC approach, the two
198
Key chain scheme comprises of three main phases similar to receiver receives the broadcast packet, it first verifies the pre-
pool scheme discussed in previous section [13]. authenticator. If this verification is successful than digital
a) Predeployment Phase: signature is done.
First of all, global key pool must be generated, Pre-authenticator is derived from the pseudorandom function
which consist of N starting keys of N and the sender node ID. Pseudorandom function say (f) is
only known to the base station (access point) and can verify
independent key chains. Each node selects k
the sender pre-authenticator when needed. Before sending or
starting keys from N keys at random. Like in broadcasting data packet the sender first distributes its pre-
key pool scheme, every node is configured authenticator to all receivers either by a secure broadcast key
with the public key of base station, network- or using pair wise shared key. Before the communication can
wide hash function and independent hash take place, each node saves the node ID and the most recent
functions for BFVs. pre-authenticator of all the neighbors. Finally, the broadcast
b) Signature generation Phase: message has the following format: [i/ Mi / DSi / Kiv] Where
Access points generate digital signature as
i is index,
DS=Eprivate (H\\tt). Mi is the message to be broad casted,
Every key chain starts from Kij to Ki(j+1) DSi is digital signature and
Where, i € [1, N], after that access point insert Kiv is the ith-pre-authenticator of node v.
all new keys into BFVs as in [13]. Final
message that is to be broadcasted is give below When this data packet is received at the receiver end, the pre-
where c shows the index of the current key in authentication is verified first to see that packet with i-th
index is not received before and the v is a valid neighbor.
chain.
Receiver perform the verification by checking
Broadcast Message = [M,tt,Ds, c,BFV] Kjv =fj-i(Kiv), because key Kiv can only be generated by node
v. If verification is successful then the receiver verifies the
c) Message verification and forwarding DS, otherwise packet is dropped if verification fails. At the
phase: end j is replaced by i and Kjv by Kiv at receiver end [15].
In this phase the tt is checked for freshness One of the advantages of one-way hash function is that as
long as the i-th packet is not broadcasted, the attacker cannot
and BFV for forgery. On BFV test passing, the
figure out Kiv since it is dependent upon i and hence the
sensor node advances the local key chain to attacker cannot fake the pre-authenticator. Consider the case
that index. If all the corresponding BFV bits when the attacker receives a broadcast packet from a node X
for new local keys are verified (the packet is and keeps the pre-authenticator and forges the message and
accepted otherwise dropped), the node then replays the forged message. If the receiver Y is in the
forwards the message. Before accepting the neighbor of X node, it must have received the unmodified
packet, the last step is to verify the DS. packet with similar pre-authenticator at the same time when
the adversary received the broadcast packet. So, node Y can
detect the forged message because he already has seen this
Advantages of Key Chain Scheme:
pre-authenticator. If node Y is not the neighbor of X, then X
• Indices are not required to be included in broadcast is also not in Y’s neighbors. So, Y will detect the modified
message. packet and will drop it.
• Multiple one-way hash chains are used which In many situations it is required to add new sensor nodes
eliminates communication overhead in WSNs. after initial deployment. The new node addition changes the
• If we have the starting key and network-wide hash neighborhood association of already existing nodes.
function, we can find key at any location. Therefore, there is need to handle the new node so that it can
Disadvantages of Key Chain Scheme: be detected as valid node and can broadcast the packets and
• This scheme is prone to single point of failure verify the packets which are received. For this purpose, first
which means once key is compromised; the whole of all for identity, ID certificate is calculated by signing node
key chain will be compromised. ID (each node has its unique ID) with the private key of the
base station. Then all the sensor nodes including both old and
new are pre-configured with base station’s public key. Now
4. Defending against DoS Attacks the new node can prove its validity with ID certificate.
200
Remainder Theorem” . Sensors (Basel). 2010; 10(9): 8683–8695.
Published online 2010 Sep 17. doi: 10.3390/s100908683
201
Towards An Integration Concept of Smart Cities
Naoum Jamous Stefan Willi Hart
Magdeburg Research and Competence Cluster (MRCC) Digital Business Integration
Otto von Guericke University Magdeburg (OVGU) Accenture GmbH
Magdeburg, Germany Cologne, Germany
naoum.jamous@ovgu.de Stefan.willi.hart@accenture.com
Abstract— Urbanization is one of the greatest challenges of In a broader sense, an integrated smart city system can be
our time. Smart cities concept was introduced to support efficient compared to an IT system landscape of an enterprise
utilization of the limited resources in our world. However, describing its supply chain [10]. Thus, a similar concept of
implementing this concept is a challenging task where researchers supply chain management (SCM) system can be used to
and practitioners are dealing with. In this paper, different smart support a smart city project. SCM is a process-oriented
cities’ frameworks as well as integration approaches of enterprises management approach comprising all movements of raw
are discussed. In order to support the orchestration of Smart City materials, components, products, and information along the
services, an integration concept is presented. value creation and supply chain from raw material to the end
product [11]. In the context of this idea, integration
Keywords—Smart City, systems integration, Internet of things
(IOT), IT Operation Management, Supply Chain Management.
approaches for supply chain management could be
transferred to smart city projects as indicated in figure 1.
I. INTRODUCTION
Urbanization is one of the greatest challenges in our
world. For instance, in 2018 55 % of the population is living
in urban environments, and it is expected to reach 68% by
2050 [1]. Smart cities concept was introduced to support
efficient utilization of the ecosphere’s limited resources.
However, implementing this concept is a challenging task for
both researchers and practitioners. For example, the
European Union is supporting those researchers and
practitioners by funding Smart Cities’ and Internet of Things
(IOT) related research in the context of the Horizon2020
program [2].
By considering the schematic maturity level model of
Smart Cities, it can be stated that the ultimate goal is to
develop a self-learning and adaptive Smart City with
endogen and exogenous networks [3]. In order to achieve
this goal, Smart City initiatives in different areas, such as Fig. 1. Supply Chain of a Smart City System [10]
traffic management or waste management, have to be
implemented. However, to achieve the highest maturity In this work, different smart cities’ frameworks as well as
level, these initiatives should be integrated in a common integration approaches of enterprises will be discussed. In
communication framework [3], [4]. Since smart city project order to support the orchestration of smart city services, an
involves several partners, the implementation dilemma is one integration concept will be proposed and discussed. Then,
of the most crucial aspects to be considered. This dilemma the paper closes with a conclusion paragraph stating the main
describes the challenge of managing the stakeholders’ findings and obstacles.
interests differences while planning and developing a smart
city initiatives [3]. Moreover, failure and success
responsibilities are often not clarified. In addition to the II. SMART CITY FRAMEWORKS
management aspects, technological issues must be According to A.M. Townsend, Smart Cities can be
considered. High degrees of safety, security, as well as defined as "places where information technology is
system integration are necessary [5]. According to many combined with infrastructure, architecture, everyday objects,
researchers, incomplete information and communication and even our bodies to address social, economic, and
technologies, the variety of technological standards, the data environmental problems" [12]. Therefore, a smart city
privacy, and integration are some of the main transformation development is a challenging task. According to
barriers in smart city projects [3], [4], and [6]. C.Etezadzadeh, a Smart City consists of the following
enablers: natural basics, urban actors and their contributions,
An integration problem can be solved by several
integrated urban management and urban governance,
techniques. Thus, evaluating the integration solutions is a
objectives and versioning, infrastructures, layer of
multi-dimensional problem [7]. This paper focuses on the
information and communication technologies and resilience
integration of different smart city systems to achieve the
[4]. With the introduction of smart city solutions, new
highest maturity level possible. While the concept of web
business models and new value chains arising on basis of
services is recommended for single initiative in the scientific
inter-sectoral cooperation.
literature, achieving synergy effect among different
initiatives requires higher communication degree between As described in the previous section, there are many
the used systems [7], [8], and [9]. barriers to overcome. Researchers proposed development
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 202
frameworks to be used in smart cities projects. Hereafter, factors, the business process changes and human challenges
four different frameworks are presented. are identified. The second level is divided into
technologically related challenges, process related
A. Smart City Initiatives Design Framework (SCID) challenges, and human problems. At the third level, other
The SCID authors have developed a conceptual model to dimensions are derived from the process related challenges:
be used while designing a concrete smart city initiative. This inter-organizational, functional, and managerial challenges.
model was based on the analysis of ten different smart city
initiatives [13]. It describes the main features of the D. SMART model
initiative. Using the Leontief “Input-Output Model” [14], the This model is proposed by S. Ben Letaifa [17] and it uses
a top-down approach. As shown in figure 3, the model
consists of three paths (macro, mezzo, and micro) and five
consecutive main phases: Strategy, Multidisciplinary,
Appropriation, Roadmap, and Technology [17]. The top
level of the framework is the marco level. At this level, the
development of smart city strategy and the mobilization of
multidisciplinary resources occur. The mezzo level focuses
on the various actors' appropriation of the project and the
creation of a clear road map for the realization of the city. At
the micro level, possible technologies are identified to
support the strategy and hence the initiatives.
In the strategic phase, the local challenges and the
population are inputs requirements to realize the common
Fig. 2. The Smart City Initiatives Framework (SCID) [13] vision to be pursued. After, the objectives will be defined
and the strategy will be carried out. Then, the
authors aimed to create an explicit link between
environmental factors that affect the initiative directly, and
the achieved results. Thereby, a value-oriented perspective is
associated to the solution. The model consists of six main
elements as depicted in figure 2.
The element "Smart City Initiatives" describes how the
specific Smart City-related projects can be implemented.
These projects have an impact on the city policy domain of
the location in which the initiative is carried out. This in turn
produces some result on the city and on the various Fig. 3. The SMART model [17]
stakeholders. On the lower level of the model, the elements
of enablers, critical success factors, and challenges play a multidisciplinary resources should be mobilized. In the
role. The critical success factors extracted from two elements following phase, it is all about the iterative and agile
“enablers” and “challenges”. SCID describes two different improvement of definition and development of the project.
implementation approaches. The first is the top-down The various multidisciplinary actors should be integrated to
approach, stating that smart cities are initially planned, transform the actors to active members of the project. Then,
designed, and developed on the basis of drafts. The second the detailed planning is carried out including the
approach is a bottom-up one. It assumes that existing cities implementation of the services. In the last phase, the
are upgraded with smart features. technologies selection is detailed.
B. Modified Smart City Initiatives Design Framework III. ENTERPRISE INTEGRATION – GENERAL CONCEPTS
This model is an improvement of the SCID frameworks. The term system integration is defined by J. Myerson as:
Here, the SCID is converted using a decision support model “…Systems integration involves a complete system of
into a schematic transformation meta-model [3]. It firstly business processes, managerial practices, organizational
considers the maturity degree of the city development, and interactions and structural alignments… It is an all inclusive
secondly the iterative development of it. Another aspect process designed to create relatively seamless and highly
which is revisited in the model is the holistic planning agile processes and organizational structures that are aligned
approach. This planning should provides essential guidelines with the strategic and financial objectives of the enterprise…
for the redesign of the city. Systems integration represents a progressive and iterative
cycle of melding technologies, human performance,
C. Integration Model for Smart City Development knowledge and operational processes together.” [18]. Hence,
according to V. Javidroozi [15] it is extensive and complex process whereas various aspects
It is based on a literature review, questionnaires, and need to be considered. System integration has five essential
interviews [15]. The model is derived from the Business characteristics to be considered [18]:
Process Change (BPC) model of Kettinger and Grover [16]
• Functional and technical compatibility is provided.
which consists of four dimensions: information &
technology, people, management, and structure. V.Javidroozi • The technologies used to process applications and
merged the management, and the structure dimensions. The data are relatively transparent to users.
model has three levels. In the first one, the technological
203
• The issue is selecting the best technology with respect the connection of the application landscape to a central
to longevity, adaptability and scalability, and speed of communication component (Message Broker) can be found
solution delivery. in both. However, SOA requires that the connected
applications follow the service paradigm, whereas they can
• Application systems, data, access paths to data, and remain discrete in an EAI scenario [24], [25].
graphical user interfaces (GUIs) are harmonized and
standardized for the user. Another important difference is that EAI is driven by the
business processes, while SOA is driven by technology [26].
• All enterprise wide applications and computing Plus, SOA use a top-down approach while EAI follow a
environments are scalable and portable to a variety of bottom-up approach [20]. SOA defines standards for various
needs. integrations [27]. In contrast, the EAI intends to extend the
The integration objectives must be defined before setting changes from one system to a cluster of systems. Generally,
the integration approach to be followed. In the definition of SOA enables a wide range of enterprise applications to
integration objectives, two Integration items need be integrate the use of standardized services [25]. EAI
considered: the task level and the task manager level. Based integrated against it at the system level, so to speak, by the
on these two levels, four main aspects are determined [19]: integration system output [28].
204
The development of a suitable smart city system
landscape, and/or the integration of various smart city
solutions into a system landscape are not considered in any
model yet. The sustainability of the system landscape is an
essential aspect to be considered while developing a smart
city. Due to the rapid change in information technologies, it
is important to have flexible system environment enabling
the exchange of individual components/services. Therefore,
the real-time migration, the switch to new technologies, and
the integration of new services are at the forefront. The
maintenance of the services and the system environment
should also be considered. Another important aspect is the
potential synergy among the smart city solutions. For
example, in the field of smart logistics, all related systems
should use the same sensors, and/or they should all be able to
Fig. 5. Hub and Spoke approach communicate.
Table 1 presents an evaluation of the proposed
message is forwarded according to defined rules via the integration approaches based on [31]. It can be seen that all
adapter to the target system [19]. three approaches have middle or lower development
complexity. In particular, the ESB has a low development
C. Enterprise Service Bus (ESB) complexity since there are a lot of open standards, reducing
According to D.A. Chapell, an Enterprise Service Bus is: the development effort [30]. When the maintenance
“a standards-based integration platform that com-bines complexity is considered by the approaches, it can be seen
messaging, web services, data transformation, and intelligent that the P2P approach is the most complex. The reason is that
routing to reliably connect and coordinate the interaction of when a service is exchanged or replaced, all connections to
significant numbers of diverse applications across extended the other services need to be altered or updated. In the hub
enterprises with transactional integrity” [29]. As presented in and spoke approach, all hubs must be serviced, while for the
figure 6, the ESB is a messaging backbone. This messaging ESB, only the needed special services will be served.
system controls the flow among various services and Consequently, in many changes and maintenance cases,
application which are linked to the ESB. Subscribing these two approaches would be appropriate. When pairing,
applications will have adapters which would take message the situation is similar. The P2P is very tightly coupled,
from bus and transform the message into a format required unlike the other two approaches. Whereby, the ESB
for the application. approach is completely loosely coupled. In terms of
scalability, the P2P and the ESB approach have certain
advantages as they are highly scalable. The hub and spoke
approach is dependent on the hub structure. Therefore, there
may be limitations in the processing. For the extensibility of
205
the approaches, hub and spoke and ESB show advantages integration layer with using an ESB concept. At this level,
over the P2P approach. If a new node or service is added to the transformation, planning, and orchestration of the data of
the P2P approach, all other node must know the protocol of the services takes place. Moreover, security and policies are
the new service. This leads to an increase in the system’s managed. The presentation layer is the last level. Here, user
reorganization costs. can access and interact with the system through browser,
mobile device, or other means.
The latency is in all tests dependent on the central
orientation tool (stroke or messaging backbone). At this
point, the P2P approach has its advantages. The approach VII. CONCLUSION
can respond via its connections to each system very quickly In this research, different smart cities’ frameworks as
and in real time. In the performance, exactly the same well as integration approaches of enterprises are discussed.
consideration is identified. This results in optimal application In order to support the orchestration of smart city services, an
scenarios for the individual approaches. In the P2P approach, integration landscape model has been proposed. The
it is the intra-business service integration, because a fast proposed model serves as an example to prove that a
internal communication is possible and often the same combination of different approaches is the best solution to
standards are used in the company itself. The hub and spoke solve the integration problem of smart cities’ services.
approach is an intermediate solution; it can be used for intra- Evaluating the model is one of the main future steps.
and inter-business service integration, because the hub has a
central broker and external services can reach it via an
REFERENCES
adapter. The disadvantage of this approach is that each
service requires the protocol of the hub. The ESB approach is [1] The UN department of |Economic and Social Affairs. May, 2018.
URL:
especially suitable for enterprise integration, since each https://www.un.org/development/desa/en/news/population/2018-
service provides its own adapter. revision-of-world-urbanization-prospects.html
[2] Portmann, E., and Finger, M. 2015. “Smart Cities – Ein Überblick!,”
VI. PROPOSED FRAMEWORK HMD Praxis der Wirtschaftsinformatik (52:4), pp. 470–481.
[3] Jaekel, M. 2015. Smart City wird Realität, Wiesbaden: Springer
Based on the previous discussion, it can be concluded Fachmedien Wiesbaden.
that there is a lack of an uniform system landscape concept [4] Etezadzadeh, C. 2015. Smart City – Stadt der Zukunft?: Die Smart
integrating all smart city solutions. Furthermore, the City 2.0 als lebenswerte Stadt und Zukunftsmarkt, Wiesbaden:
evaluation of different approaches of the integration shows Springer Vierweg.
that no integration approach meets all necessary [5] Alawadhi, S., Aldama-Nalda, A., Chourabi, H., Gil-Garcia, J. R.,
requirements for an unified approach. Thus, there is no single Leung, S., Mellouli, S., Nam, T., Pardo, T. A., Scholl, H. J., and
approach that can be followed to achieve all the integration Walker, S. 2012. “Building Understanding of Smart City Initiatives,”
in Electronic Government, D. Hutchison, T. Kanade, J. Kittler, J. M.
objectives. A combination of different approaches is the most Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C.
suitable strategy to realize city-wide services integration. In Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M.
figure 7, a possible system landscape concept is proposed. Y. Vardi, G. Weikum, H. J. Scholl, M. Janssen, M. A. Wimmer, C. E.
The model recommends using Point 2 Point (P2P) approach Moe and L. S. Flak (eds.), Berlin, Heidelberg: Springer Berlin
Heidelberg, pp. 40–53.
within each smart city area (e.g. Smart Energy, Smart Living,
etc.). This ensures a high degree of crosslinking among the [6] Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J. M., Mattern, F.,
Mitchell, J. C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen,
services. Therefore, a better process control is possible. Here, B., Sudan, M., Terzopoulos, D., Tygar, D., Vardi, M. Y., Weikum,
the disadvantages of the P2P approach are negligible due to G., Scholl, H. J., Janssen, M., Wimmer, M. A., Moe, C. E., and Flak,
the relatively small number of the services related to one L. S. (eds.) 2012. Electronic Government, Berlin, Heidelberg:
Springer Berlin Heidelberg.
[7] Pero, M., Kühne, S., and Fähnrich, K.-P. 2014. “Integration – eine
Dienstleistung mit Zukunft,” in Enterprise -Integration, G. Schuh and
V. Stich (eds.), Berlin, Heidelberg: Springer Berlin Heidelberg, pp.
125–137.
[8] Ferreira, D. R. 2013. Enterprise Systems Integration, Berlin,
Heidelberg: Springer Berlin Heidelberg.
[9] Schuh, G., and Stich, V. (eds.) 2014. Enterprise -Integration, Berlin,
Heidelberg: Springer Berlin Heidelberg.
[10] Javidroozi, V., Shah, H., Cole, A., and Amini, A. (eds.) 2014. Smart
City as an Integrated Enterprise: A Business Process Centric
Framework Addressing Challenges in Systems Integration, Paris, July
20 - 24, 2014, IARIA
[11] Kummer, S., Grün, O., and Jammernegg, W. 2009. Value Pack
Grundzüge der Beschaffung, Produktion und Logistik + Übungsbuch:
Bundle Lehr- und Übungsbuch, München: Addison Wesley in
Pearson Education Deutschland.
[12] Townsend, A. M. 2014. Smart cities: Big data, civic hackers, and the
quest for a new utopia, New York, NY: Norton.
[13] Ojo, A., Curry, E., Janowski, T., and Dzhusupova, Z. 2015.
“Designing Next Generation Smart City Initiatives: The SCID
Fig. 7. A Smart City landscape Model Framework,” in Transforming City Governments for Successful Smart
Cities, M. P. Rodríguez-Bolívar (ed.), Cham: Springer International
smart city area. The data of each unit must flow into the hub. Publishing, pp. 43–67.
Hence, the different areas can be decoupled, and unnecessary [14] Leontief, W. 1971. Theoretical Assumptions and nonobserved Facts.
American Economic Review, Vol. 61, No. 1, pp. 1-7. March 1971.
dependencies can be avoided. This is followed by an
206
[15] Javidroozi, V., Shah, H., Cole, A., and Amini, A. (eds.) 2015. [24] Draheim, D. 2010. “Service-Oriented Architecture,” in Business
Towards a City’s Systems Integration Model for Smart City Process Technology, D. Draheim (ed.), Berlin, Heidelberg: Springer
Development: A Conceptualization, Las Vegas, 7-9 December. Berlin Heidelberg, pp. 221–241.
[16] Kettinger, W. J. and Grover, V. 1995. "Towards and Theory of [25] Fischer, S., and Werner, C. 2007. “Towards Service-Oriented
Business Process Change Management," Journal of Management Architectures,” in Semantic Web Services, R. Studer, S. Grimm and
Information Systems (12:1), pp. 9-30. A. Abecker (eds.), Berlin, Heidelberg: Springer Berlin Heidelberg,
[17] Ben Letaifa, S. 2015. “How to strategize smart cities: Revealing the pp. 15–24
SMART model,” Journal of business research : JBR (68:7), pp. [26] Papazoglou, M. P., and van den Heuvel, W.-J. 2007. “Service
1414–1419. oriented architectures: Approaches, technologies and research issues,”
[18] Myerson, J. M. 2002. The complete book of middleware, Boca Raton, The VLDB Journal (16:3), pp. 389–415.
Fla: Auerbach. [27] Masak, D. 2005. Moderne Enterprise Architekturen, Berlin,
[19] Ferstl, O. K., and Sinz, E. J. 2006. Grundlagen der Heidelberg: Springer-Verlag Berlin Heidelberg.
Wirtschaftsinformatik, München: Oldenbourg [28] Aier, S. (ed.) 2004. Enterprise application integration:
[20] Aier, S. 2007. Integrationstechnologien als Basis einer nachhaltigen Flexibilisierung komplexer Unternehmensarchitekturen, Berlin:
Unternehmensarchitektur: Abhängigkeiten zwischen Organisationen GITO-Verl.
und Informationstechnologie, Berlin: Gito-Verlag [29] Chappell, D. A. 2004. Enterprise Service Bus, Sebastopol: O'Reilly
[21] Ruf, W., Mucksch, H., and Biethahn, J. 2007. Ganzheitliches Media
Informationsmanagement: Band 2: Ganzheitliches [30] Bianco, P., Kotermanski, R., and Merson, P. 2007. “Evaluating a
Informationsmanagement: Band II: Entwicklungsmanagement, Service-Oriented Architecture,” Carnegie Mellon University.
München: De Gruyter Oldenbourg [31] Cognizant 20-20 Insights 2013. “Comparing and Contrasting SOA
[22] Ziemen, T. 2006. Standardisierte Integration und Datenmigration in Variants for Enterprise Application Integration,” .
heterogenen Systemlandschaften am Beispiel von Customer-
Relationship-Management.
[23] Organization for the Advancement of Structured Information
Standards 2006. Reference Model for Service Oriented Architecture
1.0: OASIS.
207
Compression Techniques Used in Iot: A
Comparitive Study
Salam Hamdan, Arafat Awajan, Sufyan Almajali
Department of Computer Science
Princess Sumaya University for Tecnology
Amman, Jordan
S.hamdan@psut.edu.jo, awajan@psut.edu.jo, s.almajali@psut.edu.jo
Abstract— Due to the improvement of technology, most of the cannot be retrieved from the compressed file, therefore, the file
devices used nowadays are connected to the internet, therefore a size is reduced permanently by eliminating the redundant data.
huge amount of data is generated, transmitted, and used by these On the other hand, in lossless compression, all original data is
devices. In general, these devices are limited in resources such as completely recovered after uncompressing the file [10]. In IoT
memory, processors, and battery lifetime. Reducing the data size restricted devices, lossy compression algorithms have better
reduces the energy required to process this data, minimizes the efficiency in compression rather than the lossless compression
storage of this, data and the energy required to transmit this algorithms, by taking the advantage the existence of the
data. The need for applying data compression techniques on these redundant data, because, there is no need to recover the
devices will come in handy. This paper provides a survey and a redundant data [11].
comparative study among most commonly used IoT compression This paper briefly discusses the most common compression
techniques. The study addresses the techniques in terms of techniques used in IoT and it will provide a comparative study
different attributes such as the compression type, lossless or lossy, between these techniques with respect to the compression type,
the limitations of the compression technique, the location of the amount of energy or space the technique will save, in what
where the compression is applied, and the implementation of the solutions these techniques are implemented, whether the
compression technique. compression happens in the node side or server side, the type of
IoT application. Also, the comparison covers whether the
experiments were simulated, emulated or using testbeds.
Keywords—internet of things, wireless sensor network, Data This paper is organized as follows, the second section
compression discusses the previous works on data compression in IoT
I. INTRODUCTION networks, the third section differentiates between these
compression techniques and finally, section four concludes the
Internet of things (IoT) is a network that connects various paper.
types of devices with each other[1] including Wireless sensor
network[2]; Sensors could be found almost everywhere from II. LITERATURE REVIEW
the implanted sensors in the human body to the deepest point in Several IoT applications employed various compression
the oceans. However, most of these devices have constrained methods. In [12], Pielli proposed an optimization MAC layer
resources. Their memory is always limited to a short RAM and protocol that combines the energy efficiency and data
flash memory[3]and is provided with a short battery compression for IoT devices. They consider a network from N
lifetime[4]. users aims to use the up-link channel which is a link from the
IoT devices are used in numerous types of applications, IoT device to a Base Station using the idea of Time Division
therefore, it enables human-to-device and device-to-device Multiple Access (TDMA) scheme which allows sharing the
connection in a trustworthy and reliable manner [5]. These same frame with the same frequency among several users by
applications include but are not limited to, healthcare dividing the frame into different time slots [13]. For each
applications [6], Mobile ad hoc Networks (MANET)[7], frame, the energy is consumed for the following reasons: 1)
transportation systems, heat and electricity management [8]. Data processing, 2) Data transmission and 3) Data sensing and
circuitry costs. In their MAC protocol, they aimed to extend the
The limitation of memory and battery lifetime in IoT network lifetime and to fulfill the Quality of service (QoS)
devices create the need to reduce the size of the data in order to requirements. The nodes generate data from the environment.
minimize the CPU cycles needed to process these data also to To compress the input signal, a number of CPU cycles per bit
reduce the memory space that is needed to save the data. In are needed, thus, the energy that is consumed by the node’s
addition, data size reduction reduces the bandwidth required to CPU depends on the node's processor. In their protocol, they
transmit the data. Thus, the implementation of data defined the optimal Energy-Allocation over time that balances
compression techniques is very important in IoT devices. The the lifetime of the network and the average of the maximum
data compression is essential for transmission, storage, and in- distortion and this problem is called the Energy allocation
network processing. Also reducing the network traffic is problem (EAP) which is a convex optimization problem that
essential in order to avoid saturation and to achieve many depends on observation by using an alternative optimization
devices to work cooperatively within the same hub [9]. procedure. EAP determines the amount and the optimal
There are two types of compression: lossy compression and allocation of the energy consumption for each frame. After
lossless compression. In lossy compression, the original data defining the optimal allocation energy, they determined for a
Deepu et al. [14] proposed a hybrid compression of lossy Dang et al. [16] have proposed a Robust Information-Driven
and lossless compression schemes that consist of lossy Architecture (RIDA) that aims to improve the compression by
compression, lossy decompression and entropy encoder. They determining the correlation of the data between a cluster of
applied their technique to the cardiovascular diseases IoT sensors. Their approach is only suitable for the fixed network
application, especially with the wearable electrocardiogram hence they can group the sensors into clusters. Also, they
(ECG) sensors. The data that is generated from the ECG
assumed that if any two nodes in the same cluster want to
sensors is compressed with a lossy compression with a high
communicate with each other, the communication takes only
one hop. Their architecture contains three main parts:
information-drive logical mapping, resiliency mechanism and,
and compression algorithm. In the first part, the nodes within
the same cluster exchange their readings among each other thus
each node will learn a pattern about the whole cluster, also,
they chose logical indices for each other based on the data
content. In the second part, the resiliency part, the faulty and
missing nodes will be detected, isolated and classified all along
Fig. 1: Block diagram for the hybrid scheme the compression and decompression process. Generally
speaking, the nodes first distribute their readings to the cluster,
compression ratio (CR). The output of lossy compression
which produces an initial estimation of QTS peak location, therefore, each node will have a glance at the data within each
heart rate variability (HRV), etc. Also, they consider when an area. The node coefficient contains the corresponding index as
overall analysis is required for a signal, therefore, in their the logical index. And, if the coefficient is not zero the node
hybrid scheme, the original ECG is reconstructed using the will send it back to the server. The data can be retrieved from
lossy decompression and then the difference between the the non-zero coefficient, the missing data will be classified and
reconstructed signal and the original signal is estimated and then retrieve the physical map by doing the remapping process.
they called it the residual error with a very low dynamic range. This approach reduced the energy consumption and the
Thereafter, the bit rate is minimized for the residual error by
entering it to an entropy encoder. The original signal can be bandwidth, therefore sending a few non-zero coefficients.
represented in lossless shape when using the lossy compressed Gandhi et al. [17] had proposed an algorithm called
signal with the encoded residual. This hybrid scheme has Grouping and Amplitude Scaling (GAMPS). In their algorithm,
several advantages. First, it enables a hybrid transmission they aimed to reduce the space that is needed for archiving the
mode, which minimizes the power consumption, therefore, data in the server side and also reducing the query time for the
only the compressed data is transmitted. Also, the transmission generated data from sensors, like RIDA, they take the
has power awareness. When most of the sensors are battery- advantage of the correlation between the data. First, they
based devices, and in case of battery low the transmission will formalized the compression problem of multi-sensors.
turn to lossy compression only to reduce the amount of power Thereafter they propose GAMPS as a compression method for
consumption. In addition, the local storage usage is optimized stream data that is generated from a large number of sensors. In
by storing the lossy data only in the memory. Another their algorithm, the sensor signals groups that could be
advantage is the error tolerance, it is increased, by removing maximally compressed together are discovered dynamically.
the redundancy between the data samples that are close to each Furthermore, each compressed data will have an index in order
other. The results show that the power was reduced 18% for the to make the data query more easily. They also enhance the
lossy compression and for lossless compression the power was signal compression ratio by using a suitable amplitude scaling.
reduced to 53%. This scheme is efficient for healthcare
application, therefore, some cases will need the original data. Ukil et al. [18] proposed a dynamic lossy compression
method called Senscompr which is influenced by the
In their approach [15], Ukil aimed to increase the information theoretic and statistical techniques. In their
information gain from the compressed data by analyzing the approach, they reconstruct a huge amount of varied sensor data
data and extracting the robust outlier that is generated from the set accurately by using the Chebyshev approximation which is
sensor and adjusting the parameters exhaustively. Due to its a nonlinear model. Also, it works on reducing the redundant
ability to achieve high data retrieval after the decompression, data as shown in figure 2 as the traditional lossy compression.
this approach is efficient for various sensors applications. In SensCompr will debrief the important information and then
order to extract the most important features, they used adjusts the parameters. This process happens in traditional
statistical and information theoretic techniques. They made a compression, however, in their method, they solve the fixed
hardware implementation to test the information gain after block size by introducing dynamic block size.
decompression.
209
Park in [19] proposed a machine learning based • Compression type. This attribute provides whether
compression algorithm that uses neural network regression to the compression technique is lossy, lossless or
vectorize the data. However, vectorizing the entire data is using both types.
inefficient using the neural network only, therefore, they divide
the entire data according to a specific range, after that they • The goal of the compression technique, the
vectorized the divided chunks and then they merge them. The limitation it aims to solve, such as energy
compression was done using the divide and conquer method consumption, data size, etc.
thus the neural network is not sufficient to compress the • What techniques and algorithms the authors used
generated data hourly. The generated data is divided into time in order to compress the data.
units, thereafter, in each unit they applied neural regression. In
the conquer process they applied different machine learning • The location of the compression, does it happened
techniques which are, coefficient averaging, Euclidean in the IoT device or on the sensor side.
distance, cosine similarity and re-learning in order to represent
• What is the application type that this compression
the data easily and choose the machine learning technique with
technique is efficient for?
the highest accuracy. And the results show that Euclidean
distance has the highest accuracy among the previous • Implementation that describes if the technique was
techniques. simulated, emulated or using hardware
In their work, [20] aim to adjust the storage and the implementation.
precision cost for sensors that generates a video stream data. In Comparative table 1 highlights a brief comparison between
order to make the size of the video less than the original video, the compression techniques according to the previous
the authors proposed to omit the redundant video frames attributes.
(frames that have slight differences among each other) and
store only differentiable frames, in order to find the differences The table shows that most of the compression techniques
among each other the authors adopted Structural Similarity were lossy compression thus retrieving all data is not
Index Measure (SSIM). By doing this the size of the video will necessary, in which removing the redundant data will not affect
be sized down to 60%. the application functionality. The table also shows that most of
the compression techniques focused on solving the energy issue
In [9], Stojkoksa proposed a lightweight delta compression required to transmit and process the data to enhance the
algorithm by developing a new coding scheme. This scheme network and device lifetime. Furthermore, most of the
could be used with temporally correlated data. In order to techniques concentration was on the network availability
compress data, they collected temperature raw data from parameters such as energy and bandwidth. On the other hand,
MICAz Crossbow nodes using MOTE_VIEW application [21]. less work was done on the information gain. Most of the
The change in temperature is slow and the correlation is techniques applied the statistical approach to compress the data,
temporal, thus the next temperature degree depends on the and some of them used machine learning techniques. Also,
previous temperature degree. Therefore, delta values are most of the compression techniques implanted using hardware
dependent which is the difference between the next temperature implementation.
value with the previous temperature value. Thence they apply a
statistical approach on delta values the result of applying the IV. CONCOLUSION
statistical approach provide a probability distribution which is Nowadays, most of devices are connected to the internet
Gaussian distribution for delta. Thereafter, they found the and which causes a huge amount of data generated from these
variance of this distribution, which leads to the most probable devices. Also, these devices are limited with resources.In order
values of the delta which are -1, 0 and 1. Taking advantage of to make the IoT network more efficient, there is a need to
this result they propose a statistical encoding on the possible compress the data to reduce the energy to process the data,
values of delta, thus the most probable values will get less reduce the demanded storage to store it and to minimize the
number of bits to be encoded, consequently, reducing the energy required to transmit the data.
number of bits required to encode delta.
In this paper, the authors summarized the common
In [22], [23], and [24], they allow the extraction of IoT compression data techniques in IoT and made a comparative
context data from IoT devices and store this data in a study among them in order of the compassion type, in what the
customized servers and providers. Providers provide the scheme will enhance the network, what techniques are used
collected IoT data, along with specialized services that are within the compression technique, the location of the
custom to the applications need. One of these services is to compression, node side or server side, the applications
apply compression at the cloud level for IoT data before it is deployed these techniques. As a result, most of compression
delivered to the IoT enabled applications. techniques were lossy techniques and aimed to reduce the
energy consumption.
210
in Editor (Ed.)^(Eds.): ‘Book Towards an analysis of security issues, transactions on biomedical circuits and systems, 2017, 11, (2), pp. 245-
challenges, and open problems in the internet of things’ (IEEE, 2015, 254
edn.), pp. 21-28 [15] Ukil, A., Bandyopadhyay, S., Sinha, A., and Pal, A.: ‘Adaptive Sensor
[4] Tripathi, P.: ‘Vision, Opportunities and Challenges in Internet of Things Data Compression in IoT systems: Sensor data analytics based
(IoT)’, 2017 approach’, in Editor (Ed.)^(Eds.): ‘Book Adaptive Sensor Data
[5] Lee, I., and Lee, K.: ‘The Internet of Things (IoT): Applications, Compression in IoT systems: Sensor data analytics based approach’
investments, and challenges for enterprises’, Business Horizons, 2015, (IEEE, 2015, edn.), pp. 5515-5519
58, (4), pp. 431-440 [16] Dang, T., Bulusu, N., and Feng, W.-c.: ‘Rida: A robust information-
[6] Catarinucci, L., De Donno, D., Mainetti, L., Palano, L., Patrono, L., driven data compression architecture for irregular wireless sensor
Stefanizzi, M.L., and Tarricone, L.: ‘An IoT-aware architecture for smart networks’, in Editor (Ed.)^(Eds.): ‘Book Rida: A robust information-
healthcare systems’, IEEE Internet of Things Journal, 2015, 2, (6), pp. driven data compression architecture for irregular wireless sensor
515-526 networks’ (Springer, 2007, edn.), pp. 133-149
[7] Bellavista, P., Cardone, G., Corradi, A., and Foschini, L.: ‘Convergence [17] Gandhi, S., Nath, S., Suri, S., and Liu, J.: ‘Gamps: Compressing multi
of MANET and WSN in IoT urban scenarios’, IEEE Sensors Journal, sensor data by grouping and amplitude scaling’, in Editor (Ed.)^(Eds.):
2013, 13, (10), pp. 3558-3567 ‘Book Gamps: Compressing multi sensor data by grouping and
amplitude scaling’ (ACM, 2009, edn.), pp. 771-784
[8] Kyriazis, D., Varvarigou, T., White, D., Rossi, A., and Cooper, J.:
‘Sustainable smart city IoT applications: Heat and electricity [18] Ukil, A., Bandyopadhyay, S., and Pal, A.: ‘IoT data compression:
management & Eco-conscious cruise control for public transportation’, Sensor-agnostic approach’, in Editor (Ed.)^(Eds.): ‘Book IoT data
in Editor (Ed.)^(Eds.): ‘Book Sustainable smart city IoT applications: compression: Sensor-agnostic approach’ (IEEE, 2015, edn.), pp. 303-312
Heat and electricity management & Eco-conscious cruise control for [19] Park, J., Park, H., and Choi, Y.-J.: ‘Data compression and prediction
public transportation’ (IEEE, 2013, edn.), pp. 1-5 using machine learning for industrial IoT’, in Editor (Ed.)^(Eds.): ‘Book
[9] Stojkoska, B.R., and Nikolovski, Z.: ‘Data compression for energy Data compression and prediction using machine learning for industrial
efficient IoT solutions’, in Editor (Ed.)^(Eds.): ‘Book Data compression IoT’ (IEEE, 2018, edn.), pp. 818-820
for energy efficient IoT solutions’ (2017, edn.), pp. 1-4 [20] Hsu, C.-C., Fang, Y.-T., and Yu, F.: ‘Content-Sensitive Data
[10] Nelson, M., and Gailly, J.-L.: ‘The data compression book’ (M & t Compression for IoT Streaming Services’, in Editor (Ed.)^(Eds.): ‘Book
Books New York, 1996. 1996) Content-Sensitive Data Compression for IoT Streaming Services’ (IEEE,
2017, edn.), pp. 147-150
[11] Bose, T., Bandyopadhyay, S., Kumar, S., Bhattacharyya, A., and Pal, A.:
‘Signal Characteristics on Sensor Data Compression in IoT-An [21] Datasheet, M.: ‘Crossbow technology inc’, San Jose, California, 2006,
Investigation’, in Editor (Ed.)^(Eds.): ‘Book Signal Characteristics on 50
Sensor Data Compression in IoT-An Investigation’ (IEEE, 2016, edn.), [22] Almajali, S., Abou-Tair, D. 'Cloud based intelligent extensible shared
pp. 1-6 context services',in the proceeding of 2017 Second International
[12] Pielli, C., Biason, A., Zanella, A., and Zorzi, M.: ‘Joint optimization of Conference on Fog and Mobile Edge Computing (FMEC). pp:133-138
energy efficiency and data compression in TDMA-based medium access [23] Almajali, S ; Bany Salameh, H. ; Ayyash, M. Elgala, H. 'A framework
control for the IoT’, in Editor (Ed.)^(Eds.): ‘Book Joint optimization of for efficient and secured mobility of IoT devices in mobile edge
energy efficiency and data compression in TDMA-based medium access computing', in the Proceeding of 2018 Third International Conference on
control for the IoT’ (IEEE, 2016, edn.), pp. 1-6 Fog and Mobile Edge Computing (FMEC), pp: 58 - 62
[13] Jung, P.: ‘Time Division Multiple Access (TDMA)’, Wiley [24] Almajali, S., Abou-Tair, D., Bany Salameh, H. ; Ayyash, M. Elgala, H.'
Encyclopedia of Telecommunications, 2003 A distributed multi-layer MEC-cloud architecture for processing large
[14] Deepu, C.J., Heng, C.-H., and Lian, Y.: ‘A hybrid data compression scale IoT-based multimedia applications', Multimedia Tools and
scheme for power reduction in wireless sensors for IoT’, IEEE Applications. September 2019, Volume 78, Issue 17, pp 24617–24638.
211
and divide
and
conquer,
coefficient
averaging,
Euclidean
distance,
cosine
similarity
and re-
learning
Chun-Chi et.al ✔ ✔ Structural Server side Viseo Hardware
[20] similarity streaming implementation
index streaming and
measure applications
Stojkoska et. ✔ ✔ Statistic Node side Temperature simulation
al[9] approach
212
Using Part of Speech Tagging for Improving
Word2vec Model
Dima Suleiman Arafat A. Awajan
Computer Science Department Computer Science Department
King Hussein Faculty of Computing Sciences King Hussein Faculty of Computing Sciences
Princess Sumaya University for Princess Sumaya University
Technology for Technology
Teacher at the University of Jordan Amman, Jordan
Amman, Jordan awajan@psut.edu.jo
d.suleiman@ psut.edu.jo
214
number of context words. On the other hand, in case of IV. PROPOSED MODEL
CBOW, the embedding of the context input words are
The proposed method is an extension of both approaches
concatenated in the same order of occurrence. After that, the
of word2vec model that was proposed by Mikolov in 2013 [2].
result of concatenation is passed to the output predictor [17].
The main purpose of the proposed approach is to capture more
In 2016, in Skip-Gram, the distance between the precise syntax and semantic features in training process. In
context words and the input word is taken into account [18]. this paper, we proposed to use Part-of-Speech Tagging
Komninos and his colleagues extended the Skip-Gram model (POST) in order to learn high quality distributed
by using the dependency graph to determine the distance of representation of vectors. In both approaches of the word2vec,
the relation between words and the input word[18]. In addition the vocabulary size is denoted by V, and N is used to represent
to considering the dependency relation, the adjacent context the dimension size. The vocabulary size is the number of most
words are also taken into account. The syntax of word frequent words in the corpus. While the dimension size, is the
embedding is very crucial and took more concerns size of the vector that is used to represent the words, which is
recently[19]–[21]. equal to number of neurons in the hidden layer. Every input
word size is V and it is represented by one-hot representation.
Such that, all the units’ values are zeros except one entry is set
III. ARABIC LANGUAGE FEATURES to one. There are two weighting matrixes, one of them
between the input layer and the hidden layer which is called
Arabic language is considered as an official language in W. Another matrix is called W’ which is between the hidden
several regions in the world [22]. Even though, the number of layer and the output layer. The size of W and W’ is V X N and
research papers that are concerned with Arabic language is N X V respectively. The dimensions of each row in W and
limited due to the shortage of Arabic resources [23]. In order each column in W’ is N-dimensions. The transpose of each
to make improvements on the quality of the results of Arabic row i in W which is vTwi represents the vector of certain input
research, some Arabic features such as part-of-speech tagging word wi. For example, assume that the position of the input
(POST) and dependency parsing must be taken into account. (context) word in the vocabulary is k, then the one-hot
The morphological nature of Arabic language, makes the representation (x) of a word will have xk=1 and xk’=0 for all
process of dealing with Arabic language harder and needs k≠k’.
more effort [22]. As a result, other Arabic NLP tasks such as
normalization and segmentation become harder and must be In the original word2vec model there is a vocabulary list
considered. that contains the most frequent words in the corpus. However,
in the proposed model, each vocabulary entry in the
vocabulary list does not contain only the word as in original
Normalization
word2vec model, but also each entry contains the word and its
There are two types of vowels in Arabic language, short and part of speech tagging. Thus, the same word may have more
long. The diacritical marks are used to represent the short than one entry in the list for several part of speech tagging. For
vowels such as (ب,َ بِ , ُ) ب. On the other hand, letters are used example, assume we have the word “”كتب, if it is noun, it
to represent long vowels. One of the challenges of Arabic means (books). However, if it is verb, it means (wrote). In this
language is related to having several marks such as hamza “”ء, case, in the proposed model, there are two entries for “”كتب
dot “.”, or madda “~”on the same letter. For example, “”ا one with “Noun” word and one with “Verb” word, instead of
may be written as “”أ, “”آand “”إ. In this case, normalization is one entry.
used to consider all the shapes to be the same. For example,
The proposed model includes extension of CBOW and
the words “ ”ايامand “ ”أيامwhich means (days) must be
Skip-Gram approaches of word2vec. More details are covered
considered the same, since both of them have the same
in the following subsections:
meaning.
Segmentation A) Continuous Bag-of-Word Model
Another important NLP task is the segmentation process.
Segmentation faces several problems such as keeping letters To simplify the explanations, we will simplify the CBOW
that must be removed and segmenting words that must not be to consist only one input (context) word and one predicted
segmented. For example, “( ”الThe) must be removed, such output word. In this case, the equation that is applied between
that the words “( ”بلدcountry) and “( ”البلدThe country) must be the input and the hidden layer is as follows [24]:
considered the same. However, in some words “ ”الis part of
the name and must not be removed such as “( ”ألغازmysteries).
h = W x = ( ,.) = v ……. (3)
Therefore, Farasa is used since Farasa segmenter has high
quality of segmenting the words and removing the parts that
are not part of them, which can be determined based on the After that, score for each word uj at the output layer is
context [3]. computed using Eq. (4) [24].
u = v’ h ……. (4)
215
Where v’wj represents the vector of jth column in the
second weighting matrix W’. It can be clearly seen that, the u , = v’ , h ……. (10)
activation function is linear function. Moreover, the words
posterior distribution must be determined. This distribution of In this case, Eq.(5) is modified to consider the part of
the words is multinomial which can be obtained using a speech tagging when computing the output yj for certain
classification model that is log-linear such as softmax. In this unit j in the output layer as shown in Eq.(11).
case, the output yj in certain unit j in the output layer can be
computed as shown in Eq.(5) [24]. ,
y , =p w, w, = … (11)
,
y =p w w = ……. (5)
By substituting Eq. (9) and Eq. (10) in Eq. (11), we can get
Eq. (12).
By substituting Eq. (3) and Eq. (4) in Eq. (5), we can get Eq.
(6) [24].
y , =p w, w,
’ ’
y = p w w = ……. (6) =
, ,
……. (12)
’ ’
, ,
In this equation, both vw and vw’ represent the vector Finally, the log-linear probability in Eq.(1) must be modified
representation for the word w. vw represents the input vector to Eq.(13) to consider the part of speech tagging post of the
that is certain row in the weighting matrix between input and words w.
hidden layers W, while vw’ is the output vector that is certain
column of the weighting matrix between the hidden and output w( ) , post ( ) , … ,
layers W’. | |
w( ) , post ( ) ,
On the other hand, in the case that the input (context) 1 (w( ) , post ( ) ),
consists of more than one word, the output of the hidden layer log p (w( ) , post ( ) )
|V| w( ) , post ( ) ,
is equal to the average of the input vectors for the context
words vw multiplied by the input hidden layers weighting w( ) , post ( ) , … ,
matrix W which is computed using Eq.(7) [24]. w( ) , post ( )
……. (13)
h= ( 1 + 2 + ⋯+ ) ……. (7) B) Skip-Gram Model.
In case of Skip-Gram, as we explained previously, there is
Where x1, x2, …, xc is the vectors for the first, second and only one input word and several context or output words. In
c words in the context respectively and c is the number of this case, the equation of calculating the output of the hidden
context words. After substituting x1, x2,….xc with their input layer is shown in Eq.(14) [24].
vectors representations vw1, vw2, …., vwc in Eq.(7), we get Eq.(8)
[24].
h= W x = w( ,.) = v ……. (14)
216
the input word. Note that, the same hidden output weighting Finally, the log-linear probability in Eq.(2) must be modified
matrix W’ is used for all the context output words, thus to Eq.(22) to consider the part of speech tagging post of the
words w.
, = = v’ h ……. (16)
| |
∑ , log p w( ) , ( ) w( ) , ( ) ……(22)
Where v’wi is the jth column in the hidden output layers
matrix W’. After substituting Eq.(14) and Eq.(16) in Eq.(15),
we get Eq.(17) V. EXPERIMENTAL RESULTS
’
1) Datasets and Pre-processing
y , =p w , =w , w = ……. (17) The experiments are conducted on OSAC datasets [25].
’
OSAC are benchmark datasets which include documents from
several domains such as, Sports, Health, Economic and others
The output of the hidden layer in the proposed model is with total number of documents equal to 22,429. The quality
computed using Eq.(18). Where vTwi,posti is the transpose of the of the generated vectors that are generated from the word
input vector of the word wi and its part of speech tagging embedding is highly affected by the quality of corpus. Thus,
posti. several pre-processing stages are made: the first stage incudes
removing non-Arabic words, diacritical and punctuation
h= W x = w( ,.) = v , ……. (18) marks. The second stage is replacing all numbers with NUM
keyword. After that, in the third stage, Farasa stemmer is used
for segmenting and retrieving the stem of the words [3]. This
On the other hand, the output score of the word and its part
stage is very crucial especially for Arabic language. For
of speech tagging at the output layer uc,postc,i,posti for jth word in
example, the sentence “ ”قام أحمد بتصحيح االمتحاناتwhich means
c-panel is computed using Eq. (19). Where v’Twj,postj is the
(Ahmed corrected the exams) becomes “ ” قام أحمد تصحيح امتحان
transpose of the output vector for the word wi and its part of
after using Farasa. In this example, the stem of the word
speech tagging posti.
“( ”امتحاناتexams) which is “( ”امتحانexam) is retrieved. In
word embedding using stem is more useful than using the
, , , = = v’ , h ……. (19) word itself [26]. For example, the words “ ”امتحان/emtehan/,
“ ”امتحانات/emtehanat/ , “ ”امتحاناتھم/emtehanatehem/ and
“ ”امتحانه/entehanoh/ which are translated to (exam), (exams),
In Skip-Gram as we mentioned before, there are c (their exams) and (his exam) respectively must be considered
multinomial distributions in the output instead of one. as one word “( ”امتحانexam). Finally, the last stage of pre-
Yc,postc,j,postj is the output of the jth unit in the c-panel for certain processing is the normalization.
word and its part of speech tagging which can be computed 2) Experimental settings
using the softmax as shown in Eq.(20) for each output in the
context. The proposed model was implemented using Python v3.5.3
and TensorFlow v1.12.0. The experiments are performed on
Y , , ,
standalone computer. The computer specifications include
3.4GHz Intel Core i7 quad processor and 24 GB RAM. The
= p w , , , =w , , , w, hyper parameters that are used in the experiments are 50000 is
the vocabulary size, 100 is dimension size, and 9 is context
,
= ……. (20) window. The experiments are conducted on the extension of
both approaches of word2vec model including CBOW and
Skip-Gram.
3) Results and Discussions
After substituting Eq.(18) and Eq.(19) in Eq.(20), we get
Eq.(21). After training the proposed model, we selected two words
“ ”ذھبand “”جمع. If the word “ ”ذھبis noun its meaning is
(gold) and if it is verb its meaning is (went). The proposed
Y , , ,
model enables the user to query and retrieve the vector
representation of the word for certain part of speech tagging.
= p w , , , =w , , , w, Cosine similarity is used to compute the similarity of vectors
as shown in Eq.(23). The vector representation of word “”ذھب
’ , ,
= ……. (21) with (Noun) part of speech tagging is completely different
’
, , than the vector representation of word “ ”ذھبwith (Verb) part
of speech tagging. Thus, if we used cosine similarity to
retrieve the most similar words for the word “ ”ذھبwith
different part of speech tagging, we can find that the similar
words are different. For example, the word “ ”ذھبwith (Noun)
217
part of speech tagging means (Gold), thus we can notice that ضم Combined مصف Parking
the most similar words are “( ”نقدMoney), “( ”فاتورةBill), “”ملجم
(Mine) and others. On the other hand, the most similar words
for the word “ ”ذھبwith (Verb) part of speech tagging which The word “ ’جمعif it is (Noun) it means (Group of People)
means (Went) are “( ”رجعWent Back), “( ”عادIs Back), “”سافر or (Addition Operation) while if it is (Verb) it means (Add).
(Travelled) and others. Table 1 and Table 2 show the most We can notice that, the most similar words in case of Verb
similar words for the words “ ”ذھبand “ ”جمعfor Verb and part of speech tagging are different than the most similar
Noun part of speech tagging for CBOW and Skip-Gram words the case of Noun part of speech tagging.
models respectively. VI. CONCLUSION
word1. word2
Cosine Similarity(word1, word2) =
|word1| |word2| In this paper, extension of both approaches of word2vec
model including CBOW and Skip-Gram is proposed. The
……(23)
main idea of the proposed approach is to consider part of
speech tagging when training the word embedding model.
TABLE 1. THE MOST SIMILAR WORDS FOR THE WORDS “ ”ذھبAND “”جمع Thus, the same word with different part of speech tagging
IN CBOW MODEL FOR VERB AND NOUN PART OF SPEECH TAGGING.
must be considered different. Therefore, if we have two words
Verb Noun
Word that have the same surface form but different part of speech
Arabic English Arabic English
tagging, the results are two different words with different
ذھب توجه Go To اشترى Bought
meaning and different vector representation. The proposed
رجع Went Back ملجم Mine
التفت Turned درھم Dirham
model can be applied in several languages such as English and
ارجع Come Back شقة Flat
Arabic. However, in Arabic language the process is harder
سافر Traveled حلق Earing because of its morphological nature. In this paper, the
عمد Went فاتورة Bill experiments are conducted on Arabic language using OSAC
صعد Ascended معدن Metal datasets. Moreover, Farasa toolkit is used for segmentation,
اصطحب Take حلي Jewels stemming and determining the part of speech tagging of the
نظر Looked بضاعة Goods words.
انصرف Run along بائع Seller
جمع ربط Link تفريق Differentiation
قارن Compared ربط Link REFERENCES
درس Studied فرق Differentiate
فرق Differentiate خلط Mix
[1] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
ضم Sign ضم Join
“Distributed Representations of Words and Phrases and their
فصل Separated لؤلؤ Pearl Compositionality,”. InAdvances in neural information
نظم Organized مئوي Percentage processing systems,pages 3111–3119. 2013.
احصى Counted سائر Other
مجموعة Group توزيع Distribution [2] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
قطع Cut قوي Strong Estimation of Word Representations in Vector Space,”
arXiv:1301.3781 [cs], Jan. 2013.
TABLE 2. THE MOST SIMILAR WORDS FOR THE WORDS “ ”ذھبAND “”جمع [3] A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak,
IN SKIP-GRAM MODEL FOR VERB AND NOUN PART OF SPEECH TAGGING.
“Farasa: A Fast and Furious Segmenter for Arabic,” in
Verb Noun Proceedings of the 2016 Conference of the North American
Word
Arabic English Arabic English Chapter of the Association for Computational Linguistics:
ذھب رجع Went Back نقد Money Demonstrations, San Diego, California, 2016, pp. 11–16.
عاد Is Back بيع Sell
صعد Ascended ثمن Price [4] J. Pennington, R. Socher, and C. Manning, “Glove: Global
انتقل Moved رھان Bet Vectors for Word Representation,” in Proceedings of the 2014
Conference on Empirical Methods in Natural Language
خرج Came Out سبيكة Alloy
Processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.
اسرع Become Faster معدن Metal
توجه Go To دوالر Dollars [5] R. Socher, “RECURSIVE DEEP LEARNING FOR NATURAL
وصل Arrived عملة Currency LANGUAGE PROCESSING AND COMPUTER VISION,”
ادى Led بوليصة Policy Ph.D. thesis, Stanford University,2014.
صار Became رھان Bet
جمع جرى Ran فارق Difference [6] A. Mahdaouy, E. Gaussier, and S. Ouatik El Alaoui, Arabic
Text Classification Based on Word and Document Embeddings,
نظم Organized اقام Stayed
International Conference on Advanced Intelligent Systems and
فرق Differentiate الف Make kind
Informatics, 2016.
فرق Groups فرد Individual
مقارنة Comparison توزيع Distribution [7] Y. Li, Q. Pan, T. Yang, S. Wang, J. Tang, and E. Cambria,
فرعي Sub فندق Hotel “Learning Word Representations for Sentiment Analysis,”
بين Between ھدف Target Cognitive Computation, vol. 9, no. 6, pp. 843–851, Dec. 2017.
شمل Include مزدلفة Muzdalifah
حصل Retrieved منافسة Completion [8] D. Suleiman, A. Awajan, and N. Al-Madi, “Deep Learning
218
Based Technique for Plagiarism Detection in Arabic Texts,” in Embeddings for Deep Compositional Models of Meaning,” in
2017 International Conference on New Trends in Computing Proceedings of the 2015 Conference on Empirical Methods in
Sciences (ICTCS), Amman, 2017, pp. 216–222. Natural Language Processing, Lisbon, Portugal, 2015, pp.
1531–1542.
[9] D. Suleiman and A. Awajan, “Comparative study of word
embeddings models and their usage in Arabic language [21] K. Hashimoto, P. Stenetorp, M. Miwa, and Y. Tsuruoka,
applications,” International Arab Conference on Information “Jointly Learning Word Representations and Composition
Technology (ACIT), Werdanye, Lebanon, pp. 1-7.2018. Functions Using Predicate-Argument Structures,” in
Proceedings of the 2014 Conference on Empirical Methods in
[10] A. El Mahdaouy, S. O. El Alaoui, and E. Gaussier, “Improving Natural Language Processing (EMNLP), Doha, Qatar, 2014,
Arabic information retrieval using word embedding pp. 1544–1555.
similarities,” International Journal of Speech Technology, vol.
21, no. 1, pp. 121–136, Mar. 2018. [22] A. Farghaly and K. Shaalan, “Arabic Natural Language
Processing: Challenges and Solutions,” ACM Transactions on
[11] P. Lauren, G. Qu, J. Yang, P. Watta, G.-B. Huang, and A. Asian Language Information Processing, vol. 8, no. 4, pp. 1–22,
Lendasse, “Generating Word Embeddings from an Extreme Dec. 2009.
Learning Machine for Sentiment Analysis and Sequence
Labeling Tasks,” Cognitive Computation, vol. 10, no. 4, pp. [23] A. Awajan, “Arabic Text Preprocessing for the Natural
625–638, Aug. 2018. Language Processing Applications,” Arab Gulf Journal of
Scientific Research vol. 25 no. 4 pp. 179-189 2007.
[12] D. Suleiman and A. Awajan, “Bag-of-concept based keyword
extraction from Arabic documents,” in 2017 8th International [24] X. Rong, “word2vec Parameter Learning Explained,”
Conference on Information Technology (ICIT), Amman, arXiv:1411.2738 [cs], Nov. 2014.
Jordan, 2017, pp. 863–869.
[25] M. K. Saad W. Ashour "Osac: Open source arabic corpora" 6th
[13] D. Suleiman, A. Awajan, and W. Al Etaiwi, “The Use of Hidden ArchEng Int. Symposiums EEECS vol. 10 2010.
Markov Model in Natural ARABIC Language Processing: a
survey,” Procedia Computer Science, vol. 113, pp. 240–247, [26] I. El Bazi and N. Laachfoubi, “Is Stemming Beneficial for
2017. Learning Better Arabic Word Representations?,” in Lecture
Notes in Real-Time Intelligent Systems, vol. 756, J. Mizera-
[14] T. Mikolov, W. Yih, and G. Zweig, “Linguistic Regularities in Pietraszko, P. Pichappan, and L. Mohamed, Eds. Cham:
Continuous Space Word Representations,”, 2013c. Linguistic Springer International Publishing, 2019, pp. 508–517.
regularities in continuous space word representations. In HLT-
NAACL
219
Applying Ontology in Computational Creativity
Approach for Generating a Story
Lana Issa Shaidah Jusoh
Department of Computer Science Department of Computer Graphics and Animation
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
lanaissa238@hotmail.com s.ibrahim@psut.edu.jo
Abstract. Computational creativity is a young ontology as a knowledge base to supply the story generator with the
multidisciplinary field which has a promising future for needed content to generate informative stories. The ontology stores
effectively giving options in tackling and handling many all the needed scientific concepts, relations, and attributes, to get
automated systems. It is very useful for example, in creating detailed knowledge about any concept that will be used in a
a narrative story for various purposes, which is actually a generated story. Then, a creative based algorithm will design a
human art. In this research, we are investigating methods suitable story plot according to the given information. Finally, a
that can be applied as into computational approach for language generating method will be used to generate full sentences
generating structured narratives automatically, to suit to form the final story in its final shape.
education purposes. In this paper, we present a literature In this work, we design an alternative approach for traditional e-
review of the work done in this field so far, and we propose a learning methods which deliver the educational content in its
framework that is designed to generate educational stories already prepared shape. Automating the process of generating
using computational creativity approach. The major the educational material in the shape of a story is entertaining
contribution of this paper is a proposed computational and helpful for students at a young age. This work could also be
creativity approach consisting of hybrid Artificial implemented to work as an extraordinary feature for intelligent
Intelligence methods to generate educational stories. systems that provide valuable content.
This paper is organized as follows: section II reviews the work
Keywords- Computational Creativity; Natural Language done in the field, section III describes the proposed method,
Generation; Ontology; Automatic Story Generation. section IV contains the conclusions and the future work of this
paper, and section V includes the references.
I. INTRODUCTION
Modern computer science techniques have provided multiple II. LITERATURE REVIEW
options for building solutions that makes life easier for humans. A.Computational Creativity
Various intelligent methods which have the ability to stimulate Humans are gifted with creativity, which gives them the ability
human practices had been developed to problem solving to come up with novel ideas, new solutions, or any type of novel
options. Computational Creativity is a new research area that creation that helps them achieve several goals. Several issues in
proposes several ideas related to simulating human's creative life do not need a formal method to find a solution, they need a
behavior in many areas. Narrating is one of the creative fields in different way of idea generation that includes unexpectedness,
which humans use their creative minds to produce creative and novelty. In this area, we find various creative methods
content that satisfies readers. With the computational creativity created by creative humans to suit such nature of problems.
approach, an automated narratives generation system may able In the world of computer science, many problems were solved
to create content that look like human-written content. by creating methods that simulate humans’ cognitive behavior,
Automatic story generation shows a very interesting example of this has supported humans to reach solutions faster and to deal
automating the process of generating narratives. with bigger volumes of data.
Integrating computational creativity approach which helps with Recent endeavors have taken a bigger step by investigating the
analyzing and understanding human writings, such as writing ability to simulate human creative behavior, to produce better
jokes, stories, and so on[21][23], into natural language solutions or to solve problems of un-formal nature. With the
processing (NLP) may produce an effective system for existing artificial intelligence (AI) techniques, it is possible to
generating a well written text automatically. hire them to build a method that produces novel yet familiar
Automatic story generation is considered as a young sub-field of ideas. Although it has been difficult to find a precise definition
Artificial Intelligence (AI) and an application of Computational of creativity in order to guide the construction of computational
Creativity. AI methods are extending to cover more and more creativity, a certain pattern could be analyzed and some rules
human intelligence applications, while Computational Creativity could be extracted to guide this process that aims at building
is about hiring Artificial Intelligence methods to simulate problem-solving methods that are inspired by humans creativity
human creativity in several life aspects. to achieve novelty and familiarity[2].
In this paper, we present a related work in this field, then, we Humans creativity has been behind implementing and inventing
propose a story generation approach which contain hybrid of AI many solutions that were found to be useful in many fields.
methods which aims at embedding educational material with the Here, we review some examples of using computational
entertaining nature of storytelling. creativity in various fields.
The proposed automatic story generating approach contains In the field of decision making, computational creativity
multiple steps: first of all, constructing stories creatively using an techniques were proved to be very effective. It utilizes available
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 220
information in a creative way that supports each decision to be The idea of automatic story generation is generating stories
made. automatically using intelligent computer programs to finally
The Deep Green concept [1] hired computational creativity as produce content that looks like human-produced stories. This
an innovative approach to deploy simulation to support military process requires building basic knowledge for the creative
operations while they are being conducted. The authors have program to learn how to combine words to form a story which
developed software agents that process information on the has all the expected story elements.
military operation to make military operations planning easier,
and with having a space graph of possible future states, along Stories are one form of entertainment that many people look for,
with the assessment of the possibility of approaching future generating creative content is the biggest challenge in
states, they have designed a dynamic approach that uses entertainment. Many efforts around the world are put towards
information acquired at the moment to make decisions. finding new ideas that participate in creating non-traditional
In [2], the authors review many decision-making problems that learning methods such as interesting e-learning systems, for
found computational creativity feasible to be used, because it helps example edraak website[58] created by Queen Rania Foundation (
with assessing situations, explore possible actions, and improve the QRF). Automatic story generation could add a new flavor for
planning process. The building of such creative solutions has educational based platforms or educational intelligent solutions, by
developed to reach a state where it affects humans in their decision- simulating the comprehensive, linguistic, and entertainment skills
making process, such as chess[15]. The program was enhanced that writers have in an automated method that is formed with
until it reached a stage where chess player where learning from the respect to human creativity. Several people have designed
program how to search and evaluate for each movement in the frameworks that generate short stories, MEXICA[5] for example, is
game. “humans play chess like machines, and machines play chess a computer model that produces short stories guided by content,
the way humans used to play”[15]. linguistic, and cultural constraints.
Since computational creativity supports planning and decision
making which are activities that are usually done by leaders who The production of stories at MEXICA is driven by the chosen
use their creativity to come up with the best suitable plan or actions. After learning from several existing stories stored in its
decision. One of the biggest examples that could be listed here information repository, it analyzes how the normal action flow
is making decisions in military training[2]. There have been should be designed. Each event has a set of pre and post-
many effective solutions developed for military training using conditions, whenever an event is added to the story, automatic
artificial intelligence, virtual reality, game trainers and many story compliance checking is done to check whether further
other new trends in technology. In a game trainer, for example, events needed to be added to satisfy the set of defined rules. The
computational creativity could be hired by building complex main idea at MEXICA is improvisation to produce creativity.
characters that might behave like real life soldiers called The system was created by creating two agents. The two agents
"intelligent agents" which are programmed with a human-like have partially different knowledge-base to collaborate in the
background such as emotions or education [3][4]. story generation process.
Another field that relies on coming up with creative strategies is
the field of marketing. It requires certain understanding for the Other than MEXICA, there are many famous story generation
advertised product and the target audience, to build innovative systems such as DEFACTO [34], Tale-Spin [35], OPIATE[36],
marketing strategies. Computational creativity has been applied KIIDS[32], Minstel[37], and MAKEBELIEVE[38].
in this field, where it was used to automate the creative work in In general, the approaches in generating stories could be divided
advertisement [6]. In their work, the creative system was into two different approaches: generating story structures and
programmed to produce a list of advertising messages that generating a full story. The first approach is about generating a
contain novel ideas yet familiar expressions. complex structure of elements depending on stored atomic
In the field of generating narratives, many studies have proposed elements and using production grammar, while the other
methods to generate written content such as stories, jokes, approach generates a full story from A to Z. And this is usually
metaphors, and so on. This part is discussed in the following done using planning or simulating approach to build a story.
section that discusses generating natural language in details. These two approaches to generating a full story are discussed in
the next section.
B. Automatic story generation
Writings are a form of creative art produced by humans, as well Before the year of 2000, DEFACTO[34] and Tale-Spin[35]
as cooking, music, and paintings. With the existence of were introduced. DEFACTO is a framework that uses logical
computational creativity, the probability of automating those formulas to generate s structure of a story. This framework
creative products of humans becomes higher. Computational implements a dynamic technique to produce a story with user
creativity techniques are very helpful in the world of generating engagement. On the other hand, Tale-Spin is another story
writings in a form of stories, by providing useful methods generating framework that generates stories but it generates a
connected to natural language processing methods so the results full story and not just story structures.
would be valuable and novel.
Producing a pleasing narrative requires a lot of creativity and Another example of previous work in the area of generating stories
intelligence. Therefore, an intelligent is required to is the work proposed by Peinado and Gervas [32]. In their paper,
automatically process, build, and produce a creative entertaining they presented a system ( KIIDS) that generates fabulas which is a
content. Stories, in particular, have multiple elements such as narratological term for the set of story events that form the story.
characters, setting, plot, and so on, that should be chosen Their system was built with respect to Vladimir Propp as a
carefully to build an attractive story. narratological background and their system learns from existing
stored fabulas using description logic. Their system
221
followed the same narrating structure but with changing the the quality and structure of the story plan on the off chance that
content. Their results were evaluated by comparison with one is found, however, which needs semantics. The ISR
randomly generated stories and existing stored stories. arranging algorithm expects innovative power over parts of the
MAKEBELIEVE is an interactive story generating agent that story world description.
generates short stories after the user inputs the first line of the
story, MAKEBELIEVE follows a hybrid approach generating Riedl et al [14] outline the stream of a story as a linear
both story structures and full stories. It is based on a representation of events with foreseen client actions and system
commonsense knowledge base which doesn’t only suit controlled operator collaborated together in a partially arranged
storytelling but many other goals also. plan. For each conceivable way the user violates the story plan,
Authors in [8], proposed a strategy to computational narrating. an alternative story plan is created.
Their methodology has three key highlights. Right off the bat, the The literature in an automatic story generation system can take one
story plot is made progressively by counseling a consequently made of two directions: planning-base, and simulation-based story
information base. Furthermore, the generator understands the generation [2]. In the planning-based generation of a story, the
different parts of the aging pipeline stochastically, without broad characters and events are statically defined, then a plot is created
manual coding. Thirdly, they create and store multiple versions of a based on the defined characters and events. Of course, the events
story in a tree structure. Story creation adds up to navigating the are not randomly placed but arranged according to pre and post
tree and choosing the nodes with the highest scores. Then, they conditions of each event. Whereas each event is suitable for the
created two scoring methods that rate stories as far as how coherent previous and posterior event. In the planning based approach of
and interesting they are. By and large, their outcomes demonstrate story generation, a fixed set of story events are set and then
that the over generation-and-positioning methodology pushed was characters and events are combined to form stories.
feasible in creating short stories that follow narrative structure. An example of this approach is MEXICA[5]. The framework
However, their approach stochastically combined sentences for proposed by the creators of MEXICA considered the pre and
stories and there is no guarantee that these stories will be interesting post conditions of events when generating the story.
or coherent.
An author in [9] the author presented a virtual storytelling Another author that followed this approach in story generation
system (AVEIRO). In their system, the characters are executed is Riedl [27], where he proposed a general planning model using
as intelligent, semi-autonomous operators. A virtual director (a AI by retrieving and reusing vignettes which are fragments of
specialist with general learning about plot structure), controls story that holds examples to narrating situations. This method
their activities and guarantees that an all around organized plot allowed him to create a space of creative solutions that help
develops. They don't make utilization of pre-characterized with creating stories with respect to a planning concept inspired
contents, which implies that the plot isn't endorsed yet made by by existing plans.
the characters. Their approach has been executed in a general The story-generating process isn't as static as forming a plan and
multi-operator structure, the Virtual Storyteller. The structure sticking to it without final touches or changes to follow some
for the Virtual Storyteller has been fully implemented, however, rules or constraints. Traditionally, sticking to a fixed plan will
the information bases were very constrained. give us one feasible solution but won't provide us with all
Authors of [10] have suggested an approach to automatically build possible solutions that might be generated. Adding
individual sentences with the help of an ontology that stores the unexpectedness or randomness to the planning-based approach
needed knowledge, their sentence generation model receives as will produce a non-deterministic output each time the system
input a specification of what it is supposed to deliver, and produces runs which will lead to having multiple output options.
as output a corresponding natural language expression. Language
grammar is a basis for the sentence generation. They have Regarding how efficient a planning-based method is, the final
considered sentence- structure planning according to grammatical goal of automatic story generation systems is to provide
rules, also a selection of syntax, plus the order and morphological something that satisfies the audience, which requires altering the
generation. Their system concentrates on the construction of story generation method to produce a happy ending for example
sentences to make some sort of a story. or to add comedy and many other examples. This is about hiring
The same authors of [10] have built on their basic idea to form an creativity in generating stories and not just following a fixed
Automatic story generation framework [11] that gives an method to generate expected repeated stories.
environment to the user to build or rewrite the story according to Unlike the planning-based approach of generating stories which
their selections, through client collaboration. The most alluring revolved around events, the simulation-based generating
component of their framework is that it enables the client to choose approach which revolves around individuals. In this approach a
the characters, items, and locations for the story in which they are scope of characters is created, each denoted with its properties
built. The utilized ontology gives the characteristics of the and possible actions. The rules in this approach hold the way
characters, items, and locations to the produced story. characters interact in the real world. Thus, no specific plan-
Charles et al [12] displayed results from the first form of a based constraints exist, the generated stories will comply with
completely implemented narrating prototype, which delineates the the rules existing by nature in the scope of the behavior of each
generation of variations of a conventional storyline. These character. Hence, the output isn't guaranteed to be interesting, it
variations result from the communication of autonomous characters will just follow a natural flow. But since this approach focuses
with each other, with condition assets or from user collaboration. highly on creating realistic characters, they might be equipped
Riedl et al [13] had given arranging algorithm to story generation. with elements that make them more realistic and close to the
The story organizers are restricted by the way that they can just nature of humans such as emotions.
work on the story world gave, which impacts the capacity of the Rank et al [4] reported on interactive storytelling approach[4] for
organizer to discover a solution story plan and example, the factor of emotion creatively affects the creation of
222
the agents(characters) or the story. This character configuration
process added a creative touch over the generated stories.
However, a hybrid approach might combine both methods to get
a hybrid advanced method that might produce more satisfying
results. But as mentioned earlier, each approach could be
tackled and updated in infinite ways that might suit various
domains to produce satisfying results.
In the planning-based approach, infinite creative planning
methods could be found to guide a story production. And in the Figure 1. Example of mapping concepts into the ontology
simulation-based approach, an infinite real world inspired
factors could drive the behavior of any character to produce The story plan is set by considering all of the story elements that
many versions of a requested story. are needed to be prepared. In this work, our method is designed
to generate short stories for children at young age (5-7 years
III. PROPOSED APPROACH old). Thus, we choose not to follow a very complicated method
Writing a story is about preparing the right content and putting in preparing story elements in order to keep the story simple and
it in a good structure. The story content and presentation matters suitable for kids. However, our method is scalable such that it
to the readers[39]. Well written stories usually have impressive can be modified by increasing the level of complexity in making
content that was prepared by a creative writer. In this work, we some decisions when preparing the story components. Figure 2
aim at designing a method that simulates humans' creativity in shows an overview of our generating story approach.
writing stories. The proposed approach contains two major
phases, the first phase is planning the story in terms of
components and structure, using computational creativity
approach. The second phase is the linguistics of the story, where
the story sentences are generated according to the pre-planned
content.
A. Planning the story
Planning a story requires preparing all the story elements. In the
area of children’s short stories, the writer Nancy Krulik [26]
who is an author of Katie Kazoo, Switcheroo book series
defined five essential elements of a story:
223
Planning a good plot is essential for making a good story. A plot
is a term that describes a set of events that make up a story,
which is the main part of the story. Creating a unique plot is
what makes a new story. Writers focus on creating an exciting
and interesting plot to catch the readers interest and to produce
an entertaining story. In order to make a good plot, a set of story
events are ordered in an exciting way such that it will capture an
interest from the reader. Usually, when a writer writes a story,
the writer sets a goal idea or a core event that is needed to be Figure 3. Example of markov chain model for a child's possible actions
delivered in the story, then, sequences of events are developed
to create a full plot. An interesting concept followed by writers Discrete Time Markov Chain[43] is a sequence of random
is called the causality[7] concept which is about considering the variables such that the probability of the next state n depends on
cause and effect to every chosen action in a story, to choose the the previous state n-1and not the overall sequence. This is
pre-actions of this action and the post-actions of this action expressed in the following probabilistic formula.
carefully. The main idea of causality is that every process is P( Xn+1 = x | X1 = x1, X2 = x2, …, Xn = xn) = P( Xn+1 = x | Xn = xn)
caused by many possible processes and a process could cause Probabilities could be set in multiple ways, a stochastic
several other processes, so, a plot could develop in many approach could be followed or a statistical based approach
possible ways it depends on writers' choice of each story event according to stored history of sequences that participate in
that makes sense according to the previously chosen events. forming the current probabilities.
This idea was followed by many authors who proposed methods The Markov Chain Model is able to form multiple sequences of
for generating stories such as the authors of [32]. character actions. Then, it will choose either the sequence with
the highest probabilities connecting the elements, or a sequence
In generating a story automatically, a system should be that sums up to be above a certain threshold set by domain
intelligent enough to create a story plot. The system has to be constraints. For now, we will consider taking the sequence with
able to predict the sequence of actions. Previous computational the highest summation of probabilities.
creativity has been developed using various methods. However, Applying the Markov Chain Model to find a suitable ordering
none of them has considered of predicting sequence of actions for story events assures planning the shape of the story by
in generating a story. In this work, we proposed to use artificial choosing a mathematically feasible sequence of actions which
intelligence method namely Markov Chain Model for the will make sure that the outcome is reasonable and valuable.
prediction purpose [43].
B. Building the story
In our proposed approach, using Markov Chain Model, we This stage contains natural language generation. The previous
follow a stochastic based approach which depends on stored stage sets the story outline creatively, what's remaining is
information about possible ordering of events to suit the story making all linguistic decisions to satisfy the goal of producing a
goal. This part is about simulating the creativity of humans story that is complete, correct, and coherent. Natural language
when forming story plots to design a suitable plot for our system generation contains 6 main tasks: content planning, text design,
to follow when generating the story. As previously mentioned, sentence planning, lexicalization, referring expression
we find our problem of story generation is a problem that hires generation, and linguistic realization[16, 17]. The first 3 tasks
the exploratory creativity in exploring a set of possible elements are application dependant and set according to the domain of the
to build the final solution. The exploration is controlled with language generating application. The last 3 tasks are about
certain constraints and requirements, also the options are making decisions about the linguistics of the story, a lot of
measured according to the domain measurement basis. techniques are used in the literature that could help with the
implementation process, but, here we design the general method
A plot is a sequence of events that form a story. Those events are of generating educational stories and such details related to the
actions performed by the characters involved in the story. So, implementation will be discussed in our future work.
designing a plot requires ordering characters' actions into a
reasonable order according to some constraints. In order to simulate C. Evaluating the story
this process we chose the Markov chain model[43]. Markov chain Evaluating any creative content is considered a bit of a
model is a model that hires statistics in determining a sequence of challenge[40]; because usually the evaluation in the artistic field
elements according to certain rules or history. This model contains is very subjective and humans taste is nondeterministic. Also,
building multiple possible sequences in shapes of directed graphs such productions cannot be automatically assessed. So, such
and then choosing a suitable sequence according to all the systems require human evaluation for the output. Many types of
probabilities between links that connect the sequence elements. evaluation methods could be used to get useful information
Figure 3 shows an example of a simple Markov Chain Model for a about the validity of such a proposed system, such as
possible actions in a child story. questionnaires, surveys and observations.
224
which combines previous topics, and then, we studied all [22] Herv´as, R., Pereira, F., Gerv´as, P., & Cardoso, A. (2006). Cross-
domain analogy in automated text generation. In Proc. 3rd joint
possible paths that could be followed to build such a system. we workshop on Computational Creativity, pp. 43–48.
have designed a reasonable method for building such a system. [23] Veale, T., & Hao, Y. (2007). Comprehending and Generating Apt
In future work, we aim to implement this proposed method and Metaphors: A Web-driven, Case-based Approach to Figurative
test its ability to generate educational short stories for children. Language. In Proc. AAAI’07, pp. 1471–1476.
[24] Veale, T., & Hao, Y. (2008). Fluid knowledge representation for
After that, we also hope to continue investigating the ability to understanding and generating creative metaphors. In Proc.
hire computational creativity techniques into designing even the COLING'08, pp. 945–952.
smallest details in the stories to make them look like human [25] S. Han, H. Shim, B. Kim, S. Park, S. Ryu and G. G. Lee, "Keyword
produced stories. question answering system with report generation for linked data,"
2015 International Conference on Big Data and Smart Computing
(BIGCOMP), Jeju, 2015, pp. 23-26.doi:
V. REFERENCES 10.1109/35021BIGCOMP.2015.7072843.
[1] Surdu, J. R. & Kittka, K. (2008), The Deep Green concept, in [26] Nancy Krulik Katie Kazoo Classroom Crew.www.katiekazoo.com
'Proceedings of the 2008 Spring simulation multiconference', Society /nancy.html Accessed 18 Nov2018.
for Computer Simulation International, San Diego, CA, USA, pp. [27] Riedl, M. O. (2008), Vignette-Based Story Planning: Creativity
623--631. Through Exploration and Retrieval, in 'Proc. 5th International Joint
[2] Jändel, M. (2013a), 'Computational Creativity in Naturalistic Workshop on Computational Creativity'.
Decision-Making', Submitted to International conference on [28] List of narrative forms. (2019). Retrieved from
computational creativity, 2013. https://en.wikipedia.org/wiki/List_of_narrative_forms
[3] Swartjes, I. & Vromen, J. (2007), Narrative Inspiration: Using Case [29] R. Rosenfeld, "Two decades of statistical language modeling: where
Based Problem Solving to Support Emergent Story Generation, in do we go from here?," in Proceedings of the IEEE, vol. 88, no. 8, pp.
'4th International Joint Workshop on Computational Creativity'. 1270-1278, Aug. 2000.
[4] Rank, S.; Hoffmann, S.; Struck, H.-G.; Spierling, U. & Petta, [30] Fiona Baxter, Liz Dilley, Alan Cross, Jon Board. (June 2014).
P.(2012), Creativity in Configuring Affective Agents for Interactive Cambridge Primary Science Stage 4 Learner's Book. Cambridge:
Storytelling, in 'Proc. of the 3rd International Conference on University of Cambridge.
Computational Creativity'. [31] Musen, M.A. The Protégé project: A look back and a look forward. AI
[5] Perez y Perez, R.; Negrete, S.; Peñaloza, E.; Castellanos, V.; Ávila, R. Matters. Association of Computing Machinery Specific Interest Group
& Lemaitre, C. (2010), MEXICA-Impro: A Computational Model in Artificial Intelligence, 1(4), June 2015. DOI:
for Narrative Improvisation, in 'Proc. of the International Conference 10.1145/2557001.25757003.
on Computational Creativity'. [32] Federico Peinado, Pablo Gervas, "Evaluation of Automatic
[6] Strapparava, C.; Valitutti, A. & Stock, O. (2007), Automatizing Two Generation of Basic Stories "New Generation Computing,
Creative Functions for Advertising, in 'International Conference on Computational Paradigms, and Computational Intelligence. Special
Computational Creativity'. issue: Computational Creativity 24(3):289-302, 2006.
[7] Bunge, M.(2012) Causality and Modern Science: Third Revised [33] Jie Bao, Caragea.D, Honavar, V. “Towards Collaborative Environments
Edition. Massachusetts, USA: Courier Corporation. for Ontology Construction and Sharing.” Collaborative
[8] “Learning to Tell Tales: A Data-driven Approach to Story Technologies and Systems, CTS 2006.
Generation", Neil McIntyre and Mirella Lapata. Proceedings of the [34] Sgouros, N. M., ”Dynamic Generation, Management and Resolution
Joint Conference of the 47th Annual Meeting of the ACL and the 4th of Interactive Plots”, Artificial Intelligence 107, 1, pp. 29-62, 1999.
International Joint Conference on Natural Language Processing of [35] Meehan, James R., ”TALE-SPIN and Micro TALE-SPIN”, in Inside
the AFNLP. Suntec, Singapore — August 02 - 07, 2009. computer understanding (Schank, Roger C., and Riesbeck,
[9] “The Virtual Storyteller: Story Creation by Intelligent Agents”, Christopher K. ed.), Lawrence Erlbaum Associates, Hillsdale, NJ.
Mari¨et Theune, Sander Faas, Anton Nijholt, and Dirk Heylen. 1981.
University of Twente, PO Box 217, 7500 AE Enschede, The [36] Fairclough, C. and Cunningham, P., ”A Multiplayer Case Based
Netherlands. Story Engine”, in Proceedings of the 4th International Conference on
[10] “A novel approach for construction of sentences for automatic story Intelligent Games and Simulation, EUROSIS, pp. 41-46, 2003.
generation using ontology”, A. Jaya and G.V. Uma. Proceeding of [37] Turner, S. R, Minstrel: A Computer Model of Creativity and
International Conference on Computing, Communication, and Storytelling, Technical report UCLA-AI-92-04, Computer Science
Networking(2008). Department, University of California, USA, 1992.
[11] “An intelligent system for semi-automatic story generation for kids [38] Hugo Liu, Push Singh. (2002). MAKEBELIEVE: Using
using ontology”, A. Jaya and G.V. Uma.Proceedings of the Third Commonsense to Generate Stories. Proceedings of the Eighteenth
Annual ACM Bangalore Conference(2010). National Conference on Artificial Intelligence, AAAI 2002,
[12] Charles, F.; Mead, S.J.; Cavazza, M. “Character-driven story Edmonton, Alberta, Canada. AAAI Press, July 28 - August 1, 2002,
generation in interactive storytelling” Virtual Systems and pp. 957-958.
Multimedia. Proceedings. Seventh International Conference on [39] Soleimani, H., & Akbari, M.G. (2013). The Effect of Storytelling on
Virtual Systems and Multimedia. 25-27 pp no: 609 – 615, Oct. 2001. Children ' s Learning English Vocabulary : A Case in Iran.
[13] Riedl, M. and Young, RM, “Open-World Planning for Story [40] Boden, M. (2009). Computer Models of Creativity. AI Magazine,
Generation” Proceedings of the 19th International Joint Conference on (Vol 30 No 3: Fall 2009).
Artificial Intelligence.California USA 2004. [41] Gervás, P.; Pérez y Pérez, R.; Sosa, R. & Lemaitre, C. (2007), On the
[14] Riedl, M. and Young, RM, "From Linear Story Generation to Fly Collaborative Story-Telling: Revising Contributions to Match a
Branching Story Graphs" American Association for Artificial Shared Partial Story Line, in 'Proc. of International Joint Workshop
Intelligence(www.aaai.org) 2005. pg 23 – 29. on Computational Creativity'.
[15] Bushinsky, S. 2009. Deus ex machina a higher creative species in the [42] R. Studer, V. Benjamins, and D. Fensel, “Knowledge engineering:
game of chess. AI Magazine 30(3):63–69. Principles and methods,” Data & Knowledge Engineering, vol.
[16] Reiter, E., & Dale, R. (1997). Building natural-language generation 25, no. 12, pp. 161 – 197, 1998. [Online]. Available:
systems. Natural Language Engineering, 3, 57–87. http://dx.doi.org/10.1016/S0169-023X(97)00056-6.
[17] Reiter, E., & Dale, R. (2000). Building Natural Language Generation [43] jaiswal, s. (2019). Python Markov Chains Beginner Tutorial. [online]
Systems. Cambridge University Press, Cambridge, UK. DataCamp Community. Available at:
[18] Binsted, K., & Ritchie, G. D. (1994). An implemented model of https://www.datacamp.com/community/tutorials/markov-chains-
punning riddles. In Proc. AAAI’94. python-tutorial [Accessed 14 Apr. 2019].
[19] Binsted, K., & Ritchie, G. D. (1997). Computational rules for
generating punning riddles. Humor: International Journal of Humor
Research, 10 (1), 25–76.
[20] Stock, O., & Strapparava, C. (2005). The act of creating humorous
acronyms. Applied Artificial Intelligence, 19 (2), 137–151.
[21] Petrovic, S., & Matthews, D. (2013). Unsupervised joke generation
from big data. In Proc. ACL’13, pp. 228–232.
225
Arabic Document Indexing for Improved Text
Retrieval
Yaser A. M. Al-Lahham
Computer Science Department
Zarqa University
Zarqa – Jordan
yasirlhm@zu.edu.jo
Abstract - Arabic document indexing is a challenging process Using Light stemmers for document indexing could be
due to the complex morphological nature of the Arabic language. improved by choosing a representative subset of terms instead
Methods of document indexing in the literature relied on applying of selecting all terms, since the results recorded by many
morphological schemes to extract terms. These morphological researchers showed that it had better retrieval, and it reduced
schemes mainly depend on root extraction and stemming. This the index size [18]. These results motivated the proposal of this
paper proposes a simple document indexing method based on paper. This paper proposes a different approach of indexing
selecting only definite words (that have the prefix AL, or it is documents, it selects index terms that most likely to have an
acceptable to have this prefix). The words that preceding (and/or) important role in Arabic sentences, such as the definite words
succeeding these definite words are also considered. The proposed
(that have the prefix ")"ال, and words after/before them.
method is tested using the TREC-2001/2002 Arabic test collection.
Definite words could gain more importance as key words in
The proposed method outperforms selecting all terms, either
without stemming, or stemmed by the Light10 stemmer, for
Arabic text as it is added to nominal words to upgrade
example, indexing documents by selecting definite words and importance, as previous knowledge indicator, and as definite
words that come after them enhances the Mean average Precision conjunctive article added to active and passive participles [20].
of the Light10 by 4.4%, and at the same time decreases the index Once a document is indexed according to a word that has the
size by 6.1%. prefix ""ال, all documents –later- are indexed according to this
word, regardless whether it has the prefix " "الor not.
Keywords - Arabic Information Retrieval; Arabic Document
All over the paper, the term "AL-Word" means a word
Indexing; Index Term Selection; Arabic Language Processing
begins by the article " "الor "AL", AL-Words and the words
I. INTRODUCTION before them are referred to as: “ALBEFORE”, AL-Words and
words after them as: “ALAFTER”, and AL-Words and words
Arabic has rich vocabulary since words can be devised by after and before them as: “ALBEFORE_AFTER”.
adding, stressing, or combining words, or just by changing a
diacritic of a letter in a word. The application of these rules on The rest of paper is organized as follows: section 2 includes
Arabic words making Arabic has a complicated morphological a survey of the related work. Section 3 presents the proposed
structure, which complicated the document-indexing process method of document indexing. Section 4 presents the
[10]. These complications could be noticed in many cases, evaluation procedure, results, and discussion. Finally, section
such as a word could have different forms for plural and pair 5 concludes the paper and presents the future work.
forms, definite articles, male or female, or any other usage, for
example applying these morphological rules on the word II. RELATED WORK
"( "كتابa book), making the following forms: " "كتابيmy book, Arabic document indexing includes index terms’ selection,
" "كتابھاher book, " "كتابھمtheir book (male), " "كتابھنtheir book which can be categorized into: statistical, linguistic, and
(female), " "كتبbooks, " "الكتابthe book, and " "كتابھماtheir book combined linguistic and statistical techniques.
(for two). All of these forms of the word refer to the same
meaning. In Arabic Information Retrieval, this problem makes Statistical approaches used properties of index terms such
it difficult to match terms of a query to index terms of that a selection criterion is not oriented towards a specific
documents [18]. language or application, for example, Al-Kabi et al. in[2] used
term frequency-inverted term frequency (TF-ITF), and term
To solve these problems, some research efforts indexed co-occurrence, as statistical parameters to extract index terms.
documents using Arabic morphological analysis to extract Inverse document frequency (IDF) is also used as statistical
index terms, although many proposals based on root extraction parameter to select n-gram character stems, for example
effectively used in some areas, such as automatic diacritization Awajan in [7] used the IDF to distinguish a stem from terms
of Arabic sentences [9], it show less significance to be used in seem as that stem, since these terms are more frequent than
Arabic text retrieval [15]. Alternatively, stemming is mostly stems so they will have higher document frequency than a
used to index a document according to different word forms stem.
into a single term, or stem [8], which could solve the problem
of matching two words having the same meaning and different Other methods used machine learning techniques to apply
shapes [10]. Stemming makes a reduced index and enhances morphological rules of the Arabic language to extract words’
the retrieval (in terms of the recall), other researchers found roots and stems. Some researchers apply these methods to
that it has little effect on precision ratio, such as [16]. extract features in order to be used in a specific application, for
example Al-Thubaity et al. in [4] tested several methods to
Recently, light stemmers are used for document indexing extract features that suitable for Arabic text classification.
in order to reduce the complexity of morphological analyzers,
and heavy stemmers, an example of widely used light stemmer Some root extraction methods, for example Nehar et al. in
is the light10 [16], which applied few morphological rules to [19], used finite state transducers to determine index terms,
strip off predefined list of affixes. Light stemmers encounter and in case a word mapped to more than one root some
some problems, for example, using different stemming statistical methods are applied to resolve conflictions.
algorithm with the same affixes lists produces different results Morphological analysis suffers from ambiguity and has
[10], and for short queries light stemmers behave the same way limited number of words’ forms[10].
as no stemming [21].
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 226
Light stemmers were proposed to overcome the
complications of morphological analysers, where tables of
limited number of common affixes are determined, and the
AL-WORDS
INDEX
BEFORE
longest matched affix of a word is removed [7]. Light
AFTER
TERMS
stemmers were improved by updating previously used tables,
where a new affix is added to develop a new table at each
improvement.
PRE- LIGHT
Recently, enhancements on existing light stemmers are PROCESSING STEMMING
proposed, for example SAFAR [13]; and P-Stemmer [14] that
extends the prefix list of the Light10 stemmer and doesn’t strip Fig. 1 Index Term Selection Framework
the suffixes, P-Stemmer is used for text classification.
Mustafa, Mohammad et al., in [17], extended the Light10 The following subsections explain the rationale of using
stemmer by adding more prefixes and suffixes to be stripped- these aspects of the Arabic language to select index terms.
out, additionally, some conditions were imposed; such as one-
letter prefix is stripped-out if the remaining term length is
greater than three letters. They proposed another linguistic- III.1. Select definite Words (AL-Words)
based stemmer that uses some morphological aspects to
The article " "الin Arabic is used for different purposes: it
classify words into categories, such that a different stemming
could be used as a redundant article added to nominal words
method is applied on each category. Abdelali et al., proposed
to upgrade importance, as a previous knowledge indicator, and
FARASA stemmer [1] that uses the Support Vector Machine
as a definite conjunctive article added to active and passive
to rank multiple segments that are possible to be the stem of a
participles [20]. So this article is a good indicator to determine
word.
terms that are most likely to be important index terms, because
Alternatively, a dictionary or Lookup Table Approach can names and adjectives that this article define represent
be used such that for each stem, all of the words belong to it informative terms in the text as indicated by [5], and [12].
are stored, and replaced by that stem during the indexing
As " "الindicates previous knowledge, it focuses on a
procedure, as indicated in [3]. However, this method is static
concept that previously mentioned in the text, being the topic
and needs tables to be periodically updated.
of that text, so these words are expected to have significant
On the other hand, documents could be indexed by role in the text, and important enough to be selected as index
selecting index terms according to a combination of terms.
morphological analysis and other linguistic tools such as Part
Passive and active participles, which are determined by the
Of Speech (POS), for Example, Awajan in [6] proposed a
article ""ال, are frequently used for focusing on the event rather
method that automatically extracts keywords from Arabic
than the entity that actually did that event, which indicates that
documents using unsupervised learning and statistical aspects
these words are of sufficient importance to be selected as index
of words.
terms, for example " "ھُ ِد َم الجدارor "the wall has been destroyed",
Previous research work extracted index terms according to the term “ ”الجدارis significant to be selected as it is the entity
the statistical and morphological aspects, and other tools such affected by the verb " "ھدم. Moreover " "الis used to identify
as POS taggers. However, Arabic language has other aspects different concepts, as indicated in table-1.
that can be used for more simple and efficient index terms TABLE-1 EXAMPLES OF AL-WORDS’ USAGE
selection. This research is primarily based on using some of e begins by ""ال
these aspects for index term selection in order to index Region name ( الشرق األوسطthe Middle east)
documents, as explained in section III. Enterprise name ( الشركة العربيةthe Arab Company)
focus ( قضية الالجئينthe Refugee Cause)
III. DOCUMENTS’ INDEXING FRAMEWORK Relationship ( أھداف المؤتمرthe Conference
The proposed document indexing selects a subset of words Objectives)
that are most likely to have importance in Arabic language Place name ( المسجد األقصىthe Aqsa Mosque)
sentences and semi-sentences. The selected words include Family name ( الھاشميAl-Hashmi)
definite words (AL-Words) that its prefix is the article ""ال, or
any of its forms ()وال فال بال كال لل, and terms preceding and/or Moreover, selecting AL-Words prevents the IR system
following them (ALAFTER/ALBEFORE). The words that are from ignoring some words that have the same shape as stop
acceptable to have this prefix are also considered as definite words, since –in the Arabic language- stop words couldn’t
words, even they haven’t the prefix AL. Some words, such as have “ ”الas prefix, while these words accept this prefix. Some
verbs and most of persons’ names, couldn’t have the prefix examples are listed in table-2.
AL, so it could be considered in case it precede or follow a TABLE-2 EXAMPLES OF WORDS THAT SEEM AS STOP WORDS
word that can has this prefix. The overall index term-selection
Stop word Word of same shape
framework is presented in Fig-1. The framework begins by a ايه which اآلية آية Verses
pre-processing stage (stop-words and punctuation removal,
and normalization), the next step is to apply the proposed فھم they الفھم فھم understanding
selection of words, as follows: وھم And they الوھم وھم Mystery
R-PRECISION
% TO ALL
METHOD
% TO LIGHT10
documents could be returned as a response to that query, TERMS
VALUE
AFTER
POSTINGS /
% TO ALL-
NUMBER
NUMBER
LIGHT10
TERMS
% TO
230
Evaluation of Question Classification
Mariam Biltawi, Arafat Awajan, Sara Tedmori
Computer Science Department
King Hussein School of Computing Sciences
Princess Sumaya University for Technology
Amman, Jordan
maryam@psut.edu.jo, awajan@psut.edu.jo, s.tedmori@psut.edu.jo
Abstract— The goal of this paper is to study question which is either to impose, to rebuke, to threat, to command, to
classification for the Arabic language using machine learning pray, or to hope. Arabic questions can be asked either by using
approaches. Different experiments were conducted using two Arabic IWs or without using them. There are two types of
types of weighting schemes and three classifiers; Multinomial Arabic IWs; (1) question particles used for yes/no questions,
Naïve Bayes, Decision Trees, and Support Vector Machine. The and they are Hamza ( )أand Hal ()ھل, and (2) other question
dataset used in the experiments is an updated version of CLEF. words such as; who - من, what – ماذا،ما, where - اين, when –
The best results were obtained when the dataset was ايان، متى, how – انى، كيف, how much/many - كم, which – [ اي2].
preprocessed by removing punctuation, diacritics, stop-words, Some Arabic questions do not use IW. Such questions can be
and performing normalization and stemming and then using the
list questions that might start with the words ( عدد، )اذكرand
TF-IDF weighting scheme; with SVM being the best classifier
which mean list or explanation questions that start with the
among the three with an F1-score of 81%.
words ( فسر، )اشرحand which mean explain.
Keywords—question classification, Arabic question Question are classified in order to determine either the
classification, multinomial Naïve Bayes, Decision Trees, Support question type or the answer type. Questions can be factoid or
Vector Machine non-factoid. Factoid questions can have different answers, for
I. INTRODUCTION example a person name, organization, locations, etc. While
non-factoid questions can be definition, casual, etc. The goal
Question Answering (QA) is an application of computer of this paper is to experiment question classification utilizing
science that spans several core areas including information the updated version of the translated CLEF dataset. The
retrieval and natural language processing. QA is concerned contents of original version of the CLEF dataset can be dated
with building systems that can automatically answer questions back to 1994. Hence, some of the questions are outdated and
provided by humans. QA can be seen as an extension to search can’t be utilized for purposes of constructing QA systems that
engines but rather than providing a group of documents as a use the web as the answers’ source. In addition, some of the
result, QA systems provide concise and correct answers, questions in the original version of CLEF are syntactically
saving navigation time for users. Generally, QA systems can incorrect. This paper is organized as follows: section 2
be classified based on the source of answers, which can either presents the related work, section 3 provides information
be structured or unstructured. Unstructured can be documents relating to the used dataset, section 4 presents the classifiers
from the web, while structured QA systems may use used in the experiment, section 5 presents weighting schemes
knowledge bases [1]. used, section 6 presents the evaluation measures, section 7
QA systems rely on different fields and technologies, presents the experiment and the results, section 8 discusses the
including; NLP, information retrieval, semantic web results and section 9 presents the conclusion.
technologies, database technologies, and human computer II. RELATED WORK
interaction. QA systems can be implemented to construct
either structured or unstructured answers to deal with different Generally text classification can be categorized into; (1)
types of questions including: How, Why, Fact, List, definition, rule-based techniques [3], (2) Machine Learning (ML)
Cross-lingual, semantically constrained, and hypothetical techniques [4], and (3) hybrid-based techniques [5]. Rule-
questions. These questions can be either domain-specific, or based techniques are usually unsupervised since no model
open-domain that deal with nearly anything [1] . needs to be trained, and they can be either lexicon-based,
pattern-based, or both. ML techniques are usually named
A question is a natural language sentence, phrase, or even corpus-based techniques and are categorized into supervised
a word, used to request information or test someone’s learning and unsupervised learning. However, hybrid-based
knowledge. A question usually starts with an interrogation techniques are usually either a mix of supervised and
word (IW). Questions posed in the beginning of a research in unsupervised learning techniques, or a mix of rule-based and
order to identify the main objectives of the study or to ML techniques.
determine the type of the problem the writer is trying to solve
are referred to as research questions. Rhetorical questions, on Some researchers have experimented with question
the other hand refer to the questions that are used to begin a classification techniques in the Arabic language. Al-Chalabi
discussion or to emphasize a point rather than requesting a et al. [6] presented a rule-based Arabic question classifying
direct answer. Rhetorical questions can either have obvious technique. Their technique relies on the Arabic (IW) where
answers or can be used as metaphors (example: Can birds fly?) each IW within a question represents a class, while questions
or can have no answers and hence used for negative assertion that do not use IW were neglected. The authors considered the
or sarcasm (example: Who cares?). IWs; ( كم- how much, how many, how far, and how long), (من
- who), ( ما- what), ( اين- where), ( متى- when), ( اي- which),
In Arabic, there are two types of questions: (1) Real ( كيف- how). They have proposed patterns for each class as
questions: that are used to request a direct answer from the illustrated in table 1. Most of the patterns start with IW and
respondent and (2) Metaphorical questions: that do not seek a maybe followed by a noun phrase (NP) or a verb phrase (VP),
specific answer. Metaphorical questions are identical to and any word format (WF) that will not affect the
rhetorical questions discussed earlier in terms of their purpose classification process. The only IW that may start with a
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 231
proposition (PP) is ( اي- which). The IW ( ما- what) is either [12] dataset, and results showed an accuracy of 92.83% in
followed by (HOA - )ھو, (HEA - )ھي, or a NP. The experiment classifying the course-grain classes and an accuracy of
was conducted on 200 questions, applied on context free- 89.32% in classifying fine-grain classes.
grammar and regular expressions written using NooJ tool.
Results showed a recall and precision of 93% and 100% III. DATASET USED
respectively. The experimented dataset is an updated version of the
original translated CLEF dataset and which consists of 800
TABLE I. QUESTION PATTERNS PROPOSED IN [6] question-answer pairs. For purposes of this research, a native
IW ANSWER TYPE CLASS PATTERN Arabic expert was assigned the mission to review the dataset
( كمHOW MUCH, Number IW NP VP WF for correctness. Knowing that the contents of the dataset can
HOW MANY, IW VP WF be tracked back to 90s. Therefore, some of the questions were
HOW FAR, AND either updated or deleted. For example the question “ كيف يمكن
HOW LONG)
( منWHO) Person/ Organization IW NP WF
”للتصوير بالرنين المغناطيسي العمل؟is syntactically incorrect, and
( ماWHAT) Device/ Geographical IW HOA NP WF there is no answer attached to it; thus, it was deleted. Other
location/ Sports/ IW HEA NP WF questions were syntactically corrected such as “ كيف عدد سكان
Organization/ Art/ Person IW NP WF ”فيتنام ؟and mean “how is the population of Vietnam?”, the IW
( اينWHERE) Geographical location IW VP WF “ ”كيفmeans “how” is replaced with “ ”كمwhich means “how
( متىWHEN) Date IW VP WF much”.
( ايWHICH) Number/ Geographical PP IW NP WF
location/ History/ Sports IW NP WF Replicated questions were also deleted, keeping just one
( كيفHOW) Science IW VP WF form of them, noting that some of the replicas were
syntactically incorrect. For example the question “”ما ھو أداه؟
Al-Shawakfa [7] proposed a rule-based technique to which means “what is the tool” is replicated twice with
classify questions according to IWs. The examined IWs; ( من- different answer each time, the question is ambiguous and
who, whose), ( متى- when), ( – اينwhere), (ماذا, ما- what, does not ask about a specific thing, nor does it mimic human
which), (مما, – ماwhat), (ماھي, ما ھو- what is), ( كم- how much, behavior; thus, both question were deleted. In addition,
how many), ( – لماذاwhy), ( – ايwhich), ( – كيفhow). The questions that have English letters were excluded as well. For
classes assigned are; person, organization, temporal example, questions like “ ما ھيUEFA ”؟and means “what is
expressions, location, product, event, object, device, sports, UEFA?” and its answer is “”االتحاد االوروبي لكرة القدم, were also
art, thing, numeric expressions, reason, history, and science. omitted because they are not pure Arabic questions.
The question is tokenized, then classified according to a set of Furthermore, questions like “ ”من ھو كريستو؟which means
patterns defined by the authors. “who is Cristo?” and its answer was “ ”فنان من أصل ھنغاريwhich
Lahbari et al. [8] proposed a rule-based method to classify means “an artist with Hungarian origin”, and which have no
Arabic questions, they also compared between two types of records when a google search is performed, have also been
question taxonomies; the first taxonomy is Arabic taxonomy excluded. Some other questions were updated, for example the
proposed by the authors, while the second one is proposed by question “ ”من ھو رئيس وكالة الطاقة الذرية ؟and means “Who is the
Li and Roth in [9]. Their rule-based method first starts by head of the IAEA?” and at that time the head of IAEA was
normalizing the questions through removing diacritics and “Hans Blix”, was updated into “ من كان رئيس الوكالة الدولية للطاقة
punctuation, then tokenizing. Next is the pattern matching ؟1997-1981 ”الذرية في الفترةwhich means “who was the head of
step, where each IW corresponds to one class, except what ( ،ما IAEA in period of 1981-1997?”.
)ايwhich can have multiple classes according to the noun The total number of updated questions is 189 question-
present in the question; therefore, it is further processed by answer pairs, while the excluded questions were 200 question-
removing stop-words and identifying the nouns. Experiments answer pairs. The resulting size of the dataset was 600
were conducted using CLEF and TREC translated datasets. question-answer pairs. Question were given labels manually,
The Arabic taxonomy classes used to label the question were; and the total number of classes were eight; casual, definition,
time, description, location, human, number. The question description, entity, human, list, location, and numeric.
classes used for experimentation from Li and Roth taxonomy Questions assigned the classes; entity, human, location, and
were; abbreviation, definition, description, location, person, numeric, are of type factoid questions, where the answers are
time, number, entity, and other. Experimental results showed short and represent facts. For example, entity can be an
an accuracy of 78% and an error rate of 3.39%. The same organization, metric, currency, etc. While human can be
authors conducted three experiments using the same question human name or occupation. Location may represent city,
taxonomies to compare between three classifiers (SVM, NB, country, river, or mountain. And numeric can be date, time,
and DT) in [10]. Results showed that SVM outperformed both population, etc.
NB and DT when Arabic taxonomy used with a recall,
precision, and f-measure of 89%, 93%, and 90% respectively. The remaining classes (casual, definition, description, and
list) are considered non-factoid and their answer may exceed
Aouichat et al. [11] presented an approach to classify one sentence, and can have multiple answers according to the
Arabic questions using Li and Roth taxonomy. Their approach writing style, and each class differs from the answer it asks
starts by preprocessing the questions through applying for, for example casual questions usually ask for a reason or
tokenization, removing diacritics, normalization, and date and purpose. Definition questions ask for definition for a term or
time labeling using regular expressions. Next, the an entity, description questions ask for methods and
preprocessed questions are fed into the SVM classifier to explanations, and list questions ask for steps. Table 2 shows
assign them a course-grain class, and then they are fed into the the number of questions under each class.
Convolutional Neural Network (CNN) to assign them a fine-
grain class. Experiments were conducted on TALAA-AFAQ TABLE II. NUMBER OF QUESTIONS UNDER EACH CLASS
232
CLASS NUMBER OF QUESTIONS numeric 23
CASUAL 5 كم How 58
DEFINITION 60 (much/many)
DESCRIPTION 16 متى When numeric 54
ENTITY 107 ماذا What casual 1
HUMAN 110 definition 1
LIST 21 entity 6
LOCATION 128 human 4
NUMERIC 153 list 2
TOTAL 600 الى To casual 1
Table 3, shows the first tokens used in the dataset along entity 1
with their count, noting that these tokens are normalized and numeric 1
the affixes were not removed. As illustrated in the table there اين Where location 72
بمن Whom, who human 1
are seven tokens that are not IWs and came at the beginning اي Which entity 1
of the questions, these tokens are ( ، اعطي، فوق، عدد، الى، في عدد List, enumerate list 6
منذ، )على. Note that IWs in the Arabic language can come بماذا What casual 1
either; (1) at the beginning of the question, (2) as a second entity 2
token in the question, and (3) at the end of the question. In the location 1
updated CLEF, the third case is not included. فوق On, above location 1
كيف How description 13
TABLE III. FIRST TOKENS FOR THE QUESTION IN THE DATASET اعطي Give entity 7
human 2
FIRST MEANING IW FREQUENCY list 3
TOKEN location 4
ما What Yes 179 فيما What casual 1
من Who Yes 132 على On, onto numeric 1
في In No 40 لماذا Why casual 1
كم How Yes 58 الي For what entity 1
(much/many) منذ Since numeric 1
متى When Yes 54 باي Which entity 1
ماذا What Yes 14 TOTAL 600
الى To No 3 Table 5 illustrates the second token for the IW “ ”ماand the
اين Where Yes 72 non-IWs “ منذ، على، فوق، الى، ”في. It also shows the number of
بمن Whom, who Yes 1 times that both the first and second tokens occur together.
اي Which Yes 1
Obviously, the IW ( )ماis frequently used with the pronouns
عدد List, enumerate No 6
بماذا What Yes 4 ( ھي، )ھو. The question classes of questions starting with the
فوق On, above No 1 IW “ ”ماcan differ according to the second or third tokens. For
كيف How Yes 13 example, the question “”ما ھي المجرة التي ينتمي اليھا كوكب األرض ؟
اعطي Give No 16 which means “What is the galaxy to which the Earth
فيما What Yes 1 belongs?” is given the class “entity”, and the question “ ما ھي
على On, onto No 1 ”غولدمان ساكس ؟which means “What is Goldman Sachs?” is
لماذا Why Yes 1 given the class “definition”, both questions start with the IW
الي For what Yes 1
“ ”ماfollowed by the pronoun “”ھي.
منذ Since No 1
باي Which Yes 1 Questions that start with non-IWs, such as “ ”فيis followed
TOTAL 600 with ( ايه، )اي, to indicate either time, location or an entity. The
Table 4 shows the number of classes under each IW. For questions that start with ( )الىare followed with one of the
example question starting with ( )ماcan be assigned one of the tokens ( ماذا، كم، )اي. Questions having the second token ()اي
seven classes; definition, description, entity, human, list, indicate the answer to be an organization, and thus are given
location, and numeric. Therefore it will be hard to specify the class entity. Questions having the second token ()كم
rules or patterns for each case. The questions with the IW “”من indicate that the answer is a number, thus are given the class
can be given one of six classes. numeric. Finally, questions having the second token ()ماذا
indicate that the answer should be a reason or purpose,
TABLE IV. NUMBER OF CLASSES ACCORDING TO IWS
therefore are given the class casual.
FIRST MEANING CLASS FREQUENCY
TOKEN TABLE V. THE OCCURRENCE OF TWO TOKEN TOGETHER
ما What definition 22
description 2 FIRST SECOND MEANING SECOND
entity 77 TOKEN TOKEN FREQ
human 24 ما ھو He 75
list 9 ھي She 83
location 30 الذي That, whose, 7
numeric 15 whome
من Who definition 37 اسم Name 9
description 1 االسباب Reasons 1
entity 10 الشركتين The two 1
human 79 companies
list 1 جنسيه Nationality 1
location 4 االفرقه Teams 1
في In entity 1 اصل origin 1
location 16 في ايه Which 19
233
اي Which 21 الشركتين The two 1 No باع Sold 1 No
الى اي Which 1 companies
استغرق Took 1 No نجح Succeesed 1 No
كم How (much/ 1
من Of 4 No عاش Lived 1 No
many) اخترع Invented 1 No اطيح Dropped 1 No
ماذا What 1 اقيمت Establish 2 No جرت Took place 1 No
فوق ايه Which 1 اقيم Establish 1 No
على اي Which 1
منذ متى When 1 IV. EXPERIMENTED CLASSIFIERS
In this paper, the task of question classification was
Table 6 illustrates all the second tokens that occurred in performed using three classifiers; Multinomial Naïve Bayes
the dataset. Note that the second token maybe an IW attached (MNB), Decision Trees (DT), and Support Vector Machine
to another word. For example the question “ إلى كم يصل عدد (SVM).
”السكان في الواليات المتحدة األمريكية ؟, here we can see that “”إلى 1. Multinomial Naïve Bayes (MNB):
which means “to” is not an IW but it is attached to an IW. The
IW is “ ”كمand means how much, the two words together mean Naïve Bayes classifier was selected because it can work
(to how much) and the full question means “How many people well on small datasets and can be considered computationally
are in the United States?” As a conclusion, using rules or fast. There are several extensions of NB classifiers, and
identifying patterns for question is time consuming, because Multinomial NB (MNB) is one that works with discrete
the IWs as stated earlier may come at the beginning of the features. MNB is a widely used text classifier and is useful
question, as the second token for the question, or even at the classifier when term frequency matters. Generally NB
end of the question. Therefore, the purpose of this paper is to classifiers refers to the independent assumption between the
examine machine learning classifiers to test their capability of features given the class in the model, equation 1 represents the
classifying the classes given for the updated version of CLEF Bayes theorem.
dataset. ( )
( | )= …………………………… (1)
( )
TABLE VI. SECOND TOKENS IN THE DATASET
SECOND MEANING FREQ IW SECOND MEANING FREQ IW Where ( | ) is the posterior probability, i.e. the
TOKEN
ھو He 140 No
TOKEN
تنحى Step aside 1 No
probability of the class given the question, ( | ) is the
ايه Which 20 Yes تولى Took over 1 No likelihood, i.e. the probability of the question given the class,
ھي She 91 No اندلعت Broke out 2 No ( ) is the prior, i.e. probability of the class, and ( ) is the
قدر Estimate 1 No قام Started 3 No
عدد Number 36 No دخلت Entered 1 No
normalization constant, i.e. the probability of the question.
حاز Possess 1 No انتقل Moved 1 No Generally, the normalization constant is neglected.
يبلغ Reaches 2 No الفاٮز Winner 1 No
اي Which 25 Yes جنسيه Nationality 1 No 2. Decision Trees (DT):
يسمى Called 2 No ولد Born 5 No
يوجد Exist 19 No تاسست Established 3 No A Decision Tree (DT) is a non-parametric supervised tree-
كان Was 21 No وصل Reached 1 No
الذي That, 42 No حل Settle 1 No
based machine learning algorithm, that can be used for
whose, classification and regression problems. DT can perform multi-
whome class classification, which is the case in this paper. DT can
تزوج married 3 No تعني Mean 1 No
نوع Type 1 No فريق Team 1 No
handle both categorical and numerical data. It can map non-
العناصر Elements 1 No يقوم Do/ work 1 No linear relationships among features. DT predict classes
ادخلت Entered 1 No ولدت Born 2 No through learning simple decision rules from the training
تغطى Covers 1 No تجري Take place 1 No
كم How 1 Yes اصبحت Became 2 No examples.
(much/
many) 3. Support Vector Machine (SVM):
تم Done 6 No دفعت Paid 1 No
اتھم Accuse 2 No بلغ Reached 1 No Support Vector Machine (SVM) is a supervised machine
ھم They 3 No توجد Located 6 No learning algorithm that works through fitting a boundary to a
الفٮات Categories 1 No تاسس Founded 1 No
اجزاء Parts 1 No توغلت Penetrated 1 No
region of training examples that are alike. SVM is known to
ظھر Appear 1 No يقع Located 4 No perform well with small datasets. It also works well in high
ينتقل Transfer 1 No تنتج Produce 3 No dimensional space, where natural text can be high
اسم Name 23 No حاله Case 1 No
تحدث Happen 1 No تبيع Sell 1 No dimensional. This paper experiments the linear SVM
استقال Quit 1 No افتتح Opened 1 No classifier.
حصلت Obtain 1 No توفي Died 3 No
يتم Complete 4 No الذين Those 1 No V. WEIGHTING SCHEMES
بلغت Reach 2 No انعقد Was held 1 No
حدث Happen 3 No وقع Happened 2 No Before feeding the questions to the classifier, they are
قتل Kill 2 No تقام Held 1 No
تفعل Do 1 No بدات Started 1 No represented using two different schemes, Term-Frequency
مات Die 5 No عقد Was held 2 No (TF), and TF-Inverse Document Frequency (TF-IDF). Where
اين Where 2 Yes عقدت Was held 1 No TF is the number of times each term occurred in a document,
يمكن Can 2 No حطت Landed 1 No
تحول Convert 1 No االفرقه Teams 1 No and IDF represents the number of documents that a term
كانت Was 3 No تمت Done 2 No appears in, it increases the weights of non-frequents terms,
بني Build 2 No متى When 1 Yes while decreasing weights of frequent terms. Both TF and IDF
ماذا What 5 Yes اصل Origin 1 No
االسباب Reasons 1 No يصب Pour 1 No can be combined by multiplying their values, to adjust the
اطلقت Launched 2 No مره Number of 2 No frequency of a term for how rarely it is used. Thus, TF is used
times
تقع Located 19 No تبلغ Reaches 1 No
to measure frequency while TF-IDF is used to measure
اسماء Names 5 No صدر Released 1 No relevancy, this applies on saying; terms that are frequent may
طرق Methods 1 No اصبح Become 1 No
يعمل Works 2 No اصطدمت Bumped 1 No
234
not be relevant such as the stop-words. Equation 2 is the Normalization was conducted on all the three forms for the
formula of IDF for the term . dataset. The goal of normalization is to standardize letters, for
example; Alif ( )اwhich has multiple forms ( آ، إ، )أand
= log …………………………… (2) transformed into the bare Alif ()ا. Another letter is TA ()ة,
which is sometimes written as HA ()ه, therefore TA is
Where is the total number of documents/ training transformed to HA, there is also A ( )ءthat may come on Alif
examples, and is the number of documents the term Maqsoora ( )ىsuch as ( )ئor on Waw ( )وsuch as ()ؤ, the ( )ءis
appeared in. Equation 3 is the formula for TF-IDF is: removed in both cases. Punctuations and diacritics were both
− = ……………… (3) removed for the three forms of the dataset as well.
, ,
The difference between the first form of the preprocessed
Where TF is the frequency of the term in the document
dataset and the second form is either to keep the stop words or
and IDF of the term .
remove them. the removal of the stop-words is performed by
VI. EVALUATION MEASURES checking a lexicon that consists of Arabic normalized stop-
words, therefore, the first two tokens were not checked for
The measurements used to compute the experimental stop-word removal, because ( )منwhich is an IW if it comes as
results were; precision, recall, and F1-score. Where precision a first token, can be considered a stop-word when it come in
(equation 4) is the ratio of the correctly predicted examples the middle of the question.
among the retrieved examples, recall (equation 5) is the ratio
of correctly predicted examples among the total amount of The difference between the second and third form of the
relevant examples, and F1-score (equation6) models accuracy preprocessed datasets lays in adding stemming to the third
combining both precision and recall. Three types of averages form. While it differs from the first dataset in two things; it
are taken for each measurement, macro-average, micro- has no stop-words while the first form has, and the token are
average, and weighted-average. stemmed while the tokens in the first form are not stemming.
Stemming is done using the shallow stemmer ISRI.
= …………………………… (4)
There is no significant change in the number of unique
= …………………………… (5) token between the first form of the processed dataset and the
second. Noting that before removing stop-words the number
1= …………………………… (6) was 1732, while after removing the stop-words it become
1719. Only 13 unique tokens were removed when the stop-
Where P is the precision, R is the recall, TP is the number words were removed. While the number of unique token
of true positive examples, FP is the number of false positive become 1243 after stemming. Table 7 shows these numbers.
examples, FN is the number of false negative examples, and
F1 is the F1-score. TABLE VII. THE NUMBER OF UNIQUE TOKENS IN EACH FORM OF
PREPROCESSED DATASET
Micro-average computes the average for all classes by PREPROCESSED PREPROCESSING NUMBER OF
summing all their contributions globally, macro-average is DATASET PERFORMED UNIQUE TOKENS
implemented through computing each measure independently FORM1 Normalization 1732
for each class and then taking the average, and weighted Punctuation Removal
average is implemented by first computing each measure Diacritics Removal
FORM2 Normalization 1719
independently and then taking the average for each metric by Punctuation Removal
multiplying them with the number of instances in the class. Diacritics Removal
The two measures that can be important in the multi-class Stop-words Removal
problem is the micro- and weighted- average, because macro- FORM3 Normalization 1243
average treats all classes equally while the other two does not. Punctuation Removal
Diacritics Removal
The reason behind using these measurements is because the
Stop-words Removal
dataset is unbalanced, and the number of each class in the Stemming
dataset differs from one another, as demonstrated in table 8.
VII. EXPERIMENT AND EXPERIMENTAL RESULTS To experiment on the three forms of the preprocessed
Twelve experiments were conducted on the updated datasets, three classifiers were used; MNB, DT, and SVM.
version of the CLEF dataset, where three forms of the dataset Each time using one different weighting scheme, which are;
were prepared: TF and TF-IDF. And the dataset is divided into training and
testing set, where 400 question used for training and 200
1. The questions were preprocessed by first normalizing question used for testing, table 8 shows the number of
them, and then removing punctuation marks and questions under each class for both the training and testing
diacritics, keeping the stop-words. sets. And tables from 9 to 14 illustrate the results for
2. The questions were preprocessed by first normalizing experimenting the three classifiers on the three form of the
them, and then removing punctuation marks, diacritics, preprocessed datasets.
and also stop-words.
3. The questions were preprocessed by first normalizing TABLE VIII. NUMBER OF QUESTIONS UNDER EACH CLASS IN THE
TRAINING AND TESTING SETS
them, and then removing punctuation marks, diacritics,
and stop-words. Adding stemming to this step. CLASS TRAINING SET TESTING SET
CASUAL 4 1
DEFINITION 50 10
DESCRIPTION 13 3
235
ENTITY 74 33 micro avg 0.72 0.72 0.79 0.72 0.72 0.79 0.73 0.73 0.79
macro avg 0.55 0.51 0.58 0.59 0.58 0.62 0.55 0.50 0.58
HUMAN 69 41
weighted 0.78 0.79 0.81 0.72 0.72 0.79 0.74 0.74 0.79
LIST 18 3 avg
LOCATION 78 50
NUMERIC 94 59 TABLE XIV. RESULTS OF THE THREE CLASSIFIERS USING TF-IDF
TOTAL 400 200 WEIGHTING ON THE DATASET FORM3
236
applying a set of preprocessing steps on the dataset, which the classifiers MNB, DT, and SVM, respectively. The future
consisted of removing punctuations, diacritics, and stop- work will focus on conducting similar experiments on a larger
words, then performing tokenization. Their dataset consisted dataset.
of translated questions from both CLEF and TREC, having
800 and 1500 question-answer pairs respectively. Note that X. REFERENCES
Arabic and Li and Roth taxonomies consisted of two numeric
classes (time and number), while in our proposed taxonomy,
time and number is under one class named numeric. Arabic [1] R. K. Santosh and K. Shaalan, "A review and future
taxonomy do not have the classes; definition and entity, while perspectives of arabic question answering systems,"
Li and Roth and our taxonomy does. The three taxonomies IEEE Transactions on Knowledge and Data
share the classes; human or person, location, and description. Engineering, pp. 3169--3190, 2016.
However, Li and Roth have the class abbreviation which is not [2] K. C. Ryding, A reference grammar of modern standard
used in the in the Arabic taxonomy nor in our taxonomy, while Arabic, Cambridge university press, 2005.
we have proposed two other classes; casual and list.
[3] C. C. Aggarwal and C. Zhai, "A survey of text
TABLE XV. EXPERIMENTED CLASSES IN [10] AND PROPOSED CLASSES classification algorithms," in Mining text data,
Springer, 2012, pp. 163--222.
EXPERIMENTED CLASSES
ARABIC human, description, location, time, and number. [4] B. Agarwal and N. Mittal, "Text Classification Using
TAXONOMY Machine Learning Methods-A Survey," in Proceedings
LI AND ROTH abbreviation, definition, description, location of the Second International Conference on Soft
(city, country, and other location), person, time,
number, entity, other. Computing for Problem Solving (SocProS 2012),
Springer, 2014.
OUR casual, definition, description, entity, human, list,
EXPERIMENT location, numeric [5] C. P. Rose, A. Roque, D. Bhembe and K. Vanlehn, "A
Hybrid Text Classification Approach for Analysis of
Student Essays," in Proceedings of the HLT-NAACL 03
The results show that the three classifiers with the Arabic workshop on Building educational applications using
taxonomy outperformed the same classifiers in our natural language processing-Volume 2, 2003.
experiment. Note that the difference is not that significant [6] H. M. Al Chalabi, S. K. Ray and K. Shaalan, "Question
between the NB classifiers having F1-score 1% lower than classification for Arabic question answering systems,"
that of the Arabic taxonomy, while compatible with Li and in 2015 International Conference on Information and
Roth taxonomy. However, DT in our experiment
Communication Technology Research (ICTRC). IEEE,
outperformed that in Li and Roth taxonomy with F1-score
2015.
reaches 74% while in Li and Roth reaches 66%. On the other
hand, SVM classifier reached 90% using the Arabic [7] E. Al-Shawakfa, "A Rule-based Approach to
taxonomy, the difference is 9% between them and our Understand Questions in Arabic Question Answering,"
experiments, and only 2% between Li and Roth and our Jordanian Journal of Computers and Information
experiment. Note that the results obtained in our experiment Technology, vol. 2, pp. 210--231, 2016.
is promising especially that the dataset is only 600 records [8] I. Lahbari, S. E. A. Ouatik and K. A. Zidani, "A rule-
compared to the two other experiments in table 15. Therefore, based method for Arabic question classification," in
our intension is to increase the size of the data as a future work, 2017 International Conference on Wireless Networks
through manually updating and labeling the translated TREC
and Mobile Communications (WINCOM). IEEE, 2017.
dataset as done in this paper.
[9] X. Li and D. Roth, "Learning question classifiers," in
TABLE XVI. EXPERIMENTAL RESULTS FOR [10] AND EXPERIMENT OF Proceedings of the 19th international conference on
THE CURRENT PAPER Computational linguistics-Volume 1, 2002.
NB DT SVM TOTAL NUMBER [10] I. Lahbari, S. O. El Alaoui and K. A. Zidani, "Toward
OF QUESTIONS
79% 81% 90% 2300
a new arabic question answering system," International
ARABIC
TAXONOMY Arab Journal of Information Technology (IAJIT), vol.
LI AND ROTH 78% 66% 83% 2300 15, pp. 610--619, 2018.
OUR 78% 74% 81% 600
EXPERIMENT
[11] A. Aouichat, M. S. H. Ameur and A. Geussoum,
"Arabic Question Classification Using Support Vector
IX. CONCLUSION AND FUTURE WORK Machines and Convolutional Neural Networks," in
This paper presents a comparative Arabic question International Conference on Applications of Natural
classification experiments on an updated version of translated Language to Information Systems. Springer, 2018.
CLEF dataset, which was labeled manually using eight [12] A. Aouichat and A. Guessoum, "Building TALAA-
classes; casual, definition, description, entity, human, list, AFAQ, a corpus of Arabic FActoid question-answers
location, and numeric. The experiments were conducted using for a question answering system," in International
three classifiers, MNB, DT, and SVM, after applying a Conference on Applications of Natural Language to
number of preprocessing steps on the dataset and creating Information Systems. Springer, 2017.
three versions of the dataset differing in the applied
preprocessing steps. The best results were given after
performing all the preprocessing steps and using the TF-IDF
weighting scheme with F1-score of 78%, 74% and 81% for
237
Arabic Text Classification of News Articles Using
Classical Supervised Classifiers
Leen Al Qadi, Hozayfa El Rifai, Safa Obaid, and Ashraf Elnagar
Dept. of Computer Science
University of Sharjah
Sharjah, UAE
ashraf@sharjah.ac.ae
Abstract—Automatic document categorization gains more im- Nowadays, manual classification that is done by experts is
portance in view of the plethora of textual documents added not so fruitful due to the large number of text documents.
constantly on the web. Text categorization or classification is As a result, automated classifiers were proven to be more
the process of automatically tagging a textual document with
most relevant label. Text categorization for Arabic language is effective and a great alternative utilizing machine learning
interesting in the absence of large and free datasets. Our objective algorithms. Many applications and examples of text catego-
is to automatically identify the category of a document based on rization have been explored such as sentiment analysis [1]–[5],
its linguistic features. To achieve this goal, we constructed a new spam filtering [6] and [7] language identification [8], dialect
dataset which contains almost 90k Arabic news articles with their identification [9] and many more.
tags from Arabic news portals. The dataset shall be made freely
available to the research community on Arabic computational lin- Using machine learning for structuring data is especially
guistics. The dataset has four main categories: Business, Sports, helpful in the field of business. It enhances decision- making
Technology and Middle East. Each collected article was cleaned and automates processes, getting faster results. For instance,
from Latin characters, numbers, punctuation and stop words. marketers can research, collect and analyze keywords used by
To investigate the effectiveness of the dataset, we used an array competitors.
of classical supervised machine learning classifiers. Namely, the
following 10 popular classifiers were used: Logistic Regression, The Arabic language is the mother tongue of more than
Nearest Centroid, Decision Tree (DT), Support Vector Machines 300 million people and it is one of the languages that present
(SVM), K-nearest neighbors (KNN), XGBoost Classifier, Random significant challenges to many NLP applications. It is a highly
Forest Classifier, Multinomial Classifier, Ada-Boost Classifier, and inflected and derived language. The scale of Arabic compu-
Multi-Layer Perceptron (MLP). In pursuit of high accuracy, tational linguistic research work is now orders of magnitude
we implemented an ensemble model to combine best classifiers
together in a majority-voting classifier. Our experimental results beyond what was available a decade ago, but still it has so
showed solid performance with a minimum F1-score of 87.7%, much room to grow.
achieved by Ada-Boost and top performance of 97.9% achieved The statistics reported by the Internet World Stats show that
by SVM. The experimental results are presented in terms of the Arabic language is the fourth popular language online by
confusion matrices, F1-scores, and accuracy. share of Internet users with an estimate of 226,595,470 Arabic
Index Terms—Arabic Text Classification; Single-Label Classi-
fication, Arabic Dataset, Shallow Learning Classifiers. Internet users by language, which represent 5.2% of all the
World’s Internet users as of April, 2019. Moreover, that out
of 444,016,517 Arabic speaking people (as estimated in 2019),
I. I NTRODUCTION
51.0% of them use the Internet. The highest growth rate among
Due to the heavy usage of the Internet and Web 2.0, all languages in the last nineteen years for the number of online
enormous amounts of repositories had arisen. The increasing users was for the Arabic language, achieving 8,917.3%. In our
number of these repositories of online documents resulted in work context, we constructed a dataset of Arabic news articles
a growing demand for automatic categorization algorithms. scraped from multiple websites for the purpose of our research.
Majority of the data, which is generated, is in textual form, 10 classifiers were implemented to predict the most probable
which is highly unstructured in nature, yet extremely rich in class an article should belong to. In addition, we implemented
information. Extracting insights from such data can be hard a voting classifier, which takes into account the classifiers that
and time-consuming, so machine-learning algorithms are used gave the best accuracy scores while predicting the label.
to organize massive chunks of the data and perform a number An automatic Arabic news article labeling system extracts
of automated tasks. Text classification is a fundamental task in features from the articles using the TF-IDF technique. After
NLP (Natural Language Processing) that is used for assigning turning each article into a feature vector, it will identify which
tags to text and classifying it under categories based on features are most common under which class (in the training
its content. Classifying huge textual data standardizes the phase). This will help the classifier when encountering a new
platform, and makes searching for information much easier article, to predict which class it falls under after turning it into
and more feasible, and improves and simplifies the overall a feature vector.
experience of automated navigation. We propose a single-class text classifier and the objective
239
TABLE II
COMPARISON BETWEEN TF-IDFVECTORIZER AND
COUNTVECTORIZER.
240
gradient boosting algorithm.
7) Multi-layer Perceptron (MLP): This is a supervised
classifier. It consists of three (or more) layers of neuron
nodes (an input and an output layer with one or more
hidden layers). Each node of one layer is connected
to the nodes of the next layer, and uses a non-linear
activation function to produce output.
8) KNeighbors Classifier: This is a supervised classifier.
In order to classify a given data point, we take into
consideration the number of nearest neighbors of this
point. Each neighbor votes for a class and the class
with the highest votes is taken as the prediction. In
other words, the major vote of the point’s neighbors will
determine the class of this point.
9) Nearest Centroid Classifier: This is a supervised classi-
fier. It’s a no parameter algorithm where each class is
represented by the centroid of its members. It assigns to Fig. 3. Confusion Matrix for the worst classifier.
tested articles the label of the class of training samples
whose mean (centroid) is closest to the article.
10) AdaBoost Classifier: This is a supervised classifier. It
is a meta-estimator that begins by fitting a classifier on
the original dataset and then fits additional copies of
the classifier on the same dataset but where the weights
of incorrectly classified instances are adjusted such that
subsequent classifiers focus more on difficult cases.
11) Voting Classifier: A very interesting ensemble solution.
It is not an actual classifier but a wrapper for a set of
different classifiers. The final decision on a prediction is
taken by majority vote.
IV. EXPERIMENTAL RESULTS AND DISCUSSION
A. Setup and Pre-processing
Our objective is to explore the success of using 11 different
classifiers to classify Arabic news categories. Our experiments
involve single-label classification on the collected dataset
and comparing the results of using the same classifiers with Fig. 4. Confusion Matrix for the best classifier.
another recently reported dataset ‘Akhbarona’ [28], which has
7 different categories. We split our constructed dataset into
80% for training and 20% for testing. All classifiers were samples for each Arabic character. In fact, normalization can
trained on the training set which consist of 71,707 labeled even affect the meaning of some Arabic words.
articles, then tested on the testing set which consists of 17,432
B. Text Classification
articles.
To evaluate the performance of our classifiers, we report We implemented all the classifiers using Scikit-learn. With
the accuracy score, which is simply expressed as the ratio just using the default hyper-parameters as a black-box on our
of the number of correctly classified articles. The number of testing set and L1 penalty for some of the classifiers. We tested
extracted features from our training set is more than 344k the proposed classifiers on the testing set. The accuracy scores
features. are high and clearly show the strength of the system as well
Furthermore, text pre-processing is used to clean the dataset as the hyper parameters used with each classifier.
by removing all the non-Arabic content. This approach is Table III shows the precision, recall, and F1-score measures
highly recommended when dealing with text collected from for each of the tested classifiers on our dataset. Accuracy
the web. The next step is to clean all the scraped articles scores are almost same as F1-scores. The average of the accu-
by removing elongation, punctuation, Arabic digits, isolated racy scores is 94.8%. The SVM classifier produced the best re-
chars, qur’anic symbols, Latin letters, and other marks. sult of 97.9%. However, the Ada-Boost classifier produced the
Although most of the research works on Arabic computa- worst result of 87.7%. Furthermore, four classifiers produced
tional linguistics apply normalization on the collected text, we close results between 97.5% and 97.9%. For the rest of the
believe this step is not necessary. The dataset provides enough classifiers, two classifiers (MultinomialNB and KNeighbors)
241
TABLE III
ACCURACY METRICS FOR CLASSIFIERS TESTING ON OUR
DATASET.
242
TABLE IV [12] A. al Sbou, “A survey of arabic text classification models,” International
ACCURACY METRICS FOR CLASSIFIERS TESTING ON Journal of Electrical and Computer Engineering, vol. 8, pp. 4352–4355,
AKHBARONA DATASET. 12 2018.
[13] A. El-Halees, “A comparative study on arabic text classification.”
Algorithm Precision Recall F1-score Egyptian Computer Science Journal, vol. 30, 01 2008.
Logistic Regression 0.94 0.94 0.94 [14] R. Al-shalabi and R. Obeidat, “Improving knn arabic text classification
SVC 0.94 0.94 0.94 with n-grams based document indexing,” in in Proceedings of the 6 th
DT Classifier 0.83 0.83 0.83 International Conference on Informatics and Systems INFOS2008, 2008,
Multinomial NB 0.91 0.88 0.88 pp. 108–112.
XGB Classifier 0.89 0.88 0.88 [15] G. Raho, R. Al-Shalabi, G. Kanaan, and A. Nassar, “Different
KNN Classifier 0.91 0.91 0.91 classification algorithms based on arabic text classification: Feature
RF Classifier 0.88 0.88 0.88 selection comparative study,” International Journal of Advanced
Nearest Centroid 0.89 0.86 0.87 Computer Science and Applications, vol. 6, no. 2, 2015. [Online].
Ada-Boost Classifier 0.80 0.78 0.78 Available: http://dx.doi.org/10.14569/IJACSA.2015.060228
MLP Classifier 0.94 0.94 0.94 [16] A. M. A. Mesleh, “Chi square feature extraction based svms arabic
Voting Classifier 0.94 0.94 0.94 language text categorization system,” Journal of Computer Science,
vol. 3, no. 6, pp. 430–435, 2007, exported from https://app.dimensions.ai
on 2019/02/03.
[17] B. Hawashin, A. Mansour, and S. Aljawarneh, “An efficient feature
97%. We also used the voting classifier, hoping for improving selection method for arabic text classification,” International Journal
of Computer Applications, vol. 83, pp. 1–6, 12 2013.
the accuracy, using a majority vote of ten classifiers. However, [18] N. Alalyani and S. L. Marie-Sainte, “Nada: New arabic dataset
the result is comparable to the SVM classifier. for text classification,” International Journal of Advanced Computer
A further investigation has taken place to check the ro- Science and Applications, vol. 9, no. 9, 2018. [Online]. Available:
http://dx.doi.org/10.14569/IJACSA.2018.090928
bustness of our proposed system. We trained and tested the [19] I. Abu El-Khair, “1.5 billion words arabic corpus,” 11 2016.
classifiers on the recently reported “Akhabrona” dataset. The [20] T. Gonçalves and P. Quaresma, “The impact of nlp techniques in
number of classes is increased to have seven classes. The the multilabel text classification problem,” in Intelligent Information
Processing and Web Mining, M. A. Kłopotek, S. T. Wierzchoń, and
results were as good as on our dataset. The SVM classifier K. Trojanowski, Eds., 2004, pp. 424–428.
scored the highest. In future, we intend to increase the number [21] F. Harrag, E. El-Qawasmeh, and P. Pichappan, “Improving arabic
of classes in our dataset. We have also shown the need of text categorization using decision trees,” in 2009 First International
Conference on Networked Digital Technologies, July 2009, pp. 110–115.
multi-label text classification which we start on soon. [22] M. EL KOURDI, A. BENSAID, and T.-e. Rachidi, “Automatic arabic
document categorization based on the naı̈ve bayes algorithm,” 08 2004.
R EFERENCES [23] M. Bawaneh, M. Alkoffash, and A. Alrabea, “Arabic text classification
[1] A. Elnagar and O. Einea, “Brad 1.0: Book reviews in arabic dataset,” using k-nn and naive bayes,” Journal of Computer Science, vol. 4, 07
in 2016 IEEE/ACS 13th International Conference of Computer Systems 2008.
and Applications (AICCSA), Nov 2016, pp. 1–8. [24] S. Alsaleem, “Automated arabic text categorization using svm and nb,”
[2] A. Dahou, S. Xiong, J. Zhou, M. H. Haddoud, and P. Duan, “Word International Arab Journal of eTechnology, vol. 2, no. 2, pp. 124–128,
embeddings and convolutional neural network for Arabic sentiment June 2011.
classification,” in Proceedings of COLING 2016, the 26th International [25] A. Mohammad, T. Alwadan, and O. Almomani, “Arabic text catego-
Conference on Computational Linguistics: Technical Papers, Dec. 2016, rization using support vector machine, naı̈ve bayes and neural network,”
pp. 2418–2427. GSTF Journal on Computing (JoC), vol. 5, 09 2016.
[3] A. A. Altowayan and A. Elnagar, “Improving arabic sentiment anal- [26] M. Biniz, S. Boukil, F. El Adnani, L. Cherrat, and A. Elmajid
ysis with sentiment-specific embeddings,” in 2017 IEEE International El Moutaouakkil, “Arabic text classification using deep learning tech-
Conference on Big Data (Big Data). IEEE, 2017, pp. 4314–4320. nics,” International Journal of Grid and Distributed Computing, vol. 11,
[4] A. Elnagar, Y. S. Khalifa, and A. Einea, Hotel Arabic-Reviews Dataset pp. 103–114, 09 2018.
Construction for Sentiment Analysis Applications. Springer Interna- [27] A. Elnagar, O. Einea, and R. A. Debsi, “Automatic text tagging of arabic
tional Publishing, 2018, pp. 35–52. news articles using ensemble deep learning models,” in Proceedings
[5] A. Elnagar, L. Lulu, and O. Einea, “An annotated huge dataset for of the 3rd International Conference on Natural Language and Speech
standard and colloquial arabic reviews for subjective sentiment analysis,” Processing, Sep. 2019.
Procedia Computer Science, vol. 142, pp. 182 – 189, 2018, arabic [28] O. Einea, A. Elnagar, and R. A. Debsi, “Sanad: Single-label
Computational Linguistics. arabic news articles dataset for automatic text categorization,”
[6] A. Al-alwani and M. Beseiso, “Article: Arabic spam filtering us- Data in Brief, p. 104076, 2019. [Online]. Available:
ing bayesian model,” International Journal of Computer Applications, http://www.sciencedirect.com/science/article/pii/S2352340919304305
vol. 79, no. 7, pp. 11–14, October 2013.
[7] Y. Li, X. Nie, and R. Huang, “Web spam classification method based
on deep belief networks,” Expert Systems with Applications, vol. 96, pp.
261 – 270, 2018.
[8] S. Malmasi and M. Dras, “Language identification using classifier
ensembles,” in Proceedings of the Joint Workshop on Language Tech-
nology for Closely Related Languages, Varieties and Dialects. Hissar,
Bulgaria: Association for Computational Linguistics, Sep. 2015, pp. 35–
43. [Online]. Available: https://www.aclweb.org/anthology/W15-5407
[9] L. Lulu and A. Elnagar, “Automatic arabic dialect classification using
deep learning models,” Procedia Computer Science, vol. 142, pp. 262 –
269, 2018, arabic Computational Linguistics.
[10] C. C. Aggarwal and C. Zhai, A Survey of Text Classification Algorithms,
2012, pp. 163–222.
[11] I. Hmeidi, M. Al-Ayyoub, N. A. Mahyoub, and M. A. Shehab, “A
lexicon based approach for classifying arabic multi-labeled text,” In-
ternational Journal of Web Information Systems, vol. 12, no. 4, pp.
504–532, 2016.
243
Graph-Based Arabic Key-phrases Extraction
Dana Halabi Arafat Awajan
Department of Computer Science Department of Computer Science
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
d3hhalabi@yahoo.com awajan@psut.edu.jo
Abstract— This paper proposes Arabic key-phrases unsupervised methods). Some approaches like Al-Kabi et al.
extraction using graph representation. The proposed approach [1] was based on building a co-occurrence matrix for the most
based on representing the text of an individual document as a frequent terms and used the knowledge of the χ2 and TF-ITF
graph, where the nodes within the graph hold the words’ stem measures. The terms with high χ2 were considered to be
and the edges represent the co-occurrence relation between stems keywords. Awajan [2] proposed unsupervised two-phase
in specific window size. After building the graph, graph-based approach. In phase one, the author detected all N-gram for the
centrality measures were used in ranking the nodes according to possible candidate keywords, then in phase two, he used a
their importance. Then the ranking results are sorted decently to morphological analyzer to calculate the frequency of N-gram
determine the top n nodes. The stems that are represented by the
term based on the root and stem of terms. Awajan [3] proposed
top n nodes will be considered as the key-stems of the individual
new technology based on a vector space model to compute the
document. The performance of our work is measured using the
three accuracy measures: Precision, Recall, and F-Measure. The
most frequency N-gram in the text. In addition to count
obtained result reached 54%, 82% and 64% for Precision, frequency of terms within the doc, the final frequency of N-
Recall, and F-measure respectively. grams within a document depends on their weight and degree.
El-Shishtawy et al. [4] represented a supervised learning
Keywords— Natural language processing, Arabic, key-phrases method for extracting key-phrases from a document based on
extraction, graph, ranking, centrality measures linguistic knowledge and annotated Arabic corpus, they used
syntactic rules based on Part Of Speech (POS) to extract the
I. INTRODUCTION key-phrases. Suleiman et al. [19] proposed Arabic keywords
extraction based on bag-of-concept to extract keywords from
Recently, 4.7% of internet users are Arabic speakers, which the text and used a semantic vector space model to group
will impact an increasing amount of Arabic contents in the web synonym words into classes. Although the tested dataset had
world [6]. This yields the need to have efficient ways to extract only three documents, the proposed method showed significant
information and knowledge from the available amount of data. results.
Key-phrases extraction is a useful Natural Language
Processing (NLP) task that can be used in NLP related tasks In this work, we propose an Arabic keywords extraction
such as automatic document(s) summarization, information approach based on a simple weighted graph. The main idea is
retrieval, search engine... etc. to convert the sequence of words in the sentences of the
document into a simple graph that its nodes represent the terms
Currently, there are limited researches related to keywords and its edges represent frequency co-occurrence relation.
extraction from Arabic contents. In this work, we propose a
new approach, based on a weighted graph. The main idea of Representing the document as a graph was introduced by
proposed work is to represent the document as a graph, in Mihalcea and Tarau (2004) [8] for English content. In their
which its nodes will hold the candidate stems, and the weighted work, they introduced a ranking algorithm called Text-Rank
edges represent the frequency co-occurrence relation between based on PageRank ranking algorithm proposed by Brin et al.
the connected nodes within predetermined window size. [13]. The Text-Rank model considered the words as lexical
Centrality measures will be used to analyze the network and units represented as nodes in an undirected weighted graph.
rank the N-top most important candidate nodes (stems) in the The edges in the graph represented the co-occurrence relation
graph that will be considered as key-stems. These key-stems between words. The best-achieved result for Mihalcea and
will be used to extract the keywords and key-phrases. Tarau [8] was 31% for precision. Litvak et al. and Last (2008)
[15], also represented the text as a graph using Hyperlink-
In this paper, section two represents the related work. Induced Topic Search (HITS) for ranking the nodes. Boudin
Section three explains a basic theoretical background of graph represented the text as a graph and proposed a comparison of
theory. Next, section four illustrates the proposed application. different centrality measures for Key-phrase extraction. In his
Section five represents the experiments and evaluation. Finally, work, he recommended the use of degree centrality measure to
conclusion and future work will be held in section six. ranking the nodes [16].
Kim et al. [7] represented the Korean content as a graph
II. RELATED WORKS and apply the original PageRank algorithm for ranking the
Most of the work for Arabic key-phrases extraction nodes within a graph. In their work, they had achieved more
depended on the use of statistical methods (supervised and than 71% for precision. For Arabic content, Al-Taani et al. [14]
(3)
245
will be the input for the graph in the next phase. In other words, The first time any two stems appear in the same window the
it has a big effect on how to represent the entire Arabic weight of E between these two stems will initialize to one, then
document as a graph. Algorithm 1 represents the pseudo-code each time these two stems appear again in the same window
of “Pre-Processing phase”. the edge’s weight will be incremented by one.
Algorithm 1. Pseudo-code of “Pre-processing phase”. For research purposes, the proposed system is testing with
window size ranged from 2 to 10, in addition to the window
Inputs with a size equal to an exact number of stems in a sentence.
• Arabic document, StopWords list, Punctuation In the second step, the centrality (ranking) measure is
list, Special Characters list applied to produce the N-top key-stems list. The number of
key-stems should be determined before step 2. Its default value
Outputs
is set to 10. The centrality (ranking) measure should be
determined whether it will be PageRank, Betweenness
• sentence_no_stopwords: tokens for each sentence Centrality, Closeness Centrality or Degree Centrality.
(a two-dimensions list) Algorithm 2 represents the pseudo-code of step one “Build
• sentences_stems: stems for each token in each graph G”, and algorithm 2 represents the pseudo-code of step
sentence (a two-dimensions list) two “Ranking graph G”.
• sentences: the set of sentences before remove
Algorithm 2. Pseudo-code of “Build graph G”.
StopWords
Steps Inputs:
1. Use punctuation* to split the text into a list of • sentence_no_stopwords: tokens for each sentence (a
sentences two-dimensions list)
2. For each sentence in sentences Do • sentences_stems: stems for each token in each
2.1. Remove Punctuations sentence (a two-dimensions list)
2.2. Remove Special Characters • win_size (Window Size): {2,3,.., 10} U {A:
2.3. Make a copy from the original sentence Sentence}*
2.4. Remove StopWords from the new copy
2.5. Add the new sentence that is without the stop Outputs:
words to a new list sentence_no_stopwords
2.6. Split the sentence_no_stopwords to tokens • G = {V, E}
2.7. POS Tagging each token in the tokens
• V = stems
2.8. Find tokens that have POS tag value ϵ
• E = Undirected co-occurrence relation between two
{DTNN, NN, DTNNP, NNP} and save
stems in the same window.
these tokens in new list noun_tokens**
3. For each token in noun_tokens Do • pair_set: set contains the pair of stem and the token
3.1. Compute stem(token) related to it from sentence_no_stopwords
3.2. Add the stem to noun_stems list Steps
4. For each sentence in sentence_no_stopwords Do
4.1. For each token in the tokens of sentence Do 1. For each sentence in sentences_stems Do
4.1.1. If stem(token) not in noun_stems 1.1. For each stem in sentences_stems - 1
Remove the token from the sentence's Do
tokens 1.1.1. If stemi not in G, Then
Add stemi to G
4.1.2. Otherwise
Add stem(token) to sentences_stems 1.1.2. If stemi+1 not in G, Then
Add stemi+1 to G
* In step 1, the punctuations used to split the text into sentences
are Arabic comma (،), dot (.) and ()؟. 1.1.3. If edge (stemi, stemi+1) not in G
** In step 2.8, the graph will build depending on the stems for Then
only the nouns that come in the Arabic document. Add edge (stemi, stemi+1, weight = 1) to G
1.1.4. Otherwise
b. Phase two: Graph building and Ranking Phase
Update edge (stemi, stemi+1) [weight] += 1
This phase is the core of the system. It has two main steps.
In the first step, the undirected weighted graph G = (V, E) is 2. Add pair(stemi, tokeni) to pair_set
created, where V holds the stems and E holds the edges * Window size: could be from 2 to 10, or it could be the whole
between two stems that represent the co-occurrence relation sentence.
between them if these two stems appear in the same window.
246
Algorithm 3. Pseudo-code of “Ranking graph G”. keywords (one word) and list of key-
phrases (two, three or maximum four
Inputs:
words)
• G = (V, E) V. EXPERIMENT AND EVALUATION
• Ranking_algorithm: PageRank (PR), Betweenness
Centrality (BC), Closeness Centrality (CC) or Degree A. Environment and Configuration Settings for Evaluation
Centrality (DC). The new system was tested against a dataset of 60
• N: number of the top stems to be selected from documents. Mostly, these documents were collected from
candidate key-stems Aljazeera.net site. They were manually annotated for their key-
Outputs: phrases. Information about the dataset was summarized in
Table 1.
• prime list of N-top key-stems
Steps TABLE I. DATASET INFORMATION
Dataset of 60 documents
1. Ranking nodes in G according to centrality
Total number of tokens The average number of
(ranking) measure before removing stop 36317 tokens before removing 605
2. Sorted_list = Sort the nodes in descending words stop words
order according to their ranking value Total number of tokens The average number of
3. Extract from the Sorted_list the N_top key- after removing stop 25648 tokens after removing 427
stems words stop words
Total number of unique The average number of
9898 165
stems unique stems
c. Phase three: Post-Processing Phase Total number of The average number of
343 6
manual annotated manual annotated
The last phase receives the N-top key-stems from phase Total number of The average number of
two and transforms, to produce the most appropriate M-top 526 9
automatic key stems automatic key stems
key-phrases by replacing the stems by their surface form listing
in the original document. Algorithm 4 represents the pseudo-
code of “Post-Processing phase”. For each document, the manually annotated key-phrases
were converted to a set of keywords, then using the same
Algorithm 4. Pseudo-code of “Post-Processing Phase”. stemmer that used within the system to extract the stems of
Inputs: keywords. These stems represented the annotated-stems.
In order to evaluate the performance of the system, a
• N-top key-stems list comparison between annotated-stems and N-Top Key-stems
• pair_set: set contains the pair of the stem and the token (the output from step 2 of phase two Ranking Graph) was
related to it from sentence_no_stopwords conducted to compute the main measures of accuracy:
• sentences: original sentences list (output from phase 1) Precision (P), Recall (R) and F-measure (F).
Outputs: The testing results were generated for four (centrality)
ranking measures, PageRank, Betweenness Centrality,
• The final list of keywords and key-phrases Closeness Centrality, and Degree Centrality. The centrality
Steps measures were computed using the NetworkX package [17].
The effect of window size on the accuracy of outputs was also
1. Extract from pair_set the tokens that their tested.
stem's in the N-top key-stems, and
determine these tokens as keywords B. The Document Test
2. Examine all keywords against the original In order to illustrate the detailed steps of the proposed
sentences units, to decide which tokens system, a document test was selected, which is the Arabic
appear adjacent to each other. version of the human rights declaration available at
3. If two tokens adjacent to each other http://www.un.org/ar/documents/udhr/. The original document
combine them to produce the key-phrases. contains 1411 tokens (including stop-words and punctuations).
The key phrases can be two keywords or The number of manual annotated stems equally to 12. The
more. default value of N-Top Key-stems was updated from 10 to 15.
4. If the tokens are adjacent to each other, use After converting the text of the document to the graph we had
the sentences list to ensure if there is need found 328 unique stems representing the nodes in the graph G.
to add any deleted stop-words to give a The graph G had different edge numbers according to window
significant meaning of key-phrases. size. Table 2 summarizes the number of edges for each window
size.
5. At the end, this step will produce a list of
247
TABLE II. NUMBER OF EDGES FOR G WITH DIFFERENT WINDOW SIZE
PageRank Betweenness
Document Title: اإلعالن العالمي لحقوق اإلنسان Window Centrality Centrality
(Universal Declaration of Human Rights) Size
Tokens = 1411, Nodes = 328, Key-stems = 15 P R F P R F
Number of manual annotated stems = 12 2 60 75 67 53 67 59
Window Size Number of Edges
2 590 3 60 75 67 60 75 67
3 1023 4 47 58 52 47 58 52
4 1316
5 53 67 59 40 50 44
5 1470
6 1504 6 60 75 67 47 58 52
7 1466 7 53 67 59 40 50 44
8 1379
9 1298 8 47 58 52 47 58 52
10 1216 9 47 58 52 47 58 52
Sentence’s length 550
10 40 50 44 40 50 44
Sentence’s
Figure 2 illustrates a visual representation of graph G for 40 50 44 33 42 37
length
window size equal to 5 to a subgraph for the following part
from the document.
TABLE IV. P, R AND F VALUES FOR KEY-STEMS = 15 USING THE CC AND
ل ّما كان االعتراف بالكرامة المتأصلة في جميع أعضاء األسرة البشرية DC FOR THE SELECTED TEST DOCUMENT.
(Since recognition of the inherent dignity of all members of
the human family) Closeness Degree
Window Centrality Centrality
The number of nodes (stems) = 6 that are {(word, stem): Size
P R F P R F
(االعتراف, )عرف, (بالكرامة, )كرم, (المتأصلة, )ءصل, (أعضاء, )عضو,
(األسرة, )ءسر, (البشرية, })بشرand number of edges = 5. 2 60 75 67 60 75 67
3 60 75 67 60 75 67
4 67 83 74 47 58 52
5 60 75 67 53 67 59
6 67 83 74 60 75 67
7 47 58 52 53 67 59
8 53 67 59 47 58 52
9 40 50 44 47 58 52
10 40 50 44 40 50 44
Sentence’s length 47 58 52 40 50 44
C. Evaluation
The results for system evaluation are summarized in Table
5 and 6. They hold the average values for P, R, and F among
all documents in the dataset for all window size values.
248
5 50 76 60 46 70 55 Khartoum, although “Sudan” appears only one time in the
document. Another article talking about Al-Aqsa Mosque in
6 49 75 59 46 70 55
the whole document they use the term “Alhrm Alqdsy Alshryf
7 48 72 57 43 65 51 ”الحرم القدسي الشريف, but in the keywords, they use “Almsjd
45 68 53 42 63 50 Alaqsa ”المسجد األقصى.
8
9 44 67 53 41 63 50 We can conclude that keywords can hold synonyms terms
for candidate terms extracted from the document using
10 41 62 49 38 58 45 centrality (ranking) measures. As the work approach that does
Sentence’s length 26 39 31 31 46 37 not take into account the synonyms, this yields to have a
limitation for this approach.
TABLE VI. P, R AND F VALUES FOR KEY-STEMS = 15 USING THE CC AND
DC FOR THE WHOLE DATASET. VI. CONCLUSION AND FUTURE WORK
Closeness Degree The new system based on representing the words of the
Window Centrality Centrality document as a graph and used graph-based centrality measures
Size in ranking the words. It had a very good performance
P R F P R F according to the three accuracy measures: Precision, Recall,
2 46 71 55 54 82 64 and F-Measure. The accuracy measures reach 54, 82 and 64 for
the three measures respectively. Although there is no
3 48 73 58 53 81 63 significant difference between the four centrality measures,
4 48 72 58 52 79 62 Degree Centrality shows better performance than others. There
are still other ranking methods that could be tested also, such as
5 48 73 58 50 76 60 TextRank and HITS.
6 48 72 57 48 74 58
In general, based on graph centrality measures for
7 47 71 56 48 73 57 keywords extraction process give better performance than
8 46 69 54 46 70 55 statistical approaches. One of the limitations that any term to be
one of the key-phrase candidates must appear at least once in
9 45 68 54 45 69 54 the documents and in order to increase its chance, it should
10 45 68 54 43 65 52 appear more than once. One possible solution is to take the
word’s synonyms into accounts.
Sentence’s length 41 61 48 32 48 39
REFERENCES
In term of precision values, the Degree Centrality gives the
best results with window size = 2. While the PageRank gives [1] Al-Kabi M., Al-Belaili H., Abul-Huda B., and Wahbeh A., “Keyword
very close results to Degree Centrality. Followed by Closeness Extraction Based On Word Co-Occurrence Statistical Information for
Centrality and Betweenness Centrality. In general, when the Arabic Text”, in Abhath Al-Yarmouk: "Basic Sci. & Eng.", Vol. 22, No.
window size varies from 2 to 6, it has obtained the best 1, pp. 75- 9, 2013.
performance for all centrality (ranking) measures. Furthermore, [2] Awajan A., “Unsupervised Approach for Automatic Keyword Extraction
from Arabic Documents”, in The 2014 Conference on Computational
the Degree Centrality, which refers to the number of ties a node Linguistics and Speech Processing, pp. 175-184, The Association for
has to other nodes, has the best time cost compared to the other Computational Linguistics and Chinese Language Processing, 2014
centrality (ranking) measures [3] Awajan, A., “Keyword extraction from Arabic documents using term
equivalence classes”, in ACM Trans. Asian Low-Resour. Lang. 2015
D. Discussion [4] El-shishtawy T.A. & Al-sammak A.K., “Arabic Keyphrase Extraction
using Linguistic knowledge and Machine Learning Techniques”, in
The keywords-extraction system presented in this work is a Proceedings of the Second International Conference on Arabic
pure unsupervised method. It mainly depends on the content of Language Resources and Tools, The MEDAR Consortium, Cairo,
the entire document. In other words, for any word that does not Egypt. 2009.
appear in the document for at least one time, there is no chance [5] Sahmoudi I., Froud H. and LACHKAR A., "A new keyphrases
for this word to be one of the candidates for keywords. extraction method based on suffix tree data structure for Arabic
According to the way of building the graph (an acyclic documents clustering", in Int. J. Database Manag. Syst., vol. 5, no. 6, pp.
17-33, 2013
weighted graph), as much as the term repeated in the
document, it will have a better chance to be one of the [6] http://www.internetworldstats.com/stats7.htm
candidates of keywords. But in most cases, the keywords of a [7] Kim Y., Kim M., Park S., Cattle A., Shin H. and Otmakhova J.,
“Applying Graph-based Keyword Extraction to Document Retrieval”, in
document usually hold some terms or phrases that are not International Joint Conference on Natural Language Processing, pp 864–
included in the treated document. For example, in our dataset, 868, Nagoya, Japan, 14-18 October 2013
there is an article about Kurdistan and Turkey, when looking [8] Mihalcea R., Tarau P., “TextRank: Bringing Order into Texts” in
for its keywords we see that it has the term “Abdullah Öcalan Proceedings of EMNLP 2004, pp 404–411, Barcelona, Spain.
”عبدﷲ أوجالنalthough it does not a term in the article. Another Association for Computational Linguistics, 2004
example, an article about some events happened in Khartoum [9] Kolaczyk E. and Csardi G., “Statistical Analysis of Network Data with
city (the capital of Sudan), its keywords have Sudan beside to R”. ISBN-13: 978-1493909827, 2014
249
[10] Wasserman S. and Faust K., “Social Network Analysis: Methods and workshop on Multi-source Multilingual Information Extraction and
Applications”, Cambridge University Press, ISBN: 9780521387071, Summarization, pp 17–24 Manchester, August 2008
1994 [16] Boudin F., “A comparison of centrality measures for Graph-Based
[11] Newman M., “Networks: An Introduction” Oxford University Press, Keyphrase extraction”, in International Joint Conference on Natural
ISBN: 97801992066502010, 2010. Language Processing, pages 834–838, Nagoya, Japan, 14-18, October
[12] Freeman L., “Centrality in Social Networks. Conceptual Clarification”, 2013.
in Journal: Social Networks - SOC NETWORKS, vol. 1, no. 3, pp. 215- [17] https://networkx.github.io/
239, 1979 [18] Daoud D., Al-Kouz A. and Daoud M., “Time-sensitive Arabic
[13] Brin S. and Page L., “The PageRank Citation Ranking: Bringing Order multiword expressions extraction from social networks”, in International
to the Web”, in Stanford Digital Library Technologies Project, 1998 Journal of Speech Technology 2016, volume 19, pages 249–258, 2016.
[14] Al-Taani A and Al-Omour M., “An Extractive Graph-based Arabic [19] Suleiman, D. and Awajan, A., 2017, May. Bag-of-concept based
Text Summarization Approach”, in The International Arab Conference keyword extraction from Arabic documents. In 2017 8th International
on Information Technology, 2014 Conference on Information Technology (ICIT)(pp. 863-869). IEEE.
[15] Litvak M. and Last M., “Graph-Based Keyword Extraction for Single-
Document Summarization”, in Coling 2008: Proceedings of the
250
Arabic Text Keywords Extraction using
Word2vec
Dima Suleiman Arafat A. Awajan Wael Al Etaiwi
Computer Science DepartmentKing Computer Science Department Computer Science Department
Hussein Faculty of Computing Sciences King Hussein Faculty of Computing King Hussein Faculty of Computing
Princess Sumaya University for Sciences Sciences
Technology Princess Sumaya University for Princess Sumaya University for
Teacher at the University of Jordan Technology Technology
Amman, Jordan Amman, Jordan Amman, Jordan
d.suleiman@ psut.edu.jo awajan@psut.edu.jo w.etaiwi@psut.edu.jo
252
B. Term weight
The terms weights are computed using the score or the
frequency of their stem. Furthermore, the weights are affected
by the position of the words in the document. Such that, the
words that occur at the abstract, introduction and conclusion
will have higher probability of being keywords. Therefore
such words will have higher weights than other words.
Accordingly, they will have higher opportunities to be
keywords.
In order to calculate the weights based on the position of
the word in the document, the document will be divided into
Fig. 1: Phases of Proposed Method N sections where each section will be assigned a weight as in
[5]. The same probabilities that are used in research [14] will
be used here. Li is used to represent the probability of the
The proposed approach consists of five phases as shown
words in certain section i. The values of Li will be 0.24, 0.51
in Fig. 1. The first phase is the preprocessing phase, which and 0.11 for the words that occur in first or last paragraph,
consists of several stages: the first stage is cleaning the text. title and internal paragraph respectively. Freq(w,i) is used to
Cleaning the text includes removing punctuation marks, non- represent the frequency of word w in section i as in [14].
Arabic languages words, numerals, and diacritical. The Therefore the weight of certain word w is computed using Eq.
second stage is the tokenization, which splits the text into (1) [14].
words. In this research, Farasa segmenter is used for
tokenization [4]. The last stage is the stemming where Farasa
stemmer is used to get the stem of the words.
weight(w) = ∑ (LiFreq (w, i)) (1)
In the second phase, the semantic vector space model is
used to represent words. Each row in semantic vector space
represents word’s stem and word’s weight, where the weight All the paragraphs of the Universal Declaration of Human
is the frequency of words that have the same stem. In Rights document, that we will use as an example in this
clustering phase, the words that have the same stem will be paper, will have the same probability since it does not contain
grouped together in the same class. Moreover, as in [17], also abstract, introduction and conclusion.
the synonyms words are assigned to the same class. In
addition, we group words that have high context sematic C. Building bag-of-concept
similarity together in the same class. The fourth phase is the
statistical phase, where N-gram will be used. In this phase, N- Building BoW consists of three stages: in the first stage,
grams include unigram, bigram and trigram. For each N- words that have the same stem will be grouped together in the
gram, there exists one entry in the semantic vector space, and same class in a process called Term Normalization. Each
the frequency of N-gram will be used to compute the weights class will be given a weight, such that for class C the weight
of N-gram. Finally, keywords extraction is the last phase. will be computed using Eq. (2) [14].
The Universal Declaration of Human Rights document is
used to apply the five phases of the proposed keyword ClassWeight(C) = ∑ Weight(Wi) , Wi ϵ C (2)
extraction method.
The second stage includes grouping the words that are
DOCUMENT PROCESSING AND CLEANING synonyms in the same class. Wordnet is used to find the
synonyms words [18]. The last stage consists of grouping the
A. Document Preprocessing and cleaning words that have high context similarity in the same class. In
this stage the pre-trained word2vec model is used to convert
Document preprocessing and cleaning is crucial step in the words into vectors. In addition, the cosine similarity is
many NLP applications, thus instead of dealing with the used to find the similarity values between words.
words itself, it is better to deal with the word’s stem. The
preprocessing phase includes removing stop words, in Table I shows the results of term normalization for the
addition to extracting the word’s stem. In tokenization classes with highest count or weight.
process the text will be splitted into words.
Table II displays the classes after considering the
Keyword extraction mainly based on frequency of words. synonyms. After using WordNet, the synonym words are
The words that occur frequently are more nominated to be combined into the same class. For example, classes C, D and
keywords. However, even that stop words occur frequently E which represent words “”فرد, “ ”شخصand “”انسان
they cannot be considered as keywords, thus stop words are respectively will be combined into one class. Since the count
removed in preprocessing phase. of word “ ”فردis the highest, thus the name of the combined
class will be “ ”فردand it is represented by C.
253
TABLE I TABLE III
CLASSES AND THEIR COUNTS CLASSES OF MOST SIMILAR WORDS AND THEIR COUNTS
Class Label Class Label Class Label Class Label Count
Count Count Count Root/Stem
Symbol Root/Stem Symbol Symbol Symbol Root/Stem Symbol Symbol
A حق 61 N بلد 7 A حق 92 K جميع 6
B حرية 22 O كرامة 6 B فرد 51 L اسرة 6
C فرد 19 P جميع 6 C دولة 14 M متساوي 6
D شخص 18 Q اسرة 6 D عمل 10 N تمتع 6
E انسان 14 R متساوي 6 E حماية 10 O أخر 6
F عمل 10 S تمتع 6 F لكل 10 P امة 6
G حماية 10 T أخر 6 G اجتماعي 8 Q اساسي 6
H لكل 10 U امة 6 H عام 7 R اعالن 6
I اجتماعي 8 V اساسي 6 I قانون 7 S عادل 4
J عام 7 W اعالن 6 J مجتمع 7 T
K قانون 7 X احترام 4
L دولة 7 Y عادل 4
M مجتمع 7 Z مساواة 3
D. N-Gram Detection and Scoring
N-gram must be taken into account in statistical analysis,
since the keywords composed of one or more words in most
The counts of words in classes C, D and E are19, 18 and 14 of the documents. The score of unigram is the same as the
respectively. Therefore, the count of the new class will be 51, count or the weight of the word, since unigram consists of
which is the summation of the counts of the three classes C, one word. On the other hand the score of bigram and trigram
D and E. Also, by using synonyms, the two classes N and L do not equal to their weights since bigram consists of two
will be combined with total count equal to 14 and the class is classes and trigram consists of three classes.
labeled by “”دولة.
N-gram (NG) weight is equal to the summation of the
Moreover, the context semantic similarity will be weight of the classes that compose it as in Eq. (3) [14]. The
considered when combining classes. The words that occur number of the sectors in the documents is represented by M.
within the same semantic will be considered as similar. The
pre-trained word2vec model proposed by AraVec [19] is used NgramWeight(NG) = ∑ (Li Freq(NG, i)) (3)
in the proposed model. The cosine similarity is used to
compute the similarity between the words in the document.
On the other hand, the total score of the N-gram is
The results show that, the classes “”حق, “”حرية, “”كرامة, and
calculated by adding the weights of all the classes that
“ ”مساواةwhich are represented by symbols A, C, L, and V
composing it in addition to the weight of the N-gram itself.
respectively in Table II, have high similarity values and will
For example, the score of the bigram “ ”حق فردis equal to the
be combine into one class. As we mentioned previously, the
summation of the weights of the classes “ ”حقand “”فرد,
name of the class will be “ ”حقsince the count of the class
which is equal to 80 and the weight of the bigram “”حق فرد
“ ”حقis the highest. Also, the count of the new class will be
which is equal to 16. Thus, the total score is 96. On the other
equal to the summation of the count of the classes A, C, L
hand, the score of the bigram “ ”حق شخصis equal to the
and V. Thus, the count of the combined class is equal to 61+
summation of the weights of the classes “ ”حقand “”شخص
22 + 6 +3 = 92. Table III displays the classes after
which is equal to 79 and the weight of the bigram “”حق شخص
considering the context semantic similarity.
which is equal to 15. Thus, the total score is 94. In the case of
The proposed approach differs from the approach that was bigram “”حق انسان, the total score is 85 while the bigram
proposed in [17] in considering the semantic similarity of the weight is 10.
words that occur within the same context when combining
N-gram score is calculated by Eq. (4) [17]. Where N is the
classes. In research [17], only words that have the same stem
number of the classes that composes the N-gram.
and synonyms will be grouped together.
Score(NG) = NgramWeight(NG) +
TABLE II ∑ (ClassWeight(Cj)) (4)
CLASSES OF SYNONYM WORDS AND THEIR COUNTS
Class Label Class Label Count
Symbol Root/Stem
Count
Symbol Symbol Root/Stem The synonyms and the semantic context similarity of
A حق 61 L كرامة 6 words must be considered not only for unigram but also for
B فرد 51 M جميع 6 N-gram. For example, the bigrams “”حق فرد, “ ”حق شخصand
C حرية 22 N اسرة 6 “ ”حق انسانare considered as one bigram which is “”حق فرد.
D دولة 14 O متساوي 6 Since “”فرد, “ ”شخصand “ ”انسانare synonyms. The reason for
E عمل 10 P تمتع 6
F حماية 10 Q أخر 6 choosing “ ”حق فردinstead of the two others is that the count
G لكل 10 R امة 6 of “ ”فردis higher than “ ”شخصand “”انسان. Furthermore, the
H اجتماعي 8 S اساسي 6 bigrams “”حق فرد, “ ”حرية فرد, “ ”كرامة فردand “ ”مساواة فردare
I عام 7 T اعالن 6 considered as one bigram which is “”حق فرد. The reason for
J قانون 7 U عادل 4
K مجتمع 7 V مساواة 3 choosing “ ”حق فردis that “”حق, “”حرية, “ ”كرامةand “”مساواة
have high similar context based on word2vec model and they
will be grouped in the one class. Moreover, the reason for
254
choosing “ ”حق فردinstead of the other bigrams is that the TABLE V
TRIGRAMS AND THEIR SCORES
count of “ ”حقis higher than“”حرية, “ ”كرامةand “”مساواة.
The score of the N-gram in the case of synonyms and the Trigram Weight of
semantic context similarity of words can be calculated as First Second Third First Second Third Trigram
Score
follows: in the previous example, in the case of synonyms, Term Term Term Term Term Term / count
the new class “ ”حق فردwill be the combination of all the دولة فرد حق 14 51 92 9 166
bigrams “”حق فرد, “ ”حق شخصand “”حق انسان. The score of عمل فرد حق 10 51 92 7 160
NgramWeight (NG) of “ ”حق فردwill be the sum of the حماية فرد حق 10 51 92 6 159
اجتماعي فرد حق 8 51 92 6 157
weights of the three bigrams “”حق فرد, “ ”حق شخصand “ حق عام فرد حق 7 51 92 5 155
”انسانwhich are equal to 16, 15 and10 respectively. In this قانون فرد حق 7 51 92 4 154
example, the total will be 41. After that, 41 will be added to آخر فرد حق 6 51 92 5 154
the weights of the classes “”فرد, “”حق, “ ”شخصand “”انسان. اساسي فرد حق 6 51 92 5 154
مجتمع فرد حق 7 51 92 4 154
Thus, the total score of the bigram “ ”حق فردis 153. In case of تمتع فرد حق 6 51 92 4 153
combing classes that have high context semantic similarity, امة فرد حق 6 51 92 4 153
the same process is made as follows: NgramWeight (NG) of
“ ”حق فردwill be the sum of the weights of the four bigrams E. Selection of Keywords
“”حق فرد, “”حرية فرد, “ ”كرامة فردand “ ”مساواة فردwhich is The keywords that have highest score will have high
equal to 41, 15, 4 and 2 respectively. In this example, the probability of being the predicted keyword. The application
total will be 62. After that 62 will be added to the weights of requirements, size of the document and the user needs
the classes “”فرد, “”حق, “”حرية, “ ”كرامةand “”مساواة. Thus, the determine the number of selected keywords. Moreover, the
total score of the bigram “ ”حق فردis 205. number of words in the keywords will affect the selection of
Eq. (4) is modified to consider the synonyms and the keywords, such that if we have two keywords with the same
words that have context semantic similarity as in Eq. (5). In score, then the keyword with largest number of words will be
Eq. (5) B represents the number of N-grams that have selected. The list of select keywords can be seen in Table VI.
synonym words such as “”حق فرد, “ ”حق شخصand “”حق انسان
in this case the value of B is 3. And C represents the number TABLE VI
CANDIDATE KEYWORDS AND THEIR SCORE
of N-grams that have high context similarity such as words
Keyword Score
such as “”حق فرد, “”حرية فرد, “ ”كرامة فردand “ ””مساواة فردin حق فرد 205
this case the value of C is 4. حق فرد دولة 166
حق فرد عمل 160
حق فرد حماية 159
حق فرد اجماعي 157
Score(NG) = ∑ NgramWeight NG(b) + حق فرد عام 155
∑ NgramWeight NG(b) + ∑ ClassWeight(Cj) +
ClassWeightofSynonym + ClassWeightofSimilar +
IV. EXPERIMENTS AND EVALUATION
(5)
A. Description of Datasets
TABLE IV
BIGRAMS AND THEIR SCORES
The performance of the proposed approach is evaluated
Bigram Weight of by comparing the generated keywords with the manually
Score
First Second First Second Bigram extracted keywords. The experiments are conducted using
Term Term Term Term / count
three documents. The titles of the three documents are as
فرد حق 51 92 62 205
دولة حق 14 92 10 116 follows: “”اإلعالن العالمي لحقوق اإلنسان, “”المؤتمر الوطني األردني
حماية حق 10 92 9 111 and “”العنف لدى طالب جامعة ال البيت. The keywords of the
عمل حق 10 92 8 110 documents will be displayed below; the keyword of “ المؤتمر
اجتماعي حق 8 92 7 107 ”الوطني األردنيis displayed in Table VII, Table VIII lists the
عام حق 7 92 6 105
جميع حق 6 92 6 104 keywords for document “”العنف لدى طالب جامعة ال البيت.
اسرة حق 6 92 6 1f[04
متساوي حق 6 92 6 104 Table VIII presents the third document keywords
تمتع حق 6 92 6 104 without considering semantic. Two synonym keywords can
قانون حق 7 92 5 104 be seen “”العنف جامعة طالبة,”ً ” العنف جامعة طالبا. The two
امة حق 6 92 6 104 keywords will be grouped into one class when using
اساسي حق 6 92 6 104
semantic. Accordingly, the score of the generated class will
be higher. Thus, this class will have higher probability of
The score of the most frequent bigram and trigram is being selected as candidate keyword.
shown in tables IV and V respectively.
255
TABLE VII document, the words “”حق, “”حرية, “”كرامة, and “ ”مساواةare
KEYWORDS OF “ ”المؤتمر الوطني األردنيDOCUMENT
highly similar. Thus, these words are grouped into one class.
Keyword Score Also, we can notice that, the word “ ”عدالةhas also high
المؤتمر الوطني األردني 32 similarity with the word “”حرية. However, the word “”عدالة
األردني الوطني 26
المؤتمر الوطني 24 does not exist in the document, but in the proposed model, the
المؤتمر األردني 20 word " "عدالةappears in the generated keywords, this type of
the generated keywords are called abstractive keywords.
TABLE VIII Abstractive keywords are the keywords that are generated by
KEYWORDS OF “ ”العنف لدى طالب جامعة ال البيتDOCUMENT the keyword extraction models but do not occur in the
original document.
Keyword
ً العنف جامعة طالبا C. Performance Evaluation
العنف جامعة طالبة
طلبة الدراسة جامعة The evaluation measures that we use in this research is
طلبة جامعة العنف precision P, recall R and F-measure. True positive, false
positive, true negative and false negative values for each
In order to evaluate the accuracy of the proposed model, keyword w are defined. Keyword w that is selected by the
the results are compared with the results in [14] and [17]. proposed algorithm and also selected by manual extraction is
Since the three models evaluate the first documents and [17] considered as true positive. There are three groups of
evaluates all the documents. In all documents, the keywords experiments that are determined according to number of
are extracted manually in order to make comparisons with the keywords that are extracted, the first group contains 5
generated keywords. keywords and the second and third groups contain 10, and 15
keywords respectively. Table IX displays the results.
B. Word2vec Model
Word2vec is a word embedding model that was proposed The number of selected keywords determines the precision
by Mikolov in 2013 [3][20]. Word embedding is used to as shown in Table IX. The second document achieved the
convert words into low dimensional vectors such that the best precision since number of keyword is low. Thus, the
similar words can be explored in term of syntax and semantic accuracy is high. On the other hand, the precision of the large
similarity. Using Word2vec facilitates finding the most documents like the third one is high and the recall is
similar words that occurs within the same context. moderate. Furthermore, a comparison between the results of
Furthermore, word2vec model has two approaches including the proposed approach and the results in [14], [17] is made.
Continuous Bag of Words (CBOW) model and skip-gram Accordingly, the results show that the proposed model
model. Both approaches are neural networks that consist of overcomes the previous models in terms of precision, recall
one input, one hidden and one output layers. In order to and F-measure.
retrieve significant results, both approaches must be trained
using large corpus [21],[22]. Also, they have the same hyper
V. CONCLUSION
parameters such as the vocabulary size, the context window
and the dimension size. The vocabulary size is the number of
most frequent words in the vocabulary. On the hand, the According to the increase in number of available online
context window is a window that surrounds the input word in Arabic documents, the need for keyword extraction methods
case of skip-gram and the output or target word in case of increase. The importance of keyword refers to its uses in
CBOW. The dimension size is the size of the dimension of several NLP applications such as text summarization and
each vector that is used to represent the words. The dimension information retrieval. Automated keyword extraction is used
of each vector in the input and output layers is equal to the instead of manual extraction since manual extraction is time
vocabulary size, while the number of neurons in hidden layer consuming. A new Arabic keyword extraction approach was
is equal to the dimension size. The similarity between words proposed in this research. The proposed method considered
can be measured using cosine similarity, Euclidean distance the context semantic similarity. Furthermore, it combined
and other measures. Arabic language linguistics properties with some statistical
analysis in order to extract keywords from Arabic documents.
In this paper, we used one of the six pre-trained word Words that have the same stem are grouped together in the
embedding models prepared by a Arabic open source project same class. Also the words that have synonyms stem were
which is called AraVec[19]. AraVec models are trained using grouped together, in addition to grouping words that have
three different resources including Wikipedia Arabic articles, high context semantic similarity together. Word2vec model
Tweets and World Wide Web pages, where the number of was used in order to convert the words into vectors. This
tokens are more than 3,300,000,000. The proposed keyword facilitated computing the semantic similarity between words
approach used the model that was trained using Wikipedia that occurs within the same context. Moreover, abstractive
with the dimension size is equal to 100. The cosine similarity keywords were generated using word2vec model. The
is used to find the words in the documents that are highly experimental results showed that the proposed model
similar. For example, in the “”اإلعالن العالمي لحقوق اإلنسان improved the results of extracting keywords from Arabic
256
documents. The experiments were conducted using three
documents and the results were reasonable.
TABLE IX
RESULT AFTER APPLYING ALGORITHM
[6] M. Sahlgren and R. Cöster, “Using bag-of-concepts to improve [16] Alarmouty, Batool & Tedmori, Sara. Automated Keyword
the performance of support vector machines in text Extraction using Support Vector Machine from Arabic News
categorization,” in Proceedings of the 20th international Documents. 342-346. 10.1109/JEEIT.2019.8717420. (2019).
conference on Computational Linguistics - COLING ’04,
Geneva, Switzerland, 2004, pp. 487-es. [17] D. Suleiman and A. Awajan, “Bag-of-concept based keyword
extraction from Arabic documents,” in 2017 8th International
[7] F. Wang, Z. Wang, Z. Li, and J.-R. Wen, “Concept-based Short Conference on Information Technology (ICIT), Amman,
Text Classification and Ranking,” in Proceedings of the 23rd Jordan, 2017, pp. 863–869.
ACM International Conference on Conference on Information
and Knowledge Management - CIKM ’14, Shanghai, China, [18] W. Black et al., “Introducing the Arabic WordNet Project,”,
2014, pp. 1069–1078. Proceedings of the Third International WordNet Conference,
Sojka, Choi, Fellbaum and Vossen eds, 2006.
[8] H. K. Kim, H. Kim, and S. Cho, “Bag-of-concepts:
Comprehending document representation through clustering [19] A. B. Soliman, K. Eissa, and S. R. El-Beltagy, “AraVec: A set of
words in distributed representation,” Neurocomputing, vol. Arabic Word Embedding Models for use in Arabic NLP,”
266, pp. 336–352, Nov. 2017. Procedia Computer Science, vol. 117, pp. 256–265, 2017.
[9] S. Albitar, B. Espinasse, and S. Fournier, “Semantic [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
Enrichments in Text Supervised Classification: Application to “Distributed Representations of Words and Phrases and their
Medical Domain,” The Twenty-Seventh International Flairs Compositionality,”. InAdvances in neural information
Conference, March, 2014. processing systems,pages 3111–3119. 2013..
[10] D. R. Al-Shalabi, “Arabic Text Categorization Using kNN [21] D. Suleiman and A. Awajan, “Comparative study of word
Algorithm,”, Proceedings of the 4th International Multi- embeddings models and their usage in Arabic language
conference on Computer Science and Information Technology, applications,” International Arab Conference on Information
Vol. 4, Amman, Jordan, April 5-7. 2006. Technology (ACIT), Werdanye, Lebanon, pp. 1-7.2018.
257
A Deep Learning Approach for Arabic Text
Classification
Katrina Sundus Fatima Al-Haj Bassam Hammo
Computer Science Department Computer Science Department Computer Information Systems
The University of Jordan The University of Jordan The University of Jordan
Amman, Jordan Amman, Jordan Amman, Jordan
sun.katrina@yahoo.com Alhaj5661@gmail.com b.hammo@ju.edu.jo
Abstract—Advancement in information technology year, and author name. In this paper, we only consider the
produced massive textual material that is available online. subject classification property [1].
Text classification algorithms are at the core of many natural
language processing (NLP) applications. There are several Automatic text classification may fall under one of
algorithms which have been implemented to tackle the three categories: supervised, unsupervised and semi-
classification problem for English and other European supervised. In the supervised text classification, human
languages. Few attempts have been carried out to solve the interaction evolved to provide some text classification
problem of Arabic text classification. In this paper, we information. Whereas, in the unsupervised text
demonstrate a feed-forward deep learning (DL) neural classification, also known as text clustering, classification
network for the Arabic text classification problem. The first
layer uses term frequency-inverse document frequency (TF- is completed without any external information. In the
IDF) vectors constructed from the most frequent words of semi-supervised text classification, the categorization is
the document collection. The output of the first layer is used completed using some external mechanism [2].
as an input to the second layer. To reduce the classification
error rate, we used Adam’s optimizer. We conducted a set of
Arabic language is one of the top ten languages used on
experiments on two multi-classes Arabic datasets to evaluate the web [3]. Although Arabic language is growing rapidly
our approach based on standard measures such as precision, on the internet, its content is still as low as 3%. The current
recall, F-measure, support, accuracy and time to build the rapid growth is a compelling motivation for researchers
model. We compared our approach with the logistic and developers to develop effective systems and tools to
regression (LR) algorithm. Experiments entailed that the advance research in Arabic NLP.
deep learning approach outperformed the logistic regression
algorithm for Arabic text classification. Deep learning (DL) is considered as a part of neural
networks. It is the fastest growing field in machine
Keywords—Arabic Text Classification, Machine Learning, learning methods. It could be supervised, semi-supervised
Logistic Regression, Neural Networks, Deep Learning. or unsupervised [4]. DL allows various computational
models to be composed into multiple processing layers to
I. INTRODUCTION participate in the learning representations of the data with
different abstraction levels.
Finding relevant information about a specific topic in a
massive amount of exponentially growing online textual In this work, we demonstrate a feed-forward supervised
data is a challenging problem. Organizing data in DL model for Arabic text classification. The first layer
predetermined categories may help to solve this dilemma. uses term frequency-inverse document frequency (TF-IDF)
Hence, the need for efficient and effective automatic vectors constructed from the most frequent words of the
classification algorithms is always in demand. Text document collection. The output of the first layer is used as
classification algorithms are at the core of many NLP an input to the second layer. Optimization methods are
applications such as: text summarization, question used to reduce the error rate between the computed and the
answering, sentiment analysis, spam detection and text target output. The error rate is usually measured by a loss
visualization. function. In this paper, we used a common optimizer called
Adam.
The main task of text classification can be summarized
as follows: given a document D, find a zero or multiple To test the model, we carry out a set of experiments on
categories to which D belongs. A binary classification two multiclass, single-label Arabic datasets based on
process involves a collection made of two classes, while a standard measures such as precision, recall, F-measure,
multi-classification process acts on a data collection of support, accuracy and the time to build the model. Then we
more than two classes to be assigned to an unseen compared our proposed model with the logistic regression
document. (LR) algorithm.
Text classification can be either manual or automatic.
Manual text classification was the core task of classifying The rest of the paper is organized as follows. The
library context since the early days. On the other hand, second section presents the related work. The third section
automatic text classification is mainly done by computer presents the research background. In section four, a
machines using classification techniques. description of the test datasets is presented. In section five,
we discuss the research methodology. Experiments and
Textual material can be classified in many ways based their results are presented in section six followed by the
on some metadata such as: text subject, type, publication conclusion in section seven.
259
Training the model is an iterative process, and hence, delimiters, including white spaces, tabs and
determining the number of iterations for the model must be punctuation marks. The output of the tokenization
specified. The iterations are called epochs. In our proposed process is of two types: tokens that correspond to
model, we used ten epochs. In addition, a batch size units whose characters are recognizable such as
parameter is used to determine how many samples to punctuation marks, numeric data, dates, etc., and
manage in the forward/backward pass for each epoch. The tokens that need further morphological analysis.
reason for applying this parameter is to increase the Tokens of one or two-character length, non-Arabic
computational speed and to minimize the number of characters, or numerical values are ignored and
epochs required for running. excluded from the dataset as they affect the
performance of the classifier [9]. Regular
To reduce the error between the computed and the expressions could be a helpful tool to do
target output, we used a common optimizer called Adam. tokenization [1].
The error is measured by a loss function.
2) Stop-words removal. Stop-words are usually
A. The datasets functional words. They include conjunctions,
In this work, we used two test datasets to conduct the prepositions, etc. They occur frequently in a text
experiments and to validate the efficiency of the proposed and they have low impact on the classification
approach to solve the problem of Arabic text classification. process [9]. A compiled list of Arabic stop-words is
The following is a description of the two datasets. usually used to eliminate them from a text [14].
Dataset-I. The first dataset, Khaleej-2004, was borrowed Developers of NLP applications usually remove
from [20]. It contains 5690 Arabic documents organized stop-words from search engine indices as this will
into four categories: Economy, International News, Local reduce the size of indices dramatically and, hence,
News and Sport. Table 1 shows the characteristics of the improving recall and precision [15].
first dataset. 3) Word stemming. Stemming is the process of
TABLE 1. CHARACTERISTICS OF DATASET-I mapping derivative words onto the base form, the
stem, which they share. Stemming uses
Category Num. of Num. of Words Num. of Words
Documents Before Processing After Processing
morphological heuristics to remove affixes from
Economy 909 418978 8946 words before indexing them. As an example, the
Int. News 935 534532 10382 Arabic words ""كتاب, " "كاتبand " "مكتبةshare the
Local News 2398 967525 12519 same root ""كتب. In this work, we utilized a stemmer
Sport 1430 551728 9978 described in [13].
Total 5690 2472763 41825
After the preprocessing phase, the dataset is
Dataset-II. The second dataset was borrowed from [12]. It represented in a form suitable for the ML phase.
contains a set of 1445 Arabic documents organized into Consequently, the most relevant stems of words must be
nine categories. They include: Computer, Economics, extracted and converted into vectors. The vector space is
Education, Engineering, Law, Medicine, Politics, represented as a two dimensional matrix where the
Religion, and Sport. Table 2 shows the characteristics of columns denote the stems and the rows denote the
the second dataset. documents. The entries in the matrix are the weights of the
stems in their corresponding documents. The TF-IDF
TABLE 2. CHARACTERISTICS OF DATASET-II
scheme is used to assign weights to stems. Equations (1)-
Category
Num. of Num. of Words Num. of Words (3) are used to calculate the weights of the terms.
Documents Before Processing After Processing
Computer 70 13959 2890
Economics 220 99853 8505 (, )= (1)
Education 68 51749 6803
Engineering 115 143240 9280 ℎ
Law 97 199156 11346 ( , )= (2)
ℎ ℎ
Medicine 232 97230 9097
Politics 184 85015 9009 = ( , )× (, ) (3)
Religion 227 140307 9769
Sport 232 73604 4156
where TF(i,j), is the frequency of term i in document j,
Total 1445 904113 70855
IDF(i,j), is the frequency of word i with respect to all
B. The preprocessing phase documents (i.e. dataset). Finally, the weight of word i in
Data preprocessing is extremely important for many document j, wij, is calculated by (3).
research areas such as NLP, DM, and ML. It allows C. Model evaluation metrics
improving the quality of the raw experimental data. Data
Text classification evaluation is performed using four
preprocessing has a significant impact on the performance
standard measures: classification accuracy, precision,
of supervised learning models [11]. The primary aim of
recall, and F1-score. Accuracy is simply a ratio of correctly
preprocessing is to reduce the test space and to minimize
predicted observation to the total observations. It is
the error rate. The appropriate data preprocessing and data
calculated using (4).
analysis is the next step in text classification. Data
preprocessing includes the following stages: +
= = (4)
+ + +
1) Text tokenization. The tokenization process takes
the dataset and splits it into separate words Precision is the ratio of correctly predicted positive
(tokens). The words are separated at multiple observations to the total predicted positive observations.
260
Precision is calculated using (5). Recall is the ratio of matrix shows that class 0, the “Economy” class, has
correctly predicted positive observations to all (151/178) of its documents correctly classified. This is
observations in actual class and is captured by (6). The F1- equivalent to (84.8%) of the documents. If we keep
score is a function used when a balance between precision looking across the same row of the “Economy” class, we
and recall is needed and is calculated by (7). find that (27/178) documents, which forms 15.2% of its
documents, were misclassified and predicted as class 2,
“Local news”.
= = (5)
+
TABEL 3. EVALUATION RESULTS OF THE LR MODEL OF DATASET-I
= = (6)
+ Class# Name Precision Recall F1-score Support
2×( × ) 0 Economy 0.88 0.85 0.87 178
1_ = (7)
+
1 Int. news 0.98 0.91 0.94 192
D. Split the datasets into testing and training portions 2 Local news 0.91 0.94 0.93 489
After the datasets of the Arabic documents have been 3 Sport 0.97 0.98 0.97 279
preprocessed, they were split into a training set (80%) and
a test set (20%). This group division is completed by using TABLE 4. CONFUSION MATRIX OF LOGISTIC REGRESSION
the split train test library in python. CLASSIFICATION MODEL FOR DATASET-I
Predicted
V. EXPERIMENTAL RESULTS Class 0 1 2 3
In this section, we compare our work with the LR 0 84.8% 15.2%
supervised classification model. We evaluate the LR model (151/178) 27
Actual
based on standard measures, namely, precision, recall and 1 1.0% 91.1% 7.3% 0.5%
F1-score. The DL model is evaluated based on accuracy 2 (175/192) 14 1
and the percentage of loss in training and validation 2 3.5% 0.8% 94.3% 1.4%
17 4 (461/489) 7
datasets. A Confusion Matrix, which is a visualized
3 0.4% 1.8% 97.8%
summary of the classification prediction results, is 1 5 (273/279)
produced to evaluate the accuracy of the classification of
both models. The number of correct and incorrect For the DL classification mode, Table 5 shows the
predictions is summarized with counted values for each confusion matrix of dataset-I. Taking a closer look at the
class. The confusion matrix provides an insight overview confusion matrix, it is obvious that the deep learning
of the errors being made by the classifier and the model outperformed the logistic regression in two classes,
misclassified instances. namely, class 0 (“Economics”) (+3.4) and class
1(“International News”) (+4.7). For class 2 (“Local
In order to implement the two models, we used Python News”) and class 3 (“Sport”), the logistic regression
3.7.0 with the aid of JetBrains PyCharm Community, model was better than the deep learning model.
edition 2017.2.4, Python IDE. The DL model used Keras,
which is a DL and neural network API running on top of TABLE 5. CONFUSION MATRIX OF DEEP LEARNING CLASSIFICATION
the Tensorflow library. Keras API supports two main types MODEL OF DATASET-I
of models; the sequential model API, which we
Predicted
administered in this work, and the functional API, which
Class 0 1 2 3
may be implemented for advanced models with complex
0 88.2% 11.8%
NN architectures. The applied sequential model is a stack (157/178) 21
of layers using a dense layer. The experiments were carried
Actual
261
TABLE 6. TRAINING AND VALIDATION ACCURACY OF DATASET-I The next best performance of the logistic regression model
Epoch Validation Accuracy Training Accuracy was on the “Computer” class and least performance was on
1 0.729 0.567 the “Economics” class.
2 0.894 0.843
3 0.928 0.925 TABLE 8. LOGISTIC REGRESSION RESULTS OF DATASET-II
4 0.929 0.947
5 0.929 0.953 Class # Name Precision Recall F1-score Support
6 0.934 0.958 0 Computer 1 0.79 0.88 19
7 0.933 0.963 1 Economics 0.73 0.93 0.81 40
8 0.939 0.967 2 Education 1 0.75 0.86 16
9 0.937 0.972 3 Engineering 0.95 1 0.98 21
10 0.938 0.976 4 Law 1 0.52 0.69 21
5 Medicine 0.92 1 0.96 34
6 Politics 0.92 1 0.96 44
7 Religion 0.96 0.95 0.95 55
8 Sports 1 1 1 39
Fig. 6 shows the model building time for the two tested
classification models. The evaluation results showed that
the time consumed in deep learning to build the model was
less than the time consumed in the logistic regression
model for both dataset.
Fig. 3. Training & validation accuracy of DL classification of dataset-II
263
accuracy and time building model in favor of the deep
learning model compared with the logistic regression
model. The results pertain that deep learning classification
models are very promising to the Arabic text classification
problem.
REFERENCES
[1] D. Sagheer, and F. Sukkar, "Arabic Sentences Classification via [11] S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, "Data
Deep Learning," International Journal of Computer Applications, preprocessing for supervised leaning," International Journal of
182(5), pp. 40-46, 2018. Computer Science, 1(2), pp 111-117, 2006.
[2] R. G. Rossi, A. Lopes, and S. O. Rezende, "Optimization and label [12] A. M. A. Mesleh, "Chi Square Feature Extraction Based Svms
propagation in bipartite heterogeneous networks to improve Arabic Language Text Categorization System," Journal of
transductive classification of texts," Information Processing & Computer Science, 3(4), pp. 430-435, 2007.
Management, 52(2), pp. 217–257, 2016. [13] B. Hammo, S. Yagi, O. Ismail and M. AbuShariah, "Exploring and
[3] https://speakt.com/top-10-languages-used-internet/, visited May 20 exploiting a historical corpus for Arabic," Language Resources and
2019. Evaluation, 50(4), p. 839–861, 2016.
[4] J. Schmidhuber, "Deep learning in neural networks: An overview," [14] M. T. Alrefaie, Arabic stop-words list, available online:
Neural Networks, vol. 61, pp. 85-117, 2015. https://github.com/mohataher/arabic-stop-words. Visited 20 May
[5] J. Saarikoski, J. Laurikkala, K. Järvelin, M. Juhola, "Self- 2019.
Organising Maps in Document Classification: A Comparison with [15] G. Salton, and C. Buckley, “Term-weighting approaches in
Six Machine Learning Methods," in International Conference on automatic text retrieval,” Information Processing and Management,
Adaptive and Natural Computing Algorithms. ICANNGA 2011, 24, pp. 513–523, 1988.
Verlag Berlin Heidelberg, 2011.
[6] W. Hadi, Q. A. Al-Radaideh, S. Alhawari, "Integrating [16] J. Schmidhuber, "Deep learning," Encyclopedia of Machine
Associative Rule-based Classification with Naïve Bayes for Text Learning and Data Mining, pp. 1-11, 2016.
Classification," Applied Soft Computing, 69, pp. 344-356, 2018. [17] Y. Bengio, "Learning deep architectures for AI," Foundations and
[7] R. Alshammari, "arabic Text Categorization using Machine trends in Machine Learning, 2(1), pp. 1-127, 2009.
Learning Approches," International Journal of Advanced Computer [18] L. Deng, and Y. Dong, "Deep learning: methods and applications,"
Science and Applications (IJACSA), 9(3), pp. 226-230, 2018. Foundations and Trends in Signal Processing, 7(3-4), pp. 197-387,
[8] A. Conneau, H. Schwenk, Y. L. Cun, and L. Barrault, "Very Deep 2014.
Convolutional Networks for Text Classification," Proceedings of [19] N. Bacaër, "Verhulst and the logistic equation (1838)," In A Short
the 15th Conference of the European Chapter of the Association for History of Mathematical Population Dynamics, pp. 35-39.
Computational Linguistics, vol. 1, p. 1107–1116, 2017. Springer, London, 2011.
[9] S. Boukil, M. Biniz, F. El Adnani, L. Cherrat, and A. E. El [20] Khaleej-2004 Arabic corpus compiled by Dr. Mourad Abbas.
Moutaouakkil, "Arabic Text Classification Using Deep Learning Available online:
Technics," International Journal of Grid and Distributed https://sourceforge.net/projects/arabiccorpus/files/arabiccorpus%20%
Computing,11(9), pp. 103-114, 2018. 28utf-8%29/. Visited 20 May 2019.
[10] M. M. Al-Tahrawi, "Arabic Text Categorization Using Logistic
Regression," I.J. Intelligent Systems and Applications, vol. 6, pp.
71-78, 2015.
264
Arabic Text Semantic Graph Representation
Wael Mahmoud Al Etaiwi Arafat Awajan
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman. Jordan Amman. Jordan
w.etaiwi@psut.edu.jo awajan@psut.edu.jo
Abstract— Semantic representing of Arabic text can facilitate weighted graph called semantic graph, in which weights
several language processing applications. It reflects the meaning represents the semantic relations between vertices (words). For
of the text as it is understood by humans. Semantic graphs can be paragraphs or documents, the semantic graph becomes more
used to enhance the performance of several natural language complex and difficult to manipulate. On the other hand, network
processing applications such as question answering and textual representation is flexible and accumulative, thus, it is suitable
entailment. This paper proposed a graph-based Arabic text for real-time and online applications. Finally, a set of predefined
semantic representation model. The proposed model aims to rules are used in the rule-based representation to represent the
represent the meaning of Arabic sentences as a rooted acyclic semantic relation between words. The combination of applied
graph. Most of the works on semantic representation have focused
rules may differ among different implementations. The order of
on the English language. Furthermore, not much work considered
checked rules and the priority of applied rules may produce
and focused on the Arabic language. In this paper, the proposed
model dedicated to the Arabic language and considers its features different representations. This affects the process of retrieving
and challenges. the original text from its rule-based representation negatively[6].
Semantic parsing refers to the process of mapping text into
Keywords—semantic graph; knowledge representation; its semantic representation [7]. Many different methods and
semantic; semantic representation. techniques are used in the semantic parsers such as machine
learning and linguistics-based methods [8]. Semantic parsers are
I. INTRODUCTION classified into two main types: Deep semantic parsers and
The knowledge representation using a predefined set of shallow semantic parsers. Deep semantic parsers are used to
notations that can be used by a computer program in a systematic represent text components such as multiword expressions [9].
way is called semantic [1]. The semantic relations between While in shallow semantic parsers, each word in the text is
words and text components play a key role in several represented according to its meaning and its semantic
applications, especially for text analysis and mining relationship with other words [10].
applications. For the Arabic language, the semantic parsers are limited and
Semantic representation is to reflect what human understand have less attention with comparison to other languages. This is
about the meaning of given text semantically. It is used in due to the lack of high-quality resources and tools that could be
several Natural Language Processing (NLP) applications such used to Arabic NLP models. Furthermore, the Arabic language
as Question Answering (QA), Textual Entailment (TE) and text has a sophisticated syntax and morphology structure. Thus, most
summarization. of the proposed Arabic parsers focus on the syntax and
morphology of Arabic text rather that it’s semantic
Semantic representation models can be classified into four representation [11].
main categories: Predicate logic representation, Frame
representation, Network representation, and Rule-based In this paper, we propose an Arabic text semantic network
representation. In the predicate logic representation, the representation model. The semantic graph will be used to
language is used as a notation set in order to represent the represent the meaning of Arabic sentences. The proposed
semantic relation between text components [2], [3]. A set of representation model is used to represent different sentences into
logic notations is used to express the meaning of words in the the same semantic graph when they share the same meaning.
sentence. For example, the sentence “The weather is beautiful” The proposed model designed for the Arabic language and it has
is represented as: beautiful(weather). The complexity of the ability to represent and retrieve Arabic sentences easily.
representing complex sentences is the main drawback of this The remaining of this paper is structured as follows: section
representation model. Furthermore, ignoring helping verbs and II presents the related work. The proposed model is presented in
supporting words reduces the retrieval process quality [2]. In section III. Some examples are illustrated in section IV. Section
frame representation, the original text is represented as slots of V discussed the main challenges of the proposed model. Finally,
components and parts. Each part carries a specific type of the conclusion and future works are presented in section VI.
information [4]. The key step in this model is to split the original
text into its appropriate components and part, which is a time-
consuming process. Furthermore, retrieving the original text II. RELATED WORK
from its frame representation is a very difficult task. In network Most of the proposed researches on semantic representation
representation, also call graph representation, the semantic and parsers are small and domain-oriented [12]. Furthermore,
relations are represented as a set of vertices and edges [5]. The they are oriented for the English language. Several graph-based
266
B. Noun Relations person name. Figure 4 illustrates an example of
Mainly, noun relations classified into two main groups: the representing the sentence “”سجل محمد صالح الھدف
relation between two nouns, and relation based on noun type. (Mohamad Salah scored the goal).
The first group is represented by adding a new edge to connect
the two nouns together, such as adjective (Adj), modifier (Mod.)
and identifier (Idf.). For example; the sentence “”الشمس مشرقة
(The sun is shining) consists of two nouns; “( ”الشمسThe sun)
and its adjective noun “( ”مشرقةshining). Thus, a direct adjective
edge will be added to connect the two vertices. Figure 2
illustrates an example.
C. Conjunctions Relation.
Additional concept vertex has been used to represent
conjunctions relation such as “( ”وand) and “( ”أوor).
Figure 3: Adding Location Relation. Furthermore, the conjunction options have been represented as
relation edges that connect the concept conjunction vertex with
the original word vertices. For instance, the representation of the
sentence “( ”اشترى الطالب قلم ودفترThe student bought a pen and a
2) Person. In the Arabic language, person names can be
notebook) is illustrated in Figure 6.
mentioned in the original text in many different forms.
They may consist of one or more phrases (e.g. ""طه حسين
or as a noun phrase (e.g. “)”مخترع الذرة. Thus, a new
concept vertex called (Person) has been created and
added to the semantic graph in order to represent the
267
vertex. Additional concept vertex is added to represent the
location.
268
structure, and it has many challenging features. In the Arabic VI. CONCLUSION AND FUTURE WORK
language, many different types of ambiguities affects the This paper proposed a model for Arabic text semantic
understanding of sentences meaning, for instance, the same representation. The proposed model represents text components
word could be used for either location or time, such as the word (words) and the semantic relation between them as a rooted
“ ”مشرقthat could be used as location noun (e.g. “ سافرت باتجاه acyclic graph. The proposed model is dedicated to the Arabic
( ”المغربI traveled towards the west)), or it could be used as language. It considers Arabic language features and challenges.
daytime noun (e.g. "( " عدت الى المنزل بعد مغرب الشمسI went home The vertices in the proposed semantic graph consist of original
after sunset). words in addition to the main concepts. Main concepts include
location, person and date time. The proposed model could be
used to represent different types of Arabic sentences including
questions and conjunctions.
In our ongoing research, we are going to utilize the proposed
model to enhance different Arabic NLP application such as
textual entailment and question answering. Furthermore, a new
dataset that contains a collection of pre-generated graphs could
be established and produced.
REFERENCES
[1] P. J. Hayes, “Some Problems and Non-problems in
Representation Theory,” in Proceedings of the 1st
Summer Conference on Artificial Intelligence and
Simulation of Behaviour, Amsterdam, The Netherlands,
The Netherlands, 1974, pp. 63–79.
[2] A. Ali and M. A. Khan, “Selecting predicate logic for
knowledge representation by comparative study of
knowledge representation schemes,” in 2009
Figure 10: semantic representation of the sentence: “ قرر International Conference on Emerging Technologies,
( ”الجيش األمريكي خفض عدد قواته في الباكستان خالل العام المقبلThe US 2009, pp. 23–28.
military has decided to reduce the number of its troops in [3] A. Ali and M. A. Khan, “Knowledge representation of
Pakistan during next year).
Urdu text using predicate logic,” in 2010 6th
International Conference on Emerging Technologies
(ICET), 2010, pp. 293–298.
Another challenge that may affect the quality of the semantic
graph is Name Entity Recognition (NER). The lack of capital [4] M. Minsky, “A Framework for Representing
letters in the Arabic language makes the task of NER a Knowledge,” Massachusetts Institute of Technology,
challenging task. Furthermore, names in Arabic are derived from Cambridge, MA, USA, 1974.
adjectives. For example, the word “ ”كريمcan be used as a named [5] M. R. Quillian, “Semantic Networks,” in Semantic
entity (person name) or an adjective which means (generous). Information Processing, M. L. Minsky, Ed. MIT Press,
1968.
Several Arabic text processing toolkits were proposed for the [6] M. A. Tayal, M. M. Raghuwanshi, and L. G. Malik,
Arabic language in order to perform specific text processing “Semantic Representation for Natural Languages,” Int.
tasks, such as POS tagging, segmentation, dependency parsing, Refereed J. Eng. Sci. IRJES, vol. 4, no. 10, pp. 01–07,
and others. The quality of the used toolkit affects the semantic
Oct. 2015.
representation of the Arabic text.
[7] Y. Wilks and D. Fass, “The preference semantics
In order to overcome the Arabic text semantic representation family,” Comput. Math. Appl., vol. 23, no. 2, pp. 205–
challenges, further preprocessing tasks should be conducted 221, 1992.
with further analysis of the Arabic text. More understanding of [8] P. Liang, “Learning Executable Semantic Parsers for
the morphological and syntactical features of the Arabic Natural Language Understanding,” Commun ACM, vol.
language yield to better semantic representation. On the other 59, no. 9, pp. 68–76, Aug. 2016.
hand, using high quality resources that are dedicated for Arabic [9] P. Liang and C. Potts, “Bringing Machine Learning and
language is more useful than using translated resources from Compositional Semantics Together,” Annu. Rev.
other languages, since Arabic resources considers Arabic Linguist., vol. 1, no. 1, pp. 355–376, 2015.
features and challenges during the processing.
[10] D. Jurafsky and J. H. Martin, Speech and language
processing, vol. 3. Pearson London, 2014.
[11] B. Haddad, “Semantic Representation of Arabic: a
Logical Approach towards Compositionality and
269
Generalized Arabic Quantifiers,” Int J Comput Proc communication, understanding, and collaboration,”
Orient. Lang, vol. 20, pp. 37–52, 2007. Tokyo UNUIASUNL Cent., 1996.
[12] L. Banarescu et al., “Abstract Meaning Representation [19] S. Alansary, M. Nagi, and N. Adly, “The universal
for Sembanking,” in Proceedings of the 7th Linguistic networking language in action in English-Arabic
Annotation Workshop and Interoperability with machine translation,” in Proceedings of 9th Egyptian
Discourse, Sofia, Bulgaria, 2013, pp. 178–186. Society of Language Engineering Conference on
[13] J. Bos, V. Basile, K. Evang, N. J. Venhuizen, and J. Language Engineering,(ESOLEC 2009), 2009, pp. 23–
Bjerva, “The groningen meaning bank,” in Handbook of 24.
linguistic annotation, Springer, 2017, pp. 463–496. [20] S. S. Ismail, M. Aref, and I. F. Moawad, “Rich semantic
[14] O. Abend and A. Rappoport, “Universal Conceptual graph: A new semantic text representation approach for
Cognitive Annotation (UCCA),” in Proceedings of the arabic language,” in 7th WSEAS European Computing
51st Annual Meeting of the Association for Conference (ECC ‘13), 2013.
Computational Linguistics (Volume 1: Long Papers), [21] C. Lhioui, A. Zouaghi, and M. Zrigui, “A Rule-based
Sofia, Bulgaria, 2013, pp. 228–238. Semantic Frame Annotation of Arabic Speech Turns for
[15] M. AL-Smadi, Z. Jaradat, M. AL-Ayyoub, and Y. Automatic Dialogue Analysis,” Procedia Comput. Sci.,
Jararweh, “Paraphrase identification and semantic text vol. 117, pp. 46–54, 2017.
similarity analysis in Arabic news tweets using lexical, [22] C. F. Baker, C. J. Fillmore, and J. B. Lowe, “The
syntactic, and semantic features,” Inf. Process. Manag., Berkeley FrameNet Project,” in Proceedings of the 36th
vol. 53, no. 3, pp. 640–652, May 2017. annual meeting on Association for Computational
[16] Z. Kastrati, A. S. Imran, and S. Y. Yayilgan, “The impact Linguistics -, 1998.
of deep learning on document classification using [23] A. Sharaf and E. Atwell, “Knowledge representation of
semantically rich representations,” Inf. Process. Manag., the Quran through frame semantics: A corpus-based
vol. 56, no. 5, pp. 1618–1632, Sep. 2019. approach,” Corpus Linguist.-2009, p. 12, 2009.
[17] M. Palmer, D. Gildea, and P. Kingsbury, “The [24] G. A. Miller, “WordNet: A Lexical Database for
Proposition Bank: An Annotated Corpus of Semantic English,” Commun ACM, vol. 38, no. 11, pp. 39–41, Nov.
Roles,” Comput. Linguist., vol. 31, no. 1, pp. 71–106, 1995.
2005.
[18] H. Uchida, M. Zhu, and T. Della Senta, “Unl: Universal
networking language–an electronic language for
270
Sentiment Analysis for Arabic Language using
Attention-Based Simple Recurrent Unit
Saja Al-Dabet Sara Tedmori
Department of Computer Science Department of Computer Science
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
saja.aldabet@yahoo.com s.tedmori@psut.edu.jo
Abstract— With the growing number of people who express used in Quran, Modern Standard Arabic (MSA) which is
their opinions on the web, Sentiment Analysis have become an derived from the classical form and used for the formal
active research field that aims to analyze and classify the writing and speaking, and colloquial Arabic which is a
sentiment polarity of opinionated reviews. Recently, Deep regional delicate that used for the informal speaking and
Learning models have been extensively used for many Natural
varies by the region [2].
Language Processing tasks including Sentiment Analysis. In this
paper, the authors propose a Deep Learning model for Arabic
language sentence-level Sentiment Analysis. The proposed There are three main approaches to SA: (1) lexical based
model represents an integration between an emerged variant of approaches, (2) Machine Learning (ML) based approaches,
Recurrent Neural Networks known as Simple Recurrent Unit and (3) hybrid approaches. Lexical based approaches are
which is characterized by its light recurrent computations, and dependent on external lexicons that are used to uncover the
an attention mechanism that concentrates more on the sentiment polarity. In ML based approaches, supervised
important parts of an input text. The Simple Recurrent Unit learning techniques are applied. Lastly, hybrid approaches
model allows parallel recurrent calculations that lead to
integrate both lexical and ML based approaches [3]. Deep
enhance the training process in terms of time and accuracy.
Experiments were performed to evaluate the performance of the
Learning (DL), a subfield of ML, has demonstrated its power
proposed model using the Large Scale Arabic Book Reviews and success in a variety of fields including Natural Language
(LABR) dataset. The proposed model obtained state of the art Processing (NLP). Recurrent Neural Networks (RNN)
results compared to other Deep Learning models where it including its variants, such as Gated Recurrent Unit (GRU)
achieved 94.53% in terms of accuracy measure with faster [4] and Long-Short Term Memory (LSTM) [5], are capable
execution time. of dealing with a large number of sequence modeling tasks
such as language understanding [6], [7], opinion mining [8],
Keywords—Sentiment Analysis, Deep Learning, Natural and Question Answering (Q&A) [9]. However, the RNN
Language Processing, Recurrent Neural Networks, Simple
models are limited by the timestep dependency as the
Recurrent Unit, Attention Mechanism.
calculation of each timestep is dependent on the completion
I. INTRODUCTION of the previous one which restricts the processing of long
sequences especially in deep models. This dependency
With the rise of web 2.0 services, people around the globe
makes the operations slower and less scalable than other DL
have become more willing to express their opinions and share
models like Convolutional Neural Networks (CNN) which
them with others using different platforms such as e-
allows parallel computations [10], [11]. Simple Recurrent
commerce websites, blogs, social media websites, and many
Unit (SRU) model was proposed as a light recurrent model
others. Such opinions can be exploited by various
designed to have a parallelism feature with careful parameters
applications like sales prediction, reputation evaluation, and
initialization property. Moreover, SRU model utilizes
intention analysis. In recent years, and as published opinions
highway connections which improve the training process
continue to play a vital role in customers’ purchase decisions,
even within a model of multiple layers. SRU model have been
there has been a steady increase in interest in the field of
applied in Q&A, machine translation and different text
Sentiment Analysis (SA) and its applications. SA is a field of
classification tasks with comparable results and speed [11].
study that aims to analyze and classify people emotions,
evaluations, or opinions as positive, negative, or neutral. SA
is divided into three levels; aspect-level, sentence-level, and The advances of DL have reshaped the NLP research. DL
document-level. Aspect-level aims to classify the sentiment models have been integrated with attention mechanisms for
polarity for different aspects by considering the discussed different tasks which help the model to automatically
entities. The two latter levels, however are more general concentrate more on the important words in a sentence. In
levels that treat the sentence or the document as it expresses SA, words in a sentence do not contribute equally in
a sentiment about specific entity without considering the classifying the sentiment polarity. Consider the following
discussed aspects of each entity[1]. sentence, “The story is full of suspense, worth to read it”.
Only the words “suspense” and “worth” play an important
role in determining the sentiment polarity of this sentence
Although the majority of SA research efforts target
which is positive sentiment. The advantage of using attention
languages such as English, SA for the Arabic language has
has been widely conducted for aspect-level SA [12]–[14].
gained special attention in the last few years. The Arabic
language is spoken by millions of people in the Arab world.
Arabic comes in three main forms; classical Arabic which is In this paper, the authors aim to investigate the use of an
SRU model with the attention mechanism (Att-SRU model)
The authors of [17] examined several DL architectures In [25], several experiments were conducted using
based on LSTM and CNN models for Arabic SA such as different ML models such as SVM, NB, Logistic Regression
simple LSTM, simple CNN, a combination of LSTM and (LR) which were trained using Term Frequency Inverse
CNN, stacked LSTM, and a combination of two LSTM Document Frequency (TF-IDF), Unigrams, and Bigrams
models with different dropout probabilities and combination [26]. Moreover, the DL models were also examined including
methods. The evaluation of these models was based on two DNN and CNN. Those models were trained using words’
publicly available Twitter datasets; Arabic Sentiment Twitter frequency and word2vec [27] respectively. The authors
Dataset (ASTD) [18] and Arabic Twitter (ArTwitter) dataset introduced a health dataset collected from Twitter data. The
[19]. The experiments showed promising results of the best results were obtained when using the SVM model.
combined LSTM model which outperformed all the tested
models. In [28], the authors released a pre-trained word
embeddings based on a large Arabic corpus. The generated
In another study [20], the authors exploited the advantage embeddings were built using the architectures of word2vec
of combining CNN and LSTM models. To consider the model; Continues Bag-of-Words (CBOW) and skip-gram. A
morphological diversity of the Arabic language, different set of classifiers were used in the experiments: Linear SVM,
levels of SA were explored; character-level, word-level, and Random Forest, and Logistic Regression. The classifiers
character n-grams level. Both word-level and character n- were trained using the generated embeddings. The
grams level shown better results than the word level where experiments showed that utilizing word embeddings can
the used dataset was based on Twitter data and the character- slightly enhance performance. The Logistic Regression and
level increased the number of extracted features without any the Linear SVM classifiers outperformed the other classifiers.
beneficial effect.
In [29], the authors applied a lexicon-based fuzzy
The work presented in [21] explored the effect of utilizing approach. Their model is composed of two stages: in the first
various DL models for Arabic SA. The authors investigated stage weights were assigned to the entered text, and the
four models; Deep Neural Networks (DNN), Deep Belief second one was to apply fuzzy logic operations to classify the
Networks (DBN), Recursive Auto Encoder (RAE), and a sentiment polarity. The lexicon-based fuzzy approach was
combination of DBN and Deep Auto-Encoder (DAE) compared with a lexicon-based approach and achieved better
models. To train the first three models, lexicon-based features results.
were used based on ArSent lexicon [22]. However, the last
model was trained using the indices of raw words. According III. METHODOLOGY
to the reported results, the RAE model achieved the best The architecture of the proposed model is shown in figure
performance over the investigated models where the parsing 1, which consists of four main modules: the input module,
order and context-semantic were considered. Although the SRU module, attention module, and output module. Given a
RAE model obtained the best results, it suffers from the set of sentences which represent book reviews from the
limited capability of generalizing semantics and modeling LABR dataset, the proposed model aims to classify the
morphological interaction between morphemes. sentiment polarities of those sentences as positive or negative
sentiments.
The same group of authors extended the work in [21] and
developed a Recursive DL Model for Opinion Mining in
Arabic (AROMA) [23]. This model handled the limitations
of RAE by adding a morphological tokenization procedure
followed by sentiment and semantic embeddings. Moreover,
272
multiplication. This means that the current state does not have
to wait until the full completion of the previous state . In
this way, the state vectors dimensions become independent of
each other and that facilitates the parallelization process.
In order to make the gradient-based training easier, a
highway component [33] is used in SRU. The reset gate is
utilized which integrates the produced state from light
recurrent with the current input . In case of stacking
multiple layers of this model, the 1 ⊙ in (4) will
ignore the connections that permit the gradient flow through
layers. The following equations illustrate the process:
= + ⊙ + (3)
B. Simple Recurrent Unit (SRU) Module where the hidden vectors H are fed into a single-layer multi-
After receiving the input’s vectors, this module aims to layer perceptron network to calculate the hidden
extract the input’s sequential features. The proposed model representation for each hidden vector . Afterward, the
utilizes a gated network similar to LSTM and GRU but with significance of each word is calculated as a similarity
a parallelism scheme. The architecture of SRU includes two between the generated vector and a trainable vector ,
main components: light recurrence operations and highway the output is normalized using a softmax function to form the
operations. The light recurrence operations involve reading attention weight . Finally, the sentence representation is
input vectors and extracting the sequential features by produced as a weighted sum of the attention weights and the
computing the cell state . This could be achieved using the words’ hidden vectors [34], [35]. This representation is
following equations: passed to the output module to classify the sentiment of the
given sentence.
= + ⊙ + (1)
D. Output Module
After receiving the final representation from the attention
= ⊙ + 1 ⊙ (2)
module, a sigmoid layer is utilized in order to classify the
where represents a forget gate that controls the flow of sentiment of each sentence. The model is trained in a
information and represents the cell state which is supervised manner by minimizing cross entropy error of the
calculated based on the adaptive average of the previous cell classified sentiment polarities and the actual sentiment
state and the current input ( ) with respect to . polarities. In addition, regularization technique is used to
, refer to parameters’ matrices and , refer to alleviate the model overfitting [36]. The loss function is
defined as follows:
learning vectors during the training process. The way of using
the previous cell state makes a substantial difference
= log ; + || || (8)
between the SRU and other recurrent models, where a point-
wise multiplication operation ⊙ is used rather than matrix , ∈ ∈
273
where refers to the training dataset, C refers to the - CNN: a CNN model was trained based on a
sentiment polarity classes, ∈ ℝ| | refers to the sentiment generated word embedding using word2vec model.
class which is represented as a one-hot vector where 1 is the Convolutional filters with different sizes were used
true class and 0 is the false class, ; refers to the for the convolutional operations. To down-sample
estimated sentiment distribution, and refers to the the extracted features, max-over-time pooling
regularization weights. operation was used. Finally, a sigmoid function was
used for the classification purpose [15].
IV. EXPERIMENTS AND RESULTS - Baseline: a linear SVM classifier was trained using
To evaluate the performance of the proposed model, the N-grams and TF-IDF features [16].
following experiments, detailed in this section, where - Random Forest, Linear SVM, Logistic Regression:
conducted. these classifiers are trained based on a generated
word embedding from a large Arabic corpus. The
A. Data
generated word embeddings are trained using
The LABR dataset [16] is a book reviews dataset word2vec model. The classifiers have been used
composed of 63000 reviews collected from Goodread books with the default parameters configurations [28].
website. The reviews are annotated with a rating from 1-5 - Fuzzy Logic: The learning process was divided into
stars. The utilized version of the dataset is a binary version two phases: data pre-processing and features
with two classes; a positive class for 1-2 rating, a negative extraction phase, and fuzzy control system phase.
class for 4-5 rating. The dataset consists of 42832 positive The model was trained based on a lexicon-based
reviews and 8224 negative reviews. features.
TABLE 1.Experimental Results
In order to prepare the dataset, each sentence in the
dataset was tokenized into a sequence of words. Thereafter, a Model Accuracy Time/Min
pre-trained vector for each word was retrieved. The adopted -
CNN [15] 89.6%
pre-trained vectors were built on the word-level
-
representation trained on a Wikipedia dataset with 300 Baseline [16] 75.1%
dimensions. Random Forest [28] 80.05% -
B. Experimental Setup LinearSVM [28] 81.27% -
The experiments were conducted on a windows 10 machine Logistic Regression [28] 81.88% -
with 64-bit Operating system, 16 GB RAM, and Intel (R)
Fuzzy Logic [29] 80.59% -
Core (TM) i7 CPU. The development environment was
python 3.6 where the implementation was using Tensorflow GRU 90.62% 114
open-source machine learning library [37]. For the SRU
module, the number of hidden cells was 100 cells. The model SRU 92.96% 34
was trained with 15 epochs, learning rate of 0.001, L2 of 0.001 GRU + Attention 93.75% 129
for weights regularization, dropout of 0.7 probability, batch
SRU + Attention 94.53% 40
size of 128, Adaptive Moment Estimation (Adam) optimizer
[38] for weights stochastic optimization, and sigmoid cross
entropy as a loss function. The proposed model outperformed the baseline and all the
previous models and achieved the best results. The baseline
C. Evaluation Measure model [16] obtained the worst result since SVM requires
In order to evaluate the proposed model, the accuracy more comprehensive features engineering to achieve better
measure was used. The accuracy measure is defined as the performance. The utilization of word embeddings in [28] has
number of correctly classified sentiment-polarities divided by slightly improved the performance of SVM, Random Forest,
the total number of sentiment-polarities. This measure is and Logistic Regression classifiers as word embeddings
calculated as follows: extract text semantics features. These features can be very
helpful for the handled task. Fuzzy Logic [29] model
+ (8) achieved a comparable result to other models as it was based
=
+ + + on lexicon information. CNN model obtained the best result
compared to the mentioned models from the literature. The
where TP is the number of relevant sentiment-polarities that
effectiveness of using CNN model for text classification tasks
are correctly classified, TN is the number of irrelevant
has been studied by many researchers where CNN is
sentiment-polarities that are correctly classified, FP is the
characterized by its hierarchical architecture that is capable
number of irrelevant sentiment-polarities that are incorrectly
of extracting local-invariant features which help in text
classified, and FN is the number of relevant sentiment-
modeling tasks. However, despite the CNN capability of
polarities that are incorrectly classified.
extracting local features, the recurrent models which are
D. Results and Discussion characterized by its sequential architecture still achieve better
Table 1 presents the experimental results of the proposed results in text classification tasks. This can explain the
model in comparison to other Arabic models from the noteworthy results obtained by the proposed model against
literature. The Arabic models could be summarized as the previous models.
follows:
274
Our experiments aimed to examine the effect of: (1) using the sentence rather than give the same level of attention for
SRU model for SA task and (2) integrating attention the whole words. Figure 4 shows a visualization example of
mechanism with SRU model for SA task. To evaluate the attention weights. It could be noticed that the positive words
impact of using the SRU model against other recurrent “ ”اكثرand “ ”رائعةhave more attention weights than the rest of
models, a Gated Recurrent Unit (GRU) was also applied words in the sentence. This helps the model to focus more on
using the same hyperparameters configurations. The reported these words and take the final classification decision which is
results in table 1 show that using SRU cells for such task lead positive in this example.
to better performance in terms of accuracy and time. The
simplified design of SRU cells helps improve the training
process where the used element-wise multiplication makes
the training easier and candidate to obtain better results.
Figures 2 and 3 show the training accuracy of the SRU model
and the GRU model respectively. It can be noticed that the
SRU cells model training is more robust compared with the
GRU model, where the SRU does not suffer while training
and the accuracy seamlessly decreased over steps. This
demonstrates that the SRU cells are practically simpler to Fig. 4. A visualization example of attention weights. The dark red refers to
train than the GRU cells. Moreover, the probability of having high attention weights and the light one refers to lower weights.
the overfitting with the SRU is lower than that of a GRU
model. The reason behind that is the imposed constraints on V. CONCLUSION AND FUTURE WORK
the recurrent weights which prevent them from the extensive In this paper, the authors have proposed a DL model to
correlations over the same layer. Furthermore, the tackle sentence-level SA for the Arabic language. The
parallelization scheme allowed the SRU model to be much proposed model investigated the utilization of a variant of the
faster than GRU model because of the light recurrence as recurrent model called Simple Recurrent Model (SRU) which
each state does not have to wait for the previous one to finish. permits the parallel recurrent computations model, and the
integration of such model with the attention mechanism. The
proposed model outperformed the Gated Recurrent Unit
(GRU) and obtained competitive results to other DL models.
The obtained results were better in terms of time and accuracy
measure.
275
[9] A. Kumar et al., “Ask me anything: Dynamic memory networks mining in arabic as a low resource language,” ACM Trans. Asian
for natural language processing,” in International Conference on Low-Resource Lang. Inf. Process., vol. 16, no. 4, p. 25, 2017.
Machine Learning, 2016, pp. 1378–1387. [24] R. Baly, H. Hajj, N. Habash, K. B. Shaban, and W. El-Hajj, “A
[10] J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-recurrent sentiment treebank and morphologically enriched recursive deep
neural networks,” arXiv Prepr. arXiv1611.01576, 2016. models for effective sentiment analysis in arabic,” ACM Trans.
[11] T. Lei, Y. Zhang, S. I. Wang, H. Dai, and Y. Artzi, “Simple Asian Low-Resource Lang. Inf. Process., vol. 16, no. 4, p. 23,
recurrent units for highly parallelizable recurrence,” in 2017.
Proceedings of the 2018 Conference on Empirical Methods in [25] A. M. Alayba, V. Palade, M. England, and R. Iqbal, “Arabic
Natural Language Processing, 2018, pp. 4470–4481. language sentiment analysis on health services,” in 2017 1st
[12] J. Liu and Y. Zhang, “Attention modeling for targeted sentiment,” International Workshop on Arabic Script Analysis and
in Proceedings of the 15th Conference of the European Chapter of Recognition (ASAR), 2017, pp. 114–118.
the Association for Computational Linguistics: Volume 2, Short [26] C. Manning, P. Raghavan, and H. Schütze, “Introduction to
Papers, 2017, vol. 2, pp. 572–577. information retrieval,” Nat. Lang. Eng., vol. 16, no. 1, pp. 100–
[13] M. Yang, W. Tu, J. Wang, F. Xu, and X. Chen, “Attention Based 103, 2010.
LSTM for Target Dependent Sentiment Classification.,” in AAAI, [27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
2017, pp. 5013–5014. “Distributed representations of words and phrases and their
[14] D. Ma, S. Li, X. Zhang, and H. Wang, “Interactive attention compositionality,” in Advances in neural information processing
networks for aspect-level sentiment classification,” arXiv Prepr. systems, 2013, pp. 3111–3119.
arXiv1709.00893, 2017. [28] A. A. Altowayan and L. Tao, “Word embeddings for Arabic
[15] A. Dahou, S. Xiong, J. Zhou, M. H. Haddoud, and P. Duan, “Word sentiment analysis,” in 2016 IEEE International Conference on
embeddings and convolutional neural network for arabic sentiment Big Data (Big Data), 2016, pp. 3820–3825.
classification,” in Proceedings of coling 2016, the 26th [29] M. Biltawi, W. Etaiwi, S. Tedmori, and A. Shaout, “Fuzzy based
international conference on computational linguistics: Technical Sentiment Classification in the Arabic Language,” in Proceedings
papers, 2016, pp. 2418–2427. of SAI Intelligent Systems Conference, 2018, pp. 579–591.
[16] M. Aly and A. Atiya, “Labr: A large scale arabic book reviews [30] J. R. Firth, “A synopsis of linguistic theory, 1930-1955,” Stud.
dataset,” in Proceedings of the 51st Annual Meeting of the Linguist. Anal., 1957.
Association for Computational Linguistics (Volume 2: Short [31] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2–3, pp.
Papers), 2013, vol. 2, pp. 494–498. 146–162, 1954.
[17] S. Al-Azani and E.-S. M. El-Alfy, “Hybrid deep learning for [32] A. B. Soliman, K. Eissa, and S. R. El-Beltagy, “Aravec: A set of
sentiment polarity determination of arabic microblogs,” in arabic word embedding models for use in arabic nlp,” Procedia
International Conference on Neural Information Processing, Comput. Sci., vol. 117, pp. 256–265, 2017.
2017, pp. 491–500. [33] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very
[18] M. Nabil, M. Aly, and A. Atiya, “Astd: Arabic sentiment tweets deep networks,” in Advances in neural information processing
dataset,” in Proceedings of the 2015 Conference on Empirical systems, 2015, pp. 2377–2385.
Methods in Natural Language Processing, 2015, pp. 2515–2519. [34] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
[19] N. A. Abdulla, N. A. Ahmed, M. A. Shehab, and M. Al-Ayyoub, by jointly learning to align and translate,” arXiv Prepr.
“Arabic sentiment analysis: Lexicon-based and corpus-based,” in arXiv1409.0473, 2014.
2013 IEEE Jordan conference on applied electrical engineering [35] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy,
and computing technologies (AEECT), 2013, pp. 1–6. “Hierarchical attention networks for document classification,” in
[20] A. M. Alayba, V. Palade, M. England, and R. Iqbal, “A combined Proceedings of the 2016 conference of the North American chapter
cnn and lstm model for arabic sentiment analysis,” in International of the association for computational linguistics: human language
Cross-Domain Conference for Machine Learning and Knowledge technologies, 2016, pp. 1480–1489.
Extraction, 2018, pp. 179–191. [36] C. Cortes, M. Mohri, and A. Rostamizadeh, “L 2 regularization for
[21] A. Al Sallab, H. Hajj, G. Badaro, R. Baly, W. El Hajj, and K. B. learning kernels,” in Proceedings of the Twenty-Fifth Conference
Shaban, “Deep learning models for sentiment analysis in Arabic,” on Uncertainty in Artificial Intelligence, 2009, pp. 109–116.
in Proceedings of the second workshop on Arabic natural [37] M. Abadi et al., “Tensorflow: a system for large-scale machine
language processing, 2015, pp. 9–17. learning.,” in OSDI, 2016, vol. 16, pp. 265–283.
[22] G. Badaro, R. Baly, H. Hajj, N. Habash, and W. El-Hajj, “A large [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic
scale Arabic sentiment lexicon for Arabic opinion mining,” in optimization,” arXiv Prepr. arXiv1412.6980, 2014.
Proceedings of the EMNLP 2014 workshop on arabic natural
language processing (ANLP), 2014, pp. 165–173.
[23] A. Al-Sallab, R. Baly, H. Hajj, K. B. Shaban, W. El-Hajj, and G.
Badaro, “Aroma: A recursive deep learning model for opinion
276
A novel medical image fusion algorithm
for detail-preserving edge and feature extraction
Fayadh Alenezi
Department of Electrical Engineering, Faculty of Enginerring, Jouf University, Sakaka 72388, Saudi Arabia
Fshenezi@Ju.edu.sa
Abstract—By combining two or more medical images into method improve edges and textual information of the fused
one, image fusion has become an important tool for clinical image by combining Gabor filtering with maximum
diagnosis. However, existing fusion methods have also shown selection and fuzzy fusion, resulting in an image with low
significant limitations, such as the loss of information content, information content [1]. Another recent method uses a
weak contrast, noise and lengthy computation times. This Pulse-Coupled Neural Network (PCNN) together with
paper presents a novel technique for medical image fusion that Gabor filtering, in order to produce fused images with high
seeks to preserve and boost detailed information of the source information content [7], although the estimation of the
images, while promoting its edges and textual features and PCNN parameters requires a significant computation time.
suppressing noise. The method is based on a feature-linking,
In this paper, an improvement of the information content of
pulse-coupled neural network, followed by a modified Haar
the fused image is sought by preserving and promoting the
wavelet transform that leads to maximum-selection fusion in
the transformed domain and high-scale Wiener filtering of the textual features of the source images. For this purpose, a
resulting image. The new algorithm is presented, described and Feature-Linking model (FLM) [8] and a Modified Haar
evaluated on two sets of images, and its results are compared to Wavelet Transform (MHWT) [9] are combined. FLM is
those obtained from existing fusion methods. The performance aimed at enhancing the contrast of the fused image by
of the newly developed algorithm is shown to be superior over preserving and boosting detailed information of the source
the reference fusion methods in terms of a set of quality images. MHWT also strengthens the contrast of the FLM
metrics based on subjective visual perception criteria, thus image output before fusion, which is accomplished by
confirming its potential benefits to medical diagnosis. maximum selection rule. The fused image is filtered using a
high-scale Wiener filtering [8] in order to smooth out the
Keywords—Edge extraction, Information Content, HAAR noise in the final image.
transform, FLM PCNN, Wiener filter, Medical Image Fusion.
The rest of the paper is organized as follows: the
proposed algorithm is presented in section II, and section III
I. INTRODUCTION presents simulation results and their discussion. Section IV
Image fusion aims at combining complementary and provides analysis and conclusions.
redundant information from two or more images. The fused
(composite) image has superior qualities than any individual II. PROPOSED METHOD
input image [1]. Image fusion improves quality of decision-
making and therefore has found applications in medical A. Overview and Background
imaging, military science, biometrics and machine vision
[2]. The block diagram in Fig. 1 represents the algorithmic
steps proposed to achieve the desired fused image. The input
Image fusion methods are divided into spatial and images are fed into a lateral-inhibited and excited feature-
transform domain fusion methods. Spatial domain fusion linking pulse-coupled neural network (FLM-PCNN), in
methods directly handle the pixels of input images [3]. On order to boost and preserve key features. The output from
the other hand, transform-domain methods operate in an FLM-PCNN is decomposed in the spatial resolution domain
alternative domain in which images are represented via using a Discrete Wavelet Transform (DWT). The
some suitable transformation [3]. Image fusion can also be transformed image is then modified by means of the Haar
categorized based on fusion stages, such as pixel-level, wavelet transform to increase image contrast and extract
feature-level and decision-level fusion [4]. Pixel-level edge details and features. All of the wavelet coefficients are
fusion involves generating a composite image based on then combined using maximum fusion rule in order to
predetermination of pixel intensities of source images [4]. preserve salient features of the images. The fused image is
Feature-level fusion is based on extracting salient features then filtered using a high-scale Wiener filter to reduce
from source images, such as edges or texture [4]. Decision- image noise [8] and to optimize the complementary effects
level fusion entails the pre-extraction of information from of the inverse transformation from the earlier procedure.
source images, followed by the application of a set of
decision rules in order to have a common interpretation [4].
Although generally helpful, existing image fusion
techniques have also had significant drawbacks. For
instance, weighted average fusion methods have often
produced outputs with reduced contrast [5]; image fusion
using controllable cameras depends on camera motion and
Fig. 1: Schematic representation showing the proposed algorithm.
does not work with still images [6]; image fusion based on
probabilistic techniques entails huge and lengthy
computation efforts. A recently developed image fusion
278
The threshold Θ ( ) of the neuron can be represented LL sub-bands, where all approximations take place, consists
by a leaky integrator. The threshold Θ ( ) is given by (8) of low-frequency components of the image, and they are
split at high level of decomposition. HL, HH and LH sub-
Θ ( ) = Θ ( − 1) + ℎ ( − 1), (8) bands are the detail components [9]. The HL sub-bands
result from high pass filtering on the row direction and low
where ( − 1) is the postsynaptic action potential, is
pass filtering on the columns. The HH sub-bands are high
the attenuation time constant and ℎ is a magnitude pass filtering on all directions while the LH sub-bands result
adjustment. from low pass filtering on the row direction and high pass
d) Summary of FLM action on input image filtering on the columns [9].
Each pixel of the input images corresponds to one All the visible details like edges and lines of the FLM
neuron of the network; therefore, a two-dimensional image enhanced image are assumed perpendicular to the
matrix is represented as × neurons (r being the number orientation of the high pass filtering. The proposed MHWT,
of image rows and c the number of columns). The input as shown in Fig. 4, consists of four nodes as opposed to two
image intensity is normalized according to nodes in the DWT.
=
( )
+ , (9) The first average sub-signal in the proposed MHWT is
( ) ( ) estimated by
where S represents the output of FLM stage, min( ) returns = , ,…, , (10)
the minimum value of , max( ) returns the maximum value
of , and is a small positive constant which ensures where is the signal length. The frequency of the signal is
nonzero pixel values, which has been set to the smallest denoted by , where = ( , , , … , ). For instance, the
gray scale value of the matrix, = ( ). mean of the first sub-signal of length can be
approximated as
The first multiplying term in (9) normalizes the pixel
value across its local neighborhood: is the peak-to-mean = , (11)
amplitude of the neurons’ filter response to the edge, is and the corresponding detail sub-signal at the same level is
the mean amplitude, used to achieve contrast invariance approximated as
during normalization, and is a normalization constant,
which is set to 0.5. The normalization matrix increases the = , (12)
lateral inhibition, thereby sharpening the visual and feature
properties of the images [14]. These sharp-masked images where = 1,2,3 … , .
are subsequently transformed using MHWT.
The maximum coefficients resulting from the MHWT
C. Modified Haar Wavelet Transform processes corresponding to all source images are selected,
and the inverse transform of this selection is subsequently
The output of the FLM becomes the input in this stage. obtained. The resulting image from the inverse transform is
The image is read as a matrix by application of MHWT the fused image, which is then fed to the next stage.
along rows and columns on the entire image matrix. The
process largely relies on Haar filters in order to help extract D. Space-variant Wiener filter
image features [19]. When MHWT is applied along the rows The fused image from the MHWT stage is filtered
and columns of the FLM-enhanced image matrix, a using a high-scale wiener filter to optimize the trade-off
transformed image matrix is obtained, featuring one level of between noise power and signal power [20]. High-scale
input image divided into four corners, namely: upper left Wiener filtering is used as opposed to general Wiener filter
corner of the FLM-enhanced image; lower left corner of the in order to solve the problem of invariance across different
vertical details; upper right corner of the horizontal details; image regions. High-scale Wiener filtering is achieved by
lower right corner of the component of the FLM enhanced amplifying the magnitude of fused image pixels so that their
image detail (high frequency), as presented in Fig. 4. energies dominate over that of the noise [8]. The energy of
the spectral components of the fused image pixels that is
smaller than the noise energy is set to zero, leading to noise-
free image. The proposed Wiener filter output ( ) of the
input image ℊ( ) is described as follows
( )= ( )+ ℊ( ) − ( ) , (13)
279
constant and it ensures that the filter has low frequency
response at high frequencies.
The first term inside the max [•] operator in (14) ensures
the filter is dynamic, making it spatially variant. On the
|ℬ | ( )
other hand, the weight coefficients |ℬ |
in the
max [•] operator depend on the spectrum of the fused image,
and have values ranging from 0 to 1 depending on the
magnitude of the noise variance ,
1, |ℬ | ≥
= (15).
0, ℎ
Fig. 5. Inputs and fused image for algorithm test using example 1.
The space-variant Wiener filtered image has more
features preserved, which is crucial in medical imaging,
where images are typically characterized by poor contrast TABLE II. PERFORMANCE OF PROPOSED ALGORITHM COMPARED TO
[22, 23]. Thus, by letting the filter vary from one region to EXISTING ALGORITHMS FOR EXAMPLE 1
another, there is enough flexibility to expose the appropriate
details of the fused image for further operations. also Algorithm Entropy OCE AVG
helps ensure that all of the power spectrum in either un- Proposed 7.401 0.5806 0.0817
degraded or noisy images that are hard to estimate are also
SHFV 7.1961 0.6214 0.0795
filtered.
CT 6.6424 0.9041 0.0765
III. SIMULATION RESULTS DWT 6.5142 0.7274 0.0662
The proposed algorithm has been implemented using
MATLAB R2018b, and then applied to two different sets of
images (input images and resulting images corresponding to
Examples 1 and 2 are shown in Fig. 5 and Fig. 5,
respectively) The FLM parameters used in these simulations
are listed in Table I. The results have been evaluated using
subjective visual perception criteria based on a set of
performance quality metrics, namely: entropy, which
measures information content in the image [8]; overall cross
entropy (OCE), which measures the difference between the
input images and the fused image [23]; and average gradient
(Avg), which measures clarity of the fused image [24]. The
results are compared with existing medical image fusion
methods such as Shearlets and Human Feature Visibility
(SHFV), Contourlet Transform (CT) and Discrete Wavelet
Transform (DWT); such comparison is presented in Fig. 7 Fig. 6. Inputs and fused image for algorithm test using example 2
and Tables II and III. Finally, the graphical representation of
the selected performance metrics for the proposed method TABLE III. PERFORMANCE OF PROPOSED ALGORITHM COMPARED TO
and the reference algorithms is displayed in Fig. 8, 9, 10, 11, EXISTING ALGORITHMS FOR EXAMPLE 2
12 and 13.
Algorithm Entropy OCE AVG
Proposed 7.084 0.7348 0.0701
SHFV 6.9467 0.7654 0.0521
CT 6.8824 0.8843 0.0433
DWT 6.5198 1.1076 0.0419
TABLE I. LIST OF THE PROPOSED FLM PARAMETERS VALUES.
Parameter Value
f 0.015
g 0.975
h 1.95 × 10
d 2.05
ϵ −0.2
φ 1.05
β 0.0295
α 0.015
280
Fig. 10. AVG of proposed algorithm compared to existing algorithms for
Example 1.
IV. CONCLUSION
Medical image fusion is critical in medicine in order to
Fig. 9. Entropy of proposed algorithm compared to existing algorithms for
Example 2. enable correct and accurate clinical diagnosis. Image
features such as textures and edges are important in accurate
Furthermore, the AVG values obtained for Examples 1 non-invasive treatments. This paper proposes a medical
and 2 and graphically displayed in Fig. 10 and Fig. 11 image fusion method based on combination of FLM,
respectively, expose the superiority of the proposed method MHWT and space-variant Wiener filter. The algorithm,
with respect to the selected reference algorithms, given the which is precisely aimed at improving those critical image
fact that higher AVG values are preferred, since they reflect features, exhibits a remarkable improvement when
an increased clarity of the fused image. compared to existing fusion methods. The evaluation has
been based on a set of performance metrics, showing that
the proposed algorithm outperforms the existing ones
despite having low computational complexity. The proposed
281
method yields images with better edges, information content for image enhancement," Neural computation, vol. 28, no. 6, pp.
and contrast. This performance can be attributed to better 1072-1100, 2016.
edge detection and extraction due to the MHWT, increased [13] A. Tsofe, H. Spitzer and S. Einav, "Does the Chromatic Mach bands
richness in information content thanks to the FLM and effect exist?," Journal of vision, vol. 9, no. 6, pp. 20-20, 2009.
superior contrast enhancement and smoothing of noise by
[14] F. A. A. Kingdom, "Mach bands explained by response
the space-variant Wiener filter. normalization," Frontiers in human neuroscience, vol. 8, p. 843,
2014.
Based on this preliminary evaluation, it is possible to
conclude that the proposed algorithm can potentially bring [15] F. G. J. Montolio, W. Meems, M. S. A. Janssens, L. Stam and N. M.
Jansonius, "Lateral inhibition in the human visual system in patients
significant benefits to the field of medical diagnosis. with glaucoma and healthy subjects: a case-control study," PloS one ,
Nevertheless, a more thorough evaluation considering an vol. 11, no. 3, p. e0151006, 2016.
increased number of examples and a more extensive set of
performance indicators is deemed necessary in order to fully [16] J. H. Byrne, Introduction to neurons and neuronal networks, 2013.
assess the performance of this novel method. [17] R. D. Stewart, I. Fermin and M. Opper, "Region growing with pulse-
coupled neural networks: an alternative to seeded region growing,"
IEEE Transactions on Neural Networks, vol. 13, no. 6, pp. 1557-
REFERENCES 1562, 2002.
[1] F. Alenezi and E. Salari, "Medical Image Fusion (MIF) Exploring [18] T. Brosch and H. Neumann, "Interaction of feedforward and
Textural Information," in 2018 IEEE International Conference on feedback streams in visual cortex in a firing-rate model of columnar
Electro/Information Technology (EIT), Rochester, MI, USA , 2018. computations," Neural Networks, vol. 54, pp. 11-16, 2014.
[2] F. Alenezi and E. Salari, "Perceptual Local Contrast Enhancement [19] S. Audithan and R. M. Chandrasekaran, "Document text extraction
and Global Variance Minimization of Medical Images for Improved from document images using haar discrete wavelet transform,"
Fusion," International Journal of Imaging Science and Engineering European journal of scientific research, vol. 36, no. 4, pp. 502-512,
(IJISE), vol. 10, no. 3, pp. 1-10, 2018. 2009.
[3] D. K. Sahu and M. P. Parsai, "Different image fusion techniques-a [20] G. Cristobal, P. Schelkens and H. Thienpont, Optical and digital
critical review," International Journal of Modern Engineering image processing: fundamentals and applications, John Wiley &
Research (IJMER), vol. 2, no. 5, pp. 4298-4301, 2012. Sons, 2013.
[4] S. K. Shah and D. U. Shah, "Comparative study of image fusion [21] A. Umarani, "Enhancement of coronary artery using image fusion
techniques based on spatial and transform domain," International based on discrete wavelet transform," Biomedical Research, vol. 27,
Journal of Innovative Research in Science, Engineering and no. 4, pp. 1118-1122, 2016.
Technology (IJIRSET), vol. 3, no. 6, pp. 10168-10175, 2014.
[22] R. Singh and A. Khare, "Multiscale medical image fusion in wavelet
[5] J. Kong, K. Zheng, J. Zhang and X. Feng, "Multi-focus image fusion domain," The Scientific World Journal, vol. 2013, pp. 1-11, 2013.
using spatial frequency and genetic algorithm," International Journal
of Computer Science and Network Security, vol. 8, no. 2, pp. 220- [23] L. Yang, B. L. Guo and W. Ni, "Multimodality medical image fusion
based on multiscale geometric analysis of contourlet transform,"
224, 2008.
Neurocomputing, vol. 72, no. 1-3, pp. 203-211, 2008.
[6] W. B. Seales and S. Dutta, "Everywhere-in-focus image fusion using [24] Z. Li, Z. Jing, X. Yang and S. Sun, "Color transfer based remote
controlablle cameras," in International Society for Optics and
sensing image fusion using non-separable wavelet frame transform,"
Photonics, 1996.
Pattern Recognition Letters, vol. 26, no. 13, pp. 2006-2014, 2005.
[7] F. Alenezi and E. Salari, "Novel Technique for Improved Texture [25] T. Schoenauer, S. Atasoy, N. Mehrtash and H. Klar, "NeuroPipe-
and Information Content of Fused Medical Images," in 2018 IEEE
Chip: A digital neuro-processor for spiking neural networks," IEEE
International Symposium on Signal Processing and Information
Transactions on Neural Networks, vol. 13, no. 1, pp. 205-213, 2002.
Technology (ISSPIT), 2018.
[8] F. Alenezi and E. Salari, "A Novel Image Fusion Method Which [26] M. Deshmukh and U. Bhosale, "Image fusion and image quality
assessment of fused images," International Journal of Image
Combines Wiener Filtering, Pulsed Chain Neural Networks and
Processing (IJIP), vol. 4, no. 5, p. 484, 2010.
Discrete Wavelet Transforms for Medical Imaging Applications,"
International Journal of Computer Sci ence And Technology, vol. 9, [27] L. Yaroslavsky, Yaroslavsky, L. (2013). Digital holography and
no. 4, pp. 9-15, 2018. digital image processing: principles, methods, algorithms. Springer
Science & Business Media, New York: Springer Science+Business
9] G. Singh, G. Singh and G. S. Aujla, "MHWT-A Modified Haar Media, 2004, p. 323.
Wavelet Transformation for Image Fusion," International Journal of
Computer Applications, vol. 79, no. 1, pp. 26-31, 2013. [28] N. A. Al-Azzawi, "Medical Image Fusion based on Shearlets and
Human Feature Visibility," International Journal of Computer
[10] R. Eckhorn, H. J. Reitboeck, M. T. Arndt and P. Dicke, "Feature Applications, vol. 125, no. 12, pp. 1-12, 2015.
linking via synchronization among distributed assemblies:
Simulations of results from cat visual cortex," Neural computation,
vol. 2, no. 3, pp. 293-307, 1990.
282
Classification of Short-time Single-lead ECG
Recordings Using Deep Residual CNN
Areej Kharshid Ridha Ouni
Department of Computer Engineering Haikel S. Alhichri Department of Computer Engineering
King Saud University Advanced lab for Intelligent Systems King Saud University
Riyadh, Saudi Arabia 11543 Research (ALISR) Riyadh, Saudi Arabia 11543
Areej.kharshid@gmail.com Department of Computer Engineering rouni@ksu.edu.sa
King Saud University
Riyadh, Saudi Arabia 11543
hhichri@ksu.edu.sa
Yakoub bazi
Advanced lab for Intelligent Systems
Research (ALISR)
Department of Computer Engineering
King Saud University
Riyadh, Saudi Arabia 11543
ybazi@ksu.edu.sa
Abstract— This paper presents a method for the ECG devices cannot replace the larger more expensive
classification of short-time single-lead ECG recordings of devices used in hospitals but they can have a major role in
variable size. These recordings are published as part of a early detection of AF through long term daily monitoring [3].
challenge in 2017 by PhysioNet. The goal of the challenge is to The dataset in the competition is challenging because the class
classify the ECG recordings into four classes (Normal, atrial sizes are unbalanced, which is problematic for many
fibrillation, other abnormalities, and too noisy). The dataset is classification algorithms. Another difficulty in this dataset is
challenging because the high inter-class variability and because that each ECG recording has one label yet they have variable
class sizes are unbalanced. The proposed method starts by sizes (from 9 seconds to 60 seconds in length), which is again
denoising the ECG recordings using bandpass filtering, then
makes it difficult to use directly in raw format as input to many
detecting and correcting inverted signals using our own
deep classification algorithms.
proposed algorithm. Since the recording have variable size, our
proposed solution extracts a large set of features (188) that the More than 70 groups participated in the 2017 ECG
literature has shown to be effective in characterizing ECG challenge. For example, the method of Teijeiro et al. [4]
signals and detecting abnormalities. Then we present our own extracts morphological and rhythm-related features using an
carefully designed residual convolutional neural network (CNN) abductive framework for time series interpretation [5]. Then,
with 5 hidden layers and use advanced and efficient training the authors feed these features into two classifiers, one that
techniques to build a deep learning classifier for the solution. evaluates the record globally, using aggregated values for each
Finally the paper presents preliminary results of testing the
feature; and another one that evaluates the record as a
proposed solution on the challenge dataset and shows its
classification capabilities.
sequence, using a Recurrent Neural Network fed with the
individual features for each detected heartbeat. Kropf et al. [6]
Keywords—short-time single-lead ECG recordings, Atrial proposed a method which starts by extracting a total of 380
Fibrillation detection, Deep Residual Convolutional Neural features from both time and frequency domain. They used
Networks (CNN). these features to train a Random Forest–based classifier
(bagged decision trees). Billeci et al. [7] proposed an approach
I. INTRODUCTION that starts by extracting fifty different features which can be;
Early diagnosis of irregular heart rhythm known as 1) computed on the ECG signal, 2) derived from the RR series,
arrhythmia, helps reduce the risk of severe complications, and 3) obtained by merging QRS morphology and rhythm.
such as stroke or heart failure. Atrial fibrillation (AF) is one Then, they select a subset of thirty discriminating features
of the most common heart arrhythmias today, affecting an using the stepwise linear discriminant analysis (LDA)
estimated 1% of the population [1]. It is the leading cause of algorithm. After that, the least squares support vector machine
stroke, so detecting it is important. An Electrocardiogram (SVM) classifier performs the classification step.
(ECG) is the most important method for AF detection. ECG Another top-performing method, proposed by Datta et al.
records the electrical activity of the heart at rest and provides [8], used a two layer binary cascaded approach where the first
information about heart rate and rhythm. It can show if there binary classifier separates the unlabelled recordings into two
is enlargement of the heart due to high blood pressure intermediate classes (’normal+others’ and ’AF+noisy’). Then,
(hypertension) or evidence of a previous heart attack each intermediate class is separated into two using a second
(myocardial infarction). binary classifier in a second layer. This method also relies on
In 2017, the PhysioNet/Computing in Cardiology a feature extraction step before classification. It extracts more
presented a challenge that asked researchers and practitioners than 150 features including morphological, Prior art AF
to provide a reliable solution for the screening of AF from features, frequency features, statistical features, and others.
short-time single-lead ECG signals acquired with a Zabihi et al. [9] propose a hybrid classification approach
commercial low-cost hand-held device [2]. These hand-held for ECGs recorded by the AliveCor hand-held devices [9]. It
284
normal ECG signal is inverted then a classification algorithm
may classify it as abnormal, because of the difficulty in
detecting P waves. Thus detecting and correcting inverted
signals is an important step to improve classification
accuracy. Our algorithm for inverted signal detection is
illustrated in Fig. 2. In the figure we see two example of
normal signals, one is not flipped (Fig. 2a) while the other is
flipped (Fig. 2b). Our algorithm uses a sliding window with
a size equal to 600 samples or 2 seconds (since the sampling
rate of 300 Hz). This window size also guarantees that the
sliding window covers at least two heart beats, since one heart
beat takes on average 300 samples. Inside this window we
compute the maximum and minimum values. Then we
compute the midpoint between the maximum and minimum
values.
Fig.3: Wave form of a typical ECG heartbeat showing the R peak, QRS
complex, and other characteristics.
285
B. Experimental setup
We implement the proposed deep residual CNN in the
Keras environment, which is a high level neural network
application programming interface written in python. We set
the number of epochs to 100 and fix the batch size to 100
samples. Additionally, we set the learning rate of the Adam
optimization method to 0.0001. Regarding the exponential
decay rates for the moment estimates and epsilon, we use the
following default values 0.9, 0.999, and 1e-8 respectively. We
note here that all experiments are conducted on HP-station
with an Intel Xeon processor 2.40GHz, 24.00 GB of RAM,
and the GPU GEForce GTX1090 with 11GB of memory.
For performance evaluation, we present the results using
the F1-score. Given a confusion matrix as shown in Fig. 6,
theF1 scores are computed as presented in (1), (2), (3), and
(4).
FN =
(1)
FA = (2)
FO = (3)
F~ = (4)
(a) (b) Finally, following the guidelines of the
PhysioNet/Computing in Cardiology challenge [2] we
Fig. 4: Proposed CNN architecture for classification of short-time single-
lead ECG records. (a) 5-layer CNN without residual connections, (b) same compute an overall F1 score using the F1 scores for the N, A,
CNN with residual connections. and O classes as follows: F1 = (FN + FA + FO)/3.
286
TABLE I. EFFECT OF INVERSION DETECTION ON IV. CONCLUSION
CLASSIFICATION ACCURACY
This paper presented a feature-based deep learning
F1 score per class F1 score
Method N A O ~ Overall
approach to classify rhythms from short-time single-lead
5 layer CNN ECG recordings of variable size. A general performance
without inversion
95.42 93.75 84.21 - 87.58
evaluation has been performed based on the PhysioNet
correction challenging dataset and compared to the most recent works
with inversion published in this field. Our results show that residual CNN
96.71 90.52 86.13 - 91.76
correction
5 layer CNN with residual connections. are more capable of classifying short-time single-lead ECG
without inversion recordings. Moreover, the proposed method, based on our
93.89 92.63 83.72 - 89.09
correction own algorithm for inverted signal detection and a 5-layer
with inversion CNN trained from scratch, has shown great classification
correction 97.10 94.85 93.33 85.58 95.09
capabilities which reaches 91.76% and 95.09% when using
5-layer CNN and 5-layer Residual CNN approaches
We have also selected three more papers from the
respectively.
challenge because they use deep neural networks in their
Finally, it is challenging to reliably detect AF from a
solution. In particular, we have selected the works of Xiong
short-time single-lead of ECG, and the broad taxonomy of
et al. [13], Warrick et al. [14], and Andreotti et al. [15]. Thus
rhythms makes this particularly difficult. However, two
in total we have selected 8 methods from the challenge for
alternatives can be followed in future works in order to
comparison.
reliably improve results: increase the number of extracted
TABLE II. COMPARISON OF CLASSIFICATION ACCURACY
features and the use a cascaded approach for ECG
WITH STATE-OF-THE-ART classification.
F1 Score per class F1 score ACKNOWLEDGMENT
Method N A O ~ Overall This work was supported by the Deanship of Scientific
Top 5 methods Research at King Saud University through the Local Research
Teijeiro et al [5] 93.29 95.74 84.62 91.22 Group Program under Project RG-1435-055.
Kropf et al [6] 95.50 98.95 92.42 95.62
REFERENCES
Billeci et al. [7] 92.72 94.62 83.20 90.18
[1] “What is Atrial Fibrillation (AFib or AF)?,” www.heart.org.
Datta et al [8] 99.66 98.95 98.46 99.02
[Online]. Available: https://www.heart.org/en/health-topics/atrial-
Plesinger [30] 95.30 95.83 85.94 92.36 fibrillation/what-is-atrial-fibrillation-afib-or-af. [Accessed: 19-Apr-
Other deep NN 2019].
Xiong et al. [13] [2] G. Clifford et al., “AF Classification from a Short Single Lead
92.31 96.91 82.17 90.46
ECG Recording: the Physionet Computing in Cardiology Challenge
Warrick et al. [14] 89.93 89.36 70.07 83.12 2017,” presented at the 2017 Computing in Cardiology Conference,
Andreotti et al. [15] 96.35 84.71 89.05 90.03 2017.
[3] K. M. Griffiths, E. N. Clark, B. Devine, and P. W. Macfarlane,
5 layer CNN [ours] 96.71 90.52 86.13 83.71 91.76 “Assessing the accuracy of limited lead recordings for the detection
Residual CNN [ours] 97.10 94.85 93.33 85.58 95.09 of Atrial Fibrillation,” in Computing in Cardiology 2014, 2014, pp.
405–408.
[4] T. Teijeiro, C. A. García, D. Castro, and P. Félix, “Arrhythmia
The results of our method is better than 6 methods out of classification from the abductive interpretation of short single-lead
8, which is good as a preliminary result. The two methods ECG records,” in 2017 Computing in Cardiology (CinC), 2017, pp.
which beat our method are the ones by Datta et al. [8] which 1–4.
achieved 99.02 score and Kropf et al. [6] which achieved a [5] T. Teijeiro, P. Félix, J. Presedo, and D. Castro, “Heartbeat
Classification Using Abstract Features From the Abductive
score of 95.62. Interpretation of the ECG,” IEEE Journal of Biomedical and
Similar to this work, the method by Datta et al. [8], relies Health Informatics, vol. 22, no. 2, pp. 409–420, Mar. 2018.
on a feature extraction step where more than 150 features are [6] M. Kropf, D. Hayn, and G. Schreier, “ECG classification based on
extracted. However, it uses a two layer binary cascaded time and frequency domain features using random forests,” 2017
Computing in Cardiology (CinC), pp. 1–4, 2017.
approach where the first binary classifier separates the [7] L. Billeci, F. Chiarugi, M. Costi, D. Lombardi, and M. Varanini,
recordings into two intermediate classes (’normal+others’ “Detection of AF and other rhythms using RR variability and ECG
and ’AF+noisy’). Then, each intermediate class is separated spectral measures,” in 2017 Computing in Cardiology (CinC),
into two using a second binary classifier in a second layer. 2017, pp. 1–4.
[8] S. Datta et al., “Identifying normal, AF and other abnormal ECG
Clearly, this is the reason for the good performance of their rhythms using a cascaded binary classifier,” in 2017 Computing in
method. Thus, we definitely should investigate this cascaded Cardiology (CinC), 2017, pp. 1–4.
approach in our future work. [9] M. Zabihi, A. B. Rad, A. K. Katsaggelos, S. Kiranyaz, S.
As for the second work by Kropf et al. [6], it again starts Narkilahti, and M. Gabbouj, “Detection of atrial fibrillation in ECG
hand-held devices using a random forest classifier,” in 2017
by extracting a set of features from each ECG recording. Computing in Cardiology (CinC), 2017, pp. 1–4.
However, they extract a total of 380 features from both time [10] G. Bin, M. Shao, G. Bin, J. Huang, D. Zheng, and S. Wu,
and frequency domains. This a larger set of features than what “Detection of atrial fibrillation using decision tree ensemble,” in
we are using in our method (188 features only). Then for 2017 Computing in Cardiology (CinC), 2017, pp. 1–4.
[11] J. A. Behar, A. A. Rosenberg, Y. Yaniv, and J. Oster, “Rhythm and
classification they use a random forest–based classifier quality classification from short ECGs recorded using a mobile
(bagged decision trees). However, we believe the larger device,” in 2017 Computing in Cardiology (CinC), 2017, pp. 1–4.
number of extracted features explains the slightly better [12] P. Bonizzi, K. Driessens, and J. Karel, “Detection of atrial
fibrillation episodes from short single lead recordings by means of
results achieved.
287
ensemble learning,” in 2017 Computing in Cardiology (CinC), International Joint Conference on Pervasive and Ubiquitous
2017, pp. 1–4. Computing: Adjunct, New York, NY, USA, 2016, pp. 1084–1088.
[13] Z. Xiong, M. K. Stiles, and J. Zhao, “Robust ECG signal [23] L. Maršánová et al., “ECG features and methods for automatic
classification for detection of atrial fibrillation using a novel neural classification of ventricular premature and ischemic heartbeats: A
network,” in 2017 Computing in Cardiology (CinC), 2017, pp. 1–4. comprehensive experimental study,” Scientific Reports, vol. 7, no.
[14] P. A. Warrick and M. N. Homsi, “Cardiac arrhythmia detection 1, p. 11239, Sep. 2017.
from ECG combining convolutional and long short-term memory [24] S. Sarkar, D. Ritscher, and R. Mehra, “A detector for a chronic
networks,” 2017 Computing in Cardiology (CinC), pp. 1–4, 2017. implantable atrial tachyarrhythmia monitor,” IEEE Trans Biomed
[15] F. Andreotti, O. Carr, M. A. F. Pimentel, A. Mahdi, and M. D. Vos, Eng, vol. 55, no. 3, pp. 1219–1224, Mar. 2008.
“Comparing feature-based classifiers and convolutional neural [25] R. Alcaraz, D. Abásolo, R. Hornero, and J. J. Rieta, “Optimal
networks to detect arrhythmia from short segments of ECG,” in parameters study for sample entropy-based atrial fibrillation
2017 Computing in Cardiology (CinC), 2017, pp. 1–4. organization analysis,” Comput Methods Programs Biomed, vol.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for 99, no. 1, pp. 124–132, Jul. 2010.
Image Recognition,” in 2016 IEEE Conference on Computer Vision [26] D. E. Lake and J. R. Moorman, “Accurate estimation of entropy in
and Pattern Recognition (CVPR), 2016, pp. 770–778. very short physiological time series: the problem of atrial
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for fibrillation detection in implanted ventricular devices,” Am. J.
Image Recognition,” in 2016 IEEE Conference on Computer Vision Physiol. Heart Circ. Physiol., vol. 300, no. 1, pp. H319-325, Jan.
and Pattern Recognition (CVPR), 2016, pp. 770–778. 2011.
[18] G. M. Friesen, T. C. Jannett, M. A. Jadallah, S. L. Yates, S. R. [27] J. Park, S. Lee, and M. Jeon, “Atrial fibrillation detection by heart
Quint, and H. T. Nagle, “A comparison of the noise sensitivity of rate variability in Poincare plot,” Biomed Eng Online, vol. 8, p. 38,
nine QRS detection algorithms,” IEEE Trans Biomed Eng, vol. 37, Dec. 2009.
no. 1, pp. 85–98, Jan. 1990. [28] S. Bandyopadhyay et al., “An unsupervised learning for robust
[19] K. K. Paliwal, “Spectral subband centroid features for speech cardiac feature derivation from PPG signals,” in 2016 38th Annual
recognition,” in Proceedings of the 1998 IEEE International International Conference of the IEEE Engineering in Medicine and
Conference on Acoustics, Speech and Signal Processing, ICASSP Biology Society (EMBC), 2016, pp. 740–743.
’98 (Cat. No.98CH36181), 1998, vol. 2, pp. 617–620 vol.2. [29] C. Puri et al., “Classification of normal and abnormal heart sound
[20] D. Giannoulis and J. D. Reiss, “Parameter Automation in a recordings through robust feature selection,” in 2016 Computing in
Dynamic Range Compressor,” 2013. Cardiology Conference (CinC), 2016, pp. 1125–1128.
[21] G. Peeters, B. L. Giordano, P. Susini, N. Misdariis, and S. [30] F. Plesinger, P. Nejedly, I. Viscor, J. Halamek, and P. Jurak,
McAdams, “The Timbre Toolbox: extracting audio descriptors “Automatic detection of atrial fibrillation and other arrhythmias in
from musical signals,” J. Acoust. Soc. Am., vol. 130, no. 5, pp. holter ECG recordings using rhythm features and neural networks,”
2902–2916, Nov. 2011. in 2017 Computing in Cardiology (CinC), 2017, pp. 1–4.
[22] R. Banerjee, R. Vempada, K. M. Mandana, A. D. Choudhury, and
A. Pal, “Identifying Coronary Artery Disease from
Photoplethysmogram,” in Proceedings of the 2016 ACM
288
Identification and Tagging of Malicious Vehicles through License
Plate Recognition
Ahmad Mostafa#, Walid Hussein*, Samir El-Seoud+
# Computer Networks Department, The British University in Egypt, El-Sherouk, Egypt
E-mail: ahmad.mostafa@bue.edu.eg
Abstract— Vehicular Ad-hoc NETworks (VANET) are becoming a reality in today’s world. These networks are composed of highly
dynamic and capable vehicles and they rely on information that originates and is exchanged between each other. One of the main
success factors of this communication is the validity of the data communicated. Hence, malicious vehicles pose a serious threat to
VANETs. Once a vehicle is identified as malicious, the main challenge is to keep a centralized ledger of the malicious vehicles within
the network. In this paper, an innovative distributed framework is proposed for the identification and the tagging of malicious vehicles.
This framework is based on Arabic license plate recognition using different image recognition algorithms and the identification of the
vehicle as malicious or non-malicious propagate through the network, with higher accuracy in comparison to the other common plate
recognition approaches. The details of both the vehicle communication framework and the image processing process are presented and
the framework is validated through different implementations and discussion.
Keywords—VANET, Image Processing, Number Plate Recognition, Feature Extraction, Malicious Nodes.
290
dian filter to reduce the noise. Afterwards, histogram equali- In order to achieve the steps mentioned in the flowchart
zation is implemented to remap the pixels of the image and above, there have to be some requirements. These require-
improve the quality. The most ideal number plate area is dis- ments are:
covered by looking at width by height variable of genuine In-
1.The license plate number has to be recognized by other
dian number plates to a similar component of plate like re-
neighboring vehicles instead of being sent by the source ve-
gions found by this strategy. Secondly, the characters of the
hicle. This is important in order to avoid allowing the mali-
distinguished number plate district are divided utilizing Re-
cious vehicle to spoof its own identity in a reputation based
gion props capacity of MATLAB to get jumping boxes for
system.
each of the characters. Region props restores the littlest
bouncing box that contains a character. The third step is ap- 2.The license plate number will have to be amended to the
plying Optical Character Recognition (OCR) using template received packet. Hence, the neighboring vehicle will have to
matching or supervised learning approach. It works by pixel- have the ability to identify and distinguishing which one of
by-pixel correlation of the picture and the layout for every the received packets will be associated with which license
conceivable removal of the format. For each character and plate.
number there is template created in the database for each one
from 0 to 9 and from A to Z. However, the published accuracy 3.There has to be a ledger for the identified malicious ve-
in relation to the computational time is not suitable for the hicles that will propagate through the network in order to
real-time plate recognition application introduced in this pa- minimize the effect of these vehicles in the network.
per. 4.There has to be a mechanism for vehicles to recover from
being tagged as malicious. Moreover, the reputation of
III. GENERAL ARCHITECTURE whether a vehicle is malicious or not should be a consensus
The general architecture can be displayed through the in order to avoid a benign vehicle being tagged as malicious
flowchart shown in figure 2. The protocol steps can be ex- by an actual malicious vehicle.
plained as follows: It is important to note that the vehicle decides the location
of the vehicle using two main approaches:
1.Each vehicle takes photos of the surrounding vehicles li-
1. Through image recognition: In which we estimate the
cense plates, and start recognizing the plate numbers and
distance of the vehicle based on the image analysis.
how far away the vehicle is. The method of achieving that
2. Using RSSI readings from the received packet.
has been explained in the previous section.
These two values are compared to each other, and since the
2.Once a vehicle receives a packet, it needs to identify vehicle continuously move, it is important that the vehicle
which packet belongs to which license plate. This is chal- captures images and analyzes them continuously in order to
lenging in case the license plate is on the further side from be ready to amend the identity of the vehicle to the packet
the camera, and hence, the picture is not obtainable. Another once it is analyzed.
case that might be challenging is if there are two vehicles In the following section, we will discuss the details of the
that are very close to each other, and hence, it will be diffi- protocol. We will start by discussing the license plate recog-
cult to distinguish which one is the source of the packet. In nition, following that, we will discuss the actual protocol and
order to decide which vehicle is the source of the packet, the how the messages are tagged. Also, we will be discussing the
RSSI measurements are used in order to estimate the dis- different cases.
tance of the packet source (the vehicle) from the destination.
This distance is also compared based on the image recogni-
tion and deciding the distance of the vehicle from the cam-
era, and based on both values of distance and RSSI, the ve-
hicle source of the packet is identified.
3.The packet is analyzed in order to determine whether it
is malicious or not. This can be achieved based on the con-
tent of the packet and data being sent. For example, if the
packet is supposed to be including traffic data, then this data
can be verified using data from other vehicles. However, this
framework introduced in this paper can function with any
malicious data analysis algorithm.
4.Once the packet is considered either malicious or benign,
the license plate is given a score.
5.The score along with the license plate is saved in a ledger
and this ledger is distributed throughout the network.
6.This plate along with the score is broadcasted with a
timestamp of when the original packet was received.
291
Fig. 3 Vehicle Schematic diagram for the enhancement process of the cap-
tured plate image.
292
“0”. The importance of the thresholding process is that it con- Accuracy 79.84% 80.8% 88%
verts the image to a bi-level picture by using an ideal edge as
described in figure 7. It is worth mentioning that the first technique is applying non-
linear support vector machines through a radial basis function.
The second technique is directly applying a template match-
ing on the captured number plate image to reduce the compu-
Fig. 6 Thresholding process. tational time. While the proposed method is applying a tem-
plate matching on a filtered (by Guassian filter) and smoothed
(by thresholding and Canny edge detection process)version of
the number plate image. The accuracy represents how the
technique provided correct number plate recognition out of all
the test images.
293
from the transmitting vehicle. RSS is the strength of the elec-
tromagnetic signal that attenuates with distance. The further
the signal travels, the more the signal attenuates. This can be
demonstrated using experimental results as shown in figure
10. This experiment was done using Tmote Sky sensors in an
indoor environment. The experiment was repeated 100 times
in different indoor environments with different locations from
walls and reflective surfaces.
In this experimental results, we utilized to wireless sensor
devices to transmit wireless signals in an indoor environment
to simulate the reflection of wireless signals which is present
due to the reflection/diffraction of the wireless signals off of
Case A Case B
vehicle metal body.
• As shown in the figure, the strength of the electromagnetic
• signal attenuates with the distance. Although the attenuation
Figure 9. Different possible scenarios of license plate locations on transmit- is not uniform, however, the attenuation of the signal is clear
ting vehicle. and can be used to estimate the location of the source of the
vehicle.
It is important to note that the vast majority of vehicles are
B. Plate number detection and vehicle location estimation
equipped with global positioning systems (GPS), which pro-
The plate number detection happens according to the pre- vide an accurate location of the vehicle. However, the depend-
vious section. However, after detecting the vehicle plate num- ence on the RSSI for estimating the location of the vehicle is
ber, it is imperative that the vehicle that captures the image is based on the assumption that malicious vehicle can modify
able to estimate the location of the vehicle in the picture. This the GPS location amended with the packet. On the other hand,
includes two elements: in the case of RSSI localization, the transmitting vehicle is not
• The orientation of the camera on the vehicle involved in its localization and it is fully dependent on the
• The resolution of the camera, through which we can es- receiving vehicle.
timate the distance of the vehicle.
Based on these two elements, it becomes feasible to detect
both the distance of the vehicle and the angle.
C. Local Database
When a vehicle detects a license plate, it saves the location
of the vehicle with the timestamp the location was detected,
and the plate number in a local database. There are periodic
operations that take place in the database in order to keep it Fig. 10 Received Signal Strength attenuation with distance
up to date. These operations are:
• The timestamp is checked regularly. If the timestamp is VI. FUTURE WORK
before a certain time threshold, this entry in the table is
purged. This threshold depends on the density and the The protocol presented in this paper needs further verifica-
vehicles and velocity of the vehicles in the vicinity. If tion through being implemented in a test-bed in order to test
the velocity is low and the density of the vehicles is high, for different factors such as the speed of plate recognition as
then the location of the vehicle will be difficult to ex- compared to the speed of vehicle movement. It is important in
trapolate. However, if the velocity is high and the den- this protocol that the vehicle plate recognition is done and
sity is low, then it becomes feasible to predict the loca- amended to the packet being transmitted before the distance
tion of the vehicle. The density of the network can be separating the two vehicles becomes larger than the commu-
inferred from the number of packets received at the nication range between them. This can be tested by installing
source directly, or are being forwarded. If the number of a camera on the vehicles and testing the accuracy of the recog-
unique packets is high, then the density is high and vice- nition during movement, and speed.
versa.
• The new entries in the table are communicated to the VII. CONCLUSIONS
surrounding vehicles in order to copy the entries. VANETs are becoming a reality in the technology world,
These two operations ensure that the vehicles collaborate and they have a large impact on both the technology and hu-
with each other in order to detect the plate numbers. man life. Malicious vehicles pose a serious threat to the secu-
rity and safety of both the vehicles and the personnel using it,
and the consequences of this threat are dire. In order to over-
D. RSSI Measurement come and deal with malicious vehicles, we propose a novel
The relative location of the vehicle is estimated from the framework that includes both a communication protocol as
packet received using the received signal strength (RSSI) well as image processing algorithms in which the vehicle is
tagged based on its license plate. Once the license plate is
294
tagged as malicious or non-malicious, this information is 20. A. Thomas, “Reducing air pollution in cairo: Raise user costs and in-
vest in public transit.” Available: https://erf.org.eg/publications/reduc-
propagated through the network. We discussed the different
ing-air-pollution-in-cairo-raiseuser-costs-and-invest-in-public-transit,
components of our proposed system, and we presented the 2018
feasibility of this framework through actual experiments. Fu- 21. Cho, Woong, Sang In Kim, Hyun kyun Choi, Hyun Seo Oh, and Dong
ture work will be required to rigorously test this system in ac- Yong Kwak. "Performance evaluation of V2V/V2I communications:
The effect of midamble insertion." In 2009 1st International Confer-
tual vehicle networks and with random malicious nodes.
ence on Wireless Communication, Vehicular Technology, Information
However, the framework is a step towards a reliable distrib- Theory and Aerospace & Electronic Systems Technology, pp. 793-797.
uted framework to handle malicious vehicles. IEEE, 2009.
REFERENCES
1. Flurscheim H , “Patent No. (US 1612427 A) 28522/23”. UK, 1925
2. NHTSA. “Vehicle-to-vehicle communication.” Available:
https://www.nhtsa.gov/technology-innovation/vehicle-vehiclecom-
munication, 2015
3. ERPINNEWS, “Fog computing vs edge computing”, Available:
https://erpinnews.com/fog-computing-vs-edge-computing, 2018
4. Li, F., & Wang, Y. “Routing in vehicular ad hoc networks: A sur-
vey”. IEEE Vehicular technology magazine, 2007.
5. E. Schoch, F. Kargl, and M. Weber, “Communication patterns in
vanets,” IEEE Communications Magazine, vol. 46, no. 11, 2008.
6. Naumov, V., & Gross, T. R. “Connectivity-aware routing (CAR) in
vehicular ad-hoc networks". In INFOCOM 2007. 26th IEEE Interna-
tional Conference on Computer Communications. IEEE , p: 1919-
1927
7. W. Pires, T. H. de Paula Figueiredo, H. C. Wong, and A. A. F.
Loureiro, “Malicious node detection in wireless sensor networks,” in
Parallel and distributed processing symposium, 2004. Proceedings.
18th international. IEEE, 2004, p. 24.
8. Carman, D. W., Kruus, P. S., & Matt, B. J. “Constraints and ap-
proaches for distributed sensor network security." DARPA Project re-
port,(Cryptographic Technologies Group, Trusted Information Sys-
tem, NAI Labs), 2000.
9. Raw, Ram Shringar, Manish Kumar, and Nanhay Singh. "Security
challenges, issues and their solutions for VANET." International jour-
nal of network security & its applications, 2013.
10. Hao, Yong, Yu Cheng, and Kui Ren. "Distributed key management
with protection against RSU compromise in group signature based
VANETs." In IEEE GLOBECOM 2008-2008 IEEE Global Telecom-
munications Conference, pp. 1-5. IEEE, 2008.
11. Golle, Philippe, Dan Greene, and Jessica Staddon. "Detecting and cor-
recting malicious data in VANETs." In Proceedings of the 1st ACM
international workshop on Vehicular ad hoc networks, pp. 29-37.
ACM, 2004.
12. Praba, V. Lakshmi, and A. Ranichitra. "Isolating malicious vehicles
and avoiding collision between vehicles in VANET." In 2013 Interna-
tional Conference on Communication and Signal Processing, pp. 811-
815. IEEE, 2013.
13. Perkins, Charles, Elizabeth Belding-Royer, and Samir Das. Ad hoc on-
demand distance vector (AODV) routing. No. RFC 3561. 2003.
14. Marti, S., T. Giuli, K. Lai, and M. Baker. "Mitigating routing misbe-
havior in ad hoc networks." In Proceedings of MOBICOM 2000.
15. Studer, Ahren, Elaine Shi, Fan Bai, and Adrian Perrig. "TACKing to-
gether efficient authentication, revocation, and privacy in VANETs."
In 2009 6th Annual IEEE Communications Society Conference on
Sensor, Mesh and Ad Hoc Communications and Networks, pp. 1-9.
IEEE, 2009.
16. Haas, Jason J., Yih-Chun Hu, and Kenneth P. Laberteaux. "Design and
analysis of a lightweight certificate revocation mechanism for
VANET." In Proceedings of the sixth ACM international workshop on
VehiculAr InterNETworking, pp. 89-98. ACM, 2009.
17. Rezgui, Jihene, and Cédryk Doucet. "Detection of malicious vehicles
with demerit and reward level system." In 2017 International Sympo-
sium on Networks, Computers and Communications (ISNCC), pp. 1-
6. IEEE, 2017.
18. Shidore, M. M., and S. P. Narote. "Number plate recognition for indian
vehicles." IJCSNS International Journal of Computer Science and
Network Security , 2011, p: 143-146.
19. A. Puranic, D. K. T., and U. V., “Article: Vehicle number plate recog-
nition system: A literature review and implementation using template
matching,” International Journal of Computer Applications, vol. 134,
no. 1, pp. 12– 16, January 2016, published by Foundation of Computer
Science (FCS), NY, USA.
295
Cascaded Layered Recurrent Neural Network for
Indoor Localization in Wireless Sensor Networks
1st Hamza Turabieh 2nd Alaa Sheta
Information Technology Department Computer Science Department
CIT collage, Taif University Southern Connecticut State University
Taif, KSA New Haven, CT 06515, USA
h.turabieh@tu.edu.sa shetaa1@southernct.edu
Abstract—The growth in using various smart wireless devices methods necessitate expensive site surveys to gather fingerprint
in the last few decades has given rise to indoor localization service data for localizing mobile device. The dynamic nature of
(ILS). Indoor localization is defined as the process of locating a fingerprint information in indoor wireless environments makes
user location in an indoor environment. Indoor device localization
has been widely studied due to its popular applications in public the problem even complicated and computationally expensive.
settlement planning, health care zones, disaster management, the In [14], the author provided a comparison between several
implementation of location-based services (LBS) and the Internet deterministic localization methods. They include Non-Linear
of Things (IoT). The ILS problem can be formulated as a learning Regression (NLR), Iterative Non-Linear Regression (INLR),
problem utilizing Wi-Fi technology. The measured Wi-Fi signal Least Squares (LS), Random Sample Consensus (RANSAC)
strength can be used as an indication of the distribution of
users in a various indoor location. Developing a classification and Trilaterate on Minima (ToM). A data set was collected
model with high accuracy can be achieved using a machine from real environments over a space of size 550 m2 . The
learning approach. Artificial Neural Network is one of the most finding proves that NLR is the best approach. The full
successful trends in machine learning. In this article, we provide availability and accessibility of smart phones and wearable
our initial idea of using Cascaded Layered Recurrent Neural devices that adopt wireless communication feature have made
Network (L-RNN) for the classification of user localization in
an indoor environment. Several neural network models were the localization and pursuing such devices much more acces-
trained, with the best performance attainment is reported. The sible. Dissimilar to most outdoor GPS navigation systems.
experimental results marked that the presented L-RNN model is In many cases, the fingerprints are repeated owing to the
highly accurate for indoor localization and can be utilized for available Access Points (APs) and interference, which make a
many applications. duplication of the matched patterns and the user’s fingerprint.
Index Terms—Layered Recurrent Neural Network, User Lo-
calization, Indoor Environment, Prediction. Thus, improving the classification performance and reducing
the computation cost of the WiFi indoor localization systems
is urgently needed.
I. I NTRODUCTION
Traditional fingerprinting consists of two steps (i) Offline
Sensor nodes localization is an essential task for numerous step, where the fingerprint database is created at an early
emerging applications of wireless sensor networks (WSNs) stage, and (ii) Online step, where user position is determined
such as precision agriculture, forest monitoring, home security, based on Received Signal Strength (RSS). Comparing the
smart buildings, health monitoring and many others [1], [2]. current RSS with stored RSS signal to determine user location
Precise estimation of the sensor node location is vital for is a time-consuming approach and not works well in-case
the effectiveness of location-aware services. In the past few of changing building infrastructure [15]. As a result, finding
decades, the Indoor localization Service became one of the fingerprint algorithms is needed to reduce the computational
hot research topic [3]–[5]. User and device localization found time based on machine learning methods by analyzing the
many applications in areas such as health sector, disaster RSS database and not influence by changing the building
management [6], [7], Internet of Things (IoT) [8], [9], smart infrastructure.
cities [10], [11], and smart buildings [12], [13]. Currently, we The WiFi indoor localization systems based on machine
still do not have a reliable and accurate indoor localization learning methods are broadly used in the literature. Several
system that can provide an exact location for a person. We machine learning techniques for indoor localization were pro-
cannot, for example, navigate persons at home or offices using posed such as Nearest Neighbor (NN [16], K-Nearest Neigh-
Google Maps. Recently, the proliferation of smart phones and bor (KNN) [17], [18], Artificial Neural Networks (ANNs)
other mobile devices made indoor localization more possible [19], Support Vector Regression (SVR) [20] and Deep Neural
for enabling location-based services. Networks [21]. In [22] authors developed a detailed study in a
Today, emerging indoor positioning systems is significantly real environment by exploring a number of ANN-based meth-
studied due to the increasing demands on universal posi- ods such as Radial Basis Function (RBF), Multi-Layer Per-
tioning. Most indoor wireless sensor network localization ceptron (MLP), Recurrent Neural Networks (RNN), Position-
297
Fig. 2. The L-RNN model.
298
Fig. 3. Proposed Cascaded L-RNN model for indoor localization.
10 -1
The performance of L-RNN over UJIIndoorLoc data set for es-
timating building is more accurate that floor. The main reasons
that L-RNN has 89.30% for UJIIndoorLoc testing data set that
estimating building number is used as an input for the L-RNN
10 -2
model that estimates floor number. It is evident that building 0 20 40 60 80 100 120 140
estimation is 97.8% and this will reduce the accuracy for the 143 Epochs
floor estimation process. However, the average estimation for
Building and Floor (B&F) is 91.8%. The performance of L- Fig. 4. L-RNN convergence process
RNN over Wireless Indoor Localization data set is outstanding
for two reasons: (i) The size of the data set is only 2000
samples which are 93.55% smaller than UJIIndoorLoc data A. Comparison
set, and (ii) the data set do not have missing values. Table IV In this section, we provide a comparison between our
also shows the statistical results of our proposed approach for proposed L-RNN and many methods reported in the literature.
11 runs. It is clear that the performance of L-RNN approach • Table V shows comparison results between our proposed
is stable based on standard deviation value. approach and the state-of-the-art methods based on the
Figure 4 shows the performance of L-RNN model in the average accuracy value. It is clear that our proposed
training process. It is clear that L-RNN can converge within method gains the second rank for UJIIndoorLoc data sets.
143 iterations. This fast convergence is due to the ability of L- • Table VI shows a comparison results for Wireless Indoor
RNN to learn by generating various abstract representation of Localization data set and other methods in the literature.
the data. One more advantage is that the network structure It is clear that our proposed method outperforms all
can expand deeper to co-op with the modeling problem reported results and gain rank number one Big campus
requirements. with a large number of building and multi-floors will
299
increase the complexity of an indoor user localization [5] Y. Gu, A. Lo, and I. Niemegeers, “A survey of indoor positioning
problem. As a result, deep learning algorithms such systems for wireless personal networks,” Commun. Surveys Tuts.,
vol. 11, no. 1, pp. 13–32, Jan. 2009. [Online]. Available: http:
as L-RNN will be more applicable for such problems //dx.doi.org/10.1109/SURV.2009.090103
compared to the traditional machine learning algorithms. [6] S. Doeweling, T. Tahiri, P. Sowinski, B. Schmidt, and M. Khalilbeigi,
“Support for collaborative situation analysis and planning in crisis
management teams using interactive tabletops,” in Proceedings of
TABLE V the 2013 ACM International Conference on Interactive Tabletops and
C OMPARISON WITH THE SATE - OF - THE ART METHODS BASED ON THE Surfaces, ser. ITS ’13. New York, NY, USA: ACM, 2013, pp. 273–282.
AVERAGE ACCURACY VALUES FOR UJII NDOOR L OC DATA SET. [Online]. Available: http://doi.acm.org/10.1145/2512349.2512823
[7] K. Tran, D. Phung, B. Adams, and S. Venkatesh, “Indoor location
Rank Approach Average accuracy (%) prediction using multiple wireless received signal strengths,” in
1 CNN [37] 95.41 Proceedings of the 7th Australasian Data Mining Conference - Volume
2 Cascaded L-RNN 93.55 87, ser. AusDM ’08. Darlinghurst, Australia, Australia: Australian
Computer Society, Inc., 2008, pp. 187–192. [Online]. Available:
3 Scalable DNN [38] 92.89 http://dl.acm.org/citation.cfm?id=2449288.2449317
4 SAE+ classifier [39] 91.10 [8] S. K. Pandey and M. A. Zaveri, “Localization for collaborative
processing in the internet of things framework,” in Proceedings of the
Second International Conference on IoT in Urban Space, ser. Urb-IoT
’16. New York, NY, USA: ACM, 2016, pp. 108–110. [Online].
TABLE VI Available: http://doi.acm.org/10.1145/2962735.2962752
C OMPARISONWITH THE SATE - OF - THE ART METHODS BASED ON THE
[9] T. Kramp, R. van Kranenburg, and S. Lange, Introduction to the Internet
AVERAGE ACCURACY VALUES FOR W IRELESS I NDOOR L OCALIZATION
of Things. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp.
DATA SET.
1–10. [Online]. Available: https://doi.org/10.1007/978-3-642-40403-0 1
[10] E. Curry, S. Dustdar, Q. Z. Sheng, and A. Sheth, “Smart cities
Rank Approach Average Accuracy (%)
– enabling services and applications,” Journal of Internet Services
1 Cascaded L-RNN 96.30 and Applications, vol. 7, no. 1, p. 6, Jun 2016. [Online]. Available:
2 FPSPGSA-NN [40] 95.16 https://doi.org/10.1186/s13174-016-0048-6
3 SVM [40] 92.68 [11] A. Ojo, Z. Dzhusupova, and E. Curry, “Exploring the Nature of the
4 Naı̈ve Bayse [40] 90.47 Smart Cities Research Landscape,” in Smarter as the New Urban
5 PSOGSA-NN [40] 83.28 Agenda: A Comprehensive View of the 21st Century City, R. Gil-Garcia,
T. A. Pardo, and T. Nam, Eds. Springer, 2015. [Online]. Available:
6 GSA-NN [40] 77.53 http://www.edwardcurry.org/publications/Landscape Preprint.pdf
7 PSO-NN [40] 64.66 [12] A. Filippoupolitis and E. Gelenbe, “An emergency response system
for intelligent buildings,” in Sustainability in Energy and Buildings,
N. M’Sirdi, A. Namaane, R. J. Howlett, and L. C. Jain, Eds. Berlin,
V. C ONCLUSION AND F UTURE W ORKS Heidelberg: Springer Berlin Heidelberg, 2012, pp. 265–274.
[13] W. Zeiler, R. van Houten, and G. Boxem, “Smart buildings: Intelligent
In this paper, we proposed a cascaded layered recurrent software agents,” in Sustainability in Energy and Buildings, R. J.
neural network to predict indoor user localization using Wi-Fi Howlett, L. C. Jain, and S. H. Lee, Eds. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2009, pp. 9–17.
fingerprinting. L-RNN have been examined use two different [14] A. Rice and R. Harle, “Evaluating lateration-based positioning
public data sets. A set of experiments were performed, and algorithms for fine-grained tracking,” in Proceedings of the 2005 Joint
the obtained results show that L-RNN can works in a proper Workshop on Foundations of Mobile Computing, ser. DIALM-POMC
’05. New York, NY, USA: ACM, 2005, pp. 54–61. [Online]. Available:
manner either with a small or massive number of samples. The http://doi.acm.org/10.1145/1080810.1080820
performance of L-RNN shows high accuracy for indoor local- [15] P. Jiang, Y. Zhang, W. Fu, H. Liu, and X. Su, “Indoor mobile
ization problem. The future work will investigate the exact localization based on wi-fi fingerprint’s important access point,”
position of indoor users based on real-location inside floor or International Journal of Distributed Sensor Networks, vol. 11, no. 4, p.
429104, 2015. [Online]. Available: https://doi.org/10.1155/2015/429104
room and simulated different machine learning methods such [16] C. Li, Z. Qiu, and C. Liu, “An improved weighted k-nearest
as Convolutional Neural Network (CNN) and Modular Neural neighbor algorithm for indoor positioning,” Wirel. Pers. Commun.,
Network (MNN). vol. 96, no. 2, pp. 2239–2251, Sep. 2017. [Online]. Available:
https://doi.org/10.1007/s11277-017-4295-z
[17] A. Belay Adege, Y. Yayeh, G. Berie, H. Lin, L. Yen, and Y. R.
R EFERENCES Li, “Indoor localization using k-nearest neighbor and artificial neural
[1] B. Rashid and M. H. Rehmani, “Applications of wireless sensor networks network back propagation algorithms,” in 2018 27th Wireless and
for urban areas,” J. Netw. Comput. Appl., vol. 60, no. C, pp. 192–219, Optical Communication Conference (WOCC), April 2018, pp. 1–2.
Jan. 2016. [18] M. Y. Umair and K. V. R. and, “An enhanced k-nearest neighbor
[2] S. R. J. Ramson and D. J. Moni, “Applications of wireless sensor algorithm for indoor positioning systems in a wlan,” in 2014 IEEE
networks — a survey,” in 2017 International Conference on Innova- Computers, Communications and IT Applications Conference, Oct 2014,
tions in Electrical, Electronics, Instrumentation and Media Technology pp. 19–23.
(ICEEIMT), Feb 2017, pp. 325–329. [19] M. V. Moreno-Cano, M. A. Zamora-Izquierdo, J. Santa, and
[3] M. Kwak, Y. Park, J. Kim, J. Han, and T. Kwon, “An energy-efficient A. F. Skarmeta, “An indoor localization system based on artificial
and lightweight indoor localization system for internet-of-things neural networks and particle filters applied to intelligent buildings,”
(iot) environments,” Proc. ACM Interact. Mob. Wearable Ubiquitous Neurocomput., vol. 122, pp. 116–125, Dec. 2013. [Online]. Available:
Technol., vol. 2, no. 1, pp. 17:1–17:28, Mar. 2018. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2013.01.045
http://doi.acm.org/10.1145/3191749 [20] A. Chriki, H. Touati, and H. Snoussi, “Svm-based indoor localization
[4] E. Martin, O. Vinyals, G. Friedland, and R. Bajcsy, “Precise in wireless sensor networks,” in 2017 13th International Wireless Com-
indoor localization using smart phones,” in Proceedings of the 18th munications and Mobile Computing Conference (IWCMC), June 2017,
ACM International Conference on Multimedia, ser. MM ’10. New pp. 1144–1149.
York, NY, USA: ACM, 2010, pp. 787–790. [Online]. Available: [21] W. Zhang, K. Liu, W. Zhang, Y. Zhang, and J. Gu, “Deep neural
http://doi.acm.org/10.1145/1873951.1874078 networks for wireless localization in indoor and outdoor environments,”
300
Neurocomput., vol. 194, no. C, pp. 279–287, Jun. 2016. [Online]. networks,” in Proceedings of Sixth International Conference on Soft
Available: https://doi.org/10.1016/j.neucom.2016.02.055 Computing for Problem Solving, K. Deep, J. C. Bansal, K. N. Das,
[22] M. Altini, D. Brunelli, E. Farella, and L. Benini, “Bluetooth indoor A. K. Lal, H. Garg, A. K. Nagar, and M. Pant, Eds. Singapore: Springer
localization with multiple neural networks,” in IEEE 5th International Singapore, 2017, pp. 286–295.
Symposium on Wireless Pervasive Computing 2010, May 2010, pp. 295–
300.
[23] Z. E. Khatab, A. Hajihoseini, and S. A. Ghorashi, “A fingerprint method
for indoor localization using autoencoder based deep extreme learning
machine,” IEEE Sensors Letters, vol. 2, no. 1, pp. 1–4, March 2018.
[24] Z. Wu, Q. Xu, J. Li, C. Fu, Q. Xuan, and Y. Xiang, “Passive indoor
localization based on csi and naive bayes classification,” IEEE Trans-
actions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 9, pp.
1566–1577, Sep. 2018.
[25] A. Haider, Y. Wei, S. Liu, and S.-H. Hwang, “Pre- and post-processing
algorithms with deep learning classifier for wi-fi fingerprint-based
indoor positioning,” Electronics, vol. 8, no. 2, 2019. [Online]. Available:
http://www.mdpi.com/2079-9292/8/2/195
[26] W. Sun, M. Xue, H. Yu, H. Tang, and A. Lin, “Augmentation of
fingerprints for indoor wifi localization based on gaussian process
regression,” IEEE Transactions on Vehicular Technology, vol. 67, no. 11,
pp. 10 896–10 905, Nov 2018.
[27] J. Torres-Sospedra, R. Montoliu, A. Martı́nez-Usó, J. P. Avariento, T. J.
Arnau, M. Benedito-Bordonau, and J. Huerta, “Ujiindoorloc: A new
multi-building and multi-floor database for wlan fingerprint-based indoor
localization problems,” in 2014 International Conference on Indoor
Positioning and Indoor Navigation (IPIN), Oct 2014, pp. 261–270.
[28] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear
feedforward neural network,” Neural Networks, vol. 2, no. 6, pp. 459 –
473, 1989. [Online]. Available: http://www.sciencedirect.com/science/
article/pii/0893608089900440
[29] S. Elanayar V.T. and Y. C. Shin, “Radial basis function neural net-
work for approximation and estimation of nonlinear stochastic dynamic
systems,” IEEE Transactions on Neural Networks, vol. 5, no. 4, pp.
594–603, July 1994.
[30] N. R. Pal, J. C. Bezdek, and E. C. . Tsao, “Generalized clustering
networks and kohonen’s self-organizing scheme,” IEEE Transactions on
Neural Networks, vol. 4, no. 4, pp. 549–557, July 1993.
[31] H. HaddadPajouh, A. Dehghantanha, R. Khayami, and K.-K. R.
Choo, “A deep recurrent neural network based approach for
internet of things malware threat hunting,” Future Generation
Computer Systems, vol. 85, pp. 88 – 96, 2018. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167739X1732486X
[32] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A
survey of deep neural network architectures and their applications,”
Neurocomputing, vol. 234, pp. 11 – 26, 2017. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0925231216315533
[33] B. L. Happel and J. M. Murre, “Design and evolution of modular neural
network architectures,” Neural Networks, vol. 7, no. 6, pp. 985 – 1004,
1994, models of Neurodynamics and Behavior. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0893608005801558
[34] H. Turabieh, M. Mafarja, and X. Li, “Iterated feature selection
algorithms with layered recurrent neural network for software fault
prediction,” Expert Systems with Applications, vol. 122, pp. 27 – 42,
2019. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0957417418308030
[35] A. J. Maren, C. T. Harston, and R. M. Pap, Handbook of Neural
Computing Applications. Orlando, FL, USA: Academic Press, Inc.,
1990.
[36] D. Dua and C. Graff, “UCI machine learning repository,” 2019.
[Online]. Available: http://archive.ics.uci.edu/ml
[37] J. Jang and S. Hong, “Indoor localization with wifi fingerprinting using
convolutional neural network,” in 2018 Tenth International Conference
on Ubiquitous and Future Networks (ICUFN), July 2018, pp. 753–758.
[38] K. S. Kim, S. Lee, and K. Huang, “A scalable deep neural network
architecture for multi-building and multi-floor indoor localization based
on wi-fi fingerprinting,” Big Data Analytics, vol. 3, no. 1, p. 4, Apr
2018. [Online]. Available: https://doi.org/10.1186/s41044-018-0031-2
[39] M. Nowicki and J. Wietrzykowski, “Low-effort place recognition with
wifi fingerprints using deep learning,” in Automation 2017, R. Szewczyk,
C. Zieliński, and M. Kaliczyńska, Eds. Cham: Springer International
Publishing, 2017, pp. 575–584.
[40] J. G. Rohra, B. Perumal, S. J. Narayanan, P. Thakur, and R. B. Bhatt,
“User localization in an indoor environment using fuzzy hybrid of
particle swarm optimization & gravitational search algorithm with neural
301
Learning with Dynamic Architectures for Artificial
Neural Networks - Adaptive Batch Size Approach
*
King Hussein School of Computer Science
Princess Sumaya University for Technology
Jordan
Abstract— In this research we explore the performance of including for images, speech recognition, expert systems,
ADANET framework by using custom search space for an fuzzy logic and control to name but a few [15], [16].
image-classification dataset using tensorflow libraries in
combination with adaptive batch sizes for learning. In one ADANET [1], [2], which is a fast, flexible and easy to
experiment we classified fashion MNISET data and MNIST
use AutoML framework newly introduced by Google, is an
data of handwritten digits and obtained favorable results in
terms of training time as well as accuracy by alternating adaptive structural learning platform for Artificial Neural
learning batch sizes dynamically. Our testing was applied Networks (ANN) developed to handle both structure and
using simple deep neural network (DNN) and also with weights of the ANN. It is a lightweight Tensorflow [3]
convolutional neural network (CNN). based platform for high quality ensemble learning that does
not depend on domain expertise. The code, which is based
on AdaNet algorithm [2], is open-source and:
Keywords—Artificial Neural Networks, Convolutional
Neural Networks, Batch Size, Convergence, Accuracy, Training, i. Supports Learning of ANN structure as an
Feed-forward, Two-Layer Feed-Forward Net, Sampling,
Tensorflow, ADANet, Stochasticity ensemble of subnetworks
ii. Integrates with the existing TensorFlow design and
ecosystem
I. INTRODUCTION iii. Performs well on novel datasets by offering
sensible default search spaces
iv. With the availability of a flexible API, can utilize
Artificial Neural Networks are machine learning models
inspired by the human brain [12], [13]. They are considered expert information when available,
as the most powerful structure capable of producing highly v. Utilizes distributed CPU, GPU, and TPU hardware
accurate learning rates. However, these structures did not to efficiently accelerate training.
gain very high popularity due to their complex designs, long
training times and the machine learning model candidate
selection requiring its own domain expertise. But as II. PROBLEM STATEMENT
computational power and specialized deep learning
hardware such as TPUs become more readily available,
machine learning models will grow larger and ensembles Design and training of Artificial Neural Networks takes a
will become more prominent. Neural Networks have been long time to converge and achieve acceptable accuracy.
applied in different domains such as classification problems Some of the drawbacks associated with their design include,
Motivated by the popularity of variance-reduced methods The Fashion MNISET dataset consists of a training set of
that achieve linear convergence rates with small sample 60,000 examples and a test set of 10,000 examples. Each
sizes, [7] increased sample sizes dynamically in stochastic example is a 28x28 grayscale image, associated with a label
gradient descent iteration and developed theoretical and from 10 classes [17]. The second set of tests was conducted
empirical methods to counter the prohibitive issues inherent against the MNIST Database of Handwritten Digital Images
in multiple training passes. They obtained positive which offers a collection of handwritten digital images to be
performance increments within accuracy thresholds “on an used in optical character recognition (OCR) and research in
n-sample in 2n, instead of n log n steps”. Similar work that data science and machine learning [18].
demonstrates the effectiveness of combining learning rates
with dynamic batch sizes with was performed by [8] and V. EXPERIMENTAL RESULTS
[9], also [10] applied novel, adaptive approaches that control
the increases in batch sizes and application to convex The Fashion modified National Institute of Standards and
problems and convolutional neural networks (CNN). These Technology (MNIST) dataset was fed into Tensorflow using
approaches however, did not explore the performance
303
Estimator convention, then neural network was built using Figures 3 and 4 show the results of the ADANET tests on
different classifiers; the first model was built using simple this dataset with batch sizes of 1000 to 3000.
deep neural network (DNN) classifier and the second
model was built using convolutional neural network
(CNN). The accuracy was evaluated for both classifiers.
Figure 5.1a shows accuracy over iterations with batch sizes
10 to 200 for the Fashion MINST Dataset with DNN
Classifier for Learning Rate of 0.01.
1500
2000
3000
Figure 5.1b shows accuracy over iterations with batch sizes A. Test Colab Specification
10 to 200 for the Fashion MINST Dataset with CNN The tests above were conducted with the colab spec as
ADANET Model for Learning Rate of 0.01. follows:
CPU: 1xsingle core hyper threaded i.e(1 core, 2
threads) Xeon
Processors @2.3Ghz (No Turbo Boost), 45MB Cache
RAM: ~12.6 GB Available
Disk: ~320 GB Available
B. Analysis
It is evident from both sets of results that ADANET
accuracy rates are around 4% to 10% higher with CNN
models than with simple DNN models for similar batch
sizes and iterations. We can see significant improvement
when using large batch size (200) over small batch size (10).
Furthermore, changing the number of iterations seemingly
has little effect on the obtained accuracy rates which directly
affects the time needed in training big datasets.
304
VI. LIMITATIONS AND FUTURE WORK [8] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark
Schmidt, Jakub Koneˇcn´y, and Scott Sallinen. Stop wasting my
gradients: Practical SVRG. In Advances in Neural Information
The experimental testing conducted in this research, Processing Systems, pp. 2251–2259. 2015.
which adaptively varies the batch sizes used in training
artificial neural networks indicates that training DNN and [9] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Big
CNN models in this way has a linear effect on speed with Batch SGD: Automated Inference using Adaptive Batch Sizes. arXiv
miniscule accuracy degradation overhead. We have preprint arXiv:1610.05792, October 2016.
alternated between large batch size and small batch size as
needed without compromising the speed by using fewer [10] Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive
iterations. Our experiments show the same improvement batch sizes with learning rates. In Proceedings of the Conference on
Uncertainty in Artificial Intelligence, pp. 410–419, 2017.
under different classifier-dataset combinations. The
proposed procedure could be applied equally well in large
[11] P. G. Maghami and D. W. Sparks, "Design of neural networks for fast
and small datasets with different classifier with different convergence and accuracy: dynamics and control," in IEEE
model architectures Transactions on Neural Networks, vol. 11, no. 1, pp. 113-123, Jan.
2000. doi: 10.1109/72.822515,
We presented experimental results demonstrating that our http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=822515&is
number=17821, accessed May 2019
procedure was successful in training the network and
perform better than those using static small batch size and
adaptive learning rates, further experiments with other batch [12] S. Haykin, Neural Networks: A Comprehensive Foundation. New
York: Macmillan, 1994.
sizes and larger datasets will be conducted in the future
[13] M. I. Elmasry, Ed., VLSI Artificial Neural Networks Engineering.
Norwell, MA: Kluwer, 1994.
REFERENCES
[14] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail
Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for
[1] Charles Weill, Introducing AdaNet: Fast and Flexible AutoML with deep learning: Generalization gap and sharp minima. arXiv preprint
Learning Guarantees, Cornell University 2018, arXiv:1609.04836, 2016.
https://arxiv.org/abs/1905.00080 , accessed March 2019
[4] Priya Goyal, Piotr Doll´ar, Ross B. Girshick, Pieter Noordhuis, [18] L. Deng, "The MNIST Database of Handwritten Digit Images for
Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Machine Learning Research [Best of the Web]," in IEEE Signal
and Kaiming He. Accurate, large minibatch SGD: training imagenet Processing Magazine, vol. 29, no. 6, pp. 141-142, Nov. 2012. doi:
in 1 hour. arXiv preprint arXiv:1706.02677, 2017. URL 10.1109/MSP.2012.2211477. URL:
http://arxiv.org/abs/1706.02677, accessed May 2019 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6296535&
isnumber=6296521, accessed May 2019
[5] Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size
to 32K for ImageNet training. arXiv preprint arXiv:1708.03888, [19] Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu.
2017. http://arxiv.org/abs/1708.03888, accessed May 2019 Sample size selection in optimization methods for machine learning.
Mathematical programming, 134(1):127–155, 2012.
305
Hybrid Machine Learning Classifiers to Predict
Student Performance
Hamza Turabieh
Information Technology Department
CIT collage, Taif University
Taif, KSA
h.turabieh@tu.edu.sa
Abstract—Recently, machine learning technology has been help educators and decision-makers to improve the educational
involved successfully in our life in an extreme manner in various systems.
domains. In this paper, we investigate the machine learning Educational domains have a set of challenging problems for
concept for educational data mining systems, that focus on devel-
oping new approaches to discover meaningful knowledge from machine learning researchers since educations systems offer
stored data. Educational data come from different resources such complex information such as students information, class and
as academic data from students, virtual courses, e-learning log schedule information, admission and registration, and alumni
files, and so on. Predicting student marks is a challenging problem information. The motivation of this paper is to predict students
in the educational sector. We applied a hybrid feature selection performance-based on historical data using a hybrid machine
algorithm with different machine learning classifiers (i.e. nearest
neighbors (kNN), Convolutional Neural Network (CNN), Naı̈ve learning approach.
Bayes (NB) and decision trees (C4.5)) to predict the student’s This paper aims to investigate the performance of different
performance. A feature selection algorithm is used to select the machine learning classifiers with and without feature selection
most valuable features. In this paper, we applied a binary genetic algorithm to predict student’s performance. Binary genetic
algorithm as a wrapper feature selection. A benchmark dataset is algorithm is employed as a feature selection to reduce the
used from UCI Machine Learning Repository, and the obtained
results show excellent performance. dimensionality of the search space, that improves the overall
Index Terms—Machine learning, Student performance, Fea- performance of the classifiers and reduces the computational
ture selection. time.
The rest of this paper has been organized as follows:
I. I NTRODUCTION Section II explores the related works about machine learning in
educational systems. Section III presents the proposed hybrid
Educational systems have complex data that can be used approach. Section IV presents the experimental datasets used
to discover hidden knowledge, that can improve the over- in this paper. Section V shows the experimental results and
all educational systems [1]. Educational data (i.e. e-learning analysis of the proposed approach. Section VI draws the
log file, student marks, admissions/registration data, virtual conclusion and future works.
courses, and so on) can manipulate using machine learning ap-
proaches to find meaningful models. Several researchers adopt II. R ELATED W ORKS
different methods (i.e. classification, clustering, statistic and Machine learning for educational systems has been investi-
so on) to mining educational data [2], [3]. Predicting students gated deeply by [1], where they defined five different fields:
performance is a challenging problem that faces educational prediction, discovery within models, extraction of data for
institutions such as universities, schools, and training centers human judgment, clustering, and relationship mining. Most
every year. As a result, predicting student’s performance of previous works for education systems are related to univer-
at earlier stage will encourage the educational institutes to sities or virtual learning [10]. In all previous works, the data
find solutions to prevent negative performance of students collected either from surveys or from e-learning systems.
[4]. Lectures can expect the performance of their students Kapur et al. [11] applied two different machine learning
and find appropriate learning strategies to improve students methods (i.e. J48 Decision Tree, and Random Forest) to
performance. Moreover, it can enhance institution enrolment predict students marks in the education field. The collected
policies and help students to overcome their grades. data consists of 480 entries that related to the student’s enroll-
Machine Learning (ML) methods have been successfully ment. Veracano et al. [12] applied different machine learning
used in several domains, as healthcare [5], environmental methods to estimates students drop out for unbalanced dataset.
studies [6], industrial [7] and educational systems [3]. Up The authors collect 419 samples from one Mexican High
to date, machine learning concept in the educational sector School. Saif et. al [13] investigates several numbers of courses
still attracting researcher [8], [9]. Moreover, the concept of e- and explores how can predict good or poor achievement.
learning and big data in education provide researchers with Saarela et al. [14] proposed a system to predict the difficulty
extremely large data that should be examined correctly to level of different math questions and predict if the students can
|R|
F itness = E ∗ (1 + β ∗ ) (1)
|N |
TABLE I
PARAMETERS SETTING FOR ANN INTERNAL CLASSIFIER .
Parameters Values
Number of neurons input layer Number of selected features
Number of neurons hidden layer 10
Number of neurons output layer 1
Training sample 70% of the data
Testing sample 15% of the data
Validation sample 15% of the data
Fitness function Mean square error
307
Given:
-nP: base population size.
-nI: number of iterations.
-rC: rate of crossover.
-rM: rate of mutation.
Generate initial population of size nP.
Evaluate initial population according to the fitness function.
While (current iteration ≤ nI)
//Breed rC × nP new solutions.
Select two parent solutions from current population.
Form offspring’s solutions via crossover.
IF(rand(0.0, 1.0) < rM)
Mutate the offspring’s solutions.
end IF
Evaluate each child solution according to the fitness function.
Add offspring’s to population.
//population size is now MaxPop=nP× (1+rC).
Remove the rC× nP least-fit solutions from population.
end While
Output the global best solution
received. All classifiers are trained and tested based on five measurements criteria are used to evaluate the obtained results:
fold cross-validation. Interested readers about classification accuracy, precision, recall, and F-measure. Eqs. (2), (3), (4)
algorithms and its applications can read [23], [27]–[29]. and (5), shows how we evaluate each criteria, respectively.
All previous equations are calculated based on a confusion
IV. E XPERIMENTAL DATA
matrix as shown in Table III. Where:
In this paper, we used a public dataset proposed by Cortez
1) TP: presents the correct predicted of positive values
and Silva [30], [31] in 2008. The dataset presents a secondary
when actual and estimated values are both correct.
education in Portugal, where the secondary educational sys-
2) TN: presents the the correct predicted negative values
tems consist of 3 years. Two types of secondary schools:
when actual and estimated values are both negative.
private and public. The grading system range between 0
3) FN: actual class is correct while estimated value in
(lowest) and 20 (highest) grade. Each student evaluated 3
negative.
times during a year. The data set is collected during 2005-
4) FP: actual value is negative and estimated value is
2006 academic year from two public schools.
correct.
The final grade shows student performance. The dataset
consists of student marks, demographics, school information, TP + TN
etc. The dataset has 649 samples and each sample has 33 Accuracy = (2)
TP + FP + FN + TN
attributes. The data set has two distinct student performance
cases: (i) Mathematics (mat), and (ii) Portuguese language TP
P recision = (3)
(por). Table II shows a description of the dataset. The final TP + FP
target is G3 (final year grade), that has a solid correlation
with attributes G2 and G1. For more details about dataset can TP
Recall = (4)
be found in [30], and the dataset link is https://archive.ics.uci. TP + FN
edu/ml/datasets/student+performance. 2 × (Recall × P recision)
F − M easure = (5)
V. E XPERIMENTAL RESULTS AND ANALYSIS Recall + P recision
In this research, we evaluate the performance of a binary Table V shows the obtained results of all classifiers without
genetic algorithm with different machine learning classifiers feature selection algorithm. It is clear that CNN approach
to enhance the prediction process for student performance for outperforms other classifiers based on accuracy value. While
Mathematics (mat) dataset. All experiments were evaluated C4.5 is the worst one. The CNN shows a great performance
using MATLAB-R2014a. Two types of experiments were per- compared to other approaches due to its structure, where
formed: Without feature selection and with feature selection. CNN has many different filters/kernals that can convolve on
Table IV shows the parameters setting for binary genetic a given input volume. CNN learn by creating a more abstract
algorithm. All settings are carefully selected after preliminary representation of data as the network structure expand deeper.
experiments. Each classifier has been executed 11 times. Four So, the CNN structure extracts features which yield to higher
308
Fig. 3. A demonstration of Binary Genetic Algorithm for a single iteration [20].
accuracy results. Figure 4 shows the boxplot diagrams for all diagrams for all classifiers with feature selection algorithm.
four classifiers based on (Best, Worst, Average, and Median). All methods have a stable performance after reducing the size
it is clear that the performance of CNN outperforms other of dataset.
approaches.
0.94
0.92
0.9
0.9
0.88
Accuracy
0.85
0.86
Accuracy
0.84
0.8 0.82
0.8
0.75 0.78
Table VI shows the obtained results after employing BGA VI. C ONCLUSION AND FUTURE WORKS
feature selection algorithm. it is obvious that all results are In this paper, we proposed a hybrid features selection
improved except NB method compared to the results reported algorithm with a set of machine learning algorithms to predict
in Table V. kNN is improved 2%, CNN is improved 2%, and student performance. Four different machine learning algo-
C4.5 is improved 3%. It is clear the feature selection algo- rithms are examined: nearest neighbors (kNN), Convolutional
rithm can reduce the complexity of dataset and enhance the Neural Network (CNN), Naı̈ve Bayes (NB) and decision trees
overall prediction performance. Figure 5 presents the boxplot (C4.5). Binary Genetic Algorithm (BGA) is used as a wrapper
309
TABLE II
DATASET DISTRIBUTION .
TABLE III
T HE CONFUSION MATRIX .
Predicted Class
Class = Yes Class = No
Class = Yes True Positive (TP) False Negative (FN)
Actual Class
Class = No False Positive (FP) True Negative (TN)
TABLE IV TABLE VI
T HE PARAMETERS SETTING FOR BGA. O BTAINED RESULTS WITH FEATURE SELECTION .
310
R EFERENCES [17] K. P. Rao, M. C. S. Rao, and B. Ramesh, “Article: Predicting learning
behavior of students using classification techniques,” International Jour-
[1] R. Baker and K. Yacef, “The state of educational data mining in 2009: nal of Computer Applications, vol. 139, no. 7, pp. 15–19, April 2016,
A review and future visions,” JEDM, vol. 1, no. 1, pp. 3–17, Jun. 2009. published by Foundation of Computer Science (FCS), NY, USA.
[2] C. Romero, S. Ventura, and E. Garcı́a, “Data mining in course [18] C. Anuradha and T. Velmurugan, “A comparative analysis on the
management systems: Moodle case study and tutorial,” Computers & evaluation of classification algorithms in the prediction of students
Education, vol. 51, no. 1, pp. 368 – 384, 2008. [Online]. Available: performance,” Indian Journal of Science and Technology, vol. 8, no. 15,
http://www.sciencedirect.com/science/article/pii/S0360131507000590 2015. [Online]. Available: http://www.indjst.org/index.php/indjst/article/
view/74555
[3] E. Fernandes, M. Holanda, M. Victorino, V. Borges, R. Carvalho, and
[19] J. H. Holland, Adaptation in Natural and Artificial Systems: An Intro-
G. V. Erven, “Educational data mining: Predictive analysis of academic
ductory Analysis with Applications to Biology, Control and Artificial
performance of public school students in the capital of brazil,” Journal
Intelligence. Cambridge, MA, USA: MIT Press, 1992.
of Business Research, vol. 94, pp. 335 – 343, 2019. [Online]. Available:
[20] H. Turabieh, M. Mafarja, and X. Li, “Iterated feature selection algo-
http://www.sciencedirect.com/science/article/pii/S0148296318300870
rithms with layered recurrent neural network for software fault predic-
[4] L. H. Son and H. Fujita, “Neural-fuzzy with representative sets for
tion,” Expert Systems with Applications, vol. 122, pp. 27 – 42, 2019.
prediction of student performance,” Applied Intelligence, vol. 49, no. 1,
[21] B. Qu, Y. Zhu, Y. Jiao, M. Wu, P. Suganthan, and J. Liang, “A survey
pp. 172–187, Jan 2019. [Online]. Available: https://doi.org/10.1007/
on multi-objective evolutionary algorithms for the solution of the
s10489-018-1262-7
environmental/economic dispatch problems,” Swarm and Evolutionary
[5] C. M. Hatton, L. W. Paton, D. McMillan, J. Cussens, S. Gilbody, Computation, vol. 38, pp. 1 – 11, 2018. [Online]. Available:
and P. A. Tiffin, “Predicting persistent depressive symptoms in http://www.sciencedirect.com/science/article/pii/S2210650216301493
older adults: A machine learning approach to personalised mental [22] S. Mirjalili, Genetic Algorithm. Cham: Springer International Publish-
healthcare,” Journal of Affective Disorders, vol. 246, pp. 857 – 860, ing, 2019, pp. 43–55.
2019. [Online]. Available: http://www.sciencedirect.com/science/article/ [23] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A
pii/S0165032718319931 survey of deep neural network architectures and their applications,”
[6] E. Fijani, R. Barzegar, R. Deo, E. Tziritis, and K. Skordas, “Design and Neurocomputing, vol. 234, pp. 11 – 26, 2017. [Online]. Available:
implementation of a hybrid model based on two-layer decomposition http://www.sciencedirect.com/science/article/pii/S0925231216315533
method coupled with extreme learning machines to support real-time [24] S. Zhang, X. Li, M. Zong, X. Zhu, and R. Wang, “Efficient knn classifi-
environmental monitoring of water quality parameters,” Science of The cation with different numbers of nearest neighbors,” IEEE Transactions
Total Environment, vol. 648, pp. 839 – 853, 2019. [Online]. Available: on Neural Networks and Learning Systems, vol. 29, no. 5, pp. 1774–
http://www.sciencedirect.com/science/article/pii/S0048969718331851 1785, May 2018.
[7] D. D. Clercq, D. Jalota, R. Shang, K. Ni, Z. Zhang, A. Khan, [25] G. Feng, J. Guo, B.-Y. Jing, and T. Sun, “Feature subset selection
Z. Wen, L. Caicedo, and K. Yuan, “Machine learning powered using naive bayes for text classification,” Pattern Recognition
software for accurate prediction of biogas production: A case study Letters, vol. 65, pp. 109 – 115, 2015. [Online]. Available:
on industrial-scale chinese production data,” Journal of Cleaner http://www.sciencedirect.com/science/article/pii/S0167865515002378
Production, vol. 218, pp. 390 – 399, 2019. [Online]. Available: [26] L. A. Breslow and D. W. Aha, “Simplifying decision trees: A survey,”
http://www.sciencedirect.com/science/article/pii/S095965261930037X Knowl. Eng. Rev., vol. 12, no. 1, pp. 1–40, Jan. 1997. [Online].
[8] K. S. Rawat and I. V. Malhan, “A hybrid classification method based Available: http://dx.doi.org/10.1017/S0269888997000015
on machine learning classifiers to predict performance in educational [27] E. Bauer and R. Kohavi, “An empirical comparison of voting
data mining,” in Proceedings of 2nd International Conference on Com- classification algorithms: Bagging, boosting, and variants,” Machine
munication, Computing and Networking, C. R. Krishna, M. Dutta, and Learning, vol. 36, no. 1, pp. 105–139, Jul 1999. [Online]. Available:
R. Kumar, Eds. Singapore: Springer Singapore, 2019, pp. 677–684. https://doi.org/10.1023/A:1007515423169
[9] S. Carnell, B. Lok, M. T. James, and J. K. Su, “Predicting student success [28] C. C. Aggarwal and C. Zhai, A Survey of Text Classification Algorithms.
in communication skills learning scenarios with virtual humans,” in Boston, MA: Springer US, 2012, pp. 163–222.
Proceedings of the 9th International Conference on Learning Analytics [29] S. Boucheron, O. Bousquet, and G. Lugosi, “Theory of classification:
& Knowledge, ser. LAK19. New York, NY, USA: ACM, 2019, pp. 436– a survey of some recent advances,” ESAIM: Probability and Statistics,
440. [Online]. Available: http://doi.acm.org/10.1145/3303772.3303828 vol. 9, pp. 323–375, 2005.
[10] P. Ducange, R. Pecori, L. Sarti, and M. Vecchio, “Educational big data [30] P. Cortez and A. Silva, “Using data mining to predict secondary school
mining: How to enhance virtual learning environments,” in International student performance,” in In A. Brito and J. Teixeira Eds., Proceedings of
Joint Conference SOCO’16-CISIS’16-ICEUTE’16, M. Graña, J. M. 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008), Porto,
López-Guede, O. Etxaniz, Á. Herrero, H. Quintián, and E. Corchado, Portugal, 2008, pp. 5–12.
Eds. Cham: Springer International Publishing, 2017, pp. 681–690. [31] D. Dua and C. Graff, “UCI machine learning repository,” 2019.
[11] B. Kapur, N. Ahluwalia, and R. Sathyaraj, “Comparative study on [Online]. Available: http://archive.ics.uci.edu/ml
marks prediction using data mining and classification algorithms,” Int. J.
Adv. Res. Comput. Sci., vol. 8, pp. 632 – 636, 2017. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S095965261930037X
[12] C. Márquez-Vera, A. Cano, C. Romero, A. Y. M. Noaman,
H. Mousa Fardoun, and S. Ventura, “Early dropout prediction using
data mining: a case study with high school students,” Expert
Systems, vol. 33, no. 1, pp. 107–124, 2016. [Online]. Available:
https://onlinelibrary.wiley.com/doi/abs/10.1111/exsy.12135
[13] R. Asif, A. Merceron, S. A. Ali, and N. G. Haider, “Analyzing
undergraduate students’ performance using educational data mining,”
Computers & Education, vol. 113, pp. 177 – 194, 2017.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0360131517301124
[14] M. Saarela and B. Yener, “Predicting math performance from raw large-
scale educational assessments data : A machine learning approach,”
2016.
[15] J. Xu, Y. Han, D. Marcu, and M. van der Schaar, “Progressive prediction
of student performance in college programs,” 2017. [Online]. Available:
https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14234
[16] R. Asif, S. Hina, and S. Haque, “Predicting student academic perfor-
mance using data mining methods,” Int. J. Comput. Sci. Netw. Secur,
vol. 17, no. 5, pp. 187–191, 2017.
311
Automated Grading for Handwritten Answer Sheets
using Convolutional Neural Networks
Eman Shaikh, Iman Mohiuddin, Ayisha Manzoor Ghazanfar Latif *, Nazeeruddin Mohammad
Department of Computer Engineering, Department of Computer Science,
Prince Mohammad bin Fahd University, Prince Mohammad bin Fahd University,
Al Khobar, Saudi Arabia. Al Khobar, Saudi Arabia.
*Email: glatif@pmu.edu.sa
Abstract—Optical Character Recognition (OCR) is an handwritten text is obtainable as an image. Whereas, in the
extensive research field in image processing and pattern online recognition systems, the characters are typed from
recognition. Traditional character recognition methods some input devices. The offline character recognition is
cannot distinguish a character or a word from a scanned more complicated than the online character recognition
image. This paper proposes a system, which is to develop systems as writing styles may differ from one user to
a method that uses a personal computer, a portable another and an enormous noise occurs in the offline
scanner and an application program that would characters during the writing of the text and scanning of the
automatically correct the handwritten answer sheets. document [4, 16]. Hence, the offline handwritten
For handwritten character recognition, the scanned recognition mechanism extends to be an effective field for
images are fed through a machine learning classifier research towards exploring the innovative procedures that
known as the Convolutional Neural Network (CNN). would enhance the accuracy of handwritten recognition
Two CNN models were proposed and trained on 250 systems.
images that were collected from students at Prince This paper proposes an automated system for grading
Mohammad Bin Fahd University. The proposed system handwritten answer sheets with the help of Convolutional
will finally output the final score of the student by Neural Networks (CNN). All the answer sheets were
comparing each classified answer with the correct scanned separately through a portable scanner, and the
answer. The experimental results exhibited that the scanned images were stored as black and white images.
proposed system performed a high testing accuracy of After scanning each answer sheet, the scanned images were
92.86%. The system can be used by the instructors in given as an input to the segmentation algorithm that
several educational institutions to automatically grade performed segmentation. This is done to separate the
the handwritten answer sheets of students effectively. questions from the answers written in each box. The
segmentation procedure divided the images into more
Keywords—Handwritten Numerals Recognition,
comprehensive divisions and procured more relevant data.
Convolutional Neural Network, Handwritten Character
Each segmented character and digit answers were extracted
Recognition, Scanned document Segmentation
to generate parameters for testing and training. The data
obtained was a handwritten data set consisting of few
I. INTRODUCTION English alphabets and numerals. The dataset was then used
In recent years, handwritten recognition is considered to to score the student's answer sheet. The recognition of the
be the uttermost engrossing and demanding analysis range student's answer was done using two CNN proposed
in the sphere of image processing and pattern recognition. architecture.
Handwritten recognition systems remarkably administer to
the development of automated procedures and enhance the The remaining paper is organized as follows: Section II
alliance among human and computerized systems in several introduces the literature review, Section III describes the
operations. Nowadays, there are various technological proposed framework, Section IV demonstrates the
approaches in organizations and institutions that help to experimental results, and Section V discusses the
reduce the time consumed for grading answer sheets conclusion.
manually. This is achieved by raising the accuracy and
avoiding the inaccuracies caused by humans. Hence, the II. LITERATURE REVIEW
comparison of answer sheets with their answer keys and The concept of handwritten recognition is a confined
grading the student answers monotonously is a trivial and sphere of research in the discipline of pattern recognition
arduous task that must be automated. and image processing over the past years and indeed there is
For this purpose, Optical Character Recognition (OCR) a broad insistence for optical character recognition on
is implemented to transform handwritten or typed text handwritten scripts. In this section, an extensive analysis of
images that are captured with the help of a scanner into an extant works in handwritten recognition systems that depend
electronic or machine-based text image. Predominantly, on various machine learning techniques are proposed.
handwritten recognition systems are characterized into two Although the printed text recognition is considered as a
categories namely, offline and online recognition. In the clarified issue these days, handwritten text recognition
offline handwritten recognition systems, the handwriting remains as a demanding task, mainly due to the huge
written on the paper is normally apprehended by a scanner variation in handwriting among certain people including the
which recognizes the characters and then the completed size, orientation, thickness, format, and dimension of each
314
Step 7: Save each segment answer in local drive. size. Higher batch size significantly degrades the quality of
Step 8: End the model, as measured by its ability to generalize.
The output size of an image that is produced from the
Table III depicts images of some properly segmented images hidden layer is fed into a logistic function like softmax.
from a student’s answer sheet (refer Fig. 2 for the sample ReLU layer or the activation function performs an element-
template). wise activation function max (0, x) that changes the negative
values to zeros. The layer does not alter the size of the
TABLE III. SAMPLES OF SEGMENTED ANSWERS volume since there are no hyperparameters present. The
softmax layer outputs a probability distribution, that is, the
values of the output sum equal to 1. In addition, the softmax
layer is a soft version of the max-output layer and hence it is
differentiable and also resilient to outliers. Max pooling is
the most used type of pooling which only takes the most
important part of the input volume and the largest element
from the rectified feature map. Dropout is a layer whose
function is to drop out a random set of activations in the
layer by setting them to zero. Moreover, it forces the network
to be redundant by providing the network with the right
classification or output for a specific example even if few of
the activations are dropped out. It also assures that the
network is not getting overfitted to the training data.
Dense is a non-linear activation function that first
performs classification to the features that are extracted by
the convolutional layers, then it downsamples the pooling
layers. Each node present in this layer is connected to every
node in the preceding layer. Adam is an optimization
algorithm which is used in replacement of the classical
stochastic gradient descent procedure to update the network
weights iterative based on training data. It usually is a
combination of RMSprop and Stochastic Gradient Descent
with a momentum that uses the squared gradients to scale the
learning rate like RMSprop. The cross-entropy loss function
calculates the error rate between the expected value and the
original value. Minimizing cross-entropy loss function
approximately will help gain better performance. In the
hidden/convolutional layer, all the artificial neurons are
attached to the neurons of the preceding layers, in order to
give out an output by picking up a set of weighted inputs. It
is necessary to minimize the number of hidden layers, due to
the fact that a large number of hidden layers would result in
an overfit and enlarged computation.
D. Proposed CNN Models
Fig. 2. Sample Answer Sheet Fig. 3 illustrates the proposed CNN architecture of
Model 1. The input image for the CNN model used is of size
C. Convolutional Neural Network (CNN) 64x64 and then it is passed through a convolution layer of
Parameters are an essential part of Convolutional Neural 64 filters with a kernel size of 5x5 and ReLU activation
Network that helps to optimize the quality of the neural function. It is then followed by the 2 x 2 MAX Pooling layer
network. Their role is to avoid the overfitting and that downsamples the image and aids in identifying the most
underfitting of the model for a given dataset. Changes in the important features. This leads to a decrease in the size of the
parameters helps to get the desired results for a specific
image. Then it passed through another convolution layer of
problem. This section talks about the different CNN
48 filters of kernel size 3 x 3 and ReLU activation function.
parameters which were implemented to design the CNN
models. In order to avoid overfitting, the images then go through
20% regularization in the first dropout layer. The image
Firstly, a kernel size of more than 5 x 5 is not used since further goes to more convolution, max
large kernel size results in a slower training time. Secondly, pooling, ReLU function, and dropout layers until the sample
to minimize the error on the training data, the number of data is ultimately converted into one-dimensional vector
rounds of optimization that were implemented during which happens due to the flatten layer. The final
training is increased. However, this can lead to an overfitting layers comprise of three dense layers that consist of 512,
in the neural network which will thereby result in 256 and 12 features. The first two dense layer
performance degradation during the testing phase. In order to
uses ReLU and the third dense layer
analyze this, monitoring of error performance is done
uses softmax activation function which helps to convert the
separately on the testing data as the number of
epochs increases. Larger batch size requires larger memory
315
output into a probability distribution. The image is then Step 7: Display the score
recognized based on its probability distribution value. Step 8: End.
Fig. 2 illustrates a student’s answer sheet which provides
the correct answers to all the questions. After segmentation
the answers to each question will be used as refernce to score
the students answersheets.
Computation Time
Model # Batch Size Epochs Testing Accuracy
(Seconds)
1 50 10 90.120 % 1772.201
1 100 10 86.451 % 1409.258
1 200 10 80.164 % 954.4838
1 50 25 91.916 % 2284.184
1 100 25 92.430 % 3919.011
1 200 25 92.353 % 2258.307
1 50 50 92.866 % 4750.766
1 100 50 92.763 % 4483.056
1 200 50 92.738 % 5266.739
1 50 100 92.840 % 10790.350
1 100 100 92.763 % 10123.380
Fig. 4. Proposed CNN architecture of Model - 2 1 200 100 92.840 % 9349.122
2 50 10 90.4799 % 1803.066
2 100 10 90.4799 % 1363.192
E. Scoring 2 200 10 81.011 % 1885.508
Algorithm for scoring each Student’s Answer Sheet is 2 50 25 92.2504 % 3579.950
as follows: 2 100 25 92.1991 % 3295.304
Input: Input Template Filled in A4 sheet 2 200 25 90.8648 % 4753.194
Output: Score of the Input Template 2 50 50 92.3274 % 6125.266
Step 1: Start 2 100 50 92.3531 % 6142.417
Step 2: Load the CNN Model 2 200 50 92.3531 % 5819.936
Step 3: Obtain segmented student’s answer sheets 2 50 100 92.4301 % 11866.290
Step 4: Read each segmented student’s answer files 2 100 100 92.4044 % 12197.170
Step 5: Compare student’s answers with True answers 2 200 100 92.1735 % 11079.200
Step 6: Score the answer sheets
316
Using Improved Structural Features–A Unified Method for
Handwritten Arabic and Persian Numerals. Journal of
V. CONCLUSION Telecommunication, Electronic and Computer Engineering
(JTEC), 9(2-10), 33-40.
Offline handwritten recognition systems based on [10] Chai, D. (2016, December). Automated marking of printed multiple-
machine learning algorithm has significant importance in the choice answer sheets. In 2016 IEEE International Conference on
research field. However, it is a difficult recognition due to Teaching, Assessment, and Learning for Engineering (TALE) (pp.
145-149). IEEE.
the presence of odd characters or similarity in shapes for [11] Muangprathub, J., Shichim, O., Jaroensuk, Y., & Kajornkasirat, S.
multiple characters. The paper proposed a system that was (2018). Automatic Grading of Scanned Multiple-Choice Answer
implemented to recognize the handwritten characters and Sheets.
then display the final score of the student. The system was [12] Patole, S., Pawar, A., Patel, A., Panchal, A., & Joshi, R. (2016,
evaluated from a dataset that consisted of 250 answer sheets March). Automatic system for grading multiple choice questions and
feedback analysis. IEEE International Journal of Technical Research
and this data was tested by using two deep convolutional and Applications, 12(39), 16-19. IEEE.
neural network models. The results attained a high accuracy [13] Tavana, A. M., Abbasi, M., & Yousefi, A. (2016, September).
with 92.86% testing accuracy. The accuracy of the system Optimizing the correction of MCQ test answer sheets using digital
was less as compared to the ones mentioned in section II as image processing. In 2016 Eighth International Conference on
the system used its own handwritten data set. In future work, Information and Knowledge Technology (IKT)(pp. 139-143). IEEE.
the segmentation algorithm can be improved to attain a [14] Abbas, A. A. (2009). An automatic system to grade multiple choice
questions paper-based exams. Journal of university of Anbar for Pure
higher percentage of accuracy for segmentation of the science, 3(1), 174-181.
images. Moreover, the proposed CNN architecture can also [15] Ciresan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J.
be enhanced to achieve much higher performance and (2011, September). Convolutional neural network committees for
accuracy in displaying the score of the student. handwritten character classification. In Document Analysis and
Recognition (ICDAR), 2011 International Conference on (pp. 1135-
1139). IEEE.
REFERENCES [16] Latif, G., Alghazo, J., Alzubaidi, L., Naseer, M. M., & Alghazo, Y.
[1] Brown, M. T. (2017). Automated Grading of Handwritten Numerical (2018, March). Deep Convolutional Neural Network for Recognition
Answers. In 2018 16th International Conference on Frontiers in of Unified Multi-Language Handwritten Numerals. In 2018 IEEE 2nd
Handwriting Recognition (ICFHR) (pp. 279-284). IEEE. International Workshop on Arabic and Derived Script Analysis and
[2] Murray, K. W., & Orii, N. (2012). Automatic essay scoring. IEICE Recognition (ASAR) (pp. 90-95). IEEE.
Transactions on Information and Systems, 102(1), 147-155. [17] Singh, N. (2018, February). An Efficient Approach for Handwritten
[3] Alomran, M., & Chia, D. (2018). Automated Scoring System for Devanagari Character Recognition based on Artificial Neural
Multiple Choice Test with Quick Feedback. International Journal of Network. In 2018 5th International Conference on Signal Processing
Information and Education Technology, 8(8). and Integrated Networks (SPIN) (pp. 894-897). IEEE.
[4] Cupic, M., Brkic, K., Hrkac, T., Mihajlovic, Z., & Kalafatic, Z. (2014, [18] Kumar, P., Sharma, N., & Rana, A. (2012). Handwritten Character
May). Automatic recognition of handwritten corrections for multiple- Recognition using Different Kernel based SVM Classifier and MLP
choice exam answer sheets. In Information and Communication Neural Network (A COMPARISON). International Journal of
Technology, Electronics and Microelectronics (MIPRO), 2014 37th Computer Applications, 53(11), 413-435.
International Convention on (pp. 1136-1141). IEEE. [19] Rao, Z., Zeng, C., Wu, M., Wang, Z., Zhao, N., Liu, M., & Wan, X.
[5] Srihari, S., Collins, J., Srihari, R., Srinivasan, H., Shetty, S., & Brutt- (2018). Research on a handwritten character recognition algorithm
Griffler, J. (2008). Automatic scoring of short handwritten essays in based on an extended nonlinear kernel residual network. KSII
reading comprehension tests. Artificial Intelligence, 172(2-3), 300- Transactions on Internet & Information Systems, 12(1), 25-31.
324. [20] Jeong, S. H., Nam, Y. S., & Kim, H. K. (2003, August). Non-similar
[6] Mahana, M., Johns, M., & Apte, A. (2012). Automated essay grading candidate removal method for off-line handwritten Korean character
using machine learning. In Document Analysis and Recognition, recognition. In Document Analysis and Recognition, 2003.
ICDAR. 10th International Conference (pp. 1206-1210). IEEE. Proceedings. Seventh International Conference on (pp. 323-328).
[7] Saengtongsrikamon, C., Meesad, P., & Sodsee, S. (2009). Scanner- IEEE.
based optical mark recognition. Information Technology [21] Al-Dobais, M. A., Alrasheed, F. A. G., Latif, G., & Alzubaidi, L.
Journal, 5(1), 69-73. (2018, March). Adoptive Thresholding and Geometric Features based
[8] Alghazo, J. M., Latif, G., Alzubaidi, L., & Elhassan, A. (2019). Multi- Physical Layout Analysis of Scanned Arabic Books. In 2018 IEEE
Language Handwritten Digits Recognition based on Novel Structural 2nd International Workshop on Arabic and Derived Script Analysis
Features. Journal of Imaging Science and Technology, 63(2), 20502- and Recognition (ASAR) (pp. 171-176). IEEE.
1.
[9] Alghazo, J. M., Latif, G., Elhassan, A., Alzubaidi, L., Al-Hmouz, A.,
& Al-Hmouz, R. (2017). An Online Numeral Recognition System
317
Wrapper-based Feature Selection for Imbalanced
Data using Binary Queuing Search Algorithm
Thaer Thaher Majdi Mafarja Baker Abdalhaq Hamouda Chantar
IT Dept. CS Dept. ICS Dept. CS Dept.
At-Tadamun Society Birzeit University An-Najah National University Sebha University
Nablus, Palestine Birzeit, Palestine Nablus, Palestine Sebha, Libya
thaer.thaher@gmail.com mmafarja@birzeit.edu baker@najah.edu hamoudak77@gmail.com
Abstract—The non-uniform distribution of classes (imbalanced classification model. If the learning algorithm is involved in the
data) and the presence of irrelevant and/or redundant infor- selection process, then the method is said to follow the wrapper
mation are considered as challenging aspects encountered in approach. Otherwise, the filter approach is being followed. The
most real-world domains. In this paper, we propose an efficient
software fault prediction (SFP) model based on a wrapper main difference between filters and wrappers is that the filter
feature selection method combined with Synthetic Minority approach is computationally more efficient than the wrapper
Oversampling Technique (SMOTE) with the aim of maximizing approach, thus the selected features may not be appropriate for
the prediction accuracy of the learning model. A binary variant of some learning algorithms. However, in the wrapper approach,
recent optimization algorithm; Queuing Search Algorithm (QSA), the selection of the features is decided based on the classi-
is introduced as a search strategy in wrapper FS method. The
performance of the proposed model is assessed on 14 real-world fication accuracy of machine learning algorithm. This may
benchmarks from the PROMISE repository in terms of three lead to high computational time, but at the same time, higher
evaluation measures; sensitivity, specificity, and area under the performance is guaranteed [1].
curve (AUC). Experimental results reveal a positive impact of FS problem can be defined as the task of finding the subset
the SMOTE technique in improving the prediction performing of features that leads to the best performance of the data
in a highly imbalanced data. Moreover, the binary QSA (BQSA)
show a superior efficacy on 64.28% of datasets compared with mining tasks. Searching for the optimal subset of features
other state-of-the-art algorithms in handling the problem of FS. is considered as a hard optimization problem. First, FS is
The combination of BQSA and SMOTE achieved an acceptable formulated as a multi-objective problem, in which the lowest
AUC results (66.47-87.12%). number of features that satisfies highest prediction quality is
Index Terms—Queuing Search Algorithm, Feature Selection, required [1]. Second, datasets with a large number of features
SMOTE, Transfer Function, Software Fault Prediction
(high dimensional) increase the complexity of this problem.
As a general speaking, there exists 2N possible subsets when
I. I NTRODUCTION
dealing with a dataset with N number of features. The
In data mining, classification techniques are used to cate- search space is exponentially increased, and thus complete
gorize data points into predefined labels based on the avail- and random search methods are impractical to handle such
able features in the dataset. Therefore, the used features for problem [2]. On the other hand, heuristic search approaches
building the classification models have high influence on the have shown superior performance in tackling various complex
performance of the constructed models. That’s to say, if some problems. These techniques guide the search process towards
irrelevant or redundant features are available in the dataset, a high quality solution within a reasonable time [3].
they will mislead the classification model, and consequently Human-based algorithms are a class of metaheuristics which
degrade its performance. Thus, selecting the most informative are inspired by some human activities. Examples of this cate-
features becomes a crucial process in order to get a high- gory include Teaching Learning Based Optimization (TLBO),
performance classification model with less computational time. Harmony Search (HS), and Passing Vehicle Search (PVS)[4].
FS is a preprocessing step that aims to build a robust classifi- Recently, various algorithms have been well exploited with
cation model by involving the most informative features, and wrapper feature approaches in order to handle the FS problem.
discard noisy and irrelevant ones. So, FS is considered as an In [5], a HS based wrapper selection approach was proposed to
optimization problem that can be handled based on two main improve the handwritten word recognition. Allam and Nadim
stages, namely feature subset generation, and feature subset [6] introduced a binary variant of TLBO to be used as a search
evaluation. A search mechanism is employed to generate strategy to handle FS. Other variants of metaheuristic algo-
feasible subsets of features, while an evaluation technique rithms have been extensively utilized for FS problem. Some
(learning algorithm) is used to assess the generated subsets examples of these algorithms include: Whale Optimization
and thus guiding the search process towards to optimal solution Algorithm (WOA)[7], Gravitational Search Algorithm (GSA)
[1]. [8], and Particle Swarm Optimization [9]. However, QSA has
FS methods are mainly distinguished based on their de- not been employed in the area of FS yet.
pendency on the learning algorithm that is used to build the Another interesting challenge that may degrade the perfor-
319
the fluctuation of service process which are represented in Eq. IV. T HE PROPOSED A PPROACH
(5) and (6) respectively. A. Data pre-processing
F11 = β × α × (E. |A − Xi |) + (e × A − e × Xi ) (5) In this work, the experiments are applied over 14 real
F12 = β × α × (E. |A − Xi |) (6) datasets available in PROMISE software engineering reposi-
tory. These data are considered to be free of noise and missing
where α is a random number in the interval [-1, 1], E is values while having imbalanced samples [15]. Therefore, we
an Erlang distributed random vector by size 1× D, e is an applied SMOTE technique with different oversampling ratios
Erlang distributed random number, | | represents the absolute to get more balanced datasets.
value, (.) denotes the element by element multiplication, and
β is an adaptive control parameter used to adjust the range of B. Feature Selection using Binary Queuing Search Algorithm
fluctuation which is computed as in Eq. (7) Metaheuristics are problem-independent, so they can be
0.5 adapted to handle problems related to various domains [3].
β = e(ln(1/t)×(t/T ) )
(7)
However, two essential design aspects should be considered:
where t is the current iteration, and T represents the maximum the solution representation for the handled problem and the
allowable iterations. evaluation (fitness) function.
B. Business 2 1) Solution representation: FS is recognized as a binary
optimization problem. That is, the set of features are encoded
In the second phase, A portion of customers are selected
as a vector of zeros and ones. In which, a specific feature is
to utilize the update strategies of this phase. as in business
selected if the value of its corresponding element is set to 1;
1, there are three queues where the number of customers for
otherwise it is ignored because the value of its corresponding
each queue is computed by Eq. (2). Initially, the customers are
element is set to 0. However, QSA was originally designed to
sorted in descending order based on their fitness value (fi ),
solve problems with a continuous search domain. Therefore,
then each costumer given a probability to be handled as in Eq.
we should employ an efficient binarization method that allows
pri = rank(fi )/N (8) adapting QSA to solve the binary FS problem.
Transfer function (TF) is recognized as a simple, cheap,
Hence, the worst agents have a higher chance of being handled
and efficient operator that has been widely used to map the
than the fittest ones. For each agent, a random number (r)
continuous search space to a binary one [16]. In the this
within [0,1] is generated, if the random number is less than
strategy, the optimizer works without adjustments, then the
pri , then this agent will be updated. In business2, the selected
obtained solutions are converted into binary by including two
agents are updated based on two patterns as defined in Eq.
{ steps. 1) The TF is employed to map the real values in Rn
new Xi + e × (Xr1 − Xr2 ) r < cv into values in range [0,1] such that each value represents the
Xi = (9)
Xi + e × (A − Xr1 ) r ≥ cv probability of transforming the corresponding real value into
binary. 2) A binarization rule that is used to convert the output
where e is an Erlang distributed random number, Xr1 and Xr2
of TF into 1 or 0 [17].
are two randomly selected customers, A is the leader(staff) for 1
0.8
0.6
0.5
0.4
0.3
0.1
320
where r is a random number restricted in range [0,1], and and the remaining part was used for testing purposes. This
Xij (t + 1) is the new binary output. procedure is repeated k times, thus each instance of the dataset
2) Fitness function: An efficient fitness function is required is given the opportunity to be employed k − 1 times to train
to guide the search process, thus the generated subset is given the model and one time to validate it. Due to the stochastic
a score that describes its quality. The desired objective of FS behavior of the utilized optimizers, each conducted experi-
is to minimize the number of selected features and maximize ment was repeated 10 times. Hence, an individual algorithm
the classification performance. These two contradictory criteria was evaluated 10 ∗ k times for each dataset. By using this
were formulated using Eq. (14) mechanism, we can be more confident with the results of the
|R| proposed model.
↓ F itness = α × E + β × (14) The implementation of the proposed approach was done
|N |
using MATLAB-R2017a, and the wrapper FS model with
where E denotes the classification error rate, |R| indicates the KNN classifier (with k = 5 [2]) as an evaluation method was
number of selected features, |N | is the number of original adopted to generate the best feature subset. We used KNN for
features. α and β (the complement of α) are two controlling its simplicity and low computational time compared to other
parameters ∈ [0, 1] which are employed to balance between classifiers. It is also a non-parametric learning algorithm that
the importance of both criteria. has shown superior results in several previous FS experiments
V. E XPERIMENTAL R ESULTS [19], [7]. All experiments were tested on an Intel machine with
Core i5 2.2GHz processor and 4 GB RAM. To be consistent
In this paper, to test the performance of the proposed and fair, all optimizers in this work were experimented using
approaches, a set of well know benchmark software fault the same common parameters (100 iterations and 30 search
prediction datasets from PROMISE repository (see Table I) agents), these values were selected after conducting extensive
were used. Observing the Table I, it can be easily seen that all experiments. The other specific parameters were selected
datasets are imbalanced, where the occurrences of the positive based on recommended settings in the original papers and
cases are very low when compared to the negative cases. related works on FS. The list of parameter values are presented
TABLE I: Description of software fault prediction data sets in Table (II). Please note that the best obtained results in the
reported tables were highlighted using a boldface format.
Dataset version #instances #defective instances %defective instances
ant 1.7 745 166 0.223 TABLE II: The used parameters settings
camel 1.0 339 13 0.038
camel 1.2 608 216 0.355 Fitness function α = 0.99 , β = 0.01
camel 1.4 872 145 0.166 No. iteration=100, population size=30
camel 1.6 965 188 0.195 Common parameters
dimension=#features, No. runs= 10
jedit 3.2 272 90 0.331 classification KNN classifier (K=5), 10-fold cross validation
jedit 4.0 306 75 0.245 GSA G0 =100, α=20
jedit 4.1 312 79 0.253 BBA Qmin =0, Qmax =2, A loudness=0.5, r pulse rate=0.5
jedit 4.2 367 48 0.131 GWO a from 2 to 0
jedit 4.3 492 11 0.022
log4j 1.0 135 34 0.252
log4j 1.1 109 37 0.339 A. Evaluation Measurements
log4j 1.2 205 189 0.922
xalan 2.4 723 110 0.152 The performance of classifiers on a set of test data can be
The experiments in this work were conducted a set of described using a specific table called confusion matrix (or
phases, in the first phase, we tried to get the best oversampling error matrix). Table III demonstrates a confusion matrix for
percentage (see Table IV), then we compared the performance a binary classifier, in which the instances of a given test data
of the proposed BQSA on the dataset without applying any are classified as either positive or negative. Various basic mea-
oversampling technique, and after using SMOTE as an over- sures, such as accuracy, error rate, sensitivity, and specificity
sampling technique, where three measurements (i.e., sensitiv- are calculated based on the four outcomes (TP,TN,FP, and FN)
ity, specificity, and AUC) were used to assess the BQSA. Then, of the confusion matrix. Other evaluation measures such as
in the last phase, two types of experiments were conducted to Area Under the ROC curve (AUC) can be derived from the
examine the effectiveness of the proposed method: In the first basic measures.
experiment, the classification outcomes of KNN classifier was TABLE III: Confusion matrix for binary classification.
compared with those after applying BQSA for FS while in the
Predicted positive Predicted negative
second experiment, BQSA was compared with other wrapper
Actual positive True Positive (TP) False Negative (FN)
FS methods by implementing four SI algorithms called Binary Actual negative False Positive (FP) True Negative (TN)
Whale Optimization Algorithm (BWOA), Binary Gravitational
• Sensitivity (true positive rate): The percentage of the
Search Algorithm (BGSA), Binary Bat Algorithm (BBA), and
positive cases that were predicted as positive.
Binary Ant Lion Optimizer (BALO).
In all experiments, the classification algorithm (KNN) was Sensitivity = T P /(T P + F N ) (15)
trained and tested using the n-fold cross-validation method
(where n=10). In this procedure, each dataset has been split • Specificity (true negative rate): The percentage of the
into 10 parts, such that nine parts were used for training, negative cases that were predicted as negative.
321
each measurement was obtained based on the original datasets
and the modified ones. Regarding the sensitivity values (which
Specif icity = T N /(T N + F P ) (16)
represents the performance measurement to detect the class
• AUC: An efficient evaluation measure that based on the of interest (i.e. defective cases)), we can see that using the
trad-off between Sensitivity and specificity. It is calcu- SMOTE technique significantly improves the performance
lated as follows: of BQSA for all datasets. On the other hand, there is a
remarkable degrade in the specificity values (which represents
AU C = (Sensitivity + Specif icity)/2 (17) the performance measurement to detect the normal cases). To
For imbalanced data, we are interested in having a high balance between these two contradictory behaviors, we based
sensitivity and specificity on the minority and majority classes on AUC values. Inspecting these values, we can notice the
respectively, so AUC is an appropriate measure for the predic- superior performance of BQSA when using SMOTE for all
tion quality over imbalanced data rather than accuracy metric datasets. This result is expected since the datasets are highly
[12]. imbalanced, and the results of the original datasets suffer from
the bias towards the majority class.
B. The impact of oversampling ration
In order to get the best percentage of the oversampling TABLE V: The performance of BQSA before and after uti-
ration, an extensive experiment was conducted, and different lizing SMOTE in terms of sensitivity, specificity, and AUC
values for the oversampling percentage (100%, 200%, 300%, measures
and 400%) were used. The value that obtained the best results Sensitivity Specificity AUC
Dataset
was used in all consequent experiments. The average of AUC original SMOTE original SMOTE original SMOTE
results of this experiment are reported in Table IV. Observing ant-1.7 0.5018 0.9218 0.8965 0.7476 0.6991 0.8347
camel-1.0 0.0000 0.5769 0.9960 0.9482 0.4980 0.7625
the results, we can notice a significant improvement in the camel-1.2 0.3745 0.8877 0.8061 0.4416 0.5903 0.6647
camel-1.4 0.2324 0.8548 0.9512 0.7180 0.5918 0.7864
AUC values for all re-sampled datasets compared to the camel-1.6 0.2043 0.8563 0.9332 0.6842 0.5687 0.7702
original ones. However, one can clearly see that the impact jedit-3.2 0.6533 0.9228 0.8264 0.6962 0.7399 0.8095
jedit-4.0 0.5107 0.8793 0.8961 0.7100 0.7034 0.7946
of using a low percentage (i.e., 100% and 200%) is less than jedit-4.1 0.5494 0.9016 0.8957 0.7549 0.7225 0.8283
jedit-4.2 0.3896 0.8786 0.9436 0.8232 0.6666 0.8509
using a high percentage (i.e., 300% and 400%). This is due jedit-4.3 0.0182 0.6205 0.9977 0.9530 0.5079 0.7867
to the highly imbalanced data in which a sufficient number of log4j-1.0 0.6265 0.9125 0.9495 0.8059 0.7880 0.8592
log4j-1.1 0.6865 0.9257 0.9222 0.7722 0.8044 0.8489
minor instanced need to be oversampled. Moreover, it seems log4j-1.2 0.9852 0.8862 0.3813 0.8563 0.6832 0.8712
xalan-2.4 0.2718 0.8675 0.9553 0.7879 0.6136 0.8277
that the further increase in the oversampling ratio (i.e., grater
than 300% ) has a negative impact on the prediction quality.
D. Comparison of BQSA versus other optimizers
As per F-test, it is revealed that increasing the percentage of
the minority class by 300% obtains the best rank, followed In this section, we present a comparison between BQSA
by 400%, 200%, 100%, and 0% (the original). Thus, in all approach and other similar approaches in the literature. To
experiments the 300% oversampling percentage was adopted. make a fair comparison, all approaches were implemented,
and all runs were conducted in the same environment with
the same parameter settings. Table VI shows the average AUC
TABLE IV: Average AUC values of BQSA with SMOTE results for all FS approaches (i.e., BQSA, WOA, BGSA, BBA,
using different oversampling ratios and BALO), in addition to the results of the KNN with no FS
SMOTE oversampling ratio on the full over-sampled datasets. It can be seen that BQSA
Dataset original 100% 200% 300% 400%
achieved the best results among all approaches in 65% of the
ant-1.7 0.699 0.804 0.815 0.835 0.819
camel-1.0 0.498 0.666 0.727 0.763 0.814 datasets, and comes in the first place according to the F-test,
camel-1.2 0.590 0.670 0.687 0.665 0.663 while WOA comes in the second place by achieving the best
camel-1.4 0.592 0.719 0.777 0.786 0.798
camel-1.6 0.569 0.702 0.757 0.770 0.764 results on 28% of the datasets. Comparing the results of the
jedit-3.2 0.740 0.795 0.801 0.809 0.799
jedit-4.0 0.703 0.765 0.775 0.795 0.785 FS approaches with those of the KNN without FS, it can
jedit-4.1 0.723 0.791 0.822 0.828 0.818 be concluded that FS as a preprocessing step has an impact
jedit-4.2 0.667 0.813 0.826 0.851 0.857
jedit-4.3 0.508 0.563 0.583 0.787 0.788 on the performance of the learning algorithm by selecting
log4j-1.0 0.788 0.827 0.873 0.859 0.862
log4j-1.1 0.804 0.819 0.845 0.849 0.835 the most informative features. According to the F-test, BBA
log4j-1.2 0.683 0.761 0.867 0.871 0.880 algorithm and KNN* without FS come in the last place among
xalan-2.4 0.614 0.751 0.799 0.828 0.822
Rank (F Test) 5.00 3.86 2.50 1.64 2.00 all approaches.
Figure 2 shows the convergence curves of the FS approaches
C. Evaluation results of BQSA with and without SMOTE on some datasets. It is clear that the BQSA recorded the best
In this section, a deep comparison between the results based performance among other approaches, where it has the fastest
on the original datasets (without oversampling), and with those convergence rate on the presented datasets, as well as the
of using the oversampled datasets with 300% percentage is lowest fitness values. Only WOA among presented approaches
conducted. Table V shows the results of three measurements competes with BQSA in xalan-2.4 dataset. However, BBA and
(i.e., Sensitivity, Specificity, and AUC) for the BQSA, where BGSA suffer from the problem of premature convergence in
all cases.
322
TABLE VI: AUC results of BQSA versus other approaches
with SMOTE [KNN*: the classification model without FS] conducted on multi-levels, where the impact of resampling
the datasets has been studied in the first stage, based on some
Dataset KNN* BQSA BWOA BGSA BBA BALO
ant-1.7 0.835 0.835 0.836 0.826 0.8182 0.8340
preliminary results to select the best resampling ration. Then,
camel-1.0 0.684 0.763 0.761 0.740 0.6977 0.7589 a comparison between the BQSA and other similar approaches
camel-1.2 0.658 0.665 0.688 0.660 0.6364 0.6669
camel-1.4 0.789 0.786 0.786 0.780 0.7720 0.7837 has been conducted. The results have confirmed that BQSA-
camel-1.6
jedit-3.2
0.739
0.751
0.770
0.809
0.765
0.798
0.747
0.794
0.7355
0.7673
0.7642
0.8027
based wrapper FS technique combined with SMOTE can be
jedit-4.0 0.760 0.795 0.793 0.781 0.7696 0.7884 utilized as a promising approach in predicting faults of real-
jedit-4.1 0.793 0.828 0.826 0.806 0.7682 0.7988
jedit-4.2 0.786 0.851 0.852 0.835 0.8088 0.8433 world software projects. Our future directions will be related
jedit-4.3
log4j-1.0
0.673
0.752
0.787
0.859
0.782
0.854
0.748
0.823
0.6903
0.8100
0.7806
0.8492
to investigating new variants of BQSA by utilizing different
log4j-1.1 0.727 0.849 0.856 0.825 0.8050 0.8524 S-shaped and V-shaped transfer functions.
log4j-1.2 0.681 0.871 0.867 0.825 0.7681 0.8404
xalan-2.4 0.821 0.828 0.824 0.824 0.8090 0.8186 R EFERENCES
Rank (F-test) 5.04 1.57 1.86 3.96 5.43 3.14
[1] V. Kumar and S. Minz, “Feature selection: A literature review,” Smart
0.26 0.21 Computing Review, vol. 4, pp. 211–229, 01 2014.
0.25
BQSA
BWOA
BQSA
BWOA [2] M. Mafarja and S. Mirjalili, “Whale optimization approaches for wrap-
BGSA
0.2 BGSA
BBA
BALO
BBA
BALO
per feature selection,” Applied Soft Computing, vol. 62, p. 441453, 2018.
Fitness Value
Fitness Value
0.24
0.19
[3] E.-G. Talbi, Metaheuristics: from design to implementation. John Wiley
0.23 & Sons, 2009, vol. 74.
0.22
0.18 [4] J. Zhang, M. Xiao, L. Gao, and Q.-K. Pan, “Queuing search algorithm:
A novel metaheuristic algorithm for solving engineering optimization
0.17
0.21 problems,” Applied Mathematical Modelling, vol. 63, 07 2018.
0.2 0.16 [5] S. Das, P. Singh, S. Bhowmik, R. Sarkar, and M. Nasipuri, “A harmony
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Iteration Number Iteration Number
search based wrapper feature selection method for holistic bangla word
recognition,” Procedia Computer Science, vol. 89, pp. 395–403, 12 2016.
(a) camel-1.0 (b) jedit-4.1 [6] M. Allam and M. Nandhini, “Optimal feature selection using binary
0.17 0.185
BQSA
teaching learning based optimization algorithm,” Journal of King Saud
BQSA
BWOA
BGSA
BWOA
BGSA
University - Computer and Information Sciences, 2018.
BBA
0.16 BALO 0.18 BBA
BALO [7] M. Mafarja, I. Jaber, S. Ahmed, and T. Thaher, “Whale optimisation
Fitness Value
Fitness Value
323
Self-Organizing Maps for Agile Requirements
Prioritization
Amjad Hudaib Fatima Alhaj
King Abdullah II School for Information Technology King Abdullah II School for Information Technology
The University of Jordan The University of Jordan
Amman, Jordan Amman, Jordan
ahudaib@ju.edu.jo fat9170261@fgs.ju.edu.jo
Abstract—In building software systems, decisions at the spec- Properly prioritized requirements enables planing the
ification phase will extremely affect the rest of the system life development tasks reasonably to meet the stockholders
cycle. Well-defined requirements at this phase will increase the expectations. Even in commercial off-the-shelf (COTS)
chance of achieving the ultimate goal of delivering a software that
meets stakeholders needs. Given a limited sources of time and case, planning a software release is a vital factor for the
predefined budget, not all the requirements should be fulfilled product success, this planning process should be influenced
with the same priority. Here comes the need for requirement by prioritizing requirements [7]. RP transfer the project into
prioritization RP techniques. This paper presents a new approach a sequential execution order or releases [8]. Where software
to deal with the dynamic nature of requirements prioritization quality depends on this defined order [9]. This order also
process in agile development. Training a self-organizing map
according to requirement’s predefined features is the main helps to avoid conflicting requirements [10] and remove
process in the proposed approach. The trained map can produce the unnecessary requirements [11]. However, implementing
a set of clusters. A farther rank is given to each requirement requirements following this order will lead to an incremental,
according to map resulting weights. The proposed approach was cumulative and systematic delivery to the client. Of course
implemented using different variables related to requirements this can help to modify the project schedule and discover any
themselves and related to the self- organizing map to show its
ability to prioritize requirements in agile development model. hidden misunderstandings between the organization and the
stockholders before moving forward in upcoming sages of
Index Terms—Requirement Prioritization, Agile, ASD, SOM, the software product SDLC [12].
Self-Organizing Map, Clustering.
Prioritization of requirements is considered the most
I. INTRODUCTION challenging for RE teams [13].Prioritizing requirements with
a predefined resources is a complex process, and as the
Dealing with a large number of software requirements along number of stakeholder increases,it even becomes a more
with having limited resources(e.g. tight deadlines, budget) complicated task ; each stakeholder has a different opinion
can be very confusing for software project managers. Since it about requirements, a single requirement can be considered a
is impractical to implement all the requirements at the same major priority to one stakeholder, and minor priority to other
time [1], RP helps determining which requirements should be stakeholder [14]. A good RP technique will keep track of all
implemented in the early release, and which of them should the requirements weights that stockholders assigned to [15].
be set aside for later implementation. Requirements ordering Stakeholders involvement should be defined and justified in
in the project time-line should be decided precisely, delaying every RP technique.
essential requirements may affect the overall success of a
software product [2]. Agile development can deal with changing requirements
[16]. Sprints, which are the incremental releases in agile
Since RP is one of the vital process of requirement system, should be featured according to the changing
engineering RE, which plays a vital role in the success prioritized requirements [17]. RP in agile is different than
or failure of a software product [3], [4], a serious need RP in traditional RE (waterfall development/ non-agile) with
for efficient RP technique is raised. As [5] mentioned, RE two main points [18]:first, prioritizing and re-prioritizing
empirical studies over years ranked as the highest after is happening through different agile iterations. The RP
requirements negotiation in the RE sub-areas. A considerable process applied before each iteration, which as a result gives
amount of research work has been done in providing different enough time before making a decision, this allow a new
techniques for RP, and still there is a need to fully analyze information and results to be taken in consideration. Second:
these techniques to enhance the maturity of this research the prioritization process mainly considering the business
exploration [6]. value/relative benefit [19], early implementation for highest
priority to obtain the highest business value and the lowest
325
TABLE I Training a SOM using the whole set of data vectors will
R ELATED FEATURE DESCRIPTION FOR EQUATION 1. position each data vector onto the map. So, input data are
Notation Description mapped to the most similar node on the SOM. All features
ERP (ri ) The estimated requirement priority related to input data are used to determine similarity. In
calculated for requirement(Ri ) addition, each node has a weight vector of same size as the
RP (ri ) The requirement priority given in
requirement specification phase.
input data features.
IM (ri ) potential impact
BP (ri ) Business profit Depending on the map size, each map node is linked to
R(ri ) risk some point in the input data; a node is associated to weight
T (ri ) Time to accomplish
DR number of dependent requirement’s vector. Each input data can be linked to the map by first
n Number of requirements finding the vector that is closest to the data vector, and then
ω1 , ω 2 , ω 3 , ω 4 Weights related to each variables. mapping the data vector to the corresponding map node BMU.
326
TABLE II
R ESULTING CLUSTERS OF TRAINING SOM ON DIFFERENT INPUT DATA .
TABLE III
SOM CLUSTERS REGISTERED FOR A DATA SET OF 300 REQUIREMENTS .
Clusters are given an associated rank according to the Fig. 2. A heat-map that shows SOM nodes and weights difference among
neighbouring nodes.
SOM trained weights. Fig. 1 shows a ranked 14 clusters of
requirements set of size 500 using SOM RP model.
V. CONCLUSIONS
This paper presents a new approach to deal with the
dynamic nature of requirements prioritization process in ag-
ile development. The method is basically depends on self-
organizing map that defines a set of ranked clusters that can
be related to successive sprints. Using the proposed work,
project managers can have a good insight about the entire
project development plan, and can adapt the development
process to any new added requirements. Using the trained
SOM can support making decisions in terms of requirement
prioritization. Each resulting cluster can be internally ranked
using the trained SOM weights.
R EFERENCES
Fig. 1. Example of a figure caption.
[1] A. Hudaib, R. Masadeh, M. H. Qasem, and A. Alzaqebah, “Require-
Having a map in which input data with similar conditions ments prioritization techniques comparison,” Modern Applied Science,
vol. 12, no. 2, p. 62, 2018.
are close to each other. Such a map can be exploit to illustrate [2] L. Alawneh, “Requirements prioritization using hierarchical dependen-
the similarity of nodes (neurons), so area where corresponding cies,” in Information Technology-New Generations. Springer, 2018, pp.
color is in minimum rate of the heat-map color coding, have 459–464.
[3] A. R. Asghar, A. Tabassum, S. N. Bhatti, and S. A. A. Shah, “The
a low distance between each other and inform one cluster. impact of analytical assessment of requirements prioritization models:
Also, the clusters are separated from each other by boundaries an empirical study,” 2017.
of nodes with high distances between them, as shown in [4] Y. V. Singh, B. Kumar, S. Chand, and D. Sharma, “A hybrid approach for
requirements prioritization using logarithmic fuzzy trapezoidal approach
Fig. 2. Resulting clusters can be developed sequentially, one (lfta) and artificial neural network (ann),” in International Conference
requirements cluster each sprint.Also, for the data points on Futuristic Trends in Network and Communication Technologies.
that belongs to a single cluster a farther sequential order is Springer, 2018, pp. 350–364.
obtained by applying equation 1 mentioned in the previous [5] T. Ambreen, N. Ikram, M. Usman, and M. Niazi, “Empirical research
in requirements engineering: trends and opportunities,” Requirements
section.Using equation 1 each requirement will assigned an Engineering, vol. 23, no. 1, pp. 63–95, 2018.
estimated priority.Next, requirement estimated priorities are [6] M. Dabbagh, S. P. Lee, and R. M. Parizi, “Functional and non-functional
ranked, resulting a an ordered list of requirements according requirements prioritization: empirical evaluation of ipa, ahp-based, and
ham-based approaches,” Soft Computing, vol. 20, no. 11, pp. 4497–4520,
to their priorities. For example, considering that a cluster with 2016.
four requirements of a data set (n = 50), features values as [7] J. R. F. Dos Santos, A. B. Albuquerque, and P. R. Pinheiro, “Require-
shown in table IV, weights are given by the trained SOM. ments prioritization in market-driven software: A survey based on large
numbers of stakeholders and requirements,” in Quality of Information
Estimated RP for each of them is calculated and a rank is and Communications Technology (QUATIC), 2016 10th International
associated to it as the last column suggest. Conference on the. IEEE, 2016, pp. 67–72.
327
TABLE IV
E XAMPLE ON RANKING REQUIREMENTS WITHIN A CLUSTER USING SOM WEIGHTS .
[8] A. Alzaqebah, R. Masadeh, and A. Hudaib, “Whale optimization algo- Discovery, Knowledge Management and Decision Support. Atlantis
rithm for requirements prioritization,” in Information and Communica- Press, 2013.
tion Systems (ICICS), 2018 9th International Conference on. IEEE, [27] S. Worner, M. Gevrey, R. Eschen, M. Kenis, D. Paini, S. Singh,
2018, pp. 84–89. M. Watts, and K. Suiter, “Prioritizing the risk of plant pests by clustering
[9] M. Yousuf, M. U. Bokhari, and M. Zeyauddin, “An analysis of software methods; self-organising maps, k-means and hierarchical clustering,”
requirements prioritization techniques: A detailed survey,” in Computing NeoBiota, vol. 18, p. 83, 2013.
for Sustainable Global Development (INDIACom), 2016 3rd Interna- [28] J. Parvizian, H. Tarkesh, S. Farid, and A. Atighehchian, “Project
tional Conference on. IEEE, 2016, pp. 3966–3970. management using self-organizing maps,” Industrial Engineering and
[10] R. V. Anand and M. Dinakaran, “Whalerank: an optimisation based Management Systems, the official journal of APIEMS, vol. 5, no. 1,
ranking approach for software requirements prioritisation,” International 2006.
Journal of Environment and Waste Management, vol. 21, no. 1, pp. 1–21, [29] V. Chaudhary, R. Bhatia, and A. K. Ahlawat, “A novel self-organizing
2018. map (som) learning algorithm with nearest and farthest neurons,”
[11] H. Ahuja, G. Purohit et al., “Understanding requirement prioritization Alexandria Engineering Journal, vol. 53, no. 4, pp. 827–831, 2014.
techniques,” in Computing, Communication and Automation (ICCCA), [30] F. Comitani, “Simpsom (simple self-organizing maps),” 2019. [Online].
2016 International Conference on. IEEE, 2016, pp. 257–262. Available: https://pypi.org/project/SimpSOM/
[12] R. Qaddoura, A. Abu-Srhan, M. H. Qasem, and A. Hudaib, “Require-
ments prioritization techniques review and analysis,” in 2017 Inter-
national Conference on New Trends in Computing Sciences (ICTCS).
IEEE, 2017, pp. 258–263.
[13] H. F. Hofmann and F. Lehner, “Requirements engineering as a success
factor in software projects,” IEEE software, no. 4, pp. 58–66, 2001.
[14] J. A. Khan, I. U. Rehman, Y. H. Khan, I. J. Khan, and S. Rashid, “Com-
parison of requirement prioritization techniques to find best prioritization
technique,” International Journal of Modern Education and Computer
Science, vol. 7, no. 11, p. 53, 2015.
[15] M. A. Awais, “Requirements prioritization: challenges and techniques
for quality software development,” Advances in Computer Science: an
International Journal, vol. 5, no. 2, pp. 14–21, 2016.
[16] R. V. Anand and M. Dinakaran, “Popular agile methods in software
development: Review and analysis,” International Journal of Applied
Engineering Research, vol. 11, no. 5, pp. 3433–3437, 2016.
[17] M. Brhel, H. Meth, A. Maedche, and K. Werder, “Exploring princi-
ples of user-centered agile software development: A literature review,”
Information and Software Technology, vol. 61, pp. 163–181, 2015.
[18] Z. Racheva, M. Daneva, K. Sikkel, A. Herrmann, and R. Wieringa, “Do
we know enough about requirements prioritization in agile projects: In-
sights from a case study,” in 2010 18th IEEE International Requirements
Engineering Conference. IEEE, pp. 147–156.
[19] K. Wiegers, “First things first: prioritizing requirements,” Software
Development, vol. 7, no. 9, pp. 48–53, 1999.
[20] R. Popli, N. Chauhan, and H. Sharma, “Prioritising user stories in
agile environment,” in 2014 International Conference on Issues and
Challenges in Intelligent Computing Techniques (ICICT). IEEE, 2014,
pp. 515–519.
[21] R. V. Anand and M. Dinakaran, “Handling stakeholder conflict by
agile requirement prioritization using apriori technique,” Computers &
Electrical Engineering, vol. 61, pp. 126–136, 2017.
[22] P. Avesani, S. Ferrari, and A. Susi, “Case-based ranking for decision
support systems,” in International Conference on Case-Based Reason-
ing. Springer, 2003, pp. 35–49.
[23] M. S. Rahim, A. Z. M. E. Chowdhury, and S. Das, “Rize: A proposed
requirements prioritization technique for agile development,” in 2017
IEEE Region 10 Humanitarian Technology Conference (R10-HTC), Dec
2017, pp. 634–637.
[24] V. R. Basil and A. J. Turner, “Iterative enhancement: A practical
technique for software development,” IEEE Transactions on Software
Engineering, no. 4, pp. 390–396, 1975.
[25] P. Tonella, A. Susi, and F. Palma, “Interactive requirements prioritization
using a genetic algorithm,” Information and software technology, vol. 55,
no. 1, pp. 173–187, 2013.
[26] M. Azzolini and L. I. Passoni, “Prioritization of software requirements:
a cognitive approach,” in Fourth International Workshop on Knowledge
328
A Parallel Face Detection Method using Genetic &
CRO Algorithms on Multi-core Platform
Mohammad Khanafsa Ola Surakhi Sami Sarhan
Computer Science Department Computer Science Department Computer Science Department
University of Jordan University of Jordan University of Jordan
Amman, Jordan Amman, Jordan Amman, Jordan
mkhanafsa@gmail.com ola.surakhi@gmail.com samiserh@ju.edu.jo
Abstract— Face recognition is a well-known biometric method image by finding distance between pivot and each point in the
that used in many applications for authentication and identification. image, the extracted features are compared with the one
The original face recognition scheme takes face image, extract its stored in the database. If the number of matched features is
features and store it as a vector in the database. The saved vector is greater than a threshold value, then the user is identified.
then compared with the input image by comparing features to
recognize it. Many methods had been proposed before to achieve
Otherwise the user is not matched and not recognized.
that and to increase the level of identification accuracy. This paper Two factors play an important role in the matching phase, the
proposes a new method by using a meta-heuristic algorithm Genetic pivot point and the weight for each area in the face image.
and Chemical Reaction Optimization algorithms, both are Selecting pivot point correctly implies to extract a set of
implemented in parallel using multicore platform. The aim is to features that can increase accuracy of matching.
increase accuracy of image matching with less error rate and Dividing face image into a set of areas and assign a weight
increase performance of the system in terms of speedup. for each area such that summation of the weight values is one
Keywords— Chemical Reaction Optimization algorithm; Face will enhance accuracy. Some areas in the face image are clear
Recognition; Genetic Algorithm; Multi-threaded and can be assigned a high weight, other areas may have
Introduction
special objects that affect on the accuracy and thus should be
assigned a low weight.
Face recognition is a biometric technique which used in In [15], two meta-heuristic algorithms are used to achieve
many applications and systems. Because of its importance, that, Genetic algorithm, GA and Chemical Reaction
many fields pay a great attention in it like security, image Optimization algorithm CRO.
processing and psychology [1-7]. The face recognition Meta-heuristic algorithm can be used to expand search in the
process consists of three main steps: face detection, features search space of problem in order to generate better solution.
extraction and face recognition as shown in Figure 1 [8]. As the algorithm consists of a number of iterations, better and
better solutions can be generated after each iteration till
reaching best solution.
GA and CRO are used in this paper to search for the best point
in the image to be selected as pivot point, to assign a weight
value for each area in the image, and to generate a set of
Figure 1: Face recognition process steps features that are not important and may reduce matching rate.
The excluded features are stored in a vector. The selection
Face detection detects the face in an image by determining its process will be repeated after each iteration to get better
position. The features then extracted from the face and saved results that increase matching rate and enhance accuracy of
in a vector which are then used as a signature for the image recognition system. The algorithms are implemented in
which discriminate individual from other. Last, face parallel using multicore platform to speedup training of the
recognition is done by comparing extracted features of the algorithms and increase efficiency.
input image with the ones stored in the database and based on The rest of the paper is organized as follow: Section 2 gives
the matching rate the recognition is accepted. an overview about GA and CRO. Section 3 introduces the
Each face image consists of more than 80-point features, one proposed work. Section 4 shows experimental results. A
of them is selected as a pivot. The features are extracted by detailed description about experimental results is discussed in
evaluating distance between a pivot and each of the 80 points section 5, and section 6 gives conclusion.
in the image and save it to a vector in the database which are
then used for comparison in order to recognize individual I. BACKGROUND
later. The enrollment phase in the face recognition system
consists of processing image, extracting features and saving A. Genetic Algorithm
them in the database. Genetic algorithm [9,10,11] is a meta-heuristic algorithm that
After enrollment phase, all user’s images are saved in the is used to solve large search problem. GA depends on the
database to be used for identification. When a user is to be initial population that consists of a set of individuals. The
identified, the entered image is compared with the one stored chromosome can be represented as a vector where each entry
in the database which is the matching phase in the face is a gene. Chromosome represents the set of variables that
recognition system. The features are extracted from input need to be updated during algorithm running to reach best
330
The meta-heuristic algorithm always searches to find the best sequentially, and thus it is divided into a set of steps that run
solution by repeating searching to a number of predefined on the multithread.
iterations. Better and better solution can be generated by each Each area from image runs on a single core which extract
iteration. In the proposed work, the algorithm will generate a features from it to be compared with the original extracted
new different pivot point, new areas weight and new excluded features and to generate a set of excluded features that reduce
features array. The new solutions will be compared with the accuracy of matching. The communication between different
old one, and the better between them will be saved for the cores is done to exchange information, the time needed to run
next iteration. algorithm sequentially is more than time needed to run it in
The proposed work in this paper uses Genetic and CRO parallel, thus the overall communication overhead between
algorithms that are implemented in parallel using multi-core cores can be ignored in order to achieve such performance in
platform. The output for each algorithm is generated and speedup and accuracy.
compared. The steps for each algorithm in detail are shown
C. Parallel-CRO Face Recognition
in the next section.
Chemical Reaction Optimization algorithm is a meta-
A. Data set heuristic algorithm that searches for a best solution in search
The data set used in this paper is taken from XM2VTSDB space. As mentioned before, it consists of a set of steps. As
multi-modal face database project [14] which consists of 371 any other meta-heuristic algorithm, the important step is
images. Each image is for one individual where each determining fitness function according to the problem. The
individual has more than one session. The overall number of fitness function in the proposed algorithm evaluates matching
the images is 2360 images with 67 features for each one. value for the variables pivot, areas weight and excludes
The features of the collected images are different as they features which influence on the matching results.
contain images for female, male with different ages and The mapped between CRO steps and the proposed method is
colors. These images are taken over a period of four months. shown in Table II.
B. Parallel-Genetic Face Recognition TABLE II. MAPPING BETWEEN CRO STEPS AND PROPOSER WORK
The GA consists of different phases, the most important is Chemical Its meaning on the proposed idea
determining fitness function which plays a role of enhancing Meaning
value of the desired parameter to get better solution. In the Molecular Set of Solutions which found based on original
Structure solution
proposed algorithm, fitness function generates the value of
Potential Energy Important variables value as Pivot value, Exclude
matching rate of the image. Based on it the new solution will array values, Weights value for different face areas
have a new pivot point that used to exclude features that have Kinetic Energy Measure of tolerance to have a worse Solution
less effect on the accuracy and new weights for the face areas Number of Hits Total Number of iteration used for specific
such that total values for them is 100%. experiment.
Minimum Current Optimal value for Matching based on
The mapping between GA phases and the proposed method Structure Different variables values
is shown in Table I. Synthesis Two solutions with two Potential Energy combined
Interaction, with each other to select single solution with
TABLE I. MAPPING BETWEEN GA STEPS AND PROPOSED WORK ω1 + ω2 → ω' highest Potential energy which refers to highest
match percent for all faces.
Genetic Mapped to the proposed idea Inter-molecule Two solutions with two Potential energy will
Phase infective produce a single solution with highest Potential
Individual Pivot point, weight for each face area, set of excluded collision energy value. By combining different steps of both
feature points and distances between selected pivot and ω1 + ω2 → ω1' + solutions, like selecting best excluded array values
all feature points ω2'. from one solution with face area weight from
Population Set of individuals that contains initial pivot point, initial another solution to have a solution with highest
weight value for each face area for first round, initial set matching percentage for all faces.
of excluded features from first random round. Decomposition Single solution with specific potential energy will
Search Different solutions founded through different iterations. ω → ω1 + ω2 produce two new spate solutions with different
Space potential energy for each.
On wall effective Single solution will be combined with other
Fitness Match values for testing data sets based on training data collision random solution where each solution has its own
Function for all faces, best solution will have highest fitness value ω → ω' potential energy to produce a new with different
which mean highest match rate for different images which potential energy from original solution.
compared with all feature information saved on database.
Crossover Generate different values for pivot, face area weight,
excluded array based on best solution with other solutions
The parallel implementation for the algorithm is similar to
what had been applied for Genetic. After evaluation running
time for CRO face recognition steps it was found that
Mutation Random difference for generated solution based on matching step needs longest time. This step is implemented in
specific value parallel by distributing the jobs between multicores in order to
reduce running time and enhance speedup. The results of such
implementation showed a great enhancement in the
Applying algorithm in parallel is done by evaluating time
performance of the system at all in terms of speedup and
needed to run each step from the above-mentioned table, the
accuracy.
step which take longest time is run on the multithread. After
evaluating time for each step, it was found that matching step
took the longest time when running the proposed algorithm
331
III. EXPERIMENTAL RESULTS TABLE V. SEQUENTIAL IMPLEMENTATION TIME FOR CRO FACE
RECOGNITION
A. Experimantal Results Using GA Total
The GA is implemented in parallel using Java programming Input size Execution Time need for Matching
language, on Intel core I7-3632QM CPU2.20GHz, 8GB of time in M/S matching step accuracy
RAM and windows 7 64 bits. As mentioned before, the 50 Persons * 10 6000 5350 99%
longest step that took most running time of the method is 100 Persons * 10 7200 6170 97%
matching step. In order to show enhancing on running time 150 Persons * 10 42300 40700 95%
for matching step, the algorithm first executes sequentially 200 Persons * 10 19500 17350 94%
for different data sets starting from 50 x 10 images which 250 Persons *10 78250 74300 92%
include images for 50 persons where each one has 10 300 Persons * 10 121000 115000 92%
different samples, to 371 x 10 images. The times needed to 335 Persons *10 177000 162000 91%
run matching step for sequential implementation by using GA 371 Persons * 10 250000 197000 89%
is shown in Table III.
333
4- REFERENCES
cores 3.214 80.339 5.253 131.333
6-
cores 3.746 62.428 2.231 37.184 [1] Aoun, N.B.; Mejdoub, M.; Amar, C.B. Graph-based approach for
8- human action recognition using spatio-temporal features. J. Vis.
cores 3.261 40.760 3.322 41.526 Commun. Image Represent. 2014, 25, 329–338.
[2] El’Arbi, M.; Amar, C.B.; Nicolas, H. Video watermarking based on
As we mentioned before, the execution time for CRO face neural networks. In Proceedings of the 2006 IEEE International
Conference on Multimedia and Expo, Toronto, ON, Canada, 9–12 July
recognition using 2-cores is almost double execution time of 2006; pp. 1577–1580.
running GA face recognition. But the speedup achieved by [3] El’Arbi, M.; Koubaa, M.; Charfeddine, M.; Amar, C.B. A dynamic
running GA face recognition in parallel using 2-cores is very video watermarking algorithm in fast motion areas in the wavelet
close to that achieved by running CRO face recognition, domain. Multimed. Tools Appl. 2011, 55, 579–600.
while execution time is different. That’s because sequential [4] Wali, A.; Aoun, N.B.; Karray, H.; Amar, C.B.; Alimi, A.M. A new
time for GA face recognition is almost half the time of CRO system for event detection from video surveillance sequences. In
Advanced Concepts for Intelligent Vision Systems, Proceedings of the
face recognition. GA is faster than CRO algorithm in 12th International Conference, ACIVS 2010, Sydney, Australia, 13–16
recognition problems. December 2010; Blanc-Talon, J., Bone, D., Philips, W., Popescu.
The best performance for CRO face recognition had been [5] D., Scheunders, P., Eds.; Lecture Notes in Computer Science; Springer:
achieved when the algorithm run using 4-cores, the speedup Berlin/Heidelberg, Germany, 2010; Volume 6475, pp. 110–120.
and efficiency are the best. While in GA face recognition, the [6] Koubaa,M.;Elarbi,M.;Amar,C.B.;Nicolas,H.Collusion,MPEG4compr
essionandframedroppingresistant video watermarking. Multimed.
speedup is increased when algorithm runs on 6-cores. After Tools Appl. 2012, 56, 281–301.
that, the performance decreased as mentioned before. [7] M. S. Obaidat and N. Boudriga,” Security of e-Systems and Computer
Networks, Cambridge University Press, 2007.
V. CONCLUSIONS
[8] Mejda Chihaoui *, Akram Elkefi, Wajdi Bellil and Chokri Ben Amar,
This paper proposed a parallel implementation for face “A Survey of 2D Face Recognition Techniques”, REGIM: Research
recognition by using a meta heuristic algorithm. The meta Groups on Intelligent Machines, University of Sfax, National School
of Engineers (ENIS), Sfax 3038, Tunisia; Elkfi@gmail.com (A.E.);
heuristic algorithm is used to choose best point in the image wajdi.bellil@ieee.org (W.B.); chokri.benamar@ieee.org (C.B.A.) *
to be selected as a pivot point, evaluate weight area for each Correspondence: mejda.chihaoui@ieee.org; Tel.: +216-5460-1073
area in the face image and exclude a number of unneeded [9] D.E. Goldberg, Genetic Algorithms in Search, Optimization &
features. Two algorithms were used, GA and CRO. Both Machine Learning, Addison-Wesley, Reading, MA, 1989.
algorithms were implemented in parallel using Java [10] John H. Holland ‘Genetic Algorithms’, Scientific American Journal,
July 1992.
programming language, on Intel core I7-3632QM
[11] KalyanmoyDeb, ‘An Introduction To Genetic Algorithms’, Sadhana,
CPU2.20GHz, 8GB of RAM and windows 7 64 bits. Vol. 24 Parts 4 And 5.
Different number of cores were used, and different data sets [12] Lam AYS, Li VOK. Chemical-reaction-inspired meta-heuristic for
were used for testing results. The results of the optimization. IEEE Trans Evol Comput. 2010;14(3):381–99.
implementation show that proposed method can give better [13] A.Y.S. Lam, V.O.K. Li, "Chemical reaction optimization: a tutorial",
results with more accuracy and less error rate comparing it Memetic Computing 4, 2012, pp. 3–17.
with original face recognition. The parallel implementation [14]https://personalpages.manchester.ac.uk/staff/timothy.f.cootes/data/xm2
enhanced performance by decreasing running time. GA vts/xm2vts_markup.html.
shows better performance over CRO in terms of speedup and [15] Ola Surakhi, Mohammad Khanafseh,Yasser Jaffal, “An enhanced
Biometric-based Face Recognition System using Genetic and CRO
efficiency. Algorithms”, submitted
334
Heart Disease Detection Using Machine Learning
Majority Voting Ensemble Method
Rahma Atallah Amjed Al-Mousa
Communications Engineering Department Computer Engineering Department
Princess Sumaya University for Technology Princess Sumaya University for Technology
Amman, Jordan Amman, Jordan
r_rahma@hotmail.com a.almousa@psut.edu.jo
Abstract—This paper presents a majority voting ensemble results since the combination of models produces a powerful
method that is able to predict the possible presence of heart collaborative overall model.
disease in humans. The prediction is based on simple affordable
medical tests conducted in any local clinic. Moreover, the aim of Section II of this paper presents a review of related work, then
this project is to provide more confidence and accuracy to the Section III introduces the intricate details of the dataset, data
Doctor’s diagnosis since the model is trained using real-life data preprocessing and the machine learning techniques used.
of healthy and ill patients. The model classifies the patient based Moreover, the results of each model along with the overall
on the majority vote of several machine learning models in order accuracy of the hard voting model are presented in Section IV.
to provide more accurate solutions than having only one model. Finally, a conclusion is outlined in section V.
Finally, this approach produced an accuracy of 90% based on
the hard voting ensemble model. II. RELATED WORK
In the field of heart disease detection, a variety of techniques
Keywords—Machine learning; Majority Voting ensemble regarding data preprocessing and model variation has been
method; heart disease; UCI dataset; classification.
used. The work presented in [5] used the same dataset as this
I. INTRODUCTION paper but different machine learning models were
implemented. Three discrete classifier models were built
In the present era, heart disease rates have dramatically which included Support Vector Machine (SVM) classifier,
increased to become the leading cause of death in the United naïve Bayes algorithm, and C4.5. The prediction of the heart
States upon adults due to the widespread of unhealthy habits disease was conducted based on each of these models
[1]. These include a declination in physical activity since the discretely and produced a maximum accuracy of 84.12% in
technology trend is moving towards replacing human physical the SVM machine learning model.
activity and unhealthy eating habits which are directly linked
to increasing the risk of having heart diseases. The work in [6] also used the Cleveland heart disease dataset
but the classification models that were implemented involved
Starting off with the definition of a Heart Disease, according only Tree algorithms. Those included J48, Logistic Model,
to [2] the National Heart, Lung, and Blood Institute states that and Random Forest algorithm. A comparison of the three
heart disease is a disruption to the heart’s normal electrical methodologies was conducted and the highest accuracy
system and pumping functions. Where the disease makes it achieved was 84% using the J48 algorithm.
harder for the heart muscle to pump blood efficiently.
Furthermore, the work in [7] presents a prediction system of
Furthermore, according to the World Health Organization coronary artery heart diseases using four different datasets
(WHO), 17.9 million people die each year from including the Cleveland dataset. The algorithm used for
cardiovascular diseases which correspond to 31% of all deaths prediction involved only decision tree techniques that
around the world [3]. This incurs the need of having an included C4.5 and Fast Decision Tree. At first, the model is
affordable system that is able to give a preliminary assessment trained based on each dataset using all features. Then the best
of a patient based on relatively simple medical tests that are features from each dataset are selected and used for training
affordable to everyone. the model. This technique improved the accuracy of
To conduct the training and testing of the machine learning prediction of the model from 76.3% to 77.5% using C4.5 (this
model, the Cleveland dataset from the well-known UCI accuracy represents the average accuracy from all datasets)
repository was used since it is an authenticated dataset that is and for the Fast Decision Tree, the average accuracy improved
widely used for training and testing in machine learning from 75.48% to 78.06%.
models [4]. The dataset contains 303 instances and 14 The work in [8] uses data mining techniques where the large
attributes that are based on well-known factors that are Cleveland dataset with all 76 attributes is investigated in order
thought to correlate with risks of heart diseases. to extract hidden and previously unknown patterns. This
The approach presented in this paper uses the hard voting allows the prediction to utilize the most dominant and
ensemble method which is a technique where multiple effective attributes provided in the dataset. The machine
machine learning models are combined and the prediction learning algorithm consists of different Decision Tree
result is based on the majority vote from all models. This methods (J48, Logistic Model Tree Algorithm, Random
technique is used in order to improve the overall prediction Forest Algorithm.) The highest accuracy is obtained from the
Exercise-induced angina:
Exang Discrete 1=yes
0=no
Heart rate:
3=normal
Thal Discrete
6=fixed defect
7=reversible defect Figure 1: Heat map of cross-correlation values
Diagnosis classes:
Target Discrete 0=healthy
1=possible heart disease
336
Also, a pie chart as shown in Figure 2 displays the gender
distribution of the instances in the Cleaveland data set. It is
clear that the dataset had more males 68% than females
32%.
337
A. Stochastic Gradient Descent (SGD) Classifier
Starting off with the first model, a binary classifier that uses
the SGD approach was built. The SGD approach picks
random instances in the training set and computes the
gradient-based on that single instance in order to reach the
minimum value of the cost function. Then based on the
parameters chosen to minimize the cost function,
classification occurs based on the simple binary classifier
built that is able to identify whether heart disease is present
or not.
B. K-Nearest Neighbor Classifier
The second model that was built is the K-Nearest Neighbor
classifier. The algorithm in this classifier involves finding the
Figure 8: Age Distribution for people with heart disease
distances between the new instance and all of the training
In addition, the highest correlated continuous attribute instances, then from a predefined K number it selects the
(thalach) is plotted against age, as shown in Figure 9 to nearest K data points to the new instance. Finally,
examine if there is any relation. It is noticed that for people classification occurs based on the majority class of the K data
with heart disease at all age ranges, the heart rate was points selected. The K number in this project was chosen to
generally higher than that for people with no heart disease. In be 7 since it produced the best results based on the
addition, in both groups as age increased the maximum heart GridsearchCV.
rate decreased leading to a negative correlation of -0.4 with C. Random Forest Classifier
age shown earlier in Figure 1.
The third model that was built is the Random Forest
Classifier. This model involves building multiple decision
trees and combines them together in order to obtain a more
accurate and stable prediction. In this project, a number of
1000 trees worked best according to the GridsearchCV.
D. Logistic Regression Classifier
The fourth model built was the Logistic Regression
Classifier. According to [10] the Logistic Regression
Classifier computes a weighted sum of the input features and
outputs the logistic of this result. The logistic is a sigmoid
function that outputs a number between 0 and 1. Then based
on the estimated probability, the classification occurs.
338
Furthermore, the last model that was built was the Logistic
Regression classifier. The model was built using the default
parameters and the classification occurred based on the
unseen test set. The accuracy came out to be 87% and after
conducting GridsearchCV the accuracy remained the same
since the default parameters came out to be the same as the
optimized parameters. Figure 13 shows the confusion matrix
of this model.
Figure 12: Random Forest classifier confusion matrix To further investigate the models built, a receiver operating
characteristic curve (ROC) was plotted as shown in Figure 16
for all of the models involved in this project. The ROC
339
represents the diagnostic ability of the classifier and the area used to assist doctors in analyzing patient cases in order to
under each curve is calculated and displayed in Figure 15. validate their diagnosis and help decrease human error.
The closer the area value of the ROC curve to one, the better
the diagnostic ability of the model. REFERENCES
[1] “Heart Disease Facts & Statistics,” Centers for Disease Control and
Prevention.[Online].Available:https://www.cdc.gov/heartdisease/facts
.htm. [Accessed: 27-Apr-2019].
[2] Nhlbi, Nih. Anatomy of the Heart. 2011 [updated 2011 November 17;
cited 2015 January 10]. Available from:
http://www.nhlbi.nih.gov/health/health-topics/topics/hhw/anatomy
[3] “Cardiovascular diseases (CVDs),” World Health Organization, 26-
Sep2018.[Online].Available:https://www.who.int/cardiovascular_dise
ases/en/. [Accessed: 27-Apr-2019].
[4] Lichman, M. (2013). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
[5] D. Chaki, A. Das, and M. Zaber, “A comparison of three discrete
methods for classification of heart disease data,” Bangladesh Journal
of Scientific and Industrial Research, vol. 50, no. 4, pp. 293–296, 2015.
[6] R. G. Saboji, “A scalable solution for heart disease prediction using
classification mining technique,” 2017 International Conference on
Figure 15: ROC curve for all models Energy, Communication, Data Analytics and Soft Computing
(ICECDS), 2017.
[7] El-Bialy, R., Salamay, M., Karam, O. and Khalifa, M. (2015). Feature
Analysis of Coronary Artery Heart Disease Data Sets. Procedia
Finally, the overall accuracy of this project after conducting Computer Science, 65, pp.459-468.
the hard voting ensemble method came out to be 90% which [8] Patel, J., TejalUpadhyay, D. and Patel, S., 2015. Heart disease
is considered a fairly adequate accuracy that can be further prediction using machine learning and data mining technique.
built upon in the future. International Journal of Computer Science & Communication, 7(1),
pp.129-137. DOI: 10.090592/IJCSC.2016.018.
[9] Ghosh, S. (2017). Application Of Various Data Mining Techniques To
VI. CONCLUSION Classify Heart Diseases. [online] Pdfs.semanticscholar.org. Available
at:
In conclusion, this paper presented a machine learning https://pdfs.semanticscholar.org/dbe6/7e47cb35edc283cebd5cf06dd6
ensemble technique that combined multiple machine learning 7faf1ad100.pdf [Accessed 13 Jul. 2019].
techniques in order to provide a more accurate and robust [10] A. Géron, Hands-on machine learning with Scikit-Learn and
TensorFlow: concepts, tools, and techniques to build intelligent
model for predicting the possibility of having a heart disease. systems. Beijing: OReilly, 2018.
The Ensemble model achieved 90% accuracy, which exceeds
the accuracy of each individual classifier. The model can be
340
Resolving Conflict of Interests in Recommending
Reviewers for Academic Publications Using Link
Prediction Techniques
Sa’ad A. Al-Zboon Saja Khaled Tawalbeh Heba Al-Jarrah
Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
Jordan University of Science and Jordan University of Science and Jordan University of Science and
Technology Technology Technology
Irbid, Jordan Irbid, Jordan Irbid, Jordan
saalzboon16@cit.just.edu.jo sajatawalbeh91@gmail.com hebaatta96@gmail.com
Abstract—An honest peer-review process is a key for producing Conflict of interest, simply, occurs when a reviewer’s judg-
high quality scientific research. However, this process depends on ment might be compromised by an existing relationship to
two main factors: (1) the expertise of reviewers in the topic of an author of a submitted paper. There are many forms of
a submitted paper and (2) the relationships between reviewers
and authors. To satisfy the first factor, editors and conferences relationships that can lead to a CoI such as student-supervisor,
chairs manually select reviewers. Whereas to prevent any conflict working on the same affiliation, co-authorship, family relation-
of interest (CoI) between reviewers and authors to satisfy the ships, etc. On the other hand, a recommended reviewer of a
second factor, reviewers and authors are asked to declare any paper, arguably, should be an active researcher who has some
CoI manually. Such a solution is tedious to all actors and error- publications on the topic of that paper.
prone. To solve this problem and satisfy those two factors, we
have developed a novel framework that (1) recommend expert To solve the aforementioned problem and achieve the two
reviewers and (2) resolve the CoI problem. To develop our frame- main factors easily and efficiently, we have developed a
work, we have represented the DBLP citation network dataset as framework that recommends expert reviewers in the topic of
a graph database using Neo4J. A Cypher queries used to select a given paper while resolving the CoI problem. To develop
expert reviewers. Various link prediction algorithms, especially our framework, we utilized graph mining techniques [1] to
the Adamic Adar and the Common Neighbors algorithms, have
been utilized to resolve any potential conflict of interest. recommend expert reviewers and detect any potential CoI
Index Terms—Conflict of Interests (CoIs), DBLP, Link Predic- between reviewers and authors.
tion, Adamic Adar, Common Neighbors. The graph mining techniques have been used in several
domains such as computer networks [2] [3], social networks
I. I NTRODUCTION [4]–[6], co-authorship networks [7] [8], and other fields.
The peer-review process for evaluating scientific research is These techniques depend on data extraction techniques such
a crucial process for producing high quality research papers as classification and clustering. Relations between people,
and successfully running academic events. An honest peer- whether business relationships, friendship or otherwise, are
review process relies on two main factors: (1) the expertise represented as graphs. A graph represented as G(V, E), where
of reviewers in the topic of the submitted paper and (2) the V is a set of vertices (nodes), and E is a set of edges. In
relationships between reviewers and authors. Unfortunately, social networks, nodes represent people and edges represent
meeting these two factors is not an easy task. the relations between them. The relationship could be direct,
Currently, achieving the first factor depends, mainly, on meaning that there is a direct edge between two nodes, or
the editors and conferences chairs to decide who can review indirect relationship (implicit relationship), meaning that there
what. Regarding the second factor, reviewers and authors need is a path between two nodes but not a direct edge.
to declare any conflict of interest (CoI) manually. Although To detect implicit relationships, various link prediction
achieving those two factors are very important, unfortunately, algorithms have been developed. Link prediction algorithms
the current solution is tedious and error prone. calculate the possibility of two nodes to have a direct edge
342
represents the number of neighbors for node B, and N (n) 2) Data Pre-processing: To increase the performance of
represents the frequency of shared neighbors between node A our approach and prepare the data for the hosting process,
and node B. we have implemented a python script to clean the data from
A value of 0 indicates that two nodes are not close, whereas any unnecessary data such as removing special characters,
a higher value indicates that the two nodes are closer. removing unneeded features from the dataset, and removing
old publications (more than 10 years old).
B. Common Neighbors Algorithm The dataset contains 4,107,340 publications. However, after
The Common Neighbors (CN) algorithm measures the link the pre-processing step, we end up with 2,219,099 publications
prediction between two nodes based on their shared neighbors with a data size of 1.21 GB. For each publication, the dataset
[10]. It relies on the fact that, if two strangers have one contains publication id, title, authors name, venue raw, year,
common friend, those two strangers are more likely to meet field of study (fos), and references data schema.
in the future than two strangers without a common friend. Finally, we used the Latent Dirichlet Allocation (LDA) [13]
The CN value is computed using the following equation: topic modeling technique to find the topics of all publications
in the dataset from their titles and stored them to be used later
CN(A, B) = |N(A) ∩ N(B)| by our framework (recall Section IV-D).
B. Hosting the Dataset as a Graph Database
Where, N (A) and N (B) are the neighbors for both node
A and node B, respectively. This equation calculates the After we obtained and preprocessed the dataset (recall
convergence between two nodes. The CN value of 0 indicates Section IV-A), we stored the dataset as a graph database
that the two nodes are not nearby, whereas higher value means on the Neo4j [12]. Neo4j is an open source graph database
the two nodes are closer. implemented using Java and Scala. The Neo4j graph database
can be managed using the Cypher query language or Bolt
IV. M ETHODOLOGY protocol. Cypher is the Neo4j query language that allows
In this research, we define a reviewer as an active researcher developers to store and retrieve data from a Neo4j graph
who has some publications on a given topic during the last 5 datbase. Bolt protocol is an efficient client/server protocol for
years and he has no CoI with any author of a given paper. This database applications.
section describes our approach in more details. Section IV-A Unlike relational databases where data stored in tables,
describes the dataset we used in our research. Section IV-B using Neo4j, the data represented as a graph of nodes and the
describes our hosting mechanisms of the processed dataset. relationships (links) between them. We utilized Neo4j graph
Finally, Section IV-D describes our framework for finding the database for hosting our DBLP dataset. This step needs to
candidate reviewers for a given paper. be performed offline one time. Table IV-B shows information
about the hosted dataset on the Neo4j graph database. The
A. Dataset Preparation total graph size is 14.07 GB.
This section describes the dataset we used in our research
Table I
as well as the pre-processing techniques we applied to clean N EO 4 J G RAPH DATABASE INFORMATION
up the data.
1) Dataset: The DBLP Citation Network dataset [11] has Authors # 1,596,642
Articles # 2,219,099
been used in this research. The DBLP is a computer science Venues # 22837
bibliography that provides bibliographic information on major Nodes # 4,992,939
computer science journals and conferences. There are several Relationships # 36,516,189
versions of the DBLP dataset including DBLP citation network
v1, DBLP citation network v4, ACM citation v9, and DBLP Figure 1 depicts the structure of the stored graph in the
citation network v11. In this research, we have used DBLP Neo4j database. The figure shows that there are four labels
version 11. (Article, Author, Topic, and Venue). An Article
The dataset contains 4,107,340 publications. For each publi- has Topic, authored by one or more Author, cited by
cation, the dataset contains the publication id, authors {name, different Article, and presented in a Venue.
id, and organization as org}, title, venue {raw, id}, year,
number of citations (n_citation), references, publisher, etc. C. Co-Author Graph Building
Moreover, the dataset contains 36,624,464 citation relation- Once the graph database has been hosted on the Neo4j,
ships. The DBLP dataset is available in different formats such we create a co-author relationships model that describes the
as XML, RDF, and JSON files. In this study, we used the collaborations between different authors. To build such a
JSON file in which each line in the dataset file represents a graphical model, we relied on the “Article authored by or
paper. or more Author" relationships in the stored graph database
This dataset has been used in the research for many purposes (see Figure 1). This step also needs to be done one time (offline
such as data clustering [29], topic modeling analysis [30], process). Each co-author relationship indicates that there is, at
conflict of interest [31], and expert finding [32]. least, one research collaboration between two authors.
343
them descending based on their number of publications on that
topic. The REVIEWER presents the name of a candidate re-
viewer. Number_Of_Publication_In_Topic presents
the total number of publications for the reviewer in that topic.
Topic_A and Topic_B are examples of the topics of the
given paper.
1 MATCH ( r e v i e w e r ) < −[:AUTHOR] −( p a p e r )
2 MATCH ( p a p e r : TOPIC
3 { HAS_TOPIC : ’ Topic_A , Topic_B ’ } )
4
5 RETURN r e v i e w e r AS REVIEWER ,
6 COUNT ( p a p e r ) AS N u m b e r _ O f _ P u b l i c a t i o n _ I n _ T o p i c
7
8 ORDER BY N u m b e r _ O f _ P u b l i c a t i o n _ I n _ T o p i c DESC
9
10 LIMIT 10
Query 2. Get Top Active Authors In a Topic
344
Figure 2. Overview of our CoI Framework
indicates that two reviewers know and communicate with each Table III
other. A DAMIC A DAR P REDICTION VALUES OF THE T OP 10 C ANDIDATE
R EVIEWERS
V. E XPERIMENTAL E VALUATION
Reviewer Name AA Score
To evaluate our approach, we applied our CoI approach on Giuseppe Riva 0.0
a publication to get the recommended reviewers and exclude Andrea Gaggioli 0.0
Mark Billinghurst 0.0
the candidate reviewers with CoI. Kimon P. Valavanis 0.0
For this experiment we chose this publication “Decoupling Karl Rihaczek 0.0
assessment and serious games to support guided exploratory Dieter Schmalstieg 0.0
learning in smart education." [33]. The authors of this paper Vinton G. Cerf 0.0
Peter Pagel 0.0
are Mohammad Al-Smadi, Nicola Capuano, and Christian Gary McGraw 0.0
Guetl. The topic that the Extract Topic step (see Figure Stefania Serafin 0.0
2) obtained from the LDA model according to the title of
this paper is "Virtual, Local, Reality, Health, Error, Phase,
Table IV
Education, Dynamics, Medical, Continuous". C OMMON N EIGHBORS P REDICTION VALUES OF THE T OP 10 C ANDIDATE
R EVIEWERS
Table II
T OP 10 C ANDIDATE R EVIEWERS OF [33] Reviewer Name CN Score
Giuseppe Riva 0.0
Reviewer Name Number Of Publications In Topic Andrea Gaggioli 0.0
Giuseppe Riva 77 Mark Billinghurst 0.0
Andrea Gaggioli 59 Kimon P. Valavanis 0.0
Mark Billinghurst 59 Karl Rihaczek 0.0
Kimon P. Valavanis 57 Dieter Schmalstieg 0.0
Karl Rihaczek 40 Vinton G. Cerf 0.0
Dieter Schmalstieg 35 Peter Pagel 0.0
Vinton G. Cerf 35 Gary McGraw 0.0
Peter Pagel 34 Stefania Serafin 0.0
Gary McGraw 31
Stefania Serafin 31
345
Table V [13] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
R ECOMMENDED R EVIEWERS FOR [33] WITHOUT C O I. Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022,
2003.
Reviewer Name [14] Neo4j Inc., “Cypher query language,” Accessed May, 2019,
Giuseppe Riva https://neo4j.com/developer/cypher/.
Andrea Gaggioli [15] P. M. Chuan, M. Ali, T. D. Khang, N. Dey et al., “Link prediction in co-
Mark Billinghurst authorship networks based on hybrid content similarity metric,” Applied
Kimon P. Valavanis Intelligence, vol. 48, no. 8, pp. 2470–2486, 2018.
Karl Rihaczek [16] T. Dai, L. Zhu, X. Cai, S. Pan, and S. Yuan, “Explore semantic
Dieter Schmalstieg topics and author communities for citation recommendation in bipartite
Vinton G. Cerf bibliographic network,” Journal of Ambient Intelligence and Humanized
Peter Pagel Computing, vol. 9, no. 4, pp. 957–975, 2018.
Gary McGraw [17] K. Zhou, T. P. Michalak, M. Waniek, T. Rahwan, and Y. Vorobeychik,
“Attacking similarity-based link prediction in social networks,” in Pro-
Stefania Serafin
ceedings of the 18th International Conference on Autonomous Agents
and MultiAgent Systems. International Foundation for Autonomous
Agents and Multiagent Systems, 2019, pp. 305–313.
and Cypher query language used to retrieve the candidate [18] R. Ahuja, V. Singhal, and A. Banga, “Using hierarchies in online social
networks to determine link prediction,” in Soft Computing and Signal
reviewers. Finally, several link prediction algorithms have Processing. Springer, 2019, pp. 67–76.
been utilized to calculate the CoI prediction value of each [19] E. Bütün and M. Kaya, “A pattern based supervised link prediction in
candidate reviewer. The final list of the reviewers presented directed complex networks,” Physica A: Statistical Mechanics and its
Applications, vol. 525, pp. 1136–1145, 2019.
with highlights for the reviewer with CoI. [20] A. M. Fard, E. Bagheri, and K. Wang, “Relationship prediction in
In the future, we are planning to use machine learning dynamic heterogeneous information networks,” in European Conference
techniques together with link prediction algorithms to solve on Information Retrieval. Springer, 2019, pp. 19–34.
[21] H. Cho and Y. Yu, “Link prediction for interdisciplinary collaboration
the CoIs problem. Moreover, we plan to use different graph via co-authorship network,” Social Network Analysis and Mining, vol. 8,
mining techniques to achieve better results. no. 1, p. 25, 2018.
[22] Y. Xiao, H. Huang, F. Zhao, and H. Jin, “Tplp: Two-phase selection link
ACKNOWLEDGEMENTS prediction for vertex in graph streams,” in Pacific-Asia Conference on
Knowledge Discovery and Data Mining. Springer, 2019, pp. 514–525.
This research is partially funded by Jordan University of [23] A. Ahmed, M. F. Khan, M. Usman, and K. Saleem, “Analysis of
Science and Technology, Research Grant Number: 20170107. coauthorship network in political science using centrality measures,”
arXiv preprint arXiv:1902.06692, 2019.
[24] T. Amjad, A. Daud, and N. R. Aljohani, “Ranking authors in academic
R EFERENCES social networks: a survey,” Library Hi Tech, vol. 36, no. 1, pp. 97–128,
[1] L. Tang and H. Liu, “Graph mining applications to social network 2018.
analysis,” in Managing and Mining Graph Data. Springer, 2010, pp. [25] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for
487–513. social networks,” Journal of the American society for information
[2] Baoxing Chen, Wenjun Xiao, and B. Parhami, “Internode distance science and technology, vol. 58, no. 7, pp. 1019–1031, 2007.
and optimal routing in a class of alternating group networks,” IEEE [26] A.-L. Barabási and R. Albert, “Emergence of scaling in random net-
Transactions on Computers, vol. 55, no. 12, pp. 1645–1648, Dec 2006. works,” science, vol. 286, no. 5439, pp. 509–512, 1999.
[3] M. Ljubojević, A. Bajić, and D. Mijić, “Centralized monitoring of [27] K. Hu, J. Xiang, W. Yang, X. Xu, and Y. Tang, “Link prediction
computer networks using zenoss open source platform,” in 2018 17th in complex networks by multi degree preferential-attachment indices,”
International Symposium INFOTEH-JAHORINA (INFOTEH), March arXiv preprint arXiv:1211.1790, 2012.
2018, pp. 1–5. [28] T. Zhou, L. Lü, and Y.-C. Zhang, “Predicting missing links via local
[4] Z. Lu, Y. E. Sagduyu, and Y. Shi, “Integrating social links into information,” The European Physical Journal B, vol. 71, no. 4, pp. 623–
wireless networks: Modeling, routing, analysis, and evaluation,” IEEE 630, 2009.
Transactions on Mobile Computing, vol. 18, no. 1, pp. 111–124, Jan [29] H. Yin, A. R. Benson, and J. Leskovec, “The local closure coefficient:
2019. A new perspective on network clustering,” in Proceedings of the Twelfth
[5] L. Zhang, H. Li, C. Zhao, and X. Lei, “Social network information prop- ACM International Conference on Web Search and Data Mining. ACM,
agation model based on individual behavior,” China Communications, 2019, pp. 303–311.
vol. 14, no. 7, pp. 1–15, July 2017. [30] X. Kong, Y. Shi, S. Yu, J. Liu, and F. Xia, “Academic social networks:
[6] S. H. Sajadi, M. Fazli, and J. Habibi, “The affective evolution of social Modeling, analysis, mining and applications,” Journal of Network and
norms in social networks,” IEEE Transactions on Computational Social Computer Applications, 2019.
Systems, vol. 5, no. 3, pp. 727–735, Sep. 2018. [31] S. Wu, U. L. Hou, S. S. Bhowmick, and W. Gatterbauer, “Pistis: A
[7] L. Guo, X. Cai, F. Hao, D. Mu, C. Fang, and L. Yang, “Exploiting fine- conflict of interest declaration and detection system for peer review
grained co-authorship for personalized citation recommendation,” IEEE management,” in Proceedings of the 2018 International Conference on
Access, vol. 5, pp. 12 714–12 725, 2017. Management of Data. ACM, 2018, pp. 1713–1716.
[8] M. Kudělka, Z. Horák, V. Snášel, P. Krömer, J. Platoš, and A. Abraham, [32] C. Shi, Z. Zhang, P. Luo, P. S. Yu, Y. Yue, and B. Wu, “Semantic
“Social and swarm aspects of co-authorship network,” Logic Journal of path based personalized recommendation on weighted heterogeneous
the IGPL, vol. 20, no. 3, pp. 634–643, June 2012. information networks,” in Proceedings of the 24th ACM International
[9] L. A. Adamic and E. Adar, “Friends and neighbors on the web,” Social on Conference on Information and Knowledge Management. ACM,
networks, vol. 25, no. 3, pp. 211–230, 2003. 2015, pp. 453–462.
[10] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for [33] A.-S. Mohammad, N. Capuano, and C. Guetl, “Decoupling assessment
social networks,” Journal of the American society for information and serious games to support guided exploratory learning in smart
science and technology, vol. 58, no. 7, pp. 1019–1031, 2007. education,” Journal of Ambient Intelligence and Humanized Computing,
[11] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: vol. 9, no. 3, pp. 497–511, 2018.
extraction and mining of academic social networks,” in Proceedings
of the 14th ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2008, pp. 990–998.
[12] Neo4j Inc., “Neo4j: Graph database,” Accessed May, 2019,
https://neo4j.com/.
346
Reconstructing Colored Strip-Shredded Documents
based on the Hungarians Algorithm
Abstract—One of the common problem in forensic science and In general, shredding machines produce three categories of
investigation science is reconstructing destroyed documents that shreds: rectangular strips (spaghetti), cross-cut and circular
have been strip-shredded. This work intend to design a strip [3]. This work aims to design, implement, and test an
matching algorithm to resemble edges of the strips, in order
to reconstruct the original document. The proposed algorithm algorithm to solve the problem of reconstructing shredded
is divided into three phases. First is image- based similarity document. This implies finding the correct positioning of a
evaluation to produce a score function (which includes building given n shreds, in order to form the original document. Each
” distance matrix”). The second phase is assignment phase that shred can be presented as a binary bitmap, and it is assume
matches a shred border pixels of the right side to a left side of that the shreds are placed in the correct orientation. Few
another shred (using Hungarians algorithm). The third phase is
defining the sequence according to the matched strips in order to researches have worked on reconstructing strip-cut documents
merge the shreds and reconstruct the document. The proposed problem [4]. We intend to improve the strip-cut matching
work is compared with a nearest neighbor search algorithm algorithm specifically, and outperform the sequential “best
in term of accuracy and speed. The Hungarians reassembling match” and “minimum distance” search for each shred.
algorithm scores better accuracy and run time than nearest Searching for the nearest neighbor matching for each strip
neighbor reassembling. The proposed approach scored (96.2
percent) as an average accuracy for reassembling an available from both sides holds the drawback of being time consuming.
online benchmark. Motivated by this drawback, a new approach is proposed
to reconstruct strip shredded text documents by firstly
Index Terms—Document Reconstruction, Feature matching, specifying the problem as an optimization problem, secondly
Hungarians Algorithm, Nearest neighbor search, Strip-shredded reformulating the problem as a maximum bipartite matching
documents.
problem. The Hungarians algorithm was deployed to find the
best match with a reduced complexity.
I. I NTRODUCTION
Automatic shreds reconstruction involves finding a correct The paper is organized as follows. In Section 2, a brief
spatial arrangement of given shreds in order to reassemble overview of related work is given. In Section 3, the method-
a complete document. This problem is usually handled by ology and how obtaining our shredded document is explored.
historians and forensic investigators [6]. It is used in many The naive algorithm used to solve this problem is described,
domains such as: health informatics, insurance claim analysis along with its complexity analysis, and the optimized algo-
[1], and military sector [2]. Also, it can be used in recovery of rithm and how to proceed in its different phases, which put
documents accidentally lost [1]. Manually reconstruction can both algorithms in a comparative frame in terms of time
be used, in which parts are arranged and analyzed as if it is a complexity. In Section 4, the experiment results thorough
puzzle [3]. The huge number of possible shreds permutations quantitative evaluation of the proposed approach are presented.
of arrangements makes manual solution an inefficient, Finally, Section 5 concludes the paper.
exhausting and time consuming. Many methodologies are
followed to provide automated and semi-automated document II. RELATED WORK
reconstruction. Whether manual reconstruction or automated
is used, the greatest challenge is the shreds identification and The problem of reconstructing shredded documents is
matching. closely related to the problem of automatically solving jigsaw
puzzles. Schauer et al. [5] considered the shredded document
as a form of jigsaw puzzles. They specified three types of
the fragments: the manually torn documents, the cross-cut
Sleit et al. [4] proposed a solution for the reconstruction of Algorithm 1: Hungarians reassembling algorithm
crosscut shredded text documents (RCCSTD) problem based input : sh[] ← shreds
on iterative building of clusters related to shreds. Biesinger output: Ordered sequence of shreds
et al. [7] investigated the same problem with an improved 1 distance[] ← ∞ , counter1 ← 0 , counter2 ← 0
genetic algorithm. 2 while counter1 ≤ numberof strips(sh) do
3 while counter2 ≤ numberof strips(sh) do
Butler and Chakraborty [3] proposed “Deshredder” 4 distance[counter1][counter2]=
approach, which provides a visual analysis and make use of strip-Distance(counter1,counter2) ( Total
user involvement to direct the reconstruction process. The distances between band edges)
approach represents shredded pieces as time series and uses 5 end
nearest neighbor matching techniques, which enable matching 6 end
not just the contours of shredded pieces, but also the content 7 Hungarians-Reassembling(distance[],sh[]) returns
of shreds. Some literature deals with reconstructing strip indexes that assigning the best right match for left
shredded documents. These implies extracting information side related to every shred.
from the boundaries of the shredded documents strips [2], 8 sequence[] ← 0
[3], [12], but their work dose not concentrate on the function 9 indexes = Hungarians-Reassembling(distance[])
order of the algorithm (time complexity). They focus on 10 while i,v in indexes do
finding a solution other than finding a better run time solution. 11 val =distance[i][v]
12 sequence.append((v,val))
Justino et al. worked in reconstructing hand shredded docu- 13 end
ments [10]. The proposed methodology include pre-processed 14 return sequence[]
each shred based on polygonal approximation in order to
reduce complexity of the boundaries. The next stage is features
extraction followed with matching stage. Shreds in hand- The preprocessing steps are defined in Algorithm 1 (line
shredded documents usually yields to irregular boundaries, 1-6). The innermost loop calculates the distance between two
which need an extra processing before matching. They applied pixel values as a measure of how similar they are. This process
polyline simplification by using Douglas–Peucker (DP) algo- deals with the pixel vector, whatever the color model used.
rithm. Justino et al. methodology’s performance degrades as Other loops ensure that comparing each shred with all other
the number of shreds gets bigger since it affects the polygonal shreds. It returns the sum of distances between the rightmost
approximation. column of pixels in a shred and leftmost column of pixels in
another shred. The result matrix ”distance” will have sums of
III. P ROPOSED M ETHODOLOGY distances between edges of each two shreds. The complexity
of preprocessing in Algorithm 1 (line 1-6) is n2 × h, where
The result of the shredding process is a set of n shreds Sh h is the height of the shreds. Assuming that the height of a
= sh0 ,...,shn , which also represents the input to the algorithm. document is the number of shreds multiplied by some constant,
This work is divided into three subsequent phases. First phase then h = C × n. The run time complexity is approximately
is applying similarity score function. This phase will result in O (n3 ). The input used in this step is a shredded document
n × n distance matrix. Second phase uses Hungarians method that was shredded using a shredding function. This shredding
for bipartite matching to find the borders matches. The Third function generates any number of shreds out of a document,
348
Fig. 3. A part of Algorithm 1 distance matrix for Fig.1 document.
349
TABLE I
E XPERIMENTAL RUN TIME OF THE H UNGARIANS REASSEMBLING
ALGORITHM MEASURED IN SECONDS FOR EACH N SHREDS
350
Fig. 5. The reassembled document using Hungarians reassembling algorithm.
Fig. 6. The analytical run time complexity against the implementation run Fig. 7. Nearest Neighbor reassembling (NNR) and Hungarians reassembling
time of Hungarians reassembling algorithm. (HR) run time.
n × n matrix that produced by Algorithm1. In every search ratio of the corrected positioned shreds to the number of all
(each row), the search space decreased by one. When a shreds. The accuracy depends on the document image and the
minimum is assigned as a match, discard the entire column color distribution of it. Hungarians reassembling shows a more
from subsequent minimum searches. stable performance than the nearest neighbor reassembling.
Fig. 9 shows the compared accuracy results after applying both
Analyzing the complexity of Algorithm4, the number of NNR and HR algorithms using different number of shreds (n).
searches
∑n for minimum equals to n+(n−1)+(n−2)+.....+1 =
2 Table II shows both the experimental run time complexity
i=1 i = (n(n + 1))/2 = O(n ). Therefore, adding the first
phase of building distance matrix which costs O (n3 ), into and the accuracy of implementation of both methods: HR and
this phase will results O(n3 ). Fig. 8 shows a comparison NNR. In general, it is clear that the HR performs better than
between the two algorithms: Nearest Neighbor reassembling NNR in both the run time and the accuracy.
(NNR) and Hungarians reassembling (HR) in terms of their
run time. Since both of them have similar theoretical run time
complexity of O(n3 ), the shape of cubic function is obvious V. C ONCLUSION
for both of them. The difference in the constants that was This paper investigates the power of Hungarians method
dropped in the HR makes it faster than the NNR. and its ability to find the best match to provide an algorithmic
solution for reassembling colored shredded documents. The
Accuracy was defined for both algorithms, HR and NNR, by algorithm has three phases. The first phase is finding image-
comparing the original document image with the reassembled based similarity and produces a distance matrix. The matrix
document results, strip by strip. Accuracy function finds the defines the distances between the left sides and the right
351
TABLE II
E XPERIMENTAL RUN TIME OF THE H UNGARIANS REASSEMBLING ALGORITHM MEASURED IN SECONDS FOR EACH N SHREDS
Number of shreds(n) HR Run time (sec.) NNR Run time (sec.) HR Accuracy NNR Accuracy
50 3.219 9.0668 0.96 0.96
100 15.5388 40.7209 0.95 0.87
150 28.9084 79.8505 0.92 0.84
200 47.3328 127.8999 0.90 0.84
250 65.5816 173.9221 0.88 0.82
300 120.3616 335.283 0.84 0.76
R EFERENCES
352
Implementation and Comparative Analysis of
Semi-automated surveillance algorithms in real
time using Fast-NCC
Omer Khan, Nayab Saeed, Raheel Muzzammel, Umair Tahir and Omar Azeem
Electrical Engineering Department, University of Lahore, Lahore Pakistan
omerkhan128@gmail.com
Abstract - Chaotic environment, irregular motion of automated systems without the interference of human
objects creates challenging environment in the field of being [6].
computer vision. Advance target tracking techniques are Background subtraction with alpha is used for tracking
used to overcome these problems, but few parameters objects by calculating deviations from the background
are considered ideal in those scenarios, or those
model [7]. In this technique background is initialized
parameters are ignored. In this research, cross
correlation technique is applied for target tracking with first few frames. Where adoptive coefficient
which is famous for feature extraction in image and large value of alpha leave tail mark of moving
processing. Further, normalized cross correlation and object. Statistical method provides difference of whole
fast-normalized cross correlation are implemented and frame with reference frame and then resultant frame is
results are compared. As enormous computation are grouped to create object, which requires expensive
required in these techniques, real time target tracking is computing [8]. It is not commonly used for real time
factual challenge faced by this technique. High target tracking. Temporal differencing method uses
performance embedded hardware is required to few consecutive frames to extract the moving object
implement these techniques. In this research,
“TMS320DM642 evaluation module with TVP video
but this technique is not very good [9]. When object
decoders” digital signal processor embedded board is stop moving, this technique fails to detect the object.
carefully chosen for this purpose. These techniques are Eigen background subtraction provides motion
implemented on TMS320DM642 evaluation module and detection using Eigen space model [10]. In this
their results are carefully analyzed in this research. method, dimensionality of the space constructed from
sample images is reduced by the help of Principal
Keywords — Digital Signal Processor (DSP); Evaluation Component Analysis (PCA). In this technique
Module (EVM); External Memory Interface (EMIF); additional overhead is added to calculate principal
Synchronous Dynamic Random-Access Memory component. Correspondence based matching
(SDRAM); Field-Programmable Gate Array (FPGA);
Universal Asynchronous Receiver-Transmitter (UART);
algorithm, takes object of current frame and previous
Normalized Cross-Correlation (NCC); Real time Tracking frame then Euclidian Distance is calculated [11]. On
(RTT); Region of Interest (ROI); Phase Alternation Line the basis of Euclidian distance, next location of object
(PAL); Computational Time (CT); Frame Per Second is predicted which increases chance of target miss.
(FPS); In this research cross-correlation, normalized NCC
and fast-NCC is selected for comparative analysis of
I. INTRODUCTION target tracking algorithm. These algorithm select
The phenomenon of analyzing video sequences is optimized target vector and search area vector, which
known as video surveillance. Video surveillance is a provide less computing and fast processing rate. In
demanding and vigorous region in the field of most of the systems, hardware optimization is required
computer vision and has been proved vital in data for real time tracking for automated systems.
storing and displaying [1]. Video surveillance Dedicated hardware can be designed to perform
activities can be categorized in three types: manual application specific tasks so it will be much expensive
video surveillance, semi-autonomous video to use redundant hardware [12].
surveillance and fully-autonomous system [2]. TMS320DM642 evaluation module with TVP video
Automated surveillance systems are required to decoders” digital signal processor embedded board is
provide target tracking and feature extraction [3]. selected for this research to improve target tracking
Video surveillance for a long time by a human algorithm in real time [13]. The DSP on the DM642
operator is not possible. Solution for this problem real EVM interfaces to on-board peripherals through the
time target tracking algorithms are widely used in 64-bit wide EMIF of the three 8/16 bit wide video
surveillance [4] [5]. Image processing and computer ports [14]. The SDRAM, Flash, FPGA, and UART
vision are areas of recent research to provide [15] are each connected to one bus. The EMIF bus is
A. Pre Processing
In this research, system is not designed for a specific
targets so, pre-processing is applied to reduce noise. If
the image sequence has noise, noise should be
removed. Common types of noises found in image
Figure 1: Proposed Flow Chart sequences includes salt and pepper noise, the pixels
354
effected by salt and pepper noise have different colors Search area is extracted area from input video which
or intensities from that of its surrounding pixels, which is represented by in the above. Expansion of
are removed by applying median filter. Another type equation 3 will provide us with
of noise found in image sequence is Gaussian noise, , ( , ) =∑ , ( , )−2 ( , ) ( −
where every pixel value in the image is changed by a (4)
, − )+ ( − , − )
small value. The noise removal methods for Gaussian In above equation 4 the term ∑ ( − , − ) is a
noise include Gaussian smoothing. constant value as it represents the square of
displacement of target vector over the search area. As
i. Gaussian Filtering target vector is constant, it will remain same for every
In this algorithm, Gaussian smoothing filtering is frame. If the term ∑ ( , ) is approximately
applied on depth video only along spatial dimension. constant considering that the search area remains the
Each frame in depth video is convolved with Gaussian same for most of the time. By considering above two
smoothing filter independently. This is preprocessing terms constant, it is said that the remaining cross-
step to remove sharp changes in video. Gaussian filter correlation term is simplified to;
of 4x4 is applied in this research. ( , )= ( , ) ( − , − )
, (5)
,
ii. Contrast Adjustment This equation measures the similarity between the
In research, histogram equalization is applied only on search area and the target vector. If energy of the
search area and target area. It works quite effective in image ∑ ( , ) differs with the change in its
local texture enhancement. To perform normalization position, matching using above equation 5 can fail. For
on target vector and search area vector it requires example, cross-correlation among the target vector
maximum and minimum value of both vectors. and an exactly matching region in the search area may
be less correlated between the target vector and the
IV. MATHEMATICAL FORMULATION OF search area because of changes of light conditions
across the image sequence
CROSS CORRELATION
Template matching is the most important in this i. Correlation Coefficient
research to provide efficient and affective target
Environmental changes causes amplitude changes in
tracking algorithm.
video sequence, these variation generates challenging
situation for target tracking. Normalization of search
A. Template Matching By Cross-Correlation area and target vector are suited for this problem,
Among many techniques, cross-correlation is one of yielding a cosine-like correlation coefficient.
the commonly used techniques for template matching.
Cross-correlation for template matching is motivated B. NORMALIZED CROSS-CORRELATION
by the Euclidean distance. The Euclidean distance
Normalized cross-correlation faces many challenges
between points is the length of the line which are discussed in previous sections. To find the
segment connecting them, in Cartesian coordinates, if
target vector in a search area of a two dimensional
= ( 1, 2, … , ) and = ( 1, 2, … , ) are two
frame, it is required to calculate for normalized
points in Euclidean n-space, then the distance ( ) cross-correlation. Normalized cross-correlation value
from , or from is given by the for every point of ( , ) for and , which has been
Pythagorean formula:
displaced by from . The equation
( , )= ( , )= ∑ ( − ) (1) below represents the basic equation of normalized
Where determines the dimensions of Cartesian cross-correlation coefficient
coordinates in this case, it only have two coordinate of ( , )
a frame so equation reduced to; ∑ , ( , ) − ̅ , [ ( − , − ) − ̅]
= (6)
( , )= ( , )= ( − ) +( − ) (2) ∑ , [ ( , ) − ̅ , ] ∑ , [ ( − , − ) − ̅]
It seems intuitively likely that the convolution output
will be highest at places where the image structure Here ̅ ̅ , represent mean values of target vector
matches the mask structure, where large image values and search area respectively. Mean value of target
get multiplied by large mask values. This idea can be vector and search area are represented by the
tried by picking out part of our image to use as a mask. following equations.
Where square of Euclidian distance for search area and 1
target area can be represented by equation 3 ̅, = ( , ) (7)
, ( , ) = ∑ , [ ( , ) − ( − , − )] (3)
355
investigate the efficiency of this implemented
1 algorithm are; average time to process a frame, frame
̅ , = ( − , − ) (8)
rate, accuracy by percentage of throughput and error
Where search area dimensions are and power consumption.
determined by , limits for target vectors are
defined by . For a search window of size D. Average time to process a frame
and a feature of size requires approximately To calculate average time to process a frame, a
( − − 1) additions and ( − + 1) hardware pin is set. When the process is completed,
multiplications. the pin is reset. This is monitored by a digital login
analyzer and time duration is recorded.
C. FAST NORMALIZED CROSS-CORRELATION
E. Frame rate
Fast calculation of the normalized cross-correlation is
provided by using two sum tables over the image In a video sequence of one minute, it have 60 second,
function and search area energy . Sum tables it is difficult to provide frame rate for the video
over the search area are pre-computed sequence of one minute so we take average of frames
integrals. After calculating sum tables, the arithmetic for five second.
operations are efficiently reduced to only three
addition/subtraction operations from ∗ F. Accuracy
computations. Various approaches can be used to Accuracy of algorithm can be determined by
efficiently calculate the denominator (image calculating number of frames in which target object is
variances) of (6), however, it cannot be directly found or lost. Number of frames in which target is lost
applied to compute the cross correlation between can further be categorized in two, target lost and false
search area and target vector, as the one shown in the target located. In this research we considered target
numerator of (6). found and lost only. To calculate both following
formula are used.
−
i. Calculation of Numerator and Denominator ℎ ℎ % = ∗ 100
Numerator of (6) can be expressed as
( , )= ( , ) ( − , − ) −
(9) % = ∗ 100
Where, (9) provides the simplified term for nominator
of normalized cross-correlation coefficient. G. Power consumption
Using these sum tables mean for search area from (7) Power consumption is a major constrained in
can be very efficiently calculated independent of the embedded hardware. It is not possible to calculate
size of target vector. Now (9) can be represented as; power consumed by individual units so in this research
power of whole system is calculated.
( , )= + − 1, + −1
(10) VI. RESULTS
− − 1, + −1
− ( + − 1, − 1) In order to provide results for comparative analysis of
+ ( − 1, − 1) cross correlation techniques for real time target
It is clear from (10), that only three tracking, experiments are made to obtain results. To
addition/subtraction are required to calculate the test all these, implemented algorithm is tested in
double sum over ( , ) by evaluation of sum different scenarios. Four set of video sequences of
table ( , ). Sum-tables are calculated using the time interval two minutes; normal video sequence,
recursive equations for the target vector. The basic chaotic video sequence, low contrast video sequence
functions in each overlapping template sub-image are and dark video sequence.
then calculated by threshold of the image and labeling
and identifying the boundaries and centers of A. Target tracking using Cross-correlation results
landmark points or natural speckle patterns on the
Cross-correlation is the basic technique that is
skin.
implemented on hardware to observe the results as
described in above subsections. It is not possible to
V. EXPERIMENTAL PARAMETERS compare these results with some previous research,
To establish understanding and gather authentic because TMS320DM642 evaluation module with
research results these video sequence are repeatedly TVP video decoders is general purpose DSP hardware;
tested through algorithm. Parameters selected to on this hardware no such tracking algorithm is
356
implemented. Later in this research, reason of systems. In case, time period is even more important
improvements and decline in values are discussed. to provide real time target tracking.
Table 1: Target tracking using Cross-correlation results Table 3: Target tracking using Fast-NCC results
Video Low Video Low
Normal Chaotic Dark Normal Chaotic Dark
Sequence Contrast Sequence Contrast
Computational Computational
50.78 54.01 55.21 54.38 48.71 50.12 50.26 49.54
time (ms) time (ms)
Frame Rate Frame Rate
18.95 16.40 17.87 15.32 19.79 17.83 18.66 18.66
(FPS) (FPS)
Target Located 13 11 10 8 Target Located 16 12 14 13
Throughput % 72.22 68.75 62.50 53.33 Throughput % 84.21 70.59 77.78 76.47
Error % 27.78 31.25 37.50 46.67 Error % 15.59 29.41 22.22 23.53
Power Power
20.01 22.62 21.35 22.24 17.11 18.74 16.81 17.60
Consumption Consumption
It is visible from result, table of cross-correlation,
B. Target tracking using NCC results normalized cross-correlation and fast-NCC; results of
Normalized cross-correlation is implemented to deal fast-NCC are way better then both previous techniques
with the low contrast images, it provide significant concerning tracking time, which rapidly changes
improvement in low contrast and dark video sequence. frame rate as well. Over all, accuracy of the system is
By comparing the two tables. It is clearly visible that increased but for the video sequence with chaotic
all the parameters are improved significantly. environment, throughput falls because in fast
Table 2: Target tracking using NCC results computation it detects false object as tacking object
which also considered as target lost. Power
Video Low
Normal Chaotic Dark consumption is also reduced, but for chaotic
Sequence Contrast
Computational environment, power utilization is increased. Results
50.13 53.61 54.11 53.83 discussed in the above section are graphically shown
time (ms)
Frame Rate in form of charts below for comparison.
19.21 16.53 17.24 15.86
(FPS)
Target Located 15 12 12 10
357
efficient utilization of existing hardware resources.
The applied algorithm in this research does not deal
with circumstances like; target object is scaled, tilted,
and object is rotated from its original location.
Multipoint tracking can also provide variety of
applications in the field of computer vision.
REFERENCE
[ 1 ] Ding Zhonglin, and LiLi. "Research on a hibrid moving object
detection algorithm in video surveillance system",
Proceedings of 2011 International Conference on Computer
Science and Network Technology, 2011.
[ 2 ] Venkatesan, R., and A. Balaji Ganesh. "Supervised and
Figure 4: Comparative Analysis of Frame Rate (FPS) Unsupervised Learning Approaches for Tracking Moving
Vehicles", Proceedings of the 2014 International Conference
on Interdisciplinary Advances in Applied Computing -
ICONIAAC 14, 2014.
[ 3 ] Purshottam J. Assudani. "Dot pattern feature extraction,
selection and matching using LBP, Genetic Algorithm and
Euclidean distance", 2012 International Conference on
Computing Communication and Applications, 02/2012
[ 4 ] Oh, Seung-Taek, Nak-Hyun Chun, Seung-Young Yoo, Ho-
Yeop Lee, and Hak-Eun Lee. "A Study on the Target 2D
Tracking Analysis Using Digital Image Correlation at Bridge
Deck Wind Tunnel Test", IABSE Congress Report, 2012.
[ 5 ] C. Wang, X. Chang, Y. Zhang, L. Zhang and X. Chen, "Failure
Analysis of Composite Structures Based on Digital Image
Correlation Method," 2017 International Conference on
Sensing, Diagnostics, Prognostics, and Control (SDPC),
Shanghai, 2017, pp. 473-476.
Figure 5: Comparative Analysis of Throughput (%) [ 6 ] Hii, A.J.H.. "Fast normalized cross correlation for motion
tracking using basis functions", Computer Methods and
Programs in Biomedicine, 200605
[ 7 ] G. Adhikari, S. K. Sahani, M. S. Chauhan and B. K. Das, "Fast
real time object tracking based on normalized cross correlation
and importance of thresholding segmentation," 2016
International Conference on Recent Trends in Information
Technology (ICRTIT), Chennai, 2016, pp. 1-5.
[ 8 ] P. Hwang, K. Eom, J. Jung and M. Kim, "A Statistical
Approach to Robust Background Subtraction for Urban
Traffic Video," 2009 International Workshop on Computer
Science and Engineering, Qingdao, 2009, pp. 177-181.
[ 9 ] T. Shibahara, T. Aoki, H. Nakajima and K. Kobayashi, "A
Sub-Pixel Stereo Correspondence Technique Based on 1D
Phase-only Correlation," 2007 IEEE International Conference
on Image Processing, San Antonio, TX, 2007, pp. V - 221-V -
224.
[ 10 ] Zheng Yi and Fan Liangzhong, "Moving object detection
Figure 6: Comparative Analysis of Power Consumption based on running average background and temporal
(Watt) difference," 2010 IEEE International Conference on
Intelligent Systems and Knowledge Engineering, Hangzhou,
2010, pp. 270-272.
VII. CONCLUSION [ 11 ] T. Cooke, "Eigen-Patch Based Background Subtraction," 2011
This paper presents comparative analysis of cross International Conference on Digital Image Computing:
correlation, NCC and Fast-NCC for the application of Techniques and Applications, Noosa, QLD, 2011, pp. 462-
467.
real time target tracking. Computational time, frame [ 12 ] N. Amrouche, A. Khenchaf and D. Berkani, "Multiple target
rate, throughput and power consumption results tracking using track before detect algorithm," 2017
provide promising improvements for NCC and fast- International Conference on Electromagnetics in Advanced
NCC as compared to cross correlation. In this Applications (ICEAA), Verona, 2017, pp. 692-695.
[ 13 ] F. E. T. Munsayac, L. M. B. Alonzo, D. E. G. Lindo, R. G.
research, results show that these algorithms work for
Baldovino and N. T. Bugtai, "Implementation of a normalized
all environmental situation; chaotic, low contrast, and coefficient-based template matching algorithm in number
dark scenes. system conversion," 2017 IEEE 9th International Conference
In future, noise filters can be added to improve the on Humanoid, Nanotechnology, Information Technology,
Communication and Control, Environment and Management
efficiency of this algorithm. There are hardware
(HNICEM), Manila, 2017, pp. 1-4.
optimization algorithms that can be applied to achieve
358
[ 14 ] M. V. G. Rao, P. R. Kumar and A. M. Prasad, "Implementation
of real time image processing system with FPGA and DSP,"
2016 International Conference on Microelectronics,
Computing and Communications (MicroCom), Durgapur,
2016, pp. 1-4. doi: 10.1109/MicroCom.2016.7522496
[ 15 ] X. Nguyen, L. Nguyen, T. Bui and H. Huynh, "A real-time
DSP-based hand gesture recognition system," 2012 IEEE
International Symposium on Signal Processing and
Information Technology (ISSPIT), Ho Chi Minh City, 2012,
pp. 000286-000291.
[ 16 ] Y. XiaoPing and L. Jieyun, "Hardware Design of Video
Stabilization System Based on TMS320DM642," 2010 Fourth
International Conference on Genetic and Evolutionary
Computing, Shenzhen, 2010, pp. 86-89.
359
Adaptive Control of Nonaffine Nonlinear Systems by
Neural state Feedback
M. Bahita 1st Department of Chemical Engineering, K. Belarbi 2nd Ecole nationale polytechnique de
Faculty of Process Engineering, Constantine 3 Constantine, University of Constantine 3, Constantine
University, Constantine 25000, Algeria 25000, Algeria
mbahita@yahoo.fr kbelarbi@yahoo.com
Abstract— In this paper, a new control method for a class of involve certain types of function approximators in their
single input single output nonaffine nonlinear systems is considered learning mechanism.
using radial basis function (RBF) neural networks (NNs). Firstly, the
existence of an ideal implicit feedback linearization control is Fuzzy logic systems and artificial neural networks [10-11]
established based on implicit function theory. An online RBF system
is introduced to approximate this ideal implicit feedback linearization
have been widely used as adjustable components in adaptive
law. The proposed neural fuzzy adaptive controller ensures that the control. In particular, these systems are introduced to
system output tracks a given bounded reference signal, while the approximate unknown nonlinear functions in nonlinear
closed loop stability results are provided and guaranteed using systems in the form of linear regression with respect to
Lyapunov theory. The effectiveness of the proposed controller is unknown parameters and then to apply the well developed
illustrated through a simulation to a nonaffine nonlinear system. adaptive control techniques.
Keywords—Adaptive control; Nonaffine nonlinear systems; Adaptive NNs design methods have been proposed to
Neural networks; Implicit function theorem.
control affine nonlinear system. In practice, many physical or
nonaffine systems are inherently nonlinear, whose input
I. INTRODUCTION variables may enter in the systems nonlinearly. To solve the
Artificial Neural Networks have gone through a rapid control problem for nonaffine nonlinear systems, several
development and grown past the experimental stage to become works have been proposed [7], [12-13], for a comprehensive
implemented in a wide range of engineering applications, such survey, see [14].
as for example state estimation, pattern recognition, signal
processing, process modeling, process quality control and data Fuzzy logic belongs [10] to a class of knowledge based
reconciliation [1-6]. systems. The main advantage of fuzzy logic is the possibility
of implementing human expert knowledge in the form of
The neural network (NNs) is capable of modeling non- linguistic if-then rules and provides a mathematical formalism
linear systems [7-8]. On the basis of supplied training data the for implementing these rules in the form of a computer
neural network learns (trains) the relationship between the program. Fuzzy logic is a rigorous mathematical field offering
process input and output. The training sets consist of one or very interesting solutions for control. Moreover, it offers
more input data and one or more output data. After the methods to control non-linear plants known to be difficult to
training of the network, a test-set of data should be used to model, and also can be used as an estimation technique or
verify whether the desired relationship was learned. In approximator in adaptive control, where the parameters are
practical applications a neural network can be used when the updated during plant operation.
exact model is not known. It is a good example of a ‘black-
box’ technique. With the combination of neural network and In this work and based on our previous related works
adaptive systems [4-5], the control techniques of most [15-16], we will use the same fuzzy logic system of Mamdany
complex systems have been improved. Adaptive control has type to approximate a non-linear term appeared in the
found extensive applications for plants that are complex and adaptation law of the RBF controller parameters. The radial
ill-defined [9]. Mathematical models might not be available basis function (RBF) controller is used in a direct neural fuzzy
for many complex systems in practice, and the adaptive adaptive control structure for a class of single input single
control problem of these systems is far from being output (SISO) unknown and nonaffine nonlinear systems.
satisfactorily resolved. Most of the adaptive controllers More specifically, the RBF controller is used online to
approximate the unknown implicit feedback linearization
361
∂v / ∂u = 0 the partial derivative of f ( x, u ) − v with respect to functions. The most used basis function is the Gaussian
the input u satisfies function. θ T = [θ1T θ 2T ...θ nT ] contains all adjustable parameters
and ξ (x ) is a vector of radial basis functions. It has been
∂ ( f ( x, u ) − v) ∂f ( x, u ) (10)
= >0 proven that (15) can approximate over a compact set Ω Z , any
∂u ∂u
smooth function up to a given degree of accuracy [21].
let u • be the ideal implicit unknown controller that makes
Thus, based on the implicit function theorem [17], we know
that the nonlinear algebraic equation f ( x, u ) − v = 0 is locally tracking error e = y m − y as small as possible.
solvable for the input u for each ( x, v ) . Thus, there exists The parameters update will be designed so as to minimize the
error eu between u • and the output
some ideal controller u • ( x, v) satisfying the following equality
u c ( x,θ ) = θ .ξ ( x) of the of the actual RBF neural controller
T
for all ( x, v) ∈ Ω x xR :
with
•
(11) eu = u • − u c ( x , θ ) (16)
f ( x, u ( x, v)) − v = 0
This leads to the cost function:
Therefore, if the control input u is chosen as the ideal control
law, i.e., u = u • , the closed-loop error dynamic (9) is reduced (u • − u c ( x, θ )) 2 (17)
J = min
to 2
e = Ac .e (12) Based on the gradient descent law, the connections weights of
the RBF network controller are adjusted under the following
Define the positive Lyapunov following function: law:
1 T (13)
V= e .P.e
2 ∂J (18)
θ = −γ
∂θ
Differentiate V with respect to time, and using (12) and (7), we
obtain: with γ > 0 the learning rate and:
1 T ∂J ∂u (19)
V = − e .Q.e (14)
∂θ
= − eu c
∂θ
2
Using (15), (19), (18) can be written as:
We conclude that V is a negative semi-definite function and ∂u
θ = γeu c = γ euξ (z ) (20)
that the tracking error e(t ) and its derivatives e ( i ) (t ) , ∂θ
i = 1,..., n − 1 , go to zero as t goes to ∞ .
As eu is unknown then, we estimate it by a fuzzy system of
However, the implicit function theory only guarantees the
Mamdani type with output êu based on the work done in [22],
existence of the ideal controller u • ( x, v) for system (1), and
does not prescribe a technique for constructing it even if the we then obtain the new law :
dynamics of the system are well known. In the following, a
neural network of RBF type system will be used to construct θ = γ eˆuξ ( z ) (21)
this unknown ideal implicit controller.
We first note that the update law (21) does not guarantee the
boundedness of the weights. In order to ensures boundedness
III. THE NEURAL NETWORK ADAPTIVE CONTROLLER of the weights, we use the so-called e-modification [23]:
The RBF network (as described in [15, 16]) can be considered
as a two-layer network with only one hidden layer. The output θ = γ ′eˆu .ξ ( z ) − γ ′ eˆu .v 0 θ (22)
depends linearly on the weights. More explicitly, the output of
v 0 > 0 is a design constant.
an RBF neural network system can be put in the following
form
Remark 2. As a resume, our adaptive controller shown in Fig. 3
nr
u c ( x,θ ) = θ .ξ ( x) = ξ iθ i , with ξ i = ψ ( x − ci 2 )
T (15) consists of three blocs: an RBF controller, a Mamdani fuzzy
i =1 estimator of the control error and the adaptation mechanism.
362
nonlinear system [12], [14] which is described by the
following differential equations: With r = x − c i and a width σ = 1.8 . The following initial
2
FUZZY
UPDATING
LAW
System states
363
V. CONCLUSION
REFERENCES
[1] Xiongbo Wan; Zidong Wang; Min Wu; Xiaohui Liu, “H-infinity State
Estimation for Discrete-Time Nonlinear Singularly Perturbed Complex
Networks Under the Round-Robin Protocol,” vol. 30, no. 2, pp. 415 –
426, 2019.
[2] Bingrong Xu; Qingshan Liu; Tingwen Huang, “ A Discrete-Time
Projection Neural Network for Sparse Signal Reconstruction With
Application to Face Recognition,” vol. 30, no. 1, pp. 151 – 162, 2019.
[3] Massimiliano Luzi; Maurizio Paschero; Antonello Rizzi; Enrico
Maiorino; Fabio Massimo Frattale Mascioli , “A Novel Neural Networks
Ensemble Approach for Modeling Electrochemical Cells,” vol. 30, no. 2,
pp. 343 – 354, 2019.
[4] Xiucai Huang; Yongduan Song; Junfeng Lai, “ Neuro-Adaptive Control
With Given Performance Specifications for Strict Feedback Systems
Under Full-State Constraints,” vol. 30, no. 1, pp. 25 – 34, 2019.
[5] Yan-Jun Liu; Shu Li; Shaocheng Tong; C. L. Philip Chen , “Adaptive
Reinforcement Learning Control Based on Neural Approximation for
Nonlinear Discrete-Time Systems With Unknown Nonaffine Dead-Zone
Input,” vol. 30, no. 1, pp. 295 – 305, 2019.
[6] Brian Roffel and Ben Betlem, “Process Dynamics and Control:
Fig. 5. tracking error e Modeling for Control and Prediction.”. John Wiley and Sons, 2006.
[7] S. L. Dai, C. Wang, M. Wang, “Dynamic Learning From Adaptive
Neural Network Control of a Class of Nonaffine Nonlinear Systems,”
IEEE Transactions on Neural Networks and Learning Systems., vol.
25, no. 1, pp. 111-123, 2014.
[8] J. L. Tao, Y. Yang, D. H.Wang, C. Guo, “A robust adaptive neural
networks controller for maritime dynamic positioning system,
Neurocomputing.,” vol. 110, no.1, pp. 128–136, 2013.
When comparing briefly our results to the works done in [12]
[9] K.J. Astrorn and B. Wittenmark, “Adaptive control,” Addison-wesley,
and [14], referring to these works, we can observe that the 2nd ed, 1995.
evolution of states x1 and x2 in our work is the same as in [12] [10] G. Feng, “A Survey on Analysis and Design of Model-Based Fuzzy
and are both better than the results obtained in [14]. The Control Systems,” Rev. IEEE Trans. Fuzzy Syst., vol. 14, no. 5, pp.
evolution of the tracking error around zero as shown in Fig. 5 676–697, 2006.
confirms the obtained results. [11] R. E. Precup, H. Hellendoorn, “A survey on industrial applications of
fuzzy control, Computers in Industry,“ vol. 62, pp. 213–226, 2011.
[12] S. Labiod and T.M. Guerra, “Adaptive fuzzy control of a class of SISO
nonaffine nonlinear systems,” Fuzzy Sets and Systems, vol. 158, no. 10,
pp. 1126–1137, 2007.
364
[13] M. Chen and S. S. Ge, “Direct adaptive neural control for a class of
uncertain nonaffine nonlinear systems based on disturbance observer,”
IEEE Transactions on Cybernetics, vol. 43, no. 4, pp. 1213–1225, 2013.
[14] Chaojiao Sun, Bo Jing, and Zongcheng Liu, “Adaptive Neural Control
of Nonaffine Nonlinear Systems without Differential Condition for
Nonaffine Function,” Hindawi Publishing Corporation, Mathematical
Problems in Engineering., vol. 16, 1, pp. 1-11, 2016.
[15] M. Bahita and K. Belarbi, “On-line Neural Network, Adaptive Control
of a Class of Nonlinear Systems Using Fuzzy Inference Reasoning,”
Rev. Roum. Sci. Techn.- Electrotech. Et Energ., vol. 54, no. I, pp. 401-
410, Buccarest, 2015.
[16] M. Bahita and K. Belarbi, “Radial Basis Function Controller of a Class
of Nonlinear Systems Using Mamdani Type as a Fuzzy Estimator,”
Procedia Engineering, vol. 41, pp. 501 – 509, 2012.
[17] KC Border, “Notes on the Implicit Function Theorem, “ Caltech:
Division of the Humanities and Social Sciences, pp. 1 - 21, 2018.
[18] C . Darken.and J . Moody, “ Fast adaptive k-means clustering: Some
empirical results,” International Joint conference on Neural Networks,
2, pp. 233-238, 1990.
[19] M. Bahita and K. Belarbi, “Neural Stable Adaptive Control for a Class
of Nonlinear System Without Use of a Supervisory Term in The
Control Law,” Journal of Engineering Science and Technology, Vol. 7,
No. 1, pp. 97 – 118, February 2012.
[20] M. Bahita and K. Belarbi, Fuzzy and Neural Adaptive Control of a Class
of Nonlinear Systems. ISBN: 978-3-8484-8920-6, LAP LAMBERT
Academic Publishing GmbH & Co. KG Heinrich-Böcking-Str. 6-8,
66121, Saarbrücken, Germany, 2012.
[21] T. P Chen and H. Chen, “Approximation capability to functions of
several variables, nonlinear functionals, and operators by radial basis
function neural networks,” IEEE. Trans. Neural Networks, vol. 6, no. 4,
pp. 904-910, 1995.
[22] M. Bahita and K. Belarbi, “ Real-time application of a fuzzy adaptive
control to one level in a three tank system,” Journal of systems and
control engineering, vol. 232 no. 7, pp. 845-856, 2018.
[23] P. A. Ioannou, J. Sun, “ Robust Adaptive Control,” Prentice-Hall, 1996.
365
Would it be Profitable Enough to Re-adapt
Algorithmic Thinking for Parallelism Paradigm
1st Aimad Eddine Debbi 2nd Abdelhak Farhat Hamida 3rd Haddi Bakhti
dept. of informatics. dept. of electronics dept. electronics
Mohamed Boudhiaf university Ferhat Abbas university Mohamed Boudhiaf university
M’sila, Algeria Setif, Algeria M’sila, Algeria
aimad-eddine.debbi@univ-msila.dz a ferhat h@yahoo.fr ahmed3791@gmail.com
Abstract—A lot of progress in computing systems components expected to have some characteristics that make some ones
are devoted today to grant more support for parallelism. This, more appropriate for parallelization than other ones. The
is likely affording much opportunities for High Performance intrinsic parallel potential of algorithm and its ability to be
Computing (HPC) applications developers who become now
able to accelerate run-times progressively. Adapting algorithmic effortlessly parallelized are the most two important properties
writings for parallelism paradigm likely leads to additional that favorite its adoption for parallelization. This paper deals
improvement in run-times. This paper deals with this matter. with the impact of those two features in parallelization issues.
We carry an empirical measures to assess how interesting is The present investigation projects to brought some clarifica-
to re-adapt algorithmic thinking for parallel processing context. tions about wither would be more advantageous? parallelizing
We provide thorough comparisons of achievable accelerations
among a number of different sorting algorithm kinds. We use a serial programs or seeking for producing parallel ones being
proprietary framework meant previously to serve as a front-end not difficult to implement in parallel systems. Such investi-
kernel in an automatic parallelization compiler and we populated gation may be extended to predict compliances between the
it with interpolation to make performance predictions in large- triplet (algorithms - target architectures parallel programming
scale parallelization. Sequential, semi parallel and parallel algo- paradigms).
rithms for sorting problem are all involved in the empiric tests
considering different distributions for randomized input records. In the same vein, we suggest a new metric to allow more
The results let to estimate how much the innovation of specific fair comparisons among parallelization issues offers. Instead
parallel algorithms could be more profitable than parallelization of using the absolute speedup for comparisons, we propose to
of serial programs. consider a relative speedup given as a ratio of speedup respect
Index Terms—parallelism paradigm; workload characteriza- to max achievable speedup. This maximum achievable speedup
tion; profiling; inherent parallelism assessment; static analysis
is determinable by an earlier profiler we have previously
proposed in [4].
I. I NTRODUCTION We provide in this paper a large set of empirical tests to
Almost all todays computing platforms are parallel systems. assess achievable speedups in a number of sorting algorithms.
They may embed at once several many/multi core CPUs, a We have chosen to consider three classes of algorithms. We
number of GPUs clusters and even many FPGA ships. This is selected sequential algorithms, parallel sorting algorithms and
affording good opportunities for High Performance Computing semi-parallel algorithms for sorting problem. Tests are ex-
HPC applications developers, but exploiting those innovative tended using regression to get results for large scales problem.
architectures and accelerating run-times by parallelization is That set of instrumentations allows us to appreciate how much
still a challenging task. Automatic parallelization frameworks parallel algorithms are profitable and how amount of effort is
[1]–[3] fail sometimes to carry a total parallelization of required for their implementation.
sequential programs. They are usually not able to process II. R ELATED W ORKS
appropriately some critical code parts containing complicated
interdependencies, which makes execution crashes. The present work bears similarities with many researches
Semi-automatic parallelization tools like CUDA, OpenMP and [5]–[9] suggested for the workload characterization. It is even
OpenACC need large involvements of end-developers. End closer to ones dealing with the parallelization concern problem
users using semi-automatic parallelization tools have to specify [8], [10]–[13]. We share with them a common objective in the
explicitly the parallel parts in programs. Their mission is meaning that we plan to make attenuation of the difficulties
still hard when they deal with sequential programs, except stemming in parallelization issues. Speedups estimation is
if algorithms to parallelize contain explicit parallel parts. a centric challenge addressed in almost all of those works.
In many cases, several algorithms may come to be a valid Particularly some of works are stemming from efforts done
solution for a single problem. However, they are generally for automatic parallelization.
Peruse [6] is an LLVM based profiling tool designed to
characterize loops features and help developers recognize the
367
4.5 k 120
4k 100
X: 4321
Y: 3425
3.5 k
80
Speedup
3k
60
Speedup
2.5 k
X: 2769 40
2k Y: 1766
20
1.5 k
1k X: 1078
0
Y: 507.5 0 50 100 150 200 250 300
500 Input size
0
0 500 1k 1.5 k 2k 2.5 k 3k 3.5 k 4k 4.5 k 5k Fig. 5. Maximum achievable speedup in the bubble sort algorithm for several
Input size
small sizes of records that follow uniform distribution.
Fig. 2. Maximum achievable speedup in the bubble sort algorithm for large 5000
sizes of records that follow poisson distribution.
4000
X: 4097
120 Y: 3221
100 3000
Speedup
80 X: 2707
Speedup
2000 Y: 1714
60
X: 1299
40 1000 Y: 621.7
20
0
0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0 50 100 150 200 250 300 Size
Input size
Fig. 6. Maximum achievable speedup in the bubble sort algorithm for large
Fig. 3. Maximum achievable speedup in the bubble sort for several small sizes of records that follow uniform distribution.
sizes of records that follow geometric distribution.
4k
Y: 3928 2) Sort by Insertion: Likewise, inherent parallelism is
evaluated in the insertion sort algorithm considering different
Speedup
200
X: 1143
1k Y: 573.4
150
Speedup
0
0 500 1k 1.5 k 2k 2.5 k 3k 3.5 k 4k 4.5 k 5k 100
Input size
50
Fig. 4. Maximum achievable speedup in the bubble sort algorithm for large
sizes of records that follow geometric distribution. 0
0 50 100 150 200 250 300
Input size
368
4.5 k
listing. 1 : Pseudo-code for quick sort algorithm
4k
X: 4276
Y: 3444
3.5 k
Quick(bottom, top) { 3k
pivot;
Speedup
pivot = M ake dichotomous parts&getpivot() 2.5 k X: 2690
Y: 2117
Quick(bottom, pivot); 2k
Quick(pivot + 1, top);
1.5 k
} X: 1064
1k Y: 814.1
0.5 k
0
The ”Quick()” function call can be mapped 0 0.5 k 1k 1.5 k 2k 2.5 k 3k 3.5 k 4k 4.5 k 5k
Input size
to concurrent threads. However, the function
Make dichotomous parts&getpivot() is a sequential fragment. Fig. 8. Maximum achievable speedup in the sort by insertion algorithm
Any algorithm likely contain a portion of code that should be considering large sizes of records that follow geometric distribution.
sequential. Each thread spawns recursively two threads that
30
have to handle the function calls Quick(bottom, pivot) and
Quick(pivot+1, top). Results of tests are given in Fig.9 and 25
Fig.10. The merge sort can be implemented in a recursive
form or in a no recursive form. In the listing. 2 we brought 20
Speedup
a variant of a pseudo code for the recursive form of the
merge sort. Once again only a dichotomous partitioning is 15
applied and every thread will spawn two threads for handling
recursive calls. In addition this pseudo-code is given here to 10
listing. 2 : Pseudo-code for merge sort algorithm in a Fig. 9. Maximum achievable speedup in the quick sort algorithm considering
recursive form several records sizes that follow geometric distribution.
40
V. I NTERPRETATIONS
X: 4116
35 Y: 31.96
Results of these empirical tests allowed the following obser- 30 X: 2502
vations. The intrinsic parallel potential increase proportionally Y: 25.51
Speedup
25
as the size of the domain rises following linear or hyperbolic X: 1039
Y: 19.66
asymptotes obtained by a linear or quadratic interpolation. 20
369
90
the function ”Make dichotomous parts&getpivot()” and sec-
80 ondly, increasing the degree of partitioning may both improve
70 considerably the inherent parallel potential of the quick sort.
60 That matter of improving forms of parallel may be a subject
Speedup
50
of a separate future investigation.
40 R EFERENCES
30
[1] H. Bae, D. Mustafa, J. W. Lee, Aurangzeb, H. Lin, C. Dave, R. Eigen-
20 mann, and S. P. Midkiff, “The cetus source-to-source compiler infras-
tructure: Overview and evaluation,” Int J Parallel Prog, vol. 41, pp. 753–
10 767, December 2013.
0 50 100 150 200 250 300
Size [2] S. Campanoni, T. M. Jones, G. Holloway, G. Y. Wei, and D. Brooks,
“Helix: making the extraction of thread-level parallelism mainstream,”
Fig. 11. Maximum achievable speedup in the merge sort algorithm consid- IEEE Micro, vol. 32, pp. 08–18, 2012.
ering several records sizes that follow geometric distribution. [3] C. Dave, H. Bae, S. Min, S. Lee, R. Eigenmann, and S. Midkiff, “Cetus:
A source-to-source compiler infrastructure for multicores,” Computer,
4.5 k vol. 42, no. 12, pp. 36–42, December 2009.
[4] A. E. Debbi and H. Bakhti, “Incremental banerjee test conditions
4k
committing for robust parallelization framework,” Turk J Elec Eng Comp
3.5 k X: 4104 Sci, vol. 26, pp. 2595–2604, may 2018.
Y: 2976
3k [5] D. Jeon, S. Garcia, C. Louie, and M. B. Taylor, “Kismet: Parallel
speedup estimates for serial programs,” in Proceedings of the 2011 ACM
Speedup
2.5 k
International Conference on Object Oriented Programming Systems
2k X: 2606 Languages and Applications, ser. OOPSLA ’11. ACM, 2011, pp. 519–
Y: 1461
1.5 k
536.
[6] S. Kumar, V. Srinivasan, A. Sharifian, N. Sumner, and A. Shriraman,
1k “Peruse and profit: Estimating the accelerability of loops,” in Proceed-
X: 1013
Y: 391.9
0.5 k ings of the 2016 International Conference on Supercomputing, ser. ICS
’16. ACM, 2016, pp. 21:1–21:13.
0
0 0.5 k 1k 1.5 k 2k 2.5 k 3k 3.5 k 4k 4.5 k 5k [7] V. H. F. Oliveira, A. F. A. Furtunato, L. F. Silveira, K. Georgiou,
Input size K. Eder, and S. Xavier-de Souza, “Application speedup characterization:
Modeling parallelization overhead and variations of problem size and
Fig. 12. Maximum achievable speedup in the merge sort algorithm consid- number of cores.” in Companion of the 2018 ACM/SPEC International
ering large sizes of records that follow geometric distribution. Conference on Performance Engineering, ser. ICPE ’18. ACM, 2018,
pp. 43–44.
[8] M. Kumar, “Measuring parallelism in computation-intensive sci-
entific/engineering applications,” IEEE Transactions on Computers,
Smax is the maximum achievable speedup indicated in the vol. 37, no. 9, pp. 1088–1098, September 1988.
curve of Fig. 10 for the size N, i.e. we can achieve ×5.3 of [9] A. Ketterlin and P. Clauss, “Profiling data-dependence to assist paral-
speedup when the domain size is of 64 records and we achieve lelization: Framework, scope, and optimization,” in Proceedings of the
2012 45th Annual IEEE/ACM International Symposium on Microarchi-
×39.2 of speedup when the domain size is of 1024 records. tecture, ser. MICRO-45. IEEE Computer Society, 2012, pp. 437–448.
[10] A. Elnashar and S. Aljahadli, “Experimental and theoretical speedup
VI. C ONCLUSION prediction of mpi-based applications,” Computer Science and Informa-
Many algorithms, even though they appear to be sequential tion Systems, vol. 10, pp. 1247–1267, june 2013.
[11] D. Jeon, S. Garcia, C. Louie, S. Kota Venkata, and M. B. Taylor,
in nature, in large scale they contain a considerable amount of “Kremlin: Like gprof, but for parallelization,” in Proceedings of the 16th
inherent parallelism. In large scale we have generally good ACM Symposium on Principles and Practice of Parallel Programming,
opportunities to get favorable accelerations on scheduling ser. PPoPP ’11. ACM, 2011, pp. 293–294.
[12] S. L. Graham, P. B. Kessler, and M. K. Mckusick, “Gprof: A call graph
them for parallel implementation. In many cases however, execution profiler,” in Proceedings of the 1982 SIGPLAN Symposium on
they may be written in some forms that are not easy to Compiler Construction, ser. SIGPLAN ’82. ACM, 1982, pp. 120–126.
implement in parallel way i.e. when algorithms contain a [13] S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor, “Kremlin: Rethinking
and rebooting gprof for the multicore age,” in Proceedings of the 32Nd
deeply nested loops with complex dependencies. These nested ACM SIGPLAN Conference on Programming Language Design and
loops are hard to handle by automatic parallelizers realizing Implementation, ser. PLDI ’11. ACM, 2011, pp. 458–469.
loops transformations and they can’t be mapped easily to [14] Y. S. Shao, B. Reagen, G. Wei, and D. Brooks, “Aladdin: A pre-rtl,
power-performance accelerator simulator enabling large design space
parallel threads. Applying the concept of ”divide to rake” is exploration of customized architectures,” in 2014 ACM/IEEE 41st Inter-
suitable to make parallelizations. Parallel algorithms admit national Symposium on Computer Architecture (ISCA), 2014, pp. 97–
the application of such concept. In quick sort and merge 108.
[15] M. A. Kim and S. Edwards, “Computation vs. memory systems:
sort algorithms we applied a dichotomous partitioning. Our Pinning down accelerator bottlenecks,” AMAS-BT - 3rd Workshop on
profiling have shown that we have little to no chances to Architectural and Microarchitectural Support for Binary Translation,
obtain good accelerations with quick sort in its form indicated pp. 86–98, june 2010.
[16] K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and
earlier. The merge sort contain a considerable amount of K. D. Bosschere, “Performance prediction based on inherent program
inherent parallelism and since it is possible to map its functions similarity,” Proceedings of the 15th International Conference on Parallel
to threads, it appears that it will be the most profitable Architectures and Compilation Techniques, pp. 114–122, 2006.
[17] Y. Yang, P. Yu, and Y. Gan, “Experimental study on the five sort
scenario of parallelization. Quick sort should not be considered algorithms,” in 2011 Second International Conference on Mechanic
absolutely poor for parallelization. Firstly, the enhancement of Automation and Control Engineering, July 2011, pp. 1314–1317.
370
[18] Z. Yildiz, M. Aydin, and G. Yilmaz, “Parallelization of bitonic sort
and radix sort algorithms on many core gpus,” in 2013 International
Conference on Electronics, Computer and Computation (ICECCO),
November 2013, pp. 326–329.
[19] M. H. Durad and M. N. A. and, “Performance analysis of parallel
sorting algorithms using mpi,” in 2014 12th International Conference
on Frontiers of Information Technology, December 2014, pp. 202–207.
[20] Z. Cheng, K. Qi, L. Jun, and H. Yi-Ran, “Thread-level parallel algorithm
for sorting integer sequence on multi-core computers,” in 2011 Fourth
International Symposium on Parallel Architectures, Algorithms and
Programming, December 2011, pp. 37–41.
371
Affordable and Portable Realtime Saudi License
Plate Recognition using SoC
Loay Alzubaidi Ghazanfar Latif Jaafar Alghazo
Department of Computer Science, Department of Computer Science, Department of Computer Engineering,
Prince Mohammad bin Fahd Prince Mohammad bin Fahd Prince Mohammad bin Fahd
University,Al Khobar, Saudi Arabia. University,Al Khobar, Saudi Arabia. University,Al Khobar, Saudi Arabia.
lalzubaidi@pmu.edu.sa glatif@pmu.edu.sa jghazo@pmu.edu.sa
Abstract— Stand along single board computers (SoC) have its own unique plate which is going throw identification and
become so inexpensive and yet so powerful that paved the way security processes [1].
for easily developing fully automated systems. SoC systems are
equipped with sensors, cameras and various embedded systems This research aims to design and the implementation of the
that allow developing systems that interact with the surrounding Plate Number Recognition system. Unlike the gate’s opener
environment. Therefore, the task of capturing images of License that uses a remote control in a hand of human as third party,
plates and using Optical Character Recognition (OCR) the system takes picture of a detect approaching vehicle,
techniques to recognize the numerals and characters allows for analyse the images and only opens the gate when a recognized
developing an inexpensive License Plate (LP) Recognition vehicle plate is identified. The main objective of the research
system. LP systems are important and can be used for various is to develop a real time fully automated number plate
application from traffic control, toll payment, and parking recognition system that is based on Raspberry Pi. This system
access. This paper proposes a Raspberry PI based LP will be built using the raspberry as the main component. The
recognition for Arabic/English Characters and Numeral on system will be able to detect the vehicle, recognize the plate,
license plates used in Saudi Arabia. The proposed process compare it with the database and control the gate.
utilizes the phases of Preprocessing, Segmentation, Feature
Extraction and Classification. At the end of the preprocessing
phase, the Characters and Numerals are segmented. Pixel II. BACKGROUND
distribution and Horizontal projection profiles are used in the With the start of 20th century, automobile industry
feature extraction phase for the segmented image. Distance boomed and number of motorized vehicles increased rapidly.
Classifier and k-nearest neighbors classifier are used in the From 1890 to 1910 the world witnessed a transition from
classification phase. The proposed system achieved an accuracy horse to automobiles. As the number increased, law
of 90.6%. The advantage of such a system is the low cost and enforcement officials started facing issues to maintain
portability making it affordable and easily deployable in any
vehicles record and trace them. As a result, in 1890 first
location.
number plate was introduced by France and Germany also
Keywords— Single Board Computer; Raspberry Pi; Saudi followed them by introducing in 1993. In United States,
License Plate; Real time Plate Number Plate Recognition; KNN Massachusetts was the first state who introduced number plate
in 1903 with proper vehicle registration and driver’s license
registration. Netherland become the first country by
I. INTRODUCTION introducing national license plate in 1899 by starting license
These days, everything now tends to be moving toward plate with number 1 which reaches to 2001 in 1906 as they
automating. People were used to deal with everything selected different way to number the license plates [2]. Fig. 1
manually, for example, people used to open the gates shows some of the initial number plates introduced by
manually, which means that users had to stop the vehicle, and different countries.
wait for someone to check their authorization before passing
the gate. This process requires at least one man to stand by the In 1938, the first oil well was discovered in Saudi Arabia.
gate and check the vehicle, open the gate manually, and then However, because of World War II in 1939 the Saudi
closing it. After the invention of the remote controlled garage government delayed the development programs and research
doors has caused a great impact on making the lives of the on the oil industry until 1946. From 1946 to 1950, the
consumer easier; the security person will open and close the Kingdom of Saudi Arabia witnessed a revolution in the oil
gate with a press of button. However, as technology improves industry, which raised the country's economy and in this
the lives of the consumers become easier. Thus, this system is period traffic in Saudi Arabia was on the rise, which led to the
aiming to have the gate to open automatically without needing development of the licensing plate to register the necessary
a person spending his whole day standing to press a button. information regarding automobiles owners. The first license
plate in Saudi Arabia appeared in 1950-1962, where they
The system approaches the same idea in an easy and differed from one region to another as shown in Fig. 2. In
automated way by recognizing the vehicle’s plate number, 1972, license plates were established in the entire country with
then if authorized the system will automatically open the gate different types of use (privet, bus, taxi and truck) as showing
by using low cost embedded system. One of the biggest in Fig. 3. However, in 2007 the design was change once again,
advantages of automation is ensuring the quality and because license plates were not enough for the demand and
consistency of the product without forgetting the important population increase which is shown in Fig. 4 [3]. The new
aspect which security. The system is going to automate the version was different from previous ones; the 1996 series was
functionality of the gate systems by using a unique sign for considered to be most preferred by the majority of the public.
opening the gates. In other words, each individual vehicle has
Fig. 4. New license plate design to meet the increased population demands
373
the minimum range where it can give you a distance is 2cm.
The of these waves that are being generated is a 15 angle.
This is equivalent to , after the wave pulse has been
sent, when the pulse hits an object and bounces back to the
sensor the trig will become high for 10 μ Seconds indicating
that there is an object in range. It then shoots 8 cycle bursts of
ultrasound at 40 kHz through the echo these 8 cycle bursts are
called “Sonic Burst” [9]. The range can be calculated from the
moment the trigger signal was sent and the echo signal
received by using equation 1 which is shown in Fig. 9.
Fig. 8: HC-SR04 ultrasonic sensor pins description
(1)
% 0.01
0.6 10 (3)
For capturing images of the vehicle, we used Logitech
Fig. 10: Servo Motor pins and PWM Cycle
c310 camera which needs a supply voltage of 5V from the
USB port in addition to 100mA of current giving it according
to ohms law an internal resistance of 50Ώ. The camera C. License Plate Recognition
captures an image of 5 megapixel resolution and HD video The first License Plate Recognition method that used in
1280×720 pixels. the proposed system was based on Tesseract-OCR. Tesseract
was originally developed initially in 1994 by the Hewlett
Packard (HP) Laboratories in 1994 which further improved in
1998 to support C++ in Windows [11]. In 2005 HP made
Tesseract open source. From 2006, onward Google is making
changes in it to further enhance it. Currently the OCR support
more than 100 languages to recognize them [12].
K-nearest neighbor (KNN) is a supervised classifier with
the ability for instant based learning [13]. The use of training
samples along with attribute is used for classifying a new
object and subsequently determining the nearest neighbor of
any instance through the use of various algorithms [14].
Classification in KNN requires analyzing similar groups.
KNN works very good with Multi-Modal classes and is
known to be an accurate process. However, in KNN all
features are treated equally when computing for similarities.
Fig. 7: Description of Raspberry Pi 3 Board Components This may lead to classification errors especially when the
feature set is small.
374
TABLE I. ENGLISH TO ARABIC LETTERS MAPPING of accuracy of 90%. Figure 13 shows the accuracies
Arabic English comparisons of proposed method with the other existing
No Description
letter letter techniques.
1 ا A ***
2 ب B ***
Does not have English letter similar to
3 ح J
pronunciation of the letter ()ح
4 د D ***
5 ر R ***
6 س S ***
Letter (S) was served for letter ( )سand
7 ص X
letter (C ) is similar to (G)
8 ط T ***
9 ع E ***
10 ق G ***
11 ك K ***
12 ل L ***
(M) is similar to (N) and is thus, rejected Fig. 11: A) Original Image, B) Gray Scaled Image, C) Threshold based
13 م Z
too wide Binary Image, D) Image after finding all Contours, E) Image after finding
14 ن N *** possible Characters, F) Image after finding all vectors of matching
Characters, G) Boundary of matching Characters of plate part, H) Extracted
15 ھـ H *** English letters part of the plate, I) Extracted numbers part of the plate.
16 و U (W) is thus, rejected too wide
17 ي V (Y) is thus, rejected too high TABLE II. PERFORMANCE OF DIFFERENT METHODS
With the same concept, KNN as an algorithm for character Plate/Method Tesseract OpenALPR KNN
detection is used. The algorithm needs to be trained first for a
certain set of characters then it became ready to use and License Plate Set 1
compare what it sees with what It has been trained on. 40 % 70 % 90 %
Understanding the concept of KNN is not enough to
implement it in real case, since the input image won’t be as
clear as the algorithm would like it to be so needed a set of License Plate Set 2
image processing steps that will prepare the image for 60 % 80 % 92 %
extracting information in it and then lock for the suitable
matches and assess each one of them to see whether it satisfy
being a character or not [15]. The process is mainly two parts; License Plate Set 3
the first is locating the plate of the image then detecting the 61 % 74 % 95 %
characters in the plate itself using KNN. If the first part of the
process failed to successfully locate a Plate, the whole process License Plate Set 4
is failed. Before passing captured image to the Tesseract,
preprocessing is done including converting the color image to 57 % 75 % 88 %
grey level, erosion and dilation [16][17]. The sample results
of the extracted plates are shown in Fig. 11 and Fig. 12. License Plate Set 5
48 % 70 % 88 %
IV. 4. EXPERIMENTAL RESULTS
Table 2 shows a general comparison between all the three
algorithms used and how accurate the result is, generally KNN Average 53.2 % 73.8 % 90.6%
gives the most accurate result due to our modification and
implementation of it. Based on the achieved results we can see
that the license plate recognition method that had the most
accurate results is the method based on KNN. KNN resulted
in recognizing the previous tested images with an average of
90%.
Whereas the OpenALPR the license plate recognition
method resulted in recognizing the tested images with an
average of 75%. Furthermore, the Tesseract-OCR based
license plate recognition method resulted in recognizing the
Fig. 12: Converting from RGB to Gray and then Binary image of detected
tested images with an average of 55%. After looking to these license plate.
results, it is decided to implement the KNN based license plate
recognition method as it resulted with the highest percentage
375
100% users as well as will use different LED lights to indicate that a
vehicle is allowed or denied to enter.
90%
80% REFERENCES
[1] Arth, C., Limberger, F., & Bischof, H. (2007, June). Real-time license
70%
plate recognition on an embedded DSP-platform. In 2007 IEEE
Conference on Computer Vision and Pattern Recognition (pp. 1-8).
60%
IEEE..
50% [2] Kothman, G. S. (1951). U.S. Patent No. D163,328. Washington, DC:
U.S. Patent and Trademark Office.
40% [3] Saudi Arabian Private and Passenger vehicle license plate History:
http://www.worldlicenseplates.com/world/AS_SAUD.html
30%
[4] Boiman, O., Shechtman, E., & Irani, M. (2008, June). In defense of
20% nearest-neighbor based image classification. In Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-
10% 8). IEEE.
[5] Monk, S. (2016). Raspberry Pi cookbook: Software and hardware
0% problems and solutions. " O'Reilly Media, Inc.".
License License License License License
[6] Rasp Richardson, M., & Wallace, S. (2012). Getting started with
Plate Set 1 Plate Set 2 Plate Set 3 Plate Set 4 Plate Set 5
raspberry PI. " O'Reilly Media, Inc.".
Tesseract OpenALPR KNN [7] Monk, S. (2015). Programming the Raspberry Pi: getting started with
Python. McGraw Hill Professional.
[8] Brahmbhatt, S. (2013). Embedded Computer Vision: Running
Fig. 13: Performance comparison chart of different LP detection methods OpenCV Programs on the Raspberry Pi. In Practical OpenCV (pp. 201-
218). Apress.
[9] Carullo, A., & Parvis, M. (2001). An ultrasonic sensor for distance
V. CONCLUSION measurement in automotive applications. IEEE Sensors journal, 1(2),
Our research aims to create integrated systems that will 143-147.
reduce man labor, discard redundant work and to create an [10] Dote, Y. (1990). Servo motor and motion control using digital signal
automated future. Three different ways to process the license processors. Prentice-Hall, Inc..
plate OpenALPR, Tesseract and KNN are discussed. The [11] Smith, R. W. (2013, February). History of the Tesseract OCR engine:
what worked and what didn't. In IS&T/SPIE Electronic Imaging (pp.
different results of each algorithm singling out the KNN for 865802-865802). International Society for Optics and Photonics.
its superior results in terms of license recognition. The [12] Patel, C., Patel, A., & Patel, D. (2012). Optical character recognition
ultrasonic measures the distance of the car approaching the by open source OCR tool tesseract: A case study. International Journal
gate, when a certain distance is measured an instruction will of Computer Applications, 55(10).
be sent to the camera to capture a picture of the car’s license [13] Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy k-nearest
plate. This image gets processed and runs as input to the KNN neighbor algorithm. IEEE transactions on systems, man, and
algorithm, opening the gate if the result is found in the cybernetics, (4), 580-585
database, otherwise the gate will not open. This system can [14] Mitchell, Tom M. (1997). Bayesian Learning. In Machine Learning
be integrated into main gate substituting the need for security (pp. 154–178). Mitchell, T. (1997). Machine Learning.
personnel to be stationed there all the time. When a vehicle is [15] Kosbatwar, S. P. and Pathan, S. K. 2012. Pattern Association for
character recognition by Back- Propagation algorithm using Neural
verified by the security official, its license plate details are Network approach, nternational Journal of Computer Science &
being inserted to the database. These information from the Engineering Survey, vol. 3 – no. 1, pp. 127-134. G. Eason, B. Noble,
database are used to open the gate once the license plate is and I.N. Sneddon, “On certain integrals of Lipschitz-Hankel type
verified, making it easier for the security personnel to make involving products of Bessel functions,” Phil. Trans. Roy. Soc.
London, vol. A247, pp. 529-551, April 1955.
their rounds and focus on other useful things rather than stay
at the gate and open it manually all the time. [16] Chen, C. W., Luo, J., & Parker, K. J. (1998). Image segmentation via
adaptive K-mean clustering and knowledge-based morphological
In future, we will try to improve the algorithm to recognize operations with biomedical applications. IEEE Transactions on Image
Processing, 7(12), 1673-1683.
Arabic Letters and numbers. We will also add more training
and testing data to improve the results. In hardware level, we [17] Patel, C. I., Patel, R. and Patel, P. 2011. Handwritten Character
Recognition using Neural Network. International Journal of Scientific
will add LCD to display the important messages to the system & Engineering Research, vol. 2 – no. 5, pp. 1-6.
376
Two Information Systems in Air Transport
It is a Short Journey from Success to Failure
Victor P. Lane Derar Eleyan James Snaith
Business School Computer Science Department Business School
London South Bank University Palestine Technical University-Kadoorie London South Bank University
London, UK Tulkarem, Palestine London, UK
profviclane@btinternet.com d.eleyan@ptuk.edu.ps snaithja@lsbu.ac.uk
TABLE 1. SUCCESSFUL IT PROJECTS – STANDISH GROUP CF5 Over commitment to CF16 Staff turnover
completion
Success Factors – 1994 Success Factors – 2012
1. Executive 6. Smaller 1. Executive 6. An Agile CF6 Unable to be impartial CF17 Communication
Management Project Support Process
Support Milestones CF7 Political external
2. User 7.Competent 2. User 7. Project pressures
Involvement Staff Involvement Management CF8 Targets set outside the
Expertise project
3. Clear 8. Ownership 3. Clear 8. Skilled III.. Conceptual Stage VI. Implementation into
Listing of Business Resources the Organization
Requirements Objectives
4. Proper 9. Clear 4. Emotion- 9. Execution CF9 Complexity CF18 Poor testing of
Planning Vision & al Maturity underestimated product
Objectives CF10 ICT over emphasized CF19 Poor training of users
CF11 Lure & trap CF20 Receding deadlines
5. Realistic 10. Hard- 5.Optimizing 10. Tools & of leading-edge IT
Expectations Working, Scope Infrastructure
Focused Staff
These paradigms and guidelines often overlap. Therefore, in This pragmatist’s guide to avoiding failures provides
this paper, only a small number are used for analysis. They several maxims, some of which appear both recognizable
indicate how, where and why failures have occurred. It is and sometimes disparaging.
astonishing how the same mistakes continually re-occur. It
TABLE 3: THEMES THAT RECUR IN MOST LARGE FAILURES
is clear that “Learning from failures” is not straightforward
T1. Over ambitious T2 Technocrats think they know it all
A. The Standish Group: Ideas of success and failures
T3. Computing must be T4. Management abdicate
For over 30 years, the Standish Group in the USA, a beneficial responsibility
primary research advisory organization, has written about
T5. Credulity - it will turn T6. Conflicts that may have a conflict
software project performance; and success and failure in IT out alright when needed of interest
projects. Its reports entitled “Chaos”, try to show methods
T7. Custom built product T8. Concealment of bad news by
for the achievement of successful projects – see Table 1 for
middle managers
suggestions of success factors.
T9. Buck passing T10. Mistaken belief - litigation will
solve problems
B. The Critical Factors Approach
The items listed in Table 2 are the “Critical Failure
Factors” which have been found to be associated with
This does not detract from the fact that they are
success or failure in computing projects [17, 18]. It is
extremely useful and, regrettably, too often ignored by
recognized that one factor alone may not be critical, but the
practitioners. This approach starts from the premise that the
concatenated effect of several factors, brings greater risk
mistakes that occur in computer-based projects are always
and possible failure. These factors are intended to help
similar. These recurring themes are encapsulated in Table 3.
practitioners to identify the true status of a computing
The approach encourages a pessimistic and cautious view of
project and in the case of a troubled project lead to
computing, which is based on the conviction that (1) many
appropriate remedial action.
computing projects ‘fail’, and (2) few if any systems ever
Some of the observations and recommendations which
come close to being perfect [19]. However, while
are associated with this approach are like those described in
emphasizing the need for caution, and being disparaging of
the next section. For example, that senior (non-technical)
IT enthusiasm or IT hyperbole, there is a recognition that no
general management should not abdicate their responsibility
enterprise in the 21st century can survive without IT
to ‘manage’ projects to any internal or external party; and
systems &/or without change. This change must be brought
that there is no certain way to avoid a disaster.
378
about by harnessing technology - without the lure of the the time of the airport opening, and (5) some 10% of the
technology causing developers to lose sight of the only real terminal’s 275 lifts not being operational.
goal, which is to bring improvements and benefits to the Each item, in isolation, appears trivial, especially in the
business. context of such a huge undertaking. The overall project was
a new technologically advanced airport terminal costing
£4.3bn. Prior to the real-time implementation of the baggage
III. CASE STUDIES IS, there had been 66 trials, using 15,000 people from the
The two case studies are selected from the air transport public and from stakeholders, and 400,000 bags. Together,
industry. At this point in time, for various incidents across these created 50,000 passenger profile trials and all travel
the world, some life critical embedded IS/IT systems in air scenarios. But, non-IT problems, apparently tiny,
transport are under scrutiny [6]. The two case studies compounded the IT/IS difficulties, such that on the day of
illustrate some of the problems faced by IT practitioners in opening, the terminal was ‘Not fit for purpose’ – see Table
the air transport industry. 5. A UK Government report [8] said that the main factors
causing the failures were (1) insufficient communication
A. Case Study 1: A New Terminal at Heathrow Airport
between the terminal user and terminal operator, and (2)
The new Heathrow Terminal 5 was designed to be one poor staff training combined with incomplete systems
of the most technologically advanced airport terminals in the testing. Slightly different from the claim of the CEO of BA.
world, but the initial opening was indefensible. While the Naturally, many of the faults were in the remit of BAA, the
terminal resulted in (1) praise for its use in ISs for the owner of the airport. BAA is approximately 90% owned by
planning and creating of a huge civil engineering project non-UK parties. Some of the faults that occurred are
[20] it may well be remembered because of (2) the failure of detailed below.
the new but relatively ‘humble and unexciting’ baggage • There were difficulties with the LAN facility.
system [7, 8]. The overall cost for T5 was £4.3bn [$US 8.5 Consequently, at some check-in stations, computer-
billion], with £250m [$US 323m] invested in technology handheld devices were inoperable, and consequently
and IT systems, see Table 4. The complex systems used airport staff could not enter baggage-data into the
400,000 people-hours for software engineering. The baggage IS.
terminal T5 required 180 IT suppliers, 163 IT systems, 546
interfaces, more than 9,000 connected devices, and 2,100 • An initial problem was that the BA loading staff could
PCs. not sign on to the baggage-reconciliation system – so
Written evidence submitted to the UK Government’s staff had to reconcile bags manually, causing significant
Transport Select Committee, discovered that a multitude of delays.
problems were unearthed in the first days of operation of • In the afternoon of opening day, BA could no longer
T5. Unfortunately, the attempts of IT staff of BA to alleviate accept checked baggage. Therefore, at check-in
the failings did not reduce the passenger problems. Instead passengers were told that they could choose between
the initial problems were intensified. (1) travelling without baggage or (2) re-booking their
flight. Unfortunately, passengers already checked-in
TABLE 4. TERMINAL T5 and waiting in the departure lounge were informed that
Cost of Terminal 5 £4.3 billion [$US 8.5 B]
they would be leaving without their bags
Terminal Area 251 hectares • During the earlier testing of the baggage system, IT
testing staff installed ‘testing software’. This software
Workers on site at one time 6,000
was not removed, and it caused problems when it was
Glass walls 30,000 m2 used with real events. In real use, the T5 baggage
New aircraft stands 60 system did not receive data about luggage transferring to
Tunnels constructed 13,000 m BA from other airlines. Therefore, these ‘unknown’ bags
were sent for manual sorting in a storage service, i.e., a
storage Centre outside T5.
The only airline using T5 was BA, but BA was not • An "incorrect configuration" between ISs stopped the
responsible for the airport and not for T5. The responsibility feed of data from the baggage-handling system to the
and ownership of the airport and T5, is a different private baggage reconciliation system. A week after the original
company, BAA - approximately 90% of the ownership is opening, the reconciliation system failed for the whole
non-British. The CEO of BA, the major airline using day. Bags missed their flights because the faulty system
Terminal 5, stated that (1) IT difficulties, plus (2) a lack of told staff that they had not been security screened.
testing played a significant proportion in the malfunctions at
T5. However, he suggested that if the issues had been • BA was compelled to cancel flights as it attempted,
simply IT related, then the airline might have coped. A huge unsuccessfully, to understand and clear the luggage
number of non-IT difficulties hit the T5 implementation blockage.
during its first few days, and these were intensified by the TABLE 5: PROBLEMS IN FIRST 5 DAYS OF OPENING.
way staff handled these difficulties. These non-IT problems
included (1) insufficient reservations for car parking for new Number of passenger bags misplaced 23,000
T5 staff, causing many staff to be missing from their posts, Flights cancelled 500
(2) security searches delayed, (3) staff not fully trained, (4)
construction of parts of the T5 building being incomplete at Made losses of some £16m [$US 21m]
379
Some 10 years before the terminal T5 construction, the This type of incident is not reported fully. Although few
Denver International Airport in USA had experienced details about the system crash have been made public, major
similar problems with its baggage IS [21]. The Denver features are known. The incident affected a large area of the
baggage system, then the most advanced system in the south west of the USA - from western Arizona to the west
world, is still known as a notorious example of project coast, and from Mexico to southern Nevada. Fortunately, no
failure. It was planned to automate the handling of baggage accidents or injuries occurred, but hundreds of flights were
for the entire Denver airport. The baggage system was found delayed. The National Air Traffic Controllers Association
to be extremely complex, and the resultant problems caused reported that ERAM was back up and running within an
the new airport being unused for 16 months. The Denver hour - perhaps, a good indicator of the strength of the air
delay added approximately $US 560m to the cost of the traffic control system. It appears the problem was caused by
airport. a simple lack of memory.
Terminal T5 was more fortunate. At its opening, T5 was
the largest free-standing structure in the UK. It had the IV. ANALYSIS
finest of architects and civil engineers, namely Richard A. Heathrow Terminal 5
Roger and Arup with Mott-MacDonald, respectively. T5’s As to be expected with a high profile and well-funded
first passengers arrived at 4.50am on 27 March 2008 on a project such as Heathrow Terminal 5, the project
flight from Hong Kong. It was perfectly successful. The rest management teams followed the sound principles and
of that day, and later days were chaos. After its less than guidelines that are described in the publications listed in
auspicious opening, the terminal T5 had many misfortunes. Section II and outlined in Tables 1, 2 and 3. In addition,
Over the first 10 days, 42,000 bags did not fly in the same there were significant sums spent on system testing and staff
plane as their owners. The first full schedule from T5 training, but without successful completion. However, it is
occurred on 8 April 20008, some 1½ weeks after its first often suggested that practitioners often exaggerate how
opening. much time and money they spend on testing [22]. In the case
of Terminal 5, whatever was spent was insufficient in that
B. Air traffic control system failure – Los Angeles Airport T5 project became a dreadful failure.
The Los Angeles International Airport, USA, locally There are no major or significant omissions or
known as the LAX airport, is the primary international differences. However, there were some minor differences,
airport for the city, and is the world’s third-busiest airport which later contributed to the failure.
based on total movements. The air traffic control system, With respect to project management, i.e., item 7 of Table
i.e., the En-Route Automation Modernization (ERAM) 1, staff from BA and BAA knew that the overall project was
system, is fundamental for the safe running of such a large a little late with some parts of the building works. The
airport. ERAM was developed by the Lockheed Martin knock-on from this was the IS testing was started late,
Corp and cost $2.4 billion. In April 2014, ERAM was without any correction to completion date, i.e., the testing
thought to be secure and dependable. completion time should have resulted in a delay in the
However, on 30 April 2014, a rogue plane entered the testing time, and perhaps also in T5 opening time or date. At
flying space [9, 10]. An air traffic controller could see that the time, this was thought to be too small a time-change to
this unknown plane (1) was going in and out of the Los cause problems.
Angeles control area multiple times, and (2) was higher than In Table 2, the critical factors, CF18, 19 & 20,
normal commercial flights. It did not have a simple point-to- emphasise how important testing, training and receding
point route like normal commercial flights. Later, it was deadlines are, particularly in the final stages of
recognised that the aeroplane was a U-2 spy plane, operating implementation. At this stage, any problems have little
at high altitude, with a complex flight plan. The controller leeway for additional time to address last-minute faults, like
entered its entry at 60,000 feet. The ERAM system those that occurred in T5. In addition, CF5 focusses on over-
calculated all possible flight paths of the unknown plane to commitment to completion dates; in the case of T5,
ensure that it was not on a crash route with the commercial commitment to completion dates caused reduced times for
planes with known flight paths at lower altitudes. testing and training. System testing and staff training are
Unfortunately, before all paths could be completed, the often thought of as non-important activities; in the case of
process used a large amount of available memory and T5 they were crucial.
interrupted the system’s other flight-processing functions, In Table 3, items that often occur in IS failures are
causing a system crash. The system then recycled, Themes T6 and T7, namely possible conflicts of interest and
attempting to complete the process. A repeating failure. the dangers of custom-built computer systems. Both are
Commercial flights have a relatively small data need and the apparent in the Terminal T5 events. Theme 5, credulity, is
rogue plane quickly overran the remainder of the system also evident, i.e., the wishful thinking that ‘when we start’
data memory. everything will be satisfactory.
With the ERAM system down, the air traffic controllers The above events emerged in various forms in the T5
in the regional LAX main centre switched to a simple situation. However, in the above discussion there is not one
uncomplicated back-up system. In this way, they could see item that seems large enough to cause the huge upheaval
the commercial planes on their screens. Using phones and that occurred at the Terminal T5 opening. It was their
paper, they were able to send flight information relating to combined effects, plus logistics and building incidents, that
commercial planes flying in their airspace, and to other brought the whole T5 enterprise to a virtual halt.
control centres in the region. Finally, it is well-known that a ‘big-bang’ approach can
cause problems. There is no explanation as to why BA
380
selected this approach, rather than a gradual phased Finally, Case Study 2 highlights the importance of back-
approach. up systems and contingency planning. In Case 1, i.e.,
B. Air Traffic Control System at Los Angeles Airport. Terminal T5, if any serious back-up or contingency
The LAX system degraded slowly before failing planning had been in place, or if the system testing and the
completely. Fortunately, the organization had a back-up staff training had been correctly completed [22] then the
manual system, to take over virtually all traffic control problems that occurred within T5 might never have
needs. Everything went well. It was certainly a ‘failure’, but occurred.
one can only applaud the way in which the back-up system
operated. REFERENCES
The organization did not claim it was their ‘contingency [1] M. Field, “O2 network restored after Ericsson Software outage
plan’, but it certainly helped the air traffic controllers to left millions of mobile users without 4G data access,” The
Telegraph, London, 7 December 2018.
work in unexpected and dangerous circumstances. The LAX [2] J. Jolly, “The TSB bank computer meltdown bill rises to £330m”, The
incident, like the T5 implementation, was a huge Guardian, London, 1 February 2019.
bewilderment. [3] L. McLeod, B. Doolin, and G. MacDonell, “A Perspective-Based
The T5 incidents caused passengers to be Understanding of Project Success,” Project Management J. vol. 43, pp.
68–86, 2012.
inconvenienced, but it did not have any threat to passengers’
[4] R. Sweis, "An Investigation of Failure in Information Systems Projects:
safety; whereas the LAX incident had the greater potential The Case of Jordan," J. Management Research, vol. 7, 2015.
of danger. It was more likely to endanger the lives of people [5] B. Shore, “Systematic biases and culture in project failures,” Project
within the LAX location. LAX could claim that they were Management J. vol. 39, pp. 5–16, 2008.
well-prepared for even this unexpected rogue event. [6] G. Toham and H. Smith, “Investigators believe Ethiopian Boeing 737
Max's anti-stall system activated,” The Guardian, London, Friday, 29
V. CONCLUSIONS March 2019.
[7] R. Thomson, “British Airways reveals what went wrong with Terminal
In Section I, the question “Can we now develop ISs 5: The full extent of the IT problems,” Computer Weekly On-Line, 14
without any major risk of failure?” was posed. The insights May 2008.
that we have from the guidelines addressed in Section II [8] UK House of Commons – Transport Committee, The opening of Heathrow
Terminal 5: Twelfth Report of Session 2007–08, HC 543, London:
help practitioners. However, Case 1 reminds us that it may Stationery Office, 22 October 2008.
not be the computer science or the brilliant IT that will [9] J. Hamil, “Los Angeles air traffic meltdown: system simply ran out of
cause failure. Other simple basic practices related to memory,” The Register Online, 12 May 2014.
‘management of change’ are at-least as important. In Case 1 [10] A. Scott, and J. Menn, “Exclusive: Air traffic system failure caused by
was it wise to use a ‘Big-Bang’ implementation? Or was the computer memory shortage,” Reuters, Technology News, 12 May
absolute necessity of good training and of system testing 2014.
really understood by BA and BAA management? Or was it [11] V. P. Lane, “Information Systems Projects – Are failures congenital or
acquired?”156-164. Current Perspectives in Healthcare Computing, pp.
simply that the date-of-start took preference over training 156-164, March 1999 [Proc. of HC’99, British Computer Society].
and testing times? [12] V. P. Lane, “The NTfIT in the NHS: £12.7bn – The NHS computer
The LAX case study shows that even the best system can system can still provide joined-up healthcare,” The Guardian, London,
pp 31, 4 August 2009.
be unable to continue, when a rogue incident occurs. It also
[13] V. P. Lane, J. A. Snaith, and D. C. Lane, “Hospital information systems
demonstrates the need for back-up systems. The events of – Are failures problems of the past?” Invited Paper, Annual Journal of
the LAX incident are not known; so, it unwise to pontificate. Medical Informatics & Technologies, University of Silesia, vol. 11, 11-
Nevertheless, the LAX back-up system averted problems, 22, November 2009.
and possibly fatalities. It also reminds practitioners of the [14] R. Ibrahim, E. Ayazi, S. Nasrmalek and S. Nakhat, “An investigation
need for contingency planning. of critical failure factors in IT projects,” J. Business and Management,
vol. 10, pp. 87-92, 2013.
Recently, there have been UK incidents - with banks and [15] M. Kateb, R. Swies, B. Obeidat, and M, Maqableh, “An investigation
with mobile phone companies [1, 2] - where these large of the critical factors of information system implementation in
companies have used a ‘big-bang’ approach that failed. The Jordanian information technology companies,” European J. Business
and Management, vol. 7, pp. 11-28, 2015.
end-users, like the T5 passengers, were innocent by-standers
[16] H. N. Nasir and S. Sahibuddin, “Critical success factors for software
or victims. Is a ‘big-bang’ implementation, even if it is more projects: A comparative study,” Scientific Research and Essays, vol. 6,
problematic to end-users, coming into fashion? This would pp. 2174-2186, 2016.
appear to be a suitable study for future research. [17] C. Sauer, Information Systems Project Performance: A
While new technologies provide us with new business Continuing Journey, Warwick Business School, ISM Forum,
opportunities, the case studies remind us of the dangers of Warwick University, 2008.
forgetting the lessons we have learned from past [18] K. T. Yeo, “Critical failure factors in information system projects,”
Int. J. Project Management, vol. 20, pp. 241–246, 2002.
experiences, such as: -
[19] T. Collins and D. Bicknell, Crash: Ten Easy Ways to Avoid a
• the importance of the completion of system testing, Computer Disaster, Simon & Schuster, Australia: Sydney, 1997.
• the importance of good quality staff training, [20] A. Davies, D. Gann and T. Douglas, “Innovation in mega-projects:
• the absolute necessity for cooperation between staff from Systems integration at London Heathrow Terminal 5,” Cal.
all organizations involved in the first real-time working Management Review, vol. 51, pp. 101-125, Winter, 2009.
of the new system, [21] M. Schloh, “Analysis of the Denver International Airport baggage
system,” Computer Science, Department School of Engineering,
• the difficulties of a ‘big-bang’ implementation - it is California Polytechnic State University, 16 Feb. 1996.
better, if possible, to use a phased implementation, and [22] M.M. Beller, G. Georgios, A. Panichella, and A. Zaidman, A. When,
• the dangers of receding deadlines, leading to attempting how, and why developers (do not) test in their integrated development
to try to do jobs in less time than originally estimated. environments, Proceedings - 10th Meeting on Foundations of Software
Engineering, ESEC/FSE 2015, ACM, New York, USA, 179-190, 2015
381
Task Scheduling based on Modified Grey Wolf
Optimizer in Cloud Computing Environment
Abdullah Alzaqebah Rizik Al-Sayyed Raja Masadeh
Computer Science Department, Information Technology Department, Computer Science Department,
The World Islamic Sciences and King Abdullah II School for The World Islamic Sciences and
Education University Information Technology, Education University
Amman, Jordan University of Jordan Amman, Jordan
Abdullah.zaqebah@wise.edu.jo Amman, Jordan raja.masadeh@wise.edu.jo
r.alsayyed@ju.edu.jo
Abstract—Task scheduling is considered as one of the to maximize resource utilization, minimize both makespan
most critical problems in cloud computing environment. The and cost to optimize the scheduling in cloud environments.
main target of task scheduling includes scheduling jobs on
virtual machines as well as improves performance. This Cloud task scheduling is known as an NP-complete
study employed Grey Wolf Optimization (GWO) algorithm problem [13]. More precisely, the required time for
with modifications on the fitness function by making it detecting the solution changes by the problem size [14].
handles multi-objectives in single fitness; the makespan and Cloud task scheduling is categorized into two classes
cost are the objectives included in the fitness in order to solve namely; meta-heuristic and heuristic algorithms. Heuristics
task scheduling problem. The main target of this technique is algorithms problem-specific strategy; it cannot be used to
to reduce both cost and makespan. CloudSim tool is used to answer open problems. On the other hand, the meta-
evaluate the objectives of the proposed method. The heuristics algorithm can be used (or applied) to solve a
simulation results showed that the proposed method wide range of problems in reasonable time.
(Modified Grey Wolf Optimizer - MGWO) has better
performance than both the traditional Grey Wolf Recently; Meta-heuristic algorithms are the most
Optimization Algorithm (GWO) and Whale Optimization applied techniques for task scheduling because they find
Algorithm (WOA) with makespan based fitness in terms of the optimal solutions or near-optimal solutions in
makespan, cost and degree of imbalance. reasonable time. Moreover, they detect the solutions by
employing the random choices. The most suitable example
Keywords—GWO, MGWO, WOA, Fitness, Makespan, and of a meta-heuristic algorithm is a Genetic Algorithm (GA)
Cost which is adopted by many studies to solve task scheduling
problem (TSP) in several manners. In literature studies
I. INTRODUCTION [15-18], the required time for mapping tasks into resources
Due to the availability of big data as well as the on- is increased when the number of jobs is increased.
demand operation in cloud computing (CC), the
requirements of CC environments have increased in recent In this research, we proposed cloud task scheduling
years. CC [1, 2] allows the clients to access the available which is based on the multi-objective model and Grey
and suitable resources such as internet applications, Wolf Optimization (GWO) algorithm to minimize both
storages, and servers [3]. The main role of the cloud cost and makespan in the cloud environments. CloudSim
service provider is to handle and manage client requests tool is used to evaluate the proposed technique.
(services) over the Internet [4]. The CC environment The organization of the paper is described as follows:
presents various services to clients. The most important Section II contains the related work, while section III
services are Platform as a Service (PaaS) [5], Infrastructure describes the GWO algorithm in details. Section IV
as a Service (IaaS) [6], Expert as a Service (ExaaS) [7] and outlines the suggested work. Simulation results are
Software as a Service (SaaS) [8][9]. The cloud clients have presented in section V. Finally, section VI concludes this
various tasks, and these tasks are implemented and research.
achieved at the same time by the available resources in the
cloud. The performance of CC can be developed by II. RELATED WORK
mapping tasks into resources in an optimized manner. One Many researchers tried to solve cloud task scheduling
of the most critical operations of the cloud is task using different techniques. Most of them employed meta-
scheduling which generates great influence on the entire heuristic algorithms such as GA, ACO, GWO, and WOA
cloud by impacting the Quality of Service (QoS) [10, 11]. in order to solve one of the main problems of cloud
The CC task scheduling preserves the balance over the environment which is task scheduling problem (TSP) as
entire system load. Each job demands response time, well as to find the optimal distribution of available
memory and computing time in several scales. In resources. However, there are still some issues in this
additions; the CC has the distributed resources. research area [2,19].
The efficient task scheduling method must minimize A novel algorithm is proposed which is based on neural
the makespan of the application [12]. Therefore, there is a network (NN) in order to classify the tasks queues which
need for algorithms to schedule the cloud tasks of users occur on any resource as well as to grant priorities to a
which optimally assign tasks into resources as well as variety of tasks [20]. NN is considered as an artificial
reduce the makespan. However, there are other criteria intelligence system which can discover and distinguish a
playing role in cloud task scheduling such as cost and pattern. Also, it can learn by instance and adapt to novel
utilization. Multi-objectives task scheduling algorithm has
978-1-7281-2882-5/19/$31.00 ©2019 IEEE 382
concepts and knowledge. Employing NN will be high the proposed technique can greatly minimize the total
potential to optimize mapping of tasks into virtual execution time to find the available cloud resources as well
machines (VMs) in CC environments. as significantly develop efficiency.
Few researchers employed GWO algorithm to solve the Some studies employed a Genetic Algorithm (GA) to
problem. Multi-Objectives cloud independent task propose novel cloud scheduling techniques. A new
scheduling based on mean GWO is presented [21]. The scheduling strategy and assists in appropriate and dynamic
primary objectives of the proposed algorithm [21] are to resource utilization are proposed in Kumar, P. et al. work
reduce both makespan and power consumption. Based on [30]. In other words, an improved GA is introduced which
simulation results, they proved that the suggested Mean of combined Min-Min and Max-min techniques in traditional
Grey Wolf Optimization algorithm has better results than GA. Based on simulation results, the proposed strategy
other traditional GWO and PSO algorithms. While [22] outperformed the traditional GA in terms of makespan.
employed the GWO method in order to solve dependent Suggested enhancement of GA is introduced by Wang, T.
tasks in CC environments. Makespan, cost, and resource et al. [31] which achieved independent task scheduling
utilization are taken into consideration. The experimental with minimizing makespan and balancing the entire system
results showed that the proposed algorithm has better load. The experimental results proved that the suggested
performance than the other existing techniques. algorithm can reduce the makespan and balance the system
load efficiently.
Some studies used Whale Optimization Algorithm
(WOA) to solve TSP. The study of sharma, m. et al. [23] III. GREY WOLF OPTIMIZATION (GWO) ALGORITHM
focused on both minimizing energy consumption and
makespan for cloud independent task scheduling. Grey Wolf Optimization (GWO) algorithm is
Experiments are performed over a variable number of tasks considered one of the most recently nature-inspired meta-
and VMs. Based on simulation results, the suggested heuristic optimization algorithm that is proposed by [32].
technique provided superior results than Min-min Moreover, it mimics the foraging and hunting behavior of
algorithm in terms of makespan and consumed energy. grey wolves. The most distinguished of grey wolves is
Another cloud task scheduling technique is suggested their social hierarchy; where they live in a pack that
based on WOA and multi-objective model that is called consists of 5-12 wolves. Each pack has alpha, beta, delta,
W-Scheduler [24]. The main objectives of W-Scheduler and omega members. Alpha is represented as a leader
are reducing makespan and budget cost. In addition, the which is responsible for take the decisions. Beta is a
simulation results of W-Scheduler are outperformed other consultant to the leader (alpha) which helps alpha to make
existing compared algorithms. Another multi-objectives decisions. Delta wolves are described as subordinate that
WOA is proposed in study of Reddy, G. N et al. [25] in submits to the upper levels (alpha and beta) but they
order to schedule independent tasks in CC environments. dominate the lower level which is called omega.
Energy consumption, makespan, resource utilization and Hunting behavior of grey wolves is split into stages as
quality of services are taken into accounts. Simulation follow [32-37]:
results proved that the suggested algorithm has better
performance compared with the existing techniques. • Tracking, chasing and approaching prey.
Masadeh, R. et al. [26] proposed a new metaheuristic • Pursuing, encircling and harassing the prey
optimization algorithm which is called Vocalization until it stops moving.
behavior of humpback Whale Optimization Algorithm
(VWOA). VWOA mimics the vocalization behavior of • Attack towards the prey.
humpback whales in nature. Also, the researchers
introduced cloud task scheduling technique which is based The mathematical model of the GWO algorithm is
on the VWOA and multi-objective model that is focused provided as follows:
on makespan, cost, resource utilization, and energy 1- Encircling prey: during the hunt phase, the grey
consumption. The simulation results showed that the wolves encircle the prey which is mathematically
proposed technique has better performance than other modeled as following equations Eq.1 and Eq.2:
algorithms.
Many researchers utilized Ant Colony Optimization
(ACO) to solve TSP in CC environment. Cloud task = | . ( )− ( ) (1)
scheduling algorithm is proposed based on load balancing
and ACO algorithm (LBACO) [27]. This algorithm ( + 1) = ( )− . (2)
balanced the entire system load, in turn, minimizing
makespan. Simulation results showed that the results of the
suggested strategy are provided superior results than First- Where t indicates to the current iteration, → and → are
Come-First-Served (FCFS) and traditional ACO coefficient vectors while is denoted as the position
algorithms. Another solution is proposed in study of
Tawfeek, M. A. et al. [28] that take into consideration the vector of the prey and → represents the position vector of a
makespan and degree of imbalance. Moreover, the grey wolf. In addition, → and → are computed using the
experimental results demonstrated that the suggested
following Eq. 3 and Eq. 4.
strategy outperformed Round-Robin (RR) and FCFS
techniques. Dependent tasks scheduling based on ACO and
two-way ants strategies is introduced in the work of Zhou,
Y. et al. [29]. The experimental results demonstrated that =2 . − (3)
383
= 2. (4) the broker is to optimize some needed parameters such as
makespan, Cost, resource utilization and energy
Where r1 and r2 represent random vectors in [0, 1] and consumption by assigning the tasks to VMs to satisfy the
→ is linearly decreased from 2 to 0. [32] optimization function.
2- Hunting: This phase is guided by the leader alpha The scheduling process is based on some parameters;
and the consultant's beta and delta wolves which the scheduler needs information about the resources during
have enough knowledge about the position of prey. the tasks execution process. The Resource Information
Thus, the rest of the wolves should update their Server (RIS) is responsible about feeding the scheduler
locations according to the location of the best about theses information by summarizing the data center
agent that is mathematically modeled as following information such as CPU, memories and all other
equations Eq.(5,6 and 7): information about the contained VMs. On the other hand,
the scheduler assigns the tasks to the resources based on
this information with respect to optimize the given
= . − , = . − , = . − (5) parameters [38].
384
3- Evaluation of Fitness Function: in this paper, two
performance metrics makespan and cost are
included in the fitness function of the MGWO
scheduler which aims to minimize the fitness value
and this is the modification of traditional use of
GWO. The fitness equation presented in Eq. 11
where ti represents the i'th task from the tasks list.
=( + ) (11)
V. SIMULATION RESULTS
Fig.2: Makespan of various numbers of tasks when the number of VMs
The proposed algorithm is simulated using CloudSim is 2
tool; where its platform based on Java. All these
experiments are validated on a personal computer with
Intel Core i-7 processor, 16 GB RAM, and Windows 8.1
operating system. The proposed MGWO differs from
GWO by modifying the core fitness function by make it
considering multi-objectives instead of just a single
objective which is the makespan. The outcomes of
employed modified GWO (MGWO) are compared with
the original GWO and existing WOA technique since the
WOA is a recently proposed optimizer by Mirjalili (2016)
with various numbers of independent tasks (200, 400, 600,
800 and 1000) and different numbers of VMs (1,2, 4 and
8); in terms of makespan, cost and degree of imbalance.
Fig.3: Makespan of various numbers of tasks when the number of VMs
The simulation results showed minimum cost and total is 4.
execution time compared with other selected algorithms. In
this simulation, each scenario is executed 10 times and
then the average is calculated and taken into consideration.
The average makespan for executed tasks using MGWO,
GWO and WOA is illustrated in Fig.1 – Fig.4. It is obvious
that MGWO has better performance than existing WOA
and traditional GWO in terms of makespan because of
using cost with makespan in the fitness function to
evaluate the solutions which make better scheduling
process which directly effect on the overall makespan. In
addition, when the number of VM equals one, all
algorithms form same results in both makespan and cost
since there are no other resources to schedule tasks into it.
The cost which represents the execution cost of running Fig.4: Makespan of various numbers of tasks when the number of VMs
an independent task on a particular VM. Moreover, it is 8
depends on the task's length, VM's storage and cost of
transmitting task to a particular VM, Moreover, due to the
simulation settings are almost same so as clearly shown in
the results there is no significant difference in term of cost
metric. Fig.5- Fig.9 showed the cost of a different number
of executing tasks on various numbers of VMs.
385
Fig.9: Degree of Imbalance of GWO, MGWO, and WOA on 8 VM
Fig.6: Scheduling cost of a various number of tasks when the number of
VMs is 2.
VI. CONCLUSION
Various meta-heuristic algorithms are employed in
order to develop task scheduling methods for CC
environment. In this work, a new task scheduling based on
GWO (MGWO) is introduced by modifying the fitness
function and make multi-objective in single fitness instead
of using the single makespan objective. The major target of
independent task scheduling based on both cost and
makespan is executed in the CloudSim. The performance
of the proposed technique is compared with traditional
GWO and WOA. The simulation results provided good
outcomes in reducing makespan, cost, and degree of
imbalance.
[1] Mell, P., & Grance, T. (2011). The NIST definition of cloud
computing..
[2] JoSEP, A. D., KAtz, R., KonWinSKi, A., Gunho, L. E. E.,
PAttERSon, D., & RABKin, A. (2010). A view of cloud
computing. Communications of the ACM, 53(4).
[3] He, H., Xu, G., Pang, S., & Zhao, Z. (2016). AMTS: Adaptive
multi-objective task scheduling strategy in cloud computing. China
Communications, 13(4), 162-171.
[4] Lin, X., Wang, Y., Xie, Q., & Pedram, M. (2014). Task scheduling
with dynamic voltage and frequency scaling for energy
minimization in the mobile cloud computing environment. IEEE
Transactions on Services Computing, 8(2), 175-186.
[5] Navimipour, N. J., Rahmani, A. M., Navin, A. H., &
Hosseinzadeh, M. (2015). Expert Cloud: A Cloud-based
framework to share the knowledge and skills of human resources.
Fig.8: Scheduling cost of a various number of tasks when the number of Computers in Human Behavior, 46, 57-74.
VMs is 8. [6] Malawski, M., Juve, G., Deelman, E., & Nabrzyski, J. (2015).
Algorithms for cost-and deadline-constrained provisioning for
While the Degree of Imbalance (DI) is measured the scientific workflow ensembles in IaaS clouds. Future Generation
Computer Systems, 48, 1-18.
imbalance among VMs [36] using the following Eq. 5:
[7] Navimipour, N. J. (2015). A formal approach for the specification
and verification of a trustworthy human resource discovery
( ) = (5) mechanism in the expert cloud. Expert Systems with Applications,
42(15-16), 6112-6131.
[8] Keshanchi, B., Souri, A., & Navimipour, N. J. (2017). An
Where Tmax represents the maximum execution time of improved genetic algorithm for task scheduling in the cloud
VMs, Tmin and Taverage are denoted the minimum and environments using the priority queues: formal verification,
average execution time, respectively. Fig.9 illustrated DI's simulation, and statistical testing. Journal of Systems and Software,
experiments which are performed for a different number of 124, 1-21.
independent tasks on 8 VMs. Moreover, it clearly showed [9] Alkhanak, E. N., Lee, S. P., & Khan, S. U. R. (2015). Cost-aware
that the MGWO achieved lowest level of degree of challenges for workflow scheduling approaches in cloud
computing environments: Taxonomy and opportunities. Future
imbalanced which means better scheduling balancing. Generation Computer Systems, 50, 3-21.
386
[10] Rimal, B. P., Jukan, A., Katsaros, D., & Goeleven, Y. (2011). [30] Kumar, P., & Verma, A. (2012). Independent Task Scheduling in
Architectural requirements for cloud computing systems: an Cloud Computing by Improved Genetic Algorithm. International
enterprise cloud approach. Journal of Grid Computing, 9(1), 3-26. Journal, 2(5).
[11] Rimal, B. P., Choi, E., & Lumb, I. (2009, August). A taxonomy [31] Wang, T., Liu, Z., Chen, Y., Xu, Y., & Dai, X. (2014, August).
and survey of cloud computing systems. In 2009 Fifth International Load balancing task scheduling based on genetic algorithm in
Joint Conference on INC, IMS and IDC (pp. 44-51). Ieee. cloud computing. In 2014 IEEE 12th International Conference on
[12] Navin, A. H., Navimipour, N. J., Rahmani, A. M., & Dependable, Autonomic and Secure Computing (pp. 146-152).
Hosseinzadeh, M. (2014). Expert grid: new type of grid to manage IEEE.
the human resources and study the effectiveness of its task [32] Mirjalili, S., Mirjalili, S. M., & Lewis, A. (2014). Grey wolf
scheduler. Arabian Journal for Science and Engineering, 39(8), optimizer. Advances in engineering software, 69, 46-61.
6175-6188. [33] Masadeh, R., Alzaqebah, A., Hudaib, A., & Rahman, A. A. (2018).
[13] Ullman, J. D. (1975). NP-complete scheduling problems. Journal Grey Wolf Algorithm for Requirements Prioritization. Modern
of Computer and System sciences, 10(3), 384-393. Applied Science, 12(2), 54.
[14] Xu, Y., Li, K., He, L., & Truong, T. K. (2013). A DAG scheduling [34] Masadeh, R., Hudaib, A., & Alzaqebah, A. (2018). WGW: A
scheme on heterogeneous computing systems using double hybrid approach based on whale and grey wolf optimization
molecular structure-based chemical reaction optimization. Journal algorithms for requirements prioritization. Advances in Systems
of Parallel and Distributed Computing, 73(9), 1306-1322. Science and Applications, 18(2), 63-83.
[15] Singh, S., & Kalra, M. (2014, November). Scheduling of [35] Masadeh, R., Sharieh, A., & Sliet, A. (2017). Grey wolf
independent tasks in cloud computing using modified genetic optimization applied to the maximum flow problem. International
algorithm. In 2014 International Conference on Computational Journal of Advanced and Applied Sciences, 4(7), 95-100.
Intelligence and Communication Networks (pp. 565-569). IEEE. [36] Yassien, E., Masadeh, R., Alzaqebah, A., & Shaheen, A. (2017).
[16] Kaur, K., & Kaur, A. (2015). Optimal Scheduling and Load Grey wolf optimization applied to the 0/1 knapsack problem.
Balancing in Cloud using Enhanced Genetic Algorithm. International Journal of Computer Applications, 169(5), 11-15.
International Journal of Computer Applications, 125(11). [37] Alzaqebah, A., & Abu-Shareha, A. A. (2019). Ant Colony System
[17] Wang, T., Liu, Z., Chen, Y., Xu, Y., & Dai, X. (2014, August). Algorithm with Dynamic Pheromone Updating for 0/1 Knapsack
Load balancing task scheduling based on genetic algorithm in Problem. International Journal of Intelligent Systems and
cloud computing. In 2014 IEEE 12th International Conference on Applications, 11(2), 9.
Dependable, Autonomic and Secure Computing (pp. 146-152). [38] Menascé, D. A., Saha, D., Porto, S. C. D., Almeida, V. A., &
IEEE. Tripathi, S. K. (1995). Static and dynamic processor scheduling
[18] Lakshmi, R. D., & Srinivasu, N. (2016). A dynamic approach to disciplines in heterogeneous parallel architectures. Journal of
task scheduling in cloud computing using genetic algorithm. Parallel and Distributed Computing, 28(1), 1-18.
Journal of Theoretical & Applied Information Technology, 85(2).
[19] Singh, P., Dutta, M., & Aggarwal, N. (2017). A review of task
scheduling based on meta-heuristics approach in cloud computing.
Knowledge and Information Systems, 52(1), 1-51.
[20] Maqableh, M., & Karajeh, H. (2014). Job scheduling for cloud
computing using neural networks. Communications and Network,
6(3), 191-200.
[21] Natesan, G., & Chokkalingam, A. (2018). Task scheduling in
heterogeneous cloud environment using mean grey wolf
optimization algorithm. ICT Express, 1-5.
[22] Khalili, A., & Babamir, S. M. (2017). Optimal scheduling
workflows in cloud computing environment using Pareto‐based
Grey Wolf Optimizer. Concurrency and Computation: Practice and
Experience, 29(11), 1-11.
[23] Sharma, M., & Garg, R. (2017, December). Energy-aware whale-
optmized task scheduler in cloud computing. In 2017 International
Conference on Intelligent Sustainable Systems (ICISS) (pp. 121-
126). IEEE.
[24] Sreenu, K., & Sreelatha, M. (2017). W-Scheduler: whale
optimization for task scheduling in cloud computing. Cluster
Computing, 1-12.
[25] Reddy, G. N., & Kumar, S. P. (2017, October). Multi objective task
scheduling algorithm for cloud computing using whale
optimization technique. In International Conference on Next
Generation Computing Technologies (pp. 286-297). Springer,
Singapore.
[26] Masadeh, R., Sharieh, A., & Mahafzah, B. A. Humpback Whale
Optimization Algorithm Based on Vocal Behavior for Task
Scheduling in Cloud Computing.
[27] Li, K., Xu, G., Zhao, G., Dong, Y., & Wang, D. (2011, August).
Cloud task scheduling based on load balancing ant colony
optimization. In 2011 Sixth Annual ChinaGrid Conference (pp. 3-
9). IEEE.
[28] Tawfeek, M. A., El-Sisi, A., Keshk, A. E., & Torkey, F. A. (2013,
November). Cloud task scheduling based on ant colony
optimization. In 2013 8th international conference on computer
engineering & systems (ICCES) (pp. 64-69). IEEE.
[29] Zhou, Y., & Huang, X. (2013, November). Scheduling workflow in
cloud computing based on ant colony optimization algorithm. In
2013 Sixth International Conference On Business Intelligence And
Financial Engineering (pp. 57-61). IEEE
387
Causal Path Planning Graph Based on Semantic
Pre-link Computation for Web Service Composition
Moses Olaifa Tranos Zuva
Department of ICT Department of ICT
Vaal University of Technology Vaal University of Technology
Vanderbijlpark, South Africa Vanderbijlpark, South Africa
newmosesolaifa@yahoo.com tranosz@vut.ac.za
Abstract—The web has impacted development across different approaches in respect of any specification defined by the users.
spheres of life by facilitating connection and communication In some cases, a single service appropriate for a request may
between people and machines, with organizational productivity not be found, hence a need to compose a set of services that
enhancement. Beyond connection and communication, access to
functionalities via the same web for solving business tasks has provides the required output.
increased its popularity. The idea of functionalities deployment The problem of service composition is a major problem in
on the web termed Service Oriented Computing (SOC) has been dynamic and fast growing web service environment [4] [5]
a major research area for some time. A key area of research [6]. More specific is the problem of time efficient service
focus in (SOC) is service composition. Service Composition deals discovery for web service composition process. While most
with aggregating available services to address complex business
processes or produce better functionalities. Due to explosion of the research works have been focused on service compo-
in the size of published web services, a need to improve the sition, these works are based on conventional web discovery
performance of web service composition arises. One of the key processes. Existing approaches underlying conventional web
research issues in Service Composition is providing an efficient service discovery processes do not form suitable basis for
web discovery approach that contributes to an improved web time efficient service composition [17]. Some attempts have
service composition. This work proposes an efficient web service
composition framework based on causal path pre-computation. been made at integrating web service discovery process and
Index Terms—Service Oriented Computing, web service, ser- composition process. However, the problem of lack of well
vice composition defined service discovery approaches underpinning the service
composition approaches still persists. Web service composition
I. I NTRODUCTION requires more than conventional service discovery approaches
The concept of web service is rooted in Service Oriented for improved performance in the face of growing web services.
Architecture (SOA) a paradigm of Service Oriented Comput- This research work presents a framework for web service
ing (SOC) that deals with the organization and provision of composition based on causal path pre-computation of service
web deployable software components called web services, that concepts.
encapsulate different functionalities and business processes
from simple requests to complex business processes [1]. II. R ELATED W ORK
Web services are loosely coupled, self-describing and self Different approaches have been proposed to deal with issues
contained applications that can be discovered via the published surrounding web service composition [11] [12] [10] [8] [9]. No
descriptions and remotely invoked through the internet across matter the approach used, central to any composition process is
different platforms using XML based standards such as the the service discovery process. This is required for identifying
Simple Object Access Protocol (SOAP) [2] [3]. By self- the component services contributing to the generation of the
describing, it means its capability to describe its operations final composite service. For realizing an appropriate published
and parameter requirements so that service brokers can dy- service for a particular web service request specification,
namically determine the functionalities of a service and how service retrieval through discovery of the particular service is
it can be invoked. Its self-contained characteristics signify its performed. This involves the search through service registries,
autonomy and platform independent nature. In order to use any matchmaking of concepts, ranking and selection of services.
of the web services, a service request in the form of service The main goal of any of the discovery approaches [13] [14]
specification is required. Between the request and delivery of [15] is the retrieval of appropriate services for service requests.
such request result is a series of tasks including search, dis- Performance evaluation is based on the ability to retrieve
covery, invocation and execution of relevant services published required service under the assumption that the request will
by service providers. Publication of services includes the be satisfied by single appropriate service.
description of the functional and non-functional components However, there are situations where a single service that
provided by the services possibly in machine understandable satisfies a particular service request may not be available.
formats. Available components are searched using different Aggregation of multiple services are required before such
389
where s and adj s are concept vectors s = {in s, out s} [⟨s1 , rel⟩, ⟨s2 , rel⟩, ..., ⟨sn , rel⟩], the highest ranked service is
and adj s = {adj sIn, adj sOut} respectively. Suppose B is selected for each concept parameter. These services form the
bounded by s0 = {∅, r in} and sn+1 = {r out, ∅}, the sum initial set of services for the universal composition plan. For
of all requisite services in the boundary (s0 , sn+1 ] generates each step of the composition, matInp and unmatInp tracks the
a predictor set for r = {r in, r out}. set of service input parameters that have been discovered, and
We define a predictor set over service (s0 , sn+1 ) = {si | outstanding input and output parameters prepared for the next
s0 < si < sn+1 } and a set of requisite neighboring services step in the composition process. For each services discovered
adj s as in the current step of the composition, the output parameters
∑ I are compared with the expected parameter set req out. For
BP = ( (si ) | si ∈ C) (2) all outstanding request output parameters, discovery of input
adj si ∈C relevant services is performed until req out is empty.
where si , adj si ∈ C and s0 < adj si ≤ sn+1 .
The predictor network BP models the dependencies be- Algorithm 1 CompP lan(r in, r out)
tween the services and the corresponding requisite services 0: req in = {r ini }
in C with an ordering to satisfy the requirements of r = 0: req out = {r outi }
{in r, out r}. With respect to BP , each adjacent service node 0: sel comp = ∅
adj si captures the dependency on the set of requisite service 0: while req out ̸= ∅ do
nodes. Given a set of requisite services adj s over services si 0: matInp = compos conc(c, req inp)
bounded by r = {r in, r out}, it is assumed the BP factors 0: unmatInp = ∅
into the composition of composite services and atomic services 0: unmatInp = req inp \ matInp
corresponding to the products of paired service nodes edge 0: req in = ∅
(si−1 , si ). For any two adjacent service nodes si−1 and si in 0: for all ci ∈ matInp do
BP , where si−1 is a parent service node to si , we define the 0: si = argmaxci [⟨s1 , rel⟩, ⟨s2 , rel⟩, ..., ⟨sn , rel⟩]
cost of the causal effect path c(si−1 , si ) as the incident edge 0: if si ∋ sel comp then
cost c(in si ) on service node si . Therefore the cost of a node 0: sel comp = sel comp + si
c(si ) is defined as the sum of the cost of all causal effect 0: comp req inp = comp req inp + out si
edges c(in si ) on s. 0: else
0: continue
∑
k
0: end if
c(si ) = c(in si ) (3) 0: end for
i=1
0: for all out si ∈ comp req inp do
Let the adjacent service nodes generated H from the set of 0: matOut = outM atch(req out, comp req inp)
requisite services by the mapping function (si ) for node si be 0: end for
the set {adj s1 , ..., adj sk }, where adj si ∈ B. We have also 0: req out = req out \ matOut
defined the predictor set BP and the edge cost between service 0: req inp = comp req inp \ matOut
node pairs c(s). Then we can now define the causal directed 0: req inp = unmatInp ∪ req inp
graph for a Universal Composite Service for the service query 0: end while
r = {r in, r out}. In the composition problem, the Universal 0: return sel comp
Composite Service is a causal directed graph that is described =0
by the ordering of a set of atomic nodes and pair nodes defined
over a set of si ∈ C and bounded by the service query s0 =
{∅, r in} and sn+1 = {r out, ∅}. V. BACKWARD P RUNING
The causal directed graph B ∗ for a Universal Composite H
Service over a set of services in C is a tuple {BP , c(s),H (s)}, After the generation of the causal directed universal com-
where BP is the predictor set over the services in C, (s) is position, pruning of the universal composition is performed
a mapping function that traverses the corresponding requisite to realize a minimum predictor set minBP that will realize
services adj s ⊂ C for a service si and c(s) is the cost of the optimal composition plan. The pruning starts from the goal
traversing an adjacent service node adj si from a requisite service nodes and traced back until the initial service nodes are
service si . reached. In traversing the graph backwards, This will enable
Algorithm 1 generates a universal composition plan for a the elimination of service nodes that do not have causal effect.
service request r = {r in, r out}. Lines 1 to 3 initializes Therefore the minimum predictor set that produces the optimal
the service request input parameters and expected output pa- composition plan is given as:
rameters. Each of the request input and output has at least two
parameters. Initially, the set of request input parameters req in I
∑
0 ⊗
are presented for discovery of the input relevant services from minBP = {min ( (si ) | si si−1 ) ∀si ∋ CN } (4)
the semantic pre-link map. In the corresponding services set i=n+1
390
All the services that are not included in the minimum discovery was directly exposed to the service description and
predictor set have not causal effect on the goal service nodes. subsumption relationships between concepts other than exact
matching are also considered. This may bloat the size of the
VI. E XPERIMENT AND R ESULT D ISCUSSION
services involved in the composition unnecessarily.
The presented framework is evaluated for scalability and
time efficiency. The performance of this approach is compared
with the composition based on the conventional service dis-
covery approach. For conventional discovery, all component
services for the universal composition is realized without the
aid of the semantic pre-link map. In order to observe the
scalability and time efficiency of the composition process,
the size of web services in the service environment is varied
between 150 services and 4000 services. The metrics used in
the evaluation are given as follows:
• Universal composition size: This is the number of ser-
vices involved in the universal composition.
• Optimal composition size: This is the number of services
realized after pruning the universal composition plan.
• Universal composition time: The time taken for request
processing to the completion of the composition.
At each increase in the number of web services in the envi- Fig. 2. Universal Composition Size.
ronment, the experiment is performed 10 times. Figure 1 shows
the processing time required by the different approaches for The above results are recorded for similar requests through-
universal composition generation. The conventional discovery out the experiments. However, there may be slightly varying
(conv disc) shows a tremendous increase in processing time results if the requests are changed from one experiment to
as web services grow in size. As the number of web services another. Overall, the impact of searching and reasoning for
increases, a higher processing time is required for the search every component service needed for the composition using
and matching of service concepts. conventional service discovery increases the processing time
for composition.
VII. C ONCLUSION AND F UTURE W ORK
This study presents a novel approach to improve the scal-
ability and time efficiency of web service composition. It
combines a semantic pre-link computation phase with a causal
graph with directed edges to realize an enhanced discovery
approach for web service composition. The semantic pre-
computation phase performs a pre-discovery of relevant ser-
vices according to different input and output definitions of
the web services. This saves a reasonable amount of time
spent during the composition process. Furthermore, the causal
directed graph allows for aggregation of fewer number of
component services which improves on the scalability of the
service composition.
This work assumes a largely stationary web service en-
Fig. 1. Universal Composition Time. vironment. For future research work, there is need for an
improvement of the pre-processing phase to fully address the
Unlike the composition based on pre sem, each services
issue of the dynamic nature of the web service environment.
required in the composition process has to be discovered
With service creation and deletion done at random, this may
directly from the service environment which translates to a
impact the expected time for the composition. In addition,
higher processing time requirement. In the case of pre sem,
component services of existing composite services can be
discovery of component services is based on the semantic pre-
deleted without updating the composite services. Therefore,
link map which reduces the required time. Figure 2 shows the
more work is required in the area of changing web service
size of service nodes involved in the universal composition
environment.
plan for each of the approaches. In each of the experiment,
Conv disc is observed to involve a higher number of services R EFERENCES
in the composition than the Pre Sem. This may be due [1] Mike P. Papazoglou, ”Service-oriented computing: Concepts, charac-
to the fact that composition based on conventional service teristics and directions.” In Proceedings of the Fourth International
391
Conference on Web Information Systems Engineering, 2003. WISE
2003., pp. 3-12. IEEE, 2003.
[2] G. Mein, S. Pal, G. Dhondu, T.K. Anand, A. Stojanovic, M. Gho-
sein, P.M. Oeuvray, ”Simple object access protocol.” U.S. Patent No.
6,457,066. 24 Sep. 2002.
[3] O. Hatzi, D. Vrakas, M. Nikolaidou, N. Bassiliades, D. Anagnostopou-
los, I. Vlahavas, An integrated approach to automated semantic web
service composition through planning. IEEE Transactions on Services
Computing. 2011 Apr 7;5(3):319-32.
[4] S. Kalasapur, Kumar M, B.A. Shirazi, ”Dynamic service composition
in pervasive computing.” IEEE Transactions on Parallel and Distributed
Systems. 2007 Jul;18(7):907-18.
[5] K. Fujii,, Suda T. Semantics-based context-aware dynamic service
composition. ACM Transactions on Autonomous and Adaptive Systems
(TAAS). 2009 May 1;4(2):12.
[6] X. Wang, J. Cao, Y. Xiang, Dynamic cloud service selection using
an adaptive learning mechanism in multi-cloud computing. Journal of
Systems and Software. 2015 Feb 1;100:195-210.
[7] P. Rodriguez-Mier, C. Pedrinaci, M. Lama, M. Mucientes, An integrated
semantic web service discovery and composition framework. IEEE
transactions on services computing. 2015 Feb 11;9(4):537-50.
[8] A. Vakili, NJ. Navimipour, Comprehensive and systematic review of the
service composition mechanisms in the cloud environments. Journal of
Network and Computer Applications. 2017 Mar 1;81:24-36.
[9] Y. Lu, X. Xu, A semantic web-based framework for service composi-
tion in a cloud manufacturing environment. Journal of manufacturing
systems. 2017 Jan 1;42:69-81.
[10] RB. Lamine, RB. Jemaa, IA. Amor, Graph planning based composition
for adaptable semantic web services. Procedia Computer Science. 2017
Jan 1;112:358-68.
[11] M. Liu, M. Wang, W. Shen, N. Luo, J. Yan, A quality of service
(QoS)-aware execution plan selection approach for a service composition
process. Future Generation Computer Systems. 2012 Jul 1;28(7):1080-9.
[12] D. Wang, Y. Yang, Z. Mi, A genetic-based approach to web service
composition in geo-distributed cloud environment. Computers Electrical
Engineering. 2015 Apr 1;43:129-41.
[13] M. Klusch, Semantic web service description. InCASCOM: intelligent
service coordination in the semantic web 2008 (pp. 31-57). Birkhuser
Basel.
[14] M. Klusch, P. Kapahnke, isem: Approximated reasoning for adaptive
hybrid selection of semantic services. InExtended Semantic Web Con-
ference 2010 May 30 (pp. 30-44). Springer, Berlin, Heidelberg.
[15] G. Priyadharshini, R. Gunasri, A survey on semantic web service
discovery methods. International Journal of Computer Applications.
2013 Jan 1;82(11).
[16] EG. Da Silva, LF. Pires, M. Van Sinderen, Towards runtime discovery,
selection and composition of semantic services. Computer communica-
tions. 2011 Feb 15;34(2):159-68.
[17] P. Rodriguez-Mier, C. Pedrinaci, M. Lama, M. Mucientes, An integrated
semantic web service discovery and composition framework. IEEE
transactions on services computing. 2015 Feb 11;9(4):537-50.
[18] S. Greenland, J. Pearl, JM. Robins, Causal diagrams for epidemiologic
research. Epidemiology. 1999 Jan 1;10:37-48.
[19] P. Spirtes, CN. Glymour, R. Scheines, D. Heckerman, C. Meek, G.
Cooper, T. Richardson, Causation, prediction, and search. MIT press;
2000.
[20] K. Czajkowski, S. Fitzgerald, I. Foster, C. Kesselman, Grid information
services for distributed resource sharing. In Proc. 10th IEEE Symp. on
High Performance Distributed Computing. 2001.
[21] National Center for Biotechnology Information.
http://www.ncbi.nlm.nih.gov
392
Accelerating Stochastic Gradient Descent using
Adaptive Mini-Batch Size
Muayyad Saleh Alsadi Rawan Ghnemat Arafat Awajan
Computer Science Dept. Computer Science Dept. Computer Science Dept.
Princess Sumaya University for Tech. Princess Sumaya University for Tech. Princess Sumaya University for Tech.
Amman, Jordan Amman, Jordan Amman, Jordan
muayyad.a@opensooq.com r.ghnemat@psut.edu.jo awajan@psut.edu.jo
Abstract—Training Artificial Neural Networks takes a long A typical basic design of a CNN model starts with an input
time to converge and achieve acceptable accuracy. The proposed image of a certain width and height Wi × Hi and in the case
method alternates between two modes: fast-forward mode and of color images that is a volume of size Wi × Hi × 3. Then
normal mode. The fast-forward mode iterates faster than normal
mode by using a smaller number of samples in each mini-batch. that volume is feed to a sequence of convolution layers of
Cycling between those two modes in an adaptive way is driven certain kernel size and depth (number of filters). A pooling
by accuracy change, by selectively using faster mode as long as layer follows (maximum pooling or average pooling). Going
it gives good results. Otherwise, it falls back to normal mode. deeper by alternating many convolution and pooling layers.
This way training becomes feasible even on commodity CPUs. The objective of the design is to form a flat signal with no
Our approach was tested on commodity CPU on Pets-37
dataset obtaining an accuracy of 91% in less than an hour and
spacial dimension (width=1 and height=1) so that the signal
on the Birds-200 dataset obtaining an accuracy of 72% in less is along depth axis which will be the signal of output classes,
two and a half hours. an example of this is seen in figure 1 showing the design
Index Terms—Artificial Neural Network; Convolutional Neural of LeNet[3]. The objective is achieved using strides on some
Networks; Stochastic Gradient Descent; Adaptive Batch size; layers (pooling or convolutional), reducing the width and the
Deep Learning; height of the output or having a convolutional filter of kernel
size that matches its input size.
I. I NTRODUCTION
394
TABLE I
OVERHEAD OF 3×3×128 CONVOLUTION IN I NCEPTION ALMOST HALVED USING SEPARABLE OPERATORS
the batch size in fast-forward mode will result in two times rate of items per second (or images per second in our case)
more number of iterations in same time period. For example, it can process. This is not the case when using small batch
one can have normal batch size to be 128, and the smaller sizes with a special hardware of large capacity like GPUs, but
one to be 64. Instead of periodically cycling between those two it’s the case for commodity CPUs. In other words, the time
modes, keep using fast mode as long as accuracy is increasing, needed to process a batch is linearly proportional to number
and switch to slower normal mode when it is not increasing. of items in the batch.
Assuming we want a boost factor of n, so that the faster
Algorithm 1 Adaptive mini-batch size by alternating two mode is n times faster than the normal mode, its batch size
batch sizes would be normal batch size/n, and number of iteration in
1: let acc old ← 0
each step can be set to be normal iterations/n so that the
2: let batch size ← small batch
time spend in each step is the same regardless of the mode. In
3: while exit criteria do ◃ like number of iterations other words we will be doing n times more updates in same
4: run training batch() period of time.
5: let acc new ← evaluate()
Since the proposed method has only too modes, learning
6: if new acc ¿ old acc then
rate hyperparameter of each mode can be handpicked and
7: let batch size ← small batch
tuned.
8: else
One can mix and match settings for each mode depending
9: let batch size ← normal batch
on the needed boost. Examples of hyper-parameters choices
10: let acc old ← acc new
for the two modes:
More generic Algorithm 2 uses a custom criteria to switch Normal batch size and normal learning rate, number of
•
modes, and custom hyper-parameters for each mode, like batch iterations
size, learning rate, dropout rate, regularization factors, and – For example, batch size=64, learning rate=0.01, iter-
number of iterations. ation count=100
• Fast forward hyperparameters examples:
Algorithm 2 Generic adaptive mini-batch size by alternating
two configuration – 2x setup: halve the batch size and same learning rate,
1: initialize normal batch size ◃ normal settings and 2x number of iterations
2: initialize normal learning rate – 10x setup: 1/10 of the batch size and 1/2 of the
3: initialize ff batch size ◃ fast forward settings learning rate, 10x number of iterations
4: initialize ff learning rate – ...etc.
5: let acc old ← 0 The fast forward criteria can be defined in multiple ways,
6: let mode ← normal the most simple one is “if new accuracy is better than old
7: while exit criteria do ◃ like number of iterations one”, that is if we are getting better then keep going using
8: run training batch() the fast forward mode, why use the slower mode if the faster
9: let acc new ← evaluate() mode is enough to increases the accuracy.
10: if ff criteria then Another criteria can be defined based on number of itera-
11: let mode ← ff tions with stalled accuracy For example, if accuracy did not
12: else get better after three consecutive iterations switch to normal
13: let mode ← normal mode, else keep going using fast forward mode.
14: let acc old ← acc new One is not limited only two configurations, for example one
can define three or more modes, or even arbitrary number of
It’s reasonable to assume that a given fully utilized machine modes, the general form would be, if accuracy stalled for more
has constant throughput (regardless of batch size) which is the than a threshold of iterations adjust parameters like this:
395
TABLE II
ACCURACY OVER ITERATIONS IN STEPS FOR DIFFERENT BATCH SIZES
Fig. 4. Top-1 Accuracy fine-tuning Birds 200 dataset with different batch
h′ = f actor × h sizes, x-axis is in hours
396
Fig. 5. Evaluation accuracy along time axis for different batch-sizes and
learning rates (LRs) for Pets-37 task. Fig. 7. Accuracy over time of Birds-200 task trained using adaptive batch
size of 4,8 then 8,16
Fig. 6. Comparing accuracy overtime for extremely small batch-sizes and Fig. 8. Accuracy over time of Birds-200 task trained using fixed batch size
learning-rate for Pets-37 task. of 4 and 8
397
uses faster higher risk settings as long as it works. One way to R EFERENCES
make faster convergence is to walk with short fast steps that is
[1] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A
by doing more frequent updates, even based on a sample that survey of deep neural network architectures and their applications,”
is too small to represent all classes in the task. Birds-200 task, Neurocomputing, vol. 234, pp. 11–26, 2017.
having a sample two items from each class means a batch size [2] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and
of 400 item. Doing frequent updates based on a batch barely trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
representing 5% of classes was very effective as in experiment applied to document recognition,” Proceedings of the IEEE, vol. 86,
shown in figure 4 and in table III. Iterating 20 times faster no. 11, pp. 2278–2324, 1998.
(even with ridiculously under-represented samples) resulted in [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
much multiple times better accuracy for two main reasons; and pattern recognition, 2016, pp. 770–778.
First, the relation between batch time and batch size is linear [5] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
while the negative effect on batch accuracy is not. Second, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and¡ 0.5 mb model size,” arXiv preprint, 2016. [Online].
updates are made based on a small fraction of error delta that Available: http://arxiv.org/abs/1602.07360
is the learning-rate which can be as small as 0.001 and in next [6] K. He and J. Sun, “Convolutional neural networks at constrained time
step it would have also under-represented but in a different way cost,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2015, pp. 5353–5360.
affecting different class due to the stochastic nature of training [7] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
algorithm, those flapping small fraction mistakes are canceling T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient
each other and negligible compared to the accumulative move convolutional neural networks for mobile vision applications,” arXiv
preprint, 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
toward least error point.
[8] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
The time consumed to process a batch is linearly propor- “Inverted residuals and linear bottlenecks: Mobile networks for
tional to the batch size (assuming the machine is fully utilized), classification, detection and segmentation,” arXiv preprint, 2018.
[Online]. Available: http://arxiv.org/abs/1801.04381
that is using 8 items per batch is 50× faster than 400 items per [9] F. Mamalet and C. Garcia, “Simplifying convnets for fast learning,”
batch. On the other hand, the negative side effect on accuracy Artificial Neural Networks and Machine Learning–ICANN 2012, pp. 58–
(if any) of using a smaller batch size is not linear, that is as 65, 2012.
[10] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
long as the accuracy is increasing with rate less than 50× network training by reducing internal covariate shift,” in International
slower then it’s a win situation. When using 200 items per Conference on Machine Learning, ser. ICML’15, 2015, pp. 448–456.
batch instead of 400 we get double the speed but we won’t [Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045167
[11] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
lose half of the accuracy increasing rate. V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
As long as we are increasing the accuracy, there is no need in Proceedings of the IEEE conference on computer vision and pattern
to use a slower settings. But if accuracy got stuck because due recognition, 2015, pp. 1–9.
[12] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
to lacking enough samples for different classes, in that case the inception architecture for computer vision,” in Proceedings of the
one might use a slower mode for a few steps, long enough to IEEE Conference on Computer Vision and Pattern Recognition, 2016,
put the SGD on a good initial position to start sliding with the pp. 2818–2826.
[13] Y. LeCuni, “Efficient backprop yann lecuni, leon bottoui, genevieve b.
gradient using the faster mode. orr2, and klaus-robert miuller3 1 image processing research department
One might ask if smaller batches are good why not using it at& t labs-research, 100 schulz drive, red bank, nj 07701-7033, usa 2
willamette university, 900 state street, salem, or 97301, usa.”
all the way? why one need to use adaptive batch size based on [14] D. R. Wilson and T. R. Martinez, “The general inefficiency of batch
some criteria? The accuracy increasing rate is not linear, and training for gradient descent learning,” Neural Networks, vol. 16, no. 10,
after long while it would start to be flat horizontal line, as it’s pp. 1429–1451, 2003.
much easier to go from 10% to 15% than to go from 90% to [15] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance
of initialization and momentum in deep learning,” in International
95%. When accuracy got stuck, or even worse start flapping conference on machine learning, 2013, pp. 1139–1147.
and decreasing, one need to activate the normal mode with the [16] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
slower hyper-parameters that eventually overcome that barrier. for online learning and stochastic optimization,” Journal of Machine
Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
[17] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
VI. C ONCLUSION [18] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv
preprint arXiv:1212.5701, 2012.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
In SGD, smaller batch sizes are very effective (even if they mation processing systems, 2012, pp. 1097–1105.
are 20× or 50× smaller than number of classes). And have [20] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola,
linear effect on speed while barely degrade accuracy increasing A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training
rate, one can exploit this property to fast-forward “boring” imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
[21] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
parts of training process and get good results in hours that for fast adaptation of deep networks,” arXiv preprint, 2017. [Online].
used to take days or require specialized hardware. This can be Available: http://arxiv.org/abs/1703.03400
summarized as simple as: Do very high risk initialization, then [22] A. Defazio, F. Bach, and S. Lacoste-Julien, “Saga: A fast incremental
gradient method with support for non-strongly convex composite objec-
“Train-Measure-Adapt-Repeat”. As long as it’s getting better tives,” in Advances in neural information processing systems, 2014, pp.
results keep using fast-forwarding settings. 1646–1654.
398
[23] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang, “Adanet:
Adaptive structural learning of artificial neural networks,” arXiv preprint,
2016. [Online]. Available: http://arxiv.org/abs/1607.01097
[24] L. N. Smith, “Cyclical learning rates for training neural networks,”
in Applications of Computer Vision (WACV), 2017 IEEE Winter
Conference on. IEEE, 2017, pp. 464–472. [Online]. Available:
http://arxiv.org/abs/1506.01186
[25] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent
with warm restarts,” arXiv preprint, 2016. [Online]. Available:
http://arxiv.org/abs/1608.03983
[26] L. N. Smith and N. Topin, “Super-convergence: Very fast training of
residual networks using large learning rates,” arXiv preprint, 2017.
[Online]. Available: http://arxiv.org/abs/1708.07120
[27] S. Ruder, “An overview of gradient descent optimiza-
tion algorithms,” arXiv preprint, 2016. [Online]. Available:
http://arxiv.org/abs/1609.04747
[28] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee,
B. Schroeder, and G. Pekhimenko, “Tbd: Benchmarking and analyzing
deep neural network training,” arXiv preprint, 2018. [Online]. Available:
http://arxiv.org/abs/1803.06905
[29] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi,
P. Bailis, K. Olukotun, C. Ré, and M. Zaharia, “Dawnbench: An end-
to-end deep learning benchmark and competition,” Training, vol. 100,
no. 101, p. 102, 2017.
[30] S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, “Don’t decay the
learning rate, increase the batch size,” arXiv preprint, 2017. [Online].
Available: http://arxiv.org/abs/1711.00489
[31] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar, “Cats
and dogs,” in 2012 IEEE Conference on Computer Vision and Pattern
Recognition, 2012, pp. 3498–3505.
[32] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei,
“Imagenet large scale visual recognition competition,” (ILSVRC2012),
2012.
[33] W. Ouyang, X. Wang, C. Zhang, and X. Yang, “Factors in finetun-
ing deep model for object detection with long-tail distribution,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016, pp. 864–873.
[34] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring
mid-level image representations using convolutional neural networks,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 1717–1724.
[35] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The
caltech-ucsd birds-200-2011 dataset,” California Institute of Technology,
Tech. Rep. CNS-TR-2011-001, 2011.
399
Author Index
A Al-Zewairi, Malek ...................................................... 1
Alzubaidi, Loay...................................................... 372
Ababneh, Mohammad ............................................ 56
Assiri, Basem ............................................................ 8
Abdalhaq, Baker ................................................... 318
Atallah, Rahma ..................................................... 335
Abdel-Nabi, Heba ................................................. 170
Awajan, Arafat............ 136,170,208,231,244,265,393
Adefila, Arinola ........................................................ 83
Awajan, Arafat A. .........................................213,251
AlArmouty, Batool ................................................. 178
Ayoubi, Eyad ........................................................... 74
Al-Asa'd, Muntaha ................................................ 341
Azeem, Omar ........................................................ 353
Al-Dabet, Saja ....................................................... 271
Alenezi, Fayadh .................................................... 277 B
Al Etaiwi, Wael ...................................................... 251 Baghdadi, Ameer .................................................. 183
Al Etaiwi, Wael Mahmoud ..................................... 265 Bahita, M. ............................................................ 360
Al-Fayoumi, Mustafa ..................................... 45,56,74 Bakhti, Haddi......................................................... 366
Alghazo, Jaafar ..................................................... 372 Bashar, Abul ......................................................... 124
AlHaidar, Shoaa.................................................... 124 bazi, Yakoub ......................................................... 283
Alhaj, Fatima .................................................. 324,347 Belarbi, K. ........................................................... 360
Al-Haj, Fatima ....................................................... 258 Benbrahim, Ghassen ............................................ 302
Alharbi, Yaser ....................................................... 189 Biltawi, Mariam...................................................... 231
Alhichri, Haikel S. ................................................ 283 C
Alhijawi, Bushra .................................................... 136
Chakraborty, Rajat Subhra ....................................... 1
Ali, Mohd Shukuri Mohamad ................................ 107
Chantar, Hamouda................................................ 318
Ali, Nazlena Mohamad.......................................... 107
Chefranov, Alexander ............................................. 20
Alia, Shahd ........................................................... 113
Clauss, Alexander ................................................. 101
Al-Jarrah, Heba..................................................... 341
Alkasassbeh, Mouhamad ....................................... 51 D
Alkasassbeh, Mouhammd ...................................... 27 Dafoulas, Georgios ................................................. 87
Al-kasassbeh, Mouhammd ..................................... 33 Dafoulas, Georgios A. ........................................... 94
AlKhatib, Lina ........................................................ 124 Daoud, Mohammad .............................................. 158
Al-Kasassbeh, Mohammad .................................... 62 Debbi, Aimad Eddine ............................................ 366
Al-Lahham, Yaser A.M. ...................................... 226 Dermol, Valerij ........................................................ 83
Al-Madi, Nailah ..................................................... 152 DeWinter, Alun ........................................................ 83
Almajali, Sufyan .................................................... 208
E
Al-Mousa, Amjed .................................................. 335
Almseidin, Mohammad ........................................... 33 Eleyan, Derar ........................................................ 377
Al-Naymat, Ghazi........................................... 136,170 Elhassan, A. ........................................................ 142
Al Omari, Islam ..................................................... 142 Elhassan, Ammar.................................................. 302
Al Omoush, Razan................................................ 142 Elnagar, Ashraf ..................................................... 238
AlOraidh, Aqeela ................................................... 124 El-Nakla, Darin ...................................................... 119
Alsadi, Muayyad Saleh ......................................... 393 El-Nakla, Samir ..................................................... 119
AlSaid, Hawra ....................................................... 124 El Rifai, Hozayfa ................................................... 238
Al-Sakran, Hasan.................................................. 189 El-Seoud, Samir .................................................... 289
Al-Sayyed, Rizik .................................................... 382 Eshtayah, Mohammad .......................................... 183
AL-Smadi, Mohammad ......................................... 341 F
Al Qadi, Leen ........................................................ 238 Fekry, Ahmed.......................................................... 87
Alzaqebah, Abdullah ............................................. 382 Fraihat, Salam ....................................................... 178
Al-Zboon, Sa'ad A. .............................................. 341
400
G Manna, Abdelrahman ........................................45,51
Manzoor, Ayisha ................................................... 312
Ghnemat, Rawan ........................................... 302,393
Masadeh, Raja ...................................................... 382
Giacinto, Giorgio ..................................................... 14
McNally, Beverley ................................................. 119
H Mohammad, Nazeeruddin .................................... 312
Halabi, Dana ......................................................... 244 Mohiuddin, Iman ................................................... 312
Hamad, Nagham..................................................... 20 Morrar, Jalal .......................................................... 183
Hambouz, Ahmed ................................................... 45 Mostafa, Ahmad .................................................... 289
Hamdan, Salam .................................................... 208 Muslmani, Baraa K. ............................................... 74
Hamida, Abdelhak Farhat ..................................... 366 Muzzammel, Raheel ............................................. 353
Hammad, Mahmoud ............................................. 341
N
Hammo, Bassam .................................................. 258
Naz, Rubina .......................................................... 130
Hamtini, Thair ....................................................... 113
Neilson, David ......................................................... 94
Hanna, Samer......................................................... 39
Haque, Tahreem ....................................................... 1 O
Hart, Stefan Willi ................................................... 202 Obaid, Safa ........................................................... 238
Hawash, Amjad..................................................... 183 Obeid, Nadim ........................................................ 136
Hudaib, Amjad ...................................................... 324 Olaifa, Moses ........................................................ 388
Hussein, Walid ...................................................... 289 Ouni, Ridha ........................................................... 283
I Q
Ibrahim, Anas .......................................................... 20 Qabbaah, Hamzah ................................................ 164
Innab, Haneen ...................................................... 142 Qasaimeh, Malik ................................................56,74
Islam, Noman........................................................ 130 Qureshi, Muhammad Faheem .............................. 195
Ismail, Manal ........................................................... 87
R
Issa, Lana ............................................................. 220
Raza, Asad ........................................................... 195
J
Romman, Ali Abu .................................................. 195
Jaber, Hayat ........................................................... 39
S
Jamous, Naoum .................................................... 202
Jusoh, Shaidah ..................................................... 220 Saeed, Nayab ....................................................... 353
Saeed, Reham ...................................................... 302
K
Saeed, Umair ........................................................ 130
Karaymeh, Ashraf ................................................... 56 Sammour, George ................................................ 164
Kazakzeh, Saif ........................................................ 74 Santikellur, Pranesh .................................................. 1
Khan, Omer .......................................................... 353 Sarhan, Sami ........................................................ 329
Khanafsa, Mohammad.......................................... 329 Scalas, Michele ....................................................... 14
Kharshid, Areej ..................................................... 283 Schoop, Eric .......................................................... 101
Kovacs, Szilveszter................................................. 33 Serguievskaia, Irina .............................................. 189
Krishnasamy, Gomathi ........................................... 68 Shaheen, Yousef .................................................... 45
Kumar, Kamlesh ................................................... 130 Shaheen, Yousef Khaled ........................................ 62
L Shaikh, Aftab Ahmed ............................................ 130
Shaikh, Eman........................................................ 312
Lane, Victor P. .................................................... 377
Sharieh, Ahmad .................................................... 347
Latif, Ghazanfar ............................................. 312,372
Sheta, Alaa ........................................................... 296
Lenk, Florian ......................................................... 101
Širca, Nada Trunk ................................................... 83
M Sleit, Azzam .......................................................... 347
Mafarja, Majdi ....................................................... 318 Snaith, James ....................................................... 377
401
Suleiman, Dima ............................................. 213,251 V
Sundus, Katrina .................................................... 258
Vanhoof, Koen ...................................................... 164
Surakhi, Ola .......................................................... 329
W
T
Wimpenny, Katherine ............................................. 83
Tahir, Umair .......................................................... 353
Tanveer, Jaweria .................................................. 130 Y
Tawalbeh, Saja Khaled ......................................... 341 Yasen, Mais .......................................................... 152
Tedmori, Sara ........................................... 45,231,271
Z
Thaher, Thaer ....................................................... 318
Trunk, Aleš .............................................................. 83 Zuraiq, AlMaha Abu ................................................ 27
Turabieh, Hamza ........................................... 296,306 Zuva, Tranos ......................................................... 388
402
Organized by:
Technically Co-Sponsored by