Anda di halaman 1dari 112

Conference Title

The Third International Conference on


Computing Technology and Information
Management (ICCTIM2017)

Conference Dates
December 8-10, 2017

Conference Venue
Metropolitan College, Thessaloniki, Greece

ISBN:
978-1-941968-46-8 ©2017 SDIWC

Published by
The Society of Digital Information and Wireless
Communications (SDIWC)
Wilmington, New Castle, DE 19801, USA
www.sdiwc.net
Table of Contents

C-Learning: Learning on Cloud ……………………………………………………………………………………………………. 1

Enterprise System Maturity – Past, Present and Future: A Case Study of Botswana …………………… 6

FAD Platforms: Proprietary Solutions …………………………………………………………………………………………. 15

Mobile Learning: A Case of Study ……………………………………………………………………………………………….. 21

Message Hiding with Pseudo-Random Binary Sequence Utilization …………………………………………... 25

Optimising the Innovation Capacity of Public Research Organisations - An Agent-Based


Simulation Approach ………………………………………………………………………………………………………………….. 32

Text Classification Using Time Windows Applied to Stock Exchange …………………………………….……..39

Towards Specification Formalisms for Data Warehousing Requirements Elicitation Techniques … 45

Using Dense Subgraphs to Optimize Ego-centric Aggregate Queries in Graph Databases …………… 59

A Secure Method for the Global Medical Information in Cloud Storage based on the Encryption
and Data Embedding ………………………………………………………………………………………………………………….. 68

Virtual Local Area Network (VLAN): Segmentation and Security ……………………………………………….… 78

Ontology-Based Data Mining Approach for Judo Technical Tactical Analysis ……………………….………90

Celestial Spectra Classification Based on Support Vector Machine …………………………………………….. 99

The Agent-Based Model of The Dynamic Spectrum Access Networks with Network Switching
Mechanism ………………………………………………………………………………………………………………………………… 106
Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

C-Learning: Learning on Cloud

Pasquina Campanella
Department of Computer Science
University of Bari “Aldo Moro”
via Orabona,4 – 70126 Bari – Italy
pasqua13.cp@libero.it

ABSTRACT

Current trends in the digital age lead us to learn [16], [20], [25]. These advances in the
that in the near future the access tools will become performance of digital components have
more contextual. It deduces that it no longer has resulted in a huge increase in the scope of IT
any sense to identify an e-learning system in a environments, and consequently, the need to
single monolithic platform but as a set consisting
manage them uniformly in a single “cloud”
of several interoperable components and
subcomponents to rationally manage the various was born [6], [26], [30]. The need for such
heterogeneous activities a training process can environments is particularly felt in the
undergo. In this scenario, cloud learning is born, exponential growth of network connected
“cloud formation”, which combines the ability to equipment and real time data streaming
draw resources distributed with contextual processes, as well as in the spread of service
information. But the problems that arise in the era oriented architectures and applications,
of distributed computing or fragmentation of collaborative and research projects [3], [19].
workload into an arbitrary number of sub-tasks to Cloud architecture has been the best candidate
be distributed to an unknown number of for solving some of the problems generated
heterogeneous machines spread around the world by large scale data processing for many
are that there is no the absolute certainty that
computer giants [27], [29]. In this context, a
network machines are always available (latency,
unpredictable network crashes) then continuous new hybrid model of resource utilization
monitoring is essential. In this context, the e- offered by computer networks was named,
learning platform Docebo Cloud has been studied which was named Cloud Computing [20],
to analyze the different response times. [28], [29]. Cloud computing is therefore a
new approach to the provision of continuous
KEYWORDS ICT resources that enables easy access to the
on demand network to a configurable
platform, cloud computing, services, architecture, computational resource pool [1], [7], [17],
test. [19]. Although the cloud landscape is still
extremely young, in recent years it has
1 INTRODUCTION become increasingly important in Information
and Communication Technology (ICT) and is
Since the ‘90 with Grid Computing and today, the new technology that will enable the entire
with the evolution of technologies and ways education system to be changed in the near
of using the users, we are witnessing the future, high-tech e-learning services that bring
proliferation of an interaction between high economic savings [10], [18], [19]. The
computing systems for a computational prospect leads to the development of
cooperation that moves the classical view of CLearning (Cloud-learning) and CMobile-
ICT towards large datacenters located in the Learning (CloudMobile-Learning) where the
territory [3], [6], [9], [22]. Hence, the rise of user will have access to data that will be
web 2.0 and content sharing and publishing shared in the cloud based on his request. The
services has led to the ability of users to have expression “cloud learning” can be translated
advanced services without having to resort to as “cloud formation”, pointing to a virtual
classical management of local resources [4],

ISBN: 978-1-941968-45-1 ©2017 SDIWC 1


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

space where you can store, share and consult software as a Service.
training data (work documents, formats Cloud computing defines several delivery
training formats, meeting records) on a service models among which the main ones
remote server [6], [23]. There are currently are [6], [12], [17]:
several suppliers on the market that have Software as a Service (SaaS) – the
enhanced their data center hosting provider's applications are accessible to the
applications in the cloud: from giants such as consumer from various client devices through
amazon, ibm, google, microsoft, sun a thin client interface (eg. gmail, google docs,
microsystems to cloud realties offered by salesforce.com CRM 8 solutions, zoho docs);
smaller companies such as goGrid [24]. You Platform as a Service (PaaS) –
can access applications through a browser and consumers have control over deployed
you can use any device that accesses the applications and hosting environment
network (pc, notebook, tablet, cell phones) configurations (eg. google app engine,
(Fig.1). In a cloud computing environment Force.com);
(cloud or cloud computing) three distinct Infrastructure as a Service (IaaS) –
actors are configured [2], [5], [12], [15], [17]: consumers can, depending on their needs,
Infrastructure Provider: provides provision storage, processing and use of
platforms by providing services (storage, resource based networks, eg. Amazon S3
applications, computing capabilities) (Simple Storage Service), Amazon EC2
generally following the pay-per-use model; (Elastic Compute Cloud), GoGrid.
Service Provider / Cloud User: choose This article discusses the Docebo Cloud
and configure the services offered by the platform in the first section, then experimental
vendor. Implement a service that uses the results conducted at various learning times
resources provided by the infrastructure and finally conclusions and future
provider and offers it to the end user. developments.
Cliente Final Client: uses services
configured by the service provider. In certain 2 DOCEBO CLOUD PLATFORM
cases the administrator and the end customer
may coincide. The spread of virtualization and cloud
computing technologies, coupled with an
increasing need to cut down on application
and system management costs in the IT
world, led to the spread of IT services
delivery policies in on demand mode to allow
the diffusion of new models, or the extension
of existing ones, related to software
distribution and access to software
applications [11], [14], [21], [27]. So here are
the software, platforms, or infrastructures that
are made available as services. These services
can be considered as core components for the
development of cloud computing [6], [24],
Figure 1. Representation of Cloud connections [29]. Based on this, the docebo cloud
platform, the e-learning platform as a service
Cloud computing, therefore, is a new way of designed to allow teachers and educators to
conceiving the supply and use of IT services create and manage online courses with ample
utilizing the convergence of three key interaction possibilities, is being studied.
elements [8], [15], [17], [27]: Docebo was born as an evolution of spaghetti
utility computing; learning, an LMS, developed in 2003 by the
virtualization of computing resources; same team of developers [7], [13]. Today it

ISBN: 978-1-941968-45-1 ©2017 SDIWC 2


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

has a widespread use and in this the


foundation that has introduced it has migrated
to develop its e-learning platform in the
Cloud/ As a Service direction and this has
come to fruition in the docebo 7.0 release,
accessible directly online “as service” without
any necessary installation. Docebo provides
e-learning and tutor creators with a host of
tools, such as forum and chat, user and group
management, test and surveys, document
archives, and the ability to connect to some
videoconferencing services such as Teleskill
[14], [29].
Figure 2. Docebo Cloud platform access and
3 EXPERIMENTAL RESULTS monitoring

Many problems in online training depend, in Docebo Cloud regarding continuous


most cases, on a management structure that is monitoring of organizational-educational
unable to adequately manage the didactic complexity, fragmentation of workload into
organizational complexity given by the an arbitrary number of sub-tasks to be
introduction of new teaching technology deployed to an unknown number of machines,
solutions [14], [16]. Based on this they heterogeneous among them, scattered all over
wanted to conduct tests on the e-learning the world, or that there is no the absolute
docebo cloud platform, migrating to cloud/ As assurance that network machines are always
a service solutions to allow teachers and available (latency times, unpredictable
educators to create and manage online courses network crashes) has provided variable cloud
with ample interaction possibilities. Tests response times in managing various tasks as
were conducted in the community with 150 shown in Fig.2, a reduction occurred with a
students, students and teachers, aged 20 to 42, number of users less than 100, much
willing to evaluate the new cloud impact. The dependent on the skills of the interlocutors.
Docebo Cloud Platform is accessed by
connecting to an account as shown in Fig. 2 4 CONCLUSIONS and FUTURE
and exploring the various learning features DEVELOPMENTS
ranging from distance learning, lectures in
videoconferencing, archiving and making Concluding C-Learning sets the groundwork
available teaching files, monographic courses, to be considered as an innovative form of web
evaluation test, alerts, discussion forums. 2.0, where the needs of the user go beyond
voice communication and demand strong
demand for services that need connectivity
and transmission bandwidth, which no longer
has to be rigidly fixed by contracts, but must
be flexible and must be readily adapted to
users requests and also as the latest form of
globalization, leading to a number of end user
benefits, in particular for the resources that
may be required and obtained on-demand. It
is arguing as a model that changes the way
resources are made, because they are
decoupled from technology and
“encapsulated” in IT services. Such services
are dynamic and flexible and can be used

ISBN: 978-1-941968-45-1 ©2017 SDIWC 3


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

individually or in larger business contexts, [9] J. Cappos, I. Beschastnikh, A. Krishnamurthy,


T. Anderson, “Seattle: a platform for
allowing optimal use of resources because educational cloud computing”, Acm sigcse
they are shared among multiple users bulletin, vol. 41, n. 1, 2009, pp. 111-115, March
3-7, New York, USA.
(multitenant model). The article analyzed the
didactic-organizational complexity of the e- [10] T. Ercan, “Effective use of cloud computing in
learning docebo cloud platform and provided educational institutions”, Procedia-Social and
Behavioral Sciences, vol. 2, pp. 938-942,
variable response times in managing various 05/01/2010.
tasks. The experimentation carried out in the
community by a number of students trying to [11] R. Katz, P. Goldstein, R. Yanosky, “Cloud
computing in higher education”, 2010,
explore the new platform for content delivery Educause.
has achieved an optimum result with a
number of users less than 100, much [12] A. Khajeh-Hosseini, D. Greenwood, and I.
Sommerville, “Cloud Migration: A Case Study
dependent on the skills of the interviewees. of Migrating an Enterprise IT system to IaaS”,
IEEE, 3rd International Conference on Cloud
Computing, 2010, pp. 450-457.
REFERENCES
[13] S. Impedovo, P. Campanella, G. Facchini, G.
[1] M.M. Alabbadi, “Cloud computing for Pirlo, R. Modugno, L. Sarcinella, “Learning
education and learning: Education and learning Management Systems: un’analisi comparativa
as a service (eLaaS)”, Proc. of the 14th delle piattaforme open-source e proprietarie”,
International Conference on Interactive Atti Didamatica 2011 - Informatica per la
Collaborative Learning (ICL2011), IEEE, pp. didattica, Torino 04-05-06/05/2011.
589-594.
[14] S. Impedovo, IAPR FELLOW, IEEE S. M., P.
[2] M. Al-Zoube, “E-learning on the Cloud”, Campanella, “Docebo Cloud: apprendimento e
International Arab Journal of e- nuove tecnologie”, Atti Didamatica 2012 -
Technology,vol.1, n. 2, 2009, pp. 58-64. Informatica per la didattica, Taranto 14-15-
16/05/2012.
[3] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph,
R. Katz, A. Konwinski, G. Lee, D. Patterson, A. [15] G. Lin, D. Fu, J. Zhu, “Cloud computing: IT as
Rabkin, I. Stoica, M. Zaharia, “A View of a Service”, 2009, vol. 11, n.2, pp. 10-13, ACM.
Cloud Computing”, Commun. Acm, vol. 53, n.
4, 2010, pp. 50-58.
[16] G. Lin, G. Dasmalchi, J. Zhu, “Cloud
computing and IT as a Service: Opportunities
[4] D. Bein , W. Bein, P. Madiraju, “The Impact of and Challenges”, Proc. of the IEEE
Cloud Computing on Web 2.0”, eij, vol. 9, n. 1, International Conference on Web Services,
2009, pp. 5-12, USA. 2008.

[5] P. Campanella, “Cloud Computing: un nuovo [17] A. Manzalini, C. Moiso, E. Morandin, “Cloud
paradigma”, Atti VIII convegno nazionale Sie-l, computing: stato dell’arte e opportunità”,
connessi! Scenari di innovazione nella Notiziario Tecnico Telecom Italia, n. 2, 2009.
formazione e nella comunicazione, 14-15-
16/09/2011, Ledizioni, Reggio Emilia, Italy.
[18] M. Miller, “Cloud Computing: Web-Based
Applications that Change the Way you Work
[6] P. Campanella, “Cloud E-learning: un nuovo and Collaborate Online”, 2008.
binomio”, Atti Didamatica 2016 – Innovazione:
sfida comune di scuola, università, ricerca e
impresa, 30° edizione Aica, 19-20-21/04/2016, [19] E. Martins Morgado, “An exploratory essay on
pp. 128-137, Udine, Italy. Cloud Computing and its Impact on the use of
Information and Communication Technologies
in Education”, in education in a technological
[7] P. Campanella, “Method of experimental world: communicating current and emerging
evaluation of ICT in teaching”, Atti del research and technological efforts, ed. Mendez-
convegno Elearn 2011, world conference on e- vilas, formatex, 2011.
learning in corporate governement, healthcare e
higher education organized by AACE,
Honolulu, Hawaii, USA, 17-18-19-20- [20] S. Ouf, M. Nasr, Y. Helmy, “An enhanced e-
21/10/2011. learning eco system based on an integration
between cloud computing and Web 2.0”, Proc.
of the 10th IEEE International Symposium on
[8] D. Chandran, S. Kempegowda, “Hybrid E- Signal Processing and Information Technology
learning Platform based on Cloud Architecture (ISSPIT), pp. 48-55, 15-18/12/2010.
Model: a proposal”, Proc. International
conference on signal and image processing
(ICSIP), 2010, pp. 534-537, IEEE. [21] P. Pocatilu, F. Alecu, M. Vetrici, “Using Cloudth
Computing for Elearning Systems”, Proc. 8

ISBN: 978-1-941968-45-1 ©2017 SDIWC 4


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Wseas international conference on data


networks, communications, computers,
(dncoco'09), Baltimore USA, 2009, pp. 54-59.

[22] L. Qian, Z. Luo, Y. Du, L. Guo, “Cloud


computing: an overview”, 2009, pp. 626-631,
springer.

[23] N. M. Rao, C. Sasidhar, V. S. Kumar, “Cloud


computing Through Mobile Learning”, (ijacsa),
International journal of Advanced computer
science and applications, vol 1, n. 6, 2010, pp.
42-46.

[24] N. Sultan, “Cloud computing for education: a


new dawn ?”, International journal of
Information Management, elsevier, issue 2, vol.
30, 2010, pp. 109-116.

[25] Velev G. Dimiter, “Challenges and


Opportunities of Cloud-based Mobile
Learning”, ijiet, iacsit, vol. 4, n. 1, 2014, pp. 49-
53.

[26] M. A. Vouk, “Cloud computing: issues,


research and implementations”, Department of
computer science, North Carolina State
University, Raleigh, North Carolina, USA,
Journal of computing and information
technology, vol. 4, 2008, pp. 235–246.

[27] B. Wang, H.Y. Xing, “The application of cloud


computing in education informatization”, in
International Conference on computer science
and service system, IEEE, pp. 2673-2676, 27-
29/06/2011.

[28] S. Yanfei, “Cloud Computing applications in


the network learning”, in China outside school
basic education edition, 2010.

[29] Q. Zhang, I. Cheng, R. Boutaba, “Cloud


Computing: state-of-the-art and research
challenges”, Journal of internet services and
applications, vol.1, n.1, 2010, pp. 7-18, springer.

[30] L. Wang, G. von Laszewski, A. Younge, X. He,


“Cloud Computing: a perspective study”, vol.
28, n. 2, 2010, pp. 137-146, springer.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 5


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Enterprise System Maturity – Past, Present and Future:


A Case Study of Botswana
Oduronke T. Eyitayo
Department of Computer Science
University of Botswana
Private Bag 00704
Gaborone, Botswana
eyitayoo@mopipi.ub.bw

ABSTRACT Keywords: Botswana; Enterprise Resource


Planning; Enterprise Architecture; Enterprise
The early stages of computerisation were pre- Maturity; Gartner Maturity Model
occupied with managing activities such as
operations, programming, data collection etc. 1. INTRODUCTION
This then developed into information systems.
Such systems include: customer management The first era of computerisation emerged
systems, financial systems, human resource from the 1960s and was referred to as the
systems, business intelligence systems, asset data processing era while the 1970
management systems, waste management
onwards was referred to as the
systems, document management systems,
workflow management systems, and hundreds management information systems (MIS)
of other systems. In recent years, these era [1]. The first area focused on efficient
systems are being integrated with each other to transaction handling and effective resource
such an extent that it is oftentimes necessary to control, while the second focused on the
view them, not as hundreds of different effective problem resolution and support
systems, but one single system comprising of for decision making. The two eras
several systems, often referred to as enterprise- overlapped as data processing continued to
wide information system. Many organizations mature. Ward and Peppard [1] noted that
however have encountered difficulties when as more data were stored in computer
attempting to modify enterprise resource systems, its use matured into using
planning systems to their business operations.
information to increase the effectiveness of
These factors led to the growth of Enterprise
Architecture. This paper focuses on studying decision making. The legacy of data
the trend in the development of the Botswana processing applications was often at best
Enterprise system. In order to trace the fragmented data resource. This led to
Botswana maturity story through literature the organisations re-organising data and
question the researcher set out to answer was: applications into integrated data-based
How have Enterprise systems evolved over systems to enable MIS to be developed.
time, and where is it now? These questions
will be answered with insights from the As the business value of personal
academic literature using online databases. computers became clear to enterprises, the
The study uses Gartner model to analyse the next logical step was to link them together.
Botswana story. The planning and
The US Defense Advanced Research
implementation approaches adopted by the
nation explicitly explain how the development Projects Agency's ARPANET initiative in
agenda of a country if followed through can the sixties and early seventies was the
impact positively on the country's dawn of efficient and large-scale
development. networking. In the early nineties it reached
a global impact with the Internet. As the
interconnected networks grew, the power
and number of applications using them

ISBN: 978-1-941968-45-1 ©2017 SDIWC 6


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

increased, and consequently their Enterprise Architecture began getting


importance grew. With the evolution of attraction in the mid to late 1980s after
extended computer networks, the computer John Zachman published an article
itself lost its position as the main focus, describing a framework for information
which instead shifted to the availability systems architectures in the IBM Systems
and capacity of the networks. Ward and Journal [3]. The field initially began to
Peppard [1] in their own analysis address two problems: System
described the third era as strategic complexity—Organizations were spending
information systems era, characterised by more and more money building IT
improved competitiveness. systems; and Poor business alignment—
Organizations were finding it more and
An offshoot of the third era is the more difficult to keep those increasingly
enterprise systems. Today, there are expensive IT systems aligned with
information systems for most of the tasks business need. The cost and complexity of
performed in an enterprise. Such systems IT systems have exponentially increased,
are: customer management systems, while the chances of deriving real value
contract management systems, financial from those systems have dramatically
systems, human resource systems, decreased [4].
business intelligence systems, asset
management systems, document Chen et. al. [5] noted that Enterprise
management systems, workflow architecture is a “challenging but still
management systems, and hundreds of confusing” concept. It is defined in many
other systems. In recent years, these different ways. Engelsman et al. [6]
systems have been integrated with each describe Enterprise Architecture (EA) as
other to such an extent that it is oftentimes “a design or a description that makes clear
necessary to view them, not as hundreds of the relationships between products,
different systems, but one single system of processes, organisation, information
systems resulting enterprise-wide services and technological infrastructure; it
information system. is based on a vision and on certain
assumptions, principles and preferences;
As the information systems grew in scope consists of models and underlying
and number, the information they stored principles; provides frameworks and
and the information they would benefit guidelines for the design and realisation of
from having access to also increased [2]. products, processes, organisation,
This situation led to demands for information services, and technological
automated exchange of information infrastructure.” Another definition states
between the systems. Technological that it comprises a collection of simplified
development responded rapidly to these representations of the organisation, from
demands by devising methods for different viewpoints and according to the
connecting two systems to each other. needs of different stakeholders [7]. EA
However, the number of connections helps describe and manage changes in
between systems has become enterprises so as to enhance their
overwhelming which has led to consistency and agility [8].
complicated systems with new challenges
when attempting to modify enterprise Enterprise architecture has evolved as a
resource planning systems to meet discipline since the 1990s. Early work
business operations. This has led to the focused on architecture models, principles,
growth of Enterprise Architecture. and standards that comprise the content of
the enterprise architecture. As companies
have gained practical experience

ISBN: 978-1-941968-45-1 ©2017 SDIWC 7


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

implementing EA concepts, they have Architecture is at the functioning phase. At


become more concerned with enterprise level 4 an integrated system is needed in
architecture management (EAM) and its order to deliver value consistently. At the
success factors [9]. This is the focus of the final phase, level 5, the enterprise
fifth era. architectures becomes a natural way of
working; the effect trickles down across
The fifth era is what Ward and Peppard [1] the organization [10]
classified as their own fourth era. This is
where sustainability from an IS Another model was one used by Feurer
perspective can be defined as an [11]. He described enterprise architecture
organisations ability to continually deliver as a continuous journey. He classified the
explicit business value through IS/IT thus maturity levels into four: the business silo
leading to advantage. Winter, Legner and architecture, the standardized technology
Fischbach [9] recognised that architecture stage, the optimized core stage, and the
a key concern of information management business modularity architecture. At the
today. They noted that there is now business silo stage, companies primarily
general acknowledgment that only use Information Technology (IT)
continuous and holistic management of the investments to meet local business needs.
Enterprise Architecture (EA) can ensure The second maturity stage, standardized
the sustainability, agility, and strategic technologies are used to move from local
alignment of corporate IT environments. applications to a shared infrastructure. At
the third stage, optimized core stage,
The purpose of this paper is to look at companies view data and applications
Botswana Enterprise System and how this from an enterprise perspective and so
has matured over the years, as well as have mainly focus on shared data and enterprise
a view into the future. systems. IT extracts crucial data from
enterprise systems and makes it reusable
for business processes and other IT
2. EA MATURITY MODELS applications. Business processes optimized
in the third stage are refined and
There are several models used in modularized in the fourth stage. The
measuring maturity levels within an outcomes from these activities are reusable
organisation. This section considers three modules (mostly based on Web service
of such models. technology) connected to core data and
backend processes. IT ensures that these
One of the popular maturity models in modules can be integrated into business
Enterprise Architecture is Gartner’s process.
maturity model. The maturity model has
five different levels of maturity within The third model considered here is
enterprise architecture practices. These Capability Maturity Model (CMM). This
five levels are also similar to the five eras is a methodology used to develop and
discussed earlier. In Level 1, that is the refine an organization's software
Non-existent phase, enterprise architecture development process. The model describes
either has not existed until that point, or is a five-level evolutionary path of
at it’s very early stages of implementation. increasingly organized and systematically
Level 2 is reactive, at this phase, enterprise more mature processes. At the initial level,
architecture is officially recognized by the organisation does not provide a stable
organization. However, the nature of the environment for developing and
practice is one of response to needs, rather maintaining software. At level 2, the
than pre-planned. At level 3, Enterprise repeatable level, policies for managing

ISBN: 978-1-941968-45-1 ©2017 SDIWC 8


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

software project and procedures to Google Scholar Found


implement them are established. The Information System and Botswana 149000
defined level, which is level 3, the IT/ICT and Botswana 153000
standard process for developing and Enterprise System and Botswana 32700
maintaining software across the Enterprise Architecture and Botswana 6660
organisation is documented. At level 4, the Date: 23rd August 2016
managed level, organisation sets
quantitative quality goals for both software 4. FINDINGS - THE BOTSWANA
products and processes. Productivity and STORY
quality are measured at this phase. At the There were a few literature found that gave
optimising level which is level 5, the a good picture of the story and maturity of
organisation is focused on continuous Information Systems and Enterprise
process improvement [12] Architecture in Botswana. This is shown in
Table 3.
CMM is more focused on software
development process. Feurer and Gartner Table 3: Relevant Literature on Maturity of
models are quite similar. However, the Information Systems and Enterprise
five distinct stages in Gartner make it a Architecture in Botswana
Author Year Title
more detailed model. In this paper, due to
Ayoku 2006 Information and
Gartner's being a trusted source of Ojedokun Communication
analysis, the model is used as a focus for and Technology (ICT) Systems
this study. Kgomotso in Botswana Government
Moahi Departments
[13]
3. METHODOLOGY Stephen 2006 E-readiness of SMEs in the
M. ICT sector in Botswana
In order to trace the Botswana maturity Mutula, with
story, through literature the question to be Pieter van respect to information
answered was - How have EA evolved Brakel access
[14]
over time, and where is it now? Answering Faith- 2010 A CMM Assessment of
these questions with insights from the Michael Information Systems
academic literature, the information and E. [15] Maturity
background on Information Systems, Levels in Botswana
Information Technology and Enterprise Nugi 2011 State of Information
Nkwe Technology Auditing in
Architecture as related to Botswana was [16] Botswana
searched using Ebscohost (Table 1) and Nugi 2012 E-Government: Challenges
Google Scholar (Table 2) Nkwe[17] and Opportunities in
Botswana
Table 1: Ebscohost Search Oduronke 2014 Enterprise resource
Temitope planning (ERP) systems –
EBSCO Host Found Eyitayo Is Botswana winning? A
Information System and Botswana 465 [18] Question on culture effects.
IT/ICT and Botswana 17132
Enterprise System and Botswana 11
Enterprise Architecture and Botswana 1022 According to Ojedokun and Moahi [13],
Date: 23rd August 2016 the use of information technology was
introduced by government in 1966 in the
Table 2: Google Search
Office of the Accountant General, for the
processing of payroll and accounts using
accounting machines which were later
replaced by ICT Punched Card Tabulators
in 1969. This corresponds to the first era

ISBN: 978-1-941968-45-1 ©2017 SDIWC 9


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

that was referred to as the data processing National What was achieved
by Ward and Peppard [1]. After this era, a Development
Plan (NDP)
number of critical ICT systems that have Phase
been implemented in Botswana implemented the Master
government departments. The details are Information Systems Plan
described using the government National (MISP); Teacher Management
Development Plan (NDP). Since System (TMS) and the Student
Selection System (SSS) [13]
independence in 1966, Botswana’s
development process has been guided by Automation of some post office
successive National Development Plans counters, and the installation of
(NDPs). These have provided a medium Performance Management
term planning and budgeting framework System by Botswana Post [13]
(typically 5-6 years) for capital and
The Ministry of Trade, Industry,
recurrent expenditure, and have been a key Wildlife and Tourism
feature of Botswana’s system of (MTIWAT), developed the
development management. The plans Ministry's website, automated the
outline the government’s development office of Wildlife and National
Parks [13]
priorities for the plan period as well as the
NDP 9 (2003/4 The Ministry of Trade, Industry,
policies, programmes and projects required - 2008/09) Wildlife and Tourism
to achieve those priorities [19]. (MTIWAT) : Automate the
process of company registration
The Information System development over and business names, trade and
the years is shown in Table 4 following the industrial licenses and tourism
information management,
NDP plan. develop document management
and workflow management
Table 4: Information Technology Development process; review and develop all
According to the National Development Plan Department of Wildlife and
National Parks systems [13]
National What was achieved
Development Ministry of Local Government
Plan (NDP) (MLG): Tribal Land
Phase Administration System for land
NDP 7 Voters' registers, a new payroll use planning and management,
(1991/92- system and vehicle registration installation of Human Resources
1996/97) [13] and Payroll package to all the
NDP 8 (1997- National Registration System, Local Authorities, Financial
2002) Payroll System, and Management Computer System,
Computerised Personnel and Project Management.
Management System (CPMS) Document Management System,
[13] a website; database for recording
tribal ceremonies, and Social
Livestock Identification and Benefits System[13]
Trace back System (LITS),
Automated System for Customs Ministry of Works, Transport and
Data (ASYCUDA), Vehicle Communications (MWTC):
Registration System [20,21] Development of communications
Department of Supplies, network infrastructure. A
Government Data Network I, Tax comprehensive review and
Payer Management System I, and computerisation of the Road
Trade Statistics [13] Transport Permits system
earmarked to be executed as an
Water Utilities Corporation also integral part of the Vehicle
implemented the Master Registration and Licensing [13]
Information Systems Plan (MISP
Water Utilities Corporation also

ISBN: 978-1-941968-45-1 ©2017 SDIWC 10


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

National What was achieved National What was achieved


Development Development
Plan (NDP) Plan (NDP)
Phase Phase
The Ministry of Mineral, Energy Offices, some embassies and the
and Water affairs (MMEWA): Government of Botswana. [18]
Development of Energy
Information System; expansion The beginning of the
of the Water Resources Government Enterprise
Information System; installation Architecture [22]
of a computerised supplies
system, inventory management 5. DISCUSSION
system and a library information
system; expansion of its
computerised billing system, and The different National Development plans
the implementation of a addressed the key process areas of the
Geosciences Information System Gartner’s Maturity model. Prior to NDP 7,
[13]
it was at the Non-existent phase. A few
The Ministry of Health: data processing applications were
Integrated Health available. The existence of information
Communication System and systems began during the NDP 7 and 8
software for Princess Marina and phases. This is the reactive phase where
Nyangabwe hospitals [13]
the practice is one of response from every
The Customs department: ministry was to develop based on needs,
Taxpayer Management System rather than pre-emptive planning and
(TMS) [13] action. During the NDP 9 phase, business
NDP10 Growth e-learning, E-passport; outcome became part of the core value.
(2009/10- E-license, E-health and E-
2016/17 Business[16]
This phase saw the introduction of systems
with whole ministries in mind rather than
Computerized Case just tasks. There now exist what can be
Management; Computerization called functioning systems such as billing
of Lobatse High Court; DPSM systems, library systems and tribal land
Computerized Personnel
Management System;
administration system. This development
Computerization of Students now led to the integration phase in NDP
Records and Grant Loan Scheme; 10 with the development of integrated
Computerization of Teaching system. This is where the nation is
Service Management; presently. This era has brought about
Government Accounting and
Budget System; Integrated
systems that work together. As the
Patient Management System; economy of the country is improving,
National Archives and Records organisations across Botswana are
Management Systems; increasing their investments in Enterprise
Computerization of Civil and Resource Planning (ERP) technology as a
National Registration;
Computerization of Labor and
way of helping in the growth of their
Social Security; Local business. Some institutions that have
Government Computerization of implemented ERP systems are banks,
Human Resource Management; Water Utilities Corporation, Botswana
Computerization of Social Post Offices, some embassies, University
Benefit and Reconciliation
System [17]
of Botswana and the Government of
Botswana [18].
Implementation of ERP systems:
Some implemented ERP systems Still during the NDP10, In an official
are banks, Water Utilities gazette released on September 15, 2015 by
Corporation, Botswana Post
the Public Procurement and Asset Disposal

ISBN: 978-1-941968-45-1 ©2017 SDIWC 11


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Board (PPADB), the Botswana becomes a natural way of working and the
government announced that Korea IT principles behind the discipline are widely
Consulting was selected as the winner of adopted. This is depicted in Figure 1.
the “Professional Services for the
Development of an e-Government Service
Oriented Enterprise Architecture” project, 6. CONCLUSION
initiated by the Ministry of Transport and
Communications of Botswana [23]. This paper reveals significant progress
Integration seem to be never-ending made by the Botswana Government in
processes, because the enterprise continues attaining maturity. The achievements of
to evolve in its ever-changing environment the Botswana government have improved
as a result of adaptation to external forces, service delivery and increased accessibility
advances in technology, emerging business to government services. One lesson to be
models, new regulations and/or learnt from the Botswana government
optimisation of internal solutions; making approach is the incorporation of major ICT
what is today a fully integrated system, the projects into its national development
partly integrated system of tomorrow. In plans, the phased implementation of the
the present state, the enterprise ICT projects, and the governments'
architecture practice is delivering value. commitment to achieving the strategies
developed. This approach has enabled a
The country is now in a phase of defining great deal of achievements. National
integration, standards etc. and how all this development plans, among others,
come together within the various sectors of particularly serve as major statements of
the nation. This is being done in the government's development policies and
development of a Government Enterprise strategies.
Architecture [24].
Enterprise plans in many countries are not
The NDP 11 (2017-2023) hopes to achieve yet very clear; however Botswana has
what is highlighted in the e-Government taken a step forward to plan its national
master plan Government [25]. There are architecture. It is therefore recommended
fifteen areas of focus are: Upgrade e- that other countries in Africa emulate the
Government Strategy; Advancement of e- example of the Botswana government by
Governance; Project Evaluation System; e- embarking on a strategised plan of an
Document System; Network Optimization; integrated enterprise. The planning and
e-Education System; National Health implementation approaches (i.e.
Information System; Business Activity incorporation into national development
Support System; e-Procurement System; plans and phased implementation) adopted
Local Government Informatization; Civil are particularly worthy of emulation. It
Affairs Single Portal; Administrative explicitly explains how the development
Information Sharing Centre; Job Portal; e- agenda of a country if followed through
Agriculture and Government Enterprise can impact positively on the country's
Architecture (GEA) [25]. development.
Looking at the well laid out government Defining maturity model is not really
plan, if all goes as planned, it can be where a nation will find its value. The
predicted that the end of NDP 11 should knowledge gained from current position
lead to a Ubiquitous state (level 5) where and insight from the model gives the
enterprise architectures success has a model its worth. This information can be
trickle-down effect across the nation. At used to influence development roadmaps
that point, Enterprise Architecture for enterprise architecture practices.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 12


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Figure 1: Botswana’s Maturity Chart

REFERENCES
[1] J Ward and J Peppard, Strategic Planning [5] D Chen, G Doumeingts, and F Vernadat,
for Information System, 3rd ed. UK: John "Architectures for enterpriseintegration
Wiley, 2002. and interoperability: Past, present and
future.," Computers in Industry, vol. 59,
[2] P Johnson, R Lagerström, and M Ekstedt, pp. 647-647, 2008.
IT Management with Enterprise
Architecture.: ePub, [6] W Engelsmana, D Quartelc, H Jonkersa
www.ics.kth.se/MAP.pdf, 2012. and M van Sinderen, "Extending
enterprise architecture modelling with
[3] The Open Group (2015) A Historical business goals and requirements,"
Look at Enterprise Architecture with John Enterprise Information Systems, vol. 5,
Zachman. [Online]. HYPERLINK no. 1, pp. 9-36, 2011.
"https://blog.opengroup.org/2015/01/23/a-
historical-look-at-enterprise-architecture- [7] M Lankhorst, Enterprise Architecture at
with-john-zachman/ Work: Modelling, Communication, and
Analysis. New York: Springer-Verlag,
[4] R Sessions (2007, May) A Comparison of 2005.
the Top Four Enterprise-Architecture
Methodologies. [Online]. HYPERLINK [8] O Noran, A Meta-Methodology for
"https://msdn.microsoft.com/en- Collaborative Networked Organisations.
us/library/bb466232.aspx" School. Brisbane, Griffith University:
School of Computing and Information
Technology, 2005.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 13


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

[9] R Winter, C Legner and K Fischbach, [18] O T Eyitayo, "Enterprise Resource


"Introduction to the special issue on Planning (ERP) Systems – Is Botswana
enterprise architecture management," Winning? A Question on Culture Effects,"
Information Systems and e-Business Issues in Informing Science and
Management, vol. 12, no. 1, pp. 1-4, 2014. Information Technology, vol. 11, pp. 47-
55, 2014.
[10] B Burke and M Blosch. (2015) ITScore
Overview for Enterprise Architecture. [19] A M Land, "Strengthening of National
[Online]. HYPERLINK Capacities for Nationa lDevelopment
"https://www.gartner.com/doc/3092223/its Strategies and Their Management: An
core-overview-enterprise-architecture" Evaluation of UNDP's Contribution:
Country Study – Botswana," 2010.
[11] S Feurer (2007) Enterprise Architechture -
Maturity Stages. [Online]. HYPERLINK [20] S Serero and M Moreri, E-government in
"https://archive.sap.com/kmuuid2/70edb7 Botswana: From Digital Divide to E-
ce-5fd8-2910-1583- Economy- Issues and Strategies for Public
db2b2cd98298/Enterprise%20Architecture Policy, July 22nd-25th, 2002.
%20-%20Maturity%20Stages.pdf"
[21] A Bose, G Dick, O F Seitei, and S
[12] M C Paulk, B Curtis, M B Chrissis, and C Obuseng, "Botswana Human
V Weber, "Capability Maturity Model for Development Report 2002:Contributions
Software, Version 1.1," Carnegie Mellon of Science and Technology and ICT for
University, Pittsburgh, Pennsylvania, Healthy Governance," Draft Outline,
1993. [Online]. HYPERLINK UNDP 2002.
"https://www.sei.cmu.edu/reports/93tr024.
pdf" [22] J Sejabosigo (2016, February) Daily
News. [Online]. HYPERLINK
[13] A A Ojedokun and K IL Moahi, "http://allafrica.com/stories/20160218019
"Information and Communication 6.html"
Technology (ICT) Systems in Botswana
Government Departments," African Jornal [23] Y Choul-Woong, "Korea IT Consulting
Library Archival & Information Science, Wins E-Government EA Deal from
vol. 16, no. 2, pp. 79-88, 2006. Botswana," Korea IT Times, September
2015. [Online]. HYPERLINK
[14] S M Mutula and P Brakel , "E-readiness of "http://www.ifg.cc/aktuelles/nachrichten/r
SMEs in the ICT sector in Botswana with egionen/163-kr-suedkorea-south-
respect to information access," The korea/52120-korea-it-consulting-wins-e-
Electronic Library, vol. 24, no. 3, pp. 402- government-ea-deal-from-botswana"
417, 2006.
[24] Government of Botswana. Botswana’s
[15] F E Uzoka, "A CMM Assessment of National e-Government Strategy 2011-
Information Systems Maturity," MIS 2016. [Online]. HYPERLINK
Review, vol. 16, no. 1, pp. 53-84, 2010. "http://workspace.unpan.org/sites/internet/
documents/unpan048687.pdf"
[16] N Nkwe, "State of Information
Technology Auditing in Botswana," Asian [25] Government Modernization Office.
Journal of Finance & Accounting, vol. 3, Botswana e-Government Master Plan
no. 1, pp. 125-137, 2011. 2015-2021. [Online]. HYPERLINK
"http://www.cit.co.bw/downloads/e-
[17] N Nkwe, "E-Government: Challenges and govenment%20master%20plan%20presen
Opportunities in Botswana," International tation%20to%2014th%20e-
Journal of Humanities and Social Science, government%20board%20version2.pdf"
vol. 2, no. 17, pp. 39-48, 2012.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 14


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

FAD Platforms: Proprietary Solutions

Pasquina Campanella
Department of Computer Science
University of Bari “Aldo Moro”
via Orabona, 4 – 70126 Bari – Italy
pasqua13.cp@libero.it

ABSTRACT

In the face of development continuous of internet, [10], [11], [14]. From early 1990, models for
the training resources in the network have grown evaluating learning management systems
informally. For this reason, the need for tools to have been developed [4], [7]. The
extract their knowledge is constantly widening. Commonwealth of Learning model, which
Starting with the evolution created by social web, examines various features such as usability,
this article explored potential of current
accessibility, collaborative functionality,
technology tools, with particular reference to the
features of e-learning platforms, and related to manualization, installation, technical support,
proprietary solutions regarding content delivery standard compliance, interoperability and
modes, user-based monitoring tests, as well as content reusability, tracking [15]. Below are
evaluation techniques in order to better manage the different sections on proprietary platforms
interactive online courses, which make the web analyzed in their respective studies and
user active in the production process. In this simulations as well as evaluation sessions and
direction, what is called lifelong learning is what finally conclusions and future developments.
follows.
2 PROPRIETARY PLATFORMS
KEYWORDS
Some of the major proprietary platforms are
monitoring, platforms, performance, collaborative
learning, features. listed in different ways, considering the
sharing, participation and collaboration of
1 INTRODUCTION web 2.0 and in particular blog, feedback, chat,
forum, podcasting and wiki (Tab.1) [2], [4],
The panoramic of FAD platforms has seen a [9], [13]:
continuous evolution over the years. The term
“platform” means the technological
infrastructure that allows e-learning activities
or online course management, integrating
teaching modules, evaluations within learning
groups [1], [10], [11], [12]. In order to
promote use of more advanced and interactive
platforms, a proprietary solution analysis is
proposed that would serve as a useful
contribution to the development of different Table 1 - Ownership proprietary platforms
forms of collaborative learning and require
new capabilities for integrated management of In particular, the study was aimed at
formative components of social networks. communication between learning objects,
The analysis was determined by the fact that tracking activities and the results obtained.
literature allows only partially to obtain an On-line questionnaires, forum interventions to
objective evaluation of platforms and how highlight the different polarities of expression
they support processes learning, considering [2], [13]. In delivering the courses, the
their peculiarities, needs and problems [1],

ISBN: 978-1-941968-45-1 ©2017 SDIWC 15


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

observations and comparisons made have


been used to understand the user friendly role
of online platform's navigability, how to use
multimodal lessons, the content, the materials,
the verification and evaluation forms used,
interactive support during the course phases.
The following results show a simplified link
between teaching collaborative and user
satisfaction, while a varying influence on
degree of socialization gained. In the analysis
below, the duration of courses, course
management, quality of distributed material,
quality of transmitted theoretical lectures and
exercises were evaluated by providing the Figure 2. Monitoring Centra
following self-assessment questionnaire
(Fig.1): Elluminate Live
Hybrid platform, virtual classroom and e-
conferencing. Multiplatform on windows,
linux and mac os systems [4], [7], [12]. The
performance assessment of a 150 student (20
to 30 year old) students in the community
reported a 30% average duration, distributed
material quality and lessons, exercises and
test averaging 40% (Fig.3). Application
sharing issues were improved by continuing
to chat, monitoring was balanced.

Figure 1. Platforms monitoring questionnaire

Centra
Web-based collaborative platform with
features such as web conferencing, virtual
classroom, web seminar, net meetings [2],
[4],[12]. The performance evaluation of 100
students (ages 18 to 30) in the community
reported the duration of courses and course
management with 54% average, quality of
lessons, exercises and tests averaging 56% Figure 3. Monitoring Elluminate Live
(Fig.2). Communication issues have improved
in videoconferencing. e/pop
E-learning tool for content sharing,
multiplatform windows, mac os [3], [4], [12].
The performance evaluation of a 100
contingent (aged between 20 and 30 years) in
the community reported the duration of
courses and quality of distributed material
with an average of 35%, quality of lessons,
exercises and test with an average percentage

ISBN: 978-1-941968-45-1 ©2017 SDIWC 16


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

of 45% (Fig.4). The monitoring was balanced lessons, exercises, and test with an average of
in the various tests conducted. 48% (Fig.6). Well balanced test monitoring.

Figure 4. Monitoring e/pop Figure 6. Monitoring HotConference

Groove LearnLinc
E-learning tool for collaborative learning, E-learning tool for collaborative learning,
synchronous and asynchronous solutions [4], synchronous and asynchronous solutions [4],
[6], [14]. The performance assessment of a [8], [12]. The performance evaluation of a
100 students (18 to 30 year old) in the 100 degree students (between 20 and 25 years
community reported a 35% lifetime of old) in the community reported a 60%
courses, 45% of distributed media, quality of lifetime of courses, 40% of distributed
lessons, exercises, and test with an average of material quality, quality of lessons, exercises,
60% (Fig.5). Monitoring balanced in the and test with an average of 50% (Fig.7).
various test. Balanced monitoring in the various test
conducted.

Figure 5. Monitoring Groove


Figure 7. Monitoring LearnLinc
HotConference
E-learning tool for collaborative sharing, Lotus Learning Space
synchronous and asynchronous solutions [5], Groupware platform for online learning, with
[7], [12]. The performance evaluation of a two modules learning space core and
150 year old students (aged 19 to 25) in the learning space collaboration [4], [8], [12].
community reported a 45% lifetime of The performance evaluation of a 100
courses, 50% of distributed media, quality of contingent (between 20 and 30 years old)

ISBN: 978-1-941968-45-1 ©2017 SDIWC 17


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

students in the community reported a lifetime performance assessment of a 100 contingent


of 50%, material quality distributed by an (20 to 30 year old) students in the community
average of 45%, quality of lessons, exercises averaged 60% of course time, material quality
and test with an average of 60% (Fig.8). distributed by an average of 50%, quality of
Testing carried out somewhat balanced. lessons, exercises and test with an average of
53% (Fig.10). Almost perfectly balanced
monitoring in the various test conducted.

Figure 8. Monitoring Lotus Learning Space

Netlearning Figure 10. Monitoring Saba Learning Enterprice


E-learning platform for synchronous and
asynchronous collaborative learning [8], [4], T-learn
[12]. The performance evaluation of a 100 E-learning platform for collaborative learning
contingent (between 25 and 35 years) students [4], [7], [12]. The performance evaluation of a
in the community averaged 55% of course 100 contingent (between 20 and 30 years old)
time, material quality distributed by an students in the community averaged 30%,
average of 50%, quality of lessons, exercises material quality distributed by an average of
and test with an average of 56% (Fig.9). Well 45%, quality of lessons, exercises and test
balanced test monitoring. with an average of 53% (Fig.11). Almost
perfectly balanced monitoring in the various
test conducted.

Figure 9. Monitoring Netlearning

Saba Learning Enterprice Figure 11. Monitoring T-learn


Modular e-learning platform consists of Saba
WebCT
Publisher and Saba Content for mixed and
customizable learning [4], [6], [12]. The

ISBN: 978-1-941968-45-1 ©2017 SDIWC 18


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Standalone e-learning platform for The platforms that reported the best results for
synchronous and asynchronous learning [3], the duration of courses are Saba Learning
[5], [12]. The performance evaluation of a Enterprice and LearnLinc; for distributed
100 contingent (between 20 and 30 years old) material quality are WebCT, Saba Learning
students in the community averaged 40% of Enterprice, Netlearning and Hotconference;
course time, material quality distributed by an for quality theoretical lectures transmitted and
average of 50%, quality of lessons, exercises quality exercises performed are Groove and
and test with an average of 43% (Fig.12). Lotus Learning Space.
Quite balanced monitoring in the various test
conducted. 3 CONCLUSIONS and FUTURE
DEVELOPMENTS

To conclude, the potential of existing


proprietary solutions was analyzed. In
particular, the study focuses on monitoring
content training activities in order to identify
those features considered fundamental as
evolution developed by increasingly
innovative technology, which puts the user at
the center of the use of new content. In this
scenario, social collaborative learning has
been considered that transforms the system
from a learning material container into a
knowledge sharing and management tool.
Figure 12. Monitoring WebCT Practice trials involving course planning,
tracking activities, evaluation reports, forums,
WebConference discussions, application sharing nd
E-learning platform with the possibility of e- communication between learning objects
meeting, allows sharing of applications [2], showed a positive trend considering
[4], [12]. The performance assessment of a simulations in terms of traceability and results
total of 100 students (aged 25 to 35) in the to appreciate the questionnaires given to the
community has averaged 41% of course time, trainees in terms of acquiring knowledge and
material quality distributed by an average of variations in performance by those who
45%, quality of lessons, exercises and test participated in it for flexible learning. In
with an average of 53% (Fig.13). Monitoring delivering the courses, the observations and
in the various test carried out fairly balanced. comparisons made have been used to
understand the user friendly role of online
platforms navigability, how to use multimodal
lessons, the content, the materials, the
verification and evaluation forms used, the
interactive support during the course phases.
There are still some improvements that hit the
interface of most platforms and the lack of
flexibility. The analysis has proved to be
decisive considering peculiarities of the
literature regarding in-depth studies in the
field. At the moment, further developments
tend to favor interoperability.

REFERENCES
Figure 13. Monitoring WebConference

ISBN: 978-1-941968-45-1 ©2017 SDIWC 19


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

[1] M. Banzato, D. Corcione, Piattaforme per la [11] D. Colombo, Formazione a distanza, ambienti e
didattica in rete, TD-tecnologie didattiche n. 33, piattaforme telematiche a confronto, 2001.
2004, pp. 22-31, edizioni menabò, Ortona.
[12] D. F. Garcia, C. Uria, J. C. Granda, F. J. Suarez ,
[2] P. Campanella, Piattaforme per l’uso integrato F. Gonzalez, A functional evaluation of the
di risorse formative nei processi di e-learning, commercial platforms and tools forrd synchronous
Atti Didamatica 2015 – Studio ergo lavoro – distance e-learning, Proc. of the 3 wseas/iasme
dalla società della conoscenza alla società delle international conference on educational
competenze, Aica, 15-16-17/04/2015, Genova, technologies, Arcachon, France, 2007, pp. 330-
Italy. 335.

[3] P. Campanella, Piattaforme Proprietarie: [13] S. Impedovo, P. Campanella, G. Facchini, G.


Un’analisi metodologica, Atti Didamatica 2015 Pirlo, R. Modugno, L. Sarcinella, Learning
– Studio ergo lavoro – dalla società della Management Systems: Un’analisi comparativa
conoscenza alla società delle competenze, Aica, delle piattaforme open-source e proprietarie,
15-16-17/04/2015, Genova, Italy. Atti Didamatica 2011 - Informatica per la
didattica, Aica, 04-05-06/05/2011, Torino, Italy.
[4] P. Campanella, Platforms and methods for the
integrated use of educational resources in the [14] S. Luciani, Caratteristiche tecniche e
processes of e-learning, Ed-Media 2011, world funzionalità didattiche delle piattaforme per
conference on educational multimedia, l’apprendimento on-line, in Formazione e
hypermedia and telecommunications 2011, cambiamento, web magazine sulla Formazione,
chesapeake aace, in t. Bastiaens & m. Ebner 2004.
(eds.), Proc. of the world conference on
educational multimedia, hypermedia and [15] M. Pedroni, Dall’interoperabilità delle
telecommunications 2011, chesapeake, va: piattaforme all’integrabilità dei moduli
AACE, 27-28-29-30/06-01/07/2011, pp. 2375- interattivi, Omniacom editore, pp. 731-735, Atti
2384, Lisbona, Portogallo. Didamatica 2004, Ferrara, 06-08/05/2004.
[5] P. Campanella, Functional Comparison of the
Tools and Commercial Platforms in distance e-
learning, Proc. of the IADIS International
conference e-learning 2011, 20-21-22-
23/07/2011, Roma, Italy.

[6] P. Campanella, Platforms for use integrated


resources formative processes in e-learning,
Proc. the 2nd international conference on digital
information processing, data mining, and
wireless communications (DIPDMWC2015),
16-18/12/2015, Islamic azad University, uae
branch, pp. 181-186, Dubai.

[7] P. Campanella, Learning Management Systems:


A Comparative Analysis of Open-source and
Proprietary Platforms, Proc. the 2nd International
conference on digital information processing,
data mining, and wireless communications
(DIPDMWC2015), 16-18/12/2015, Islamic azad
University, uae branch, pp. 187-192, Dubai.

[8] P. Campanella, A Comparative nd Assessment of


E-learning Platforms, Proc. the 2 International
conference on digital information processing,
data mining, and wireless communications
(DIPDMWC2015), 16-18/12/2015, Islamic azad
University, uae branch, pp. 193-198, Dubai.

[9] S. Campanella, G. Dimauro, A. Ferrante, D.


Impedovo, S. Impedovo, M. G. Lucchese, R.
Modugno, G. Pirlo, L. Sarcinella, E. Stasolla, C.
A. Trullo, E-learning platforms in the Italian
Universities: The technological solutions at the
University of Bari, WSEAS Transactions on
advances in engineering education, issue 1, vol.
5, 2008, pp. 12-19.

[10] F. Colace, M. De Santo, M. Vento, Evaluating


on-line learning platforms: A case study, Proc.
of the 36th Hawaii International conference on
system sciences (hicss’03), Hawaii, IEEE press.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 20


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Mobile Learning: A Case of Study

Pasquina Campanella
Department of Computer Science
University of Bari “Aldo Moro”
via Orabona,4 – 70126 Bari – Italy
pasqua13.cp@libero.it

ABSTRACT

Digital media technologies and broadband well as reported experimental results and
communications networks are undergoing conclusions and future developments.
profound transformations and new trends in
content development for training are emerging. In
2 CASE STUDY
this scenario with the rapid introduction of mobile
devices among which the smartphone prevails, it
was possible to meet the needs of users with a Within the framework of mobile learning, a
paradigm that involves the “learning in mobility” study was launched that consider analysis as
process. This article provides a case study for well as modular integration with Oracle i-
mobile oracle i-learning and claroline platforms Learning and Claroline platform plug-in on
considering its content delivery, user-based four different mobile operating systems such
monitoring test as well as evaluation techniques to as android, iPhone OS, symbian, windows
better manage interactive online courses that make mobile, in order to promote communication
the user active web participant in the production by means of services [2], [16], [17].
process. Accessing the mobile learning area through a
specially created application that bring
KEYWORDS
important features such as viewing content
platforms, questionnaires, learning, monitoring, and other content, proven on emulators and
interoperability. real life devices. The prototype examined is
the mobile oracle i-learning and claroline
1 INTRODUCTION platform of which the screenshots are shown
(Fig.1):
In the last few years, there has been a large
scale, large scale distribution of mobile
devices such as cell phones, handhelds,
pocketPC, ebook, tablet pc, smartphones, tv-
phonics, ipod, ipad and other portable
devices, personal communication devices are
becoming devices suitable for displaying
multimedia contents. A new communication
tool, a new frontier for e-learning: mobile
learning [1], [2], [3], [4], [5], [7], [8], [9],
[10], [12], [13], [14], [18], [19], [21], [23],
[26], [27], [28]. Users pass from simple users
to content creators, designed, modified, or
simply shared [10], [11], [20], [22]. Today
there is a strong demand for immediate use,
easily assimilable and immediate oriented
applications [16], [17]. In this context, a Figure 1. Oracle iLearning - Claroline mobile
follow-up study case has been launched, as platforms screening

ISBN: 978-1-941968-45-1 ©2017 SDIWC 21


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

The user interface was very intuitive. The


issues they raised concern: What didactic and
communicative models are most effective?
What technological solutions can encourage a
broad participation of users? Based on these
questions, evaluation sessions were performed
as well as simulations with actual users of
prototypes created. Interest has focused on a
particular subset of applications in which the
mobile device has been used to increase and
improve communication.

3 EXPERIMENTAL RESULTS Figure 3. Users questionnaire results in mobile


learning
During the simulations, the students involved
were subjected to observations and then The results outlined above, for the first
interviewed through questionnaires (Fig.2) to question, have found that 71% of students
ascertain their views. claim that mobile technologies have
supported it adequately during the learning
phase. For 24%, the technologies did little to
support the learner during the learning phases.
While 5% said they did not receive any
support from the use of new technologies. For
the second question, 76% claim that mobile
devices make the learning phase more
interesting. For 21% use is of little interest,
while 3% does not involve involvement.
Asked what he wants to receive from his
Figure 2. Questionnaire on mobile learning mobile device, the most common response
technologies was to get new information. Asked what a
mobile device can offer, 90% of users
The research was carried out at the Intelligent answered more and learn better. Ultimately,
Systems Lab and involved 80 students aged positive results were achieved in terms of
19 to 21 in Computer Science, who were satisfaction, acquisition of knowledge and
asked to comment on how m-learning has performance variations by those who
them supported during the learning phase as participated in it. It is crucial that the learner
well as to arouse their interest. The students has access to a flexible learning strategy and
have proved to be in favor of that all teaching resources are available at any
experimentation, even when technical time and in various types of support to allow
problems have prevented them from working users access to information based on their
properly, although only some of the features preferences, attitudes and the needs [6], [16],
have been found to be responsive to the [24], [25]. The data shows a wide availability
collaborative level access and consultation. of IT and a significant predisposition to the
Following is a graphical representation of the use of mobile devices. The mobile device
conducted questionnaire (Fig.3): learning experience finds that around 95%
rated the teaching method positively and
about 90% wanted to continue studying
through the mobile phone. The aim was to
create a flexible learning model, which makes
it possible to access information with every

ISBN: 978-1-941968-45-1 ©2017 SDIWC 22


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

type of device and produce flexible materials applications, case studies, vol. 4, 2007, pp. 1169-
1176, IOS press.
taken from different situations [4], [15], [17].
The critical issues are the small size of the [3] S. Al-khamayseh, A. Zmijewska, E. Lawrence,
screen, which does not allow for a large G. Culjak, “Mobile Learning Systems for Digital
Natives”, Proc. of 6th Iasted International
amount of content but only essential concepts, Conference on web based education, 2007, pp.
the difficulties of interoperability between the 252-257, Chamonix, France.
various devices and connectivity has been
[4] J. Attewell, “Mobile Technologies and Learning:
somewhat fragmented. a Technology Update and M-learning project
summary”, learning and skills development
4 CONCLUSIONS and FUTURE agency, United Kingdom, 2005.
DEVELOPMENTS
[5] P. Campanella, “Mobile Learning: New forms of
education”, Proc. of 10th International
Concluding the rapid diffusion of large scale Conference on Emerging e-learning technologies
and applications, ICETA 2012, IEEE, 08-
mobile devices such as cell phones, 09/11/2012, pp. 51-56, Stará Lesná, the high
handhelds, pocketPC, ebook, tablet pc, tatras, Slovakia.
smartphones, tv-phonics, ipod, ipad, and other
portable devices bring new trends in [6] P. Campanella, “Mobile learning application for
Android using web service”, in p. Resta (ed.),
developing content for training or so-called Proc. of Society for information technology &
“learning in mobility”. A new frontier for e- teacher education international conference, SITE
2012, AACE publish, 05/03/2012, pp. 1677-
learning: mobile learning. In this context, a 1682, Austin, Texas, USA.
study was conducted between students on
mobile oracle and claroline platforms tested [7] Y. Y. Chan, S. C. Chan, C. H. Leung, A. K. W.
Wu, “Mobilp: a mobile learning platform for
on four different mobile operating systems enhancing lifewide learning”, Proc. of the 3rd
such as android, iPhone OS, symbian, IEEE international conference on advanced
learning technologies, Athens, Greece, pp.457-
windows mobile considering its content 457, 09-11/07/2003.
delivery, user-based monitoring tests and
evaluation in order to better manage [8] S. J. Geddes, “Mobile learning in the 21st
century: benefit for learners”, knowledge tree E-
interactive online courses, which make the journal, vol. 30, n.3, 2004, pp. 214-228.
web user active in the production process.
The results obtained were positive in terms of [9] G. Guazzaroni, “Fare esperienze di
apprendimento con tecnologie di mobile
satisfaction, acquisition of knowledge and learning”, tratto da giornata di studio sul mobile
performance variations by those who learning organizzato dal collaborative knowledge
building group (CKBG), Genova, 2010.
participated in it. Ultimately, it is crucial that
the learner has access to a flexible learning [10] J. Herrington, A. Herrington, J. Mantei, I. Olney,
strategy and that all teaching resources are B. Ferry, “New technologies, new pedagogies:
mobile learning in higher education”, faculty of
available at any time and in different types of education, University of Wollongong, Australia,
support. The minor criticisms are related to 2009.
interoperability between the different devices
and further studies are being carried out. [11] J. Herrington, J. Mantei, A. Herrington, I. W.
Olney, B. Ferry, “New technologies, new
Ultimately, we can say that m-learning is pedagogies: mobile technologies and new ways
aimed at bridging the emerging needs of of teaching and learning”, ascilite of teaching
and learning, in atkinson, r & mcbeath, c (eds),
digital natives and training outcomes. proc. Asclite, 2008, pp. 419-427, Melbourne,
Australia.
REFERENCES [12] D. Keegan, “Mobile learning - theth next
generation of learning”, Proc. of the 18 asian
[1] B. Alexander, “Going Nomadic: Mobile association of open universities annual
Learning in Higher Education”, Educause conference, Shanghai, China, 28-30/11/2004.
review, vol. 39, n. 5, 2004, pp. 28-35.
[13] J. Kossen, “Mobile e-learning: when e-learning
[2] M. Alier, J. Casany, P. Casado, “Mobile becomes m-learning”, Palmpower magazine,
extension of a Web based Moodle Virtual 2005.
Classroom”, in P. Cunningham, M. Cunningham
(ed.), expanding the knowledge economy: issue,

ISBN: 978-1-941968-45-1 ©2017 SDIWC 23


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

[14] A. Kukulska-hulme, J. Traxler, “Mobile [27] K. Yordanova, “Mobile Learning and integration
learning: a handbook for educators and trainers”, of advanced technologies in education”, Proc. of
vol. 8, n. 2, Rroutledge, London, 2005. the international conference on computer
systems and technologies, compsystech’07, pp.
1-5, Acm, 14/06/2007.
[15] S. Impedovo, IAPR Fellow, IEEE S. M., P.
Campanella, “Mobile recommended system on
Android platform”, Proc. of the 18th [28] B. Zuga, I. Slaidins, A. Kapenieks, A. K.
International conference on distributed Strazds, “M-learning and mobile knowledge
multimedia systems, DMS 2012, dblp, pp. 33-38, management: similarities and differences”,
09-10-11/08/2012, eden roc renaissance Miami International journal of computing &
beach, Florida,USA. information sciences, vol. 4, n. 2, 2006, pp. 58-
62.
[16] S. Impedovo, IAPR Fellow, S. M. IEEE, P.
Campanella, “Mobile Computing: sviluppo
applicazione Voip su Symbian OS”, Atti
Didamatica 2012 - Informatica per la didattica,
Aica, Taranto 14-15-16/05/2012.

[17] S. Impedovo, P. Campanella, G. Facchini, G.


Pirlo, “Mobile Platforms: un’analisi
comparativa”, Atti VIII convegno nazionale sie-
l, “connessi! Scenari di innovazione nella
formazione e nella comunicazione”, Ledizioni,
pp. 499-504, Reggio Emilia, Italy, 14-15-
16/09/2011.

[18] C. Leung, Y. Chan, “Mobile Learning: A new


paradigm in electronic learning”, Proc. of the 3rd
IEEE International conference on advanced
learning technologies (ICALT’03), 09-
11/07/2003, pp. 76-80.

[19] L. Milrad, “Mobile Learning: challenges,


perspectives, and reality”, in mobile learning:
essays on philosophy, psychology and education,
passagen verlag Vienna, Austria, 2003, pp.151-
164.

[20] A. Mulliah, E. Stroulia, “Mobile devices for


collaborative learning in practicum courses”,
International Journal of mobile learning and
organisation, vol. 3, n.1, 2009, pp. 44-59.

[21] M. Pieri, D. Diamantini, “Il mobile learning”, ed.


Angelo Guerini e associati, Milano, 2009.

[22] H. Ryu, D. Parsons, “Innovative mobile learning:


techniques and technologies”, information
science reference, IGI publishing, Hershey, New
York, 2009, pp. 21-46.

[23] P. Seppala, H. Alamaki, “Mobile learning in


teacher training”, Journal of computer assisted
learning, vol. 19, 2003, pp. 330-335.

[24] M. Sharples, D. Corlett, O. Westmancott, “The


design and implementation of a Mobile Learning
resource”, Journal of personal and ubiquitous
computing, vol. 6, n. 3, 2002, pp. 220-234.

[25] M. Sharples, “The design of personal mobile


technologies for lifelong learning”, in Computers
and Education, vol. 34, 2000, pp. 177-193.

[26] J. Traxler, “Defining mobile learning”, Proc.


Iadis international conference Mobile Learning
2005, IADIS press, pp 261-266, Lisbona,
Portogallo.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 24


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Message Hiding with Pseudo-Random Binary Sequence Utilization

Gabriel Bugár, Juraj Gazda, Dušan Levický


Faculty of Electrical Engineering and Informatics,
Technical University of Košice, Letná 9, 042 00 Košice, Slovak Republic
gabriel.bugar@tuke.sk

ABSTRACT file size media files are excellent for


stenographic transmission because of their
This paper presents a novel computational secret large part of redundancy [4]. Overleaf, in
message hiding technique, also called Code- some situation, the knowledge and possibility
booked steganography technique based on Code of detecting a hidden communication and
Division Multiple Access (CDMA) technology messages may be needed. This part of science
with having regard to security and perceptibility is often called steganalysis, and its primary
of hidden message. This CDMA technique and
objective is to break a steganography system
their variations have mainly exploitation in
mobile phone communication standards, the and uncover a hidden communication [5].
world's most widely used cell phone standard. We This steganalysis has also led to several very
model the cover object transmission over two sophisticated image stenographic algorithm
interacting social networks using a user based invention. Just a brief review of these: the
model in which we apply the fundamental most popular one is the JSteg method, which
principles of the steganography. We show that the embedding a secret message in sequential
imposed features of CDMA techniques agree with order of the quantized DCT coefficients LSBs
imperatives claimed upon steganography systems while skipping 0s and 1s [6]. The next, also
and it facilitates the creation of an unsuspicious prevalent one is the OutGuess method, this
communication channel. disseminates data into the LSB of quantized
DCT coefficients. The coefficients
KEYWORDS modification is followed by a correction
process to ensure consistent distributions of
Steganography; SSIM; DCT; DWT; Social
any related pair of the DCTs. Another well-
Network.
known method algorithm applies the
technique of matrix encoding to hold secret
1 INTRODUCTION
message using LSB DCTs in F5 [7].
Steganography provides techniques of hidden Steganography channel expressed in this
communication that is mainly used in paper can represent a conventional system by
countries where usually the freedom of hiding a message in other regular
speech is forbidden. The fundamental inconspicuousness information - a secret
principle lies in embedding the secret image in a cover image. There, in the system
message into a camouflage inconspicuous verification process, we tried to improve the
cover data; where one of the indispensable algorithm substantiality, and that's why we
requirement is to ensure that an unwanted used as the transport media the user social
party will not be aware of the existence of the network account information wall. The
embedded information in a cover object. embedding process relies upon hiding a
From that place, its goal is to hide the modified secret image in each area of DWT
presence of communication [1], [2]. transformation coefficients of quantized DCT
Nowadays, the most used unsuspicious digital coefficients LSB plane. It means the
communication cover objects are digital execution of two separate transformations of a
document files, image and audio files, cover object before the embedding of the
programs or protocols [3]. Mainly the large secret message takes place. The acquisition of

ISBN: 978-1-941968-45-1 ©2017 SDIWC 25


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

proposed method based on pseudo-random


SSIM  A, B  
2  B  C1   2 cov AB  C2 
A
(4)
binary sequences holds the meaningful    B2  C1   A2   B2  C2 
2
A
improvement of confidence level in
steganography systems. Cover object Here, MAXI is the maximum possible value
degradation is satisfactorily small, in this of the object.
manner; the stego image quality distortion is
more imperceptible to the human eye. 2.2 2D - Discrete Cosines Transformation
Secondly, adding the cryptographic element
to the stenographic algorithm in the process of Discrete Cosines Transformation (DCT)
embedding represented by the CDMA performs an object transformation from
behavior is also enormously increasing the spatial to the frequency domain. We decided
security and imperceptibility of the to use the 2D modification of DCT. 2D-DCT
steganography system. provides upper computational convergence;
therefore, it offers better usability in the field
2 GENERAL SUPPOSITIONS of steganography. The high computational
convergence means that by multiplying the
In this section, we briefly describe some horizontally oriented 1D basis functions with
needed necessary knowledge to understand a vertically oriented set of the same features,
the proposed model. A fundamental of the we make 2D basis functions. Consequently,
process of secrecy a secret message relies on the 2D-DCT is a direct extension of the 1D
the differences of stego object. The stego case and is given by the following way:
object represents the original cover object N 1 N 1
 2i  1u   2 j  1v 
Fu ,v  Cu Cv  f (i, j ) cos  cos  (5)
with an embedded hidden message. A i 0 j  0  2N   2N 
function of conformity can express its  
1 1
measure. Hence, the objective conformity  u0  v0
 N  N
measures, which stem from a statistical Cu   Cv   (6)
approach, are based on measuring the  2  2

 N otherwise 
 N otherwise
distortion between an original cover object
and modified cover object = the stego object.
2.3 2D - Discrete Haar Wavelet
2.1 Peak Signal to Noise Ratio (PSNR) Transformation

The static multi-level images use several Discrete Wavelet Transform based on Haar
metrics of the objective quality evaluation. In wavelet (HT) is the simplest useful energy
the field of steganography, those usually compression process which can effectively
measure the distortion between stego and serve very useful and fast object
cover objects. One of the most common is the decomposition. The Haar transform, like all
PSNR, based on the mean squared error wavelet transforms, decomposes a discrete
(MSE). Given a noise-free M×N monochrome object into sublevels {a1|d1} of half its length,
image object I and its noisy approximation K, where a1 = [a1, a2 …aN/2] represents the
then MSE is defined as [8]: approximation (average) coefficients and d1 =
[d1, d2 …dN/2] the detailed (difference)
2

 I (i, j )  K (i, j )
M N

MSE 
1 coefficients. The first value of approximation
(1)
M .N i 1 j 1 coefficient a1 is computed by taking the
PSNR  20 log10 MAX I   10 log10 MSEdB
average of the first pair of values of and then
(2)
multiplying it by the square root of 2.
255 2 Application of 2D-HT to an object will
PSNRCSF  10 log10 dB
(3) retrieve transformation coefficients that are
 I  I  g 
N1 N2
1 W 2

N1 .N 2 i 1 j 1
CSF defined as a decomposition of an input object,
also known as approximation component LL
and detailed components LH, HL and HH [9]:

ISBN: 978-1-941968-45-1 ©2017 SDIWC 26


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

LL (m, n)  a(2m  1, n)  a(2m, n) 2


1 2
(7) consist of any binary data file. The
LH (m, n)  2 a(2m  1, n)  a(2m, n) embedding process is starting with electing of
1 2
(8)
the proper cover object from the group of
HL(m, n)  d (2m  1, n)  d (2m, n) 2
1 2
(9)
appropriated objects. These were prepared for
HH (m, n)  2 d (2m  1, n)  d (2m, n) 
1 2
(10) this usage based on steganography expected
properties. The elected one is in the first step
2.4 Spread Spectrum Image decomposed into single color components
Steganography (SSIS) where each one is divided into blocks with
size eight by eight. Every block is
The Code-division multiple access is one of transformed into the frequency domain by
the numerous access techniques, where force of the DCT transformation. Thus we get
several transmitters can send information matrices of decimal numbers according to the
simultaneously over a single communication DCT behavior (compress the energy to DC
channel. This method stands on mathematical and low spatial frequencies, i.e., highest
apparatus of orthogonal pseudorandom numbers). We can say that the energy packing
number sequences (PNS) which allow properties of the DCT are excellent.
multiple accesses on the same communication Afterward, we decomposed every transform
without interference. This experience is coefficient matrices by DWT transform.
achieved by orthogonality and the There, in this DWT perform we use just
autocorrelation aspect of PNS. The coefficients without fractions; i.e., we use
autocorrelation controls the impact of noise only the integer part of the decimal numbers.
on the stenographic communication. Direct DWT based on Haar wavelets has a unique
spreading technique (DSS) by PNS stimulates advantage of accrued transform coefficients.
to use properties applicable in steganography. According to the chapter 2.3: 2D-HT, the
Firstly, PN sequences have an almost DWT provides one block of approximation
subequal quantity of -1 and 1. It is why a (HH) and three blocks of detailed coefficients
spread data (secret data) do not directly (horizontal LH, vertical HL, and diagonal LL)
appear in bulks or other periodical sequences, on each level of decomposition. The main aim
which could be a stimulus to an intruder to was to create a stealth communication, where
uncover the communication. Secondly, the the secret message will account a very high
longer length of PNS provides better level of imperceptibility in the cover object.
correlation results; therefore, a higher number Therefore, the detail coefficients are the most
of error/broken bits during the extraction convenient area for secret message
process of the secret message can be admitted embedding. Also, a critical requirement was
[2010]. the system capacity. There, the amount of
embedded data (capacity) and total distortion
3 PROPOSED ALGORITHM of the cover object are considerably and
IMPLEMENTATION reciprocally related. Sometimes the system
capacity is represented as the payload, i.e., it
The main idea of proposed method utilizes a can be expressed as some embedded bits per
combination of DCT and DWT transform in cover image picture element (bit/pixel). Our
the process of message embedding. proposed algorithm uses the 1-level
Subliminal channel creation processes grey decomposition of discrete 2D Haar wavelet
scale images as the cover object, and it transformation (2D-HT). In general, the
employs the technique of CDMA for great approximation coefficients are not very
secret messages spreading into transform suitable for modification because there is the
coefficients of 2D-DCT and 2D-DWT during most information content of the whole cover
the procedure of embedding. This method image. Vice-versa, detailed coefficients are
works with free hidden messages, which the most convenient area for doing some
means that hidden message objects could operations (for hiding a secret data here). We

ISBN: 978-1-941968-45-1 ©2017 SDIWC 27


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

tried to design the algorithm in the way of position. Contrariwise, the "decrease" label in
minimal modification of transformation the table 1. represents a modification of the
coefficients as the imperceptibility of secret coefficient value by increasing it according to
message would then be better. We build our the value that has been modified by the
work on the fascinating feature of 2D-HT embedding of secret message in LL area in
coefficients; therefore, there is a high the same position.
correlation between the transformation
coefficients on the same position in different 3.1 Process of embedding
decomposition areas. It means, if coefficient
in LL on a position (i, j) is an even-numbered We designed the proposed method to be
integer then the coefficients in LH, HL, and usable for all kind of secret digital messages,
HH on the same location (i, j) are also even- i.e., it can be any numeric binary data file.
numbered integers. Our next finding during Firstly, the elected cover image is
the investigations was that if we change one decomposed by 2D-HT. Based on the
transform coefficient value in one previous chapter; this transform provides an
decomposition area, then the inverse 2D-HT approximation HH and three detail coefficient
(2D-IHT) changes this value back to the areas (horizontal HL, vertical LH, and
original one. Otherwise, when we do the same diagonal LL) on each level of decomposition.
change also in other three areas of It was necessary to accomplish the property of
decomposition at the same position, then this private communication as well as the secret
change will be preserved. Another and message imperceptibility in the cover image.
curious problem is a due behavior of 2D-IHT. Therefore, as it was explained, the detail
Sometimes, the reconstructed image pixel coefficients are the most convenient area for
values may not be integer numbers. After this, secret message embedding. Another, not a
the image would not be possible to negligible requirement is the system capacity.
reconstruct and illustrate. From these The capacity or payload can be defined as
findings, we concluded, that there exist only some embedded bits per pixel (bit/pixel). The
four suitable changes of transform secret message embedding process can be
coefficients. Each modification affects a achieved in three embedding approaches,
different set of image pixel values after 2D- where every method provides a specific level
IHT is performed. Table 1. shows all possible of robustness. Differences between
coefficient changes. approaches are mainly in the fact that the
pixel 2D-HT decomposition areas same data bits are replicated into other
position HH HL LH coefficients of detailed areas. In individual
f(2m-1,2n-1) increase (+) increase (+) increase (+) strategies, we try to embed secret data bit
f(2m-1,2n) increase (+) decrease (–) decrease (–)
from one area of detail coefficients until all
three areas HL, LH, LL respectively.
f(2m,2n-1) decrease (–) increase (+) decrease (–)
Introduction of higher method approaches
f(2m,2n) decrease (–) decrease (–) increase (+) introduces more details coefficients
Table 1. Transformation coefficients suitable changes. modifications without contribution to the
system payload. However, higher approach
guarantees the higher secret message
The first column determines a set of specific
robustness.
pixel values positions. Variables m and n are
We defined a set of applicable transform
pixel indexes, where m = 1, 2, 3… M and n =
coefficients C' suitable for embedding
1, 2, 3... N and image size are intended as
process, which was elected from detail areas
M×N. The "increase" label on the table 1.
HL, LH, LL (5). It was also needed to prepare
represents a modification of the value by
and define a set of hidden message bits S
increasing it according to the value that has
before embedding itself (6).
been modified by the embedding of secret
message in LL area on the same punctual C   cij | 1  i  M c ,1  j  N c  (11)

ISBN: 978-1-941968-45-1 ©2017 SDIWC 28


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

S  si | 1  i  K , si  0,1 (12) The most unusual behavior in here is the case
when no changes are needed to achieve the
The preparation consists of two parts. First secret message bit be embedded. In other
part is dealing with the determination of the words, the secret message bit is embedded
length of Code-word, which determines the without any modification of transform
size of secret data stream segmentation. coefficient.
Analogically, the length of Code word {bi | i = Nevertheless, independently of minimal
1, 2... N} directly determines the number of modifications, we encountered an unpleasant
pseudorandom sequences (PNSK). This problem on the receiver side. The issue we
PNSK are used in the process of spreading, found was related to stego object distortion,
ergo the SSIS approach, shown in Fig. 1. caused by the implementation of multiple
transformations. Inverse operations mislead
decimal numbers into spatial domain and
rounding to integers introduces this issue.
This distortion is responsible for incorrect
extraction of secret message. It was the reason
for PNS implementation. Autocorrelation
attributes of PNS significantly improve
resultful extraction. In general, a more
extended PNS are accounted for uplifting
autocorrelation characteristics, hence better-
Figure 1. Secrete message spreading before
embedding itself. gained results. Moreover, an additional
approach termed Extraction with Error
Correction (EEC) was applied during the
Secondly, the secret key can be used in the
extraction to improve error correction. The all
process of PNSK alignment, with which the
possible states calculations (2n) on the output
secret message is spread. This key can denote
of defined code-word could be held as
positions of bits during the extraction process.
reference data cell. However, incrementing of
However, one significant problem with this
code-word length requires higher computer
approach is that this secret key needs to be
power, i.e., more time-consuming.
transported and present on the receiver side. It
represents the weakness of this approach, and
4 Results and Ascertainments
it can lead to uncovering of the covert
communication.
The total system capacity of the proposed
Subsequently, when the secret message is
method is a quarter of used cover image size
correctly spread, they are embedded in the in the 1-level decomposition of 2D-HT. The
transformation domain of 2D-HT coefficient maximum useful payload is 0.25 bit/pixel in
regarding attributes of steganography. In this case of using the total capacity, and it also
embedding process, we firstly perform varies depending on some used detail
modulo calculations, and comparisons of the coefficient during the process of embedding
secret message summarized spread bits Si and from the applied code-word length. The
with the detail coefficients in the form of (13– total number of coefficient elected for
16). embedding determines the capacity in binary
if Si= 0:
form, and it does not vary with growing
C  , if mod(| C  |,1)  0 (13)
C 
ij ij
number of PNS. Unfortunately, the system
   
E
ij
 sgn C  
ij
| C  |
ij
 1 2 if mod(| C  |,1)  0 (14)
ij
capacity is decreasing with increasing length
if Si= 1: of PNS. Considerable differences can be
Cij , if mod(| Cij |,1)  0 (15) observed in the amount of added or removed
CijE  
 sgn Cij   | Cij | 1 2 if mod(| Cij |,1)  0 (16) energy to or from the cover object.
Subsequently, if more PNS are used in the

ISBN: 978-1-941968-45-1 ©2017 SDIWC 29


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

process of spreading, then more the is


embedded to cover object. This unwanted
property is measured by objective quality
measures (PSNR, PSNRCSF, SSIM
imperceptibility of human perception).

Decomp. PSNR PSNRCSF Extract. Capacity


n SSIM
area [dB] [dB] [%] utiliz. [%]
LL 59,654 233,267 0,9917 49,1
1 LL, HL 56,164 69,630 0,8714 65,7 95,2
LL, HL, HH 53,946 77,038 0,6121 78,5
LL 60,653 257,846 0,9939 74,1 Figure 2. Difference values of PSNR and system
2 LL, HL 57,875 72,009 0,8745 87,0 46,1 capacity.
LL, HL, HH 54,247 82,975 0,6178 97,9
LL 61,887 278,246 0,9951 90,1
4 LL, HL 58,367 86,451 0,8785 95,5 23,8 The differences are even more exposed to
LL, HL, HH 55,947 87,462 0,6221 100
LL 62,384 293,533 0,9975 95,7
longer code words, where better cross-
8 LL, HL 59,942 95,865 0,8812 97,9 11,8 correlation attribute of PNS stands its place.
LL, HL, HH 56,142 93,672 0,6274 100
LL 63,174 311,520 0,9998 98,4
16 LL, HL 60,285 106,647 0,8864 100 5,95 5 Conclusion
LL, HL, HH 57,056 98,975 0,6376 100

Table 2. Extraction rate results and PSNR values for In this paper, we prepare steganography
different cover objects. method that uses properties of the
combination of the transformation domain of
In case of LL and n=1 (if one decomposition 2D-HT, 2D-DCT and Direct spreading
area is utilized and no code word applied), the technique of CDMA. The objectives and
satisfactory reconstruction rate is around embedding algorithm solved the problem of
49%. This rate is unacceptable and not secret message reconstruction issue formed in
applicable for our purposes. However, the research [4], where the transformation domain
stego image visual quality values (PSNR- of DCT creates a wrong place for secret
SSIM) are incredibly satisfying. In case of message embedding. As it has been shown,
used LL, HL and HH coefficients for the capacity depends on length, but no longer
embedding, the reconstruction rate has on number of PNS. Moreover, the error
increased to around 78% what is much better, correction algorithm increases the fruitfulness
but still not enough for our purposes. It was of secret message extraction. Another
the reason why we implemented code words. critically considered benchmark according to
In Tab. 2 are all combinations of reached steganography is imperceptibility of a human
results. It is evident if the length of the code observer to degradation of the cover object.
word is increasing then the extraction rate is These measured values are decreasing with
getting better, but at the expense of the employment of more decomposition areas of
amount of payload to be embedded (amount 2D-HT. This handicap is solved by increasing
of secrete message bits). The Fig. 2 depicts the length of PNS because of cross-
the difference values of PSNR between correlation improvement. The autocorrelation
applied decomposition areas and achieved allows proper identification of PNS, thus the
capacity utilization. secret message bit.

Acknowledgment

This work was supported by the Slovak


Research and Development Agency, project
number APVV-15-0055, and by the Scientific
Grant Agency of the Ministry of Education,
Science, Research and Sport of the Slovak

ISBN: 978-1-941968-45-1 ©2017 SDIWC 30


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Republic under the contract No. 1/0075/15


and University Science Park TECHNICOM
for Innovation Applications Supported by
Knowledge Technology, ITMS:
26220220182, supported by the Research &
Development Operational Programme funded
by the ERDF. This project is being co-
financed by the European Union.

REFERENCES
[1] Gowda, S. N. (2016, July). Dual layered secure
algorithm for image steganography. In Applied
and Theoretical Computing and Communication
Technology (iCATccT), 2016 2nd International
Conference on (pp. 22-24). IEEE.

[2] Shih, F. Y. (2017). Digital watermarking and


steganography: fundamentals and techniques. CRC
Press.

[3] Kishor, S. N., Ramaiah, G. K., & Jilani, S. A. K.


(2016, May). A review on steganography through
multimedia. In Research Advances in Integrated
Navigation Systems (RAINS), International
Conference on (pp. 1-6). IEEE.

[4] Bugár, G., Levický, D., Broda, M., & Hajduk, V.


(2016, April). A novel block-based data hiding
scheme in SVD-DCT composite domain.
In Radioelektronika (RADIOELEKTRONIKA),
2016 26th International Conference (pp. 336-339).
IEEE.

[5] Banoci, V., Broda, M., Bugar, G., & Levický, D.


(2014). Universal image steganalytic
method. Radioengineering, 23(4), 1213-1220.

[6] Trivedi, M. C., Sharma, S., & Yadav, V. K. (2016,


March). Analysis of Several Image Steganography
Techniques in Spatial Domain: A Survey.
In Proceedings of the Second International
Conference on Information and Communication
Technology for Competitive Strategies (p. 84).
ACM.

[7] Westfeld, A. (2001). F5—a steganographic


algorithm. In Information hiding (pp. 289-302).
Springer Berlin/Heidelberg.

[8] Bánoci, V., Bugár, G., & Levický, D. (2010,


April). Information hiding using pseudo-random
number sequences with error correction.
In Radioelektronika (RADIOELEKTRONIKA),
2010 20th International Conference (pp. 1-4).
IEEE.

[9] Bugar, G., Banoci, V., Broda, M., Levicky, D., &
Miko, E. (2013, September). Blind steganography
based on 2D Haar transform. In ELMAR, 2013
55th International Symposium (pp. 31-35). IEEE.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 31


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Optimising the Innovation Capacity of Public Research Organisations – An


Agent-Based Simulation Approach

Davy van Doren


EA European Academy of Technology and Innovation Assessment GmbH
Wilhelmstr. 56, 53474 Bad Neuenahr-Ahrweiler, Germany
davy.vandoren@ea-aw.de

commercialisation, the traditional independent


ABSTRACT and upstream-position of public research has
been replaced by one that contains richer and
Although there is considerable attention for the role more interactive actor relationships [2][3]. Due
of public research organisations in driving to the evolving roles and functions of public
innovation, methodologies that can be used to research organisations, there is need to
analyse and optimise these roles are limited. An understand the conditions that enable
agent-based model could potentially assess ex-post
innovation within public research
the operation of public research organisations
within innovation systems, as well as allows ex- organisations and optimise their effectiveness.
ante analysis of potential intervention effects for One way to assess innovation within public
optimising their innovation capacity. Such concept research organisations is through developing a
needs to entail two perspectives. From the computational model. In the following
perspective of innovation, it should represent not sections, some general features of such
only organisational capacities and capabilities modelling concept are described. Also, to
required for research and development, but also allow analytical distinction regarding the
integrate antecedent and subsequent processes various dimensions and levels that underlie the
related to the act of innovation. From the innovation process, a theoretical framework is
perspective of innovators, it should integrate the proposed.
organisational metabolism that enables innovation,
as well as contain agglomerate effects of
organisational operation on the overall dynamic of
innovation networks. Based on such requirements,
2 THEORETICAL FRAMEWORK
some options for designing an agent-based model
are suggested. 2.1 Dimensions of Innovation
In the context of public research organisations,
innovation trajectories could be considered as
KEYWORDS consisting of four dimensions: (1) the
production of science, (2) the processing of
agent-based model, public research organisations, science to enable innovation, (3) the initiation
innovation network, innovation policy, of innovation, and (4) the transfer or
organisational learning conversion of innovation through
entrepreneurship:

1 INTRODUCTION • 1 - Science: With innovation defined as


a form of applied knowledge, theories
For the governance of technological on innovation have generally been
trajectories, it has been argued that public based on the discovery and
research organisations can be important for management of scientific knowledge
enabling innovation [1]. As political directions by actors [4].
over time have moved towards increased • 2 - Enabling innovation: Previous
embracement of technology research has highlighted the need for

ISBN: 978-1-941968-45-1 ©2017 SDIWC 32


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

organisational antecedents, capacities information [18][19]. Also, various


and installed processes that can enable studies are conducted on how
innovation [5][6][7]. processes could be optimally integrated
• 3 - Innovation: Organisations require and adapted to improve system
capabilities to actually realise the flexibility [20]. Furthermore, research
production of technological inventions, has been conducted at the level of
as well as skills to manage acquired employees, for example how human
technological capital [8][9]. relations and human resource
• 4 - Entrepreneurship: Entities require management influences organisational
adaptability and flexibility to handle performance [21][22][23].
changing operational environments • Output: Based on the architecture of
resulting from technological change, capacity and processes, the output of an
for example through driving organisation in terms of innovation is
organisational diversification or influenced. For example, it has been
manage institutional spin-offs shown how an organisation’s
[10][11][12][13].1 performance in innovation relates to
different modes of strategy formulation
2.2 Levels of innovation [24].
De facto representations of these dimensions
of innovation are likely to differ between II - Between entities - network level
research organisations. To create analytical Between entities, research on networks and
distinction within and between institutes, three innovation systems have highlighted the
conceptual levels of innovation could be relevance of inter-organisational activity for
distinguished: (I) within entities, (II) between explaining the dynamic of innovation [25],
entities, (III) and outside entities. also in order to create economic regions of
specialisation that can induce national and
I - Within entities – entity level international competitive advantages [26].
Much has been written how organisational Furthermore, innovation and entrepreneurship
structures relate to innovation. For this, it are known to occur within diverse actor
seems helpful to classify entities based on constellations and at different levels of scale
organisational capacity and organisational [27][28][29].
processes, as well as how these relate to
organisational output [14][15]: III - Outside entities - environment level
The dynamic of innovation is also strongly
• Capacity: It has been shown how influenced by effects external to entities and
knowledge management can be entity relations – i.e. variables that are not
considered from a capacity and directly associated to the relationships existing
capability perspective [16], as well as between entities. The environment constitutes
how firms need to recognise and conditions that strongly influence the character
assimilate external knowledge for and context of (national) innovation systems
survival [17]. [25][30]. Such conditions are generally outside
• Processes: There has been attention the direct sphere of organisational control.
towards optimisation of distinct
processes, for example the process of Based on the made categorisation of
knowledge generation through innovation, the following theoretical
learning or the communication of framework is composed (Figure 1).

1
In recent years, various initiatives have started to entrepreneurship programs (e.g. REAP -
elevate the potential of innovation within regional http://reap.mit.edu/).

ISBN: 978-1-941968-45-1 ©2017 SDIWC 33


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Environment

Network

Entity
Science Entrepreneurship
Capacity
Capacity

Processes
Processes

Performance
Performance

Enabling Innovation Innovation

Capacity Capacity

Processes Processes

Performance Performance

Figure 1. Theoretical framework to analyse innovation within and between organizations

exploration of successful strategies within the


3 METHODOLOGICAL APPROACH space that is being modelled. As such, models
should be flexible enough to explore multiple
The composed theoretical framework can be scenarios and alternative actor-configurations,
operationalised by means of agent-based in order to provide insights how overall agent
modelling. fitness can be improved.

In the context of evolving innovation To achieve such functionality within models,


networks, agent-based models have been there is need for legitimate data-generation and
designed to simulate actor behaviour, as well data-management procedures. For this, three
as to specify the inter-relationships between methodological features are proposed: (1)
new knowledge, knowledge transfer, market multi-level integration through ego-networks,
selection and reward structures [31]. As these (2) empirical- and simulation-data embedment
types of simulations allow for ex-ante through scores, and (3) complexity reduction
evaluation and the analysis of non-linear through model modularisation.
relationships, they have been adapted to
contexts ranging from university-industry 3.1 Methodological feature 1 – Multi-level
collaborations to research policy effectiveness integration through ego-networks
[32][33]. Multi-level integration through ego-networks
relates to what and how data is used in setting-
There are some requirements to be considered up and calibrating the operational settings of
in the development of simulation models. First, the model. Based on principles of evolutionary
for simulation models to be useful in real- economics, approaches have often been based
world settings, they should be able to make on large aggregate datasets to describe
past representations of innovation systems innovation diffusion processes in relation to
reliably. In addition, they should be able to industrial structures [34] or have integrated
make potential future projections and allow the qualitative observations on historic sector

ISBN: 978-1-941968-45-1 ©2017 SDIWC 34


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

development [35]. Also, in the tradition of entrepreneurship), as well as measurement of


Monte Carlo simulation and NK modelling, the different levels at which innovation occurs
developed methodologies have included levels (i.e. within entities, between entities, and the
of uncertainty for calculating actor fitness environment). Data used for the calculation of
under different scenarios, as well as to obtain indicators can be based on either empirical data
parametric probability distributions in the (quantitative or qualitative) or simulation data.
context of nonlinear evolution [36][37][38]. To come up with a comprehensive set of
indicators, previously developed criteria that
However, a predominant use of network-based highlight organisational aspects influencing
data might obscure structures that are more the ability to innovate were identified as a
relevant at micro-level. To introduce a multi- potential starting basis [40]. Such criteria can
level dynamic in the set-up of models, ego- then be used as a basis for the identification or
network data could be used to integrate formulation of additional indicators (Table 1).
information regarding the intensity, quality
and nature of actor connections. Through Scores
analytical embedment of preferential After indicators have been identified or
attachment, strength of ties, collaboration formulated, scores can be developed that
intensity and variation, number of co- represent a certain overall dynamic or
participations, or the profile of partnerships performance of innovation. Scores should be
[39], empirically observed ego-networks of considered as a unit that aggregates multiple
entities could be used in calibrating and indicators and that is constructed as rooted
validating model output. within theoretical logic or stylised facts. As
scores allow the integration of both empirical
3.2 Methodological feature 2 – Connecting and simulation data, they function as links
empirical- and simulation-data through between the collected empirical- and produced
indicators and scores simulation-data of simulations.
The connection of empirical- and simulation-
data relates to the observability versus 3.3 Methodological feature 3 –
unobservability of various innovation Modularisation of model architecture
phenomena. This is an important consideration The modularisation of model architecture
within modelling approaches that aim to relates to a need to methodologically deal with
integrate both types of phenomena through system complexity. Popular concepts of
either collected empirical observations or in evolutionary economics have often been
silico produced data. The inclusion of both translated into the architecture of models
data-types represents a certain duality and explaining innovation dynamics. Concepts like
challenge in translating multi-level based the bounded rationality of actors or the
innovation theory into an operating agent- transformation of produced knowledge have
based model. To align innovation models with often resulted in complex model architecture.
innovation theories, it is therefore important to Complex model architecture decreases
obtain some form of data integration between operational understanding due to multiple
levels. For this, a system based on indicators parameter interaction and the emergence of
and scores is proposed to improve the feedback- and feedforward-processes. Also,
validation of innovation models. complex model architecture reduces model
flexibility, which hampers the adaptation of
Indicators models towards specific research questions or
Indicators can enable measurement of contexts.
innovation trajectories dimensions (i.e.
science, enabling innovation, innovation,

ISBN: 978-1-941968-45-1 ©2017 SDIWC 35


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Table 1. Analytical grid for the categorisation of individual indicators and aggregated indicators. For illustrative
purpose, some example indicators are provided.
Dimensions of innovation
Levels of Science Enabling Innovation Innovation Entrepreneurship
innovation
Within entities Capacity - Revealed Scientific - Budget for applied - Revealed - Potential for spin-
Advantage (RSA) research Technological off
- Scientific personal - Science- Advantage - Entrepreneurship
innovation (RTA) programs
connectors - Engineers
Processes - Science focussed - Market research / - Innovation - Transfer of IPR
incentives identification focussed - Resource allocation
- Internal research exploitation incentives / management
projects opportunities - Identification
licensees
Output - Publications (basic - Publications - Patents - Spin-offs
research) (applied research) - Technologies
- Market assessment
Between Capacity - Basic research - Applied research - Innovation - Shared
entities collaborations collaborations networks infrastructures
- Organisational
embedment of
institutes
Processes - Interorganisational - Applied research - Contract - Identification
coordination partner projects research potential clients
- Basic research partner
projects
Output - Co-publications (basic - Co-publications - Product - Market niche
research) (applied research) development
Environment - Systemic / national - Systemic / national - Funding and - Institutional support
need for science need for transfer for market niches
innovation instruments - Price of innovation

In order for models to explain more complex – both in terms of intelligence and interface
patterns of innovation but still be coherent and possibilities – innovation phenomena could be
flexible, models could be designed in a translated into either central or peripheral
modular fashion [41]. Additional to modular modules.
models that are being developed in technology
development to overcome measurement
challenges and understand complex CONCLUSION
interactions – for example biotechnology
[42][43] – some attempts have also been made To understand the conditions that enable
in the context of innovation dynamic [44].2 innovation within public research
organisations, a preliminary methodological
The construction of modules should depend on concept was proposed that can be applied
several aspects. Based on the availability of within the construction of innovation models.
data and analysis of indicators and scores, This concept – rooted in theories of
modules can be composed based on principle organisational structure, innovation systems
components that relate to the applied analytical and evolutionary economics – can be
framework and align with the chosen scope of operationalised to include both antecedents
analysis. Second, the creation of modules is and impacts of innovation, as well as to
also likely to depend on end-user integrate processes that occur within and
requirements. Based on articulated needs between involved agents. For the further
regarding what output a model should produce development and operationalisation of this

2
The operational implications of multiple perspectives https://www.imagwiki.nibib.nih.gov/sites/default/files/
have also previously been discussed in research; for Ropella,%20Glen05Aug14cah.pdf
example, see

ISBN: 978-1-941968-45-1 ©2017 SDIWC 36


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

concept, some potential methodological theory of a firm,” J. Manag. Stud., vol. 44, no.
features were suggested to improve the 7, pp. 1213–1241, 2007.
[14] R. G. M. Kemp, M. Folkeringa, J. P. J. de Jong,
simulation of agent-based models. and E. F. M. Wubben, Innovation and firm
performance. 2003.
[15] J. Birkinshaw, G. Hamel, and M. J. Mol,
“Management Innovation,” Acad. Manag. Rev.,
REFERENCES vol. 33, no. 4, pp. 825–845, Oct. 2008.
[16] A. H. Gold, A. Malhotra, and A. H. Segars,
[1] W. M. Cohen, R. R. Nelson, and J. P. Walsh, “Knowledge Management: An Organizational
“Links and Impacts : The Influence of Public Capabilities Perspective,” J. Manag. Inf. Syst.,
Research on Industrial R & D,” Manage. Sci., vol. 18, no. 1, pp. 185–214, 2001.
vol. 48, no. 1, pp. 1–23, 2002. [17] W. M. Cohen and D. A. Levinthal, “Absorptive
[2] E. von Hippel, The sources of innovation, vol. Capacity: A New Perspective on Learning and
53, no. 9. 2013. Innovation.,” Adm. Sci. Q., vol. 35, no. 1, pp.
[3] M. Gibbons and R. Johnston, “The roles of 128–152, 1990.
science in technological innovation,” Res. [18] B. Levitt and J. G. March, “Organizational
Policy, vol. 3, no. 3, pp. 220–242, 1974. Learning,” Annu. Rev. Sociol., vol. 14, no. 1, pp.
[4] J. A. Johannessen, B. Olsen, and J. Olaisen, 319–338, 1988.
“Aspects of innovation theory based on [19] T. J. Allen, “Managing the Flow of
knowledge-management,” Int. J. Inf. Manage., Technology,” MIT Press Cambridge MA, p.
vol. 19, no. 2, pp. 121–139, 1999. 320, 1977.
[5] C. E. Helfat et al., Dynamic capabilities: [20] J. E. Ettlie and E. M. Reza, “Organizational
Understanding strategic change in Integration and Process Innovation,” Acad.
organizations. Oxford, Blackwell Publishing, Manag., vol. 35, no. 4, pp. 795–827, 1992.
2007. [21] C. Truss, A. Shantz, E. Soane, K. Alfes, and R.
[6] J. J. P. Jansen, F. a J. van den Bosch, and H. W. Delbridge, “Employee engagement,
Volberda, “Managing Potential and Realised organisational performance and individual well-
Absorptive Capacity: How do Organisational being: exploring the evidence, developing the
Antecedents Matter?,” Acad. Manag., vol. 48, theory.,” Int. J. Hum. Resour. Manag., vol. 24,
no. 6, p. 16, 2005. no. 14, pp. 2657–2669, 2013.
[7] J. Woodhill, “Capacities for institutional [22] V. H. Hailey, E. Farndale, and C. Truss, “The
innovation: A complexity perspective,” IDS HR department’s role in organisational
Bull., vol. 41, no. 3, pp. 47–59, 2010. performance,” Hum. Resour. Manag. J., vol. 15,
[8] F. E. García-Muiña and E. Pelechano-Barahona, no. 3, pp. 49–66, 2005.
“The complexity of technological capital and [23] J.-M. Hiltrop, “The impact of human resource
legal protection mechanisms,” J. Intellect. Cap., management on organisational performance:
vol. 9, no. 1, pp. 86–104, 2008. Theory and research,” Eur. Manag. J., vol. 14,
[9] H. M. Grimm, “The diffusion of Bayh-Dole to no. 6, pp. 628–637, 1996.
Germany: Did New public policy facilitate [24] R. E. Miles, C. C. Snow, A. D. Meyer, and H. J.
university patenting and commercialisation?,” Coleman, “Organizational strategy, structure,
Int. J. Entrep. Small Bus., vol. 12, no. 4, pp. and process,” Acad. Manag. Rev., vol. 3, no. 3,
459–478, 2011. pp. 546–562, 1978.
[10] H. Chesbrough and R. S. Rosenbloom, “The [25] B.-Å. Lundvall, National Systems of
role of the business model in capturing value Innovation: Towards a Theory of Innovation
from innovation: evidence from Xerox and Interactive learning. London, 1992.
Corporation’s technology spin-off companies,” [26] M. E. Porter, “The Competitive Advantage of
Ind. Corp. Chang., vol. 11, no. 3, pp. 529–555, Nations. (cover story),” Harv. Bus. Rev., vol. 68,
2002. no. 2, pp. 73–93, 1990.
[11] K. Pavitt, M. Robson, and J. Townsend, [27] W. Vandekerckhove and N. a. Dentchev, “A
“Technological Accumulation, Diversification Network Perspective on Stakeholder
and Organisation in UK Companies, 1945- Management: Facilitating Entrepreneurs in the
1983,” Manage. Sci., vol. 35, no. 1, pp. 81–99, Discovery of Opportunities,” J. Bus. Ethics, vol.
1989. 60, no. 3, pp. 221–232, Sep. 2005.
[12] A. Walter, M. Auer, and T. Ritter, “The impact [28] H. Choi, S.-H. Kim, and J. Lee, “Role of
of network capabilities and entrepreneurial network structure and network effects in
orientation on university spin-off performance,” diffusion of innovations,” Ind. Mark. Manag.,
J. Bus. Ventur., vol. 21, no. 4, pp. 541–567, vol. 39, no. 1, pp. 170–177, 2010.
2006. [29] F. W. Geels, “Ontologies, socio-technical
[13] M. G. Jacobides and S. G. Winter, transitions (to sustainability), and the multi-
“Entrepreneurship and firm boundaries: The

ISBN: 978-1-941968-45-1 ©2017 SDIWC 37


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

level perspective,” Res. Policy, vol. 39, no. 4, application to maturation of the immune
pp. 495–510, 2010. response,” J. Theor. Biol., vol. 141, no. 2, pp.
[30] M. Steiner, Clusters and regional 211–245, 1989.
specialisation : on geography, technology and [39] A. L. Barabási, H. Jeong, Z. Néda, E. Ravasz,
networks, no. 8. 1998. A. Schubert, and T. Vicsek, “Evolution of the
[31] N. Gilbert, A. Pyka, and P. Ahrweiler, social network of scientific collaborations,”
“Innovation Networks - A Simulation Phys. A Stat. Mech. its Appl., vol. 311, no. 3–4,
Approach,” J. Artif. Soc. Soc. Simul., vol. 4, no. pp. 590–614, 2002.
3, pp. 1–14, 2001. [40] Helmholtz Gemeinschaft, “BMBF-
[32] P. Ahrweiler, A. Pyka, and N. Gilbert, “A New Förderprojekt Enabling Innovation –Erprobung
Model for University-Industry Links in des Management-Tools,” 2014.
Knowledge-Based Economies,” Soc. Sci., pp. [41] D. Scerri, S. Hickmott, A. Drogoul, and L.
218–235, 2011. Padgham, “An Architecture for Modular
[33] P. Ahrweiler, M. Schilperoord, A. Pyka, and N. Distributed Simulation with Agent-Based
Gilbert, “Modelling research policy: Ex-ante Models,” Proc. 9th Int. Conf. Auton. Agents
evaluation of complex policy instruments,” Multiagent Syst. (AAMAS 2010), pp. 541–548,
Jasss, vol. 18, no. 4, 2015. 2010.
[34] R. R. Nelson and S. G. Winter, An evolutionary [42] B. K. Petersen, G. E. P. Ropella, and C. A. Hunt,
theory of economic change. 1982. “Toward modular biological models: defining
[35] F. Malerba, R. Nelson, L. Orsenigo, and S. analog modules based on referent physiological
Winter, “History-friendly models: An overview mechanisms.,” BMC Syst. Biol., vol. 8, p. 95,
of the case of the computer industry,” JASSS, 2014.
vol. 4, no. 3, 2001. [43] G. Sunwoo Park, E. P. Ropella, and C. A. Hu,
[36] V. N. Kolokoltsov, Nonlinear Markov “PISL: A Large-Scale In Silico Experimental
processes and kinetic equations. Cambridge Framework for Agent-Directed Physiological
Tracts in Mathematics, 2010. Models,” 2005.
[37] R. Y. Rubinstein and D. P. Kroese, “Simulation [44] S. H. Chen and B. T. Chie, “A functional
and the Monte Carlo Method,” Wiley, p. 377, modularity approach to agent-based modeling
2008. of the evolution of technology,” in Lecture
[38] S. A. Kauffman and E. D. Weinberger, “The NK Notes in Economics and Mathematical Systems,
model of rugged fitness landscapes and its 2006, vol. 567, pp. 165–178.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 38


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Text Classification Using Time Windows Applied to Stock Exchange

Pavel Netolický, Jonáš Petrovský, František Dařena, Jan Žižka

Department of Informatics, Faculty of Business and Economics, Mendel University in Brno,


Zemědělská 1, 613 00 Brno, Czech Republic

pavel.netolicky@mendelu.cz, jonas.petrovsky@mendelu.cz, frantisek.darena@mendelu.cz,


jan.zizka@mendelu.cz

ABSTRACT different application areas (computer


networks monitoring, scientific experiments,
Each day, a lot of text data is generated. This data internet search, social networks etc.). In
comes from various sources and may contain comparison to batch processing (for which we
valuable information. In this article, we use text have all data available at once), data streams
classification to discover if there is a connection processing needs a different approach,
between textual documents (specifically Facebook
because classical approaches are not effective
posts) and changes of the S&P 500 stock index.
The index values and documents were divided or even feasible [2].
into time windows according to the direction of
the index value changes. In the first experiment, In this article, we will focus on the connection
we used a batch processing approach to put the between text documents published on the
documents from all windows into one data set and Internet and movements of stock prices
a classification accuracy of 62% was achieved. In (represented by a composite value of stock
the second experiment, we used a data stream index). Some research in this area uses
approach to divide documents into twelve data structured (quantitative) data to analyze the
sets created from two neighboring windows and impact of data on stock prices [3].
we achieved an accuracy of 68%. This indicates Unstructured data (like text) may provide us
that posts, which companies write on their
with another complementary information with
Facebook pages, are partially related to the
additional hard-to-quantify knowledge [4].
performance of the stock index. Taking the
concept change into account also enables better
quantification of this relationship. Behavioral finance theory says that emotions
may deeply influence behavior and decision
KEYWORDS making of individuals as well as whole
human societies [5]. This means that the
Machine Learning, Classification, Text Mining, prices on capital markets are (more or less)
Stock Exchange, Time Windows, Data Streams influenced by emotions, moods and opinions
of market participants [6]. These attributes are
1 INTRODUCTION often contained in text documents and
therefore we decided to use text data for our
A huge amount of data is constantly being research.
generated by people and organizations. The
speed of data creation is rapidly growing and [7] examined the connection between the
we use the term “data stream” for the constant content of messages posted to a discussion
flow of new data [1]. board and movements of the Czech stock
index. We will expand this approach further
Data streams may be of various data types by focusing on the US stock market, using a
(text, image, numeric) and come from larger number and another type (Facebook

ISBN: 978-1-941968-45-1 ©2017 SDIWC 39


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

posts) of text data and treating stock prices Facebook posts from company pages as the
and related text documents as data streams text data, because it has been a very rarely
divided into time windows as we suppose that used data source for this area of research, we
the reasons of stock price changes evolve in have lots of available data, and it might bring
time. new interesting insights.

2 CURRENTLY USED METHODS 3.1 Stock prices

To model the behavior of a stock price with a In our research, the values of the S&P 500
relation to the content of text data we can use Index were used to represent stock prices. The
classification in a way that we examine the index values reflect stock prices of the
direction of the change of the stock price to selected blue chip (large and famous)
create classes. This approach was used for companies on the US stock market. The
example by [6]. The problem can be seen as historical values of the index were
text classification – given a text, decide its downloaded from the website investing.com.
class (direction of the price movement). For each trading day, we have a closing (end-
However, we must overcome two problems. of-day) numeric value of the S&P 500 Index
The first problem is the definition of classes. available.
[8] used a threshold value of 1% price change
for the class determination. The second 3.2 Text data
problem lies in choosing correct features.
Many studies used just single words and this As the text data, posts from Facebook pages
simple unigram bag-of-words model provided of the companies from the S&P 500 Index
good results in [8]. were used. In total, we examined 431
company pages. The company’s Facebook
There exist a wide range of supervised page contains a sequence of documents
learning algorithm that can be uses for the arranged according to their publication time.
text classification. An interesting approach is These short postings are created by the
described in [9] – it focuses on sentence-level company representatives. Figure 1 shows an
sentiment analysis of movie reviews. They example of a post on the Intel’s page. A post
used the cosine normalization, Term may be commented by Facebook users.
Presence, and Smoothed delta IDF as However, the comments were not used in the
weighting schemes and the Recursive Neutral analysis.
Tensor Network algorithm to achieve an
accuracy of 87.60%. [10] used Naïve Bayes In total, 138,713 Facebook posts published
and SVM as algorithms and unigrams, between 1. 1. 2015 and 15. 10. 2016 were
bigrams, unigrams with bigrams, and used.
unigrams with POS (Parts-of-speech) as
features. The bigrams showed a lower
accuracy then unigrams – the reason is that
the resulting vectors were very sparse. All in
all, the type of features used in the bag-of-
words model has a little (maximal 2–3%)
impact on the accuracy.

3 DATA AND METHODOLOGY

The goal of the work was to examine whether


the content of text documents published on
the Internet has any connection with stock Figure 1. Example of a Facebook post
price movements. We decided to use

ISBN: 978-1-941968-45-1 ©2017 SDIWC 40


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

3.3 Classification methodology this experiment are presented in section


“Neighboring windows approach”.
We used text classification to predict whether
the given document is connected with an Text pre-processing and conversion
upward or downward movement of the S&P The raw text of each document was processed
500 Index. by a Python script as follows:
1. Remove all whitespace.
We examined the time series of the S&P 500 2. Lowercase all letters.
Index values between 1. 1. 2015 and 15. 10. 3. Tokenize the document – get words (using
2016 and found time intervals (windows) in TreebankWordTokenizer).
which the change (either positive or negative) 4. Filter words – minimal length of three
of the index value between the first and the letters, exclude numbers.
last day of the interval was at least 5%. In
total, 24 such windows were found. In 12 of The edited text was converted into
them, the index value grew and in 12 it a structured format by using a Python library
declined. The length (a number of days) of the called scikit-learn and its Vectorizer class.
time windows varied between 4 and 30. Then, Only words that occurred at least 5 times in
each document was, based on the time the whole document collection were included
window in which it was published, assigned in the resulting vector representation.
a class: 1 (up) for the positive index value
change, 2 (down) for the negative one. Figure The documents were converted to a bag-of-
2 shows an example of time windows words representation using three different
between 1. 1. 2016 and 1. 4. 2016 with the weighting schemes for the term-document
assigned classes. matrix [11, p. 21–26]:
• Term Presence (TP): 1 if a term was present
in a document, 0 if not.
• Term Frequency (TF): number of times
a term was present in a document.
• TF-IDF: TF (local weight) multiplied by the
IDF (global weight).

Classification
The converted data was split into the training
(60%) and testing (40%) set. Each bag-of-
Figure 2. Classification classes identified in the time words representation was processed by 10
series of the stock index values classifiers (with default settings – no
parameter optimization was made) in scikit-
We decided to perform two types of learn. The classifier’s performance was
experiments with different data sets used for evaluated by the achieved accuracy
the classification. In the first experiment, (proportion of the correctly classified
documents from all 24 windows were put into instances on all examined instances [10, p.
one data set. The results for this experiment 268]) on the test set.
are presented in section “Batch approach”.
4 RESULTS AND DISCUSSION
In the second experiment, we divided the
documents into 12 data sets. Each data set One set of the text data (Facebook posts)
consisted of the documents from two together with the S&P 500 Index values was
neighboring windows: one with an upward used to prepare the data for classification. The
movement and one with a downward class-labelled data set was processed using
movement. The windows represented two the three weighting schemes (TP, TF, TF-
classes for the classification. The results for

ISBN: 978-1-941968-45-1 ©2017 SDIWC 41


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

IDF) by 10 classification algorithms. In total, Table 4. The comparison of average accuracies


30 classification results were obtained. achieved by each classifier.
Average
Classifier
accuracy
4.1 Batch approach
ExtraTreesClassifier 0.613
RandomForestClassifier 0.609
Tables 1, 2, 3 and 4 show the results for
MultinomialNB 0.602
classification of the data set including all
windows. LogisticRegression 0.601
BernoulliNB 0.598
Table 1. Facebook posts – data statistics. LinearSVC 0.597
Total Class 1 Class 2 Number of MLPClassifier 0.596
samples samples samples words DecisionTreeClassifier 0.564
138,713 85,286 53,427 33,397 NearestCentroid 0.533

4.2 Neighboring windows approach


Table 2. Facebook posts – classification results.
Accuracy Precision Recall F1 score Tables 5, 6, 7, and 8 show the results
achieved for the 12 data sets consisting of two
0.623 0.605 0.623 0.614
neighboring windows.
Table 1 shows the statistics about the data Table 5. Facebook posts – neighboring windows:
used for the classification. It is obvious that data statistics.
the data set was quite unbalanced (with more Data Data set Number
Class 1 Class 2 Total
documents marked with index value going set
samples samples samples
balance of
up). Table 2 shows the best classification no. ratio words
results. The highest accuracy (62%) was 1 477 520 997 0.917 691
achieved with the TF-IDF weighting scheme 2 1,156 719 1,875 1.608 1,523
and the Multinomial Naïve Bayes 3 4,293 953 5,246 4.505 3,627
classification algorithm. 4 1,187 2,008 3,195 0.591 2,350
5 8,971 2,207 11,178 4.065 6,523
Table 3 tells us that the used weighting 6 3,052 4,802 7,854 0.636 4,995
scheme was not very important. However, we 7 3,454 3,998 7,452 0.864 4,684
can see that the highest average accuracy was 8 8,399 15,292 23,691 0.549 10,441
achieved by TF-IDF. 9 5,198 2,211 7,409 2.351 4,646
10 1,918 2,828 4,746 0.678 3,191
Table 3. The comparison of average accuracies 11 26,111 8,139 34,250 3.208 13,269
achieved for each weighting scheme. 21,070 9,750 30,820 2.161 12,116
12
Average accuracy from
Weighting scheme
all experiments
Table 5 shows the statistics about the data
TF-IDF 0.598
used for the classification. Because the length
TP 0.587
of the windows was variable, the numbers of
TF 0.586
documents greatly vary. It is also visible that
most of the data sets are imbalanced. This
Table 4 shows for each classifier the average should be taken into account when evaluating
accuracy from all experiments. We can see the results.
that the decision tree classifiers
“ExtraTreesClassifier” and
“RandomForestClassifier“ performed the best
with the accuracy around 61%.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 42


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Table 6. Facebook posts – neighboring windows: Table 8. The comparison of average accuracies
classification results. achieved by different classifiers applied to the
Data set neighboring windows of the Facebook posts.
Accuracy Precision Recall F1 score
no.
Data set no. Classifier Avg. accuracy
1 0.584 0.602 0.584 0.593
2 0.615 0.598 0.615 0.606 1 NearestCentroid 0.587
3 0.808 0.653 0.808 0.722 2 LogisticRegressionCV 0.578
4 0.639 0.733 0.639 0.683 3 LogisticRegression 0.748
5 0.803 0.807 0.803 0.805 4 LogisticRegressionCV 0.633
6 0.646 0.653 0.646 0.650 5 LogisticRegression 0.784
7 0.554 0.553 0.554 0.553 6 MultinomialNB 0.630
8 0.674 0.669 0.674 0.672 7 ExtraTreesClassifier 0.553
9 0.721 0.727 0.721 0.724 8 SGDClassifier 0.656
10 0.618 0.614 0.618 0.616 9 MultinomialNB 0.701
11 0.800 0.782 0.800 0.791 10 LogisticRegressionCV 0.595
12 0.698 0.702 0.698 0.700 11 ExtraTreesClassifier 0.788
Average 0.680 0.674 0.680 0.676 12 SGDClassifier 0.673

According to Table 6, the average accuracy Table 8 shows the classifier that achieved the
(as well as the F1 score) was 68%. The best highest accuracy for each data set. We can see
accuracy (as well as F1 score) was achieved that most of the times the Logistic Regression
for data sets 3 (72%), 5 (80%), and 11 (79%). (5 times) achieved the best result. Among the
The reason for this might be that they have a other classifiers, the Multinomial Naïve
balance ratio around 4 (with more documents Bayes classifier, Extra Trees Classifier, and
marked with index value going up). Stochastic Gradient Descent (SGD) Classifier
were the most successful twice and the
Table 7. The comparison of average accuracies Nearest Centroid was the best only once.
achieved with different weighting schemes applied
to the neighboring windows of the Facebook posts. 5 CONCLUSION
Data set no. TP TF TF-IDF
The goal of the work was to examine whether
1 0.539 0.534 0.534
the content of text documents published on
2 0.555 0.556 0.570
the Internet (specifically Facebook posts) has
3 0.744 0.753 0.769
any connection with stock price movements.
4 0.626 0.625 0.639
We used the values of the S&P 500 Index and
5 0.735 0.736 0.765
divided them into 24 time windows with
6 0.610 0.609 0.620
either growing or decreasing index value
7 0.534 0.534 0.536
trend. Subsequently, we examined (using the
8 0.633 0.629 0.646 classification accuracy) the connection
9 0.663 0.661 0.675 between the documents’ content and the trend
10 0.568 0.562 0.578 of the index value in the time window in
11 0.754 0.747 0.766 which was the document published.
12 0.637 0.644 0.658
Average 0.633 0.633 0.646 Two types of experiments were performed. In
the first one, the documents from all 24
From Table 7 can be seen that the highest windows were put into one data set and we
average accuracy provided the TF-IDF achieved an accuracy of 62%. The second
weighting scheme (+1% in comparison to TP experiment, in which we divided the
and TF). documents into 12 data sets formed from two
neighboring windows, provided better results
– the average accuracy was 68%. Moreover,

ISBN: 978-1-941968-45-1 ©2017 SDIWC 43


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

for three data sets the accuracy was even REFERENCES


higher – 72%, 79% and 80%. This means that
classifying data from the neighboring [1] Aggarwal, C. C. Data Streams: Models and
Algorithms. 2007. Springer
windows brings on average better results than
using only one data set. This might be related [2] Gama, J. Knowledge discovery from data
to the concept drift [12] phenomenon which streams. CRC Press, 2010.
requires a further investigation for this [3] Petrovský, J., Netolický, P. and Dařena, F.
specific domain. Examining Stock Price Movements on Prague
Stock Exchange Using Text Classification.
International Journal of New Computer
The achieved accuracy around 70% tells us Architectures and their Applications (IJNCAA).
that the posts which companies write on their Vol. 7 No. 1. (2017). pp. 8-13. ISSN 2412-3587.
Facebook pages are partially related to [4] Sven S. Groth, Jan Muntermann. An intraday
the performance of the whole stock index. market risk management approach based on
textual analysis. Decision Support Systems .
Volume 50. Issue 4. March 2011. Pages 680-691
It must be noted that we did not optimize the
parameters of used classification algorithms. [5] Colm Kearney, Sha Liu. Textual sentiment in
finance: A survey of methods and models.
By doing this, we might achieve a slightly International Review of Financial Analysis.
higher accuracy. Volume 33. May 2014. Pages 171–185.

[6] Bollen, J., Mao, H. and Zeng, X. Twitter mood


This area could be further researched in predicts the stock market. Journal of
various directions. Firstly, the analysis may Computational Science. 2011. vol. 2. no. 1. p. 1–
8.
be performed on more types of documents
(e.g., newspaper articles). Secondly, the class [7] Kaplanski, G. and Levy, H. Sentiment and stock
assigning method may be enriched by using prices: The case of aviation disasters. Journal of
Financial Economics. 2010. vol. 95. no. 2. p.
various thresholds of the index value changes 174–201.
(not only 5%). Thirdly, it might be interesting
to examine not the whole stock index, but the [8] Lee, H., Surdeanu, M., MacCartney, B. and
Jurafsky, D. On the Importance of Text Analysis
stock prices of the individual companies for Stock Price Prediction. In: LREC. 2014. p.
instead. 1170-1175

[9] Maas, A. L., Daly, R. E., Pham, P. T., Huang, D.


ACKNOWLEDGEMENT Learning word vectors for sentiment analysis. In:
Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics:
This research was supported by the Czech Human Language Technologies. 2011. vol. 1. p.
Science Foundation [grant No. 16-26353S 142–150.
”Sentiment and its Impact on Stock Markets”] [10] Go, A., Bhayani, R. and Huang, L. Twitter
and Internal Grant Agency of Mendel sentiment classification using distant
University [No. PEF_DP_2017001 supervision. CS224N Project Report. Stanford.
2009. vol. 1. p. 12.
“Searching for semantic information and
gaining knowledge from text data streams [11] Weiss, S. M., Indurkhya, N. and Zhang, T.
with new machine learning methods”] and Fundamentals of Predictive Text Mining.
London: Springer. 2010. ISBN 978-1-84996-
Internal Grant Agency of Mendel University 225-4.
[No. PEF_DP_2017022 “Acquiring, filtering
and analyzing of texts for stock markets”]. [12] Lindstrom, P, Delany, S. J., Mac Namee, B.
(2010) Handling Concept Drift in Text Data
Stream Constrained by High Labelling Cost.
Florida Artificial Intelligence Research Society
Conference (FLAIRS). Florida, 19-21, May.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 44


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Towards Specification Formalisms for Data Warehousing Requirements


Elicitation Techniques
Isaac N. Mbala (1) and John A. van der Poll (2)
(1)
School of Computing, Florida, University of South Africa
(2)
Graduate School of Business Leadership (SBL), Midrand, University of South Africa
(1)
isaacnkongolo64@gmail.com, (2) vdpolja@unisa.ac.za

phase of requirements analysis for DW


ABSTRACT systems design. Numerous authors have
Various studies have demonstrated that, often researched requirements analysis for DW
data warehouse projects miss to meet with the systems, notably [1], [2], [3], [4], [5], [6], [7],
business objectives and business requirements of [8], [9], [10], and [11] to name but a few, and
the target company. Many of these projects fail to rather few research works have attempted to
recognize the importance of comprehensive address the requirements engineering aspect.
requirements elicitation and subsequent [5], [7] and [12] have defined a data
specification phases. The requirements executed warehouse as “a subject oriented, integrated,
during these phases of definition are incomplete time-variant, and nonvolatile collection of
or incoherent, leading to incorrect specifications.
data in support of management’s decisions”.
In this paper we present the challenge of data
A data warehouse is recognized as one of the
warehousing projects failure and propose a
methodology of requirements engineering for the most complex information systems and its
specification of data warehousing requirements. maintenance and design are described by
Our approach leans towards a hybrid-driven numerous complexity coefficients [2], [5],
technique for requirements analysis for data [7]. A data warehouse is thought of as the
warehouse systems. Formalisms for requirements cornerstone of business intelligence systems
elicitation are proposed, followed by formal [12], [13].
specification in Z.
As observed by [2], the objective of the data
KEYWORDS warehouse is to yield concise analysis in
Data warehouse design, Requirements order to assist decision makers and also to
engineering, Formal methods, Formal increase corporeal organizational
specification, Z. performance. To build a conventional
operational system, it requires to take in
1 INTRODUCTION
consideration not only demands on how to
Nowadays, data warehousing (DWH) is the automatically perform operations of the
technology more used in large industries [1], company, but to also develop a DW system,
[2]. The usage of data warehousing involves the analytical requirements sustaining the
the creation of a data warehouse (DW) made process of decision-making need to be
up of data marts (DMs) or operational captured [11]. The data warehouse
databases with business intelligence (BI) development and maintenance is a complex
embedded in the resultant structure. The goal and tedious task which requires solving many
of this paper is upon a methodology of different problem types [14]. The design of
requirements engineering (RE) for the DW systems is instead different from the
specification of requirements used during the design of traditional operational systems that

ISBN: 978-1-941968-45-1 ©2017 SDIWC 45


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

provide data to the warehouse since the transforming and documenting them under a
objective of data warehouse projects is form which will be analyzable and
fundamentally relied on the support of the communicable. This process is generally
process of decision making of the enterprise known as Requirements Engineering (RE)
in order to facilitate the analysis processes [11], which is a necessary and vital phase in
[2], [5], [8], and [11]. the software development life cycle (SDLC)
[16].
It is well documented that a prominent reason
why many DW projects have failed in the According to [16] and [17], RE is the process
past is not only because they attempted to which is intended to collect, document,
supply strategic information from operational analyze and manage requirements for
systems while those operational systems systems and software product throughout the
were not intended to provide strategic SDLC. According to [5], RE in the data
information [5], but also because the warehouse arena has acquired increased
requirements analysis phase were often importance and it has the goal of identifying
overlooked [1], [2], and [3] during the design the information demands of the decision
process and because of these reasons, [2] and makers. However, researchers are actually
[14] have declared that over 80% of DW attempting to utilize numerous techniques of
projects miss to meet with the users and requirements engineering to analyze the
stakeholders’ requirements. The analysis specification of data warehouse systems in
phase of requirements can be executed order to avoid the risk of failure. Several
informally based on simple requirements techniques and methods are used in the
glossaries instead of formal diagrams but requirements engineering activities and in
such an informal (or maybe semi-formal) this paper we are more interested on formal
approach may be inappropriate for a methods [16], [18], and [19] for analyzing
requirements-driven framework that requires system behavior, factors of risk and problems
more organized and comprehensible related to its implementation [20] during the
techniques [3]. design of the system.
Data warehouse projects are similar in The use of Formal Methods (FMs) in the
several phases to any software development construction of reliable software has been
project and claims a definition of different controversial for a number of decades.
activities which ought to be executed related Advocates of such techniques point to the
to demands collection, design and advantages to be gained in constructing
implementation within an operational provably correct systems, especially in the
platform, amongst other activities [1], [14]. arena of mission/safety-critical systems, e.g.
Despite the similarity to general software nuclear power plants and aviation systems.
development, the effective development of a Critics of FMs object to the steep learning
DW relies upon the quality of its models curve involved in mastering the underlying
(design and specification) [15]. However, the discrete mathematics and formal logic
system success under the development may needed for the effective use of the
be strongly affected by the discovering methodology. Yet, the literature suggest that
process of involved stakeholders demands using FMs in the design of data warehouse
and sustaining those demands while systems ought to be useful in improving the

ISBN: 978-1-941968-45-1 ©2017 SDIWC 46


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

reliability of such systems and other directions for future work are presented in
functionality [19]. Section 6.
Formal methods are mathematical 1.1 Research questions
approaches sustained by tools and techniques
In this paper we aim to find answers to the
for the verification of the desired and
following questions:
necessary properties of software or hardware
systems. FMs are necessary for the control of RQ1: What are the requirements elicitation
the quality parameters such as completeness; approaches for data warehouse development?
correctness; and consistency and verification
of requirements of a system [13] and they are RQ2: To what extent may formal
based on (often discrete) mathematical specification facilitate data warehousing?
notations and logic to clearly and accurately
RQ3: How may the two (2) prominent
express requirements specification [21].
elicitation techniques be combined?
In the research work published by [16],
Formal methods are more likely to be used at 2 DATA WAREHOUSE
the levels of design and verification of the SYSTEMS DESIGN
software development. As observed by [13],
Building a DW system is unlike transactional
formal methods are attached with the three
systems with respect to the development; the
techniques which are formal specification,
structures are not only ones to be thought of
formal checking (discharging proof
as in those kind of source systems, but
obligations), and refinements. Formal
cognizance should also be given about the
specifications aim to provide for an
purposes and strategies of the organization
unambiguous and coherent complement to
[8]. Data warehouse systems have the
natural language descriptions [21], [22] and
purpose of supporting the process of decision
are rigorously validated and verified
making of an enterprise. The development of
conducting to the early detection of
a DW requires that the analytical
specification errors [22]. One of the broadly
requirements supporting the decision making
used formal specification languages amongst
process be captured and such requirements
many different formal specification
are usually not easy to extract and specify
languages is the Z, selected in this paper
[11]. A DW is generally defined as the
owing to its simplicity and widely-usedness
linking of a number of operational databases
in the formal methods arena [22].
with the aforementioned intelligence (e.g.
This paper is structured as follows. Following decision-making) added to the resultant
our research questions below, we introduce in structure. Since a DM is viewed as a subset
Section 2 the fundamental concepts of data of a DW, we view a data mart as being one of
warehouse systems design by discussing the operational databases in the DW.
various design approaches. Section 3
Subsequently, we formalize a DW as follows:
presents a Z specification of a data warehouse
star schema and Section 4 addresses related 𝑛
work in this area. In Section 5 we address our 𝐿𝑖𝑛𝑘 𝐷𝐵𝑖 , where
methodology and finally, conclusions and i=1

ISBN: 978-1-941968-45-1 ©2017 SDIWC 47


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

(∀i)(∀j) (1 ≤ i, j ≤ n ⦁ i ≠ j ⇒ 𝐷𝐵𝑖 ∩ 𝐷𝐵𝑗 = phase, logical phase and physical phase [6],
∅) [8]. According to [9], the design stage is the
The above definition assumes that different most significant operation in the successful
databases, when correctly normalized do not construction of a DW.
contain common elements, except of course, The following sections elucidate the context
for foreign key matches. of requirements analysis and conceptual
The design of data warehouse systems is design which are considered as the two main
unlike from the design of transactional phases within the data warehouse systems
systems that provides data to the warehouse design process [8].
[8]. There are two well-known authors in the 2.1.1. Requirements Analysis
world of data warehousing; Bill Inmon and
Ralph Kimball, advocating complementary, Requirements analysis has as its aim
yet different techniques to the design of data detecting which knowledge is useful for
warehouses. The technique applied by Bill decision making by investigating the user’s
Inmon is the familiar top-down design which demands and expectations in user-driven and
begins with the Extraction-Transformation- goal-driven approaches, or by verifying the
Loading (ETL) process working from validity of operational data sources in a data-
external data sources in order to build a data driven approach [8]. Requirements analysis
warehouse, whilst Ralph Kimball applies the of users plays a crucial role in data warehouse
equally well-established bottom-up systems design. It has a major influence upon
technique which begins with an ETL process the taking of decisions throughout the data
for one or more data marts separately. Most warehouse systems implementation [2], [23].
of proponents of data warehouse design The requirements analysis phase leads the
subscribe to either of the two techniques [12]. designer to unveil the multidimensional
schema necessary elements (facts, measures
2.1 Data warehouse systems design and dimensions) which are claimed to assist
Approaches future data manipulations and calculations.
The DW systems design is based on two The multidimensional schema has a
approaches that are alternative and inverses significant impact on the success of DW
of each other viz: the Data-driven approach projects [2], [3], and [14].
also known as the Supply-driven approach Several research works have published on the
and the Requirement-driven approach also various approaches used during the
known as the Demand-driven approach [1], requirements analysis phase of DW systems
[3], [7], and [10]. The process of design, leaning on the two techniques
development of a DW starts with the mentioned above – the Top-down technique
identification and collection of requirements. and Bottom-up technique. Implementations
The design of the multidimensional model is of these are: Data-driven approach, Goal-
next, followed then by testing and driven approach, User-driven approach, and
maintenance [9]. To develop a data Mixed-driven approach [1], [2], [3], [4], [5],
warehouse requires a set of steps to [6], [7], [8], [9], [10], and [11]:
accomplish throughout the process, namely,
the requirements analysis phase, conceptual

ISBN: 978-1-941968-45-1 ©2017 SDIWC 48


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

 The Data-driven approach also known as considering the needs expressed by end users
the supply-driven approach. It utilizes and stakeholders. The correct elicitation of
the bottom-up technique and yields user requirements remains a fine challenge
subject-oriented business data schemas and many techniques, e.g. the use of JAD
by only leaning on the operational data (Joint Application Design) sessions [25] have
sources and ignoring business goals and been put forward.
stakeholder needs.
These three primary aforementioned
approaches have their merits and demerits.
 The Goal-driven approach applies the
However, in an attempt at overcoming this
top-down technique rather than the
problem, numerous authors suggested the
bottom-up technique. It allows for the
mixed-driven approach which consists of
generation of information like Key
combining two or even all three the primary
Performance Indicators (KPIs) of
approaches (either user-driven and data-
principal business areas by relying on
driven approaches or goal-driven and data-
business objectives and granted business
driven approaches or a combination of user-
processes only, and essentially overlooks
driven, goal-driven and data-driven
data sources and user demands.
approaches) as detailed in [2], all aimed at
getting a “best result” that will meet users and
 The User-driven approach is similar to stakeholders’ demands and expectations.
the goal-driven approach, and it applies According to [14] and [26], the requirements-
the top-down technique. It permits to driven approach is also called the analysis-
produce analytical requirements driven approach; the supply-driven approach
interpreted by the dimensions and is also called the source-driven approach and
measures of each subject by ignoring the requirement/supply-driven approach is
business goals and data sources. known as an analysis/source-driven
A user-driven approach starts with a detailed approach, but is also known by the name of a
agreement of the needs and expectations of hybrid-driven approach.
the users and this brings about numerous The various approached discussed above are
advantages like increasing of the production, elicited in Figure 1 below.
enhancing of the work quality, support and
training costs reductions and improvement of
Top-down
general user satisfaction [24]. User
requirements analysis does not prescribe to a Requirements- DW
standard approach which designers may rely driven Source-driven
on for designing their data warehousing User-driven
Supply-driven
projects [16]. As declared by [6] the data- Demand-
Data-driven
driven approach yields a conceptual schema driven
through a re-engineering process of the data Goal-driven
Analysis-driven Bottom-up
sources by ignoring the end users and
stakeholders’ contribution, whilst the DW
requirements-driven approach aims at
generating the conceptual schema by only Figure 1: Complementary top-down & bottom-up

ISBN: 978-1-941968-45-1 ©2017 SDIWC 49


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Table 1 below presents the advantages and Figure 2 below depicts the analysis-driven
disadvantages of approaches grouped by approach framework with all the steps
technique. considered:

Table 1: Advantages and disadvantages of techniques. Identify users

Technique Approach Advantages Disadvantages Determine analysis demands


Top- User- A coherent It’s not
data flexible to the Define, refine and prioritize goals
down driven
dimensional requirements
Goal- views change during
through data the
driven marts is implement-
provided ation. Model business processes
Demand- from the Detail
Determine
driven DW. user needs processes Specify
It’s easy It’s highly for services
Analysis
-driven from a data expose to the accomplish or
warehouse to risk of failure. ment of activities
Require reproduce a goals
data mart.
ments-
driven
Bottom- Data- It’s less The data view
up driven expose to the for each data Document requirements specification
risk of mart is
Source- failure. narrowed.
driven Figure 2: Requirement-driven approach framework [25]
It facilitates The
the return on penetration of
Supply- investment redundant
driven and leads to data within 2.1.1.2. Supply-driven Approach
concrete each data
results in mart. In the supply-driven approach, the
short time. development of the conceptual schema leans
on the data available in the operational
The above discussion presents an answer to systems. The objective of this approach is to
our RQ1 above. identify multidimensional schemas which
may be conveniently implemented over the
2.1.1.1. Requirements-driven Approach legacy operational databases (data marts –
see above). An exhaustive analysis is made
The conceptual schema development within
over these databases to elicit the essential
the requirements-driven approach is based on
elements which can depict facts with attached
business- or user requirements. The
measures, dimensions and hierarchies and the
organizational purposes and demands that the
discovering of these elements induced to an
DW systems are expected to present, support
initial DW schema which can correspond to
the process of decision making which include
many different analysis goals [14], [26].
the requirements needed for the conceptual
schema. The information collected serves as Figure 3 below represents the supply-driven
a foundation for the initial data warehouse approach framework with the all considered
schema development [14], [26]. steps.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 50


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Document before the final DW schema is developed, the


Identify
operation Apply requireme essential step in all the aforementioned
al derivation nts approaches that we concentrate on in this
systems process specificati paper, is the document requirements
on
specification step. This enables the
documentation of all the information
Figure 3: Supply-driven approach framework [25] obtained from the previous step. This step
will contain the business requirements and
2.1.1.3. Hybrid-driven approach business objectives expressed in more detail,
e.g. what operations may be done, who has
The hybrid-driven approach combines the
the right to access to the data, what measures
two abovementioned approaches that can be
and dimensions are represented, etc.
used in parallel in order to obtain a best
design [14], [26]. The operation of 2.1.2. Conceptual design
requirements mapping takes place while fact
Although most of the research on the design
tables, measures and dimensions are
identified during the decisional modeling, of DWs concerns their logical and physical
and are mapped over entities into the source models, the essential foundation to build a
schema [3]. data warehouse on is an accurate conceptual
design that is well-documented and
The hybrid-driven approach framework is thoroughly fulfills the requirements [27]. We
illustrated below in Figure 4.
addressed some works that discussed the
Supply-driven Requirement-driven conceptual design of data warehouse systems
above.
Identify Identify
operational users
In [28] the authors proposed a hybrid
systems
technique which has as objective to
summarize techniques while defining user
Determine requirements for DWs in order to get a
Apply analysis multidimensional conceptual model using an
derivation demands
ensemble of rules Query-View-
process
Transformation (QVT) based on the
multidimensional normal forms for the
Matching
process preciseness. In the paper published by [15] an
approach of the multidimensional star
schema validation supported through
Document requirements specification reparation solutions has been proposed while
assisting designers within the detection of
Figure 4: Hybrid-driven approach framework constraint violations and by suggesting
reparation solutions based on a number of
According to [14] and [26] for all the
mistake-based rules formalized in Prolog.
previously described approaches, showing
the different iterations which can be claimed

ISBN: 978-1-941968-45-1 ©2017 SDIWC 51


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

However, the main purpose of this stage is to consists of measures and additional
develop a conceptual diagram that will meet attributes, therefore we have as basic types:
the functional requirements from the
requirements analysis stage (requirement- [Measure, Attribute]
driven approach) and the data model
A Fact table consists of attributes, but given
designed from the legacy operational systems
the structure of the star schema [15] we
[8] in order to fulfill the users and distinguish between measures and ordinary
stakeholders’ demands and expectations. attributes as follows.
This phase would be useful for the
representation of the necessary elements into Fact_table
the multidimensional schema after the measures : ℙ Measure
specification of requirements. attributes : ℙ Attribute
According to [29] a schema is defined by the measures ∩ attributes = ∅
relation between facts and dimensions. Fact
is the subject of analysis or the focus of While measures of fact tables in a star schema
interest in the process of decision making are also attributes, we give them special
[27] and dimensions are different status to reflect the structure defined in [15].
perspectives used for the analysis of facts.
A Dimension as per Figure 5 consists of a
Fact contains numerical attributes commonly
number of fact tables.
called measures [29].
Dimension
Figure 5 below depicts a multidimensional
schema for data warehouse systems, dimension : ℙ Fact_table
following the suggestions in [15].
As per the Established Strategy for
constructing a Z specification, a specifier has
Fact – table Dimensions – to define an initial state, and discharge a
1.. n table 1.. n proof obligation (PO) arises, namely, show
Measures Attributes
that such an initial state may be realized (i.e.
it exists).
Figure 5: A multidimensional schema
Subsequently, the initial state of the star state
is given by:
Next we present a formal specification of the
star-based structure in Figure 5 in Z.
Init_Dimension
3 A FORMAL SPECIFICATION Dimension′

As per the Established Strategy for dimension′ = ∅


constructing a Z specification [18], we define
the basic types of the system. Our state The PO that arises is:

⊢ ∃ Dimension′ ⦁ Init_Dimension

ISBN: 978-1-941968-45-1 ©2017 SDIWC 52


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Therefore, ⊢ ∃ dimension′ | dimension′ = ∅


from which it follows that the initial state of Dim - Cars
the system contains no fact tables. IdCar
Subsequently, measures = ∅ ∧ attributes = ∅. Model
Therefore the initial state Init_Dimension above Type
Brand
can be realized.
The above Z specification provides an
answer to our RQ2 above.
Fact - Renting Dim - Date
Figure 5 suggests that in the IdCar Year
multidimensional schema, we can have idAgency Month
Day Week
several facts link to various dimensions. We
Amount Day
choose to design the DW conceptual schema
with the model of star.

In Figure 6 we show how a multidimensional Dim - Agency


star schema may appear for a specific case. idAgency
Country
Dimension Dimension City
table 1 table 3
(attributes) (attributes)
Figure 7: A multidimensional star schema

Fact table
(Measures) 4 RELATED WORK

Dimension Several works in the literature have


table 2 addressed requirements analysis for data
(attributes) warehouse systems and just few works try to
link requirements engineering to DW
systems [30]. Requirements analysis is
Figure 6: A multidimensional star schema [12]
thought of as one of the significant tasks to
facilitate the success of a data warehouse
Example 1
project [2]. However, we discuss below some
Consider an example of a car rental company work that examine the requirements analysis
represented in Figure 7 to illustrate the phase in more detail within the data
generic multidimensional star schema warehouse systems design arena.
concepts in Figure 6.
[6] Proposed a hybrid approach based on the
An example where a decision maker is two main approaches used for data
interested in analyzing the fact Renting in warehouse systems design (requirements-
terms of amount measure is described in [15]. driven approach and supply-driven approach)
which helps to produce the conceptual

ISBN: 978-1-941968-45-1 ©2017 SDIWC 53


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

schema based on a graph-oriented Various authors, e.g. [14], [28] have


representation, using in turn an automatic discussed different approaches and
operation of data sources reengineering, methodologies used for the DW systems
based on a set of constraints derived from design. Following that, we have chosen to use
requirements of users. the hybrid approach to create our framework
as a goal of this paper.
In [5] the authors introduced various
requirements elicitation methodologies by 5 OUR METHODOLOGY
addressing advantages of each of them and
Our methodology is depicted Figure 8
they investigated the role of requirements
below. It is organized as a sequence of steps
engineering within the DW development and
where each step follows in depth the
made a comparative study of different
application abstraction level, as the
requirements elicitation methodologies (GDI
requirements of the project are collected to
model, AGDI model, Use case approach,
form the requirements basis. Such
DFM, CADWA, Tropos) for the DW
requirements are ultimately formally
development.
specified, using e.g. Z.
[10] Discussed requirements analysis
As is evident from the diagram, we base our
approaches for the design of data warehouse
approach on both the hybrid-driven approach
systems and suggested a requirements
as well as requirements-driven approach,
analysis framework based on business
aimed at assisting designers of data
objects for data warehouse systems to
warehouse systems to design a conceptual
identify the analytical requirements and
schema that will meet the users and
further refined those requirements to map
stakeholders’ needs and expectations.
onto the conceptual level of the model of data
Therefore, we are not concerned with all
warehouse design, using either requirements-
these steps of requirements elicitation for
driven or supply-driven approach for DW
documenting requirements – details may be
requirements analysis. The author declared
observed in [14].
that the multidimensional data model of the
conceptual level is the primary deliverable of The two sets of requirements stem from
the data warehouse requirements analysis. bottom-up and top-down analysis
respectively. Furthermore, for the
In [9] the author proposed an object oriented
requirements analysis, requirements are
framework for DW systems conceptual
collected from dissimilar users and on the
design and used UML (Unified Modeling
supply side from different legacy operational
Language) within the design process of
systems. Finally the requirements are
software system development as this latter
combined as shown and discussed below.
has become a standard for object oriented
modelling during design and analysis phases.
He also made a comparative study of various
data warehouse design approaches and
schemas used, according to different authors.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 54


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Supply-driven Requirement-driven Algorithm 1: Enhanced SRE – Merge


structured and unstructured data
sets obtained from bottom-up and top-
Identify Identify down requirements elicitation.
operational users
systems Begin
(* Elicit unstructured- &
structured data through
Determine requirements- and supply driven
elicitations. *)
Apply analysis
derivation demands (* Assume m structured data sets *)
process for i := 1 to m step 1 do
Si := Structured data set i of
supply-driven requirements;
Matching (* Assume n unstructured data sets *)
process
for j := 1 to n step 1 do
Uj := Unstructured data set j of
user requirements;
Document requirements specification
(* Amalgamate the two data sets
into two separate sets S and U. ⋃
denotes set-theoretic arbitrary
Formal specification union. *)
m n

Conceptual schema
S := ⋃ Si ; U := ⋃ Uj

Figure 8: Proposed framework


i := 1 j := 1

(*Combine outcome of elicitations


as per Figure 8 into set C.*)
The crucial step in Figure 8 above is the
matching of the two sets of requirements C := S ∩ U
obtained through the top-down processes and End.
the bottom-up processes. It is plausible that
Following the construction of set C
these requirements sets may be non-
homogenous and in different formats, e.g. containing both structured and unstructured
one may contain structured data (bottom-up data in Algorithm 1, the next step is to
data), while the other set may contain construct a requirements definition [31], [17]
unstructured data obtained through as per Figure 8. From the definition a formal
incomplete and often inconsistent user specification [22] would be constructed in
requirements. line with the ideas presented in Section 3.
To merge the two data sets we enhance an Algorithm 1 provides an answer to our RQ3
SRE (Software Requirements Elicitation) above.
algorithm defined in [25] as follows:

ISBN: 978-1-941968-45-1 ©2017 SDIWC 55


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

6 CONCLUSIONS AND FUTURE deal with structuredness as well as


WORK unstructuredness in the same set.

In this paper we introduced several works REFERENCES


which discuss about approaches used for data [1] N. Jukic and J. Nicholas. (2010). A
warehouse systems design. Various authors Framework for Collecting and Defining
in the literature review [1], [2], [3], [4], [5], Requirements for Data Warehousing
Projects. Journal of Computing and
[6], [7], [8], [9], [10], [11] and [32] have Information Technology, 18(4), pp. 377–384.
proposed different techniques and/or https://doi.org/doi:10.2498
approaches and frameworks to address the [2] N. H. Z. Abai, J. H. Yahaya and A. Deraman.
issue of DW systems design. We also (2013). User Requirement Analysis in Data
investigated the extent to which the use of Warehouse Design: A Review. Procedia
formal methods for data warehouse systems Technology, 11, pp. 801–806.
https://doi.org/10.1016/j.protcy.2013.12.261
may alleviate such failures within the design
process. [3] P. Giorgini, S. Rizzi and M. Garzetti. (2008).
GRAnD: A goal-oriented approach to
We proposed a requirements engineering requirement analysis in data warehouses.
Decision Support Systems, 45(1), pp. 4–21.
methodology which represents our proposed https://doi.org/10.1016/j.dss.2006.12.001
framework which is created from one of the
[4] D. T. A. Hoang. (2011). Impact Analysis for
existing requirements analysis frameworks
On-Demand Data Warehousing Evolution. In
for the specification of system requirements ADBIS (2) (pp. 280–285). Retrieved from
to requirements analysis for the design of https://pdfs.semanticscholar.org/ae5c/a847a
DW systems at the conceptual level. The use 8afc046951e34653fcbd3ade06322cb.pdf
of a formal specification as a last step in the [5] S. Mathur, G. Sharma and A. K. Soni. (2012).
process is proposed. Requirement elicitation techniques for data
warehouse review paper. International
We anticipate our proposed framework to Journal of Emerging Technology and
Advanced Engineering, 2(11), pp. 456–459.
facilitate with removing vagueness or Retrieved from
inconsistencies, at least to some extent. Such https://www.researchgate.net/profile/Girish_
ambiguity is often present in natural language Sharma4/publication/259827907_IJETAE_1
specifications used to specify data 212_84/links/02e7e52e0dc392211f000000.p
df
warehouses.
[6] F. Di Tria, E. Lefons and F. Tangorra. (2011).
As for future work, we intend to test our GrHyMM: A graph-oriented hybrid
methodology on a case study using the CZT multidimensional model. In Lecture Notes in
Computer Science (including subseries
toolset, aimed at facilitating the building of a Lecture Notes in Artificial Intelligence and
formal specification. Lecture Notes in Bioinformatics) (Vol. 6999
LNCS, pp. 86–97). Springer.
The combined set C in Algorithm 1 above https://doi.org/10.1007/978-3-642-24574-
contains both structured and unstructured 9_12

data, hence the set is non-homogeneous. Z, [7] M. Golfarelli. (2010). From User
however, is a strongly typed language, Requirements to Conceptual Design in Data
Warehouse Design. Data Warehousing
containing only “homogeneous” sets. Design and Advanced Engineering …, 15.
Therefore specifying the above processes in https://doi.org/10.4018/978-1-60566-756-
0.ch001
Z will require us to devise a mechanism to

ISBN: 978-1-941968-45-1 ©2017 SDIWC 56


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

[8] M. El Mohajir and I. Jellouli. (2014). 6-


Towards a Framework Incorporating KbWAhUrJsAKHateCJQQ6AEIOTAF#v=o
Functional and Non-Functional nepage&q=incomplete%20requirements%2
Requirements for Datawarehouse 0for%20data%20warehouse&f=false
Conceptual Design. IADIS International
Journal on Computer Science and [15] A. Salem and H. Ben-Abdallah (2015). The
Information Systems, 9(1). Retrieved from design of valid multidimensional star
http://citeseerx.ist.psu.edu/viewdoc/downloa schemas assisted by repair solutions. Vietnam
d?doi=10.1.1.640.5590&rep=rep1&type=pd Journal of Computer Science, 2(3), pp. 169–
f 179. https://doi.org/10.1007/s40595-015-
0041-1
[9] R. Jindal. (2012). Comparative Study of Data
Warehouse Design Approaches : A Survey. [16] M. Dos Santos Soares and D. S. Cioquetta.
International Journal of Database (2012). Analysis of techniques for
Management Systems, 4(1), pp. 33–45. documenting user requirements. In
https://doi.org/10.5121/ijdms.2012.4104 International Conference on Computational
Science and Its Applications (pp. 16–28).
[10] A. Sarkar. (2012). Data Warehouse Springer. Retrieved from
Requirements Analysis Framework: http://link.springer.com/chapter/10.1007/978
Business-Object Based Approach. -3-642-31128-4_2
International Journal of Advanced Computer
Science and Applications, 3(1), pp. 25–34. [17] I. Sommerville. (2015). Software
https://doi.org/10.14569/IJACSA.2012.0301 Engineering (10th ed). Pearson.
04 [18] B. Potter, J. Sinclair and D. Till. (1996). An
[11] A. Nasiri, E. Zimányi and R. Wrembel. Introduction to Formal Specification and Z,
(2015). Requirements Engineering for Data 2nd edition, Prentice Hall International
Warehouses. Retrieved from Series.
http://code.ulb.ac.be/dbfiles/NasZimWre201 [19] J. Q. Zhao. (2007). Formal Design of Data
5incollection.pdf Warehouse and OLAP Systems.
[12] T. Oketunji and O. Omodara. (2011). Design [20] S. A. Han and H. Jamshed. (2016). Analysis
of Data Warehouse and Business Intelligence of Formal Methods for Specification of E-
System. Master Thesis. Retrieved from Commerce Applications, 35(1), pp. 19–28.
http://www.diva-
portal.org/smash/record.jsf?pid=diva2:8310 [21] S. H. Bakri, H. Harun, A. Alzoubi and R.
50 Ibrahim. (2013). the Formal Specification for
the Inventory System Using Z Language. The
[13] T. Pandey and S. Srivastava. (2015). 4th International Conference on Cloud
Comparative Analysis of Formal Computing and Informatics, (64), pp. 419–
Specification Languages Z, {VDM} and B. 425.
International Journal of Current
Engineering and Technology, 5(3), pp. 2277– [22] M. Gulati and M. Singh. (2012). Analysis of
4106. Retrieved from Three Formal Methods-Z, B and VDM.
http://inpressco.com/wp- International Journal of Engineering, 1(4),
content/uploads/2015/06/Paper1082086- pp. 1–5. Retrieved from
2091.pdf http://www.ijert.org/browse/june-2012-
edition?download=297:analysis-of-three-
[14] E. Malinowski and E. Zimányi. (2008). formal-methods-z-b-and-vdm&start=120
Advanced data warehouse design: from
conventional to spatial and temporal [23] J. Schiefer, B. List and R. Bruckner. (2002).
applications. Retrieved from A holistic approach for managing
https://books.google.co.za/books?id=XPMV requirements of data warehouse systems, 13.
W3PtGtEC&pg=PA259&lpg=PA259&dq=i Retrieved from
ncomplete+requirements+for+data+warehou http://aisel.aisnet.org/cgi/viewcontent.cgi?ar
se&source=bl&ots=pbxOHI231W&sig=Zuf ticle=1372&context=amcis2002
rX0byf75gvHMUeMOg3-
9z_v4&hl=en&sa=X&ved=0ahUKEwjF9df [24] M. Maguire and N. Bevan. (2002). User

ISBN: 978-1-941968-45-1 ©2017 SDIWC 57


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

requirements analysis. In Usability (pp. 133–


148). Springer. Retrieved from
http://link.springer.com/chapter/10.1007/978
-0-387-35610-5_9
[25] W. F. Friedrich and J.A. van der Poll. (2007).
Towards a Methodology to Elicit Tacit
Domain Knowledge from Users.
Interdisciplinary Journal of Information,
Knowledge and Management (IJIKM),
Volume 2, 2007, pp. 179 - 193. ISSN: Print
1555-1229. URL: www.ijikm.org
[26] E. Zimanyi. (2006). Requirements
Specification and Conceptual Modeling for
Spatial Data Warehouses, 4278(October
2017). https://doi.org/10.1007/11915072
[27] M. Golfarelli, D. Maio and S. Rizzi. (1998).
Conceptual design of data warehouses from
E/R schemes. Proceedings of the Thirty-First
Hawaii International Conference on System
Sciences, 7, pp. 334–343.
https://doi.org/10.1109/HICSS.1998.649228
[28] J. N. Mazón, J. Trujillo and J. Lechtenbörger.
(2007). Reconciling requirement-driven data
warehouses with data sources via
multidimensional normal forms. Data and
Knowledge Engineering, 63(3), pp. 699–725.
https://doi.org/10.1016/j.datak.2007.04.005
[29] A. Vaisman and E. Zimányi. (2014). Data
Warehouse Systems.
https://doi.org/10.1007/978-3-642-54655-6
[30] F. Rilston, S. Paim, A. E. Carvalho and J. B.
De. Castro. (2002). Towards a Methodology
for Requirements Analysis of Data
Warehouse Systems, pp. 146–161.
[31] I. Sommerville. (2004). Software
Engineering. A Brief History of Computing
(9th ed). Pearson.
https://doi.org/10.1111/j.1365-
2362.2005.01463.x
[32] W. H. Inmon. (2005). Building the Data
Warehouse (4th ed). Indianapolis, Ind:
Wiley.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 58


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Using Dense Subgraphs to Optimize Ego-centric Aggregate Queries in Graph


Databases

Ali Ben Ammar


Higher Institute of Computer Science and Management
Kairouan University
Kairouan, Tunisia
ali.benammar@isigk.rnu.tn

ABSTRACT graph databases [7]. A formal definition of


GDB using graph terminology is provided in
In this paper, we present an approach to optimize [8] as follow. A GDB is considered as a finite
ego-centric aggregate queries, in graph databases, edge-labeled graph. i.e. let ∑ be a finite
by precomputing (materializing) some of their alphabet, and V a countably infinite set of node
results. Ego-centric aggregate queries allow to ids, then a GDB over ∑ is a pair 𝐺 = (𝑁, 𝐸),
graph nodes, called consumers, to aggregate events where 𝑁 is the set of nodes (a finite subset of
from others nodes, called producers. Our
𝑉), and 𝐸 is the set of edges, i.e., 𝐸 ⊆ 𝑁 ×
contribution consists of discovering the densest
subgraph, that represents the tightly coupled nodes.
∑ × 𝑁. That is, we view each edge as a triple
We consider that these nodes are the most actives (𝑛, 𝑎, 𝑛′), whose interpretation, of course, is an
i.e. they have two main features: the high access 𝑎-labeled edge from 𝑛 to 𝑛′. Therefore,
frequency and the strong correlation between them. in the rest of the paper, we will use the terms
Then, we precompute the results of the ego-centric graph and graph database interchangeably.
aggregate queries that are implemented on the
In parallel to the development of GDB systems
active nodes. Our experimentation has shown that,
when the graph database size is less voluminous, and languages, some techniques have been
our approach is able to improve the response time invented to optimize query response time. A
of the ego-centric aggregate queries. However, for survey of these main techniques is presented in
large graph databases, we should adjust the [9]. In this paper, we are interested in the
selection way of active nodes to improve the optimization of a class of these queries, called
management load of ego-centric aggregate queries. ego-centric aggregate queries.

KEYWORDS According to [10], in an ego-centric aggregate


query, the querier corresponds to a node in the
Query Optimization; Ego-centric Aggregate graph, and is interested in an aggregate over
queries; Graph Databases; Dense Subgraphs. the current state or the recent history of a local
neighborhood of the node in the graph. Some
1 INTRODUCTION examples of ego-centric aggregate queries are
presented in [10] such as trend analysis in
Graph database (GDB) systems are social networks where the goal is to find, for
particularly used to store highly interconnected each user, the trends (e.g., popular topics of
data and to query them. They were arisen, discussion, news items) in his or her local
during the last decade, to satisfy the neighborhood. Similarly, in a phone-call
requirements of applications with graph-like network or an analogous communication
data structure, such as social networks, network, we may be interested in identifying
telecommunication networks, linked interesting events or anomalies (e.g., higher
webpages, …. The early surveys on GDB and than normal communication activity among a
their tools are presented in [1] [2] [3] [4] [5] group of nodes). In [11], the authors use the
[6]. Recently, there is a trend towards temporal terms consumer and producer to present ego-

ISBN: 978-1-941968-45-1 ©2017 SDIWC 59


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

centric aggregate queries. According to [11], [20] and [21] use aggregate data to share work
an ego-centric query allows to a graph node, across different queries with different sliding
called consumer, to aggregate events from windows.
others nodes, called producers. Aggregate queries over graphs are
In this paper, we present an approach to fundamentally different and have not been
optimize ego-centric aggregate queries in widely studied in previous works. [22] has
GDB by materializing some of their results. proposed a language for querying and
The main contribution of our approach is to analyzing GDB by using aggregates and
reduce the cost of affecting materialization ranking. [10] and [11] have addressed the
decisions to the results of ego-centric management of ego-centric aggregate queries
aggregate query. Firstly, we discover the in large graphs.. Both [11] and [10] have
densest subgraph from the underline graph. addressed the issue of when the events should
Dense subgraphs represent the tightly coupled be transmitted from the producers to the
nodes. Then, we materialize all the results of consumer. The two possible ways are: either at
the ego-centric aggregate queries that are query time or precomputed on the consumer.
implemented on the nodes of the densest The former way consists of traversing the
subgraph. producers at each read of the consumer. In
[10], this way is called pull task and it
The rest of this paper is structured as follows.
corresponds to an on-demand update of
The next section discusses related works.
aggregate data. However, the later one consists
Section 3 presents our approach. It contains the
of pre-computing aggregate query answer at
details of the node classification and the
each new write in the producers. In [10], this
decision affectation, which constitute the two
way is called push task and it corresponds to
main steps of our approach. Section 4 presents
online update. The work in [11] proposes to
the evaluation of the proposed approach. The
retrieve events from high-rate producers at
section 5 is the conclusion.
query time and materialize, in aggregation
2 RELATED WORKS nodes, events that come from low-rate
producers. The work in [10] proposes detailed
Ego-centric aggregate queries are special case solution that begins by constructing an
of aggregate queries which are widely treated aggregation overlay graph and then makes a
in relational databases and data warehouse [12] decision for each node of this graph whether to
[13] [14] [15], in data streams [16], and in aggregate events on it (push decision) or not
sensor networks and distributed databases [17] (pull decision). The aggregation overlay graph
[18]. The materialization of query results is the is constructed to encode the computations to be
main technique, which is used to optimize performed when an update or a query is
aggregate queries, in such domains. It allows received. The main advantage of the
pre-computing and storing their results, called aggregation overlay graph is that it allows
views, to avoid computing/recomputing them sharing partial aggregates across different ego-
whenever they are asked. The topics of centric aggregate queries. The decisions of
materialization are mainly: What data do we materializing events on nodes are made based
materialize? Where do we materializing it? on the cost of push and pull tasks. As we have
And when updating materialized data? seen, what distinguishes [10] from [11] is the
Aggregated results are selected based on the answer to the issue of where do we store
following criteria: they serve frequent queries materialized data? For this reason, the solution
or they are shared by some queries. The works of [10] has integrated intermediate aggregation
in [12], [13], [14], [15] have proposed nodes. These two approaches are most closely
approaches to best select materialized views. related to our work, which consists of deciding
The approach proposed in [19], optimizes the whether the result of an ego-centric aggregate
update load of aggregate data (materialized query should be materialized (push decision)
views). In the context of data streams, [16], or not (pull decision). We have not evoked the

ISBN: 978-1-941968-45-1 ©2017 SDIWC 60


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

issue of where materialized data should be These two steps follow an iteratively process
stored. Simply, we have supposed that in order to make it possible to adapt the node
materialized results are stored on the ego- node classes to the recent changes in the graph
i.e. the consumer. The main contribution of our databases.
approach is to reduce the cost of affecting In the rest of this section, we will develop these
decisions (pull/push) to ego-centric aggregate two steps.
queries. Instead of continuously measuring the
access and update frequencies of nodes to 3.1 Classification of graph nodes
affect decisions, we propose to intervene We distinguish two types of graph nodes: (i)
periodically to identify the most active nodes the most active nodes; and (ii) the less active
(producer or consumer), which are the nodes ones. The first set contains the nodes that
of the densest subgraph. Then, we materialize participate, as sources/targets, in the most parts
the result of every ego-centric aggregate query of the graph edges (represent events in real
that is contained on an identified active node, world). The second set contains the nodes with
i.e. the producers of this query will receive the low participation in the graph growth. In our
decision push. To the best of our knowledge, approach, the ego-centric aggregate queries, to
there is no approach that uses dense subgraphs which we affect the push (online) policy, are
as a way to classify and optimize ego-centric implemented on a part of the first set, since
aggregate queries. their access/update frequencies will be
important. However. The ego-centric
3 OPTIMIZING EGO-CENTRIC
aggregate queries, to which we affect the pull
AGGREGATE QUERIES (on demand) policy, belong to the second class
of nodes, where the access/update are less
In our approach, each ego-centric aggregate
frequent.
query is executed in one of the following two
ways. The first way consists of querying the To specify the first set of nodes, we have
inputs from the neighborhood only when the chosen to look for the tightly coupled nodes of
user requests the ego node. However, in the the graph in the recent time. We suppose that
second way, inputs are pre-computed and kept these nodes represent the main interests of
up-to-date; then, when the user requests it, the users, i.e. topics that capture popular attention
ego-centric aggregate query is executed on the and in which a group of nodes has participate
precomputed data which reduce latency. as sources or targets. In literature, tightly
Consequently, our hybrid approach affects the coupled nodes are called dense subgraph.
pull decision to the ego-centric aggregate Dense subgraphs may correspond to emerging
queries whose results should be pre-computed, stories in social media, hot topic of discussion
and the push decision to the rest of the queries. in a forum …. We use dense subgraphs, to
Therefore, there are two steps in our approach: discover the more active nodes, because:
1. Classify ego-centric aggregate queries  Dense subgraphs group the nodes that are
by identifying the nodes whose query frequently updated/accessed and that
results should be pre-computed. This capture the most part of interactions.
task is performed periodically, i.e. at the  Dense subgraphs indicate the
expiration of any time interval with a trends/interests of users in the current
predetermined duration; and probably in the future time.
2. Affect and apply update decision. This For example, in figure 1, the densest subgraph
task exploits the result of the previous is composed of vertices {a, b} and the edges
classification to optimize the server load. between them. These two nodes attract the
It applies pull or push decisions for major part of the graph edges. However, {c} is
executing ego-centric aggregate queries, less used in the interactions between nodes. So,
as we have explained here above. if the vertex a or b contains an ego-centric

ISBN: 978-1-941968-45-1 ©2017 SDIWC 61


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

aggregate query, then its result will be becomes not interesting. So, we will use the
precomputed (push decision) else we apply the greedy approximation algorithm [23] to
pull decision for it. discover dense subgraphs.
The greedy approximation algorithm identify
the densest subgraph through several
iterations. In each iteration, it removes the
node that has the minimum degree vertex
according to a certain rule.
3.2 Affectation of push/pull decisions
Figure 1. Example of graph having a densest subgraph.
Dense subgraphs may be identified from In our approach, we intervene periodically to
directed or undirected underline graphs. In this decide whether the result of an ego-centric
approach we consider the directed graph since aggregate query should be precomputed (push
the recent applications of GDB need directed decision) or not (pull decision). In other words,
edges like social networks, communication every time when a pre-specified interval of
networks, email networks, financial time (called intervention period 𝑝𝑖 ) is expired,
transaction ….. According to [23], where the we search the densest subgraph to classify the
underline graph is directed, the problem of nodes used in the last period 𝑝𝑖 . The vertices of
identification of dense subgraphs is formulated the densest subgraph, issue from the
as follow: classification task, are added to a set 𝑁 that
regroups the vertices of the densest subgraphs
Let G(V,E) be a directed graph, where of the previous periods. In other words, if 𝑀𝑘
V is the set of vertices and E is the set is the set of vertices of the densest subgraph of
of directed edges between vertices of V. the period 𝑝𝑘 , then 𝑁 = ⋃𝑘=0,…,𝑖 𝑀𝑘 .The
To identify the densest subgraph decision for an ego-centric query 𝑞𝑧 which is
𝑀(𝑉𝑚 , 𝑇𝑚 ) of G, we search the subsets implemented on a vertex 𝑣𝑗 , is made according
𝑆 ⊆ 𝑉 and 𝑇 ⊆ 𝑉 so that: to the following rules:
- All the edges from S to T are
included in E, i.e. 𝐸(𝑆, 𝑇) = {𝑒𝑖,𝑗 ∈  If 𝑣𝑗 ∈ 𝑁 then the decision for 𝑞𝑧 is push
𝐸, 𝑣𝑖 ∈ 𝑆, 𝑣𝑗 ∈ 𝑇};  Else the decision for 𝑞𝑧 is pull
- The subgraph composed of 𝐸(𝑆, 𝑇)
has the maximum density from all 𝑁 is incrementally constructed because:
the subgraphs, i.e  The identification of the densest
𝑚𝑎𝑥𝑆,𝑇 ⊆ 𝑉 {𝑑(𝑆, 𝑇)} where 𝑑(𝑆, 𝑇) subgraph from the wall underline graph
represents the density. This density is highly complex. In our approach, only
of directed graphs was introduced in the last increment to the graph is used to
|𝐸(𝑆,𝑇)| search the dense subgraph;
[24] as follow: 𝑑(𝑆, 𝑇) =
√|𝑆|.|𝑇|
 New active nodes may arise over time
- 𝑇𝑚 = 𝐸(𝑆, 𝑇); 𝑉𝑚 = 𝑆 ⋃ 𝑇; and we should adapt the update policy of
In literature, the algorithms, of identification the ego-centric aggregate queries they
of dense subgraphs, are either with or without contains.
overlap. The approach without overlap
discover a set of dense subgraphs so that the
intersection between each couple of dense
subgraphs is null [23]. However, the approach
with overlap authorizes the intersection
between dense subgraphs [25]. In this paper, (a) state at 𝑡0 (b) state at 𝑡1 (c) state at 𝑡2
since our aim is limited to identify the most
active vertices, the overlap of subgraphs Figure 2. Example of underline graph with increments.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 62


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

For example in figure 2, we have three states u sent a private message to user v at time t. The
of the underline graph in three different times. graph of this dataset contains 1899 nodes and
The dotted edges and vertices represent the 59835 temporal edges. The time span of this
changes (the increment) on the underline graph dataset is 193 days. The second dataset is a
from 𝑡𝑖 to 𝑡𝑖+1 , where 𝑡𝑖+1 = 𝑡𝑖 + 𝑙 and 𝑙 is temporal network of interactions on the stack
a specified time interval, i.e. the duration of 𝑝𝑖 . exchange web site https://superuser.com.
The table 1 presents the densest subgraphs of There are three different types of interactions
the periods 𝑝0 , 𝑝1, 𝑝2 corresponding to the represented by a directed edge (u, v, t):
three states of figure 2; and the evolution of the  User u answered user v's question at time
content of the set 𝑁 across time. The 3 last t;
rows of the table present the decisions for three
ego-centric aggregate queries 𝑞1 , 𝑞2 , 𝑞3  User u commented on user v's question at
implemented respectively on the vertices a, c time t;
and f.  User u commented on user v's answer at
time t.
Table 1. Example of densest subgraphs and
decisions. This second dataset is comprised of 194085
Period 𝒑𝟎 𝒑𝟏 𝒑𝟐 nodes and more than 1 million edges (1443339
Vertices of the densest {a, b} {a, d} {b, f} temporal edges). Its time span is 2773 days.
subgraph
N {a, b} {a, b, d} {a, b, d, f}
Decision for 𝒒𝟏 push push Push
In this experimentation, each vertex 𝑣𝑖 , which
Decision for 𝒒𝟐 Pull pull Pull corresponds to a user, is considered having an
Decision for 𝒒𝟏 Pull pull push ego-centric aggregate query that summarizes
3.3 Experimentation the reactions of neighbors to the messages
(emails, questions, answers, comments) of 𝑣𝑖 .
We evaluate our proposed approach using the
following questions: In order to answer the first question, we ran our
system on the dataset CollegeMsg, here above
1. Do the nodes, which are considered the described. We measured the update load,
most active ones and whose query which is the required time to update the results
results have been precomputed, have of all the ego-centric aggregate queries, and the
really optimized the server load and the average of query response time. We have split
response time of ego-centric aggregate the time span to periods. The duration of each
queries? period is 7 days i.e. one week. The table 2
2. How relevant is the choice of the presents the result of this test, in milliseconds,
duration of the period to reclassify nodes for the three policies of running queries:
and begin a new iteration?  Push decision for all the vertices, called
3. Do the size of the dataset has an impact precomputation policy;
on the result of our approach?  Pull decision from all vertices, called on
In order to answer these questions, we have demand policy;
evaluated our approach against the two  Hybrid policy according to the principle
datasets: CollegeMsg temporal network of our approach.
[http://snap.stanford.edu/data/CollegeMsg.ht
ml] and Super User temporal network From table 2, we can see that our approach
[http://snap.stanford.edu/data/sx- allows the optimization of the update load of
superuser.html]. The first dataset is comprised the precomputation policy by more than 46%
of private messages sent on an online social and the query response time of the on-demand
network at the University of California, Irvine. policy by more than 15%. It is obvious that the
Users could search the network for others and on-demand policy produces the minimum
then initiate conversation based on profile update load, since query results are computed
information. An edge (u, v, t) means that user only when the query is asked. The

ISBN: 978-1-941968-45-1 ©2017 SDIWC 63


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

precomputation policy produces the optimal Table 3. Selectivity and rate of participation.
response time, since the results of all queries Of all vertices Of active
vertices
are precomputed. We conclude, from this Average number by 334 5
experiment, that our approach gives the best period
compromise between an acceptable quality of Rate of the active vertices 5/334 = 1.5%
with respect to the total
service (QoS) and a low update load i.e. we number of vertices
optimize the query response time using a low Average number of edges 2209 806
update load. (events) by period
Rate of participation of 806/2206 = 36.49%
active vertices in the
Table 2. Results of the first experiment. edges
Policy Precomputation On Hybrid
demand In order to answer the second question we have
Update load in ms 2369445 1065968 1271694 executed our system on the dataset
Average of query 6.08 23.95 20.21
CollegeMsg with varying the period duration
response time in from 3 to 150 days. We have measured how
ms much our approach is capable to optimize the
Rate of - - 46.32%
optimization of total load i.e. the access and update loads. We
precomputation calculated the optimization rate. The
policy load by our
approach optimization rate is the percentage of
Rate of - - 15.60% decreased/increased time that resulted by
optimization of
query response applying our approach with rapport to the two
time of On demand other policies. For example, the optimization
policy by our
approach rate of the precomputation policy is calculated
as follow:
We explain, in table 3, why this first
experiment has produced good results. The 𝑡𝑜𝑡𝑎𝑙 𝑙𝑜𝑎𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑟𝑒𝑐𝑜𝑚𝑢𝑝𝑡𝑎𝑡𝑖𝑜𝑛 𝑝𝑜𝑙𝑖𝑦 − 𝑡𝑜𝑡𝑎𝑙 𝑙𝑜𝑎𝑑 𝑜𝑓 𝑜𝑢𝑟 𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ

selectivity, of the algorithm of classification of 𝑡𝑜𝑡𝑎𝑙 𝑙𝑜𝑎𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑟𝑒𝑐𝑜𝑚𝑢𝑝𝑡𝑎𝑡𝑖𝑜𝑛 𝑝𝑜𝑙𝑖𝑦

nodes at each period 𝑝𝑖 , was 1.5% i.e. one or


The figure 3 presents the results of this
two from the nodes of the underline graph, are
experiment.
considered active vertices. However, although
this low selectivity, the active vertices have Optimization rates
participate in about 37% of edges in the next
20.00%
period 𝑝𝑖+1 . This indicates that our
classification approach select the vertices that 0.00%
150 90 days 60 days 30 days 15 days 7 days 3 days
have the high impact on the access/update days -20.00%
load. Also, the precomputation, of results of
-40.00%
the ego-centric aggregate queries which are
implemented on the selected vertices, was -60.00%
profitable i.e. compared to their update -80.00%
frequency, the access frequency of these
optimization rate of precomputation policy
queries is very high. In addition, the other ego-
centric aggregate queries, which are optimization rate of on demand policy

implemented on the passive vertices, have a


low impact on the server load, even if their Figure 3. Optimization rates in collegeMsg dataset.
results weren’t precomputed. From the figure 3, we conclude that, regardless
of the period duration, our hybrid approach
give less load than the on demand policy but
the best rate was for a duration of 60 days (≃
8%). However, in rapport with the
precomputation policy, there is an
optimization of the total load only for short

ISBN: 978-1-941968-45-1 ©2017 SDIWC 64


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

periods (less than 30 days), where the best rate may be correlated because they discuss an
is for duration of 3 days (≃ 15%). For long issue), it does not last in time. This makes the
periods, the selectivity, of our classification results, of the ego-centric aggregate queries,
algorithm, is high. For example, for the period that are precomputed (push decision), are less
duration 90 and 150 days, the selectivity was or never requested; that's why the
27.45% and 30%, respectively. In addition, the precomputation policy was costly i.e. it
refreshment of the list of selected vertices will precomputes all the results for nothing.
be slow, i.e. it may not follow the change in the However, in the first dataset (CollegeMsg),
user interests. Consequently, some vertices where users know each other and are in
that are expected to be active will consume a permanent contact, the results of our hybrid
lot of update time but they will never be approach were good.
accessed. For example, the participation of the Optimization rates
expected active vertices, for period duration 90 30.00%
and 150 days, was 59% and 47%, respectively.
In addition to that, the possibility, to select new 20.00%
active vertices to reduce the total load, is
10.00%
delayed. This makes the precomputation
policy, which precomputes the results of all 0.00%
queries, more profitable than our hybrid 90 days 60 days 30 days 7 days 5 days
-10.00%
approach for long periods. On the other hand, optimization rate of precomputation policy
this failure, of classifying vertices, did not optimization rate of on demand policy
prohibit our approach to produce a profit in
rapport with the on demand policy. The main
Figure 4. Optimization rates in super user dataset.
conclusion of this experiment is that, to
optimize the two policies, the period duration Our main conclusion from these experiments
shall not exceed 30 days. is that our approach may be used to optimize
ego-centric aggregate queries in graph
In order to answer the third question we have
databases with limited number of users and
executed our system on the dataset Super User,
highly connected vertices across time.
here above described. We have measured the
However, our approach is not recommended
optimization rate of our approach with rapport
for large graph databases with less correlated
to the two other policies. The figure 4 presents
nodes across time.
the results of this experiment.
What we conclude, from the figure 4, is that 4 CONCLUSION
our hybrid approach has never produced a
profit with rapport to the on demand policy for In this paper, we have proposed an approach to
all the period durations of the test. The rate of optimize ego-centric aggregate queries in
optimization with rapport to the on demand graph databases. Ego-centric aggregate queries
policy was around 0% . In most cases, it was allow to a graph node, called consumer, to
negative. The second conclusion is that the aggregate events from others nodes, called
precomputation policy was the worst and we producers. The most used technique to
have optimized it by nearly 20%, in all cases. optimize such queries is the materialization of
The main reason, that has led to these results is their results either in the consumer node or in
the high variety of users and of their interests the producer ones. We have developed a policy
in the case of large dataset. In other words, the that materializes only the results of ego-centric
correlation between users, which is the basis to aggregate queries which are implemented on
select active vertices and affect the push active nodes. A node is considered active if it
decision for them, is low in such large datasets. is an element of the densest subgraph. For this
Moreover, even if there is a correlation reason, we have begun by discovering the
between a group of users (in our dataset, users densest subgraph. We have supposed that the
densest subgraph regroups the nodes whose

ISBN: 978-1-941968-45-1 ©2017 SDIWC 65


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

access/update frequency is high and the [9] Ben Ammar, Ali, "Query Optimization
correlation between them is strong. The results Techniques In Graph Databases," in CoRR
of our experimentation have demonstrated abs/1609.01893 , 2016.
that, in case of small graphs, our approach [10] J. Mondal and . A. Deshpande, "EAGr:
Supporting Continuous Ego-centric Aggregate
produces a low management load with rapport Queries over Large Dynamic Graphs," CoRR/
to the scenarios of precomputing all query abs/1404.6570, 2014.
results or computing query result at query time. [11] A. Silberstein, J. Terrace, B. . F. Cooper and R.
However, for large graphs, computing query Ramakrishnan, "Feeding Frenzy: Selectively
result at query time is the best scenario. The Materializing Users’ Event Feeds," in SIGMOD,
failure of our approach, in case of large graphs, 2010.
comes from the level of correlation between [12] H. Gupta and . I. S. Mumick, "Selection of
Views to Materialize in a Data Warehouse,"
the selected active nodes, which has not held
IEEE Trans. on Knowl. and Data Eng. 17(1), pp.
strong. Consequently, access frequencies are pages 24-43. , Jan. 2005..
decreased and the expected optimization has [13] I. Mami, Z. Bellahsene and R. Coletta, "A
not been realized. Therefore, our future works Declarative Approach to View Selection
will focus on how improving the way of Modeling.," T. Large-Scale Data- and
selecting active nodes in order to be profitable Knowledge-Centered Systems, pp. 115-145,
across time. 2013.
[14] P. P. Karde and V. M. Thakare, "Selection &
REFERENCES Maintenance of Materialized View and It’s
Application for Fast Query Processing A
survey," International Journal of Computer
[1] R. Angles and C. Gutierrez, "Survey of graph Science & Engineering Survey (IJCSES), Vol.1,
database models," ACM Comput. Surv., Vol. 40, No.2, November 2010.
No. 1, pp. pp. 1-39, 2008.
[15] Y. . D. Choudhari and S. K. Shrivastava, "Cluster
[2] K. N. Satone, "Modern Graph Databases Based Approach for Selection of Materialized
Models," International Journal of Engineering Views," International Journal of Advanced
Research and Applications , 2014. Research in Computer Science and Software
[3] P. . T. Wood, "Query languages for graph Engineering, Volume 2, Issue 7, July 2012.
databases," SIGMOD Rec Vol. 41, No. 1, pp. pp. [16] S. Krishnamurthy, C. Wu and M. J. Franklin,
50-60, 2012. "On-the-fly sharing for streamed aggregation," in
[4] P. Macko, D. W. Margo and M. I. Seltzer, SIGMOD, 2016.
"Performance introspection of graph databases," [17] S. Madden, M. J. Franklin, J. M. Hellerstein and
in 6th Annual International Systems and Storage W. Hong, "TAG: a Tiny Aggregation service for
Conference, Haifa, Israel , June 30 - July 02, Ad-Hoc sensor networks," in OSDI, 2002.
2013.
[18] A. Silberstein and . J. Yang, "Many-to-Many
[5] P. Jadhav and R. Oberoi, "Comparative Analysis Aggregation for Sensor Networks," in ICDE,
of Graph Database Models using Classification 2007.
and Clustering by using Weka Tool,"
International Journal of Advanced Research in [19] X. Zhang, L. Yang and D. Wang, "Incremental
Computer Science and Software Engineering, pp. view maintenance based on data source
438-445, Volume 5, Issue 2, February 2015. compensation in data warehouses," International
Conference on Computer Application and
[6] H. R. Vyawahare and P. P. Karde, "An Overview System Modeling (ICCASM), pp. 287-291, 22-
on Graph Database Model," International Journal 24 Oct. 2010.
of Innovative Research in Computer and
Communication Engineering, Vol. 3, Issue 8, [20] S. Wang, E. A. Rundensteiner, S. Ganguly and S.
August 2015. Bhatnagar, "State-Slice: New Paradigm of Multi-
query Optimization of Window-based Stream
[7] A. Campos, J. Mozzino and A. A. Vaisman, Queries," in VLDB, 2006.
"Towards Temporal Graph Databases," CoRR
abs/1604.08568 , 2016. [21] L. Al Moakar, "Class-Based Continuous Query
Scheduling in Data Stream Management
[8] J. Reutter, Graph Patterns: Structure, Query Systems," in PhD thesis, Univ. of Pittsburgh,
Answering and Applications in Schema 2013.
Mappings and Formal Language Theory, PhD.
Dissertation, University of Edinburgh, 2013. [22] A. Dries and S. Nijssen, "Analyzing graph
databases by aggregate queries," MLG@KDD,
pp. 37-45, 2010.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 66


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

[23] M. Charikar, "Greedy approximation algorithms


for finding dense components in a graph,"
APPROX , pp. 84-95, 2000.
[24] R. Kannan and V. Vinay, "Analyzing the
Structure of Large Graphs," Manuscript, , August
2009.
[25] O. D. Balalau, F. Bonchi, T. H. Hubert Chan, F.
Gullo and M. Sozio, "Finding Subgraphs with
Maximum Total Density and Limited Overlap,"
WSDM , pp. 379-388, 2015.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 67


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

A Secure Method for the Global Medical Information in Cloud Storage based on the
Encryption and Data Embedding
1 Soheil Nezaket, Islamic Azad University, IAU, UAE branch, Dubai, UAE
2 Dr. Mohammad V. Malakooti, Islamic Azad University, IAU, UAE branch, Dubai, UAE
3 Dr. Navid Hashemitaba, Islamic Azad University, IAU, Tehran Central Branch, Tehran, Iran
1 soheilnezakat@yahoo.com, 2malakooti@iau.ae 3 nhtaba@yahoo.com

Abstract: videos with high resolution from hospitals and


The cloud computing has become the most popular healthcare centers. The captures information must
platform and environment for both data computing be compressed as well as be encrypted before it has
and information storage. The big corporations, been transferred over the internet to highly secured
businesses, and enterprise companies moved their cloud storage facilities for the easy access, low cost,
infrastructures on the cloud for the cheap storage and high security.
facilities and processing. Though, most the cloud HIS is a comprehensive, integrated information
service providers have a highly secure system for system designed to manage all aspects of the
information storage and computing but the hospital’s operation, such as medical, financial,
malicious cloud users or other attackers could administrative and legal issues. It provides a
access our data on the cloud during the transmission common source of information about the patient’s
or while information are stored. We have suggested health information and his illness history. The
a technique that can be used to encrypt the system should have capability to store data in a
information prior to transmission over the cloud secure storage facility that can be accessed easily
system. Our technique is for the global medical and information can be retrieved by authorized user
information in cloud storage using the orthogonal through some type of authentication and
transform algorithms for the encryption and data classification which define the user’s access level.
embedding processes. Our technique is fast, HIS has integrated with most of the medical devices
reliable, robust, and reversible and can be used to and controls the information processing and
increase the security and integrity of the handling of huge amount of data which should be
information stored on the cloud systems. compressed and secured prior to transmission over
Keywords: Cloud, Computing, Encryption, the networks. We have proposed a new method
Decryption, hiding, data Security, scramble. which has used the patient information and applied
image segmentation and DCT as well as scrambling
Research Objective: techniques on the DCT coefficients. We also added
Picture Archiving and Communication System the IRIS information to increase the level of
(PACS) as well as Hospital Information System security and processed it through cellular automata
(HIS) are one of the most sophisticated systems to reduce the size of IRIS information where data
which have been used in healthcare, generally in accuracy and confidentiality is very important. Our
hospital and health systems. PACS traditionally innovative algorithm is based on the information
were associated with the Radiology department encryption which has used orthogonal transform
where the radio graphical images were captured for matrices for the encryption process, along with
the Diagnosis and treatment of patients with broken scrambling algorithm, as well as using biometric
legs, arms, hands, nose, head, and so on. In recent features and the IRIS information processed through
year, the invention of advanced technology used in the cellular automata to obtain a high secured and
hospital and healthcare department helped to reversible algorithm for the information
incorporate PACS with the images obtained from compassion, encryption, scrambling and IRIS
other departments such as, cardiology, pathology, coding..
oncology, and dermatology. Thus, a massive
amount of images captured in hospital and Introduction:
healthcare centers need to be stored and achieved as The cloud computing is an infrastructure in which
the patient medical files. Nowadays PACS has several computer networks, servers, storage units,
improved a lot and it has captured images and software, hardware, services, applications, resources
ISBN: 978-1-941968-45-1 ©2017 SDIWC 68
Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

are interconnected to form an information pool that saved millions of dollars using the cloud storage,
can be accessed over the internet with minimal but the security issues are still under consideration.
management efforts. The individual users, The security threats are major challenges for the
corporations, and even enterprises can perform data cloud users and many businesses and they are not
storage and computing on their own private cloud or comfortable to use cloud storage facility and prefer
using the third–party cloud facilities rapidly, to use their own infrastructure and storage systems.
efficiently, with high speed and acceptable security. We have proposed a new method that can be used to
The most important feature of the cloud computing scramble and encrypt the information prior to the
is the resource sharing with easy access, high transmission process. While the information is
reliability, minimum cost, high speed, and encrypted then the hacker no longer can retrieve the
acceptable security. Cloud storage is a hot topic original data during the transmission or while the
nowadays as the data storage capacity rates are data is stored on the cloud network.
increasing manifold’s every year and has become a Cloud storage brings scalability, economies, and
reality that all data centers and organizations should flexibility all together. But still security in the cloud
consider it. storage has its own proportional worries. The cloud
The cloud computing and storage is a fast growing storage will solve security problems occurred in the
technologies in which it has received the high cloud computing. problems related to compliance,
attention of scientists, researchers, enterprises, and privacy and legal matters[3]. Cloud storage will
industrial communities due to its low cost, easy covers most of the security eras considered for the
access, high speed, and reliability which bring data shared data. The study has shown that most of the
security and portability. Cloud Storage empowers cloud storage providers are aware of the extreme
the real time data access over the network. Each important of the data integrity and security and they
user with their pre-specified privilege can have a have launched the new firewalls and software for
real time and on-demand access to the online data their network security as well as the data
pool which is shared among all cloud uses but with integrity[4].
different data type, accessibility and performance.
Cloud Storage occurs as central data storage idea Study of the Cryptography:
which its main goal is avoid the data replication, The cryptography system brings latest and modern
duplication, provide quick-secure-suitable data security protocol. Cryptography can be used in a
exchange and also prevent massive paper correct way and wrong way and also it has the
displacement. The cloud storage speeds up all the possibility to protect the wrong things. Transferring
basic cloud features like: collaboration, agility, message in a secure way is the goal of
scalability, availability. In some cases, Cloud cryptography. Cryptography has been used to
storage brings the centralized data management in prevent the insecurity in a way that information is
parallel with data security. The most important issue encrypted by the predefined algorithm and those
that need to be considered regarding to the cloud who have knowledge of the algorithm as well as the
computing is the uncertainty that exist in its security secret keys can decrypt the information and retrieve
while the information is transmitted through the the original data from the encrypted message. Once
cloud facilities or during the computation or storage the message is encrypted, the meaning will change
process[1]. The security risk for storing the vital and without key no one can find out what is that.
information over the public, cloud storage, or any This meaning will be revealed only after the correct
online facilities are high because the user have no recipient tries to access it. The encrypted message
control over the cloud storage and many people will be obtained through some reversible algorithm
simultaneous can access the cloud and breach the in which the original data can be obtained by
security. inverse option through the decryption algorithm. If
Data movement for the companies is a big problem. the algorithm is not reversible then it cannot be used
In order to decrease this issue, the cloud storage for the encryption and decryption processes.
providers ensure the data owners that they can The accrued inconsistency between the security and
continue with the same security and privacy cryptography is now over 20 years. The people who
monitor their data[2]. Although the cloud works and study about the security don’t find the
computing has solved the storage problem of cryptography tools much useful. A more complex
corporations and enterprises in which they have method of information security is the lossless data
ISBN: 978-1-941968-45-1 ©2017 SDIWC 69
Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

hiding or information embedding in which vital them against the stored featured in the database. In
data, such as social security or credit card numbers, addition, we can obtain extra features by measuring
are hidden inside the coded information rather than the distance between the eyes and mouth, as well as
the original message. We have optional the distance between the nose and mouth. The total
transformation such a s DCT to convert the original biometric features collected from the human face
message into coded message. Once the code can be in the wide of potential applications
message is obtain then the vital information is
inserted into mid-frequency area of the coded Fingerprint:
message. This process is based on the mathematical The fingerprint is made up of pattern of ridges,
algorithm that can be used to retrieve the data from valley and furrows on the surface of the fingertip.
the decode information in reversible technique[5]. Fingerprints have been used for the criminal
The mathematical function which is using the investigation by the law enforcement officers to
encryption named as cryptographic algorithm. The identify the people for more than 100 years. The
used method in that algorithm is working with key finger pattern is created in the first step of growing
and a record that can be number or phrase. Keys are fetus in the uterus. This feature is a unique feature
generally giant numerical figures. You can store of each human. The fingerprint information can be
keys in encoded forms. The keys can be stored in obtained by using a fingerprint reader or fingerprint
two files on your hard disc using PGP. scanner. The scanner is a biometric device that can
be used to identify a person based on the acquisition
Review on biometrics: and recognition of those unique patters in a
Nowadays, the importance of exigency to identify fingerprint. The fingerprint scanners are the most
users of convenience and services has improved and popular types of the biometric security and used
became more on the significant issues not only for with a variety of the systems on the market for the
controlling accesses a system & service, but also to general and mask market usage.
determine who will have the rights. Biometrics is Eyes
currently being applied all over the world in various In spite of the reduced size of this organ, it provides
ways. All these systems are generally computer- two reliable modalities: The Retina and Iris.
based solutions, where the procedure of validation
is running in at the server’s side or workstations. 3.1 Retina Scans
Biometrics relies on two major categories: Physical The retina is one of the modalities which provide
and Behavioral as mentioned in below. better performance results; however, this technique
is not well accepted due to the eye invasion during
Physical Modalities: the acquisition process.
Biometric systems that can be used for the human
identification are based on the following physical 3.2 Iris Scans:
elements, such as face shape, fingerprint, retinal The Irises are the most important biometric features
scans, and Iris scans. The DNA information also that can be used for the authentication. The patterns
can be used in the high security area such as Data in our irises are unique and hardly can be replicated,
Centers, military sites, or the aerospace launch meaning those irises authentications are safe and
controller. We have not used the DNA test in our secure. The Irises authentications mostly have been
research due to lack of available information and used in the immigration, international police, airport
only used face shape, fingerprint, eyes(retinal scans, security, and criminal justice systems. The Iris
and Iris scans) as the most important elements of scanners have provider the high resolution images
the physical modalities. that can be used to identify the person with the high
Face Shape: reliability and accuracy inside the airports, highly
The face recognition technology is one of the secured data centers, and even in the criminal
advanced technics that can be used to measures and justice systems. The development of highly
match the unique characteristics of the human resolution camera and scanner as well as the fast
biometric features for the identification and and robust software make it possible to retrieve the
authentication. The digital camera connected to the high resolution iris features from the captured
face recognition software can obtain the detected image and quickly compare it with the existence Iris
image and extract their features which then match features inside the database. The high resolution
ISBN: 978-1-941968-45-1 ©2017 SDIWC 70
Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

iris scanner along with the underline software for recognized by overall velocity, forces, kinetic and
feature extraction has made this modality cost potential energy cycles, and changes in the contact
effective for commercial applications. with the surface. This trait although not very
Behavioral modalities: distinctive among users, provides an extra
Behavioral modalities are based on the data derived advantage: it can be measured at a distance,
from an action performed by the user. avoiding contact with the user. This modality,
therefore, is interesting for surveillance
1- Voice recognition: applications.
Voice recognition is a combination of a physical
and a behavioral characteristic. Voice is based on Security:
two factors: first, on language and the way of High security is needed when we are talking about
speaking which helps in the mouth disposition for the biometrics. Because the biometrics features are
modulation and secondly, in physical traits such as grafting with the individuals features and data. This
vocal chords or the mouth itself. Voice recognition security must be occurring in two phases. The first
is the identification of individuals from the phase of protection is in the storage, where they
characteristics of their voices and often referred to store. And the second phase is where the
as voice biometrics. The voice characteristics can be information exchange happens.
used for both authentication and identification. The
characteristics of the voice or its features can be Global Medical Cloud Storage
obtained by applying the Discrete Fourier By using our proposed Image embedding algorithm,
Transform and other feature extraction techniques. one can easily store the patient’s personal
These features can be saved for the speaker information at the specific location of the image.
identification as well as of the voice recognition. There are several data embedding techniques that
The acoustic features obtained from voice analysis have been proposed by different authors. The one
reflect both anatomy (size and shape of throats and which is close to our proposed algorithm is the act
mouth) as well as learning the behavioral patterns. of hiding the secret information into the image data
during the encoding process. The encoded image
2-Signature along with embedded secret information will be
Signature systems can observe only the result of the delivered to the decoder[6].
action, i.e. the signature. These systems do not We proposed a new approach that can hide data in
require that the signature is made at the time of user general images, including personal information,
identification. Thus, it can be used in forensic medical information and any type of data that can
Biometrics. Signature recognition is a type of be converted to numbers (e.g. ASCII Code). In
behavioral biometric in which users can write their contrast to traditional techniques of the data hiding,
signatures on the paper or in digitizing tablet. When the new embedding process of lossless data hiding
the signature is typed on paper it can be digitized by methods must be invertible and user be able to
using a scanner or camera for offline recognition. In completely restore the original image after the
contrast when users write their signature in a extracting the embedded secret Information[7].
digitized tablet, the online recognition can be Our proposed model is based on the a novel
applied to analyses the signature based on their Lossless, Secured data embedding algorithm in
some features such as pressure, spatial coordinates x which the vital information can be embedded into
and y, azimuth, inclination, and pen position The the personal or radiology image while preserving
most popular signature recognition techniques are the quality of cover image and maintaining the
dynamic time warping, hidden Markov Model, and security of the data to be embedded[8]. In many
vector Quantization. embedding algorithm, a huge amount of data will be
embedded into the cover image with a high security
3-Gait: and high resolution while the extraction of the
The gait is the person’s manner of walking. It is embedded data are required to be lossless and
referred to the style of stepping or the locomotion robust[9]. We have applied the Hilbert curve as
achieved through the movement of human limbs. well as encryption with the Iris Code on the original
The variety of gait patterns are characterized by data and hide the data inside the mid frequency
differences in limb movement patterns and it can areas of the DCT coefficients. The general equation
ISBN: 978-1-941968-45-1 ©2017 SDIWC 71
Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

for a 2D (N by M) image DCT is defined by the following equation.


𝑁−1
2 1⁄ 2 1⁄ 𝜋.𝑢 𝜋.𝑣
𝑓(𝑢, 𝑣) = (𝑁) 2( ) 2 ∑ ∑𝑀−1
𝑗=0 𝐴(𝑖). 𝐴(𝑗). 𝑐𝑜𝑠 [ (2𝑖 + 1)] cos [ (2𝑗 + 1)] . 𝑓(𝑖, 𝑗) (EQ. 1-1)
𝑀 𝑖=0 2.𝑁 2.𝑀

Where f(i,j) is the sample image point and f(u,v) is Transforming image by Discrete Cosine Transform
the sampled DCT coefficients, M, and N are the formula. Merging the image pixels with the data.
image rows and columns of image, respectively. Archiving is the last stage.
GMCS characteristics:
GMCS (Global Medical Cloud Storage) Algorithm Once the personal information or patient’s vital data
consists of two major security levels that can be has been obtained from the client file then they will
used to hide information of many Patients inside be embedded inside the image file using our highly
their images (Personal and radiology). The first secured algorithm without losing even one bit of the
security level is applied on Data itself, which information.
contain multiple security levels (conversion, In this paper we focused on using part of
encryption, scrambling). (embedding information's into the LSB and
All the data at the begging stage will be converted modifies DCT values). There are 3 main concerns
to the ASCII codes. that are clearly showed in Figure 1.
Data getting scrambled (Peano Hilbert).
The last stage is encryption.
The second security level is at the image
processing.

Getting Resizing the Convert Applying Keeping


patient’s image if image into 2D DCT on each required digit
images, either needed to array of blue 8*8 block of in the decimal
personal or 320*248 pixels the blue part
medical pixels

Gathering Convert the Scrambling Merging all the information and data
patient data data into 2D array of together;
(medical or ASCII code data by Peano
personal) and place it in Hilbert curve
to 2D array 1- Adding converted scrambled
and encrypted data to the end
of the transformed value
Generating Using MD-5 Encrypt the Dara 2- Encrypt all data once again for
512-digit or cellular with Generated double secure purpose
IRIS code automata to Code
from eye scan reduce 512-
digit code

Figure 1-Block Diagram of DCT along with scrambling on the Patient’s Information, and IRIS code

How can we select DCT coefficients to embed the A Hilbert curve (also known as a Hilbert space-
data? filling curve) is a continuous fractal space-filling
How can we embed the data in each block of image curve first described by the German mathematician
by using the Hilbert-space-filling curve scrambling David Hilbert in 1891. The Hilbert curve (Figure 2)
and encryption? In particular appears to have useful characteristics
How can we embed the data into image in secure of the cells after 4 subdivision steps[10].
manner and what technique do we use to enhance
the efficiency.
ISBN: 978-1-941968-45-1 ©2017 SDIWC 72
Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Hilbert rules:
Rules:
L  +RF-LFL-FR+
R  -LF+RFR+FL-
(-) Turn right
(+) Turn left
Ln+1 = +RnF − LnF Ln − F Rn+
Rn+1 = −LnF + RnF Rn + F Ln−

Figure 2-Peano Hilbert Scrambling

Authentication and Encryption key: Cellular Automata:


Encryption will use a 32-digit code that will be The general use of cellular automata is to expand a
XOR with the scrambled data matrix but for the rul. But in our model we have utilized cellular
encryption another data is required. Iris code is the automata to reduce the size of 512-digit Iris code.
3rd type input. This input is neither image nor text The 512-Digit iris code has 512*4=2048 bits. The
data that can be simply fetched from person itself. goal of using cellular automata is to reduce 4096
This input is neither image nor text data that can be bits to 128 bits which is 32-digit code
simply fetched from person itself. With utilizing iris
scanner, a single 512-digit unique code will be Functional GMCS:
generated. The process of global medical cloud storage
As the 512-digit code is a big code, there are couple contains the below steps:
choices for the compression. Depends on the Image processing
advantages and disadvantages of the following First we divide the image into 3 layers, Red, Green,
method, we can choose either MD-5 or Reversal and Blue. Blue layer is selected to be transformed
Cellular automata. by DCT. Here we apply the DCT transform by
multiplying each 8X8 blocks as you see in below,
MD-5: the value of the transformed image is Boolean
MD5 is an encryption method that widely used as a which contains 5 to 8 decimal digits.
hash function encryption. It accepts a message as DCT Matrix:
input and generates a fixed-length output, which is
generally less than the length of the input message.
The output is called a hash value, a fingerprint or a
message digest. Here, in our algorithm the MD-5
will generate a 32-digit code which is 4X smaller
than iris code. 9*1031 unique code can be generated
with MD-5 algorithm.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 73


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

0.3536 0.4904 0.4619 0.4175 0.3536 0.2778 0.1913 0.0975


0.3536 0.4157 0.1613 -0.0975 -0.3536 -0.4904 -0.4619 -0.2778
0.3536 0.2778 -0.1913 -0.4904 -0.3536 0.0975 0.4619 0.4175
0.3536 0.0975 -0.4619 -0.2778 0.3536 0.4175 -0.1913 -0.4904
0.3536 -0.0975 -0.4619 0.2778 0.3536 -0.4175 -0.1913 0.4904
0.3536 -0.3778 -0.1913 0.4904 -0.3536 -0.0975 0.4619 -0.4175
0.3536 -0.4175 0.1913 0.0975 -0.3536 0.4904 -0.4609 0.2778
0.3536 -0.4904 0.4619 -0.4175 0.3536 -0.2778 0.1913 0.0975
Table 1: DCT Coefficients
Image Block:
255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255
As you see in below, the value of the transformed image is Boolean which contains 5 to 8 decimal digits.

721.24891 721.24891 721.24891 721.24891 721.24891 721.24891 721.24891 721.24891


57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000
56.9999 56.9999 56.9999 56.9999 56.9999 56.9999 56.9999 56.9999
57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000
56.9999 56.9999 56.9999 56.9999 56.9999 56.9999 56.9999 56.9999
57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000
56.9999 56.9999 56.9999 56.9999 56.9999 56.9999 56.9999 56.9999
57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000

Table 2: The Transformed Image Values

Data Processing transferred into 2D matrix. The data will be


At the second phase the text data are getting transferred to matrix in the ASCII Code formats.
processed. It can be any kind of data that is related The matrix dimension depends on the Hilbert-curve
to the patient. Personal information, medical records round that we choose. Here we have used Hilbert
such as lab result, radiology reports and etc. are part L4 (3rd round), so the matrix is 8X8.
of the Electronic Health Record (EHR) Data being

S O H E I L S H
A H R Y A R N E
Z A K A T 2 2 0
2 1 9 9 0 0 0 9
8 9 1 2 1 2 3 4
5 6 7 T E H R A
N T E H R A N S
H E M I R A N I
Table 3: Patients’ Information

ISBN: 978-1-941968-45-1 ©2017 SDIWC 74


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

83 79 72 69 73 76 83 72
65 72 82 89 65 82 78 69
90 65 75 65 84 50 50 48
50 49 57 57 48 48 48 57
56 57 49 50 49 50 51 52
53 53 52 53 52 53 53 53
78 84 69 72 82 65 78 83
55 56 54 55 56 54 55 56
Table 4: Patient’s Information Converted to ASCII

After transferring into the matrix and convert the


characters into the ASCII codes, the scrambling
take place as part of security.

83 79 72 65 90 520 49 65
75 57 57 65 89 82 72 69
73 65 82 76 83 72 69 78
50 48 57 48 48 50 84 48
49 52 53 50 51 52 53 53
78 83 56 55 54 65 82 56
55 54 69 72 53 50 49 52
53 58 56 53 78 84 56 55
Table 5: Scrambled ASCII code of Patient’s Information

Before merging them we have to encrypt our whereas you can keep more digits to increase the
scrambled data. As I mentioned earlier, for the reversed picture quality.
encryption key we are using the 512-Digit iris code. We add the second matrix values at end of the first
We are not storing the key, because the key is matrix diagonal values.
always carried by the patient (IRIS). In order to
reduce size of iris code, we can use reversible
cellular automata or the MD-5 hash algorithm.
With the MD-5 or reverse cellular automata, the
512-digit iris code reduce to 32-digit. With this 32-
digit code we encrypt the data matrix by XOR each
ASCII code with each digit of the code.
Merging
Merging step is the most critical step in the Global
Medical Cloud Storage. Hence it is important that if
you do not merge them exactly the way that it must
be done, you cannot unmerge them later when you
want to retrieve the file.
Here we have two matrices. One is the converted,
transformed blue pixel layer with the Boolean
value. And the other one is data which are
converted, scrambled and encrypted with the iris
code.
We are merging the encrypted asci codes in to the
decimal part. The picture accuracy that we spoke
about is the decision that we make here about the
values of the transformed pixels. In this paper we
have decided to keep 5 digits in decimal part,
ISBN: 978-1-941968-45-1 ©2017 SDIWC 75
Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

721.24891 721.24891 721.24891 721.24891 721.24891 721.24891 721.24891 721.24891083


57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000079 57.0000
56.9999 56.9999 56.9999 56.9999 56.9999 56.9999072 56.9999 56.9999
57.0000 57.0000 57.0000 57.0000 57.0000065 57.0000 57.0000 57.0000
56.9999 56.9999 56.9999 56.9999090 56.9999 56.9999 56.9999 56.9999
57.0000 57.0000 57.0000050 57.0000 57.0000 57.0000 57.0000 57.0000
56.9999 56.9999049 56.9999 56.9999 56.9999 56.9999 56.9999 56.9999
57.0000065 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000 57.0000
Table 6- The Coded Patient’s Information after merging the IRIS coding and DCT, Scrambling Values

Double compression Reversible data hiding technique enables the


The matrix which is in the size of picture exact recovery of original contents upon
320X248, is saved in a text file. The text files extraction of embedding data[9]. The retrieving
size is usually bigger than the image size, so we method is critical, because minor mistake can
compressed the text file with using the ruin the retrieving process. The retrieving
Microsoft visual studio archiving method process is exactly stepping backward the data
(*.gz). The format includes a cyclic redundancy embedding process. For example, if the
check value for detecting data corruption. diagonal value is 721.24891, and the encrypted
Compressed GZip-Stream objects written to a ascii code is 48, the merged value will be like:
file with an extension of .gz can be 721.24891048. Each row of the data matrix has
decompressed using many common been merged with each diagonal values of each
compression tools; however, this class does not block. I.e. if you have two 8X8 data, then 16
inherently provide functionality for adding files 8X8 matrix blocks is needed. Because we have
to or extracting files from zip archives 16 row and each row must be merged with each
[Microsoft]. diagonal values.
The entire process of retrieving data is
Data retrieving method in GMCS illustrated in Figure 3.

Present the Removed the


data padding
characters

Present the Create and Transfer the Apply IDCT Descramble


data save the driven values the data
image matrix

Archived file Extract the Un-merge the Compress the Decrypt with Extract the
values data iris using the iris code vital values
MD5 or
cellular
automata

Figure 3-Date Retrieving Block Diagrams

Conclusions and future works: scrambling algorithm, as well as using biometric


features and the IRIS information processed through
Our algorithm is based on the information the cellular automata to obtain a high secured and
encryption which has used orthogonal transform reversible algorithm for the information
matrices for the encryption process, along with compassion, encryption, scrambling and IRIS
ISBN: 978-1-941968-45-1 ©2017 SDIWC 76
Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

coding. In this paper we proposed a new Data Compression, Encryption, and biometric feature
Encryption, Compression, along with new data extraction along with data hiding. We can achieve
embedding algorithms that no one has ever used in the highest level of security even though the patient
PACS and HIS .We have examined the feasibility is swoon by using biometric information along with
of Encryption and Data Embedding in the medical a 512 digit of iris code.
cloud storage which can be used to provide the data This study has opened a wide range of research
portability, compression, as well as the security. We about the electronic health record with security,
also discovered the existence of several factors encryption, and compression while maintaining the
about the security and compression in the cloud portability and accessibility. If we think deeply we
storage around the Hospital Information System. could easily find out that the Global Medical Cloud
In this paper we have shown that our model is Storage is the best world wide solution, but the
robust and secure and predetermined goals and model need to be fitted into the different regional
objectives have been achieved. Our main goals are requirements, because there are certain rules and
listed as following: regulations that restrict some countries to exchange
Being able to store all the medical records patient’s information and health records. The health
independently but integrated. integration and record exchange with worldwide
Making the Health Records to be accessible from all EHR solution, is the next research step which will
over the world, without any data loss. empower the GMCS worldwide.
Making the Electronics Health Record (EHR) to be
highly confidentiality by applying the Data

References:

1. S. Ajoudanian, M.R.A., A Novel Data Security Embedding for JPEG2000 Compressed Bit-Stream.
Model for Cloud Computing. IACSIT International 2008: p. 151-154.
Journal of Engineering and Technology, 2012, 4. 6. Huang, H.-C., Lai, W.-H., and Chang, F.-C.,
2.A.Ross, A.K.J., Human Recognition Using Content-Adaptive Multi-level Data Embedding for
Biometrics. Appeared in Annals of Lossless Data Hiding. 2011: p. 29-32.
Telecommunications, Feb 2007. 62: p. 11-35. 7.Wu, J.-H.L.a.a.M.-Y., An Iterative Method for
3. A.Bessani, M.C., B.Quaresma F.Andr´e Paulo Lossless Data Embedding in BMP Images.
Sousa, DEPSKY: Dependable and Secure Storage 8. Malakooti,M.V., . Khederzadeh, M., A Lossless
in a Cloud-of-Clouds. Secure data embedding in image using DCT and
4.Moritz Borgmann, T.H., Michael Herfert, Thomas Randomize key generator. DICTAP 2012.
Kunz,, and Marcel Richter, U.V., Sven Vowe, On 9.Naheed, T., I. Usman, and A. Dar, Lossless Data
the Security of Cloud Storage Services. Fraunhofer Hiding Using Optimized Interpolation Error
Institute for Secure Information Technology,SIT, Expansion. 2011: p. 281-286.
2012. 10.Mishra, S., AN INTUITIVE METHOD FOR
5. Ohyama, S., Niimi, M., Yamawaki, K., and HILBERT CURVE CODING. International Journal
Noda, H., Lossless Data Hiding Using Bit-Depth of Computing and Corporate Research, 2011,1.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 77


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Virtual Local Area Network (VLAN): Segmentation and Security


Abbas Mehdizadeha Kevin Suinggia Mojtaba Mohammadpoorb Harlina Haruna
a
Dept. of Computing, Nilai University, 71800 Putra Nilai, Negeri Sembilan, Malaysia.
b
Dept. of Computer & Electrical Eng., University of Gonabad, Gonabad, Iran.
mehdiizadeh@ieee.org, kevingizxc@gmail.com, mohammadpur@gonabad.cac.ir, harlina@nilai.edu.my

ABSTRACT 1 INTRODUCTION

A virtual local area networks (VLANs) have The world today relies heavily on technology
recently expanded into an integral property of to do daily work. Important information is al-
switched LAN solutions from every main LAN ma- ways being transferred quickly from within a
teriel seller. One of the motives for the attention company. The advancement of technology has
placed on VLAN functionality now is the fast im- helped to enhance the way we transfer informa-
plementation of LAN switching that commenced tion. The use of Virtual Local Area Network
two decades ago. Many more anxious organiza- (VLAN) is more popular now than ever. But
tions and companies are moving rapidly into net- what is a VLAN? VLANs are simple, yet they
works featuring private port LAN switching de- offer a wide variety of capabilities and options
signs. VLANs demonstrate an alternate solution to to improve the network. VLAN is a technology
routers for broadcasting containment, since VLANs that provide dividing a physical network into
permit switches to also possess broadcast traffic. logical at Layer two. Functionally, VLANs al-
With the implementation of switches in continu- low a network administrator to partition a local
ity with VLANs, each network segment can have network into separate, independent networks.
as few as one user, while broadcast domains can VLANs are often implemented in large net-
be as big as 1,000 users or probably even more. works as well as small VLANs. In larg net-
This paper presents what exactly a VLAN is and works, VLANs are sometimes implemented to
how VLAN memberships are implemented in a combine physically separate LAN segments or
switched network. Membership in a VLAN can LANs into one logical LAN.
be based on MAC addresses, port members, IP ad-
In this paper, we will discuss about the seg-
dresses, IP multicast addresses and/or a combina-
mentation of a VLAN, including why VLANs
tion of these aspects. VLANs are cost effective as
should be considered in a smaller network.
well as time effective, can decrease the traffic of the
We will go through the segmentation process
network, and give an extra security. VLANs give
of the LAN to the VLAN, and the configura-
upgraded system security. In a VLAN system en-
tion of a switch to form separate LAN seg-
vironment, with various communicating areas, sys-
ments. It is imperial for the company to se-
tem administrator can have control over every port
cure the network from any attackers that tries
and client. A malevolent client can no more simply
to steal company information. Network secu-
connect their station to any port of switch and snif-
rity technologies provide protection of the net-
fer the system movement utilizing a bundle sniffer.
work against the theft and misuse of confiden-
The system overseer controls every port and what-
tial data and secure against malicious attacks
ever assets it is permitted to utilize. VLANs confine
from viruses/worms. Without a security so-
delicate movement beginning from inside an under-
lution, the company risks unauthorized intru-
taking department itself.
sions, network downtime, service disruption,
regulatory noncompliance and even legal ac-
KEYWORDS tion. Companies uses VLAN as a way to con-
nect the networks in their company [1][2].
Local Area Network, Virtual LAN, Security, Seg- An expansive scope of themes identified with
mentation, VXLAN. system security have been discussed in the re-

ISBN: 978-1-941968-45-1 ©2017 SDIWC 78


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

cent researches, and a decent rundown of the scope. It carries an extensive and wide range
system security issue have been given. Great of information resources and services, such as
systems ought to work easily with different the World Wide Web (WWW), electronic mail,
systems, be straightforwardly to clients, give telephony, voice and video over IP [6], and file
remote get to, and keep up crest execution. sharing using peer-to-peer networks [7].
Then again, secure systems ensure private data, As the internet is the network of networks.
keep system execution solid, and underscore We then look into the different type of net-
information respectability. The two measure- works. The LAN interconnects computers and
ments are regularly at chances [3]. devices within a limited area such as a resi-
In this paper, we will discuss more on the secu- dence, campus, school, laboratory, or office.
rity of the VLAN. We will indulge further into On the other hand, there is a Wide Area Net-
the discussion of the different types of attack work (WAN), which covers a larger geographic
such as ARP poisoning and VLAN hopping on distance compared to LAN. A network larger
the VLAN and how we can use different ways than LAN and smaller than WAN could be con-
of preventing such attacks such as the use of sidered as Metropolitan Area Network (MAN),
static ARP entries. In section 2, we will discuss covering an area of a few blocks of a city to the
the related work that focuses on the segmenta- area of an entire city [8].
tion and the security of the VLAN and the ben- If LAN uses physical administration to create
efits of the VLAN. In the next subsection 2.1, a network, VLAN was created by using logical
we will examine the benefits of using VLANs. networks to divide a physical switch and sep-
We will be viewing the benefit in terms of the arate hosts that are not supposed to access to
scalability, cost, ease of use, integrity, virtual each other. A VLAN allows the creation of dif-
work group and security. In subsection 2.2, we ferent networks on a single physical switch at
will be going more in depth about the segmen- the data link layer [9]. To subdivide a network
tation of the VLAN and see how it differs from into VLANs, one configures a network switch
the traditional LAN. In subsection 2.3, we will or router. This allows different departments
examine about VXLAN, how it differs from in the company to have different networks on
VLAN and its benefits. Then in subsection 2.4, a single physical switch. This saves cost as
we will cover more about the security of the only a single physical switch is needed, and it
VLAN where we will examine the network at- can absolutely simplify network implementa-
tacks on a VLAN and the existing methods that tion and design, as it can be configured through
exist for protection against them. In section 3, software rather than hardware. VLANs allow
we will examine about the future works that network administrator to group hosts together
are in stored for the improvement of VLAN us- even if the hosts are not on the same network
age such as VXLAN. Lastly, we will conclude switch. Without VLANs, grouping hosts will
about the VLAN segmentation and security. need to relocate the nodes or rewire the data
links [10]. VLAN allows the flexibility for
2 RELATED WORK changes should there be a need to reconfigure
In this section, we will be focusing on the seg- the network [11]. However, VLANs are still
mentation and the security of the Virtual Local vulnerable to network attacks.
Area Network (VLAN)[4][5]. From protecting user data against the grow-
The Internet, the largest network, is the global ing number of threats to ensuring the conti-
system of interconnected computer networks nuity of the business, IT Security is an es-
that use the Internet protocol suite (TCP/IP) to sential element in any organization IT infras-
connect billions of computers and electronics tructure. As IT professionals being able to
devices worldwide. It consists of millions of benchmark against our peers, assess a threat, or
private, public, academic, business, and gov- just having some understanding of why a secu-
ernment sector networks of local to global rity project is important to the business is key

ISBN: 978-1-941968-45-1 ©2017 SDIWC 79


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

[12]. Many techniques to address this issue has Virtual Work Group Another benefits of
been discovered. However, these techniques VLAN is to create virtual workgroups. For in-
will require improvement over the years as net- stance, co-workers from different departments
work attackers are getting better in attacking a that is working on a part of big project or pos-
VLAN [1][2]. sibly same project can send message from one
another without having to be in the same de-
2.1 Benefits of VLAN partment. This can help in reducing traffic in
The benefits of VLAN will be dsicussed in this the network.
section [13][14].

Scalability The ability to add, move and 2.2 Segmentation


change networks are attained with low cost and
Larger network is broken into smaller sections
less effort by just configuring a switch port into
by implementing VLANs. This is called as
the segmented VLANs and assigning clients to
segmentation of the network which allows eas-
the different VLANs.
ier management of the network. A LAN is
a combination of computers and devices con-
Security VLAN is able to provide a more se- nected to each other in small area to interoper-
cure environment, as network administrators ate and share resources.
can have control over each port. A malicious
Routers look at the network address of packets
user can no longer just plug their device into
and use different routing protocols to send the
any switch port and sniff or steal the network
packet to its destination efficiently. Switching
traffic using a sniffer software without getting
could reduce the number of nodes by using the
detected. The port and whatever resources it is
same network segment, resulting in lower con-
allowed to use could be controlled by network
gestion on each segment. In switched hubs or
administrator. The sensitive traffic originating
bridges, each node can have its own network
from an enterprise department within itself can
segment, and therefore have access to all of the
be restricted by VLANs.
network bandwidth of the segment. Switching
bridges can look deep into a packet and use
Cost Saving The cost of creating a network protocol information and the like to provide a
or expanding a network can be decreased by level of filtering and prioritization [15][16].
eliminating the need for additional expensive
For the segmentation section, we also have
network equipment like longer cables or extra
divided it into subsections. In subsection
routers. VLANs will provide the ability to use
2.2.1, we have the types of VLAN member-
better bandwidth and resources and therefore
ship where it explains how the VLAN are as-
the network can work more efficiently.
signed. In subsection 2.2.2, we view the types
of VLANs and how they differ from each other
Easy Troubleshooting Network administra- by the specific function they are used for. We
tors are able to observe the activities of the then move to the VLAN operation in subsec-
VLANs more easily. Therefore network prob- tion 2.2.3, this is where we show how the op-
lems can be easily traced and identified and eration work so that each switch port can be
rectified. assigned to a different VLAN. In subsection
2.2.4, we examine how we can identify the
Integrity The network is segmented logi- VLAN and what ports are used. We then go
cally which divide a physical switch and sep- more in depth to view the methods used by
arate hosts that are not supposed to access to VLAN for identification. Lastly for this sec-
each other. This ensure that data is not com- tion, we view the routing traffic between the
promised when handled. VLANs.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 80


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

2.2.1 VLAN Membership of a switch become a part of the default VLAN.


These switch ports that are now part of de-
There are two types of VLAN memberships:
fault VLAN, in fact they are part of the same
broadcast domain. It means any device con-
Static VLANs Static VLANs are configured nected to any switch port is allowed to com-
by network administrator, mainly because of municate with other devices on other switch
security reason . Since a is assigned manually ports. VLAN 1 is considered as default VLAN
to a VLAN will be always finxed and main- for Cisco switches.
tained. This type of mebership is of course
easy to set up and configure but the manual
update is required if any changes in the host. Native VLAN A native VLAN is defined
Static VLANs are n ot feasible to be imple- to an 802.1Q trunk port which considered as
mented for a large network which required fre- the links between switches to provide the traf-
quent updates. In this case a dynamic solution fic transmission associated with more than a
is suggested. VLAN. It supports traffic coming from many
VLANs, as well as traffic that does not come
from a VLAN, which considered as tagged and
Dynamic VLANs VLANs can be assigned untagged traffic, respectively. The trunk port
automatically using software, based on hard- (802.1Q) places untagged traffic on the default
ware address (MAC), protocols and applica- VLAN 1 which is known as native VLAN. Na-
tions. For instance, assume that a MAC ad- tive VLANs are specified to maintain back-
dresses has been listed into centralized VLAN ward compatibility with untagged traffic com-
software. If it is attached to an unas- mon to legacy LAN scenarios. It is a best prac-
signed switch port, the management database tice to configure the native VLAN as an un-
of VLAN can look up for the hardware ad- used VLAN, different from VLAN 1 and other
dress and assign and configure the switch port VLANs. It can be dedicated as a fixed VLAN
into the correct VLAN. The difficulty of this to serve the role of the native VLAN for all
method is to setup the database at the initial trunk ports in the switched domain.
level [17].

2.2.2 Types of VLANs Management VLAN Any VLAN that is


configured to access the management capabil-
In modern networks, there are a number of dif- ities of a switch is considered as management
ferent types of VLANs. Some of them can be VLAN which by default is VLAN 1. By as-
explained and classified based on their traffic signing an IP address and subnet mask to the
classes. The other VLAN types could be de- Switch Virtual Interface (SVI), the manage-
fined by the particular function that they serve. ment VLAN can be created and managed via
HTTP, Telnet, SSH, or SNMP.
Data VLAN A data VLAN could be used It is not a good practice to choose VLAN 1
and configured to bear with the traffic gener- as management VLAN (which is by default)
ated by a user. It would not include a VLAN because the out-of-the-box configuration of a
carrying voice or management traffic. It is Cisco switch. In the past, the management
common practice to distinguish voice traffic VLAN for a 2960 series switch was the only
and management traffic from data traffic. It is active SVI. On 15.x versions of the Cisco IOS
sometimes referred to as a user VLAN. These for Catalyst 2960 switches, more than one ac-
VLANs are developed to separate the network tive SVI will be possible. However, having
into groups of users or groups of devices. more than one management VLAN which the-
oretically a switch can have, will give oppor-
Default VLAN When the default configura- tunity to network attackers. There is a risk if
tion is loaded at the initial bootup, all the ports the native VLAN is the same as the manage-

ISBN: 978-1-941968-45-1 ©2017 SDIWC 81


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

ment one. Therefore, the native VLAN should 2.2.4 Identifying VLANs
be distinct from any othe VLANs if it is used.
A port on a switch could be associated to only
one VLAN or to all VLANs. A port could be
Voice VLANs To support Voice over IP configured manually as an access or trunk port.
(VoIP), a different VLAN is required which Let the Dynamic Trunking Protocol (DTP) op-
can be called as a voice VLAN. VoIP traffic erates on a per-port basis to set the switch port
needs the followings: mode. It can be done by negotiating with the
• Assured bandwidth to provide acceptable port on the other end of the link [17]. There
voice quality are two different types of links in the switched
• Priority of transmission over other types network:
of network traffics
i) Access Ports: An access port normally car-
• Routing capability around congested ar-
ries the traffic of only one VLAN. In this case,
eas on the network
traffic is both sent and received in native for-
• Lesser delay (less than 150ms) across the
mats without VLAN tagging. Anything arriv-
network
ing on an access port is simply considered to
These requirements need to be achieved to sup-
belong to the VLAN assigned to the port. Any
port VoIP. However, the configuration of these
device connected to an access link is not aware
requirements is beyond the scope of this paper,
of a VLAN membership; the device just as-
but it is useful to briefly discuss on how a voice
sumes its part of the same broadcast domain
VLAN works between a switch, a computer,
and doesn’t recognize the physical network
and a Cisco IP phone.
topology. Access-link devices cannot send and
receive data to and from devices outside their
2.2.3 VLAN Operation
VLAN unless the routing is configured. It can
only make a switch port to be either an access
port or a trunk port but not both. It must be
noted that the access port can only be attached
to one VLAN only [17].
ii) Trunk Ports: Trunk ports on the other hand
is able to carry multiple VLANs at a time. A
trunk link is 100 or 1000 Mbps point-to-point
link between two switches, switch and router,
or even between a switch and server, and it car-
ries the traffic of multiple VLANs from 1 to
4094 at a time. This is a great functionality
because ports can be set up to have a server
in two separate broadcast domains at the same
time, so the users will not have to cross a net-
work layer (layer 3) device to log in and access
Figure 1. VLAN Operation. it. The other benefit is that trunk links are able
to carry various amounts of VLAN data across
While each switch port could be associated to the link [17].
a separate VLAN, the ports associated to the
same VLAN share broadcasts. Once a device 2.2.5 VLAN Identification Method
enters the network, it automatically considers
the VLAN membership of the port it is at- VLAN identification is where switches can
tached to. For a host to be a part of any VLAN, keep track of all frames as they are traveling
it must be given an IP address that belongs to through a switched network. It defines how
the appropriate subnet. switches can identify which frames belong to

ISBN: 978-1-941968-45-1 ©2017 SDIWC 82


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

which VLANs where there is more than one tual technology and protocols to segment a net-
trunking method. work can be useful to control broadcast traf-
i) Inter-Switch Link (ISL): Inter-Switch Link fic and implement security boundaries. How-
(ISL) is a method that tag VLAN data onto ever, allowing absolutely NO access between
an Ethernet frame. This data tagging per- VLANs is never very beneficial. To solve this
mits VLANs to be multiplexed over a trunk problem, implementation of inter-VLAN rout-
through an external encapsulation method ing is suggested. At; the CCNA level, there are
(ISL). In fact, it allows the switch to recog- two ways we can make this happen namely:
nize the VLAN membership of a frame over a) Connect a unique router port to each VLAN
the trunked link. By implementing ISL, inter- b) Create a router on a stick.
connection of multiple switches can be done
and still maintain VLAN information as traffic 2.3 Virtual Extended Local Area Network
travels ; between switches on trunk links. ISL (VXLAN)
operates at layer two by encapsulating a data
frame with a new header and Cyclic Redun- Now that we have learned that traditional
dancy Check (CRC). It is used for Fast Ether- network segmentation has been provided by
net and Gigabit Ethernet links only. ISL rout- VLANs that are standardized under the IEEE
ing is versatile and can be used on a switch 802.1Q group. VLANs provide logical seg-
port, router interface and server interface cards mentation of Layer 2 boundaries or broadcast
to trunk a server [17]. domains. However, due to the inefficient use of
ii) IEEE 802.1Q: it is a standard method cre- available network links with VLAN use, rigid
ated by IEEE for frame tagging, IEEE 802.1Q requirements on device placements in the data
inserts a field into the frame to identify the centre network, and the limited scalability to a
VLAN. If trunking between a Cisco switched maximum of 4094 VLANs, using VLANs has
link and a different brand of switch are need become a limiting factor to IT departments and
to be done, 802.1Q must be used for the cloud providers as they build large multitenant
trunk to work. The basic purpose of ISL and data centres.
802.1Q frame-tagging methods is to provide In this section, we will discuss about the
inter-switch VLAN communication. Also, it VXLAN standard which Cisco, in partnership
should be noted that any ISL or 802.1Q frame with other leading vendors proposed IETF as
tagging is removed if a frame is forwarded out a solution to the data centre network chal-
an access link; tagging is used across trunk lenges posed by traditional VLAN technology.
links only [17]. The VXLAN standard provides for the elastic
workload placement and higher scalability of
2.2.6 Routing between VLANs Layer 2 segmentation that is required by to-
day’s larger application demands.
Routing between VLANs (Inter-VLAN rout-
ing) is based on forwarding network traffic 2.3.1 VXLAN Benefits
from one VLAN to another VLAN using a
router. It allows devices connected to the dif- VXLAN is proposed to provide the same Eth-
ferent VLANs to communicate with each other ernet Layer two network services as VLAN
by using a router. Nodes in a VLAN stay in does today, but with greater flexibility and ex-
their own broadcast domain and will be able tensibility. The following benefits are offered
to communicate freely. VLANs can make net- by VXLAN as compared to VLAN [18]:
work partitioning and traffic separation at layer 1. Flexible placement of multitenant segments
two, data link layer. Therefore, if hosts or throughout the data centre gives a way to ex-
any other IP addressable device want to com- tend the Layer two segments over the underly-
municate between VLANs, a layer 3 device is ing shared network infrastructure. Therefore,
needed to provide routing services. Using vir- tenant workload can be put among physical

ISBN: 978-1-941968-45-1 ©2017 SDIWC 83


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

pods in the data centre. VLAN. Network security issues on the VLAN
2. Higher scalability to address more Layer [21][22][23] is very important and should be
two segments compared to VLANs, which considered, discussed and analyzed. Here we
results in limiting scalability of only 4094 consider some of the more common network
VLANs. VXLAN uses a 24-bit segment attacks on a VLAN.
ID known as the VXLAN Network Identi-
fier (VNID). This enables up to 16 million Address Resolution Protocol (ARP) Spoofing
VXLAN segments to coexist in the same ad- Attacks:
min domain. ARP spoofing, ARP cache poisoning, or ARP
3. Better utilization of network paths available poison routing, is a method that an attacker
in the underlying infrastructure where VLAN sends (spoofed) Address Resolution Protocol
uses the Spanning Tree Protocol for loop pre- (ARP)[24][25] messages onto a local area net-
vention. It should noted that it uses half of work. The main aim of an attacker is to asso-
the network links in a network by blocking re- ciate the its MAC address with the IP address
dundant paths. In contrast, VXLAN packets of another node, such as the default gateway.
are transferred through the underlying network Therefore, the attacker is able to cause any traf-
based on its Layer 3 header and can take com- fic meant for that particular IP address to be
plete advantage of Layer 3 routing, Equal Cost sent to the attacker instead [20]. As shown in
Multi Path (ECMP) routing, and link aggrega- figure 3, the attacker sends fake packets which
tion protocols to be able to use all available has similar IP address to the original IP address
paths. to the server claiming to be the genuine host.
When the server receives the packet and con-
siders the MAC address of the attacker to be
2.4 Security
the intended destination for the packets since
A broad range of topics related to network se- the attacker is using a similar IP address as the
curity has been discussed, and a good sum- original host, it starts sending the data to the
mary of the network security problem has attacker instead. The attacker can receive data
been provided. Good networks should operate intended for the genuine recipient. ARP spoof-
smoothly with other networks, be transparently ing may allow an attacker to intercept data
to users, provide remote access, and maintain frames on a network, modify the traffic, or stop
peak performance. On the other hand, secure all traffic [26].
networks protect confidential information, pro-
vide reliability, and give data integrity [19].
The two dimensions are often at odds [3].

2.4.1 Network Attacks on a VLAN


With the increasing advancement in technol-
ogy, the need for network security is also in-
creasing. This is because as technology ad- Figure 2. Routing under normal operation.
vance, it also increases the wide range of hack-
ing tools availability. These hacking tools are
used on different type of networks where we LAN Hopping:
have the Wide Area Network and the Local VLAN hopping is considered as a computer se-
Area Network [20]. curity exploit, where an attacker can attack net-
While VLAN is important in maintaining a worked resources on a VLAN. The basic con-
flexible networking [11], it also raises se- cept behind all these kind of attacks is for at-
curity issues as corporations keep important tack host on a VLAN to gain access to traffic
data that is being transferred around using the on other VLANs that are not supposed to be

ISBN: 978-1-941968-45-1 ©2017 SDIWC 84


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Figure 3. ARP attack example with the malicious user Figure 5. MAC address flooding.
as the man-in-the-middle.

2.4.2 Existing Methods for Protection


accessible. In a VLAN hopping attack, the at- against Network Attacks on the
tacker will send tagged packets with the VLAN VLANs
ID of the targeted VLAN. The VLAN ID will
then be checked and pass the packet to the tar- In this section, we will discuss about the dif-
geted VLAN [20]. In figure 4, the attacker at- ferent types of method used to protect against
tacks the VLAN through the trunk of the net- network attacks on the VLAN.
work. The packet sent by the attacker will con-
tain the VLAN ID of the targeted VLAN. Static ARP Entries:
One of the methods to prevent network attacks
is the use of static ARP entries into the ARP ta-
ble on the switch. IP address-to-MAC address
mappings in the local ARP cache might be stat-
ically entered so that nodes decline ARP reply
packets. While static entries provide some se-
curity against ARP spoofing attacks, this re-
sults in high maintenance efforts as address-
mappings of all systems in the network have to
be done statically one by one. A static Address
Figure 4. Attacking the VLANs through the trunk of the Resolution Protocol (ARP) entry is a constant
network. entry in the ARP cache. A static ARP entry
can be managed from a node/device or a work-
station. It is not regularly used under normal
MAC Flooding: circumstances, however, it is used when add
MAC flooding is a method implemented to or delete of an entry from the cache is needed.
collude the security of network switches. In The following command could be used to add
this attack, the attacker will send many dif- a static ARP cache entry:
ferent fake source MAC addresses [20]. The
purpose of attacker is to consume the limited C:\>arp -s 192.168.1.17 6c-fc-03-a3-7f-81

memory set aside in the switch to store the


MAC address table. Traffic is then flooded The arp command makes a static entry in ARP
and becomes accessible as the switch can no cache, to launch a communication session with
longer keep the specific destination address in the node that has a 192.168.1.17 IP address,
the memory. Since the traffic is flooded, the at- it is not needed to start the process with an
tacker can then simply see the traffic generated ARP request as the target node’s MAC address
and sniff out the information. Figure 5 shows is already known. If a similar ARP entry has
the attacker sending various MAC addresses to not been added to the target node/host, the tar-
the switch with the intention to limit the MAC get host needs to send an ARP request to the
address table of the switch. computer to find out the MAC address. After

ISBN: 978-1-941968-45-1 ©2017 SDIWC 85


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

adding the static ARP entry, the ARP cache on If the entry is no longer needed, or if it is
the computer may look like the following: needed to be changed to something else, the
C:\>arp -a
no arp command removes the original entry:
Interface: 192.168.1.137 --- 0x50006
Internet Address Physical Address Type Router(config)\#no arp 192.168.1.17
192.168.1.17 6c-fc-03-a3-7f-81 static
192.168.1.254 77-d8-e5-f2-43-6d dynamic
If two nodes communicate with each other
The connection is up with the node at constantly throughout the day, the static ARP
192.168.1.17 until the MAC address of the tar- would be added. By adding static ARP entries
get computer changes, which could be because for both systems in each other’s ARP cache,
of a change of network card, or probably an some network overhead will be reduced, in the
operation changes the MAC address. When form of ARP requests and ARP replies. It is
this happens, it is needed to delete the invalid also good for prevention against flooding at-
ARP entry with an arp -dcommand, such as tacks where the ARP cache is being flooded
arp -d 192.168.1.17. with random entries. Static ARP could help to
Cisco router keeps ARP entries in the cache find out which entries are allowed and which
for four hours (240 minutes), while Windows should be dropped.
workstations can only keep for a maximum of
ten minutes. This is common on routers be- Ingress Filtering:
cause they tend to spend most of their time The use of ingress filtering is a method of en-
dealing with the same nodes. A router is nor- suring that incoming packets are actually from
mally configured as a default gateway for de- the networks which they claim to originate.
vices of the network, which is why they see the The switch is configured with ingress filter-
same nodes communicating with that for most ing to accept only the allowed packets. Any
of the time in a day, and as long as those nodes router that deploys ingress filtering, checks the
keep sending data through the router, they will source IP field of IP packets it receives, and if
remain in the ARP cache. For a router con- the packets do not have an IP address in the
nected to large network segments, this would IP address block, the packets will be dropped.
result in a rather large ARP listing or ARP ta- However, addresses can be faked and ingress
ble. More of the router’s memory will be con- filtering will still accept the packets should the
sumed with a large ARP table, so the caching same allowed address by the ingress filtering
time that Cisco has chosen was the result of be used by the attacker [20]. Ingress filtering is
memory consumed by the ARP cache against to confirm whether inbound packets arriving at
the ARP’s need for fresh MAC information. To a network are from the source that they claim
create a static ARP entry for a router, it can be to be from before entry (or ingress) is given or
done by entering Global Configuration mode, not.
which the arp command looks like this: It takes the advantage of the Layer two IP-
#arp 192.168.1.17 6cfc.03a3.7f81 arpa
address filtering capability of a router at the
network’s edge and if it has a high probability
After entering this command, the ARP cache of being malicious, the traffic will be blocked.
contains the IP-MAC address pair, which At its simplest, ingress filtering involves estab-
would not age-out of the cache. This can be lishing an access control list. This list contains
seen by the dash in the Age column. Static the IP addresses of permitted source addresses.
ARP entries are not usually recognized to an Conversely, the access control list may also be
interface like the dynamic entries are. used to block prohibited source addresses. The
Router\#show arp
following source IP addresses will be blocked
by ingress filtering:
Prot. Address
Int. 192.168.1.1
Age(min) Hardware Addr
- 0050.43bf.7c82
Type
ARPA
• Already in use IP address, IP address
Int. 192.168.1.17 - 6cfc.03a3.7f81 ARPA within the internal network. By blocking

ISBN: 978-1-941968-45-1 ©2017 SDIWC 86


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

the source IP, attacker from spoofing an frastructure. The transport protocol over the
internal IP address to take advantage of a physical data centre network is IP plus UDP.
poorly written firewall rule could be pre- VXLAN defines a MAC-in-UDP encapsula-
vented. tion scheme where the original Layer 2 frame
• Private IP addresses. By blocking this has a VXLAN header added and is then placed
addresses, malicious traffic coming in in a UDP-IP packet. With this MAC-in-UDP
from an improperly configured Internet- encapsulation, VXLAN tunnels Layer 2 net-
based host or an attacker’s spoofed ad- work over Layer 3 network. VXLAN intro-
dress could be prevented. duces an 8-byte VXLAN header that consists
• loopbacks IP addresses. If the loopback is of a 24-bit VNID and a few reserved bits. The
spoofed, this helps to prevent this type of VXLAN header together with the original Eth-
traffic. ernet frame goes in the UDP payload. The 24-
• multicast addresses[27]. Blocking multi- bit VNID is used to identify Layer 2 segments
cast addresses could help to prevent un- and to maintain Layer 2 isolation between the
desired multicast traffic that seems to be segments. With all 24 bits in VNID, VXLAN
spam. can support 16 million LAN segments.
• Service or management network IP ad-
dresses. The attacker is not bale yo use VXLAN uses VXLAN Tunnel EndPoint
the public Internet to gain unauthorized (VTEP) devices to map tenants’ end devices to
access to network services running at the VXLAN segments and to perform VXLAN en-
network application layer and above. capsulation and de-encapsulation. Each VTEP
The traffic from specific regions of the world function has two interfaces: One is a switch in-
could be whitelisted by network admin, and terface on the local LAN segment to support
can be blacklisted to not allow a specific region local endpoint communication through bridg-
to have access to its environment. Some free ing, and the other is an IP interface to the
subscription-based services could be found to transport IP network. The IP interface has a
create access control lists for network border unique IP address that identifies the VTEP de-
routers. vice on the transport IP network known as the
infrastructure VLAN. The VTEP device uses
3 VXLAN ENHANCEMENTS this IP address to encapsulate Ethernet frames
VXLAN has a higher scalability to address and transmits the encapsulated packets to the
more Layer 2 segments. VLANs uses a 12- transport network through the IP interface. A
bit VLAN ID to address Layer 2 segments, VTEP device also discovers the remote VTEPs
which results in limiting scalability of only for its VXLAN segments and learns remote
4094 VLANs. VXLAN uses a 24-bit seg- MAC Address-to-VTEP mappings through its
ment ID known as the VXLAN Network Iden- IP interface. The functional components of
tifier (VNID), which enables up to 16 million VTEPs and the logical topology that is created
VXLAN segments to coexist in the same ad- for Layer 2 connectivity across the transport IP
ministrative domain. network is shown in Figure 6.
We now discuss the VXLAN encapsulation
and packet format. VXLAN is a Layer 2 The VXLAN segments are independent of the
overlay scheme over a Layer 3 network. It underlying network topology; conversely, the
uses MAC Address-in-User Datagram Proto- underlying IP network between VTEPs is in-
col (MAC-in-UDP) encapsulation to provide a dependent of the VXLAN overlay. It routes
means to extend Layer 2 segments across the the encapsulated packets based on the outer IP
data centre network. VXLAN is a solution to address header, which has the initiating VTEP
support a flexible, large-scale multitenant en- as the source IP address and the terminating
vironment over a shared common physical in- VTEP as the destination IP address [28][29].

ISBN: 978-1-941968-45-1 ©2017 SDIWC 87


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

REFERENCES
[1] Cisco, “What Is Network Security? - Cisco Sys-
tems”, 2016.

[2] Cisco, “Solutions - Cisco Systems, What You


Need to Know about Network Security”, 2016.

[3] Baker, Richard H., “Network Security: How to


Plan for it and Achieve it”, McGraw-Hill, Inc.,
1995.

Figure 6. The functional components of VTEPs. [4] Catalyst 4500 Series Switch Cisco IOS Software
Configuration Guide, 12.2(25)EW, “Understand-
ing and Configuring VLANs [Cisco Catalyst 4500
Series Switches]”, 2016.
4 CONCLUSION
[5] Surabhi Surendra Tambe, “Understanding Vir-
tual Local Area Networks”, International Journal
As technology is advancing and improving at of Engineering Trends and Technology (IJETT),
a high rate on a daily basis, more methods of Vol.25 No.4, 174-176, 2015.
managing the network of these technologies
[6] Sun, Lingfen, Is-Haka Mkwawa, Emmanuel Jam-
are being developed. Since there are millions meh, and Emmanuel Ifeachor, “Guide to voice and
of network all around the globe, one of the video over IP: for fixed and mobile networks”,
special methods to manage these networks is Springer Science & Business Media, 2013.
the creation of logical addressing. One way to
manage the networks is the creation of phys- [7] Tsimonis, G., & Dimitriadis, S., “Brand strategies
ical way of addressing which is called Local in social media. Marketing Intelligence & Plan-
ning”, 32(3), 328-344, 2014.
Area Networking (LAN). To address the issue
with handling many networks, logical address- [8] Van Heddeghem, W., Lambert, S., Lannoo, B.,
ing was created where component only needed Colle, D., Pickavet, M., & Demeester, P., “Trends
to be in the same sub network to interact with in worldwide ICT electricity consumption from
each other. 2007 to 2012”, Computer Communications, 50,
64-76, 2014.
With the wide usage of VLAN, there is a con-
cern of security of the network, as well as scal- [9] Altunbasak, Hayriye, and Henry Owen, “An archi-
ability and network management, which have tectural framework for data link layer security with
security inter-layering”, Proceedings IEEE South-
been discussed in this paper. As sensitive data eastCon, 2007.
is broadcasted on a network, there are sev-
eral risks and threats to the network. VLANs [10] Wilkins, S., “Virtual vs. Physical LANs: Device
can minimize this threat by placing only those Functionalities”, Pearson IT Certification, CCNA
users on the network data on a VLAN with ac- Routing and Switching 200-120 Network Simula-
tor, 1st Edition, 2015.
cess. This will reduce chances of an intruder
gaining access. With the implementation of [11] Nishino, H., Nagatomo, Y., Kagawa, T., & Hara-
VLAN we can also have control of broadcast maki, T., “A Mobile AR Assistant for Cam-
domains, setting up firewalls, prohibition of ac- pus Area Network Management”, In IEEE 2014
cess and alerting a network manager in case of Eighth International Conference on Complex, In-
an attack by an outsider. In this paper, we can telligent and Software Intensive Systems (CISIS),
pp.643-648, 2014.
conclude that utilization of virtual local area
networks can surely simplify network manage- [12] Shepard, D., “84 Fascinating & Scary IT Secu-
ment and also provide networks with improved rity Statistics”, 2015 CYBERTHREAT DEFENSE
security. REPORT, 2015.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 88


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

[13] Haq, Ul, Syed Ehtesham, and Suraiyaa Parveen, col”, 19th IEEE Annual Computer Security Appli-
“Implementation of network architecture, its secu- cations Conference, 2003.
rity and performance analysis of VLAN”, Interna-
tional Journal of Advanced Research in Computer [25] Cisco, “Configuring the Address Resolution Proto-
Science, 8, no. 7, 2017. col (ARP), Cisco Content Services Switch Routing
and Bridging Configuration”, 2004.
[14] Nguyen, Van-Giang, and Young-Han Kim, “SDN-
Based Enterprise and Campus Networks: A Case [26] Zargar, S. T., Joshi, J., & Tipper, D., “A survey of
of VLAN Management”, Journal of Information defense mechanisms against distributed denial of
Processing Systems, 12, no. 3, 2016. service (DDoS) flooding attacks”, IEEE commu-
nications surveys & tutorials, 15(4), 2046-2069,
[15] Derfler FJ, Freed L, Douglas P, Robbins L, Adams 2013.
S., “Illustrator-Troller M. How networks work”,
Que Corp, 2000. [27] Mehdizadeh, A., Abdullah, R. S. A. R., Hashim,
F., Ali, B. M., Othman, M., & Khatun, S., “Reli-
[16] Henry, Paul David, “Strategic networking: From able key management and data delivery method in
LAN and WAN to information superhighways”, multicast over wireless IPv6 networks”, Wireless
Coriolis Group, 1996. personal communications, 73(3), 967-991, 2013.

[17] Pal, G. Prakash, and Gyan Prakash Pal, “Virtual [28] Cisco, “VXLAN Overview: Cisco Nexus 9000 Se-
Local Area Network (VLAN)”, International Jour- ries Switches”, 2016.
nal of Scientific Research Engineering & Technol-
ogy (IJSRET), 1: 006-010, 2013. [29] Arista, “VXLAN: Scaling Data Center Capacity”,
2016.
[18] Kapadia, Shyam, Puto H. Subagio, Yibin Yang,
Nilesh Shah, Vipin Jain, and Ashutosh Agrawal,
“Implementation of virtual extensible local area
network (VXLAN) in top-of-rack switches in a
network environment”, U.S. Patent 9,565,105,
2017.

[19] Mehdizadeh, A., Abdullah, R. S. A. R., & Hashim,


F., “Secure group communication scheme in wire-
less IPv6 networks: An experimental test-bed”, In
IEEE International Symposium on Communica-
tions and Information Technologies (ISCIT), (pp.
724-729), 2012.

[20] Riddell, A., Darrell, A., Drewett, C., Sangster, W.,


“LAN Security Guide”, First Edition, ISBN: 978-
0-473-19020-0, Allied Telesis Inc, 2012.

[21] Rouiller A.S., “VLAN security weaknesses and


countermeasures”, GIAC Security Essentials Prac-
tical, SANS Institute, 2016.

[22] Heron, S., “Ten top threats to VLAN security”,


Redscan, 2014.

[23] Akram, J., Akram, N., Mamoon, S., Ali, S.,


& Naseer, N., “Future and Techniques of Im-
plementing Security in VLAN”, Journal of Net-
work Communications and Emerging Technolo-
gies (JNCET), 7(5), 2017.

[24] Bruschi, Danilo, Alberto Ornaghi, and Emilia


Rosti, “S-ARP: a secure address resolution proto-

ISBN: 978-1-941968-45-1 ©2017 SDIWC 89


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Ontology-Based Data Mining Approach for Judo Technical Tactical Analysis

Ivo La Puma1 and Fernando Antonio de Castro Giorno1,2


1
Institute of Technological Research, São Paulo, Brazil
2
Pontifical Catholic University, São Paulo, Brazil
ivolapuma@gmail.com and fgiorno@gmail.com

ABSTRACT analysis is carried out with one or a


combination of the following aims: finding
Coaches and other judo experts conduct technical patterns of performance that may increase the
tactical analysis of judokas to understand the chances of winning a competition; predicting
techniques they use to win fights, and to identify the performances of an athlete or a team; real-
defensive strategies to counter an opponent’s time decision-making on actions/reactions or
actions. Computer systems based on artificial
strategies needed in an event; and identifying
intelligence (AI) techniques are used in the
technical tactical analysis and in predicting
sports demands and selecting athletes who
results, injury prevention, talent discovery, and can best address the demands [2].
game strategy evaluation in various collective and
individual sports. However, there are no studies Judo is characterized as an intermittent high-
related to the use of AI in judo. This paper intensity combat sport consisting of a variety
proposes a data mining approach using an of techniques and actions taken during a
ontology for the technical tactical analysis of match [3]. Similar to other sporting
judo. As a proof of concept, a data mining tool modalities, technical tactical analysis is
was developed to identify sequential patterns in carried out in judo to determine the
judo combat actions and assist in strategic performance development in the sport.
decision-making. An ontology of judo fight was Moreover, the performance of judokas is
also developed and used to model the database.
The approach was found to be valid as the tool
analyzed to understand how they use their
yielded the information needed to satisfy the techniques to win a contest [4].
desired performance analysis requirement. As
contribution, it is expected this paper enables the The volume of data produced in the sports
flourishing of new researches or applications area is increasing. However, the potential for
through the ontology of judo fight as well as analyzing and understanding these large
validates the model of mapping requirements datasets is still below the potential for
performance analysis and data mining methods collecting and storing them. As experts and
used in this study. statisticians fail to explain the relationships
between data, approaches based on AI are
KEYWORDS being used to support them in explaining or
predicting events or in making strategic
Data mining, Sequential pattern mining, decisions [5]. Several studies have reported
Ontology, Judo, Technical tactical analysis. the use of data mining based on AI techniques
in predicting results, evaluating performance
and preventing injuries to athletes, finding
1 INTRODUCTION
talent, and evaluating game strategies in
various sports modalities [2], [5], [6].
Technical tactical analysis or performance
However, there are no studies related to the
analysis describes the method of analysis used
use of AI-based computer systems in the
to understand how sporting skills are applied,
technical tactical analysis of judo competitors
and serves as a source of information for
or in supporting strategic decision-making in
performance enhancement in the sport [1].
the sport.
From the perspective of sports science, such

ISBN: 978-1-941968-45-1 ©2017 SDIWC 90


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

which highlight its positive and negative


The main objective of this study is to propose aspects, contributions, and suggestions for
an ontology-based data mining approach in future work.
the technical tactical analysis of judo
competitors. Furthermore, the following are 2 REVIEW OF RELATED LITERATURE
secondary objectives of this study:
1. Develop an ontology of judo fight, In this section, the significant studies that
which is inspired by the technical influenced the development of the present
tactical model of the judo fight [7]. study are highlighted.
2. Build a prototype of a data mining
tool for technical tactical judo analysis The first study is the technical tactical
based on the model of mapping analysis model of judo fight [7], which uses a
requirements of sports performance statistical method based on Markov processes
analysis and data mining methods [2]. to calculate combinations of probabilities
between each phase of combat. This model
A contribution expected by this paper is to characterizes judo fight in six distinct phases
enable the flourishing of new researches or (approach, gripping, attack, defense, ground
applications through the ontology of judo working, and pause) and uses 46 groups of
fight. Other contribution is to validate the technical tactical variables to represent
model of mapping requirements performance combat actions.
analysis and data mining methods used in the
approach proposed in this study. This is followed by the model of data mining
tools for performance analysis in high-
The remainder of the paper is organized as performance sports [2]. This model suggests a
follows. In Section 2, significant studies categorization of sports according to
related to this paper are presented. In Section performance analysis requirements where,
3, the ontology of judo fight is introduced. through a mapping of data mining methods
The steps for the ontology-based data mining and techniques, it is possible to combine it
approach for technical tactical judo analysis with the characteristics of available
are defined in Section 4. The main aspects in techniques and suggest the most appropriate
building a prototype of the data mining tool approach to build the tool (Table 1). This
for the technical tactical analysis of judo, model was proposed based on the hypothesis
which serves as a proof of concept, are that the combination of certain methods and
presented in Section 5. In Section 6, analysis techniques of data mining is more appropriate
of the results obtained in the simulations of for certain sports performance analysis
technical tactical analysis of judo using the requirements.
prototype is described. Finally, Section 7
presents the final considerations on this study,
Table 1. Mapping between sports performance analysis requirements, data mining methods, and data mining technique
characteristics.
Sports performance analysis DM methods DM technique characteristics
requirements Clust. Class. R. Mod. R. Min. Interp. Prec. Flex.
Performance pattern discovery X X High Moderate Moderate
Performance prediction X X Low High High
Real-time decision-making X X Very high High Very low
Demand analysis X X Moderate Moderate Moderate
Note. DM = data mining; Clust. = clustering; Class. = classification; R. Mod. = relationship modelling;
R. Min. = rule mining; Interp. = interpretability; Prec. = precision; Flex. = flexibility.
A cell with “X” indicates that the specific performance analysis is mainly performed using the method in that column.
A blank cell shows that the specific method in that column is not frequently and effectively used for the specific
performance analysis in that row.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 91


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Finally, the last study is an ontology-based data source for the data mining tool in the
data mining approach using clustering and technical tactical analysis of judo.
mining association rules to identify patterns in Step 1.1. Defining the ontology of judo fight.
schools in India [8]. The approach was The ontology of judo fight is
composed of ontology development and recommended, but it is possible to use
attributes categorization, data mining, and another ontology.
evaluation. Data from Indian schools Step 1.2. Modelling the database. The
consisted of 242 attributes, which make it database must consist of entities and
difficult for the data mining tool to extract attributes found in the ontology.
useful knowledge through random grouping. Step 1.3. Creating the physical database.
The ontology developed in this study was Step 1.4. Loading the judo fight notations in
based on the school data attributes, which was the database.
used in the application to select the input data
of mining algorithms, and served as reference
in the evaluation and interpretation of results.
This approach served as an inspiration for the
data mining approach to technical tactical
analysis of judo in the present study.

3 ONTOLOGY OF JUDO FIGHT

The ontology of judo fight was strongly


inspired by the judo fight phases (approach,
gripping, attack, defense, ground working,
and pause) and 46 combat actions that
characterize the technical tactical model of
judo fight [7]. The development of this onto
logy was carried out using of the Noy and
McGuinness's method [9] and the ontology
pitfall detection tool (OOPS!) [10]. This
ontology was written in OWL through the
ontology editor Protégé 5.0.0 [11]. Figure 1
shows the classes of ontology.

The latest version of ontology is available at


Figure 1. Classes of ontology of judo fight.
GitHub repository
(https://github.com/ivolapuma/ontology-judo-
4.2 Creation of the Data Mining Tool for
fight). Technical Tactical Analysis
4 PROPOSED APPROACH Phase 2 deals with the creation of the
prototype of the data mining tool, which must
This section presents all steps comprising the discover patterns from data of fight notations
ontology-based data mining approach for the and allows the technical tactical analysis of a
technical tactical analysis of judo. judoka.
Step 2.1. Defining the performance analysis
4.1 Building the Judo Fight Notation requirements. Based on Table 1, the
Database requirements can be performance pattern
discovery, performance prediction, and
Phase 1 deals with the building of the judo real-time decision-making or demand
fight notation database, which will serve as a analysis.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 92


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Step 2.2. Defining the data mining method. database tables according to Step 1.2 was
The data mining method must obey the done based on the entities and attributes of
mapping of performance analysis this ontology. In Step 1.3, PostgreSQL
requirements and data mining methods version 9.6.1 for x86-64 bits Windows
(Table 1). platform was used to create the database. In
Step 2.3. Defining the data mining technique. Step 1.4, fight notation data were imported
The technique must correspond to the directly into the database. These notes came
values defined by the characteristics of from Professor Emerson Franchini's archive
interpretability, precision, and flexibility [12] and the 120 judo matches in official
(Table 1). contests organized by the International Judo
Step 2.4. Developing the prototype of the data Federation between Beijing 2008 and London
mining tool. This prototype must use the 2012 Olympic Games, which comprise the
selected data mining technique to process fighting actions of 23 athletes from among the
the judo fight notations and return the best of men's -81 kg category. Table 2 shows
patterns discovered and desired to satisfy the number of fights and actions noted by
the performance analysis requirement. judoka – the names are preserved. These judo
matches were under the rules applicable at
4.3 Simulation of Technical Tactical that time, and did not consider the
Analysis of Judo modifications introduced in 2017.

The simulation of the technical tactical Table 2. Number of fights and actions noted by judoka.
analysis of judo comprises the third and last
Judo fighter Total fights Total actions
phase of the proposed approach. Brazilian judoka 1 10 753
Step 3.1. Discovering patterns from judo fight German judoka 1 10 527
notations. This step should be performed in Canadian judoka 1 9 486
simulations using the prototype. Dutch judoka 1 8 555
Step 3.2. Performing the technical tactical French judoka 1 7 428
Montenegrin judoka 1 7 369
analysis of judo. The analysis must be South Korean judoka 1 6 417
based on the patterns discovered by the American judoka 1 6 390
prototype. Currently, the ontology can be Azerbaijani judoka 1 6 368
used in interpreting the results. Emirati judoka 1 6 299
Step 3.3. Evaluating the characteristics of the Moroccan judoka 1 6 263
Belgian judoka 1 5 345
data mining technique. The characteristics Italian judoka 1 5 268
of interpretability, precision, and flexibility Argentinian judoka 1 5 199
of the technique used should be compared Ukrainian judoka 1 4 170
with the values defined for these Kazakhstani judoka 1 4 167
characteristics (Table 1). Japanese judoka 1 3 210
Dutch judoka 2 3 153
Slovene judoka 1 3 94
5 PROOF OF CONCEPT Russian judoka 1 3 88
Croatian judoka 1 2 134
To validate the approach proposed in this British judoka 1 1 60
study, a proof of concept was performed that South Korean judoka 2 1 54
involved the fulfillment of the steps defined in
each phase. In this section, the most critical Following Step 2.1, the performance analysis
aspects defined in Phases 1 and 2 steps are requirements were aimed at real-time
highlighted. decision-making because it was intended to
extract information from judo matches that
The ontology of judo fight, Step 1.1 in support decision-making on actions/reactions
Section 3, was defined for this proof of or strategies needed by a judoka to win in a
concept. The creation of a model of the competition. However, the term "real-time"

ISBN: 978-1-941968-45-1 ©2017 SDIWC 93


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

must be relativized given that during a fight, a Sequence. All set of actions (or elements)
judoka would not be able to redefine his performed by the same judoka in the same
strategies from an information system. In Step fight.
2.2, mining rules was the data mining method
defined because the discovery of rules or The constraint imposed by the algorithm's
patterns in the fighting actions of a judoka can premise of not allowing an element to contain
fulfill the performance analysis requirement. repeated items can have implications on the
In Step 2.3, the sequencing pattern mining results obtained with data mining. Another
technique [13] was defined because important consideration concerns the attack
identifying the most frequent sequences of actions. In the judo fight notation database
actions performed by a judoka can model, an attack action is characterized by the
accomplish the goal defined in Step 2.1. applied judo technique, the attack direction
and, eventually, the score assessed by the
Step 2.4 comprised the development of the referee. To allow techniques and directions of
prototype of the data mining tool based on the attacks to be considered in the process of
sequential pattern mining technique. The CM- finding sequential patterns, the set "attack +
SPAM algorithm [14], provided by SPMF technique + direction" will be considered as a
library [15] written in Java, was used in the single action. Similarly, an eventual score will
prototype development. The CM-SPAM also be considered as a single action in
algorithm was selected because it has a very addition to the element following the item
satisfactory performance compared to other related to the attack action.
algorithms [16]. Furthermore, its
implementation in SPMF allows the use of An example of an element could be
optional parameters (such as maximum and represented by the sequence of judo fight
minimum pattern size and mandatory items), actions “Left Anteroposterior Approach,
which are not available in the implementation Trying to Grasp, Left Sleeve and Right
of other algorithms (such as GSP, PrefixSpan, Sleeve, Attack + Deashi-harai + Left, Waza-
SPADE, SPAM, and CM-SPADE). Ari.”

Moreover, to assemble the sequence base to The prototype built in this study was called
enable execution of the sequential pattern JudoDataMining and was developed using
mining algorithm, the following adaptations Java version JavaSE-1.8 (jdk1.8.0_121).
were considered:
Item. All fight actions performed by the same 6 RESULTS ANALYSIS
judoka.
Element. A set of consecutive fighting This section highlights the most critical
actions carried out by the same judoka aspects defined in the implementation of
between a referee's start fight command Steps 3.1, 3.2, and 3.3 of Phase 3.
(hajime or yoshi) and a pause fight
command (mate, sonomama or soremade). 6.1 Discovering Patterns from Judo Fight
Eventually, if a judo player repeats some Notations
action since hajime, that action delimits the
set of previous actions, which marks the In Step 3.1, the equipment used to perform
beginning of a new set of actions until the the simulations was a Dell Vostro 3500 with
next referee's command or until another Intel® Core ™ i7 M processor 680 @
action is repeated. The break in a set of 2.80GHz, 6.00GB RAM using Microsoft
actions is necessary because it is an Windows 10 Pro 64-bit operating system. The
algorithm premise that an element does not simulations were carried out through a script
contain repeated items. program to run the data mining tool for each
selected judo fighter with the following steps:

ISBN: 978-1-941968-45-1 ©2017 SDIWC 94


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

1. Performing a loop by running the tool The degradation is probably due to the
and initially passing the value 100% in number of fights of each judoka, the average
minimum support parameter. For each size of items and elements contained in each
repetition, 10% is decreased in the value sequence, and the number of possible items in
of the minimum support parameter until the system.
it reaches 30% or the execution time
exceeds 2 seconds. The last significant number is 2 seconds as
2. Performing the previous loop by running limit for the execution time of the tool. This is
the tool and passing the value 4 in the because in the running tests lasting close to 2
minimum pattern size parameter. With seconds, the number of sequential patterns
this parameter, the tool will only return found was around 300,000. Thus, the limit
sequential patterns with at least 4 items. was imposed so that huge files with much
3. Performing the previous loop again by information would not be generated, which
passing the value corresponding to the would cost too much to analyze.
IPPON item in the required item
parameter, instead of passing value in 6.2 Technical Tactical Analysis of Judo
the minimum pattern size parameter.
With this parameter, the tool will only In Step 3.2, the task of a tool user (or a
return sequential patterns containing technical tactical analyst) is to extract useful
IPPON. information to make winning strategic
decisions from the sequential patterns found.
There are some significant numbers in this For this task, the tool offers two features. The
script program. first resource is a CSV file containing the
columns: found sequential pattern, support
The first number is the value 4 defined for the value, number of elements, and number of
minimum pattern size parameter used in the fighting actions (example in Table 3). The
second loop. This is because a judoka must second feature is also a CSV file that present
perform at least 3 actions to obtain a score: an some simple statistics extracted from the
approach action, a grip action, and the attack sequential patterns found file with the
action. Given that a score is also considered columns: item/element, type, size, and
an item in the sequential database, the number of occurrences (example in Table 4).
hypothesis is that any sequential pattern
involving the achievement of the score must Table 3. Sample records from the sequential patterns
have at least 4 items. file of “Brazilian judoka 1”.

Sequential pattern Support Elements Actions


The second number is the value 10% that is (Left anteroposterior 3 4 5
decreased at the minimum support parameter approach) (Trying to
for each repetition in the loops. This is grasp) (Right
because the largest number of fights noted by anteroposterior
approach, Left collar)
a judoka is 10. Thus, the most adequate
(Ippon)
minimum value of support would be 10%, (Left anteroposterior 3 4 4
which would represent a single pattern approach) (Left sleeve
sustained in at least one fight. and right sleeve) (Left
collar) (Ippon)
The third number is the minimum value of
30% for the minimum support parameter. Table 5 presents some simple statistics on the
During the running tests in the development quantities of sequential patterns found in the
phase, the execution time had degraded proof of concept simulations. On the other
excessively with the support of 20% or less, hand, Table 6 shows the quantities of patterns
taking hours to finish in some simulations.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 95


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

found and the execution times from the the technique used by the prototype with the
simulations performed for Brazilian judoka 1. values defined in the model proposed by
Ofoghi et al. [2]. As defined in Step 2.3, the
The analysis of the results supported the sequential pattern mining technique should
conclusion that the sequential pattern mining present the following characteristics values:
tool was able to extract useful information for Interpretability: Very high – Owing to the
tactical and strategic decision-making in judo. large amount of response information
produced.
Table 4. Sample records from the simple statistics file Precision: High – Owing to the high degree
of the sequential patterns of “Brazilian judoka 1”. of dependence between the results
Item/element Type Size Occurrences
generated and the important decisions
Right anteroposterior Item 1 21 taken.
approach Flexibility: Very low – Owing to the very
Left collar and right Item 1 9 short time limit to use the results.
sleeve
(Right anteroposterior Element 2 8 6.3.1 Interpretability
approach, left collar)

Table 5. Statistics of the sequential patterns found in In this proof of concept, the results are
the proof of concept simulations. sequential patterns consisting of a series of
consecutive elements - or sets of items -
Number of Minimum Maximum Average representing judo fight actions. Given that
simulations number per number per these results are not numerical values and do
simulation simulation
not require interpretation in reading, their
173 0 5757309 168815.8
interpretation can be considered as an easy
Table 6. Quantities of patterns found and execution task. Moreover, a judo specialist is widely
times from the simulations performed for “Brazilian familiar with the fighting action names, which
judoka 1”. further enhances the interpretability. If there
is any difficulty or doubt in some fighting
Support Min. Required Number of Execution
pattern item patterns time (s)
action, the ontology of judo fight can be used
size discovered as support. Thus, it is possible to justify that
100% 21 0 the interpretability of the sequential pattern
90% 325 0.016 mining is very high because having
80% 10626 0.125 understood what a sequential pattern
70% 1280831 9.656
100% 4 1 0 represents allows for an interpretation without
90% 4 249 0 difficulties.
80% 4 10475 0.109
70% 4 1280453 9.786 6.3.2 Precision
100% Ippon 0 0
90% Ippon 0 0
80% Ippon 0 0 In this study, the results are sequential
70% Ippon 0 0 patterns, which represent a series of fight
60% Ippon 0 0.015 action sets that occur with certain frequency.
50% Ippon 0 0 Typically, a sequential pattern reveals similar
40% Ippon 0 0
information as a judo coach would find by
30% Ippon 41 0.016
watching a judo match once or more.
Therefore, any combat strategy decision will
6.3 Evaluation of the Characteristics of the
be made based on the information found by
Data Mining Technique
the judo coach.
In Step 3.3, the evaluation should be made by
By regarding the judo fight notations as
comparing the values of each characteristic
accurate and reliable, it is possible to consider
(interpretability, precision, and flexibility) of

ISBN: 978-1-941968-45-1 ©2017 SDIWC 96


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

that the results will maintain the same level of 7 CONCLUSION


accuracy and reliability. However, to confirm
that accuracy, it is necessary to carry out in This paper described a prototype that was
practice the combat strategies defined based developed using a sequential pattern mining
on the information extracted from the technique to find patterns from a judo fight
sequential patterns found. Given that practice notation database, to extract useful
is not part of the scope of this study, the information that would enable decision-
precision of the sequential pattern mining making on actions/reactions or strategies
could not be verified to be high. needed to defeat a particular opponent in a
contest. The analysis of the simulation results
6.3.3 Flexibility with the prototype shows that the use of the
sequential pattern mining technique produces
Given that the performance analysis the needed information to satisfy the desired
requirement is real-time decision-making, and performance analysis requirement. Thus, it is
the data mining method is mining rules, the confirmed that the main objective of this
flexibility characteristic must be very low study was achieved as the proof of concept
(Table 1). However, the term "real-time" must was able to validate the proposed data mining
be relativized. Sequential patterns are found approach.
on a sequence base assembled from the judo
fight notations of a given judoka. In judo, the This study has the following contributions.
time limit for using the found sequential The first is the novelty of using AI in judo. To
patterns is until the beginning of the match the best of our knowledge, no work has been
when the strategies have already been decided found on that subject during out search.
between the athlete and his coach. Second further research on the subject is
enabled through the development of an
The sequential pattern mining technique does ontology of judo fight and the proposed data
not allow variations in parameters and data mining approach. Finally, this study validated
format. The minimum support parameter is the mapping of performance analysis
obligatory. The processed data must be in the requirements and data mining methods [2].
form of a sequence database. This proof of
concept shows that the judo fight notation A limitation of this study is that when there is
data have to be adapted to use the data mining a small amount of fight notation data for a
technique. judoka, the use of the tool becomes infeasible
owing to the degraded execution time and/or
In this study, the number of sequential the large number of resultant sequential
patterns found per time unit (average of patterns. Another limitation is that it is not
150,000 patterns per second) is highly possible to verify the level of accuracy and
satisfactory when the support parameter reliability of the sequential patterns found. To
and/or the number of judo fights processed do so, it would be necessary to practice the
are not very low. On the other hand, the judo strategy decisions based on the
execution time degrades when the number of information extracted using the tool, which
sequential patterns found is very large. was outside the scope of this study.
Considering the time it would take for an
analyst to work on the results, which can be In future research, the information extracted
large according to the minimum support from the sequential patterns is expected to be
parameter, sequential pattern mining can evaluated in terms of making strategic judo
support the technical tactical analysis. Thus, combat decisions, thereby verifying the
the low flexibility of the sequential pattern accuracy of the sequential pattern mining. A
mining can be justified. future study may also apply this approach to
the technical tactical analysis of judo using

ISBN: 978-1-941968-45-1 ©2017 SDIWC 97


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

other data mining techniques or methods.


[12] Emerson Franchini B-9119-2012.
Finally, it is expected that the approach will ResearcherID.com. Available at
be adapted to other sports as well. http://www.researcherid.com/rid/B-9119-2012.
Accessed on 26Sep2017.

REFERENCES [13] R. Srikant, and R. Agrawal. Mining sequential


patterns: Generalizations and performance
[1] A. Lees, “Technique analysis in sports: a critical improvements. In The International Conference on
review.” J Sports Sci, vol. 20, issue 10, pp. 813- Extending Database Technology, 5, pp. 1-17,
828, 2002. 1996.

[2] B. Ofoghi, J. Zeleznikow, C. MacMahom, and M. [14] P. Fournier-Viger, A. Gomariz, M. Campos, and
Raab. Data Mining in Elite Sports: A Review and R. Thomas. Fast Vertical Mining of Sequential
a Framework. Measurement in Physical Education Patterns Using Co-occurrence Information. In
and Exercise Science, vol. 17, issue 3, pp. 171- Pacific-Asia Conference on Knowledge Discovery
186, 2013. and Data Mining, 18, pp. 40-52, 2014.

[3] G. Marcon, E. Franchini, J.R. Jardim, and T.L. [15] P. Fournier-Viger, J.C.W. Lin, A. Gomariz, T.
Barros Neto, “Structural analysis of action and Gueniche, A. Soltani, Z. Deng, and H.T. Lam. The
time in sports: Judo.” J Quant Anal Sports, vol. 6, SPMF open-source data mining library version 2.
issue 4, article 10, 2010. In Joint European Conference on Machine
Learning and Knowledge Discovery in Databases,
23, pp. 36-40, 2016.
[4] B. Miarka. Construção, validação e aplicação de
um programa computadorizado para análise de [16] P. Fournier-Viger, J.C.W. Lin, R.U. Kiran, Y.S.
ações técnicas e táticas em atletas de Judô: Koh, and R. Thomas. A Survey of Sequential
diferenças entre classes, categorias e níveis Pattern Mining. Ubiquitous International - Data
competitivos. (MSc Thesis, in Portuguese). Science and Pattern Recognition, vol. 1, issue 1,
Universidade de São Paulo, Brazil, 2010. pp. 54-77, 2017.
[5] M. Haghighat, H. Rastegari, and N. Nourafza, “A
review of data mining techniques for result
prediction in sports.” ACSIJ Advances in
Computer Science: an International Journal, vol. 2,
issue 5, pp. 7-12, 2013.

[6] H. Novatchkov, and A. Baca, “Artificial


intelligence in sports on the example of weight
training.” J Sports Sci and Med, vol. 12, issue 1,
pp. 27-37, 2013.

[7] B. Miarka. Modelagem das interações técnicas e


táticas em atletas de Judô: comparações entre
categoria, nível competitivo e resultados de
combates do Circuito Mundial de Judô e dos Jogos
Olímpicos de Londres. (PhD Thesis, in
Portuguese). Universidade de São Paulo, Brazil,
2014.

[8] P. Roy, S.S. Sathya, and N. Kumar. Ontology


Assisted Data Mining and Pattern Discovery
Approach: A Case Study on Indian School
Education System. Adv Nat Appl Sci, vol. 9, issue
6, pp. 555-560, 2015.

[9] N. Noy, and D.L. McGuinness. Ontology


Development 101: A Guide to Creating Your First
Ontology. University of Stanford, USA, 2002.

[10] M. Poveda-Villalón, A. Gómez-Pérez, and M.C.


Suárez-Figueroa. OOPS! (Ontology Pitfall
Scanner!) an on-line tool for ontology evaluation.
International Journal on Semantic Web and
Information Systems, vol. 10, issue 2, pp. 7-34,
2014.

[11] M. Musen. The Protégé project: A look back and a


look forward. AI Matters Association of
Computing Machinery Specific Interest Group in
Artificial Intelligence, vol. 1, issue 4, pp. 4-12,
2015.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 98


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Celestial Spectra Classification Based on Support Vector Machine

Jingchang Pan, Gaoyu Jiang, Yude Bu, Zhenping Yi and Xin Tan
School of Mechanical, Electrical & Information Engineering, Shandong University at Weihai,
Weihai 264209, China
pjc@sdu.edu.cn, jgyxyyxy@gmail.com, buyude001@163.com
yizhenping@sdu.edu.cn tanxin_0911@163.com

ABSTRACT preprocessed spectral data is classified by this


Spectra classification is essentially a pattern method, and the applicability of the method in
recognition problem, so using SVM (Support spectral classification is verified.
Vector Machine) to do spectral classification is
feasible. In addition, the spectra of special objects 2 SVM
can be found through the classification. In this SVM is a new pattern recognition method
paper, we applied the SVM method to spectra developed on the basis of statistical learning
classification by Matlab simulation programs, and theory in recent years, having many unique
analyze the results of the experiments. advantages in solving nonlinear and high
Experimental results show the ideal classification dimensional pattern recognition problems.
effects. Because of its excellent learning performance,
this technology has been successfully applied
KEYWORDS in many fields such as face detection,
Support vector machine, Spectrum, Classification handwritten numeral recognition, automatic
text classification and so on.
1 INTRODUCTION 2.1 Optimal Hyperplane
The large number of spectra obtained from The SVM method is proposed from the
sky surveys such as the Sloan Digital Sky Optimal Hyperplane in the linear separable
Survey(SDSS) and the survey executed by the case. As shown in Fig. 1, in two-dimensional
Large sky Area Multi-Object fibre space, two classes of linear separable case, the
Spectroscopic Telescope (LAMOST) provide circles and squares in the figure represent two
us with opportunities to search for peculiar or classes of training samples. H is a line for
even unknown types of spectra [1]. How to deal classification that separates the two classes
with the massive spectra to obtain the necessary without errors. For one class, straight line H1 is
scientific information in time has become a parallel to H and passes through the samples
very important issue, and spectral classification that are closest to line H, for the another class,
of celestial bodies is one of the key tasks. the same is true of line H2. The distance
Support Vector Machine (SVM) is an effective between them is called the margin. The
method to solve this problem [2]. In this paper, so-called optimal classification line is that it
Support Vector Machine is discussed, the can not only divide the samples into two

ISBN: 978-1-941968-45-1 ©2017 SDIWC 99


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

classes correctly, but also maximize the margin. Where C is a specified constant which
The former guarantees minimal empirical risk controlls the degree of punishment for samples
that were misclassified, so that the trade-off
(e.g. the training error is 0). It can be seen later
between the ratio of the misclassified samples
that the maximum margin is to minimize the
and the complexity of the algorithm is
fiducial range in the bound of extending, so that achieved.
the real risk is minimal. When extended to high
dimensional space, the optimal classification 2.3 Support Vector Machine
line becomes the optimal hyperplane. The final classification discriminant
function ( w  x  b  0 ) of optimal and
generalized linear classification functions that
were discussed above contains only the inner
product ( x  xi ) of support vectors for

classified samples and training samples. Thus,


in order to solve the optimal linear
classification problem in a feature space, what
we only need to know is the inner product
Figure 1 illustration of optimal hyperplane operations in this space. If the inner product
K ( x, x') is used instead of the dot product in
After a series of mathematical deduction,
the optimal classification function is obtained. the optimal hyperplane, it is equivalent to
n transforming the original feature space into a
f ( x)  sgn{( w *  x )  b*}  sgn{  i* yi ( xi  x )  b*} new feature space, and the discriminant
i 1

(2-1) function equation (3-1) is updated.


n
f ( x )  sgn{  i* yi K ( xi , x )  b*}
2.2 Generalized Optimal Hyperplane i 1
The optimal hyperplane is discussed on
the premise of linearly separable. In linear (2-4)
Other conditions of the algorithm are
inseparable cases, a slack term i  0 can be unchanged. This is what is called Support
added. Vector Machine.
The basic ideas of support vector
yi [(w  xi )  b] 1  i  0 machines can be summarized as followed.
(2-2) Firstly, the input space is transformed into a
The problem of generalized optical high dimensional space by nonlinear
hyperplane can be evolved to find the minimum transformation which is achieved by defining
values of the following functions under the proper inner product functions. Then, the
constraints of condition (3-2): optimal Hyperplane is obtained in this new
space.
n
1
 ( w ,  )  ( w  w )  C ( i )
2 i 1 3 SPECTRA CLASSIFICATION
(2-3) The automatic identification of celestial

ISBN: 978-1-941968-45-1 ©2017 SDIWC 100


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Figure 2. Physical classification of celestial spectra

spectra must contain astronomical knowledge[3]. The training data and test data used in this
For astronomers, spectral lines are the most paper are derived from SDSS DR7, and the
important feature of spectra. The transition of categories involve stars (STAR), late stars
atoms or molecules between different energy (STAR_LATE), galaxies (GALAXY), quasars
levels in a celestial body will absorb or emit (QSO), and high redshift quasars (HIZ_QSO).
spectral lines. Different atoms and molecules Among them, galaxies (GALAXY) are divided
have their own specific spectral lines. The into normal galaxies and emission-line
distribution of the intensity at different galaxies[6-8].
wavelengths can be used to describe the 4.2 Experimental Environment
radiation characteristics of celestial bodies. Experiments run in the MATLAB
There are all kinds of heavenly bodies in R2010a mainly, and its computer environment
the universe. In the field of astronomy, the is configured as follows.
celestial bodies are first divided into normal Processor: Intel Pentium dual-core
celestial spectral and emission-line object T2390@1.86GHz notebook processor
spectra [4-5]. The spectrum of normal celestial Motherboard: Lenovo 1GT30
bodies includes main stars and normal galaxies, Memory: 1GB (Hynix DDR2 667MHz)
while the emission-line object spectra includes Main hard drive: Hitachi
starburst galaxies, narrow line AGNs, wide line HTS542525K9SA00 (250GB)
AGNs and quasars. Operating system: Windows7 Ultimate 32
Spectra classification can be expressed in LIBSVM software package developed by
Fig. 2. Doctor Chih-Jen, Lin of National Taiwan
University was used for experiments. It’s a
4 OVERALL DESIGN OF SPECTRAL simple SVM software package in common use.
CLASSIFICATION MODEL 4.3 Classifier Design
4.1 General Design Training data is from DR7, downloaded
The work of this paper is to use SVM to from the SDSS website. There are files of GIF
implement classifier 1, classifier 2 and and FITS of five classes of celestial spectra
classifier 3, through which the celestial defined by Sloan including STAR,
spectrum is roughly classified. STAR_LATE, GALAXY, QSO and HIZ_QSO.

ISBN: 978-1-941968-45-1 ©2017 SDIWC 101


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Manual removal of both low confidence and 3) Reduce the dimension of training data.
erroneous spectral files, dividing GALAXY 4) Use LIBSVM software package to
into normal galaxies and emission-line galaxies, train, select parameters, generate
a total of 300 FITS files are selected as training training templates and record training
data, 50 files from each class. 150 spectral data time.
from STAR, STAR_LATE, and normal galaxies
in GALAXY are used as normal celestial 5 IMPLEMENTATION OF SPECTRAL
spectra, and data from QSO, HIZ_QSO, and CLASSIFICATION MODEL
emission-line spectra in GALAXY are used as 5.1 Description of Test Data
emission-lines celestial spectra. A total of 100 Similar to training data, test data is from
spectral data from STAR and STAR_LATE are DR7, downloaded from the SDSS website.
used as stellar spectral data, and normal galaxy There are files of GIF and FITS of five classes
spectra in GALAXY are used as normal galaxy of celestial spectra defined by Sloan including
spectral data. A total of 100 spectral data of STAR, STAR_LATE, GALAXY, QSO and
QSO and HIZ_QSO are used as quasar spectral HIZ_QSO. Manual removal of both low
data, and the emission-line galaxies in confidence and erroneous spectral files,
GALAXY are used as emission-line galaxy dividing GALAXY into normal galaxies and
spectral data. emission-line galaxies, a total of 300 FITS files
In terms of FITS files downloaded from are selected as test data, 50 files from each
DR7, the abscissa of spectral is wavelength class. 150 spectral data from STAR,
range from 3800Å to 9200Å, and the ordinate STAR_LATE, and normal galaxies in
is the corresponding flow. Because there is GALAXY are used as normal celestial spectra,
something missing in front wavelength of some and data from QSO, HIZ_QSO, and
spectral, we use the wavelength range from emission-line spectra in GALAXY are used as
3810Å to 9200Å for training. emission-lines celestial spectra. A total of 100
The two parameters, COEFF0 and spectral data from STAR and STAR_LATE are
COEFF1, can be found in the corresponding used as stellar spectral data, and normal galaxy
FITS file, representing the starting wavelength spectra in GALAXY are used as normal galaxy
and step size, respectively. The dimension n is spectral data. A total of 100 spectral data of
obtained from the FITS header file, that is, the QSO and HIZ_QSO are used as quasar spectral
number of steps, and the wavelength of the data, and the emission-line galaxies in
laboratory is obtained by the formula as GALAXY are used as emission-line galaxy
follows. spectral data.
Test data and training data are similar in
wave(n)  10COEFF 0( n1)COEFF1
format, but differ in content.
(4-1) 5.2 Testing Process
The corresponding flow is obtained from 1) Read the training data used in the
the table of the main block of the FITS file. experiment to make it can be
The specific experimental process is as processed by computers.
follows. 2) Generate class labels.
1) Read the training data used in the 3) Use LIBSVM package to test and
experiment to make it can be record the final accuracy.
processed by computers. 5.3 Test Results
2) Generate class labels. 1) The results of classifier 1 are shown

ISBN: 978-1-941968-45-1 ©2017 SDIWC 102


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

in Table 1 matrix E that obtained from 150 normal


In the row of dimensionality reduction, celestial spectra which are reduced to 43
E_star_43d is a matrix E that obtained from 50 dimensions through the PCA dimension
star spectra which are reduced to 43 dimensions reduction method. According to the same
through the PCA dimension reduction method. principle, E_celestial_total_43d is obtained
Using E_star_43d to reduce dimension, the from 150 emission-line celestial spectra, and
accuracy rate is 96.6667%, and the training E_total_43d is from all training spectra at a
time is 0.1813s. E_normal_total_43d is a total of 300.

Table 1. Results of classifier 1


Dimensionality E_normal_total E_celestial_tot
E_star_43d E_total_43d No
reduction _43d al_43d
Accuracy 96.6667% 97.3333% 95.3333% 96.3333% 97%
Time of
0.1813 0.1818 0.1821 0.1824 0.9378
training(s)

Table 2. Results of classifier 2


Dimensionality E_normal_tot E_star_late_43 E_normal_gala
E_star_43d No
reduction al_43d d xy_43d
Accuracy 97.3333% 99.3333% 99.3333% 96% 97.3333%
Time of
0.1349 0.1068 0.1098 0.1072 0.2800
training(s)

Table 3. Results of classifier 3


Dimensionality E_star_ E_celestial_ E_hiz_qso_ E_celestial_
E_qso_43d No
reduction 43d total_43d 43d galaxy_43d
Accuracy 96% 96% 98% 93.3333% 87.3333% 96.6667%
Time of
0.1056 0.1033 0.1097 0.1085 0.1053 0.3337
training(s)

2) The results of classifier 2 are shown in reduced to 43 dimensions through the PCA
Table 2 dimension reduction method. According to the
In the row of dimensionality reduction, same principle, E_star_late_43d is obtained
E_normal_total_43d is a matrix E that obtained from 50 late stellar astronomical spectra, and
from all training spectra at a total of 150 which E_normal_galaxy_43d is from all training
are reduced to 43 dimensions through the PCA spectra at a total of 50.
dimension reduction method. Using 3) The results of classifier 3 are shown in
E_normal_total_43d to reduce dimension, the Table 3
accuracy rate is 97.3333%, and the training In the row of dimensionality reduction,
time is 0.1349s. E_star_43d is a matrix E that E_star_43d is a matrix E that obtained from 50
obtained from 50 star spectra which are star spectra which are reduced to 43 dimensions

ISBN: 978-1-941968-45-1 ©2017 SDIWC 103


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

through the PCA dimension reduction method. dimension of other data, the classification result
Using E_star_43d to reduce dimension, the is the best.
accuracy rate is 96%, and the training time is Moreover, it can be seen from the
0.1056s. E_celestial_total_43d is a matrix E experiment results that the training time is
that obtained from all training spectra at a total longer without reducing dimension, and the
of 150 which are reduced to 43 dimensions effect is usually better after reducing
through the PCA dimension reduction method. dimension.
According to the same principle, 6.2 Next Work
E_hiz_qso_43d is obtained from 50 late high 1) Analyze the misclassification sample
redshift quasar spectra, and and improve the training templates.
E_celestial_galaxy_43d is from 50 Misclassification samples are extracted
emission-line galaxy spectra. and analyzed separately to find out the causes
of the errors, and improve training data to
6. SPECTRAL CLASSIFICATION MODEL improve training templates, so as to get better
ANALYSIS classification results.
6.1 Analysis of Experimental Result 2) Further analysis will be made on
1) Using support vector machines to dimension reduction.
classify celestial spectra gets a good From the above experimental results, we
result. can see that the process of dimension reduction
It can be seen from the above has a great impact on the experimental results,
experimental results that using support vector so we can do some works from the dimension
machines to classify celestial spectra does an reduction to see if we can further improve the
excellent job. The accuracy is basically above training template. The experiment in this paper
95%, and the training time is short. only involves the process of reducing to 43
2) The method of dimension reduction dimensions, and only uses the method of PCA
has an effect on the experimental dimension reduction. Spectra also can be
results. reduced to 3 dimension, 2 dimension and so on
As can be seen from the above to see whether the experimental results improve
experimental results, the experimental results or not. Other dimension reduction methods,
obtained by using different data for such as kernel entropy and component analysis
dimensionality reduction are also quite (KECA), can be used to reduce the
different, and the corresponding training time is dimensionality, and compare the results to see
different. if the training templates can be further
In classifier 1, the result obtained from all improved.
normal celestial spectra through PCA
dimension reduction method is the best. In SUMMARY
classifier 2, the result obtained from star or late The content of this study is to use support
star spectra through PCA dimension reduction vector machines to classify the celestial
method is the best. In classifier 3, the result spectrum, and the feasibility is verified by
obtained from quasar spectra is the best. experiments.
From the above results, we can get a After briefly introducing the LAMOST
conclusion that before dividing into two classes, project and the FITS file, the principle of SVM
Using E matrix obtained by the PCA reduction and the classification of celestial spectrum are
dimension of one of the classes to reduce the introduced. The process and result of the

ISBN: 978-1-941968-45-1 ©2017 SDIWC 104


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

experiment are introduced in detail, and the ASTRONOMICAL SOCIETY, 2011,


corresponding analysis is given. 412(4):2183-2198.
The main work in paper is as follows: [4] Pan J C, Wang X X, Wei P, et al.
1) Study the principle of support vector Spectroscopy and Spectral Analysis, 2012,
machine, understand its operation 32(008): 2260-2263.
mechanism and feasibility in reality, [5] Liu Z, Wan M J, Liu H. Electronic Test,
and apply it into practice. 2016, (7): 24-25.
2) Understand the mechanism of [6] M Marziani, M Gambaccini, Burga, G Di
celestial spectrum classification, be Domenico, et al. Applied Radiation and
familiar with various spectra of ISOTOPES, 2014, 92(6): 32-36.
celestial bodies, understand the [7] Yang Z Q. Application of Statistics and
characteristics of their spectra, and be Management, 2007, 26(1): 178-188.
familiar with the corresponding FITS [8] Huang L Y. Research of Automatic
documents. Recognition of Qusar [D], 20000601, Pattern
3) Support vector machines are used to Recognition and Intellegence Control.
classify the celestial spectra. After
repeated training, the best training
templates are obtained and then the
experimental results are obtained by
test.
We can see that using support vector
machines to classify celestial spectra is very
effective through experiments, and it is feasible
in practice.

ACKNOWLEDGEMENT
This work was financially supported by
the National Natural Science Foundation of
China (U1431102)

REFERENCES
[1] Cui, X., et al. The Large Sky Area
Multi-Object Fiber Spectroscopic Telescope.
Research in Astronomy and Astrophysics, 2012,
12(9): 1197-1242.
[2] Yude Bu, Fuqiang Chen, Jingchang Pan.
Stellar spectral subclasses classification based
on Isomap and SVM. New Astronomy, 2014,
28: 35-43.
[3] Daniel Thomas, Claudia Maraston, Jonas
Johansson. Flux-calibrated Stellar Population
Models of Lick Absorption-line Indices with
Variable Element Abundance Ratios.
MONTHLY NOTICES OF THE ROYAL

ISBN: 978-1-941968-45-1 ©2017 SDIWC 105


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

The Agent-Based Model of The Dynamic Spectrum Access Networks with


Network Switching Mechanism

Marcel Vološin1 and Jana Zausinova2


1
Faculty of Electrical Engineering and Informatics, 2 Faculty of Economics
Technical University of Košice, Letná 9, 040 01 Košice, Slovakia
marcel.volosin@student.tuke.sk, jana.zausinova@tuke.sk

ABSTRACT leasing [2]. The relevance of the spectrum


leasing raises its further importance in
In this paper, we aim to introduce the initial agent- conjunction with the concepts of the software-
based model, with the ability to capture two major defined radio (SDR). SDR enhances the
mechanisms (dynamic spectrum access networks efficiency of the frequency spectrum
and network switching) that will be employed in utilization through the embedded software
the future 5G communication systems. Dynamic allowing the terminal to operate in multiple
spectrum access networks aim to tackle the
frequency bands using numerous transmission
traditional drawbacks of the long-term issued
frequency licenses, which results mainly in the protocols [3]. Spectrum leasing and SDR are
inefficient use of the assigned spectra. On the other technologies coined throughout the paper as
hand, the network switching mechanism allows the the dynamic spectrum access (DSA) strategies.
user to increase its utility due to the ability to
choose the operator in the real-time. In the paper, DSA strategies fundamentally change the
the two-level agent-based model with a focus on traditional telecom model based on the
the economic characteristics of the network is "vertical integration". Here the single entity
proposed. On the first layer, the wholesale market delivers the service, maintains the network and
stipulating dynamic spectrum access network is the network infrastructure [4]. Initially, the
modeled using the MASCEM framework being available services were mainly limited to
proposed initially for the grid-electricity
telephony, radio, and television. However, we
distribution. Second layer models the retail market
consisting of the set of user-equipment (UEs), do witness the convergence of these services
which is capable of the network switching. The nowadays, and also the roles of the service
proposed early-stage model confirms the provider and the network owner are separated,
theoretical expectation regarding the price and the service providers get access to the
evolution and other technical indicators. network and the end customers through
secondary spectrum trading on fair and non-
KEYWORDS discriminatory conditions [5]. This
agent-based modeling, cognitive networks, telecommunication concept is recognized as
electronic markets, MASCEM the open access network.

1 INTRODUCTION Most trades today are direct trades between


organizations with the regulator as an
Spectrum trading allows the owners of certain intermediate giving the final consent to
spectrum licenses to transfer or lease all or part commit the trades [2]. However, in order to
of their rights and obligations under their facilitate spectrum trades on a shorter time
license to another party [1]. Several countries scale an organizational unit such as a band
have implemented spectrum trading, but the manager could be introduced to mediate
trading process is often time-consuming, hence between traders. Furthermore, organizational
hampering the usage. The UK regulator Ofcom units could be introduced to monitor for
is in the forefront on the spectrum trading compliance with committed trades and that the
arena, allowing spectrum sale and spectrum spectrum is not misused. Overall, an

ISBN: 978-1-941968-45-1 ©2017 SDIWC 106


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

ecosystem is required to realize the spectrum 2.1 Wholesale Market


micro-trading [2] and to be able to analyze the In this stage, the investors acquire the chunks
behavior of stakeholders on the both the retail of the frequency spectrum from the
and the wholesale market. Therefore, in this administrator for the prices set via bilateral
paper, the agent-based model of the dynamic negotiation. Following equations are used to
spectrum access network capable of capturing set the starting offers:
the essential network characteristics is
𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔_𝑜𝑓𝑓𝑒𝑟𝑖+1 = 𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔_𝑜𝑓𝑓𝑒𝑟𝑖 ± 𝛿𝑖+1
proposed.
(1)
2 PROPOSED MODEL Δ𝑖
𝛿𝑖+1 = 𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔_𝑜𝑓𝑓𝑒𝑟𝑖 × (β + )
𝐵𝑊𝑎𝑣𝑎𝑖𝑙𝑖 × 𝛼
To capture the stochastic nature of a spectrum
trading in a cognitive network an agent-based (2)
model based on a MASCEM [6] in the
Δ𝑖 = 𝐵𝑊𝑎𝑣𝑎𝑖𝑙𝑖 − 𝐵𝑊𝑙𝑒𝑎𝑠𝑒𝑑𝑖 .
modeling tool called NetLogo was
implemented. The Multiagent Simulator of (3)
Competitive Electricity Markets (MASCEM)
is the multiagent platform that can simulate the Each new offer 𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔_𝑜𝑓𝑓𝑒𝑟𝑖+1 calculated
variety of the different market players. The by both the investor and the administrator
goal of this framework is the ability to simulate depends on the starting offer from the previous
not only many types of the players but also as iteration, the bandwidth administrator
many market models as possible with the long- (investor) wanted to rent (lease) and the
term and short-term decision implemented and amount it rent (leased) and price shaping
ready to be used. It also includes the numerous parameters 𝛼 and β. The stakeholders set their
negotiation mechanisms that can cope with the limit prices for each round of the negotiation
diverse time scales. It was created to simulate according to:
electricity markets that similarly to the DSA 𝑙𝑖𝑚𝑖𝑡_𝑝𝑟𝑖𝑐𝑒𝑖 = 𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔_𝑜𝑓𝑓𝑒𝑟𝑖 ± 𝜗 ,
networks consist of both the wholesale and the
retail market [7]. (4)
where the 𝜗 represents the limit offer
parameter. During the process of negotiation,
agents adjust their offers simply via:
𝑜𝑓𝑓𝑒𝑟𝑖+1 = 𝑜𝑓𝑓𝑒𝑟𝑖 ± 𝜀,

(5)
the 𝜀 represents a fixed price-change
parameter, the administrator uses to lower and
the operator to increase its offer concerning the
Figure 1 Model scheme
preset 𝑙𝑖𝑚𝑖𝑡_𝑝𝑟𝑖𝑐𝑒𝑖 of the actual round.

There are 3 different types of agents involved


in the proposed model as Figure 1 shows, each 2.2 Retail Market
interested in achieving its own goals. An Due to the different characteristics of the
administrator, the owner of the spectrum frequency spectrum and the electric energy, it
license, leasing its rights on the wholesale is necessary to define suitable rules that will be
market to the investors, who offer the services applied when the trading takes place on the
on the retail market to the end-users willing to retail market. The acceptance probability
use the offered services like phone calls, file function was adopted from [8] with some
transfers, streaming etc. slight modifications. The spectrum obtained

ISBN: 978-1-941968-45-1 ©2017 SDIWC 107


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

via bilateral negotiation is sold to the end-users


on the retail market for the prices
accommodated by the end-users’ demand. The
end-user is capable of the network switching in
the real-time, based on the instantaneous price
offered by the operator.

The following equations are used for the Figure 2 Acceptance probability according to retail
channel retail price calculation: price

𝑝𝑖 = 𝑝𝑖,𝑡 + (Ψ𝑖,𝑡−1 − 0.5) × 𝜇 Table 1 Simulation parameters


Parameter Value Description
(6)
𝑁𝑏𝑡𝑠 1 Number of base
1⁄2 (𝐵𝑊𝑎𝑣𝑎𝑖𝑙𝑖 = 0) ∧ (𝑆𝑖 = 0) stations
Ψ𝑖,𝑡−1 {0 (𝐵𝑊𝑎𝑣𝑎𝑖𝑙𝑖 > 0) ∧ (𝑆𝑖 = 0) 𝐵𝑊𝑡𝑜𝑡𝑎𝑙 200 Total number of
available channels
𝑆𝑖𝑖𝑑𝑙𝑒→𝑐𝑜𝑛𝑛 ⁄𝑆𝑖 (𝐵𝑊𝑎𝑣𝑎𝑖𝑙𝑖 > 0) ∧ (𝑆𝑖 > 0),
(7) 𝑁𝑖𝑛𝑣 2 Number of operators
𝑁𝑒𝑛𝑑−𝑢𝑠𝑒𝑟𝑠 〈300; 500〉 Number of end-users
where 𝐵𝑊𝑎𝑣𝑎𝑖𝑙𝑖 represents the amount of the 𝛼𝑎𝑑𝑚 -0.7 Price-shaping
available frequency channels, 𝑆𝑖 is the total parameter of
administrator
number of the end user connection attempts
𝛽𝑎𝑑𝑚 0.02 Price-shaping
towards an operator and 𝑆𝑖𝑖𝑑𝑙𝑒→𝑐𝑜𝑛𝑛 denotes the parameter of
number of successful connections. administrator
𝛼𝑖𝑛𝑣 -3 Price-shaping
Demand on the retail market is created by end- parameter of
operator
users, who randomly switch their states during
the simulation between the following three 𝛽𝑖𝑛𝑣 0.1 Price-shaping
parameter of
different options: IDLE, ACTIVE and operator
CONNECTED which brings stochasticity to 𝑝𝑚𝑖𝑛 0.15 Minimum wholesale
the model. In the IDLE, as the name suggests price
agents are inactive. Transition from IDLE to 𝜀 0.01 Negotiation
parameter
ACTIVE takes place randomly with the 〈0.1; 0.8〉
𝑃𝑎𝑐𝑡 End-users’
probabilities that change with the Gauss-like activation
curve on interval 〈0.1; 0.8〉 which emulates probability
different activity of users during day and night. 𝑃𝑑𝑖𝑠𝑐 1 End-users’
When ACTIVE, users evaluate each offer disconnection
probability
using following equation:
𝜗 0.3 Limit offer
𝛾 parameter
𝐴𝑃𝑖 = 1 − 𝑒 −𝑐(1−𝑝𝑖) ,
𝛾 3 End-users’ price
(8) sensitivity parameter
𝛿 0.5 End-users’ utility
acceptance probability (AP) of the best offer sensitivity parameter
represents user's tendency to make a deal with 𝜇 0.2 Price coefficient
the winning operator and switch state to 𝑐 8 Acceptance
CONNECTED. Parameters 𝑐 and 𝛾 adjust a probability
parameter
sensitivity of users towards the price 𝑝. Figure
2 acceptance probability in accordance with
3 SIMULATION RESULTS
the retail price when parameters 𝑐 and 𝛾 are set
according to the Table 1.
Multiple simulations were performed to test
the model behavior in the different scenarios.
Figure 3 shows the real-time data collected

ISBN: 978-1-941968-45-1 ©2017 SDIWC 108


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

Figure 3 Real-time behavior of the model

Figure 4 Model behavior in accordance to the number of end-users

ISBN: 978-1-941968-45-1 ©2017 SDIWC 109


Proceedings of the Third International Conference on Computing Technology and Information Management (ICCTIM2017), Thessaloniki, Greece, 2017

from a single simulation during the time executed to verify and illustrate system's
interval of 5 days, each consisting of the 500 behavior in different conditions. We can
virtual time units so-called ticks in the conclude that proposed model with the
NetLogo. In Figure 3 a) the negotiated bilateral negotiation can capture the vital
wholesale prices between the administrator characteristics of the cognitive network and
and one of the investors can be observed. As performs well regarding spectrum trading as
can be seen, price reacts to the varying end- simulation results suggest. However, it
users' activity that follows Gauss like curve features the noticeable price lag as could be
with a slight delay. This phenomenon was seen on the plots which affect the incomes of
caused by equations used in MASCEM as was the operators. This phenomenon, yet being
found. natural in the MASCEM, would require an
additional effort to overcome when deployed
Figure 3 b) shows frequency channels usage. in the real environment.
Investors tend to rent more during peak hours
as expected and are very successful on the REFERENCES
retail market too. Investor's profit is also
affected by end-users' activity, but from the 1. Ofcom, “Simplifying Spectrum Trading: Spectrum
leasing and other market enhancements”, 2011, p. 8
Figure 3 c) it is obvious that highest profits 2. P. Grønsund, R. MacKenzie, P.H. Lehne, K. Briggs,
were gained thanks to the delay on the O. Grøndalen, P.E. Engelstad, and T. Tjelta,
wholesale market. Retail market price, Figure “Towards spectrum micro-trading” [Future
3 d), does not change its mean value during Network & Mobile Summit (FutureNetw), 2012].
simulated days, but it's variance which is 3. H. Arslan, ed.: “Cognitive radio, software defined
radio, and adaptive wireless systems,” Vol. 10.
higher during the non-peak hours. Berlin: Springer, 2007, p 16.
4. N. Zhang, H. Liang, N. Cheng, Y. Tang, J.W. Mark,
Numerous simulations were executed to and X.S.Shen, “Dynamic spectrum access in multi-
determine the impact of the overall network channel cognitive radio networks,” IEEE Journal on
load generated by end-users. Figure 4 Selected Areas in Communications 32.11, 2014, pp.
2053-2064.
illustrates the stability of the model with the 5. P. Cramton and L. Doyle, “Open access wireless
different number of the end-users. The results markets,” Telecommunications Policy, 2017, pp.
conclude that the increasing number of the end 379-390
users in the network, results in the higher 6. I. Praça, C. Ramos, Z. Vale, and M. Cordeiro,
wholesale prices that have the higher variance “MASCEM: A multiagent system that simulates
competitive electricity markets,” IEEE Intelligent
too due to the MASCEMs characteristics of the Systems 18.6, 2003, pp. 54-60.
negotiation process and the fact that activity of 7. Z. Vale, T. Pinto, I. Praca, and H. Morais,
end-users changes during the day. The mean “MASCEM: electricity markets simulation with
retail prices are also raised, but unlike the strategic agents,” IEEE Intelligent Systems, 26.2,
wholesale price, more users result in the less 2011, pp. 9-17.
8. J. Pastirčák, L. Friga, V. Kováč, J. Gazda, and V.
volatile thus more stable prices. Gazda, “An Agent-Based Economy Model of Real-
Time Secondary Market for the Cognitive Radio
Networks,” Journal of Network and Systems
4 CONCLUSION Management. 12.10 2015, pp. 1-17

In the paper, we described the initial agent-


ACKNOWLEDGEMENT
based model of spectrum trading in the
cognitive network with the variable stochastic
This work was supported by the Slovak
demand of the end users inspired by the
Research and Development Agency, project
MASCEM. In the model, an administrator
number APVV-15-0358.
(owner of the spectrum) and the non-
cooperating operators providing the services
were also present. Multiple simulations were

ISBN: 978-1-941968-45-1 ©2017 SDIWC 110

Anda mungkin juga menyukai