net/publication/326152612
CITATIONS READS
0 23
2 authors, including:
Fang-Fang Chua
Multimedia University
37 PUBLICATIONS 101 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
A Machine Learning Model for Detection and Prediction of Cloud Quality of Service Violation View project
Dynamic context-aware ontology based recommender system for IPTV View project
All content following this page was uploaded by Fang-Fang Chua on 20 October 2018.
Abstract. Large enterprises and companies often use different tools and sys-
tems that distributed across their company branches to operate daily business
operations. The collected data and logs have significant potential of providing
useful information and insights for the company; however, staffs may spend
massive time and effort to process the raw data into useful information as raw
data is scattered and distributed across different platforms. This study proposes a
framework called Dynamic Composable Analytic Framework (DCAF), which is
able to accept and compose raw data from different systems or tools, and per-
forms analytics on the composed data to identify or predict the consumer
behavior. The proposed framework is able to perform data receiver, data com-
position, data massaging and data analytic job with minor human interaction.
DCAF provides contribution as an end-to-end solution for converting raw data
to predicted customer behavior information and thus improving the customer
analytics efficiency.
1 Introduction
Nowadays, we use systems or tools to aid our daily routine and tasks. These tools
might contain activity logs or produce data in different formats, and we could get a lot
of helpful information and insights from these data. The emergence of data analytic is
able to provide valuable information to improve tools, decision-making and produc-
tivity. Besides, organizations that embrace analytic driven management yield a better
business performance. MIT Sloan Management Review collaborates with IBM Institute
of Business Value by conducting a research which includes global survey as well as
in-depth interview. The outcome of the survey indicates that top performing companies
are using data analytic five times more than low performance company [1] and affirms
that the implementation of data analytic is correlated with the good performance of the
organization. Through Ren’s team [2] and Mariya’s research [3], data analytic is
performed to allow user to gain business value from system or stakeholder information
to improve business performance. Ren’s team applies machine learning and big data
analytic on data from energy utility companies and it allows them to gain insight on
millions of customer within a short time. Apart from that, big data analytic can improve
new product success rate. In Xu and his team research, big data analytic is better in
dealing with drastic change in technology and market requirement as big data analytic
has better understanding in stakeholder information compared with traditional mar-
keting [4]. However, these benefits can only be utilized if the data analytic process is
completed successfully and useful information is extracted from the data. One of the
essential factors for data analytic is that the data has to be clean and well structured.
Besides, the steps to perform data integration and data composition on heterogeneous
data from different data sources are also playing a crucial part to increase effectiveness
of data analytic [5]. Meanwhile, there are not many relevant tools in market, which can
give a hand to perform data analytic on heterogeneous data.
A framework called Dynamic Composable Analytic Framework (DCAF) is pro-
posed to overcome the problem addressed earlier. There are three main objectives of
DCAF. The available tools in the market are able to perform e jobs for each objective
separately but there is no tool in the market that is able to meet the all objectives at the
same time. The first objective of our proposed work is to provide feature that accept
data coming from different data sources and in different forms such as different
structure or format but containing same semantic meaning; and then convert the data
which is in heterogeneous forms into an expected file type. This is to ease data
composition or data analytic job. The second objective is to discover and compose
information in real time scenario and enabling DCAF to accept data input from different
sources in real time automatically. Besides, DCAF allows newly discovered data source
to communicate with it or receive data from new data source. The third objective is to
provide consumer analytic feature for the user on composed dataset. DCAF is a
framework that composes of different technology like.NET, Window Workflow
Foundation (WF), Window Communication Foundation (WCF) and R.NET. There are
three core elements in DCAF: (1) dynamic environment element which improves the
flexibility of the framework and enable the framework to handle real time data,
(2) composable element which improve the interoperability of the framework and
allows the framework to compose the information from different data sources, and
(3) consumer analytic element that provide consumer analytic feature. The remainder of
this paper is organized as follows: Sect. 2 presents background study and related
works, Sect. 3 discusses structure and three core elements proposed in DCAF, Sect. 4
presents the testing and results and Sect. 5 concludes with discussion and future work.
2 Related Works
is a rule-based library to provide guidance to convert the dataset to correct file type and
data composition. Besides, DCAF contains analytic modules that provide consumer
analytic feature. DCAF’s analytic module adopts RFM (recency, frequency, monetary)
analytic model, introduced by Hughes [7] to identify the Customer Lifetime Value
(CLV) which can be used to forecast future profit of a customer, and rank the potential
important customers. During literature reviews, we find methods or tools which carry
out composition job, handle dynamic data and perform consumer analytic. However,
there is no identical framework that offers functionality like DCAF. The current
method, framework and tools that can be found from the market do not provide the
end-to-end solution as offered by DCAF, hence the literature review for this study is
done based on the comparison of the core elements in DCAF. The summary of the
related works is shown in Table 1.
A dynamic analysis allows data from the analysis process change constantly and
support real time analysis. This means that the analytic model should be able to feed in
new data in real time mode and perform immediate analysis on the data. There are
many events in the world which requires real time analysis, for example natural disaster
prevention [14] and medical research activities [15]. Special written or design of
library, module or architecture would require the tools or systems to handle dynamic
environment like the proposed framework by Coetzee [8] and Siriweera [9]. In terms of
handling dynamic environment part, DCAF will rely on rule-based class to handle
dynamic variable like how Coetzee did, but the rule-based class in DCAF which called
KL will work with WFMS. On consumer analytic part, the RFM analytic model that
proposed by Hughes [7] is able to identify the value of customer. Hence, DCAF will
356 J.-Q. Goh and F.-F. Chua
use RFM analytic model to group the customer based on the customer value, so that the
prediction can be done on important customer group. On composition method part,
there are different composition methods from the literature like web service compo-
sition method from Benatallah and Sheng [10] and Chen [11], and data composition
method from Kluskas [12]. The composition method from Kluskas is more suitable for
DCAF, because DCAF need data composition for analytic purpose to merge the data
and standardizing sematic meaning of the data. However, the proposed method from
Kluskas is focusing on image data while DCAF is focusing on text-based data. Thus,
DCAF will have its own composition method, which relies on Service-Oriented
Architecture (SOA) to obtain the data from child service and then use WFMS to
collaborate with Knowledge Library (KL) to compose the data. There is an existing
tool called Composable Analytics which is able to compose data from different sources
and then provide the summary of the data to user in current market [13]. DCAF offers
better flexibility than Composable Analytics as DCAF allows any registered child
service to call the central service to pass input dataset in DCAF when child service
provides the metadata. Besides, DCAF offers more precise analytic feature compared to
Composable Analytics, because DCAF provides consumer analytic feature while
Composable Analytics only provides basic analytic features with charts.
into expected file type. Next, the “WorkFlow_Compiler” will notice the existing of
new data file and consolidate the input data file into single compiled data file. After
that, “WorkFlow_DataAnalytic” will retrieve the data from file into analytic module to
perform analytic accordingly. The analytic result will be displayed in DCAF applica-
tion UI after the data analytic job is completed.
Algorithm 1 shows the pseudocode for RFM analysis and it is also the steps for
DCAF to obtain the RFM score. The RFM analysis will start from setting the target
date. The target date will be the recent date or today’s date, which will be used to find
the Recency dimension value. After setting the target date, analytic module will find the
transaction that is not more than one year’s difference with the target date and match
with the target customer id. After that, DCAF will find the Recency, Frequency and
Monetary dimension value from the subset of transaction. Recency value can be
obtained from the difference between the target date and the most recent transaction
date in the subset. Summing up the total invoice count that represents the number
transactions then we can get the frequency value while the Monetary value is calculated
from the total spend of the customer in every transaction. After having RFM value for
all customers, DCAF discretizes each dimension into five equal groups. For Recency
dimension, the first 20% of the value will be labeled as 5 and the next 20% of the value
will be label as 4 and so on. This means that the customer with the more recent
transaction will have higher score in Recency dimension. For Frequency and Monetary
dimension, the first 20% of the value will be labelled as 1 and the next 20% of the score
will be labelled as 2 and so on. This means that the customer with higher frequency of
visit and purchase at the shop and higher total spent in the shop will get higher score in
Frequency and Monetary dimension respectively. After DCAF obtains RFM score of
each customer, it will then perform customer segmentation according to pre-set rule
that is shown in Table 3.
products along with the probability of the product being purchased in the DCAF user
interface. The product sales amount presents customer’s demand to the product, while
the probability of the product being purchased presents how often the customer needs
the product. Hence, the combination of both of the statistics will able to improve the
accuracy on predicting the potential interest product.
A framework testing has been performed to evaluate DCAF. The testing process is
divided into three parts: (1) Workflow Management testing, (2) Knowledge Library
testing and (3) Consumer Analytic module testing. Workflow Management testing aims
to verify whether the WFMS working correctly as expected in dynamic environment;
Knowledge Library testing is to verify whether KL is giving the correct instruction to
DCAF for data composition purpose; Consumer Analytic module testing is to check
whether proposed method in analytic module is performing analysis correctly if the
framework accepts the data dynamically. Checklist questions which are used in the
framework testing are shown in Table 4. DCAF is required to pass all the questions in
checklist questions in order to pass the framework testing.
The testing will start with creating a scenario which assuming there are two child
services registered in DCAF. The scenario will assume that one of the registered child
services is from Singapore, while the other child service is from China. The Singapore
service will provide dataset in excel format (.xlsx file), while the China service will
provide the dataset in tab separated value format (.tsv file). Both of the child services
dataset will contain the required columns for the consumer behaviour prediction pur-
pose, but both datasets will contain extra unwanted columns. Hence, DCAF will need
to convert the child datasets into CSV format and find the wanted columns in both child
data sets for consolidation purpose. To find the wanted column, DCAF will need to
362 J.-Q. Goh and F.-F. Chua
match the columns name between different datasets. DCAF will use the column name
from the parent dataset as the searching criteria to check the availability of wanted
column in child dataset. The DCAF should able to adapt to the dynamic environment
because it is assumed that both child services will have inconstant time to send the
child data set to central service. Besides, it is assumed that both child service and
central service are having corrected and error free metadata file.
Dataset for testing
DCAF is being evaluated by using an online retail dataset from http://archive.ics.uci.
edu/ml/datasets/online+retail. Chen from School of Engineering, London South Bank
University generates this dataset [21]. The dataset is collected from an UK-based
company which having its non-store retail business over Europe and UK. The online
retail dataset is having the sales transaction from 01/02/2010 to 09/12/2011. There are
more than 4000 customers contribute to this dataset, and it has more than 500,000 of
row record with 8 attributes in each row to record the customer purchase history. Apart
from the retail dataset from Chen, self-created datasets will also be included in the
evaluation process. The self-created datasets will be having file type like comma
separated file, tab separated file and excel file to evaluate the data composition ability
of DCAF. Although this self-created dataset is being randomly generated, it is still able
to match with the main retail dataset, because the stock code and unit price for the
self-created dataset is from the main retail dataset. In the framework testing, the central
services will have the downloaded online retail dataset, while the both child services
will send the self-created dataset to the central service.
Workflow Management Testing Result. When the testing started, the child service
will send in the self-created datasets without a consistent time to create dynamic
environment. These self-created datasets will be handled by WFMS and converted to
CSV file type and stored at temporary file location and pending consolidator workflow
to compose. Thus, to verify the correctness of WFMS, checking is required on tem-
porary folder. If WFMS working correctly, CSV file will be generated in temporary
folder as WFMS will trigger the workflow that is able to convert the child dataset to
CSV file and then store the converted CSV file at temporary folder. The testing result
shows that WFMS is working correctly as before the execution, the temporary file
location is empty, but after the testing is started, it contains two CSV files. These two
CSV files are the dataset from China and SEA child services. After performing five
round of testing and monitoring the files changes of temporary folder, we found that
WFMS done its job correctly as it is able to handle the inconstant dataset input. From
this, we can verify DCAF pass the first testing stage.
Knowledge Library Testing Result. The child service dataset will be composed
correctly if KL is working correctly without error. To verify whether the child service
dataset been composed into expected file structure, user can check the composed file at
“Compose” folder. The “Compose” folder will have the compose dataset which being
Dynamic Composable Analytics on Consumer Behaviour 363
generated or updated after composing the CSV file from temporary location. After
verified the column structure and content of the consolidate CSV file is correct, then we
said that KL provides the correct rules and instruction to “WorkFlow_Compiler”
workflow as “WorkFlow_Compiler” workflow done it job correctly without error. To
verify the composed file having correct column structure, we did manually checking on
the composed file to ensure that it has the required columns by comparing the columns
from all the CSV files in temporary folder. On the other side, to verify the content of
consolidate file, we checked whether the all the required row of data exist in the
consolidate file. Apart from that, to verify that the KL works well, we also check
whether summary and chart from the DCAF analytic result tab reflect the changes
correctly. DCAF analytic result tab displays the result in text or chart form correctly
without application crash or error message during the testing. Hence from the checking
and verification, the KL is verified and passed both checklist questions.
Consumer Analytic Module Testing. To verify the analytic module is working
correctly, we need to ensure the analysis result is correctly displayed at DCAF data
analytic tab. The analytic module shall display the dataset summary like number of
customer, oldest transaction date and latest transaction date of the dataset at the analytic
result tab’s “summary” section correctly, customer value segmentation bar chart in
result tab’s “plot” section correctly, and the predicted important customer interested
item in result tab’s “result” section correctly (as shown in Fig. 4). The tester had use
native tools to run the analysis on the same dataset separately, so that the tester can
verify that the displayed result is correct. After complete the comparison between the
result from DCAF and the result of testing tool, the analytic module can be verified as
working correctly as the result from both tools is the same.
5 Conclusion
DCAF is a framework that is able to compose the data dynamically and then perform
consumer analytic on the compose dataset automatically with minimum human inter-
action. Besides, DCAF has high interoperability characteristic. It is able to commu-
nicate with tools or other service easily by just providing the main service URL for
other tools and service to call it. DCAF is been verified and tested and it is able to
provide its functions correctly. The drawback of DCAF is it needs to rely on dataset’s
metadata to match the columns between different dataset and perform data composition
accordingly. This drawback will reduce the efficiency of data composition because
there is possibility that the column is named differently but having the same semantic
meaning and it prompts to human error as it requires user to key in the metadata
manually. Thus, in future work, Nature language Processing (NLP) will be applied in
DCAF, so that we would able to understand the semantic meaning of the columns’
name and thus improve the accuracy and flexibility of data composition. Besides, the
analytic module will be further developed so that the prediction accuracy and perfor-
mance for DCAF can be improved.
References
1. LaValle, S., Lesser, E., Shockley, R., Hopkins, M.S., Kruschwitz, N.: Big data, analytics and
the path from insights to value. MIT Sloan Manag. Rev. 52(2), 21 (2011)
2. Ji-fan Ren, S., Fosso Wamba, S., Akter, S., Dubey, R., Childe, S.J.: Modelling quality
dynamics, business value and firm performance in a big data analytics environment. Int.
J. Prod. Res. 55(17), 5011–5026 (2017)
3. Sodenkamp, M., Kozlovskiy, I., Staake, T.: Gaining is business value through big data
analytics: a case study of the energy sector (2015)
4. Xu, Z., Frankwick, G.L., Ramirez, E.: Effects of big data analytics and traditional marketing
analytics on new product success: a knowledge fusion perspective. J. Bus. Res. 69(5), 1562–
1566 (2015)
5. Jagadish, H.V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J.M., Ramakrishnan,
R., Shahabi, C.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)
6. Ross-Talbot, S.: Orchestration and choreography: Standards, tools and technologies for
distributed workflows. In: NETTAB Workshop-Workflows Management: New Abilities for
the Biological Information Overflow, Naples, Italy, vol. 1, p. 8, October 2005
7. Hughes, A.M.: Strategic database marketing. McGraw-Hill Pub, New York (2005)
8. Coetzee, P., Jarvis, S.A.: Goal-based composition of scalable hybrid analytics for
heterogeneous architectures. J. Parallel Distrib. Comput. (2016)
9. Siriweera, T.H.A.S., Paik, I., Kumara, B.T., Koswatta, K.R.C.: Intelligent big data analysis
architecture based on automatic service composition. In: 2015 IEEE International Congress
on Big Data (BigData Congress), pp. 276–280. IEEE, June 2015
Dynamic Composable Analytics on Consumer Behaviour 365
10. Benatallah, B., Sheng, Q.Z., Dumas, M.: The self-serv environment for web services
composition. IEEE Internet Comput. 7(1), 40–48 (2003)
11. Chen, P.Y., Hwang, S.Y., Lee, C.H.: A dynamic service composition architecture in
supporting reliable web service selection. In: 2013 Fifth International Conference on Service
Science and Innovation (ICSSI). IEEE (2013)
12. Klukas, C., Chen, D., Pape, J.M.: Integrated analysis platform: an open-source information
system for high-throughput plant phenotyping. Plant Physiol. 165(2), 506–518 (2014)
13. Fielder, L.H., Dasey, T.J.: Systems and Methods for Composable Analytics
(No. MIT/LL-CA-1). Massachusetts Inst of Tech Lexington Lincoln Lab (2014)
14. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-time event
detection by social sensors. In: Proceedings of the 19th International Conference on World
Wide Web, pp. 851–860. ACM, April 2010
15. Derveaux, S., Vandesompele, J., Hellemans, J.: How to do successful gene expression
analysis using real-time PCR. Methods 50(4), 227–230 (2010)
16. What Is Windows Communication Foundation. (n.d.). https://msdn.microsoft.com/en-us/
library/ms731082(v=vs.110).aspx. Accessed 14 May 2017
17. Weske, M.: Business process management architectures. Business Process Management,
pp. 333–371. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28616-2_7
18. Milner, M.: A Developer’s Introduction to Windows Workflow Foundation (WF) in .NET 4,
April 2010. https://msdn.microsoft.com/en-us/library/ee342461.aspx. Accessed 13 May
2017
19. Gupta, S., Lehmann, D.R.: Customers as assets. J. Interact. Mark. 17(1), 9–24 (2003)
20. Dunford, R., Su, Q., Tamang, E.: The Pareto principle. Plymouth Stud. Sci. 7(1), 140–148
(2014)
21. Chen, D., Sain, S.L., Guo, K.: Data mining for the online retail industry: a case study of
RFM model-based customer segmentation using data mining. J. Database Mark. Cust.
Strategy Manag. 19(3), 197–208 (2012)