Vol. 16 (2017)
°c World Scienti¯c Publishing Company
DOI: 10.1142/S0219622017500286
†
Department of Computer Science, Tsinghua University
Beijing, P. R. China
‡
liguoliang@tsinghua.edu.cn
Zhiguang Shan
State Information Center of China, Beijing, P. R. China
Yong Shi
University of Chinese Academy of Sciences, Beijing, P. R. China
Data is growing faster than ever before and is changing our daily life. However it is rather
challenging to manage the big data [F. H. Cate, The big debate, Science 346 (2014) 810,
J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh and A. H. Byers, Big Data:
The Next Frontier for Innovation, Competition, and Productivity (Mckinsey global Institute,
2011), S. Lohr, The Age of Big Data (New York Times, 2012), p. 11, L. Einav and J. Levin,
Economics in the age of big data, Science 345 (2014) 715, M. J. Khoury and J. P. A. Ioannidis,
Big data meets public health, Science 346 (2014) 1054–1055, V. Marx, Biology: The big
challenges of big data, Nature 498(7453) (2013) 255–260.]. In this paper, we propose the big
data thinking and modeling techniques from the perspective of the I Ching, which is a very
famous imaginal thinking theory in China with 3,000 years history. The I Ching has been proven
to be very useful and practical in many domains, e.g., 36 stratagems.
Firstly, inspired from the three components of the I Ching, image, number and principle, we
propose a new three-cycle big data thinking way, from data to phenomenon, from phenomenon
to correlation, and from correlation to knowledge, which is a generalization of the fourth
paradigm (from causality to correlation) proposed by Jim Gray.
Secondly, inspired from the three entities of the I Ching, heaven, earth and human, we
propose a new big data modeling method. We use the tree entities to represent the big data. We
map the 4 V of big data (volume, variety, velocity, veracity) to four opposition and uniform
relations in the I Ching, and generate the eight diagrams. By capturing the relationships
between eight diagrams, we generate the 64 hexagrams, and use 64 hexagrams to model big
data. We also provide the principle rules to understand the knowledge generated by the model.
Thirdly, we discuss how to utilize our model to describe big-data management tools,
including, MapReduce, Spark, Storm. We also provide a new model for handling distributed
data streams.
‡ Corresponding author.
1
2 C. Lin et al.
We do think that we provide a new practical way of thinking and modeling for big data.
We also believe that this will open up many new research directions on big data.
1. Introduction
There are more and more data being generated in every ¯eld which are changing our
daily life and we enter the era of big data.1–7 However, it is rather challenging to
manage the big data8–10 because (i) large volume, data scale increase from TB to PB,
even EB. The data generated in the recent two years is larger than that generated in
the last 1,000 years; (ii) high velocity, the data is generated continuously and in-
Int. J. Info. Tech. Dec. Mak. Downloaded from www.worldscientific.com
stantly. For example, several thousands tweets will be posted every second;
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.
(iii) variety, the data is heterogeneous, including unstructured data, structured data,
semi-structured data; and (iv) veracity, the truth is hidden under the big data and it
is rather hard to identify the truth.
To address these challenges, we propose the big data thinking and modeling
techniques, which can provide new perspectives to understand and manage the big
data. Di®erent from most of the existing computing models which utilize formal
logical system from the perspectives of dialectical thinking and system thinking, the I
Ching adopts the imaginal thinking. Dialectical thinking and system thinking are
utilized to model the thinking way of western people while imaginal thinking is used
to model the thinking way of eastern people, especially Chinese. The former is de-
terministic, local, accurate. It addresses a problem from local to global, like deep
learning, and its advantages include logical reasoning, system analysis, and quanti-
tative analysis. It emphasizes on how to accurately formulate a problem. The latter is
nondeterministic, global, and fuzzy. It addresses a problem globally and its advan-
tage is property analysis and designed principle. It emphasizes on perception.
Obviously, it is hard to accurately address big-data problem using the dialectical
thinking and system thinking, because it may go astray due to the complicated data.
For example, in deep learning, even if there is a minor disturbance in the data (e.g.,
if slightly changing the face of a dog to cat), it may generate very di®erent results
(e.g., then deep learning will take it as a cat). On the contrary, the imaginal thinking
is naturally suited for designing e®ective techniques for big data. Note imaginal
thinking has been widely acceptable in many ¯elds. For example, Albert Einstein
said that
\Development of Western science is based on two great achievements: the in-
vention of the formal logical system (in Euclidean geometry) by the Greek philoso-
phers, and the discovery of the possibility to ¯nd out causal relationships by
systematic experiment (during the Renaissance). In my opinion one has not to be
astonished that the Chinese sages have not made those steps. The astonishing thing is
that those discoveries were made at all."11
Thinking and Modeling for Big Data from the Perspective of the I Ching 3
stratagems must fail in practice. Thus its basic idea is inspired from the I Ching. Note
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.
that western people utilize formal logical system from the perspectives of dialectical
thinking and system thinking, while Chinese people, especially, the I Ching, adopt the
imaginal thinking, which emphasizes on perception and gives a high-level idea to
address hard problems.13
In this paper, we utilize the I Ching to facilitate the big data thinking and
modeling. Firstly, inspired from the three components of the I Ching, image, number
and principle, we propose a new three-cycle big data thinking way, from data to
phenomenon, from phenomenon to correlation, and from correlation to knowledge,
which is a generalization of the fourth paradigm (from causality to correlation)
proposed by Jim Gray.14 Secondly, inspired from the three entities of the I Ching,
heaven, earth and human, we propose a new big data modeling method. We use the
tree entities to represent the big data. We map the 4 V of big data (volume, variety,
velocity, veracity) to four opposition and uniform relations in the I Ching, and thus
generate the eight diagrams. We utilize 64 hexagrams to capture the relationships
between the eight diagrams. We also provide the principle rules to understand the
knowledge generated by the model. Thirdly, we discuss how to utilize our model to
describe big-data management tools, including, MapReduce, Spark, Storm. We also
provide a new model for handling distributed data streams.
To summarize, we make the following contributions.
(1) We provide a new three-cycle big-data thinking way inspired by the I Ching
using the imaginal thinking, from data to phenomenon, from phenomenon to
correlation, and from correlation to knowledge. For addressing any big-data
problem, we need to utilize the three steps to design the methodology.
(2) We propose a new big-data modeling framework using the I Ching. We use the
heaven, earth and human to model the big data. We map the 4 V of big data
(volume, variety, velocity, veracity) to four opposition and uniform relations in
the I Ching, and use the eight diagrams to model the big data. We utilize the 64
diagrams to capture the relationships of the four relations and give the principle
rules to understand the knowledge.
4 C. Lin et al.
(3) We discuss how to utilize our thinking and modeling methods to explain existing
big-data processing platforms, and also provide a new model for processing
distributed stream.
three paradigms are led by humans, and the human needs to do experiments, think
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.
out new theory, and provide new models for computing. However, in the big data era,
it is rather hard to do experiments on big data, give theory and provide models from
a large amount of data. Thus, a new way is to automatically identify them from the
big data. In other words, the big data processing should be led by big data instead of
humans.
Although the fourth paradigm provides a new way to data thinking, it only
provides a high-level idea but does not provide any details on how to understand big
data. In traditional data processing, people will decide how to process the data. For
example, in databases, people should embed schema into the data and pose SQL to
query the data, and the database systems will return the result. Take search engine
as an example. The search engines ¯rst crawl the data from the web and index them.
Then given a query, search engines ¯nd most similar documents to the query. We can
¯nd that these traditional processes have an important common feature people
design an automatic data processing °ow and the machines process the data fol-
lowing the °ow. This is coined with computational simulation. However, for big data,
it is rather hard to give the automatic °ow because the people do not know how to do
it, and even do not know what to do. Thus, the °ow should be automatically detected
based on the big data.
To address this problem, in this section, we propose a novel data thinking way to
help users understand and model the big data.
(1) From Data to Phenomenon: It is very important to detect phenomenon from the
data. Any data processing operations aim to detect some phenomenon from the
big data. For example, Google predicts the in°uenza activity for more than 25
Thinking and Modeling for Big Data from the Perspective of the I Ching 5
countries based on users' queries posted to Google. The idea behind Google Flu
Trends (GFT) is that, by monitoring millions of users health tracking behaviors
online, the large number of Google search queries gathered can be analyzed to
reveal if there is the presence of °u-like illness in a population. Google Flu Trends
compared these ¯ndings to a historic baseline level of in°uenza activity for its
corresponding region and then reports the activity level as either minimal, low,
moderate, high, or intense. These estimates have been generally consistent with
conventional surveillance data collected by health agencies, both nationally and
regionally. There are many studies pointing out that it is very important to
identify phenomenon from the big data.5,16–18 Thus, we think that identifying
the phenomenon from the big data is the ¯rst step.
Int. J. Info. Tech. Dec. Mak. Downloaded from www.worldscientific.com
(2) From Phenomenon to Correlation: Jim Gray in the fourth paradigm points that
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.
There should be a closed loop in the three cycles. In other words, the knowledge
can also bene¯t detecting phenomenon from big data. To manage the big data, we
must utilize this data thinking way, which can guide users to understand and manage
the big data.
are similar to 64 functions. Given some input, e.g., the data, it uses the 64 hexagrams
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.
to tell the phenomenon. The 64 hexagrams can cover every aspects in social science,
natural science and cosmos.
For example, a very famous saying in China \bad surroundings make bad
civilians" is an image, which is veri¯ed in the 3,000 history of China. Based on the
images, we can summarize some useful principle and avoid bad images to bene¯t
mankind and environment.
Thus, the I Ching implies that it is rather important to get the phenomenon from
the big data, which is consistent with our analysis. It also provides a way using 64
hexagrams to detect the phenomenon.
Number: Number is the mathematical system in the I Ching, which is used to do
reasoning in the I Ching. It can be understood as the various mathematical models.
The I Ching also uses a binary system, similar to the computer. It uses two numbers
Yin (0) and Yang (1), which are the two options in each gram in the eight diagrams
and 64 hexagrams, Yin and Yang also re°ect the positions of di®erent grams in the
eight diagrams and 64 hexagrams. For example, in eight diagrams, we have eight
images, 000, 001, 010, 011, 100, 101, 110, and 111. The positions are rather important
in the eight diagrams and 64 hexagrams. A 64 hexagram contains six-layer gram, and
each gram has two options.
There are many important numbers in the I Ching, including two forms Yin and
Yang, four images, eight diagrams, and 64 hexagrams. Image and number can used
together and then it can generalize more images from the basic images. This is also
coined with that we can generate all kinds of everything from the images. In other
words, the heaven gives a birth to 1, 1 gives birth to 2, 2 gives birth to 3, and 3 gives
all things of the world. It can also build the connection between di®erent images.
Principle: It is the philosophy in the I Ching model, referring to the Law in the
natural world and human social activities. It is used to explain the reasons in various
domains. It can be understood as prediction functions. Based on the images, we can
get important principles. For example, based on the image \heaven maintains vigor
through movement", we can get a principle \a gentleman should constantly strive for
self-perfection". Based on the image \earth's condition is receptive devotion", we can
Thinking and Modeling for Big Data from the Perspective of the I Ching 7
get a principle \a gentle man should hold the outer world with board mind".
These two principles are used as the motto of Tsinghua University.
There are four types of principles. First, the I Ching principle can model every-
thing and complies with the natural science. Second, the I Ching principle can model
cosmos and social, and thus it complies with cosmos and social science. Third, the I
Ching can model the relationship between diagrams. A famous saying is that \After
a storm comes a calm," which means that the diagrams can be mutual transformed.
Fourth, the gram number and gram position are rather important. A saying is that
\couple hardness with softness and the doctrine of the mean". This discusses the
importance of gram positions.
Principle and number can be used together. For example, we can use di®erent
Int. J. Info. Tech. Dec. Mak. Downloaded from www.worldscientific.com
mathematical models and di®erent functions, and then we can generate di®erent
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.
results.
Relationship Between Image, Number and Principle: Image, number,
and principle are strongly connected and cannot be separated. Image refers to the
phenomenon, number refers to the way the ancients describe the world. Principle
refers to the knowledge or truth extracted based on image and number. A famous
saying is \number implies power tactics/methods, and methods also reply on num-
ber; Yin and Yang can change the principle, and the changing mechanism plays an
important role. However, the mechanism cannot be supposed, otherwise the mech-
anism does not work." For example, any circumstance hitting a limit will begin to
change. Change will in turn lead to an unimpeded state, and then lead to continuity.
A Thingking Principle: Simplicity-Variance-Consistency. The I Ching also
has important thinking principles. (1) Simplicity: If we can capture the essential
features, everything is simple and we do not need to utilize many complicated
models. The I Ching use 64 hexagrams to capture the main features, and thus it
is simple and easy to understand. This corresponds to the imaginal thinking.
(2) Variance: Everything will change. The I Ching contains a circle and a curve.
The curve divides the circles into Yin and Yang, and they can change to each other.
The curve denotes that everything can change. This corresponds to the dialectical
thinking. (3) Consistency: Although everything changes, the changing mechanism
will not change. The change is the surface phenomenon and the consistency is the
inherent law. This corresponds to the system thinking. Thus, the three I Ching covers
the three types of thinking ways. The simplicity-variance-consistency also explains
image, number and principle. Image can deduce number, which in turn deduces the
principles and the principles are in¯nite. In other words, we can understand the
content from the external imagery. From the status, we can understand the property.
From the property, we can know the essence.
Summary. Based on the above discussion, we can use the I Ching to facilitate the
big data thinking. We use image to capture the phenomenon of big data, number to
capture the correlation, and principle to capture the knowledge. In any big data
8 C. Lin et al.
There are three dimensions in data processing, micro dimension, medium dimension,
and macro dimension. Micro dimension is used to model individual and macro is used
to model the entirety. The medium is used to model a partial group. In big data, it is
more important to capture the entirety and thus macro dimension is more important.
The I Ching is also good at modeling the entirety. It uses the heaven, earth, human to
model everything. This is also a philosophy abstraction model. It has no numerical
form but uses Yin/Yang. The three entities give birth to 1, 1 gives birth to 2, 2 gives
birth to 3, and 3 gives all things of the world.
Note that the three entities have two options, heaven: Yin/Yang, earth: hard/
soft, human: kindheartedness/justice. Based on the three dimensions, we can build
eight diagrams. Thus, we can use the three entities to present the big data. For any
type of data, we can model it as a set of triples, where each triple includes heaven,
earth, human. In di®erent domains, we can utilize di®erent strategies.
changing point. This is also consistent with consistency, variance and simplicity in
the I Ching.
(3) Variety: It is rather hard to handle the variety as the data is heterogeneous,
multiple-sourced, and full of noise. So we have two ways: hard or soft. The hard mode
exactly processes each type of data while the soft mode approximately handles
the variety. So there is a tradeo® between hard and soft for variety. Obviously, the
heaven, earth, and human is a good model for variety. The big data contains un-
structured, semi-structured and structured data. We can use \simplicity" to model
the variety. The nature law is simple and everything can be simpli¯ed. The struc-
tured data is well organized and easy to process. However, the way of getting the
structured data is hard. On the contrary, the unstructured data is not well structured
Int. J. Info. Tech. Dec. Mak. Downloaded from www.worldscientific.com
and processing the unstructured data is hard, requiring complicated techniques (e.g.,
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.
distributed machine learning). But the law hidden in the data is simple, and the
processing theory and philosophy are also simple. Thus, the way of changing from
processing structured data to unstructured data requires to ¯nd a new computing
way to handle the unstructured data which should use all the features of unstruc-
tured data. If we can ¯nd the way, it should also be simple.
(4) Veracity: It is fairly important to know the truth of the data, true or false.
So there is a tradeo® between true and false for veracity. First, the quality of data is
uneven, and thus the value is also di®erent and the truth may be a®ected by the error
data. Second, the whole data is rather important for detecting the truth. An example
is the questionnaire on the web. Obviously, the users have di®erent experiences,
di®erent ages, di®erent backgrounds. If we do not have the whole data, the results
are unbelievable or biased based on the skewed data. Third, the veracity is similar to
Yin/Yang, which should be stable and there must exist a law to detect the truth.
Based on the four opposite relationships, centralized/distributed, continuous/
discrete, hard/soft, and true/false, we can build an eight-diagram model as shown in
Fig. 1. In this way, we can capture the main features of big data and use the eight
diagrams to represent the image of big data.
hexagrams. The hierarchical design for computers and networks can also be deduced
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.
from the I Ching.20 Thus, we use 64 hexagrams to model the big data. For each type
of query processing, we need to ¯nd an appropriate hexagram to address it.
Principle Rules of Knowledge. From the images/hexagrams, we can get very
important principles. For example, based on the image \heaven maintains vigor
through movement", we can get a principle \a gentleman should constantly strive for
self-perfection. " Based on the image \earth's condition is receptive devotion", we
can get a principle \a gentleman should hold the outer world with board mind." A
famous image is that \After a storm comes a calm", which means that the diagrams
can be mutually transformed. Note the gram number and gram position are rather
important. An image is that \couples hardness with softness and the doctrine of the
mean". For example, a very famous saying in China \bad surroundings make bad
civilians" is an image, which is veri¯ed in the 3,000 history of China. Based on the
images, we can summarize some useful principles and avoid bad images to bene¯t
mankind and environment.
e®ective. However, if the data has strong correlations, e.g., graph, the framework
is not e®ective.
The 14th image (called DaYou) in the 64 hexagrams implies the basic idea of
MapReduce. In this image, the top is distributed and the bottom is centralized.
An explanation in this hexagram, in the image shown in Fig. 2, is that \Given the
products in a big cart, if the products are not strongly connected, we can use multiple
small trollies to load the products without of any loss." This implies that we should
utilize distributed environment to process the centralized data. To this end, it
requires to partition the data into di®erent nodes (Map), distribute the data into
di®erent nodes (Shu®le), and process the data in each node (Reduce). Thus, the 14th
image provides a good hint for batch computing in the big data era.
Int. J. Info. Tech. Dec. Mak. Downloaded from www.worldscientific.com
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.
4.1.2. Spark
Spark is an in-memory distributed data processing framework for iterative com-
puting. It aims to improve MapReduce by avoiding the expensive disk-based pro-
cessing and synchronization steps. To address these problems, it iteratively processes
the data using a directed acyclic graph (DAG) by saving the synchronization time.
It also provides a lazy processing technique to optimize multiple operations.
The 35th image (called Jin) in the 64 Hexagram implies the basic idea of Spark.
In this image, the top is distributed and the bottom is also distributed. An expla-
nation is that \Given a cart and multiple person pulling the cart, if all the persons go
to the same direction, then the cart can be pulled; If some person pulls from the front
and some push at the back, the cart can be easily pulled". This is to say, we can keep
our brain ticking over and think out new idea. In big data computing, we have two
e®ective strategies. The ¯rst is to trade time for space. That is, we can utilize the
memory to replace the disk and thus can reduce the disk latency. The second is to
optimize the multiple operations using a delay execution manner (i.e., given multiple
operations, we do not execute them immediately. Instead, we can batch them to-
gether, and then execute them using an optimized way when we need to execute the
operations). This implies that we should utilize distributed environment to process
distributed data using an iterative way, trade time for space and do optimizations in
the query processing. In addition, the image also implies that to pull the cart, the
person should be evenly distributed. That is, in Spark, we need to make workload
balance in the system. This also gives a suggestion to improve Spark by enabling
load balance. Obviously, the 35th image provides a good hint for iterative data
processing.
4.1.3. Storm
Storm is a stream processing framework for streaming computing. Its basic idea is to
use distributed environment to handle centralized streaming data.
The ¯fth image (called Xu) in the 64 hexagrams implies the basic idea of Storm.
In this image, the top is serial and the bottom is also serial. An explanation is that
12 C. Lin et al.
(a) DaYou for MapReduce (b) Jin for Spark (c) Xu for Storm (d) BI for MapReduce
have full preparation before action. In big data computing, we need to ¯rst use a
master node to monitor the request, which is used to accept and distribute the
requests, and utilize multiple slave nodes to serve the request. Usually, we need to
preprocess for the coming request, e.g., indexes and routing rules on the queries.
Obviously, the ¯fth image provides a good hint for stream processing.
the eight diagrams. We utilize the 64 hexagrams to capture the relationships of the
eight diagrams and give the principle rules to understand the knowledge. We discuss
how to utilize our thinking and modeling methods to explain existing big-data
processing platforms, and also provide a new model for processing distributed
stream.
In future work, we aim to utilize more hexagrams to provide big data processing
tools. We do think that we have provided a new practical way of thinking and
modeling for big data. We also believe that this will open up new research directions
on big data.
References
Int. J. Info. Tech. Dec. Mak. Downloaded from www.worldscientific.com
by MCMASTER UNIVERSITY on 10/23/17. For personal use only.