Anda di halaman 1dari 15

Real Time Processing of Big Data for predictive analysis

Name of the Candidate :

AJAY ACHARYA

Supervisor:

Dr. NANDINI SIDNAL

Introduction
Moores law predicted the processor and other IT technologies providing faster computing storage and networking power for cheaper cost, year after year. This helped to automate most of the industries and processes, leading to creation of terabytes data every day in organizations. Social computing is also fueling the growth of the data leading to the birth of Big data. The term Big Data is used almost anywhere these days; from news articles to professional magazines, from tweets to YouTube videos and blog discussions. The term coined by Roger Magoulas from OReilly media in 2005, refers to a wide range of large data sets almost impossible to manage and process using traditional data management tools due to their size, but also their complexity. Big Data can be seen in finance and business where enormous amount of stock exchange, banking, online and onsite purchasing data flows through computerized systems every day and are then captured and stored for inventory monitoring, customer behavior and market behavior. It can also be seen in the life sciences where big sets of data such as genome sequencing, clinical data and patient data are analyzed and used to advance breakthroughs in science in research. Other areas of research where Big Data is of central importance are astronomy, oceanography, and engineering among many others. The leap in computational and storage power enables the collection, storage and analysis of these Big Data sets and companies introducing innovative technological solutions. In a broad range of application areas, data is being collected at unprecedented scale. Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences.

Gartner puts the focus not only on size but also on three different dimensions of growth for data, the 3V: Volume, Variety and Velocity. Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes - even petabytes - of information (e.g. turn 12 terabytes of Tweets created each day into improved product sentiment analysis; convert 350 billion annual meter readings to better predict power consumption). Velocity: For time-sensitive processes such as catching fraud, big data flows must be analyzed and used as they stream into the organizations in order to maximize the value of the information(e.g. scrutinize 5 million trade events created each day to identify potential fraud; analyze 500 million daily call detail records in real-time to predict customer churn faster). Variety: Big data consists in any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. The analysis of combined data types brings new aspect for problems, situations etc. (e.g. monitor hundreds of live video feeds from surveillance cameras to target points of interest; exploit the 80% data growth in images, video and documents to improve customer satisfaction). If one is able to process Big data in real time, it will provide market trends and user behavior, which can be used in various businesses. To manage and efficiently process this growing terabytes or petabytes of data, per day, the current data management technologies does not meet the requirements. New concepts and models are emerging to meet these challenges. Figure 1 depicts the typical challenges in big data processing. It shows the various stages during the processing of big data and the challenges in terms of the 3Vs.

Data Acquisition

Extraction /Cleaning

Aggregation/ Representation

Analysis/ Modeling

Interpret ation

Volume

Velocity

Variety

Figure 1: Big data Analysis pipeline

Figure 1.2 depicts the popular Data Information Knowledge Wisdom (DIKW) hierarchy (Rowley, 2007). In this hierarchy data stands at the lowest level and bears the smallest level of understanding. Data needs to be processed and condensed into more connected forms in order to be useful for event comprehension and decision making. Information, knowledge and wisdom are these forms of understanding. Relations and patterns that allow gaining deeper awareness of the process that generated the data, and principles that can guide future decisions.

Co nn ect ed nes s
ng di n ta rs e d ns Un tter pa

g in d Wisdom an st s r de ple Un inci pr

Knowledge

i nd a st er n s d Un latio Re

ng

Information

Data

Understanding

Figure 1.2: Data Information Knowledge Wisdom hierarchy.

As we enter the era of big data, increasing amounts of data, both structured and unstructured, is available for analysis. Social media offer an unprecedented plethora of
4

opportunities for corporations and government agencies to engageimmediately, in real time with millions of people around the world, around the clock. They are sending out messages, experimenting with offers, and strengthening their brands. But, perhaps far more important, effective social media are about listening, understanding, and responding. Consumers can now talk to each other, criticize or recommend products, share feedback with their chosen providers, and generally shape the nature and scope of their preferred brands. But unlike standard, structured datawith its neat rows and columns and tables, fixed fields, and predictable, validated formatssocial media present their data in loosely structured formats, making it far more challenging for public- and private-sector organizations to tap into the undeniably large value it holds. Ad hoc interactions and responses are a dead end. Instead, companies need systematized ways to capture, analyze, and respond to the mountains of social media data sources that are rightfully taking their place alongside traditional structured data. With big data, analytics is moving from traditional business intelligence methods that use classic sorting with structured data to discovery techniques that utilize raw information. Traditional BI tools use a deductive approach to data, which assumes some understanding of existing patterns and relationships. An analytics model approaches the data based on this knowledge. For obvious reasons, deductive methods work well with structured data. An inductive approach makes no presumptions of patterns or relationships and is more about data discovery. Predictive analytics applies inductive reasoning to big data using sophisticated quantitative methods such as machine learning, neural networks, robotics, computational mathematics, and artificial intelligence to explore all the data and to discover interrelationships and patterns. Inductive methods use algorithms to perform complex calculations specifically designed to run against highly varied or large volumes of data.

The result of applying these techniques to a real-world business problem is a predictive model. The ability to know what algorithms and data to use to test and create the predictive model is part of the science and art of predictive analytics. Predictive analytics is a broad term describing a variety of statistical and analytical techniques used to develop models that predict future events or behaviors.

Problem Identification
Predictive analytics uses historical data to build predictive analytic models. These models are mathematical formulae that use analysis of the past to calculate a value that represents a prediction about the future. Continuous streams of data can actually help enhance the efficiency and accuracy of predictive models: Predictive models are informed by historical data that in real time or near real time may only be seconds or minutes old. In the past, predictive models took time to build and test to exploit those patterns. The higher the volume of data, the greater the opportunity to fine-tune and refine the model. The relevance of a predictive model is dependent on the accuracy of key coefficients and factor variablesthe ability to find correlations. In this way, data analytics differs from more traditional data mining. Automated analytics algorithms, such as machine learning, continuously inform the predictive model and enable it to adjust. Each adjustment can increase accuracy as the algorithm continues to process and analyze. Depending on the previous human or other machine action, the system can continue to generate new algorithms as needed, ensuring that the model remains relevant. E. Letouz e describes the main opportunities Big Data offers in the white paper Big Data for Development: Challenges & Opportunities:
6

Early warning: develop fast response in time of crisis, detecting anomalies in the usage of digital media

Real-time awareness: design programs and policies with a more fine-grained representation of reality

Real-time feedback: check what policies and programs fails, monitoring it in real time, and using this feedback make the needed changes

This research proposal will attempt to address this gap by exploring the topic of providing real time predictive analytics for Big Data.

Literature survey

There are several solutions to store and process big data. HDFS and Google File System (GFS) (Ghemawat et al., 2003) are distributed file systems geared towards large batch processing. They are not general-purpose file systems. For example, in HDFS files can only be appended but not modified and in GFS a record might get appended more than once (at least once semantics). They use large blocks of 64 MB or more, which are replicated for fault tolerance. BigTable (Chang et al., 2006) and HBase are non-relational data stores. They are actually multidimensional, sparse, sorted maps designed for semi-structured or non-structured data. They provide random, realtime read/write access to large amounts of data. In the big data computation layer we find dataflow paradigms with support for automated parallelization for large scale data intensive computing.

MapReduce (Dean and Ghemawat, 2004) is a distributed computing engine developed by Google, while Hadoop is an open source clone. Programs are specified by a Direct Acyclic Graph (DAG) whose vertexes are operations and whose edges are data channels. The system takes care of scheduling, distribution, communication and execution on a cluster. Data originates from a wide variety sources. Radio-Frequency Identification (RFID) tags and Global Positioning System (GPS) receivers are already spread all around us. Sensors like these produce petabytes of data just as a result of their sheer numbers, thus starting the socalled industrial revolution of data (Hellerstein, 2008). Big data mining The following papers describe the various aspects of big data mining/analysis. Scaling Big Data Mining Infrastructure: The Twitter Experience by Jimmy Lin and Dmitriy Ryaboy (Twitter,Inc.). This paper presents insights about Big Data mining infrastructures, and the experience of doing analytics at Twitter. It shows that due to the current state of the data mining tools, it is not straightforward to perform analytics. Most of the time is consumed in preparatory work to the application of data mining methods, and turning preliminary models into robust solutions. Mining Heterogeneous Information Networks: A Structural Analysis Approach by Yizhou Sun (North- eastern University) and Jiawei Han (University of Illinois at Urbana-Champaign). This paper shows that a mining heterogeneous information network is a new and promising research frontier in Big Data mining research. It considers interconnected, multi-typed data, including the typical relational database data, as heterogeneous information net- works. These semi-structured heterogeneous

information network models leverage the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge from interconnected data. Big Graph Mining: Algorithms and discoveries by U Kang and Christos Faloutsos(Carnegie Mellon University). This paper presents an overview of mining big graphs, focusing in the use of the Pegasus tool, showing some findings in the Web Graph and Twitter social network. The paper gives inspirational future research directions for big graph mining. In Big Data Mining, there are many open source initiatives. The most popular are the following: Apache Mahout: Scalable machine learning and data mining open source software based mainly in Hadoop. It has implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative filtering and frequent pattern mining. R: open source programming language and software environment designed for statistical computing and visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand beginning in 1993 and is used for statistical analysis of very large data sets. Real time analytics Cheng et al describe a system for real time analytics processing with MapReduce. The analytics system modified the stock Hadoop's MapReduce programming model and execution framework, and used Chord model as temporary data, Cassandra as its persistent

storage. With this system, we can develop data stream processing application with the familiar MapReduce programming model. Rajanish et al describe An Efficient Algorithm for Real-Time Frequent Pattern Mining for Real-Time Business Intelligence Analytics. Stephan et al describe mining large distributed log data in near real time. They use large distributed graphs that allow a high throughput of updates with very low latency.

Objectives

In this work, we plan to study the issues in big data processing and predictive analysis and contribute a framework for time efficient processing of Big data to perform real time predictive analytics. Since the existing big data processing methodologies differ widely, we will perform a comprehensive study of such methodologies by implementing them on a common infrastructure and compare their performance by running identical workloads on them. We will create a statistical framework to perform predictive analysis in a clustered, distributed and parallel manner with big logs and extend it to handle real time big data. Finally, a case study will be conducted where the framework will be implemented to perform real time risk analysis in the financial markets. The issues to be considered in our research are as follows: 1. To comparatively study the existing methodologies/algorithms for processing big data in real time with latency, degree of heterogeneity, scalability and usage of computing resources as performance parameters.

10

2.

Design & develop a time efficient model/ framework in clustered distributed and parallel fashion for processing log of Big data in near real time for predictive analytics.

3. Update the model to perform real time analytics using real time big data for predictive analytics 4. Develop a case study and apply the above to perform risk analysis in financial markets.

Methodology

Based on the objectives listed above, the following research methodology is proposed: We will perform a study of the most common data analysis techniques for Big Data processing. These techniques range from typical text processing tasks such as indexing to pattern extraction, machine learning and graph mining. Such algorithms encompass different computational patterns, which might not properly suit the requirement of real time analytics. This analysis will allow us to identify the weaknesses of existing systems and to design a roadmap of contributions to the state of the art. A common infrastructure will be used to do a comparative study of the algorithms on parameters like scalability, latency and usage of computing resources. All the algorithms will be implemented in a suitable language such as C or Python and run under identical workloads for the study. The next step would be to understand statistical models for predictive analytics. There are various techniques like linear regression, logistic regression, regression with regularization
11

etc. which will be explored in detail. Based on this study we will design & develop a time efficient model/ framework in clustered distributed and parallel fashion for processing log of Big data in near real time for predictive analytics. Statistical machine learning shall be applied for analysis in an automated fashion. The models developed above will be enhanced to support real time big data; either as a data stream or from social networks and apply these to the earlier developed statistical models. A suitable implementation in R will be used to validate the models. We will use MapReduce as the target paradigm, because of its widespread adoption by the scientific community and of the open-source availability of its implementations (e.g. Hadoop). The above developed models and methodologies will be applied to perform risk analysis in the financial markets. A case study will be implemented to demonstrate the models developed and their applicability to perform real time risk analysis in the financial markets.

Possible outcome
The possible outcome of the proposed research will be: To evaluate and build a set of algorithms for predictive analysis on Big Data.

To develop methods to improve the performance of the algorithms in order to perform real time processing.

The details of the research conducted will be published in a journal of international repute.

12

A case study will be conducted where the algorithms developed will be implemented to perform real time risk analysis in the financial markets.

This research is an effort to bring together the fields of predictive analytics and real time big data processing.

References
1. Jennifer Rowley. The wisdom hierarchy: representations of the DIKW hierarchy. Journal of Information Science, 33(2):163180, April 2007. 4 2. Joseph M Hellerstein. Programming a Parallel Future. Technical Re- port UCB/EECS-2008-144, University of California, Berkeley, November 2008. URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/ EECS-2008-144.html. 5 3. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file sys- tem. In SOSP 03: 19th ACM symposium on Operating Systems Principles, pages 2943. ACM, October 2003. 22 4. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wal- lach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. BigTable: A distributed storage system for structured data. In OSDI 06: 7th Symposium on Operating Systems Design and Implementation, pages 205218, November 2006. 22 5. J Dean and S. Ghemawat. MapReduce: a flexible data processing tool. Communications of the ACM, 53(1):7277, January 2010. 13 6. Apache Hadoop, http://hadoop.apache.org. 7. Apache Mahout, http://mahout.apache.org.

13

8. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis http://moa. cms.waikato.ac.nz/. Journal of Machine Learning Research (JMLR), 2010. 9. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0. 10. E. Letouz e. Big Data for Development: Opportunities & Challenges. May 2011. 11. Gartner. Gartner Says Solving Big Data Challenge Involves More Than Just Managing Volumes of Data, 2011. URL http://www.gartner.com/it/ page.jsp? id=1731916. 3 12. P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch, and G. Lapis. IBM Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Companies,Incorporated, 2011. 13. Cheng-Zhang Peng., Ze-Jun Jiang., Xiao-Bin Cal2, Zhi-Ke Zhang. Real-Time Analytics Processing With Mapreduce 14. Rajanish Dass, Ambuj Mahanti. An Efficient Algorithm for Real-Time Frequent Pattern Mining for Real-Time Business Intelligence Analytics 15. Stefan Weigert, Matti Hiltunen, Christof Fetzer. Mining large distributed log data in near real time. 16. Agrawal, Divyakant, Sudipto Das, and Amr El Abbadi. "Big data and cloud computing: current state and future opportunities." Proceedings of the 14th International Conference on Extending Database Technology. ACM, 2011. 17. Ananthanarayanan, Rajagopal, et al. "Cloud analytics: Do we really need to reinvent the storage stack." Proceedings of the 1st USENIX Workshop on Hot Topics in Cloud Computing (HOTCLOUD2009), San Diego, CA, USA. 2009. 18. Fox, Armando, et al. "Above the clouds: A Berkeley view of cloud computing."Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep. UCB/EECS 28 (2009). 19. Manovich, Lev. "Trending: the promises and the challenges of big social data."Debates in the digital humanities (2011): 460-75.

14

20. Shmueli, Galit, and Otto Koppius. "Predictive analytics in information systems research." Robert H. Smith School Research Paper No. RHS (2010): 06-138. 21. Labrinidis, Alexandros, and H. V. Jagadish. "Challenges and opportunities with big data." Proceedings of the VLDB Endowment 5.12 (2012): 2032-2033. 22. Siemens, George, and Phil Long. "Penetrating the fog: Analytics in learning and education." Educause Review 46.5 (2011): 30-32. 23. D. Boyd and K. Crawford. Critical Questions for Big Data. Information, Communication and Society, 15(5):662-679, 2012. 24. Intel. Big Thinkers on Big Data, http://www.intel.com/content/www/us/en/bigdata/big-thinkers-on-big-data.html, 2012. 25. J. Lin. MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail! CoRR, abs/1209.2191, 2012. 26. N. Marz and J. Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2013. 27. P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch, and G. Lapis. IBM Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Companies,Incorporated, 2011. 28. C. Bockermann and H. Blom. The streams Framework. Technical Report 5, TU Dortmund University, 12 2012. 29. U. Fayyad. Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling. http://big-data-mining.org/keynotes/#fayyad, 2012. 30. J. Gama. Knowledge Discovery from Data Streams. Chapman & Hall/Crc Data Mining and Knowledge Discovery. Taylor & Francis Group, 2010. 31. S. M. Weiss and N. Indurkhya. Predictive data mining: a practical guide. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998. 32. C. C. Aggarwal, editor. Managing and Mining Sensor Data. Advances in Database Systems. Springer, 2013.

15