Anda di halaman 1dari 16

SYNOPSIS OF PROPOSED WORK

Introduction
Big data "Big Data" is a popular phrase used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process with traditional database and software techniques. It is typically described by the first three characteristics below sometimes referred to as the three Vs. However, organizations need a fourthvalueto make big data work. Volume: Huge data sets that are orders of magnitude larger than data managed in traditional storage and analytical solutions. Think petabytes and exabytes instead of terabytes. Variety: Heterogeneous, complex, and variable data, which are generated in formats as different as e-mail, social media, video, images, blogs, and sensor dataas well as shadow data such as access journals and Web search histories. Velocity: Data is generated as a constant stream with real-time queries for meaningful information to be served up on demand rather than batched. Value: Meaningful insights that deliver analytics for future trends and patterns from deep, complex analysis based on machine learning, statistical modeling, and graph algorithms. These analytics go beyond the results of traditional business intelligence querying and reporting.

Big Data Analytics Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences. Big data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information. Such information can provide competitive advantages over rival organizations and result in business benefits, such as more effective

marketing and increased revenue. There are three types of Big data analytics: Descriptive (business intelligence and data mining) Predictive (forecasting) Prescriptive (optimization and simulation) Descriptive Analytics Descriptive analytics looks at past performance and understands that performance by mining historical data to look for the reasons behind past success or failure. Almost all management reporting such as sales, marketing, operations, and finance, uses this type of postmortem analysis. Descriptive models can be used, for example, to categorize customers by their product preferences and life stage; examine historical electricity usage data to help plan power needs,etc.

Predictive Analytics Predictive analytics is a set of advanced technologies that enable organizations to use data both stored and real-timeto move from a historical, descriptive view to a forward-looking perspective of whats ahead. Predictive analytics turns data into valuable, actionable information. Predictive analytics uses data to determine the probable future outcome of an event or a likelihood of a situation occurring.

Prescriptive Analytics Prescriptive analytics automatically synthesizes big data, mathematical sciences, business rules, and machine learning to make predictions and then suggests decision options to take advantage of the predictions. Prescriptive analytics goes beyond predicting future outcomes by also suggesting actions to benefit from the predictions and showing the decision maker the implications of each decision option.

The Apache Hadoop Framework The Hadoop open-source framework uses a simple programming model to enable distributed processing of large data sets on clusters of computers. The complete technology stack

includes common utilities, a distributed file system, analytics and data storage platforms, and an application layer that manages distributed processing, parallel computation, workflow, and configuration management. In addition to offering high availability, the Hadoop framework is more cost-effective for handling large, complex, or unstructured data sets (typical of Big Data) than conventional approaches, and it offers massive scalability and speed.

Apache MapReduce MapReduce, the software programming framework in the Hadoop stack, simplifies processing on large data sets and gives programmers a common method for defining and orchestrating complex processing tasks across clusters of computers. MapReduce applications coordinate the processing of tasks for a cluster node by scheduling jobs, monitoring activity, and reexecuting failed tasks. Input and output are stored in the Hadoop Distributed File System (HDFS*).

Big Data Analysis Pipeline It consists of multiple distinct phases as shown in Fig.1. The phases are: Data Acquisition and Recording Big Data does not arise out of a vacuum: it is recorded from some data generating source. Much of this data is of no interest, and it can be filtered and compressed by orders of magnitude. One challenge is to define these filters in such a way that they do not discard useful information. The second big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured. Another important issue here is data provenance. Recording information about the data at its birth is not useful unless this information can be interpreted and carried along through the data analysis pipeline.

Fig.1. The Big Data Analysis Pipeline

Information Extraction and Cleaning Frequently, the information collected will not be in a format ready for analysis. Rather we require an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis. Doing this correctly and completely is a continuing technical challenge.

Data Integration, Aggregation, and Representation Data Integration, Aggregation, and Representation is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis all of this has to happen in a completely automated manner. This requires differences in data

structure and semantics to be expressed in forms that are computer understandable, and then robotically resolvable. There is a strong body of work in Data Integration, Aggregation, and Representation that can provide some of the answers. Query Processing, Data Modeling, and Analysis Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis on small samples. Big Data is often noisy, dynamic, heterogeneous, interrelated and untrustworthy. Nevertheless, even noisy Big Data could be more valuable than tiny samples because general statistics obtained from frequent patterns and correlation analysis usually overpower individual fluctuations and often disclose more reliable hidden patterns and knowledge. Interpretation Having the ability to analyze Big Data is of limited value if users cannot understand the analysis. Ultimately, a decision-maker, provided with the result of analysis, has to interpret these results. Usually, it involves examining all the assumptions made and retracing the analysis. Rather, one must provide supplementary information that explains how each result was derived, and based upon precisely what inputs. Such supplementary information is called the provenance of the (result) data.

Literature survey
[1] Big Data Analytics Sachchidanand Singh, Nirmala Singh 2012 International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India Review paper that discusses the concept, characteristics and need of Big Data and different offerings available in the market to explore unstructured large data. The paper also covers Big Data adoption trends, entry and exit criteria for the vendor and product selection, best practices, customer success story and benefits of Big Data analytics.

[2] A Big Data Model Supporting Information Recommendation in Social Networks Xiaoyue Han ; Lianhua Tian ; Minjoo Yoon ; Minsoo Lee Cloud and Green Computing (CGC), 2012 IEEE International Conference Publication Year: 2012 , Page(s): 810 - 813 This paper introduces a big data model for recommender systems using social network data. The model incorporates factors related to social networks and can be applied to information recommendation with respect to various social behaviors that can increase the reliability of the recommended information. [3] Customer Preference Analysis Based on SNS Data J. S. Kim, M. H. Yang, Y. J. Hwang, S. H. Jeon, K. Y. Kim, I. S. Jung, C. H. Choi, W. S. Cho, J. H. Na Cloud and Green Computing (CGC), 2012 IEEE International Conference In this paper, Twitter data (SNS Social Network Services data) has been collected, stored and analyzed in a multi-dimensional fashion on top of Hadoop platform in order to find out what kind of factors can affect the customer preference for the smartphones. Opinion mining or Sentiment analysis is used to determine the attitude of customers w.r.t. smartphones in Korea.

[4] Considerations for Big Data: Architecture and Approach Kapil Bakshi Cloud and Green Computing (CGC), 2012 IEEE International Conference The main focus of the paper is on unstructured data analysis. The various techniques for unstructured data analysis share common characteristics of scale-out, elasticity and high availability. MapReduce, in conjunction with the Hadoop Distributed FileSystem (HDFS) and HBase database, as part of the Apache Hadoop project is a modern approach to analyze

unstructured data. Hadoop clusters are an effective means of processing massive volumes of data, and can be improved with the right architectural approach. [5] Design Principles for Effective Knowledge Discovery from Big Data Begoli, E. ; Horey, J. Software Architecture (WICSA) and European Conference on Software Architecture (ECSA), 2012 Joint Working IEEE/IFIP Conference Due to the difficulty of analyzing large datasets, big data presents unique systems engineering and architectural challenges. In this paper, three system design principles that can inform organizations on effective analytic and data collection processes, system organization, and data dissemination practices are presented. The design principles are derived from the authors development experiences with big data problems from various federal agencies.

[6] Shared disk big data analytics with Apache Hadoop Mukherjee, A. ; Datta, J. ; Jorapur, R. ; Singhvi, R. ; Haloi, S. ;Akram, W. High Performance Computing (HiPC), 2012 19th International Conference Publication Year: 2012 , Page(s): 1 6 This paper debates on the need of a massively scalable distributed computing platform for Big Data analytics in traditional businesses. For organizations which don't need a horizontal., internet order scalability in their analytics platform, Big Data analytics can be built on top of a traditional POSIX Cluster File Systems employing a shared storage model. VERITAS Cluster File System (VxCFS) has been compared with Hadoop Distributed File System (HDFS) using popular Map-reduce benchmarks on top of Apache Hadoop. In the experiments VxCFS could not only match the performance of HDFS, but also outperformed in many cases. This way, enterprises can fulfill their Big Data analytics need with a traditional and existing shared storage model without migrating to a different storage model in their data centers.

[7] Temporal Analytics on Big Data for Web Advertising Chandramouli, B. ; Goldstein, J. ; Songyun Duan Data Engineering (ICDE), 2012 IEEE 28th International Conference "Big Data" in map-reduce (M-R) clusters is often fundamentally temporal in nature, as are many analytics tasks over such data. For instance, display advertising uses Behavioral Targeting (BT) to select ads for users based on prior searches, page views, etc. The paper proposes a novel framework called TiMR (pronounced timer), that combines a time-oriented data processing system with a M-R framework. It also shows the feasibility of this approach for BT, with new temporal algorithms that exploit new targeting opportunities. [8] An Agent Model for Incremental Rough Set-Based Rule Induction: A Big Data Analysis in Sales Promotion

Yu-Neng Fan ; Ching-Chin Chern System Sciences (HICSS), 2013 46th Hawaii International Conference Rough set-based rule induction is able to generate decision rules from a database and has mechanisms to handle noise and uncertainty in data. This technique facilitates managerial decision-making and strategy formulation. However, the process for RS-based rule induction is complex and computationally intensive. The paper proposes an Incremental Rough Set-based Rule Induction Agent (IRSRIA). Rule induction is based on creating agents for the main modeling processes. In addition, an incremental architecture is designed, to address large-scale dynamic database problems. A case study of a Home shopping company is used to show the validity and efficiency of this method. [9] Predictive Analytics 101:Next-Generation Big Data Intelligence- Intel IT Center-March 2013 http://www.intel.in/content/www/in/en/big-data/big-data-predictive-analytics- overview.html This paper discusses Predictive analytics: why it matters, how businesses can operationalize it and the impact on IT. [10] Vision Paper - Distributed Data Mining and Big Data- Intel IT Center August 2012 http://www.intel.in/content/www/in/en/big-data/distributed-data-mining- paper.html This paper describes Intels perspective on the analytics of big data and gives a quick overview of emerging technologies, including distributed frameworks such as the Apache

[11] Challenges and Opportunities with Big Data A community white paper developed by leading researchers across the United States http://www.cra.org/ccc/files/docs/init/bigdatawhitepaper.pdf This paper describes the multiple distinct phases in the analysis of Big Data; opportunities and challenges in each phase of the Big Data Analysis Pipeline. [12] Industry standard process for carrying out predictive analytics: CRISP-DM
http://www.red-olive.co.uk/2011/07/industry-standard-process-for-carrying-outpredictive- analytics-crisp-dm/ This paper describes the 6 phases for carrying out predictive Analytics which can be used as a methodology for the proposed work.

Problem Identification
The massive scale and growth of data in generaland semi-structured and unstructured data in particularoutstrip the capabilities of traditional storage and analytic solutions, which also do not cope well with the heterogeneity of big data. Organizations may be data rich, but new analytic processes and technologies are needed to unlock the potential of big data. Big data derives most of its value from the insights it produces when analyzedfinding patterns, deriving meaning, making decisions, and ultimately responding to the world with intelligence. As big data technology continues to evolve, businesses are turning to Analytics to help them deepen engagement with customers, optimize processes, and reduce operational costs. To its proponents, prescriptive analytics is the next evolution in business analytics, an automated system that combines big data, business rules, mathematical models and machine learning to deliver sage advice in a timely fashion. While predictive analytics helps you model and forecast what might happen in the future, prescriptive analytics helps you decide the best course of action to take given your objectives, requirements and constraints. It seeks to find the optimal solution given a variety of choices, alternatives and influences that might affect the outcome. An even more advanced area of prescriptive analytics uses stochastic optimization to also take into consideration the uncertainty that might exist in the data used in the analysis. Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen. Further, prescriptive analytics suggests decision options on how to take advantage of a future opportunity or mitigate a future risk and shows the implication of each decision option. Prescriptive analytics can continually and automatically take in new data to improve prediction accuracy and provide better decision options. Prescriptive analytics ingests hybrid data, a combination of structured and unstructured data, and business rules to predict what lies ahead and to prescribe how to take advantage of this predicted future without compromising other priorities The proposed work is to Build a Model for Big Data Analysis using Advanced Analytics.

Objectives The proposed work belongs to the Analysis / Modeling phase of the Big Data Analysis Pipeline shown in Fig.1. It attempts to achieve the following objectives: Review the history and evolution of big data and advanced analytics and to comparatively study the existing analytical techniques to extract useful information from big data which can be used to extend traditional predictive models and analytics. To device a mechanism/ frame work for processing and condensing data into more connected forms in order to be useful for event comprehension and decision making leading to information by discovering relationships in the data. The problems start right away during data acquisition, when the data tsunami requires us to make decisions, currently in an ad hoc manner, about what data to keep and what to discard, and how to store what we keep reliably with the right metadata. Much data today is not natively in structured format; for example, tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not for semantic content and search: transforming such content into a structured format for later analysis is a one of the primary objectives. To design and develop algorithm for understanding and building patterns in information for predicting new outcomes based on advanced analytics in an automated fashion using machine learning and adopting in-memory computing for high performance o Applying advanced analytics for suggesting actions to benefit from the predictions and showing the decision maker the implications of each decision option to present use cases and case studies that demonstrate the value of advanced analytics o In memory computing : A workload where all the data being processed is stored in a computer memory that is directly addressable via the CPUs memory bus Provides high-speed performance interactive and iterative BI analytic workloads . for OLTP and BI workloads by eliminating I/O to storage devices Especially beneficial for

Develop a case study and apply the above to perform Real-Time Monitoring & Analytics In-line fraud detection to reduce financial losses caused by stolen credit cards

Methodology
[CRISP-DM (CRoss Industry Standard Process for Data Mining) helps analysts focus on solving specific business problems with measurable goals. It is a non-proprietary approach developed by a large consortium of organisations such as DaimlerChrysler, Teradata and SPSS, with contributions from over 300 other companies. The overall process is outlined in the diagram below and involves the use of 6 main phases:

Fig. 2. CRISP - DM The outermost circle aims to illustrate the iterative and incremental nature of Analytics itself: over time the models need to be refreshed to take into account changes in the business environment, and are further enhanced as greater insight is gained.

There are six main phases within CRISP-DM: 1) Business Understanding involves clarifying the business aims for the project, converting this into a analytics problem definition and designing a preliminary plan to achieve the objectives. 2) Data Understanding involves the analyst carrying out an initial data collection, familiarising himself or herself with the data collected and identifying any data quality problems that need to be addressed. 3) Data Preparation involves selecting the specific modeling techniques to be applied and then getting the data into a form for the modeling to be carried out. The preparation steps are then selecting a sub-set of data for analysis, cleansing the data to address quality issues and transforming the data into a usable form. 4) Modeling involves applying the selected modeling techniques. 5) Evaluation involves checking whether the model properly achieves the business objectives and using the results to identify whether some important business issue has been missed. 6) Deployment involves generating model reports and releasing the tested model into the organisations decision-making process.

Possible outcome
Organizations can use Big Data Analytics for decision making, solving business problems, and identifying opportunities, including: Optimizing business processes and reducing operational costs Engaging deeper with customers and enhancing the customer experience Identifying new product and market opportunities Reducing risk by anticipating and mitigating problems before they occur

References
[1] Big Data Analytics Sachchidanand Singh, Nirmala Singh 2012 International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India [2] A Big Data Model Supporting Information Recommendation in Social Networks Xiaoyue Han ; Lianhua Tian ; Minjoo Yoon ; Minsoo Lee Cloud and Green Computing (CGC), 2012 IEEE International Conference Publication Year: 2012 , Page(s): 810 - 813 [3] Customer Preference Analysis Based on SNS Data J. S. Kim, M. H. Yang, Y. J. Hwang, S. H. Jeon, K. Y. Kim, I. S. Jung, C. H. Choi, W. S. Cho, J. H. Na Cloud and Green Computing (CGC), 2012 IEEE International Conference [4] Considerations for Big Data: Architecture and Approach Kapil Bakshi Cloud and Green Computing (CGC), 2012 IEEE International Conference

[5] Design Principles for Effective Knowledge Discovery from Big Data Begoli, E. ; Horey, J. Software Architecture (WICSA) and European Conference on Software Architecture (ECSA), 2012 Joint Working IEEE/IFIP Conference [6] Shared disk big data analytics with Apache Hadoop Mukherjee, A. ; Datta, J. ; Jorapur, R. ; Singhvi, R. ; Haloi, S. ;Akram, W. High Performance Computing (HiPC), 2012 19th International Conference Publication Year: 2012 , Page(s): 1 6 [7] Temporal Analytics on Big Data for Web Advertising Chandramouli, B. ; Goldstein, J. ; Songyun Duan Data Engineering (ICDE), 2012 IEEE 28th International Conference [8] An Agent Model for Incremental Rough Set-Based Rule Induction: A Big Data Analysis in Sales Promotion

Yu-Neng Fan ; Ching-Chin Chern System Sciences (HICSS), 2013 46th Hawaii International Conference

[9] Predictive Analytics 101:Next-Generation Big Data Intelligence- Intel IT Center-March 2013 http://www.intel.in/content/www/in/en/big-data/big-data-predictive-analytics- overview.html [10] Vision Paper - Distributed Data Mining and Big Data- Intel IT Center August 2012 http://www.intel.in/content/www/in/en/big-data/distributed-data-mining- paper.html [11] Challenges and Opportunities with Big Data A community white paper developed by leading researchers across the United States http://www.cra.org/ccc/files/docs/init/bigdatawhitepaper.pdf

[12] Industry standard process for carrying out predictive analytics: CRISP-DM
http://www.red-olive.co.uk/2011/07/industry-standard-process-for-carrying-outpredictive- analytics-crisp-dm/

[13] Analytics: The real-world use of big data - IBM Global Business Services Business Analytics and Optimization Executive Report http://public.dhe.ibm.com/common/ssi/ecm/en/gbe03519usen/GBE03519USEN.PDF [14] TDWI besT pracTIces report on Big Data Analytics http://www.greenplum.com/sites/default/files/TDWI_BPReport_Q411_Big_Data_Analy tics.pdf [15] Advanced Predictive Analytics: Predicting the Outcome! http://www.ibmbigdatahub.com/blog/advanced-predictive-analytics-predictingoutcome [16] Predictive Analytics: Making Little Decisions with Big Data http://www.information-management.com/news/predictive-analytics-making-littledecisions-with- big data-10023151-1.html? zkPrintable=1&nopagination=1http://bit.ly/TIkUgf

Davenport, Thomas H., Kalakota, Ravi, Taylor, James, Lampa, Mike, Franks, Bill, Jeremy, Shapiro, Cokins, Gary, Way, Robin, King, Joy, Schafer, Lori, Renfrow, Cyndy and Sittig, Dean,Predictions for Analytics in 2012 International Institute for Analytics (December 15, 2011)

Bertolucci, Jeff, Prescriptive Analytics and Data: Next Big Thing?InformationWeek. (April 15, 2013). Basu, Atanu, Five Pillars of Prescriptive Analytics SuccessAnalytics. (March / April 2013). Laney, Douglas and Kart, Lisa, (March 20, 2012). Emerging Role of the Data Scientist and the Art of Data Science Gartner.

McCormick Northwestern Engineering Prescriptive analytics is about enabling smart decisions based on data.

Business Analytics Information Event, I2SDS and Department of Decision Sciences, School of Business, The George Washington University (February 10, 2011).

Farris, Adam, "How Big Data is Changing the Oil & Gas Industry"Analytics. (November / December 2012). Venter, Fritz and Stein, Andrew "Images & Videos: Reall Big Data" Analytics. (November / December 2012). Venter, Fritz and Stein, Andrew "The Technology Behind Image Analytics" Analytics. (November / December 2012).

Horner, Peter and Basu, Atanu, Analytics and the Future of Healthcare Analytics. (January / February 2012). Ghosh, Rajib, Basu, Atanu and Bhaduri, Abhijit, From Sick Care to Health Care Analytics. (July / August 2011).

Fischer, Eric, Basu, Atanu, Hubele, Joachim and Levine, Eric, TV ads, Wanamakers Dilemma & Analytics Analytics. (March / April 2011)

Basu, Atanu and Worth, Tim, Predictive Analytics Practical ways to Drive Customer Service, Looking Forward Analytics. (July / August 2010).

Jain, Rajeev, Basu, Atanu, and Levine, Eric, Putting Major Energy Decisions through their Paces, A Framework for a Better Environment through Analytics OR/MS Today (December 2010).

Bhaduri, Abhijit and Basu, Atanu, Predictive Human Resources, Can Math Help Improve HR Mandates in an Organization?OR/MS Today (October 2010).

Brown, Scott, Basu, Atanu and Worth, Tim, Predictive Analytics in Field Service, Practical Ways to Drive Field Service, Looking Forward Analytics. (November / December 2010).

Pease, Andrew Bringing Optimization to the Business, SAS Global Forum 2012, Paper 165-2012 (2012).