Anda di halaman 1dari 8

MC0088 - Data Mining

(Book ID: B1009)

Q1. What is operational intelligence? Ans: Operational intelligence (OI) is a form of real-time dynamic, business analytics that delivers visibility and insight into business operations. The purpose of OI is to monitor business activities and identify and detect situations relating to inefficiencies, opportunities, and threats. OI helps to quantify following: Efficiency of the business activities Impact of IT infrastructure and unexpected events on the business activities Execution of the business activities contributing to revenue gains or losses.

Features Different operational intelligence solutions may use many different technologies and be implemented in different ways. This section lists the common features of an operational intelligence solution: Real-time monitoring Real-time situation detection Real-time dashboards for different user roles

Correlation of events - Industry-specific dashboards - Multidimensional analysis o Root cause analysis o Time Series and trending analysis Comparison OI is often linked to or compared with business intelligence (BI) or real time business intelligence, in the sense that both help make sense out of large amounts of information. But there are some basic differences: OI is primarily activity-centric, whereas BI is primarily datacentric. (As with most technologies, each of these could be sub-optimally coerced to perform the other's task.) OI is, by definition real-time, unlike BI which is traditionally an after-the-fact and report-based approach to identifying patterns, and unlike real time BI which relies on a database as the sole source of events. Q2. What is Business Intelligence? Explain the components of BI architecture Ans: Business intelligence is actually an environment in which business users receive data that is reliable, consistent, understandable, easily manipulated and timely. With this data, business users are able to conduct analyses that yield overall understanding of where the business has been, where it is now and where it will be in the near future. Business intelligence serves two main purposes; it monitors the financial and operational health of the organization (reports, alerts, alarms, analysis tools, key performance indicators and dashboards). It also regulates the MC0088 Page 1

operation of the organization providing two-way integration with operational systems and information feedback analysis. There are various definitions given by the experts; some of the definitions are given below: Converting data into knowledge and making it available throughout the organization are the jobs of processes and applications known as Business Intelligence. BI is a term that encompasses a broad range of analytical software and solutions for gathering, consolidating, analyzing and providing access to information in away that is supposed to let the users of an enterprise make better business decisions.

Business Intelligence Infrastructure Business organizations can gain a competitive advantage with well-designed business intelligence (BI) infrastructure. Think of the BI infrastructure as a set of layers that begin with the operational systems information and Meta data and end in delivery of business intelligence to various business user communities. Based on the overall requirements of business intelligence, the data integration layer is required to extract, cleanse and transform data into load files for the information warehouse. This layer begins with transaction-level operational data and Meta data about these operational systems. Typically this data integration is done using a relational staging database and utilizing flat file extracts from source systems. The product of a good data-staging layer is high-quality data, a reusable infrastructure and meta data supporting both business and technical users.

The information warehouse is usually developed incrementally over time and is architected to include key business variables and business metrics in a structure that meets all business analysis questions required by the business groups. 1. The information warehouse layer consists of relational and/or OLAP cube services that allow business users to gain insight into their areas of responsibility in the organization. 2. Customer Intelligence relates to customer, service, sales and marketing information viewed along time periods, location/geography, and product and customer variables. 3. Business decisions that can be supported with customer intelligence range from pricing, forecasting, promotion strategy and competitive analysis to up-sell strategy and customer service resource allocation. 4. Operational Intelligence relates to finance, operations, manufacturing, distribution, logistics and human resource information viewed along time periods, location/geography, product, project, supplier, carrier and employee. 5. The most visible layer of the business intelligence infrastructure is the applications layer, which delivers the information to business users. MC0088 Page 2

6. Business intelligence requirements include scheduled report generation and distribution, query and analysis capabilities to pursue special investigations and graphical analysis permitting trend identification. This layer should enable business users to interact with the information to gain new insight into the underlying business variables to support business decisions. 7. Presenting business intelligence on the Web through a portal is gaining considerable momentum. Portals are usually organized by communities of users organized for suppliers, customers, employers and partners. 8. Portals can reduce the overall infrastructure costs of an organization as well as deliver great self-service and information access capabilities. 9. Web-based portals are becoming commonplace as a single personalized point of access for key business information.

Q3.

Differentiate between database management systems (DBMS) and data mining.

Ans: Database Management System (DBMS) is the software that manages data on physical storage devices. Data Mining: Data mining is the process of discovering relationships among data in the database. Area Task Type result Method DBMS Extraction of detailed and summary data of Information Deduction (Ask the question, verify the data) Data mining Knowledge discovery of hidden patterns and insights Insight and Prediction Induction (Build the model, apply it to new data, get the result) Who will buy a mutual fund in the next 6 months and why?

Example question

Who purchased mutual funds in the last 3 years?

MC0088

Page 3

Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. It is the process of data-driven extraction of not so obvious but useful information from large databases. The aim of data mining is to extract implicit, previously unknown and potentially useful (or actionable) patterns from data. Data mining consists of many up-to-date techniques such as classification (decision trees, naive Bayes classifier, k -nearest neighbor, and neural networks), clustering (k-means, hierarchical clustering, and density-based clustering), association (one-dimensional, multidimensional, multilevel association, constraint-based association). Data warehousing is defined as a process of centralized data management and retrieval. Data warehouse is an enabled relational database system designed to support very large databases (VLDB) at a significantly higher level of performance and manageability. Data warehouse is an environment, not a product. It is an architectural construct of information that is hard to access or present in traditional operational datastores. Q4. What is Neural Network? Explain in detail. Ans: An Artificial Neural Network (ANN) is an information-processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements(neurons) working in unison to solve specific problems. Neural Networks are made up of many artificial neurons. An artificial neuron is an electronically modelled biological neuron. Number of neurons used depends on the task at hand. It could be as few as three or as many as several thousand. There are many different ways of connecting artificial neurons together to create a neural network. There are different types of Neural Networks, each of which has different strengths particular to their applications. The abilities of different networks can be related to their structure dynamics and learning methods .

Q5. What is partition algorithm? Explain with the help of suitable example Ans: The partition algorithm is based on the observation that the frequent sets are normally very few in number compared to the set of all item sets. As a result, if we partition the set of transactions MC0088 Page 4

to smaller such that each segment can be accommodated in the main memory, then we can compute the set of frequent sets of each of these partitions. It is assumed that these sets (set of local frequent sets) contain a reasonably small number of item sets. Hence, we can read the whole database (the unsegmented one) once, to count the support of the set of all local frequent sets. The partition algorithm uses two scans of the database to discover all frequent sets. In one scan, it generates a set of all potentially frequent item sets by scanning the database once. This set is a superset of all frequent item sets, i.e., it may contain false positives; but no false negatives are reported. During the second scan, counters for each of these item sets are set up and their actual support is measured in one scan of the database. The algorithm executes in two phases. In the first phase, the partition algorithm logically divides the database into a number of nonoverlapping partitions. The partitions are considered one at a time and all frequent item sets for that partitions are generated. Thus, if there are n partitions, Phase I of the algorithm takes n iterations. At the end of Phase I, these frequent item sets are merged to generate a set of all potential frequent item sets. In this step, the local frequent item sets of same lengths from all n partitions are combined to generate the global candidate item sets. In Phase II, the actual support for these item sets are generated and the frequent item sets are identified. The algorithm reads the entire database once during Phase I and once during Phase II. The partition sizes are chosen such that each partition can be accommodated in the main memory, so that the partitions are read only once in each phase. A partition P of the databases refers to any subset of the transactions contained in the database. Any two partitions are non-overlapping. We define local support for an item set as the fraction of the transaction containing that particular item set in partition. We define a local frequent item set as an item set whose local support in a partition is at least the user defined minimum support. A local frequent item set may or may not be frequent in the context of the entire database. Partition Algorithm P = partition _ database (T); n = Number of partitions // Phase I for i = 1 to n do begin read _ in _ partition (T1 in P) Li = generate all frequent of Ti using a priori method in main memory. end // Merge Phase for (k = 2;Lki , i = 1, 2, , n; k++) do begin CkG = i k n i L 1 end // Phase II For i = 1 to n do begin read _ in _ partition (T1in P)for all candidates c CG computes s(c)Ti end LG = {cCG| s(c)Ti } Answer = LG The partition algorithm is based on the premise that the size of the global candidate set is considerably smaller than the set of all possible item sets. The intuition behind this is that the size of the global candidate set is bounded by n times the size of the largest of the set of locally frequent sets. For sufficiently large partition sizes, the number of local frequent item sets is likely to be comparable to the number of frequent item sets generated for the entire database. If the data characteristics are uniform across partitions, then large numbers of item sets generated for individual partitions may be common. MC0088 Page 5

Example Let us take same database T, given in Example 6.2, and the same . Let us partition, for the sake of illustration, T into three partitions T1, T2, and T3, each containing 5transactions. The first partition T1 contains transactions 1 to 5, T2 contains transactions6 to 10 and, similarly, T3 contains transactions 11 to 15. We fix the local supports a sequal to the given support that is 20%. Thus, 1 = 2= 3 = = 20 %. Any item set that appears in just one of the transactions in any partition is a local frequent set in that partition. The local frequent sets of the T1 partition are the item sets X, such that s(X)T11.L1:= {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {1,5}, {1,6}, {1,8}, {2,3}, {2,4}, {2,8}, {4,5}, {4,7},{4,8}, {5,6}, {5,8}, {5,7}, {6,7}, {6,8}, {1,6,8},{1,5,6}, {1,5,8}, {2,4,8}, {4,5,7},{5,6,8},{5,6,7},{1,5,6,8}} Similarly, L2:= {{2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {2,3}, {2,4}, {2,6}, {2,7}, {2,9}, {3,4}, {3,5}, {3,7},{5,7}, {6,7}, {6,9}, {7,9}, {2,3,4}, {2,6,7}, {2,6,9}, {2,7,9}, {3,5,7}, {2,6,7,9}} L3:= {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {1,3}, {1,5}, {1,7}, {2,3}, {2,4}, {2,6},{2,7}, {2,9},{3,5}, {3,7}, {3,9}, {4,6}, {4,7}, {5,6}, {5,7}, {5,8}, {6,7}, {6,8}, {1,3,5}, {1,3,7}, {1,5,7},{2,3,9}, {2,4,6}, {2,4,7}, {3,5,7}, {4,6,7}, {5,6,8}, {1,3,5,7}, {2,4,6,7}} In Phase II, we have the candidate set as C: = L1 L2L3 C:= {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {1,3}, {1,5},{1,6}, {1,7},{1,8}, {2,3}, {2,4},{2,6},{2,7}, {2,8},{2,9}, {3,4},{3,5}, {3,7}, {3,9}, {4,5},{4,6}, {4,7},{4,8}, {5,6}, {5,7}, {5,8},{6,7},{6,7}, {6,8},{6,9},{7,9}, {1,3,5}, {1,3,7},{1,5,6}, {1,5,7},{1,5,8}, {1,6,8},{2,3,4},{2,3,9},{2,4,6}, {2,4,7},{2,4,8},{2,6,7}, {2,6,9},{2,7,9}, {3,5,7}, {4,5,7},{4,6,7},{5,6,8},{5,6,7},{1,5,6,8},{2,6,7,9}, {1,3,5,7}, {2,4,6,7}} Read the database once to compute the global support of the sets in C and get the final set of frequent sets. Q6. Describe the following with respect to Web Mining: a. Categories of Web Mining (5) b. Applications of Web Mining (5) Ans: a. Categories of Web Mining Web is broadly defined as the discovery and analysis of useful information from the World Wide Web. Web mining is divided into three categories: 1. Web Content Mining 2. Web Structure Mining 3. Web Usage Mining. All of the three categories focus on the process of knowledge discovery of implicit, previously unknown and potentially useful information from the web. Each of them focuses on different mining objects of the web.

MC0088

Page 6

Content mining is used to search, collate and examine data by search engine algorithms (this is done by using Web Robots).Structure mining is used to examine the structure of a particular website and collate andanalyze related data.Usage mining is used to examine data related to the client end, such as the profiles of the visitors of the website, the browser used, the specific time and period that the sitewas being surfed, the specific areas of interests of the visitors to the website, andrelated data from the form data submitted during web transactions and feedback. Web Content Mining Web content mining targets the knowledge discovery, in which the main objects are the traditional collections of multimedia documents such as images, video, and audio, which are embedded in or linked to the web pages. It is different from Data mining because Web data are mainly semi-structured and/or unstructured, while Data mining deals primarily with structured data. Web content mining could be differentiated from two points of view: Agent-based approach or Database approach. The first approach aims on improving the information finding and filtering. The second approach aims on modeling the data on the Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyze it. Web Structure Mining Web Structure Mining focuses on analysis of the link structure of the web and one of its purposes is to identify more preferable documents. The different objects are linked in some way. The intuition is that a hyperlink from document A to document B implies that the author of document. A thinks document B contains worthwhile information. Webstructure mining helps in discovering similarities between web sites or discovering important sites for a particular topic or discipline or in discovering web communities. However, the appropriate handling of the links could lead to potential correlations, and then improve the predictive accuracy of the learned models. The goal of Web structure mining is to generate structural summary about the Web site and Web page. Technically, Web content mining mainly focuses on the structure of inner-document, while Web structure mining tries to discover the link structure of the hyperlinks at the interdocument level. Based on the topology of the hyperlinks, Web structure mining will categorize the Web pages and generate the information, such as the similarity and relationship between different Web sites. Web structure mining can also have another directiondiscovering the structure of Web document itself. This type of structure mining can be used to reveal the structure(schema) of Web pages; this would be good for navigation purpose and make it possible to compare/integrate Web page schemes. This type of structure mining will facilitate introducing database techniques for accessing information in Web pages by providing a reference schema. Web Usage Mining Web Usage Mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. Web usage mining, discover user navigation patterns from web data, tries to discovery the useful information from the secondary data derived from the interactions of the users while surfing on the Web. Web usage mining collects the data from Web log records to discover user access patterns of webpages. There are several available MC0088 Page 7

research projects and commercial tools that analyze those patterns for different purposes. The insight knowledge could be utilized in personalization, system improvement, site modification, business intelligence and usage characterization. The only information left behind by many users visiting a Web site is the path through the pages they have accessed. Most of the Web information retrieval tools only use the textual information, while they ignore the link information that could be very valuable. In general, there are mainly four kinds of data mining techniques applied to the webmining domain to discover the user navigation pattern: Association Rule mining Sequential pattern Clustering Classification b. Applications of Web Mining With the rapid growth of World Wide Web, Web mining becomes a very hot and popular topic in Web research. E-commerce and E-services are claimed to be the killer applications for Web mining, and Web mining now also plays an important role for E-commerce website and Eservices to understand how their websites and services are used and to provide better services for their customers and users. A few applications are: E-commerce Customer Behavior Analysis commerce Transaction Analysis E-commerce Website Design E-banking M-commerce Web Advertisement Search Engine Online Auction

c. Web Mining Software Open source software for web mining includes RapidMiner, which provides modules for text clustering, text categorization, information extraction, named entity recognition, and sentiment analysis. RapidMiner is used for example in applications like automated news filtering for personalized news surveys. It is also used in automated content-based document and e-mail routing, sentiment analysis from web blogs and product reviews in internet discussion groups. Information extraction from web pages also utilizes RapidMiner to create mash-ups which combine information from various web services and web pages, to perform web log mining and web usage mining. SAS Data Quality Solution provides an enterprise solution for profiling, cleansing, augmenting and integrating data to create consistent, reliable information. With SASData Quality Solution you can automatically incorporate data quality into data integration and business intelligence projects to dramatically improve returns on your organizations strategic initiatives. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. MC0088 Page 8

Anda mungkin juga menyukai