Anda di halaman 1dari 16

Data mining & Data warehousing tools:

Microsoft Analysis Services 2. SPSS Modeler 3. Carrot2 4. ELKI 5. KNIME 6. SAS 7. Oracle data mining
1.

Microsoft Analysis Services data mining software provided by Microsoft

Developer(s) Stable release

Microsoft Analysis Services 2008 R2 / December 21, 2010; 12 months ago

Operating system Microsoft Windows Type License OLAP, Data Mining Microsoft EULA

Microsoft SQL Server Analysis Services is part of Microsoft SQL Server, a database management system. Microsoft has included a number of services in SQL Server related to business intelligence and data warehousing. These services include Integration Services and Analysis Services. Analysis Services includes a group of OLAP and data mining capabilities.

Storage Modes
Microsoft Analysis Services takes a neutral position in the MOLAP vs. ROLAP arguments among OLAP products. It allows all the flavors of MOLAP, ROLAP and HOLAP to be used within the same model. Partition Storage Modes MOLAP - Multidimensional OLAP - Both fact data and aggregations are processed, stored, and indexed using a special format optimized for multidimensional data. ROLAP - Relational OLAP - Both fact data and aggregations remain in the relational data source, eliminating the need for special processing. HOLAP - Hybrid OLAP - This mode uses the relational data source to store the fact data, but preprocesses aggregations and indexes, storing these in a special format, optimized for multidimensional data.

Dimension Storage Modes MOLAP - dimension attributes and hierarchies are processed and stored in the special format ROLAP - dimension attributes are not processed and remain in the relational data source.

Query Languages
Microsoft Analysis Services supports the following query languages Data Definition Language (DDL) DDL in Analysis Services is XML based and supports commands such as <Create>, <Alter>, <Delete>, and <Process>. For data mining models import and export, it also supportsPMML. Data Manipulation Language (DML) MDX - for querying OLAP cubes LINQ - for querying OLAP cubes from .NET using ADO.NET Entity Framework and Language INtegrated Query (SSAS Entity Framework Provider[6] is required) SQL - small subset of SQL for querying OLAP cubes and dimensions as if they were tables DMX - for querying Data Mining models

APIs and Object Models


Microsoft Analysis Services supports different sets of APIs and object models for different operations and in different programming environments. Querying XML for Analysis - The lowest level API. It can be used from any platform and in any language which support HTTP and XML OLE DB for OLAP - Extension of OLEDB. COM based and suitable for C/C++ programs on Windows platform. ADOMD - Extension of ADO. COM Automation based and suitable for VB programs on Windows platform. ADOMD.NET - Extension of ADO.NET. .NET based and suitable for managed code programs on CLR platforms. ADO.NET Entity Framework - Entity Framework and LINQ can be used on top of ADOMD.NET (SSAS Entity Framework Provider is required)

SPSS Modeler data mining software provided by IBM SPSS.


SPSS Modeler is a data mining software tool by SPSS Inc., an IBM company. It was originally named SPSS Clementine by SPSS, after which it was renamed PASW Modeler in 2009 by SPSS. It was since acquired by IBM in its acquisition of SPSS Inc.

Developer(s) Stable release Operating system Type

SPSS Inc., an IBM company 13 (Win / Unix / Linux) / April 2009 Windows, Linux, UNIX Data mining

Carrot2

Developer(s) Stable release

Carrot Search 3.5.3 / December 6, 2011; 0 days ago

Development status Active Written in Operating system Type License Java Cross-platform Text mining and cluster analysis BSD license

Carrot is an open source search results clustering engine. It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Apart from two specialized search results clustering algorithms, Carrot offers ready-to-use components for fetching search results from various sources. Carrot is written in Java and distributed under the BSD license.

Architecture and components


The architecture of Carrot is based on processing components arranged into pipelines. Two major groups or processing components in Carrot are: document sources and clustering algorithms.

Document sources
Document sources provide data for further processing. Typically, they would e.g. fetch search results from an external search engine, Lucene / Solr index or load text files from a local disk. Currently, Carrot has built-in support for the following document sources: Bing Search API Google Search API Google Desktop Lucene index Open Search PubMed

Solr server eTools metasearch engine Generic XML files Other document sources can be integrated based on the code examples provided with Carrot distribution.

Clustering algorithms
Carrot offers two specialized document clustering algorithms that place emphasis on the quality of cluster labels: Lingo: a clustering algorithm based on the Singular value decomposition STC: Suffix Tree Clustering Other algorithms can be easily added to Carrot.

APIs
Carrot clustering can be called through a number of APIs.

Java API
Being implemented in Java, Carrot can be integrated with Java software through its native Java API.

C# / .NET API
Carrot provides a native C# API for calling clustering from C# / .NET software without installing a Java runtime. The Carrot C# API requires .NET Framework version 3.5 or later.

Other platforms
Other platforms can call Carrot clustering through the REST service exposed by the Document Clustering Server. Example integration code is provided for PHP5, C#, Ruby and CURL.

ELKI A university research project with advanced cluster analysis and outlier detection methods written in the Java language.
Screenshot of ELKI 0.4 visualizing OPTICS cluster analysis. Developer(s) Ludwig Maximilian University of Munich 0.4.0 / September 20, 2011; 3 months ago Java

Stable release

Written in

Operating system Microsoft Windows, Linux, Mac OS Platform Type License Java platform Data mining AGPL (since version 0.4.0)

ELKI (for Environment for DeveLoping KDD-Applications Supported by Index-Structures) is a knowledge discovery in databases(KDD, "data mining") software framework developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

Cluster analysis:
K-means clustering Expectation-maximization algorithm Single-linkage clustering DBSCAN (Density-Based Spatial Clustering of Applications with Noise) OPTICS (Ordering Points To Identify the Clustering Structure), including the extensions OPTICSOF, DeLi-Clu, HiSC, HiCO and DiSH SUBCLU (Density-Connected Subspace Clustering for High-Dimensional Data)

Anomaly detection:
LOF (Local outlier factor)

OPTICS-OF DB-Outlier (Distance-Based Outliers) LOCI (Local Correlation Integral) LDOF (Local Distance-Based Outlier Factor) EM-Outlier Spatial index structures: R-tree R*-tree M-tree Evaluation: Receiver operating characteristic (ROC curve) Scatter plot Histogram Parallel coordinates Other: Apriori algorithm Dynamic time warping Principal component analysis

KNIME The Konstanz Information Miner, a user friendly and


comprehensive data analytics framework

Stable release OS License

2.5.1 (December 21, 2011) Windows, Linux, Macintosh GNU General Public License

KNIME, the Konstanz Information Miner, is a user friendly, coherent open source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. The graphical user interface allows the quick and easy assembly of nodes for data preprocessing (ETL: Extraction, Transformation, Loading), for modeling and data analysis and visualization. Since 2006, KNIME is used in pharmaceutical research,but is also used in other areas like CRM customer data analysis, business intelligence and financial data analysis.

SAS:
SAS is an integrated system of software products provided by SAS Institute Inc. that enables programmers to perform

retrieval, management, and mining report writing and graphics statistical analysis business planning, forecasting, and decision support operations research and project management quality improvement applications development data warehousing (extract, transform, load) platform independent and remote computing

In addition, SAS has many business solutions that enable large-scale software solutions for areas such as IT management, human resource management, financial management, business intelligence, customer relationship management

SAS is driven by SAS programs, which define a sequence of operations to be performed on data stored as tables. Although non-programmer graphical user interfaces to SAS exist (such as the SAS Enterprise Guide), these GUIs are most often merely a front-end that automates or facilitates the generation of SAS programs. The functionalities of SAS components are intended to be accessed via application programming interfaces, in the form of statements and procedures. A SAS program has three major parts: 1. the DATA step 2. procedure steps (effectively, everything that is not enclosed in a DATA step) 3. a macro language SAS Library Engines and Remote Library Services allow access to data stored in external data structures and on remote computer platforms. The DATA step section of a SAS program,[1] like other database-oriented fourth-generation programming languages such as SQL or Focus, assumes a default file structure, and automates the process of identifying files to the operating system, opening the input file, reading the next record, opening the output file, writing the next record, and closing the files. This allows the user/programmer to concentrate on the details of working with the data within each record, in effect working almost entirely within an implicit program loop that runs for each record. All other tasks are accomplished by procedures that operate on the data set (SAS' terminology for "table") as a whole. Typical tasks include printing or performing statistical analysis, and may just require the user/programmer to identify the data set. Procedures are not restricted to only one behavior and thus allow extensive customization, controlled by mini-languages defined within the procedures. SAS

also has an extensive SQL procedure, allowing SQL programmers to use the system with little additional knowledge. There are macro programming extensions, that allow for rationalization of repetitive sections of the program. Proper imperative and procedural programming constructs can be simulated by use of the "open code" macros or the Interactive Matrix Language SAS/IML component. Macro code in a SAS program, if any, undergoes preprocessing. At run time, DATA steps are compiled and procedures are interpreted and run in the sequence they appear in the SAS program. A SAS program requires the SAS software to run. Compared to general-purpose programming languages, this structure allows the user/programmer to concentrate less on the technical details of the data and how it is stored, and more on the information contained in the data. This blurs the line between user and programmer, appealing to individuals who fall more into the 'business' or 'research' area and less in the 'information technology' area, since SAS does not enforce (although it recommends) a structured, centralized approach to data and infrastructure management. SAS runs on IBM mainframes, Unix, Linux, OpenVMS Alpha, and Microsoft Windows. Code is "almost" transparently moved between these environments. Older versions have supported PC-DOS, the Apple Macintosh, VMS, VM/CMS, PrimeOS, Data General AOS and OS/2. SAS consists of a number of components, which organizations separately license and install as required. Base SAS o The core of SAS, the so-called aba SAS Software, manages data. SAS procedures software analyzes and reports the data. The SQL procedure allows SQL (Structured Query Language) programming in lieu of data step and procedure programming. Library Engines allow transparent access to common data structures such as Oracle, as well as pass-through of SQL to be executed by such data structures. The Macro facility is a tool for extending and customizing SAS software programs and reducing overall program verbosity. The DATA step debugger is a programming tool that helps find logic problems in DATA step programs. The Output Delivery System (ODS) is an extendable system that delivers output in a variety of formats, such as SAS data sets, listing files, RTF, PDF, XML, or HTML. The SAS windowing environment is an interactive, graphical user interface used to run and test SAS programs. BI Dashboard o A plugin for Information Delivery Portal. It allows the user to create various graphics that represent a broad range of data. This allows a quick glance to provide a lot of information, without having to look at all the underlying data. Data Integration Studio o Provides extract, transform, load (ETL) services. SAS Enterprise Business Intelligence Server Includes both a suite of business intelligence (BI) tools and a platform to provide uniform access to data.The goal of this product is to compete with Business Objects and Cognos' offerings. Enterprise Computing Offer (ECO) o Not to be confused with Enterprise Guide or Enterprise Miner, ECO is a product bundle.

Enterprise Guide o SAS Enterprise Guide is a Microsoft Windows client application that provides a guided mechanism to use SAS and publish dynamic results throughout an organization in a uniform way. It is marketed as the default interface to SAS for business analysts, statisticians, and programmers. Though Data Integration Studio is the true ETL tool of SAS, Enterprise Guide can be used for the ETL of smaller projects. Enterprise Miner o A data mining tool. Information Delivery Portal o Allows users to set up personalized homepages where they can view automatically generated reports, dashboards, and other SAS data structures. Information Map Studio o A client application that helps with building information maps. OLAP Cube Studio A client application that helps with building OLAP Cubes. SAS Web OLAP Viewer for Java o Web based application for viewing OLAP cubes and data explorations. (Discontinued as per Nov 2010 [6]) SAS Web OLAP Viewer for.NET SAS/ACCESS o Provides the ability for SAS to transparently share data with non-native datasources. SAS/ACCESS for PC Files o Allows SAS to transparently share data with personal computer applications including MS Access and Microsoft Office Excel. SAS Add-In for Microsoft Office o A component of the SAS Enterprise Business Intelligence Server, is designed to provide access to data, analysis, reporting and analytics for non-technical workers (such as business analysts, power users, domain experts and decision makers) via menus and toolbars integrated into Office applications. SAS/AF o Applications facility, a set of application development tools to create customized desktop GUI applications; a library of drag-and-drop widgets are available; widgets and models are fully object oriented; SCL programs can be attached as needed. SAS/SCL o SAS Component Language, allows programmers to create and compile object-oriented programs. Uniquely, SAS allows objects to submit and execute Base/SAS and SAS/Macro statements. SAS/ASSIST o Early point-and-click interface to SAS, has since been superseded by SAS Enterprise Guide and its clientserver architecture. SAS/C SAS/CALC o Is a discontinued spreadsheet application, which came out in version 6 for mainframes and PCs, and didn't make it further. SAS/CONNECT o Provides ability for SAS sessions on different platforms to communicate with each other.

SAS/DMI A programming interface between interactive SAS and ISPF/PDF applications. Obsolete since version 5. SAS/EIS o A menu-driven system for developing, running, and maintaining an enterprise information systems. SAS/ETS o Provides Econometrics and Time Series Analysis SAS/FSP o Allows interaction with data using integrated tools for data entry, computation, query, editing, validation, display, and retrieval. SAS/GIS o An interactive desktop Geographic Information System for mapping applications. SAS/GRAPH o Although base SAS includes primitive graphing capabilities, SAS/GRAPH is needed for charting on graphical media. SAS/IML o Matrix-handling SAS script extensions. SAS/INSIGHT o Dynamic tool for data mining - allows examination of univariate distributions, visualization of multivariate data, and model fitting using regression, analysis of variance, and the generalized linear model. SAS/Integration Technologies o Allows the SAS System to use standard protocols, like LDAP for directory access, CORBA and Microsoft's COM/DCOM for inter-application communication, as well as messageoriented middleware like Microsoft Message Queuing and IBM WebSphere MQ. Also includes the SAS' proprietary clientserver protocols used by all SAS clients. SAS/IntrNet o Extends SAS data retrieval and analysis functionality to the Web with a suite of CGI and Java tools SAS/LAB Superseded by SAS Enterprise Guide. SAS/OR o Operations Research SAS/PH-Clinical o Defunct product SAS/QC o Quality Control provides quality improvement tools. SAS/SHARE o A data server that allows multiple users to gain simultaneous access to SAS files SAS/SHARE*NET o Discontinued and now part of SAS/SHARE. It allowed a SAS/SHARE data server to be accessed from non-sas clients, like JDBC or ODBC compliant applications. SAS/SPECTRAVIEW o Allows visual exploration of large amounts of data. Once the system has plotted the data in a 3D space, users can then visualise it by creating envelope surfaces, cutting

planes, etc., which can be animated depending on a fourth parameter (time for example). SAS/STAT o Statistical Analysis with a number of procedures, providing statistical information such as analysis of variance, regression, multivariate analysis, and categorical data analysis. Note for example the GLIMMIX procedure.[7] SAS/TOOLKIT SAS/Warehouse Administrator o superseded in SAS 9 by SAS ETL Server. SAS Web Report Studio o Part of the SAS Enterprise Business Intelligence Server, provides access to query and reporting capabilities on the Web. Aimed at non-technical users. SAS Financial Management o Budgeting, planning, financial reporting and consolidation. SAS Activity Based Management o Cost and revenue modeling. SAS Strategy Management (formerly Strategic Performance Management) o Collaborative scorecards. SAS Scalable Performance Data Server (SPDS) Distributed data system offering increased performance; Data processing server.

Oracle Data Mining


Oracle Data Mining (ODM) is an option of Oracle Corporation's Relational Database Management System (RDBMS) Enterprise Edition (EE). It contains several data mining and data analysis algorithms for classification, prediction, regression, classification, associations, feature selection, anomaly detection, feature extraction, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment. Oracle Data Mining implements a variety of data mining algorithms inside the Oraclerelational database. These implementations are integrated right into the Oracle database kernel, and operate natively on data stored in the relational database tables. This eliminates the need for extraction or transfer of data into standalone mining/analytic servers. The relational database platform is leveraged to securely manage models and efficiently execute SQLqueries on large volumes of data. The system is organized around a few generic operations providing a general unified interface for data mining functions. These operations include functions to create, apply, test, and manipulatedata mining models. Models are created and stored as database objects, and their management is done within the database - similar to tables, views, indexes and other database objects. In data mining, the process of using a model to derive predictions or descriptions of behavior that is yet to occur is called "scoring". In traditional analytic workbenches, a model built in the analytic engine has to be deployed in a mission-critical system to score new data, or the data is moved from relational tables into the analytical workbench - most workbenches offer proprietary scoring interfaces. ODM simplifies model deployment by offering Oracle SQL functions to score data stored right in the database. This way, the user/application developer can leverage the full power of Oracle SQL - in terms of the ability to pipeline and manipulate the results over several levels, and in terms of parallelizing and partitioning data access for performance. Models can be created and managed by one of several means. (Oracle Data Miner) is a graphical user interface that steps the user through the process of creating, testing, and applying models (e.g. along the lines of the CRISP-DM methodology). Application and tools developers can embed predictive and descriptive mining capabilities using PL/SQL or JavaAPIs. Business analysts can quickly experiment with, or demonstrate the power of, predictive analytics using Oracle Spreadsheet Add-In for Predictive Analytics, a dedicated Microsoft Excel adaptor interface. ODM offers a choice of well known machine learning approaches such as Decision Trees, Naive Bayes, Support vector machines, Generalized linear model (GLM) for predictive mining, Association rules, K-means and Orthogonal PartitioningClustering (see O-Cluster paper below), and Non-negative matrix factorization for descriptive mining. A minimum description length based technique to grade the relative importance of an input mining attributes for a given problem is also provided. Most Oracle Data Mining functions also allow text mining by accepting Text (unstructured data) attributes as input

Oracle Data Mining

Developer(s)

Oracle Corporation

Stable release

11gR2 / September, 2009

Type

data mining and analytics

License

proprietary

Anda mungkin juga menyukai