Data Mining

Data Warehouse: - A warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection of data in support of management's decision making process (as defined by Bill
Inmon).
Data mining: - Data mining is the analysis of data for relationships that have not previously
been discovered.
Association rules: - Association rules are if/then statements that help uncover relationships
between seemingly unrelated data in a transactional database, relational database or other
information repository.
ETL: - The process of extracting data from source systems and bringing it into the data
warehouse is commonly called ETL.
OLAP (Online Analytical Processing) - The term online analytical processing (OLAP) usually
refers to specialized tools that make warehouse data easily available.
OLTP (Online Transaction Processing): - Online Transaction Processing (OLTP) systems are
characterized by high throughput, many users, and a mix of DML operations (insert, update, and
delete) and queries.
OLAP server: -An OLAP server is a high-capacity, multi-user data manipulation engine
specifically designed to support and operate on multi-dimensional data structures.
MOLAP (Multidimensional Online Analytical Processing): - The MOLAP storage mode
causes the aggregations of the partition and a copy of its source data to be stored in a
multidimensional structure in Analysis Services when the partition is processed.
ROLAP (Online Analytical Processing): - The ROLAP storage mode causes the aggregations
of the partition to be stored in indexed views in the relational database that was specified in the
partition's data source.
HOLAP (Hybrid Online Analytical Processing) - The HOLAP storage mode combines
attributes of both MOLAP and ROLAP. Like MOLAP, HOLAP causes the aggregations of the
partition to be stored in a multidimensional structure in an SQL Server Analysis Services
instance.
Categorical: - A finite number of discrete values. The type nominal denotes that there is no
ordering between the values, such as last names and colors. The type ordinal denotes that there is
an ordering, such as in an attribute taking on the values low, medium, or high.
Classifier: - A mapping from unlabeled instances to (discrete) classes. Classifiers have a form
(e.g., decision tree) plus an interpretation procedure (including how to handle unknowns, etc.).
Some classifiers also provide probability estimates (scores), which can be thresholded to yield a
discrete class decision thereby taking into account a utility function.
Knowledge discovery: -The non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data. This is the definition used in ``Advances in
Knowledge Discovery and Data Mining,'' 1996, by Fayyad, Piatetsky-Shapiro, and Smyth.
Data cleaning/cleansing: -The process of improving the quality of the data by modifying its
form or content, for example by removing or correcting data values that are incorrect. This step
usually precedes the machine learning step, although the knowledge discovery process may
indicate that further cleaning is desired and may suggest ways to improve the quality of the data.
Apache Hadoop: - An open source platform that allows for the distributed processing of large data
sets across clusters of computers using a simple programming model.
HBase: - An open-source, distributed, versioned, non-relational database modeled after Googles

Bigtable: A Distributed Storage System for Structured Data.
HDFS: - An acronym for "Hadoop Distributed File System", which breaks large application
workloads into smaller data blocks that are replicated and distributed across a cluster of
commodity hardware for faster processing.
Hive: - A data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. It allows you to query data using a SQL-like language called
HiveQL (HQL).
Pig: - A high level programming language for creating MapReduce programs used within Hadoop.
JSON (JavaScript Object Notation): - is a lightweight data-interchange format. It is easy for

humans to read and write.
Jaql (JAQL):- is a functional data processing and query language most commonly used for
JSON query processing on BigData.
MongoDB:- (from "humongous") is an open-source document database, and the leading NoSQL
database.

Data Mining

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Mining

Diunggah oleh

Hak Cipta:

Format Tersedia

Data Warehouse: - A warehouse is a subject-oriented, integrated, time-variant and non-volatile

HBase: - An open-source, distributed, versioned, non-relational database modeled after Googles

JSON (JavaScript Object Notation): - is a lightweight data-interchange format. It is easy for

Anda mungkin juga menyukai