Big Data UAS

Big Data: Big Data refers to humongous volumes of data that cannot be processed
effectively with the traditional applications that exist. The processing of Big Data
begins with the raw data that isnt aggregated and is most often impossible to store
in the memory of a single computer.
The definition of Big Data, given by Gartner is, Big data is high-volume, and highvelocity and/or high-variety information assets that demand cost-effective,
innovative forms of information processing that enable enhanced insight, decision
making, and process automation
Data analytics the science of examining raw data with the purpose of drawing
conclusions
about
that
information.
Data Analytics involves applying an algorithmic or mechanical process to derive
insights. For example, running through a number of data sets to look for
meaningful correlations between each other.
Data Science is the combination of statistics, mathematics, programming, problem
solving, capturing data in ingenious ways, the ability to look at things differently,
and the activity of cleansing, preparing, and aligning the data. : Dealing with
unstructured and structured data, Data Science is a field that comprises of
everything that related to data cleansing, preparation, and analysis
Why is analytics becoming more important now?
Much more operational data is being created and captured because of the use of
technology (structured)
Enterprise software :ERP,CRM,SCM
Much more unstructured data is being captured and stored (social media data)
Facebook,Twitter
Much more unstructured data being captured: Web transactions,Smart objects
Data Analytics vs. Statistical Analysis
Data Analytics
Utilizes data mining techniques
Identifies inexplicable or novel relationships/trends
Seeks to visualize the data to allow the observation of relationships/trends
Statistical Analysis
Utilizes statistical and/or mathematical techniques
Used based on theoretical foundation
Seeks to identify a significant level to address hypotheses or RQs
Analtics landscape
4.Database and Data Warehouse
Strategies
Serve as the foundation of business
Social, Email, Blogs, Video, Mobile
analytics
Marketing, Sales - Product Listing, Promotions
Principles of database design and
Applications
implementation:
ERP, CRM, Databases, Internal Applications,
Conceptual, logical and physical
Customer/Consumer facing applications
modeling
Context
Relational Databases
Web, Customers, Products, Business Systems,
ETL
process
(Extraction,
Processes and Services
Transformation and Loading)
Support Systems
SQL (Structured Query Language)
CRM, Recommendation Systems
Data warehouses, Business Intelligence
Analytic Tools
Dangers in Analytics
Data mining
Privacy
Statistical analysis
Security
Predictive analysis
Drawing decisions on incomplete data
Correlation
Drawing decisions on inaccurate data
Using only data that supports our gut
Regression
decisions
Forecasting
Drawing the wrong conclusion from the
Process Modeling
data
Optimization
Stock prices example
DATA
- collected facts
and figures
DATABASE
- collection of
computer files
containing data
INFORMATION
- comes from
analyzing data
BUSINESS ANALYTICS DOMAIN
3. Interval Data
Ordinal data but with constant
differences
between
observations
No true zero point
Ratios are not meaningful
Examples:
- temperature readings
- SAT scores
4. Ratio Data
Continuous values and
have a natural zero
point
Ratios are meaningful
Examples:
- monthly sales
- delivery times
Big Data Issues Affecting Analytics
3.presciptive
analytics
Function:
make decisions based
on data
Common models:
linear programming
sensitivity analysis
integer programming
goal programming
nonlinear
programming
simulation modeling
Database and Data Warehouse

Descriptive analytics
- uses data to understand past and present
Predictive analytics
- analyzes past performance
Prescriptive analytics
- uses optimization techniques
NoSQL (Non-Structured Query Language)
1.descriptive analytics
Function:
describe the main features of organizational data
Common tools:
sampling, mean, mode, median, standard deviation, range, variance, stem and leaf
diagram, histogram, interquartile range, quartiles, and frequency distributions
Displaying results:
graphics/charts, tables, and summary statistics such as single numbers
2.Predictive Analytics
Function:
draw conclusions and predict future behavior
Common tools:
cluster analysis, association analysis, multiple regression, logistic regression,
decision tree methods, neural networks, text mining and forecasting tools (such as
time series and causal relationships)
Example of Fandango
Data for Business Analytics
Four Types Data Based on Measurement Scale:
Categorical (nominal) data,Ordinal data,Interval data, Ratio data
1. Categorical (nominal) Data
Data placed in categories according to a specified characteristic. Categories bear
no quantitative relationship to one another
Examples:customers location (America, Europe, Asia), employee classification
(manager, supervisor, associate)
2. Ordinal Data
Data that is ranked or ordered according to some relationship with one another
No fixed units of measurement
Examples: college football rankings, survey responses (poor,average,kaya)
Volume: How much data is really relevant to the problem solution? Cost of processing? So, can
you really afford to store and process all that data?
Velocity:
Much data coming in at high speed
Need for streaming versus block approach to data analysis
So, how to analyze data in-flight and combine with data at-rest
Variety:
A small fraction is structured formats, Relational, XML, etc.
A fair amount is semi-structured, as web logs, etc.
The rest of the data is unstructured text, photographs, etc.
So, no single data model can currently handle the diversity
Veracity: cover term for
Accuracy, Precision, Reliability, Integrity
So, what is it that you dont know you dont know about the data?
Value:
How much value is created for each unit of data (whatever it is)?
So, what is the contribution of subsets of the data to the problem solution?
Types of Analytics
Descriptive: A set of techniques for reviewing and examining the data set(s) to
understand the data and analyze business performance.
Diagnostic: A set of techniques for determine what has happened and why
Predictive: A set of techniques that analyze current and historical data to
determine what is most likely to (not) happen
Prescriptive: A set of techniques for computationally developing and analyzing
alternatives that can become courses of action either tactical or strategic that
may discover the unexpected
Decisive: A set of techniques for visualizing information and recommending
courses of action to facilitate human decision-making when presented with a set
of alternatives.
Advanced Analytics:Application of Analytics To Critical Problems
Goal of analytics : to Monitoring and Alerting , to Sensemaking (Data and
Analytics Science), to Cents-Making (Getting to ROI!!)
Klein (2006) theorizes that sensemaking processes are initiated when individuals
or organizations recognize a lack of understanding of evenT
Sensemaking Challenges
Lillian Wu (IBM) has noted that everything is
becoming:
Instrumented: We now have the ability to
measure, sense and see the exact condition of
practically everything.
Interconnected: People, systems and objects
can communicate and interact with each
other in entirely new ways
Intelligent: People, systems and objects can
respond to changes quickly and accurately,
and get better results by predicting and
optimizing for future events.
How to deal with ambiguity?
How to deal with too much data?
#1: You will be expected to do something with

information
#2: There really is more to know
#3: You will have to know more about knowing
#4: Brain science and decision science are
converging
#5: The environment is changing our brain
#6: Information management is the essence of
leadership
#7: A more connected world means much more
data is available (and accessible)
#8: Math matters (but so does logic and rules)
#9: There are significant downsides to not knowing
#10: Knowing can change the world
THE NEW ANALYTIC PARADIGM
Finding A Needle in a Haystack

Problem: The needle hasnt grown as fast as the haystack!!
Problem: We need new analytics methods to deal with larger, more complex data
and problems!!
Knowledge-Centric Systems (These are the types of systems using advanced
analytics)
User-centric Systems Systems That Know
Knowledge-driven solutions that connect open information, share open decision-making rules,
deliver open composite services, access and navigate information in context of use, and provide
virtual assistants that manage cases and complete tasks.
Adaptive Systems Systems That Learn
Knowledge-driven solutions that feature modeling, collaboration, and advanced analytics to
detect patterns, make sense, simulate, predict, learn, take action, and improve performance
with use and scale.
Smart Operations Systems That Reason
Knowledge-driven solutions that reason like experts, advise as avatars, adapt, are autonomic,
perform autonomously.
Network Forensics
Networks have become exponentially faster. They carry more traffic and more types of data than
ever before. Yet as they get faster, they become more difficult to monitor and analyze.
Network forensics must be:
1.Precise: capture high-speed packets without droppage
2.Scalable: extend to new network technologies and speeds
3.Flexible: adapt to heterogeneous network segments
4.VOIP-Smart: reconstruct & replay VoIP calls; present Call Detail Records (CDR) for each call
5.Continuously available: run 24/7 with adequate storage; support real-time analysis
Finding the Knees

The knee of an algorithm or analytic is the scale value at which the performance begins to
degrade as larger data volumes are processed.
Factors affecting the knee:
- data structure, volume, and variety
-algorithm complexity and implementation, and
-infrastructure implementation.
Finding the Tipping Point

A tipping point is one in which change in a system becomes potentially irreversible and maybe
even unstoppable
Maybe associated with negative or positive effects
Ex: MySpace was a formidable component of Facebook, but once the Facebook membership
reached its tipping point people started abandoning MySpace and signing up for Facebook.
ADVANCED ANALYTICS
Crowdsourcing: A New Analytic?
: DISTRIBUSI DARI PENYELESAIAN MASALAH
Moving Analytics To the Edge
Traditional analysis: data is stored and then analyzed
-Usually at some central location or a few distributed locations
-The cost and time to move large amounts of data may render it obsolete or of little worth
Moving analytics to the frontier of the domain:
-(Near) real-time analysis and decisions are required
-Streaming massive amounts of data is expensive, fraught with error: microsecond latency,
millions of events
-We just cant store it all
-Perishability is a key factor
-May only really need the synthesized, aggregate information, not the raw data
-True data-driven analysis
Example: Pushing analytics into cameras for images, full motion video analysis, motion
correction, 3D perception,
WEB ANALYTICS
What is it?
-Now: The study of the behavior of web
users
-Future: The study of one mechanism for
how society makes decisions
-Example: Behavior of Web Users
How many people clicked on Ebola (or
related terms in the past 2 months)
EDGE DEVICE ANALYTICS

STREAMING ANALYTICS
Streaming data: analytics must (often) occur in real
time, as the data passes through the
sensing/collecting device:
-Allows you to identify and examine patterns of
interest as the data is being created.
-Can yield instant insight and immediate (re)action.
LOCATION ANALYTICS
What is it?
- Augmenting mission-critical, enterprise business
systems with complementary content, mapping,
and geographic capabilities
-Mapping & Visualization: use maps as the media
to visualize data
-Spatial analytics: merging GIS w/ other types of
analytics
-Find spatio-temporal patterns indicative of
physical activities or social behavior
-Data/information enrichment: add maps,
imagery, demographics, consumer and lifestyle
data, environment and weather, social media, etc.
Ubiquity of GPS on cellphones, cars, wristwatches,
laptops, tablets, etc.
Visual analytics: the science

of
analytical
reasoning
facilitated by interactive
visual
interfaces,
the
formation
of
visual
metaphors in combination
with a human information
discourse (interaction)
Advanced Analytics:Critical Challenges

A specific scientific or technological innovation that would remove a critical barrier to solving an
important domain problem with a high likelihood of global impact and feasibility.
A tool for focusing investigators working towards overcoming one or more
bottlenecks in a foreseeable path toward a solution to significant domain problems.
Provide scope for engineering ambition to build something that has never been seen
before
Contoh:
Modeling Our Planets Systems:Assessing global warming and determining mitigating actions
Confronting Existential Risk:What is the impact of a dangerous genetically modified pathogen
Exploring Transhumanism:What is the impact of embedded nanotechnology, genetic therapy,
and smart prosthetics?
ANALYTIC CHALLENGES
1.
3. DATA ANALYTICS
What is Context-Aware Mobile Computing?
- Applications that can detect their users situations and adapt to their behaviors
accordingly.
-A software that adapts according to its context!
Active context awareness
an application automatically adapts to discovered
context, by changing the applications behavior.
Calm technology
Passive context awareness

an application presents the new or
updated context to an interested user or
makes the context persistent for the user
to retrieve later.
What is context?
Context is any information that can be used to characterize the situation of an entity. An entity
is a person, place, or object that is considered relevant to the interaction between a user and an
application, including the user and the application themselves
Example : Temperature, User preferences,Lighting,Location,Nearby resources (such as
printers),History
CONTEXT SENSING
1.Low level context sensing
location : GPS, IR, etc, orientation,time, temperature, pressure
2.High level context sensing
Activity: Machine vision based on camera tech & image processing, Combine several low-level
context (AI)
DEFINING CONTEXT
Computing context: connectivity, communication cost, bandwidth, nearby resources (printers,
displays, PCs)
User context: user profile, location, nearby people, social situation, activity, mood
Physical context: temperature, lighting, noise, traffic conditions
Context-Aware applications use context to:
Present services and information to a user
Examples: The time of day and restaurants near the user
Automatically execute a service for a user
Example: A phone automatically setting a weekly alarm for
the user
Tag information to retrieve at a later time
Example: Phone keeps track of recent calls
System infrastructure
Examples:
1. Phone display adjusts the brightness of the display based
on the surrounding area : Uses a light sensor
2. Schneider trucking trackers
Uses GPS to track loads
Sends a notification when

a load nears its
destination
Sends emergency
notifications when
conditions are met
Decouple the context sensing part and application
Message passing
Centralized architecture
The simplest way
Client and server : communicate by RPC
Scalability problem
Distributed architecture
Wireless communication
Wireless cellular networks
Wireless LAN
Wireless PAN or BAN
Population Imbalance: Events of interest occur relatively infrequently in very large

datasets.
2.
Data analysis
Feature Selection: Information is distributed in a complex way across many features.
Mitigating False Alarms: Target patterns are ambiguous/unknown; squelch settings are brittle;
cannot prevent false positives
Domain Drift: Target patterns change/morph over time and across operational modes
(processing methods becomes stale)
3.
Ethical problems: Should we be required to inform individuals when we use their
data
4.
Data annotation
The semantic web is largely unrealized:
Much metadata is lousy
Standardization of models and labels is a major issue
Integrating ontologies and vocabularies is a critical problem
No standards
5.
End-to-end systems: Bottlenecks, Data conversion, transfer, & loading overheads
6.
Data sources: Heterogeneity and Incompleteness
7.
ROI (Return on Investment) is not always immediately obvious
Results of analytics may be available only after years of following the prescription
Requires long-term effort(s) to develop a sustainable capability
Health: moving from predictive to preventative health care
Crime: predicting potential crime locales and time to preventatively deploy police
Environment: predicting weather, floods, earthquakes, volcanic eruptions earlier
Computer Security: surveying systems to predict potential for attacks
-
4.DATA WAREHOUSE
WHAT IS DATA WAREHOUSE
- A data warehouse is a relational database that is designed for query and analysis rather than
for transaction processing
- A common way of introducing data warehousing is to refer to the characteristics of a data
warehouse as set forth by William Inmon : Subject Oriented, Integrated, Nonvolatile, Time
Variant
DATA WAREHOUSE PROPERTIES
Subject oriented : Data is categorized and stored by
business subject rather than by application.
Integrated : Data warehouses must put data from
disparate sources into a consistent format.
Time variant : Data is stored as a series of
snapshots, each representing a period of time.
Non volatile : Typically data in the data warehouse is
not updated or deleted. .This is logical because the
purpose of a warehouse is to enable you to analyze
what has occurred.
DATA WAREHOUSE ARCHITECHTURE
Data Warehouse Architecture (Basic)
Data Warehouse Architecture (with a Staging Area)
Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture (Basic)
End users directly access data derived from several source systems through the data warehouse.
Operational Systems are the internal and external core systems that run the day-to-day business
operations. They are accessed through application program interfaces (APIs) and are the source of data
for the data warehouse and operational data store.
External Data is any data outside the normal data collected through an enterprises internal applications.
Generally, external data, such as demographic, credit, competitor, and financial information, is purchased
by the enterprise from a vendor of such information.
Data Acquisition is the set of processes that capture, integrate, transform, cleanse, and load source data
into the data warehouse and operational data store
Data Warehouse Architecture (with a Staging Area)

you need to clean and process your operational data before putting it into the warehouse. You
can do this programmatically, although most data warehouses use a staging area instead.
Data Warehouse Architecture (with a Staging Area and Data Marts)

you may want to customize your warehouses architecture for different groups within your
organization. You can do this by adding data marts, which are systems designed for a particular
line of business.
DATA MART: A Data Mart is a small warehouse designed for strategic business unit or a
department
Data Mart Advantages: The cost is low. Implementation time is shorter, They are controlled
locally rather than centrally, They contain less information than the data warehouse and hence
have more rapid response, They allow a business unit to build its own DSS without relying on a
centralized IS department.
Data Mart Types:
Replicated Data Marts.
Stand-alone Data Marts.
Corporate Information Factory
The Data Warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data used
to support the strategic decision-making process for the enterprise.
The Operational Data Store is an subject-oriented, integrated, current, volatile collection of data used to
support the tactical decision-making process for the enterprise.
Operation and Administration is the set of activities required to ensure smooth daily operations, to ensure
that resources are optimized, and to ensure that growth is managed.
Systems Management is the set of processes for maintaining, versioning, and upgrading the core
technology on which the data, software, and tools operate.
Data Acquisition Management is the set of processes that manage and maintain processes used to
capture source data and its preparation for loading into the data warehouse or operational data store.
Service Management is the set of processes for promoting user satisfaction and productivity within the
Corporate Information Factory. It includes processes that manage and maintain service level agreements,
requests for change, user communications, and the data delivery mechanisms.
Change Management is the set of processes coordinating modifications to the Corporate Information
Factory.
TRANSPORTATION IN DATA WAREHOUSES
Transportation Using Flat Files
The most common method for transporting data is by the transfer

of flat files, using mechanisms such as FTP or other remote file
system access protocols
Transportation Through Distributed Operations
Distributed queries, either with or without gateways, can be an

effective mechanism for extracting data. These mechanisms also
transport the data directly to the target system.
Transportation Using Transportable Tablespaces
Some Databases such as Oracle and DB2 introduced an important

mechanism for transporting data: transportable tablespaces. This
feature is the fastest way for moving large volumes of data
between two databases.
DATA WAREHOUSING SCHEMES
Star, Snowflake, Constellation
DATA MINING
CIF Data Management is the set of processes that protect the integrity and continuity of the data within
and across the data warehouse and operational data store. It may employ a staging area for cleansing
and synchronizing data.
The Transactional Interface is an easy-to-use and intuitive interface for the end user to access and
manipulate data in the operational data store.
Data Delivery is the set of processes that enables end users and their supporting IT groups to filter, format,
and deliver data to data marts and oper-marts.
The Exploration Warehouse is a data mart whose purpose is to provide a safe haven for exploratory and
ad hoc processing. An exploration warehouse may utilize specialized technologies to provide fast
response times with the ability to access the entire database.
The Data Mining Warehouse includes tasks known as knowledge extraction, data archaeology, data
exploration, data pattern processing and data harvesting.
The OLAP (online analytical processing) Data Mart is aggregated and/or summarized data that is derived
from the data warehouse and tailored to support the multidimensional requirements of a given business
unit or business function.
The Oper-Mart is a subset of data derived from of the operational data store used in tactical analysis and
usually stored in a multidimensional manner (star schema or hypercube). They may be created in a
temporary manner and dismantled when no longer needed.
The Decision Support Interface is an easy-to-use, intuitive tool to enable end user capabilities such as
exploration, data mining, OLAP, query, and reporting to distill information from data.
Meta Data Management is the set of processes for managing the information needed to promote data
legibility, use, and administration.
Information Feedback is the set of processes that transmit the intelligence gained through usage of the
Corporate Information Factory to appropriate data stores.
Information Workshop is the set of the facilities that optimize use of the Corporate Information Factory
by organizing its capabilities and knowledge, and then assimilating them into the business process.
The Library and Toolbox is the collection of meta data and capabilities that provides information to
effectively use and administer the Corporate Information Factory.
The Workbench is a strategic mechanism for automating the integration of

capabilities and knowledge into the business process.
proses mengekstraksi pola-pola yang menarik (tidak remeh-temeh, implisit, belum diketahui
sebelumnya, dan berpotensi untuk bermanfaat) dari data yang berukuran besar.
WHY DATA MINING?
Data volume too large for classical analysis
Number of records too large (millions or billions)
High dimensional (attributes/features/fields) data (thousands)
Increased opportunity for access

Database Processing vs. Data Mining
Web navigation, on-line collections
Processing
KDD : KNOWLEDGE DISCOVERY IN DATABASES
Data Mining Models and Tasks
DATA MINING PROCESS
Interpret, evaluate and visualize patterns

What's new and interesting?
Iterate if needed
Manage discovered knowledge
Close the loop
Query:
Well defined
poorly defined, no specifiec
SQL
query language
Data
Operational data not operational data
Output
Precise
fuzzy
subset of database
1.Understand application
domain Prior knowledge,
user goals
2. Create target dataset
Select data, focus on
subsets
3. Data cleaning and
transformation Remove
noise, outliers, missing values
Select features, reduce
dimensions
4.Apply data mining
algorithm
Associations, sequences,
classification, clustering, etc.
DATA MINING METHODS:

Predictive modeling (classification, regression)
Segmentation (clustering)
Dependency modeling (graphical models, density estimation)
Summarization (associations)
Change and deviation detection
COMPONENT OF DATA MINING METHODS
Representation: language for patterns/models, expressive power
Evaluation: scoring methods for deciding what is a good fit of model to data
Search: method for enumerating patterns/models
DATA MINING TECHNIQUES:
Association rules: detect sets of attributes that frequently co-occur, and rules among them, e.g.
90% of the people who buy cookies, also buy milk (60% of all grocery shoppers buy both)
Sequence mining (categorical): discover sequences of events that commonly occur together,
.e.g. In a set of DNA sequences ACGTC is followed by GTCA after a gap of 9, with 30% probability
CBR or Similarity search: given a database of objects, and a query object, find the object(s)
that are within a user-defined distance of the queried object, or find all pairs within some
distance of each other.
Deviation detection: find the record(s) that is (are) the most different from the other records,
i.e., find all outliers. These may be thrown away as noise or may be the interesting ones.
Classification and regression: assign a new data record to one of several predefined categories
or classes. Regression deals with predicting real-valued fields. Also called supervised learning.
Clustering: partition the dataset into subsets or groups such that elements of a group share a
common set of properties, with high within group similarity and small inter-group similarity. Also
called unsupervised learning.
KDD PROCESS
-Learning the application domain:
-relevant prior knowledge and goals of application
-Creating a target data set: data selection
-Data cleaning and preprocessing: (may take 60% of effort!)
-Data reduction and transformation:
-Find useful features, dimensionality/variable reduction, invariant representation.
-Choosing functions of data mining (summarization, classification, regression, association,
lustering)
-Choosing the mining algorithm(s)
Accuracy ( MEASURE DATA
-Data mining: search for patterns of interest
Completeness QUALITY)
-Pattern evaluation and knowledge presentation
Consistency
-visualization, transformation, removing redundant patterns, etc.
Timeliness
-Use of discovered knowledge
Believability
Value added
Why Data Preprocessing?
Interpretability
-Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
-No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same

or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially

for numerical data
NOISY DATA
Noise: random error or variance in a measured variable
A
c
c
e
s
s
i
b
i
l
i
t
y
HOW TO HANDLE NOISY DATA
Binning method:
first sort data and partition into (equi-depth) bins
then smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
Regression
smooth by fitting the data into regression functions

DATA TRANSFORMATION
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling

DATA MINING TASKS
Classification maps data into predefined groups or classes
Supervised learning
Pattern recognition
Prediction
Regression is used to map a data item to a real valued prediction

variable.
Clustering groups similar data together into clusters.
Unsupervised learning
Segmentation
Partitioning
Summarization maps data into subsets with associated simple

descriptions.
Characterization
Generalization
Link Analysis uncovers relationships among data.
Affinity Analysis
Association Rules
Sequential Analysis determines sequential patterns.

DATA MINING TECHNIQUES
Association or link analysis search all details or transactions from operational
systems for patterns with a high probability of repetition
Results to development of associative algorithm that correlates one

set of events or items with another set of events or items
Another example of link analysis:
Market basket analysis analysing the products contained in a

purchasers basket and then using an associative rule to compare
hundreds of thousands of baskets
Sequencing or time-series analysis techniques that relate events in time
Prediction of interest rate fluctuations or stock performance based on

a series of preceding events
E.g. buying sequence: parents buy promotional toys associated with a

particular movie within 2 weeks after renting the movie
>flyer campaign for promotional toys should be linked to customer lists created a
s a results of movie rentals
Association and sequencing tools analyse data to discover rules that identify
patterns of behaviour. An association tool will find rules such as:
When people buy diapers they also buy beer 50 percent of the time.
A sequencing technique is very similar to an association technique, but it adds time
to the analysis and produces rules such as:
People who have purchased a VCR are three times more likely to
purchase a camcorder in the time period two to four months after the
VCR was purchased.
Clustering technique for creating partitions so that all members of each set are
similar according to some metric or set of metrics
e.g., credit card purchase data
Cluster 1: business-issues gold card, meals charged on weekdays,

mean values greater than $250
Cluster 2: personal platinum card, meals charged on weekends, mean

value $175, bottle of wine charged more than 65% of the time

Big Data UAS

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Big Data UAS

Diunggah oleh

Hak Cipta:

Format Tersedia

Big Data: Big Data refers to humongous volumes of data that cannot be processed

BUSINESS ANALYTICS DOMAIN

Big Data Issues Affecting Analytics

Database and Data Warehouse

#1: You will be expected to do something with

Finding A Needle in a Haystack

Finding the Knees

Finding the Tipping Point

EDGE DEVICE ANALYTICS

Visual analytics: the science

Advanced Analytics:Critical Challenges

Passive context awareness

Uses GPS to track loads

Sends a notification when

Decouple the context sensing part and application

The simplest way

Client and server : communicate by RPC

Wireless cellular networks

Wireless PAN or BAN

Population Imbalance: Events of interest occur relatively infrequently in very large

DATA WAREHOUSE ARCHITECHTURE

Data Warehouse Architecture (Basic)

Data Warehouse Architecture (with a Staging Area)

Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture (with a Staging Area)

Data Warehouse Architecture (with a Staging Area and Data Marts)

Replicated Data Marts.

Stand-alone Data Marts.

Corporate Information Factory

TRANSPORTATION IN DATA WAREHOUSES

Transportation Using Flat Files

The most common method for transporting data is by the transfer

Transportation Through Distributed Operations

Distributed queries, either with or without gateways, can be an

Transportation Using Transportable Tablespaces

Some Databases such as Oracle and DB2 introduced an important

Star, Snowflake, Constellation

The Workbench is a strategic mechanism for automating the integration of

Data volume too large for classical analysis

Number of records too large (millions or billions)

High dimensional (attributes/features/fields) data (thousands)

Increased opportunity for access

Data Mining Models and Tasks

DATA MINING PROCESS

Interpret, evaluate and visualize patterns

DATA MINING METHODS:

Fill in missing values, smooth noisy data, identify or remove

Integration of multiple databases, data cubes, or files

Normalization and aggregation

Obtains reduced representation in volume but produces the same

Part of data reduction but with particular importance, especially

Noise: random error or variance in a measured variable

HOW TO HANDLE NOISY DATA

first sort data and partition into (equi-depth) bins

then smooth by bin means, smooth by bin median,

detect and remove outliers

Combined computer and human inspection

detect suspicious values and check by human

smooth by fitting the data into regression functions

Smoothing: remove noise from data

Aggregation: summarization, data cube construction

Generalization: concept hierarchy climbing

Normalization: scaled to fall within a small, specified range

normalization by decimal scaling