Introduction To Big Data And Data
Mining
Prodi Informatika 2021
Anna Baita, M.Kom.
Fakultas Ilmu Komputer
• SCPMK 1681901: Mahasiswa dapat menjelaskan konsep dasar data mining
• Outline:
• What is Big data?
• Big data Characteristic
• Type Of Tools Big Data
• Big Data Analytic
• What is Data Mining
• Application Of Data Mining
2
The World is Changing
We Can Do Everything Online
3
Data Never Sleep
2021
https://www.visualcapitalist.com/from-
amazon-to-zoom-what-happens-in-an-
internet-minute-in-2021/
Data Explosion
source : https://www.herox.com/blog/138-the-internet-of-
things
What Is BIG DATA??
BIG DATA ??
Big Data dapat didefinisikan dengan data yang memiliki skala
(volume), distribusi (velocity), keragaman (variety) yang sangat
besar, dan atau abadi, sehingga membutuhkan penggunaan
arsitektur teknikal dan metode analitik yang inovatif untuk
mendapatkan wawasan yang dapat memberikan nilai bisnis baru
(informasi yang bermakna)
McKinsey Global Institute(2011)
BIG DATA
Big data merupakan istilah untuk sekumpulan data yang
begitu besar atau kompleks dimana tidak bisa ditangani lagi
dengan sistem teknologi komputer konvensional
(Hurwitz, et al., 2013).
Kapan Suatu Data Dikatakan BIG DATA?
9
Karakteristik BIG DATA
Big Data dengan 10V
Volume
• Facebook menghasilkan 10TB data baru
setiap hari, Twitter 7TB
• Sebuah Boeing 737 menghasilkan 240
terabyte data penerbangan selama
penerbangan dari satu wilayah bagian AS
ke wilayah yang lain
• Microsoft kini memiliki satu juta server,
kurang dari Google, tetapi lebih dari
Amazon, kata Ballmer (2013).
Infra Stuktur Data Center
Satuan Volume data
Unit Value
Bit(b) 0 or 1
Byte(B) 8 bits
Kilobyte(KB) 1000^1 bytes
Megabyte(MB) 1000^2 bytes
Gigabyte(GB) 1000^3 bytes
Terabyte(TB) 1000^4 bytes
Petabyte(PB) 1000^5 bytes
Exabyte(EB) 1000^6 bytes
Zettabyte(ZB) 1000^7 bytes
Yottabyte(YB) 1000^8 bytes
12
Variety
Variety: Kumpulan dari berbagai macam data, baik data yang terstruktur, semi
terstruktur maupun data tidak terstruktur (bisa dipastikan lebih mendominasi).
RDBMS Structured
FIles Semi Structured
Files, Web
Services, No SQL
Unstructured
Velocity
Velocity mengacu pada kecepatan dimana data dimasukkan
ke dalam suatu sistem dan harus diproses
Veracity
• Karakter veracity mengarah kepada seberapa akurat dan dapat dipercaya suatu
data.
• Business process rawan akan kesalahan, tergantung datanya
• Bagaimana suatu data dapat dipercaya mengingat keandalan sumbernya
• Bagaimana mengelola, mengolah data mana yang benar dan mana yang salah
Variability
• Variability mengacu kepada karakteristik Big Data yang terus berubah secara
konstan.
• Data datang terus-menerus dari sumber yang berbeda dan seberapa efisien
membedakan antara data derau (data noise) atau data penting
Value
• Value: seberapa bernilainya atau bermaknanya data tersebut.
• Sebuah data dapat disebut memiliki value jika hasil dari pemrosesan data
tersebut dapat membantu dalam pengambilan keputusan yang lebih baik.
Biasanya karakteristik value ini diperlukan dalam bidang bisnis.
Types Of Tools Used in Big Data
• Where processing is hosted?
• Distributed Servers/cloud (e.g. Amazon EC2)
• Where data is stored
• Distributed storage (e.g. Amazon S3, Google File System (GFS), Hadoop File System(HDFS),
Google Cloud Storage)
• What is the programming model?
• Distributed Processing (e.g. MapReduce)
• How data is stored and indexed?
• High-performance schema-free databases(e.g. Mongo DB)
• What Operations are performed on data?
• Analytic/semantic processing
18
Distributed Storage
NameNode NameNode works as a Master
Namenode is mainly used for storing the Metadata
DataNode: DataNodes works as a Slave DataNodes are mainly
utilized for storing the data
https://media.geeksforgeeks.org/wp-content/cdn-uploads/20200728155931/Namenode-and-Datanode.png
19
Map Reduce
• MapReduce adalah sebuah model pemograman yang didesain untuk dapat
melakukan pemrosesan data dengan jumlah yang sangat besar dengan cara
membagi pemrosesan tersebut ke beberapa tugas yang indipenden satu sama
lain.
Source: https://www.todaysoftmag.com/images/articles/tsm33/large/a11.png
20
We are drowning in data,
but starving for knowledge
Big Data Analytics
• Big Data analytics is the process of collecting, organizing and analyzing large
sets of data (called Big Data) to discover patterns and other useful information.
• Big Data analytics can help organizations to better understand the information
contained within the data and will also help identify the data that is most
important to the business and future business decisions. Analysts working with
Big Data typically want the knowledge that comes from analyzing the data.
22
High-Performance Analytics Required:
• To analyze such a large volume of data, Big Data analytics is typically performed
using specialized software tools and applications for predictive analytics, data
mining, text mining, forecasting and data optimization.
• Collectively these processes are separate but highly integrated functions of
high-performance analytics
• Using Big Data tools and software enables an organization to process extremely
large volumes of data that a business has collected to determine which data is
relevant and can be analyzed to drive better business decisions in the future.
23
Data Mining..?
• Disiplin ilmu yang mempelajari metode untuk mengekstrak
pengetahuan atau menemukan pola dari suatu data yang besar
• Ekstraksi dari data ke pengetahuan:
1. Data: fakta yang terekam dan tidak membawa arti
2. Pengetahuan: pola, rumus, aturan atau model yangmuncul dari data
• Nama lain data mining:
Knowledge Discovery in Database (KDD)
Knowledge extraction
Pattern analysis
Information harvesting
Business intelligence
Data Mining
• Melakukan ekstraksi untuk mendapatkan informasi
penting yang sifatnya implisit dan sebelumnya tidak
diketahui, dari suatu data (Witten et al., 2011)
• Kegiatan yang meliputi pengumpulan, pemakaian
data historis untuk menemukan keteraturan, pola
dan hubungan dalam set data berukuran besar (Santosa, 2007)
•Extraction o f interesting ( non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data (Han et al., 2011)
26
Data Mining..?
Data Data Mining Knowledge
Method
Data- Information-Knowledge
Employee attendance data
NIP DATE TIME IN TIME OUT
1103 02/05/2023 07:20 15:40
1142 02/05/2023 07:45 15:33
1156 02/05/2023 07:51 16:00
1173 02/05/2023 08:00 15:15
1180 02/05/2023 07:01 16:31
1183 02/05/2023 07:49 17:00
Data- Information-Knowledge
Employee Attendance Monthly Accumulation Information
NIP Present Absent Leave sick late
1103 22
1142 18 2 2
1156 10 1 11
1173 12 5 5
1180 10 12
Data- Information-Knowledge
Employee Weekly Attendance Habit Pattern
Monday Tuesday Wednesday Thursday Friday
Late 7 0 1 0 5
Early Time Out 0 1 1 1 8
Leave 3 0 0 1 4
Absent 1 0 2 0 2
Data- Information-Knowledge-Policy
• Policy for setting special working hours for monday and friday
• Working hours regulations:
• Monday Start: 10 am
• Friday End: 14.00 pm
• The remaining working hours are compensated for another day
Do you agree..?
Problems of data mining
• Tremendous amount of data
• Algorithms must be highly scalable to handle such as tera- bytes of data
• High-dimensionality of data
Micro-array may have tens of thousands of dimensions
• High complexity of data
• Data streams and sensor data
• Time-series data, temporal data, sequence data
• Structure data, graphs, social networks and multi-linkeddata
• Heterogeneous databases and legacy databases
• Spatial, spatiotemporal, multimedia, text and Webdata
• Software programs, scientific simulations
• New and sophisticated applications
Data Mining Applications
• Financial Data Analysis
• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection
VS DATA MINING
36
Big Data VS Data Mining
• 1. FOCUS DATA MINING
Big Data mainly focussess Data Mining mainly focusses on
on lots of relationships lots of details of data
between data
• 2. view
Data Mining is a close-up view of data
Big Data is the big picture of data
• 3. Data
Big Data is expresses why of the data Data Mining is expresses what about the
data
• 4. Volume
Big Data refers to a large number of data Data Mining can be used for small data or
sets. big data.
37
Big Data VS Data Mining
DATA MINING
• 5. Definition
Big Data is a concept than Data Mining is a technique for
a precise term analyzing data
• 6. Data Types
Structured, Semi-Structured, and Structured data, relational and dimensional
Unstructured data database.
• 7. Analysis
Mainly data analysis focuses on the
prediction and discovery of business Mainly Statistical Analysis focuses on the
factors on a large scale prediction and discovery of business factors
on a small scale
• 8. Results
Dashboards and predictive measures Mainly for strategic decision-making.
38
ANY QUESTION?