Anda di halaman 1dari 29

Data Engineer 101

Syarif Hidayatullah
Shiftacademy Tutor
AWS & Google Cloud Certified

Minggu, 18 April 2021


Konsep Data
adalah kumpulan fakta yang merupakan deskripsi
dari suatu kejadian yang terjadi.

Konsep Metadata
adalah informasi terstruktur yang
mendeskripsikan, menjelaskan, menemukan,
atau setidaknya menjadikan suatu informasi
mudah untuk ditemukan kembali, digunakan,
atau dikelola
Jenis Data

Data Terstruktur Data Semi terstruktur Data Tidak terstruktur


adalah data yang adalah data yang adalah data yang tidak
memiliki format skema memiliki format skema memiliki skema khusus.
tetap seperti tabel yang tertentu selain baris dan
memiliki kolom dan kolom sesuai dengan
baris. baku file type masing-
data terstruktur dapat masing.
dengan mudah diolah - json - gambar
dan diambil insightnya. - xml - video
- relational database - avro
- spreadsheet
Penyimpanan Data
Database File System
● Data terorganisir dalam skema ● Data kurang terorganisir
● Mempunyai fungsional ● Tidak mempunyai fungsional khusus
pencarian, replikasi, indexing, dll ● Disimpan dalam bentuk file
● Disimpan dalam bentuk tabel
Profesi Data
Data Engineer Data Analyst Data Scientist

• Construct, develop, test, and


maintain data architecture • Looking for insight from historical • Build model using machine
• Data acquisition data learning.
• Develop data transformation (ETL / • Answering business question • Develop Features from data
ELT) • Create report and dashboard warehouse
• Maintain Data Lake and Data • Develop data visualization • Explore and examine data to find
Warehouse • Maintain high level data patterns
• Mostly Engineering warehouse (Data Mart) • Mostly statistics
Profesi Data
Data Governance
• Assist with the implementation of
the Data strategy and roadmap and
ensure business data consumed
across the company is always
accurate, complete, secure and
available.
• Work with IT and Business partners
to prioritize data needs
• Develop, define and execute data
governance framework
• Develop and implement appropriate
data governance and data quality
processes
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
https://www.dpadvantage.co.uk/wp-content/uploads/2016/09/garbage-in-garbage-out-1.jpg
Source: datacamp
Garis Besar Arsitektur Big Data

data lokal
visualization

data survey

data lake data reporting


data 3rd party warehouse

machine
data IOT (industry 4.0) learning
Garis Besar Arsitektur Big Data
Data Environment

data lokal
visualization

data survey

data lake data reporting


data 3rd party warehouse

• Data Entry • Data Engineer • Business Intelligence • Data Analyst • Management


• Apps • Data Scince • Stake Holder
• Transaction machine
data IOT (industry 4.0) learning
https://www.thorntech.com/wp-content/uploads/2019/01/DataLake-vs-DataWarehouse.png
Hierarki Data Warehouse Level Design
Access Level Granularity

Data
Mart

Integration

Standarization

Staging / Raw
Big Data
Big data is larger, more complex data sets, especially
from new data sources. These data sets are so
voluminous that traditional data processing software
just can’t manage them. But these massive volumes of
data can be used to address business problems you
wouldn’t have been able to tackle before. – oracle-
Big Data

Volume Velocity Variety


Capable to process petabytes of data Capable to process data faster Data comes in new unstructured data
types. Unstructured and
semistructured data type
Big Data Platform
Provider Big Data
On Premise Data Center Cloud Provider
Hadoop
Introduction
Local vs Distributed File System
Big Data Platform -
Hadoop
Cluster Based
Using Map Reduce as basic method
to transform data
Replication to ensure data
availability
Many tools in top of Hadoop (Spark,
Hive, Pig)
Hadoop Ecosystem
Case: Netflix Big Data Architecture

Source: tableau blog


Hadoop Ecosystem
Apache Spark

Apache Spark is a lightning-fast unified


analytics engine for big data and machine
learning.

Source: databrick
Hive
The Apache Hive ™ data warehouse software facilitates
reading, writing, and managing large datasets residing
in distributed storage using SQL. Structure can be
projected onto data already in storage.

Source: databrick
Next –
Cloud Big
Data
• Easy to deploy
• Pay as you use
• Free tier!
I’m newbie, how do I start?
Get yourself ready!
• Linux Basic
• SQL Basic
• Python for Analytics
• Cloud 101

Anda mungkin juga menyukai