Weekend
• Certificado pelo Kimball University nos EUA, onde teve aula pessoalmente com
Ralph Kimball, um dos principais gurus do data Warehouse, treinamentos
realizados no TDWI, maior entidade de pesquisa de Data Warehouses do mundo.
• VOLUME
• VELOCIDADE
• VARIEDADE
• VIRTUDE
• VALOR
• Big Data representa um conjunto de dados que não pode mais ser
facilmente gerenciado ou analisado com as ferramentas atuais de
dados, métodos ou arquitetura disponível até então.
• E então ?
• No ano de 2012 a
• Web log
• Click stream
• Sensor data
• Email
• Call center voice logs
• Images/video
• Dados RFID
• Dados de Localização e Geográficos
• Dados adquiridos no mercado
• Machine Learning
• Sentiments
• Text Processing
• Image Processing
• Video Analytics
• Log Parsing
• Collaborative Filtering
• Context Search
• Email & Content
• Machine Learning :
– Aprendizado de Maquina
• Sentiments :
– Sentimentos ou Análise de Sentimentos
• Text Processing:
– Processamento de Textos
• Log Parsing
• Collaborative Filtering
• Context Search
• Email & Content
• Load ( Carga )
• Structure ( Estrutura )
• Response ( Resposta )
• Complex Workload ( Processamentos Complexos )
• Economics ( Retorno ao Investimento )
HDFS HDFS
(redundant, reliable storage) (redundant, reliable storage)
• Para trabalhar com Big Data acreditamos que o melhor caminho seria
conhecer as ferramentas utilizadas
• Ter perfil misto : técnico e negócios
• Conhecer de Business Inteligence e Data Warehouse
• Entender os processos da empresa
• Conhecer estatística e matemática
• Desenvolvedor
• Administrador
application
1. When the NameNode starts, it reads 2. The transactions in edits are merged with
fsimage and edits files from disk. fsimage, and edits is emptied.
NameNode
3. For every block, the client will request the NameNode to provide a new blockid
and a list of destination DataNodes.
4. The client will write the block directly to the first DataNode in the list.
5. The first DataNode pipelines the replication to the next DataNode in the list.
CETAX - All Rights Reserved
Exercício 2
Entendendo HDFS e Bloco de Armazenamento
Examples of more
Mapper Reduce side Reducer
functions
CETAX - All Rights Reserved
7
1 Input split Mapper output =
Reducer input
Spill files are
merged into a
6
single file
4.
Mapper output = 5.
Reducer input Reducer HDFS
NodeManager In-memory
buffer
Spill files
Mapper output =
Reducer input Merged
1. The Reducer input NodeManager
2
The AsM finds an appropriate
Client submits an 1
NodeManager
application
ResourceManager
Scheduler
AsM
Container AM Container
Container AM
MapReduce
nfs gateway
Hadoop WebHDFS
hadoop fs -put
Vendor Connectors Hue Explorer
• POSIX utility commands such as ls, mv, cp, touch, cat, mkdir are also supported
• Full list of commands
hadoop fs
NN
Weblog
RM
Sensor
Operational/MPP
Mobile
Data-local processing 3
Hadoop
1 User issues SQL query
2 Hive parses and plans query MapReduce or Tez Job
Query converted to MapReduce
and executed on Hadoop
Data Data Data
3
• Hive component
• Glue between Pig & Hive
– Schema visibility to
Pig Scripts & MapReduce
• REST API to
– Access Hive schemas
– Submit DDL
– Launch Hive queries
– Launch Pig jobs
– Launch MR
– Notifications to message broker
http://hortonworks.com/hdp/addons
CETAX - All Rights Reserved
Hadoop Weekend
– Up to 100x performance
improvement
Attributes:
• sc.appName: Spark application name
• sc.master: Spark Master (local, yarn-client, etc)
• sc.version: Version of Spark being used
Functions:
• sc.parallelize(): create an RDD from local data
• sc.textFile(): create RDD from a text file in HDFS
• sc.stop(): stop the spark context
rdd.first(): 5
rdd.saveAsTextFile(“myfile”)
rdd=sc.parallelize([1, 2, 3, 4, 5])
rdd.filter( lambda x: x%2 == 0).collect()
[2, 4]
hc.sql(“use demo”)
df1 = hc.table(“crimes”)
.select(“year”, “month”, “day”, “category”)
.filter(“year > 2014”).head(5)
hc.sql(“use demo”)
df1 = hc.sql(“““
SELECT year, month, day, category
FROM crimes
WHERE year > 2014”””).head(5)
df1.first()
Row(age=23, cid=u’104’, name=u’Bob’, state=u’nc’)
df1.take(2)
[Row(age=45, cid=u’104’, name=u’Ram’, state=u’fl’)
Row(age=15, cid=u’102’, name=u’Bob’, state=u’ny’)]
df1.distinct().show()
df1.drop_duplicates([“name”]).show()
• Perguntas ?
• Não deixem de acessar nosso site e se cadastrem para
as promoções, vagas: www.cetax.com.br
MUITO OBRIGADO!