Anda di halaman 1dari 15

Flume

Problem with HDFS and streaming data

In a regular POSIX-style filesystem, if you open a file and write data, it still
exists on the disk before the file is closed.
In HDFS, the file exists only as a directory entry; it shows zero length until the
file is closed.
This means that if data is written to a file for an extended period without
closing it, a network disconnect with the client will leave you with nothing
but an empty file for all your efforts.
Sources, channels and sinks
Things to remember

A source writes events to one or more channels.


A channel is the holding area as events are passed from a source to a sink.
A sink receives events from one channel only.
An agent can have many channels.
Flume events

The basic payload of data transported by Flume is called an event. An event is


composed of zero or more headers and a body.
The headers are key/value pairs that can be used to make routing decisions
or carry other structured information
The body is an array of bytes that contains the actual payload. If your input is
comprised of tailed log files, the array is most likely a UTF-8-encoded string
containing a line of text.
Interceptors, channel selectors, and sink
processors

An interceptor is a point in your data flow where you can inspect and alter
Flume events. You can chain zero or more interceptors after a source creates
an event.
Channel selectors are responsible for how data moves from a source to one
or more channels. Flume comes packaged with two channel selectors that
cover most use cases you might have.
A sink processor is the mechanism by which you can create failover paths for
your sinks or load balance events across multiple sinks from a channel.
Installing flume

Download link can be found at flume.apache.org


Both source and binary distribution files are necessary, so download both the
zip files for proper functioning of Flume.
Most important part

Flume configuration file:


Each agent is configured, starting with three parameters:
agent.sources=<list of sources>
agent.channels=<list of channels>
agent.sinks=<list of sinks>
Example
agent.sources=s1
agent.channels=c1
agent.sinks=k1

agent.sources.s1.type=netcat
agent.sources.s1.channels=c1
agent.sources.s1.bind=0.0.0.0
agent.sources.s1.port=12345

agent.channels.c1.type=memory

agent.sinks.k1.type=logger
agent.sinks.k1.channel=c1
Starting the agent

Use command:
./bin/flume-ng agent n agent c conf f conf/hw.conf
Dflume.root.logger=INFO,console

Overrides root logger in conf/log4j.properties


If we dont, output would go to log/flume.log
Use Dflume.log.dir=Path Dflume.log.file=Name.log

In example:
% nc localhost 12345
Hello World
OK
Different flows

Multi-hop flow: Event may travel through more than 1 agent.


Fan-out flow: Dataflow from 1 source to multiple channels.
Replicating: Data will be replicated in all configured channels.
Multiplexing: Data will be sent to a selected channel mentioned in
header of event.
Fan-in flow: Dataflow from many sources to 1 channel.
Using SQL source

agent.sources=s1
agent.channels=c1
agent.sources.s1.type=org.apache.flume.source.SQLSource
agent.sources.s1.connectionurl=jdbc:location
agent.sources.s1.user=root
agent.sources.s1.password=pwd
agent.sources.s1.table=wlslog
agent.sources.s1.databse=test
agent.sources.s1.columns.to.select=*
agent.sources.s1.incremental.column.name=id
agent.sources.s1.incremental.vaue=0
agent.sources.s1.run.query.delay=10000
agent.sources.s1.status.file.path=path/to/flume
agent.sources.s1.status.file.name=name.status
agent.sources.s1.channel=c1
JMS Source

agent.sources.s1.type=jms
agent.sources.s1.channel=c1
initialContextFactory=org.apache.activemq.jndi.ActiveMQInitialContextFactory
connectionFactory=registered jndi name
providerURL=network connection
userName=jms_bot_user
passwordFile=/path/to/password.txt
destinationName=my_cool_data
destinationType=queue/topic
messageSelector="Age > 10
errorThreshold=Default(10)
batchSize=Default(100)
pollTimeout=Default(1000)
converter.type= class name from
org.apache.flume.source.jms.JMSMessageConverter
Channel selectors
Replicating:
agent.sources.s1.channels=c1 c2 c3
agent.sources.s1.selector.type=replicating
agent.sources.s1.selector.optional=c2 c3

Multiplexing:
agent.sources.s1.selector.type=multiplexing
agent.sources.s1.selector.header=port

Let's assume we used the Multiport Syslog TCP source to listen on four
ports11111, 22222, 33333, and 44444with a portHeader setting
of port:
agent.sources.s1.selector.default=c2
agent.sources.s1.selector.mapping.11111=c1 c2
agent.sources.s1.selector.mapping.44444=c2
agent.sources.s1.selector.optional.44444=c3
HDFS Sink

agent.channels=c1
agent.sinks=HDFS
agent.sinks.HDFS.type=hdfs
agent.sinks.HDFS.path=10.0.2.15:8020/flume/mysql
agent.sinks.HDFS.file.type=DataStream
agent.sinks.HDFS.channel=c1

Anda mungkin juga menyukai