Flume

Flume
Problem with HDFS and streaming data
In a regular POSIX-style filesystem, if you open a file and write data, it still
exists on the disk before the file is closed.
In HDFS, the file exists only as a directory entry; it shows zero length until the
file is closed.
This means that if data is written to a file for an extended period without
closing it, a network disconnect with the client will leave you with nothing
but an empty file for all your efforts.
Sources, channels and sinks
Things to remember
A source writes events to one or more channels.

A channel is the holding area as events are passed from a source to a sink.
A sink receives events from one channel only.
An agent can have many channels.
Flume events
The basic payload of data transported by Flume is called an event. An event is

composed of zero or more headers and a body.
The headers are key/value pairs that can be used to make routing decisions
or carry other structured information
The body is an array of bytes that contains the actual payload. If your input is
comprised of tailed log files, the array is most likely a UTF-8-encoded string
containing a line of text.
Interceptors, channel selectors, and sink
processors
An interceptor is a point in your data flow where you can inspect and alter
Flume events. You can chain zero or more interceptors after a source creates
an event.
Channel selectors are responsible for how data moves from a source to one
or more channels. Flume comes packaged with two channel selectors that
cover most use cases you might have.
A sink processor is the mechanism by which you can create failover paths for
your sinks or load balance events across multiple sinks from a channel.
Installing flume
Download link can be found at flume.apache.org

Both source and binary distribution files are necessary, so download both the
zip files for proper functioning of Flume.
Most important part
Flume configuration file:

Each agent is configured, starting with three parameters:
agent.sources=<list of sources>
agent.channels=<list of channels>
agent.sinks=<list of sinks>
Example
agent.sources=s1
agent.channels=c1
agent.sinks=k1
agent.sources.s1.type=netcat
agent.sources.s1.channels=c1
agent.sources.s1.bind=0.0.0.0
agent.sources.s1.port=12345
agent.channels.c1.type=memory
agent.sinks.k1.type=logger
agent.sinks.k1.channel=c1
Starting the agent
Use command:
./bin/flume-ng agent n agent c conf f conf/hw.conf
Dflume.root.logger=INFO,console
Overrides root logger in conf/log4j.properties

If we dont, output would go to log/flume.log
Use Dflume.log.dir=Path Dflume.log.file=Name.log
In example:
% nc localhost 12345
Hello World
OK
Different flows
Multi-hop flow: Event may travel through more than 1 agent.

Fan-out flow: Dataflow from 1 source to multiple channels.
Replicating: Data will be replicated in all configured channels.
Multiplexing: Data will be sent to a selected channel mentioned in
header of event.
Fan-in flow: Dataflow from many sources to 1 channel.
Using SQL source
agent.sources=s1
agent.channels=c1
agent.sources.s1.type=org.apache.flume.source.SQLSource
agent.sources.s1.connectionurl=jdbc:location
agent.sources.s1.user=root
agent.sources.s1.password=pwd
agent.sources.s1.table=wlslog
agent.sources.s1.databse=test
agent.sources.s1.columns.to.select=*
agent.sources.s1.incremental.column.name=id
agent.sources.s1.incremental.vaue=0
agent.sources.s1.run.query.delay=10000
agent.sources.s1.status.file.path=path/to/flume
agent.sources.s1.status.file.name=name.status
agent.sources.s1.channel=c1
JMS Source
agent.sources.s1.type=jms
agent.sources.s1.channel=c1
initialContextFactory=org.apache.activemq.jndi.ActiveMQInitialContextFactory
connectionFactory=registered jndi name
providerURL=network connection
userName=jms_bot_user
passwordFile=/path/to/password.txt
destinationName=my_cool_data
destinationType=queue/topic
messageSelector="Age > 10
errorThreshold=Default(10)
batchSize=Default(100)
pollTimeout=Default(1000)
converter.type= class name from
org.apache.flume.source.jms.JMSMessageConverter
Channel selectors
Replicating:
agent.sources.s1.channels=c1 c2 c3
agent.sources.s1.selector.type=replicating
agent.sources.s1.selector.optional=c2 c3
Multiplexing:
agent.sources.s1.selector.type=multiplexing
agent.sources.s1.selector.header=port
Let's assume we used the Multiport Syslog TCP source to listen on four
ports11111, 22222, 33333, and 44444with a portHeader setting
of port:
agent.sources.s1.selector.default=c2
agent.sources.s1.selector.mapping.11111=c1 c2
agent.sources.s1.selector.mapping.44444=c2
agent.sources.s1.selector.optional.44444=c3
HDFS Sink
agent.channels=c1
agent.sinks=HDFS
agent.sinks.HDFS.type=hdfs
agent.sinks.HDFS.path=10.0.2.15:8020/flume/mysql
agent.sinks.HDFS.file.type=DataStream
agent.sinks.HDFS.channel=c1

Flume

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Flume

Diunggah oleh

Hak Cipta:

Format Tersedia

Flume

Problem with HDFS and streaming data

A source writes events to one or more channels.

The basic payload of data transported by Flume is called an event. An event is

Download link can be found at flume.apache.org

Flume configuration file:

Overrides root logger in conf/log4j.properties

Multi-hop flow: Event may travel through more than 1 agent.

Anda mungkin juga menyukai