In a regular POSIX-style filesystem, if you open a file and write data, it still
exists on the disk before the file is closed.
In HDFS, the file exists only as a directory entry; it shows zero length until the
file is closed.
This means that if data is written to a file for an extended period without
closing it, a network disconnect with the client will leave you with nothing
but an empty file for all your efforts.
Sources, channels and sinks
Things to remember
An interceptor is a point in your data flow where you can inspect and alter
Flume events. You can chain zero or more interceptors after a source creates
an event.
Channel selectors are responsible for how data moves from a source to one
or more channels. Flume comes packaged with two channel selectors that
cover most use cases you might have.
A sink processor is the mechanism by which you can create failover paths for
your sinks or load balance events across multiple sinks from a channel.
Installing flume
agent.sources.s1.type=netcat
agent.sources.s1.channels=c1
agent.sources.s1.bind=0.0.0.0
agent.sources.s1.port=12345
agent.channels.c1.type=memory
agent.sinks.k1.type=logger
agent.sinks.k1.channel=c1
Starting the agent
Use command:
./bin/flume-ng agent n agent c conf f conf/hw.conf
Dflume.root.logger=INFO,console
In example:
% nc localhost 12345
Hello World
OK
Different flows
agent.sources=s1
agent.channels=c1
agent.sources.s1.type=org.apache.flume.source.SQLSource
agent.sources.s1.connectionurl=jdbc:location
agent.sources.s1.user=root
agent.sources.s1.password=pwd
agent.sources.s1.table=wlslog
agent.sources.s1.databse=test
agent.sources.s1.columns.to.select=*
agent.sources.s1.incremental.column.name=id
agent.sources.s1.incremental.vaue=0
agent.sources.s1.run.query.delay=10000
agent.sources.s1.status.file.path=path/to/flume
agent.sources.s1.status.file.name=name.status
agent.sources.s1.channel=c1
JMS Source
agent.sources.s1.type=jms
agent.sources.s1.channel=c1
initialContextFactory=org.apache.activemq.jndi.ActiveMQInitialContextFactory
connectionFactory=registered jndi name
providerURL=network connection
userName=jms_bot_user
passwordFile=/path/to/password.txt
destinationName=my_cool_data
destinationType=queue/topic
messageSelector="Age > 10
errorThreshold=Default(10)
batchSize=Default(100)
pollTimeout=Default(1000)
converter.type= class name from
org.apache.flume.source.jms.JMSMessageConverter
Channel selectors
Replicating:
agent.sources.s1.channels=c1 c2 c3
agent.sources.s1.selector.type=replicating
agent.sources.s1.selector.optional=c2 c3
Multiplexing:
agent.sources.s1.selector.type=multiplexing
agent.sources.s1.selector.header=port
Let's assume we used the Multiport Syslog TCP source to listen on four
ports11111, 22222, 33333, and 44444with a portHeader setting
of port:
agent.sources.s1.selector.default=c2
agent.sources.s1.selector.mapping.11111=c1 c2
agent.sources.s1.selector.mapping.44444=c2
agent.sources.s1.selector.optional.44444=c3
HDFS Sink
agent.channels=c1
agent.sinks=HDFS
agent.sinks.HDFS.type=hdfs
agent.sinks.HDFS.path=10.0.2.15:8020/flume/mysql
agent.sinks.HDFS.file.type=DataStream
agent.sinks.HDFS.channel=c1