简介
Flume简要来说是,是分布式实时数据采集系统,能够监控多种数据来源,然后传输给多种数据目标。常用的数据来源有日志文件、Socket、Kafka等;常见的数据目标有HDFS、Hbase、文件、Kafka、Logger等。
Flume是Cloudera提供的一个高可用的,高可靠的,、聚合和传输的实时数据采集系统,Flume支持在日志系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进分布式的海量日志采集行简单处理,并写到各种数据接受方(可定制)的能力。
当前Flume有两个版本Flume 0.9X版本的统称Flume-og,Flume1.X版本的统称Flume-ng。
架构
flume agent
Flume Event(事件)被定义为具有字节有效载荷和可选的一组字符串属性的数据流的单元。
Flume Agent(代理)是一个(JVM)进程,它承载事件从外部源传递到下一个目标(跳转)的组件。当采集某一个数据源的数据时,需要启动的一个flume-ng进程也就是一个flume agent。 每一个flume agent承载了一个flume event的处理任务。Agent包含三种组件(Source、Channel、Sink),让events从一个外部数据源流向下一个数据标地(消息队列、数据中心、文件系统等).
Flume Source:消费一定格式(avro、tcp、文本等)的外部数据入Flume;(flume接入)
Flume Channel:暂存event,等Flume Sink消费;(flume暂存) Flume Sink:将event从Flume Channel移出,并输出给外部数据目标(Kafka、Hdfs等);(flume输出)Consolidation
为了将数据流过多个flume agent中跳转,可以在前一个agent的sink和后一个agent的source之间使用avro类型进行连接,sink指向source的主机名(或IP地址)和端口。
日志收集中一个非常常见的情况是大量日志客户端,向连接到存储子系统的几个消费者代理发送数据。 例如,从数百个Web服务器收集的日志发送到写入HDFS群集的十几个代理。
配置文件
- 参数--conf 指定的是通用配置,包括日志配置、环境变量、JVM参数等。
- 参数--conf-file 指定的是针对某一个flume event独立的配置文件。下面是两个例子以及对应不同的配置文件。
下图是一个flume event特定的配置文件示例
flume agent示例
从本地采集数据写入HDFS
conf-file配置
[training@ localhost flume]$ cat stubs/spooldir.conf # spooldir.conf: A Spooling Directory Sourceagent1.sources = webserver-log-sourceagent1.sinks = hdfs-sinkagent1.channels = memory-channel#webserver-log-sourceagent1.sources.webserver-log-source.type=spooldiragent1.sources.webserver-log-source.spoolDir=/flume/weblogs_spooldiragent1.sources.webserver-log-source.channels=memory-channel#hdfs-sinkagent1.sinks.hdfs-sink.type=hdfsagent1.sinks.hdfs-sink.hdfs.path=/loudacre/weblogsagent1.sinks.hdfs-sink.channel=memory-channelagent1.sinks.hdfs-sink.hdfs.rollInterval = 0agent1.sinks.hdfs-sink.hdfs.rollSize = 524288agent1.sinks.hdfs-sink.hdfs.rollCount = 0agent1.sinks.hdfs-sink.hdfs.fileType = DataStream#memory-channelagent1.channels.memory-channel.type=memoryagent1.channels.memory-channel.capacity=10000agent1.channels.memory-channel.transactionCapacity=10000
flume-ng启动命令
[training@ localhost flume]$ flume-ng agent --conf /etc/flume-ng/conf --conf-file spooldir.conf --name agent1 -Dflume.root.logger=INFO,console
##使用netcat监控端口socket然后使用logger打印
conf-file配置
[training@ localhost flume]]$ cat stubs/bonus_netcat.conf # bonus_netcat.conf: A netcat sourceagent1.sources=netcat-sourceagent1.sinks=logger-sinkagent1.channels=memory-channel#netcat-sourceagent1.sources.netcat-source.type=netcatagent1.sources.netcat-source.bind = localhostagent1.sources.netcat-source.port = 12345agent1.sources.netcat-source.channels = memory-channel#logger-sinkagent1.sinks.logger-sink.type = loggeragent1.sinks.logger-sink.channel = memory-channel#memory-channelagent1.channels.memory-channel.type = memory
flume-ng启动命令
[training@ localhost flume]$ flume-ng agent --conf /etc/flume-ng/conf --conf-file bonus_netcat.conf --name agent1 -Dflume.root.logger=INFO,console
flume-ng help
[training@ localhost flume]$ flume-ng helpUsage: /usr/lib/flume-ng/bin/flume-ng[options]...commands: help display this help text agent run a Flume agent avro-client run an avro Flume client version show Flume version infoglobal options: --conf,-c use configs in directory --classpath,-C append to the classpath --dryrun,-d do not actually start Flume, just print the command --plugins-path colon-separated list of plugins.d directories. See the plugins.d section in the user guide for more details. Default: $FLUME_HOME/plugins.d -Dproperty=value sets a Java system property value -Xproperty=value sets a Java -X optionagent options: --name,-n the name of this agent (required) --conf-file,-f specify a config file (required if -z missing) --zkConnString,-z specify the ZooKeeper connection to use (required if -f missing) --zkBasePath,-p specify the base path in ZooKeeper for agent configs --no-reload-conf do not reload config file if changed --help,-h display help textavro-client options: --rpcProps,-P RPC client properties file with server connection params --host,-H hostname to which events will be sent --port,-p port of the avro source --dirname directory to stream to avro source --filename,-F text file to stream to avro source (default: std input) --headerFile,-R File containing event headers as key/value pairs on each new line --help,-h display help text Either --rpcProps or both --host and --port must be specified.Note that if directory is specified, then it is always included firstin the classpath.