大数据相关

相关工具

安装ELK

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.1.1.zip

unzip elasticsearch-5.1.1.zip

wget https://artifacts.elastic.co/downloads/kibana/kibana-5.1.1-darwin-x86_64.tar.gz

tar zxvf kibana-5.1.1-darwin-x86_64.tar.gz

wget https://artifacts.elastic.co/downloads/logstash/logstash-5.1.1.tar.gz

tar zxvf logstash-5.1.1.tar.gz

安装 X-Pack

将 X-Pack 安装到 Elasticsearch

进入 Elasticsearch 目录,执行

./bin/elasticsearch-plugin install x-pack

安装过程会提示 "plugin requires additional permissions" 选Y继续安装

启动Elasticsearch

./bin/elasticsearch

将 X-Pack 安装到 Kibana

进入Kibana目录

./bin/kibana-plugin install x-pack

启动 Kibana

./bin/kibana

访问http://localhost:5601/

用户名:elastic

密码:changeme

文件编码

文件统一转为utf-8,首先要判断文件的编码,有两种工具,file和enca,同时用起来,因为有时候判断不准

安装enca

brew install enca

两个工具一起用

file csdn.com.txt &&enca csdn.com.txt

 

配置logstash

input {
file{
#监听文件的路径
path => "/Users/sword/bigdata/form_src/*.txt"
#监听多个目标文件
#path => ["E:/software/logstash-1.5.4/logstash-1.5.4/data/*","F:/test.txt"]
#监听文件的起始位置,默认是end
start_position => beginning
sincedb_path => "/Users/sword/bigdata/form_src/sincedb_log.txt"
}
}
filter{
grok{
match => {
"message" =>"'%{DATA:email}',\s'%{DATA:password}',\s'%{DATA:username}','%{DATA:from}'\r"
}
}

mutate{
remove_field => ["host","path","tags","message"]

}

fingerprint {
#用来组合source中的字段
concatenate_sources => true
source => ["username","email","password"]
target => "[@metadata][generated_id]"
#没有key就会出错
key => "znmfLov5KlNUeh2z"
#用MD5来保证数据唯一
method => 'MD5'

}

}

output{
elasticsearch {
hosts => ["localhost:9200"]
index => "base_data"
#workers => 5
document_id => "%{[@metadata][generated_id]}"
#用来标识来源
document_type => "csdn.com"
user => "elastic"
password => "changeme"
}
#stdout {
# codec => rubydebug {
# metadata => true
# }
#}

}

运行 logstash,来导入数据

./bin/logstash -f ./bigdata.conf

 

字段规划

  1. username(用户名)
  2. email(邮箱)
  3. password(密码)
  4. salt(盐值)
  5. nickname(昵称)
  6. qq(QQ号)
  7. mobile(手机号码)
  8. telno(固定电话)
  9. idno(身份证号码)
  10. realname(真实姓名)
  11. address(家庭住址)
  12. ip(IP地址)
  13. date(该条数据发生时间)
  14. other(其它数据)
  15. from(数据来源)

参考

Logstash 最佳实践 http://udn.yyuap.com/doc/logstash-best-practice-cn/
ELKstack 中文指南 http://kibana.logstash.es/content/
Elasticsearch 权威指南 http://es.xiaoleilu.com/