之前我曾经讲过一个简单的在 ELK 中山寨 Splunk 的『显示来源』功能的办法。这次我们玩个更有难度的、当然依然只是山寨式功能的新东西:『事件模式』功能。
Splunk 6.2 推出的这个功能,会基于当前搜索语句的结果集做模式探测,根据精度调整,做成不同数量的聚类。然后给每个聚类分组内,提取出一个关键词(个别情况下也有零个或多个的)。也就是通过机器学习的手段,探测你的日志可能有什么模式,其最具识别性的关键内容是什么。
这个页面如果用 SPL 表示,就是:index=_internal | cluster t=0.8 lableonly=true | findkeywords labelfield=cluster_label | sort - percentInputGroup
我们目前当然在 ES 里是没法做聚类分析什么的了。不过在日志场景下,也不是没有近似的办法。
其实如何山寨模式分组,Splunk 也有类似 SPL 命令做出了示范。这个命令叫 typelearner
。
这个命令的大致意思是:把日志里的英文单词、数字、空格等字符都隐藏掉,剩下各种标点符号,就代表一种日志类型。简单的处理方式就是:
cat samplelog.cisco.asa |sed 's/[0-9a-zA-Z]*//g' | sed 's/[[:space:]]/_/g'
然后将这个纯标点符号的字符串,存为事件的一个字段,我们沿袭 Splunk 的叫法: punct 。
这样,我们只要简单的对 punct 字段做 terms aggregation
就可以获取模式分组了。
然后查找关键词。什么叫关键词呢?就是要能让本分组跟其他分组有显著差异的一个词。这个显然不能再用 terms aggregation 了。否则出来的是最多的词,而不是最有差异性的词。ES 对这个也提供了现成的聚合方式:significant_terms aggregation
。
然后这里有另一个问题:一般我们都是在 not_analyzed
字段上做聚合统计的。现在显然并没有具体哪个字段来提供单个字段值做聚合!我们需要用的就是分词的日志原文内容。
所以这块我们需要对原文字段的 mapping 做出特殊定义:
"message": {
"type": "text",
"fielddata": true,
"index_options": "docs",
"norms": false
},
即重新放开 fielddata —— ES 5.0 里,text 类型字段已经默认关闭 fielddata 了。
至于内存的问题,或者交给 Circuit Breaker 来控制;或者自己通过请求中的 terminate_after
参数预先控制。
就模式发现这个功能来说,通过 terminate_after
参数预定义控制应该是个不错的思路。因为本来就是一个不确定的猜测,加太大的数据量来做这事儿,没多少性价比。
所以我们最终发出的请求是这样:
#!/bin/bash
curl -XPOST 'http://localhost:9200/logstash-2016.07.18/logs/_search?pretty&terminate_after=30000&size=0' -d '
{
"aggs": {
"group": {
"terms": {
"field": "punct"
},
"aggs": {
"keyword": {
"significant_terms": {
"size": 1,
"field": "message"
},
"aggs": {
"hit": {
"top_hits": {
"_source": {
"include": [ "message" ]
},
"size":1
}
}
}
}
}
}
}
}
'
我们可以看到请求结果如下:
{
"took" : 2179,
"timed_out" : false,
"terminated_early" : false,
"_shards" : {
"total" : 5,
"successful" : 2,
"failed" : 3,
"failures" : [ {
"shard" : 0,
"index" : "logstash-2016.07.18",
"node" : "L0qQ1ZcyQGmj7Ge7ZlCmYg",
"reason" : {
"type" : "circuit_breaking_exception",
"reason" : "[request] Data too large, data for [<reused_arrays>] would be larger than limit of [415550668/396.2mb]",
"bytes_wanted" : 415762160,
"bytes_limit" : 415550668
}
} ]
},
"hits" : {
"total" : 371095,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group" : {
"doc_count_error_upper_bound" : 72,
"sum_other_doc_count" : 93355,
"buckets" : [ {
"key" : "--_::._+____-_=,_=,_=,_=.,_=,_=",
"doc_count" : 98100,
"keyword" : {
"doc_count" : 98100,
"buckets" : [ {
"key" : "cpu_seconds",
"doc_count" : 98100,
"score" : 2.2037623779471813,
"bg_count" : 115831,
"hit" : {
"hits" : {
"total" : 98100,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMJbLjo3PexoUujh",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:17:33.776 +0800 INFO Metrics - group=pipeline, name=indexerpipe, processor=index_thruput, cpu_seconds=0.000000, executes=111, cumulative_hits=161675"
}
} ]
}
}
} ]
}
}, {
"key" : "--_::._+____-_=,_=,_=,_=,_=,_=,_=",
"doc_count" : 87058,
"keyword" : {
"doc_count" : 87058,
"buckets" : [ {
"key" : "largest_size",
"doc_count" : 75663,
"score" : 2.835574761742766,
"bg_count" : 75663,
"hit" : {
"hits" : {
"total" : 75663,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMJbLjo3PexoUuj9",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:17:02.780 +0800 INFO Metrics - group=queue, name=nullqueue, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=1, smallest_size=0"
}
} ]
}
}
} ]
}
}, {
"key" : "--_::._+____-_=,_=\"\",_=.,_=.,_=.,_=,_=.,_=",
"doc_count" : 26317,
"keyword" : {
"doc_count" : 26317,
"buckets" : [ {
"key" : "max_age",
"doc_count" : 26317,
"score" : 7.224805514306611,
"bg_count" : 45119,
"hit" : {
"hits" : {
"total" : 26317,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMJbLjo3PexoUukH",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:17:02.780 +0800 INFO Metrics - group=per_sourcetype_thruput, series=\"scheduler\", kbps=0.014869, eps=0.032258, kb=0.460938, ev=1, avg_age=0.000000, max_age=0"
}
} ]
}
}
} ]
}
}, {
"key" : "--_::._+____-_=,_=\"//////.\",_=.,_=.,_=.,_=,_=.,_=",
"doc_count" : 13063,
"keyword" : {
"doc_count" : 13063,
"buckets" : [ {
"key" : "log",
"doc_count" : 13063,
"score" : 27.241628614916287,
"bg_count" : 13140,
"hit" : {
"hits" : {
"total" : 13063,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMKILjo3PexoUulQ",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:16:31.780 +0800 INFO Metrics - group=per_source_thruput, series=\"/applications/splunk/var/log/splunk/metrics.log\", kbps=0.326188, eps=2.032164, kb=10.112305, ev=63, avg_age=0.968254, max_age=1"
}
} ]
}
}
} ]
}
}, {
"key" : "--_::._+____-_=,_=,_=.,_=.,_=.,_=.,_=.,_=.",
"doc_count" : 11603,
"keyword" : {
"doc_count" : 11603,
"buckets" : [ {
"key" : "average_kbps",
"doc_count" : 11603,
"score" : 20.38013481592441,
"bg_count" : 17357,
"hit" : {
"hits" : {
"total" : 11603,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMKILjo3PexoUulA",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:16:31.781 +0800 INFO Metrics - group=thruput, name=index_thruput, instantaneous_kbps=0.875684, instantaneous_eps=2.032165, average_kbps=0.340430, total_k_processed=33138.000000, kb=27.147461, ev=63.000000"
}
} ]
}
}
} ]
}
}, {
"key" : "--_::._+____-_=,_=,_=,_=,_=",
"doc_count" : 11417,
"keyword" : {
"doc_count" : 11417,
"buckets" : [ {
"key" : "qwork_units",
"doc_count" : 11417,
"score" : 31.50372251905054,
"bg_count" : 11417,
"hit" : {
"hits" : {
"total" : 11417,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMLOLjo3PexoUunn",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:15:29.777 +0800 INFO Metrics - group=tpool, name=indexertpool, qsize=0, workers=2, qwork_units=0"
}
} ]
}
}
} ]
}
}, {
"key" : "--_::._+____-_=,_=,_=---,_=.,_=,_=",
"doc_count" : 11350,
"keyword" : {
"doc_count" : 11350,
"buckets" : [ {
"key" : "generic",
"doc_count" : 11350,
"score" : 31.69559471365639,
"bg_count" : 11350,
"hit" : {
"hits" : {
"total" : 11350,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMJbLjo3PexoUukk",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:17:02.779 +0800 INFO Metrics - group=pipeline, name=indexerpipe, processor=syslog-output-generic-processor, cpu_seconds=0.000000, executes=104, cumulative_hits=161564"
}
} ]
}
}
} ]
}
}, {
"key" : "--_::._+____-_=,_=,_=,_=.",
"doc_count" : 7135,
"keyword" : {
"doc_count" : 7135,
"buckets" : [ {
"key" : "search_health_metrics",
"doc_count" : 7135,
"score" : 51.010511562718996,
"bg_count" : 7135,
"hit" : {
"hits" : {
"total" : 7135,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMJbLjo3PexoUujq",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:17:33.776 +0800 INFO Metrics - group=search_health_metrics, name=bundle_directory_reaper, bundle_dir_reaper_max_ms=1, bundle_dir_reaper_mean_ms=1.000000"
}
} ]
}
}
} ]
}
}, {
"key" : "--_::._+____-_=,_=,_=,_=.,_=,_=,_=,_=",
"doc_count" : 5849,
"keyword" : {
"doc_count" : 5849,
"buckets" : [ {
"key" : "search_queue_metrics",
"doc_count" : 5849,
"score" : 62.445888186014706,
"bg_count" : 5849,
"hit" : {
"hits" : {
"total" : 5849,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMKILjo3PexoUulx",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:16:31.777 +0800 INFO Metrics - group=search_concurrency, name=search_queue_metrics, enqueue_seaches_count=0, avg_time_spent_in_queue=0.000000, max_time_spent_in_queue=0, current_queue_size=0, largest_queue_size=0, min_queue_size=0"
}
} ]
}
}
} ]
}
}, {
"key" : "--_::._+____-_=,_=,_=,_=,_=,_=,_=,_=,_=,_=,_=,_=,_",
"doc_count" : 5848,
"keyword" : {
"doc_count" : 5848,
"buckets" : [ {
"key" : "max_ready",
"doc_count" : 5848,
"score" : 62.45673734610123,
"bg_count" : 5848,
"hit" : {
"hits" : {
"total" : 5848,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2016.07.18",
"_type" : "logs",
"_id" : "AVX-RMJbLjo3PexoUuk5",
"_score" : 1.0,
"_source" : {
"message" : "07-15-2016 14:17:02.776 +0800 INFO Metrics - group=searchscheduler, dispatched=1, skipped=0, total_lag=1, max_ready=0, max_pending=0, max_lag=1, window_max_lag=0, window_total_lag=0, max_running=0, actions_triggered=0, completed=1, total_runtime=0.189, max_runtime=0.189"
}
} ]
}
}
} ]
}
} ]
}
}
}
响应体中可以看到因为 terminate_after
设得还是过大,所以还没到中止条数就被 kill 了。实际只扫描了 370173 条数据。那么我们下次就可以把 terminate_after
调成 10000 得了。
然后就是 significant_terms
返回的关键词们。跟之前 splunk 的截图相比,我们可以发现,不是完全一样的效果,但是还是有部分关键词是一致的。比如 smallest_size
, total_k_processed
, search_health_metrics
, var
, workers
等。
可以说,作为一个山寨品,这个做法是行得通的~