SIREn 插件试用

SIREn 是一个基于 Lucene 做的，专门针对 nested object 数据做优化的方案。其官网地址：http://siren.solutions。SIREn 自己并不提供完整的软件，而是以 Solr 或者 Elasticsearch 插件的形式存在。在 SIREn 官网首页写着，自己是 trush schemaless，high performance nested query。而我之前已经写博客说过，Elasticsearch 的 schemaless 是有限制的，同一个 index 下，field 的 mapping 是必须唯一一致的。否则，或者写入失败，或者搜索异常。

那么我们来试一下这个 SIREn 看看。首先是下载运行：

# wget http://siren.solutions/download/siren-elasticsearch-1.4-bin.zip
# unzip siren-elasticsearch-1.4-bin.zip
# cd siren-elasticsearch-1.4-bin
# ./example/bin/elasticsearch

然后我们尝试写入几条 mapping 有冲突的数据：

# curl -XDELETE "http://localhost:9200/napr"
# curl -XPOST "http://localhost:9200/napr"
# curl -XPUT "http://localhost:9200/napr/chargepoint/_mapping" -d '
{
    "chargepoint" : {
        "properties" : {
            "_siren_source" : {
                "analyzer" : "concise",
                "postings_format" : "Siren10AFor",
                "store" : "no",
                "type" : "string"
            }
        },
        "_siren" : {}
    }
}'
# curl -XPUT "http://localhost:9200/napr/chargepoint/1" -d '
{
    "ChargeDeviceName": "1c Design Limited, Glasgow (1)",
    "Accessible24Hours": false
}'
# curl -XPUT "http://localhost:9200/napr/chargepoint/2" -d '
{
    "ChargeDeviceName": "2c Design Limited, Glasgow (2)",
    "Accessible24Hours": "true"
}'
# curl -XPUT "http://localhost:9200/napr/chargepoint/3" -d '
{
    "ChargeDeviceName": "3c Design Limited, Glasgow (3)",
    "Accessible24Hours": 123
}'
# curl -XPUT "http://localhost:9200/nepr/chargepoint/4" -d '
{
    "ChargeDeviceName": "4c Design Limited, Glasgow (4)",
    "Accessible24Hours": [123, 234, 345, 456]
}'

ok，三条数据都写入成功了。

然后我们用原始的 Elasticsearch 语法尝试去获取『大于100』的数据：

# curl -XPOST "http://localhost:9200/nepr/_search?q=Accessible24Hours:>100"
{"took":16,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

可以看到，搜索结果是空。

而用 SIREn 的树状结构语法获取：

# curl -XPOST "http://localhost:9200/nepr/_search" -d '
{
  "query": {
    "tree" : {
      "node" : {
        "attribute" : "Accessible24Hours",
        "query" : "xsd:long([100 TO *])"
      }
    }
  }
}'
{"took":29,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":4.0,"hits":[{"_index":"nepr","_type":"chargepoint","_id":"4","_score":4.0,"_source":
{
    "ChargeDeviceName": "4c Design Limited, Glasgow (4)",
    "Accessible24Hours": [123, 234, 345, 456]
}},{"_index":"nepr","_type":"chargepoint","_id":"3","_score":1.0,"_source":
{
    "ChargeDeviceName": "3c Design Limited, Glasgow (3)",
    "Accessible24Hours": 123
}}]}}%

yes，我们拿到了这条数据！

更复杂一点，我们再来:

# curl -XPOST "http://localhost:9200/nepr/_search" -d '
{
  "query": {
    "tree" : {
      "node" : {
        "attribute" : "Accessible24Hours",
        "range" : [2,3],
        "query" : "xsd:long([10 TO *])"
      }
    }
  },
  "aggs": {
    "1": {
      "terms": {
        "field": "ChargeDeviceName"
      }
    }
  }
}'

这里添加了一个 range 选项，SIREn 对所有的数组默认就做 nested 处理了，所有是有序的。这个选项的意思就是，只对数组中第 2 到 3 位节点的数据做搜索请求。这下，搜索结果变成了：

{"took":9,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":2.0,"hits":[{"_index":"nepr","_type":"chargepoint","_id":"4","_score":2.0,"_source":
    {
            "ChargeDeviceName": "4c Design Limited, Glasgow (4)",
                "Accessible24Hours": [123,234,345,456]
    }}]},"aggregations":{"1":{"buckets":[{"key":"4","doc_count":1},{"key":"4c","doc_count":1},{"key":"design","doc_count":1},{"key":"glasgow","doc_count":1},{"key":"limited","doc_count":1}]}}}%

可以看到，因为 _id 为 3 的文档里 Accessible24Hours 字段只有一个值，所以无法匹配上从第二个值开始的多个值的对比，也就没被过滤出来了。

不过 SIREn 目前比较尴尬的是，他只基于 ES 做了 query 部分，aggregation 部分还是老样子，必须类型一致才行，这也导致 SIREn 示例文件数据里把一些冲突日志去掉了的原因。

如果使用的是 Solr，SIREn 插件的做法是只定义两个 field，一个是 UUID，一个是 JSON。然后 siren 处理的所有数据存在这个 JSON 字段里(类似 ES 插件里的那个 _siren_source 字段)。这也就能达到全部 JSON schemaless。此外，SIREn 的 Solr 插件还实现了 nested facet 支持，也可以尝试。

总之，SIREn 扩展采用树形方式自行处理一个在 ES、Solr 看来多出来的字段，而并不影响原有字段的处理流程。所以，这对 ES 有几个影响：

其他字段还是会判断数据类型并生成 mapping，所以写入依然会有问题。
aggregation 还是走 ES 的实现，导致根据 number 过滤出来的文档，在 aggregation 时却会按照 boolean(即 mapping 中的记录)检测，aggregation 请求直接报错不计算。
重复一遍树状索引数据，导致膨胀率翻倍增高。实测，一段大小约为 30MB 的数据，在 ES 默认环境中会膨胀到 50MB，而在开启 SIREn 插件的环境下则膨胀到了 120MB！