slop的含义
query string,搜索文本,中的几个term,要经过几次移动才能与一个document匹配,这个移动的次数,就是slop
词条位置
当一个字符串被分析时,分析器不仅只返回一个词条列表,它同时也返回原始字符串的每个词条的位置、或者顺序信息:
例如:
POST /_analyze { "analyzer": "standard", "text": "区块链比特币" }
结果:
{ "tokens" : [ { "token" : "区", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "块", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "链", "start_offset" : 2, "end_offset" : 3, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "比", "start_offset" : 3, "end_offset" : 4, "type" : "<IDEOGRAPHIC>", "position" : 3 }, { "token" : "特", "start_offset" : 4, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position" : 4 }, { "token" : "币", "start_offset" : 5, "end_offset" : 6, "type" : "<IDEOGRAPHIC>", "position" : 5 } ] }
示例
假设我们有个theme字段,存储的 “区块链,新能源,比特币,军工,医疗保健,医药”,标准分词后结果如下
POST /_analyze { "analyzer": "standard", "text": "区块链,新能源,比特币,军工,医疗保健,医药" }
结果:
{ "tokens" : [ { "token" : "区", "start_offset" : 0, "end_offset" : 1, "type" : "<IDEOGRAPHIC>", "position" : 0 }, { "token" : "块", "start_offset" : 1, "end_offset" : 2, "type" : "<IDEOGRAPHIC>", "position" : 1 }, { "token" : "链", "start_offset" : 2, "end_offset" : 3, "type" : "<IDEOGRAPHIC>", "position" : 2 }, { "token" : "新", "start_offset" : 4, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position" : 3 }, { "token" : "能", "start_offset" : 5, "end_offset" : 6, "type" : "<IDEOGRAPHIC>", "position" : 4 }, { "token" : "源", "start_offset" : 6, "end_offset" : 7, "type" : "<IDEOGRAPHIC>", "position" : 5 }, { "token" : "比", "start_offset" : 8, "end_offset" : 9, "type" : "<IDEOGRAPHIC>", "position" : 6 }, { "token" : "特", "start_offset" : 9, "end_offset" : 10, "type" : "<IDEOGRAPHIC>", "position" : 7 }, { "token" : "币", "start_offset" : 10, "end_offset" : 11, "type" : "<IDEOGRAPHIC>", "position" : 8 }, { "token" : "军", "start_offset" : 12, "end_offset" : 13, "type" : "<IDEOGRAPHIC>", "position" : 9 }, { "token" : "工", "start_offset" : 13, "end_offset" : 14, "type" : "<IDEOGRAPHIC>", "position" : 10 }, { "token" : "医", "start_offset" : 15, "end_offset" : 16, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "疗", "start_offset" : 16, "end_offset" : 17, "type" : "<IDEOGRAPHIC>", "position" : 12 }, { "token" : "保", "start_offset" : 17, "end_offset" : 18, "type" : "<IDEOGRAPHIC>", "position" : 13 }, { "token" : "健", "start_offset" : 18, "end_offset" : 19, "type" : "<IDEOGRAPHIC>", "position" : 14 }, { "token" : "医", "start_offset" : 20, "end_offset" : 21, "type" : "<IDEOGRAPHIC>", "position" : 15 }, { "token" : "药", "start_offset" : 21, "end_offset" : 22, "type" : "<IDEOGRAPHIC>", "position" : 16 } ] }
查询我们使用如下命令
GET my_index/_search { "query": { "bool": { "must": [ { "match_phrase":{ "theme":{ "query":"区块链比特币", "slop":0 } } } ] } }, "from": 0, "size": 20 }
我们可以看到查询不到结果
原因分析
和match查询类似,match_phrase查询首先解析查询字符串来产生一个词条列表。然后会搜索所有的词条,但只保留包含了所有搜索词条的文档,并且词条的位置要邻接。
“区块链比特币” 标准分词后都是单个字,如上结果
“区块链,新能源,比特币,军工,医疗保健,医药”标准分词后,也都是单个字,如上结果
我们发现 ,其他关键字都紧邻着,但是“链”的postion=2 和 “比”的positon=6 之间的 position 差了4,但是我们设置的slop为0,要求分词后的位置必须紧邻(不用挪动位置),所以没有搜索到,根据我们刚才的分析,我们试着把slop逐渐增加,发现一直增大到3,才能搜到,也就是需要挪动3次,2挪动3次到5,就跟6紧挨着了,也就匹配到了
总结
1.位置信息可以被保存在倒排索引(Inverted Index)中,像match_phrase这样位置感知(Position-aware)的查询能够使用位置信息来匹配那些含有正确单词出现顺序的文档,且在这些单词之间没有插入别的单词。 我们可以在短语匹配使用slop参数来引入一些灵活性,slop参数告诉match_phrase查询词条能够相隔多远时仍然将文档视为匹配。相隔多远的意思是,你需要移动一个词条多少次来让查询和文档匹配
2.slop的含义,不仅仅是说一个query string terms移动几次,跟一个doc匹配上。一个query string terms,最多可以移动几次去尝试跟一个doc匹配上
3.slop搜索下,关键词离的越近,relevance score就会越高
- 本文固定链接: https://www.phpmianshi.com/?id=248
- 转载请注明: admin 于 PHP面试网 发表
《本文》有 0 条评论