Elaticsearch 6.0 同义词配置说明

准备工作

最近,有一个小需求如下:

给2段文字,找出两段文字中高度重合词的同义词。

由于需求比较简单(意思也就是这个需求不是那么重要),所以就想省点事,而且最近又在折腾elasticsearch。所以就Google查了下相关的文章。当然这一篇就借鉴了很多前辈的经验。

Elasticsearch

这个参考文档安装就可以了,比较简单。elasticsearch

分词

因为基于elasticsearch,所以可以直接使用对应的中文分词插件elasticsearch-analysis-ik

同义词词典

同样基于elasticsearch,所以首先我们要先看下es的相关文档:Synonym Token Filter

Two synonym formats are supported: Solr, WordNet.

也就是es的同义词词典格式目前支持两种格式,我们这里基本上仅使用Solr这种模式的同义词。

1
2
3
4
5
肯定,确认,认可,认定
承认,确认,认可,肯定,认同,认账
眼色,眼神
眼光,目光,眼神,眼波,秋波
视力,目力,眼力,眼神

注意,这里面的逗号是英文的。

关于,词典这块我这里采用了哈工大的同义词词林,下载链接

如果你也使用这个,需要根据Solr模式的同义词格式进行格式化一下。

当然,这个词典最终还是要结合自己的场景做对应的优化。

好了,准备工作到这里基本上差不多了。接下来,开始就是配置的工作了。

同义词配置

首先需要在你的es的config目录下创建analysis文件夹,然后移动synonyms.txt到这个目录下。

创建索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
curl -X PUT \
http://127.0.0.1:9200/syno \
-H 'content-type: application/json' \
-d '{
"index": {
"analysis": {
"analyzer": {
"ik_syno": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": ["my_stop","my_synonym"],
"char_filter": ["my_char_filter"]
},
"ik_syno_smart": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": ["my_stop","my_synonym"],
"char_filter": ["my_char_filter"]
}
},
"filter": {
"my_stop": {
"type": "stop",
"stopwords": [" "]
},
"my_synonym": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["| => |"]
}
}
}
}
}'

创建映射

1
2
3
4
5
6
7
8
9
10
11
12
curl -X PUT \
http://127.0.0.1:9200/syno/_mapping/default \
-H 'content-type: application/json' \
-d '{
"properties": {
"title": {
"type": "text",
"analyzer": "ik_syno",
"search_analyzer": "ik_syno_smart"
}
}
}'

获取同义词

1
2
3
4
curl -X POST \
'http://127.0.0.1:9200/syno/_analyze?pretty=true' \
-H 'content-type: application/json' \
-d '{"text":"确认过眼神","analyzer":"ik_syno_smart"}'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
{
"tokens": [
{
"token": "确认",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "肯定",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "认可",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "认定",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "承认",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "认同",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "认账",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "过",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 1
},
{
"token": "眼神",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
},
{
"token": "眼色",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
},
{
"token": "眼光",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
},
{
"token": "目光",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
},
{
"token": "眼波",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
},
{
"token": "秋波",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
},
{
"token": "视力",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
},
{
"token": "目力",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
},
{
"token": "眼力",
"start_offset": 3,
"end_offset": 5,
"type": "SYNONYM",
"position": 2
}
]
}

以上,就是基于elasticsearch 6.0+elasticsearch-analysis-ik+哈工大同义词词林搭建的一个简单同义词服务。

Reference

https://github.com/medcl/elasticsearch-analysis-ik/issues/93

More than your eyes can see