forked from zhisheng17/zhisheng17.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathatom.xml
2097 lines (1928 loc) · 525 KB
/
atom.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Zhisheng的博客</title>
<subtitle>放码过来!</subtitle>
<link href="/atom.xml" rel="self"/>
<link href="http://yoursite.com/"/>
<updated>2017-09-23T07:32:26.494Z</updated>
<id>http://yoursite.com/</id>
<author>
<name>Zhisheng Tian</name>
</author>
<generator uri="http://hexo.io/">Hexo</generator>
<entry>
<title>Google Guava 缓存实现接口的限流</title>
<link href="http://yoursite.com/2017/09/23/Guava-limit/"/>
<id>http://yoursite.com/2017/09/23/Guava-limit/</id>
<published>2017-09-23T07:30:59.463Z</published>
<updated>2017-09-23T07:32:26.494Z</updated>
<content type="html"><![CDATA[<p><img src="http://ohfk1r827.bkt.clouddn.com/cb2.jpeg-1" alt=""><br><a id="more"></a></p>
<h3 id="项目背景"><a href="#项目背景" class="headerlink" title="项目背景"></a>项目背景</h3><p>最近项目中需要进行接口保护,防止高并发的情况把系统搞崩,因此需要对一个查询接口进行限流,主要的目的就是限制单位时间内请求此查询的次数,例如 1000 次,来保护接口。<br>参考了 <a href="http://www.jianshu.com/p/0d7ca597ebd2" target="_blank" rel="external">开涛的博客聊聊高并发系统限流特技</a> ,学习了其中利用 Google Guava 缓存实现限流的技巧,在网上也查到了很多关于 Google Guava 缓存的博客,学到了好多,推荐一个博客文章:<a href="http://ifeve.com/google-guava-cachesexplained/" target="_blank" rel="external">http://ifeve.com/google-guava-cachesexplained/</a>, 关于 Google Guava 缓存的更多细节或者技术,这篇文章讲的很详细;<br>这里我们并不是用缓存来优化查询,而是利用缓存,存储一个计数器,然后用这个计数器来实现限流。</p>
<h3 id="效果实验"><a href="#效果实验" class="headerlink" title="效果实验"></a>效果实验</h3><figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">static</span> LoadingCache<Long, AtomicLong> count = CacheBuilder.newBuilder().expireAfterWrite(<span class="number">1</span>, TimeUnit.SECONDS).build(<span class="keyword">new</span> CacheLoader<Long, AtomicLong>() {</div><div class="line"> <span class="meta">@Override</span></div><div class="line"> <span class="function"><span class="keyword">public</span> AtomicLong <span class="title">load</span><span class="params">(Long o)</span> <span class="keyword">throws</span> Exception </span>{</div><div class="line"> <span class="comment">//System.out.println("Load call!");</span></div><div class="line"> <span class="keyword">return</span> <span class="keyword">new</span> AtomicLong(<span class="number">0L</span>);</div><div class="line"> }</div><div class="line"> });</div></pre></td></tr></table></figure>
<p>上面,我们通过 CacheBuilder 来新建一个 LoadingCache 缓存对象 count,然后设置其有效时间为 1 秒,即每 1 秒钟刷新一次;缓存中,key 为一个 long 型的时间戳类型,value 是一个计数器,使用原子性的 AtomicLong 保证自增和自减操作的原子性, 每次查询缓存时如果不能命中,即查询的时间戳不在缓存中,则重新加载缓存,执行 load 将当前的时间戳的计数值初始化为 0。这样对于每一秒的时间戳,能计算这一秒内执行的次数,从而达到限流的目的;<br>这是要执行的一个 getCounter 方法:</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">public</span> <span class="class"><span class="keyword">class</span> <span class="title">Counter</span> </span>{</div><div class="line"> <span class="keyword">static</span> <span class="keyword">int</span> counter = <span class="number">0</span>;</div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">int</span> <span class="title">getCounter</span><span class="params">()</span> <span class="keyword">throws</span> Exception</span>{</div><div class="line"> <span class="keyword">return</span> counter++;</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>现在我们创建多个线程来执行这个方法:</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">public</span> <span class="class"><span class="keyword">class</span> <span class="title">Test</span> </span>{</div><div class="line"></div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">void</span> <span class="title">main</span><span class="params">(String args[])</span> <span class="keyword">throws</span> Exception</span></div><div class="line"> {</div><div class="line"> <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>;i<<span class="number">100</span>;i++)</div><div class="line"> {</div><div class="line"> <span class="keyword">new</span> Thread(){</div><div class="line"> <span class="meta">@Override</span></div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">run</span><span class="params">()</span> </span>{</div><div class="line"> <span class="keyword">try</span> {</div><div class="line"> System.out.println(Counter.getCounter());</div><div class="line"> }</div><div class="line"> <span class="keyword">catch</span> (Exception e)</div><div class="line"> {</div><div class="line"> e.printStackTrace();</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }.start();</div><div class="line"> }</div><div class="line"></div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>这样执行的话,执行结果很简单,就是很快地执行这个 for 循环,迅速打印 0 到 99 折 100 个数,不再贴出。<br>这里的 for 循环执行 100 个进程时间是很快的,那么现在我们要限制每秒只能有 10 个线程来执行 getCounter() 方法,该怎么办呢,上面讲的限流方法就派上用场了:</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">public</span> <span class="class"><span class="keyword">class</span> <span class="title">Counter</span> </span>{</div><div class="line"> <span class="keyword">static</span> LoadingCache<Long, AtomicLong> count = CacheBuilder.newBuilder().expireAfterWrite(<span class="number">1</span>, TimeUnit.SECONDS).build(<span class="keyword">new</span> CacheLoader<Long, AtomicLong>() {</div><div class="line"> <span class="meta">@Override</span></div><div class="line"> <span class="function"><span class="keyword">public</span> AtomicLong <span class="title">load</span><span class="params">(Long o)</span> <span class="keyword">throws</span> Exception </span>{</div><div class="line"> System.out.println(<span class="string">"Load call!"</span>);</div><div class="line"> <span class="keyword">return</span> <span class="keyword">new</span> AtomicLong(<span class="number">0L</span>);</div><div class="line"> }</div><div class="line"> });</div><div class="line"> <span class="keyword">static</span> <span class="keyword">long</span> limits = <span class="number">10</span>;</div><div class="line"> <span class="keyword">static</span> <span class="keyword">int</span> counter = <span class="number">0</span>;</div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">synchronized</span> <span class="keyword">int</span> <span class="title">getCounter</span><span class="params">()</span> <span class="keyword">throws</span> Exception</span>{</div><div class="line"> <span class="keyword">while</span> (<span class="keyword">true</span>)</div><div class="line"> {</div><div class="line"> <span class="comment">//获取当前的时间戳作为key</span></div><div class="line"> Long currentSeconds = System.currentTimeMillis() / <span class="number">1000</span>;</div><div class="line"> <span class="keyword">if</span> (count.get(currentSeconds).getAndIncrement() > limits) {</div><div class="line"> <span class="keyword">continue</span>;</div><div class="line"> }</div><div class="line"> <span class="keyword">return</span> counter++;</div><div class="line"> }</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>这样一来,就可以限制每秒的执行数了。对于每个线程,获取当前时间戳,如果当前时间 (当前这 1 秒) 内有超过 10 个线程正在执行,那么这个进程一直在这里循环,直到下一秒,或者更靠后的时间,重新加载,执行 load,将新的时间戳的计数值重新为 0。<br>执行结果:<br><img src="http://img.blog.csdn.net/20160620150358906" alt=""><br>每秒执行 11 个(因为从 0 开始),每一秒之后,load 方法会执行一次;</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div></pre></td><td class="code"><pre><div class="line">为了更加直观,我们可以让每个<span class="keyword">for</span>循环sleep一段时间:</div><div class="line"></div><div class="line"><span class="keyword">public</span> <span class="class"><span class="keyword">class</span> <span class="title">Test</span> </span>{</div><div class="line"></div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">void</span> <span class="title">main</span><span class="params">(String args[])</span> <span class="keyword">throws</span> Exception</span></div><div class="line"> {</div><div class="line"> <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>;i<<span class="number">100</span>;i++)</div><div class="line"> {</div><div class="line"> <span class="keyword">new</span> Thread(){</div><div class="line"> <span class="meta">@Override</span></div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">run</span><span class="params">()</span> </span>{</div><div class="line"> <span class="keyword">try</span> {</div><div class="line"> System.out.println(Counter.getCounter());</div><div class="line"> }</div><div class="line"> <span class="keyword">catch</span> (Exception e)</div><div class="line"> {</div><div class="line"> e.printStackTrace();</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }.start();</div><div class="line"> Thread.sleep(<span class="number">100</span>);</div><div class="line"> }</div><div class="line"></div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>在上述这样的情况下,一个线程如果遇到当前时间正在执行的线程超过 limit 值就会一直在 while 循环,这样会浪费大量的资源,我们在做限流的时候,如果出现这种情况,可以<strong>不进行 while 循环</strong>,而是直接抛出异常或者返回,来拒绝这次执行(查询),这样便可以节省资源。</p>
<h3 id="最后"><a href="#最后" class="headerlink" title="最后"></a>最后</h3><p>本篇文章地址: <a href="http://www.54tianzhisheng.cn/2017/09/23/Guava-limit/" target="_blank" rel="external">http://www.54tianzhisheng.cn/2017/09/23/Guava-limit/</a></p>
]]></content>
<summary type="html">
<p><img src="http://ohfk1r827.bkt.clouddn.com/cb2.jpeg-1" alt=""><br>
</summary>
<category term="Guava" scheme="http://yoursite.com/tags/Guava/"/>
</entry>
<entry>
<title>面试过阿里等互联网大公司,我知道了这些套路</title>
<link href="http://yoursite.com/2017/09/17/Interview-summary/"/>
<id>http://yoursite.com/2017/09/17/Interview-summary/</id>
<published>2017-09-16T17:28:43.398Z</published>
<updated>2017-09-17T04:49:13.086Z</updated>
<content type="html"><![CDATA[<p><img src="http://ohfk1r827.bkt.clouddn.com/shanghai.jpeg-1" alt=""></p>
<a id="more"></a>
<h3 id="前面感谢一波"><a href="#前面感谢一波" class="headerlink" title="前面感谢一波"></a>前面感谢一波</h3><p>因为看到掘金在做秋招求职征文大赛,赞助商也有牛客网,自己前段时间也稍微写了篇博客总结我的大学生活,<a href="http://www.54tianzhisheng.cn/2017/08/26/recommend-books/" target="_blank" rel="external">那些年我看过的书 —— 致敬我的大学生活 —— Say Good Bye !</a> 博客中稍微简单的介绍了下自己的求职,重点是推荐了下我自己看过的那些书籍,对我帮助真的很大。</p>
<p>如今借这么个机会,回馈掘金和牛客网,想想自己这一年在掘金也写过不少文章,从 0 个粉丝到如今被 11047 人(截止写此篇文章时)关注,有点小激动,竟然这么多粉,也不知道真正活跃的用户有多少。不管怎样,这一年在掘金还是收获很多的,不仅可以阅读到很多大神的文章,学习新的知识,而且还遇到了好几个不错的哥们,如今平常也有和他们交流,比如 :<a href="https://juejin.im/user/5904c637b123db3ee479d923" target="_blank" rel="external">芋道源码</a> 老哥人就很不错,在上海还和老哥见过面,吃过饭,平常对我帮助也很大,会推荐一些很有用的书籍给我看。欢迎大家关注他的博客:<a href="http://vip.iocoder.cn/" target="_blank" rel="external">芋道源码的博客</a> ,里面有好几系列的源码分析博客文章呢。至于牛客网,我就更是老用户了,印象中好像是大一的时候注册的,那时有空的话就会去上面刷几道基础题,写写题解,坚持了好久了,如今早已是红名了。(其实是水出来的,哈哈)在牛客网遇到的大神也是超多,好多朋友几乎都是通过牛客网认识的,那时早的时候一起在一群讨论问题,别提那场面了,震惊,我等弱渣瑟瑟发抖。感谢叶神,左神,牛妹!</p>
<p>说着说着,好像偏题了。<img src="http://ohfk1r827.bkt.clouddn.com/201612031957235746.gif" alt=""></p>
<p>正式进入话题吧!</p>
<h3 id="正文开始"><a href="#正文开始" class="headerlink" title="正文开始"></a>正文开始</h3><p>本篇秋招求职征文主要分享如下几方面:<strong>招聘职位需求套路</strong> 、<strong>招聘面试的套路</strong>、<strong>简历撰写套路</strong>、<strong>简历投递套路</strong> 、<strong>找工作经历</strong> 、<strong>自己面试面经</strong> 、<strong>实习感悟</strong>、<strong>书籍推荐</strong> 、<strong>优秀网站推荐</strong> 、<strong>优秀博客推荐</strong> 、<strong>求职资料放送</strong>。</p>
<h3 id="招聘职位需求套路"><a href="#招聘职位需求套路" class="headerlink" title="招聘职位需求套路"></a>招聘职位需求套路</h3><p>摘举下几个公司的招聘需求:(from lagou)</p>
<p>1、Java开发校招生( 有赞 )</p>
<blockquote>
<p>职位诱惑:<br>福利好待遇佳,技术氛围浓,有大牛带成长快<br>职位描述:</p>
<p>有赞2018校招官方网申地址(请在官网投递,勿直接在Lagou上投递):<br><a href="https://job.youzan.com/campus" target="_blank" rel="external">https://job.youzan.com/campus</a><br>岗位职责</p>
<ol>
<li>我们拥有世界级的 SaaS 电商解决方案,每天处理几百万订单、几亿条消息,并且量级不断攀升;</li>
<li>我们开放了有赞云,连接了数十万开发者,大大提升了 SaaS 对商家产生的价值;</li>
<li>我们正在新零售的潮流中激流勇进、开疆拓土,用产品技术撬动巨大的市场;</li>
<li>而你的工作,就是参与这些大流量系统的研发,哪怕提升1%的性能和稳定性都将是激动人心的时刻。</li>
</ol>
<p>岗位要求</p>
<ol>
<li>2018届本科及以上学历应届毕业生,计算机或者软件工程相关专业;</li>
<li>具备扎实的计算机基础知识,至少熟练使用一门主流开发语言;</li>
<li>积极参与开发实践,如果拥有引以为豪的项目经历则加分;</li>
<li>热衷数据结构与算法,如果一不小心在 ACM 赛场摘过金,夺过银则加分;</li>
<li>能在 Linux 上写任何脚本,比王者荣耀上手还快则加分;</li>
<li>快速学习新鲜事物,自我驱动追求卓越,积极应对问题和变化。</li>
</ol>
</blockquote>
<p>2、京东居家生活事业部-汽车用品招聘实习生(2018届)</p>
<blockquote>
<p>职位诱惑:<br>京东商城</p>
<p>职位描述:<br>京东商城-汽车用品部门招聘实习生</p>
<p>我们需要这样的你:</p>
<ul>
<li>2018届毕业生(本科或硕士均可)</li>
<li>学习能力强</li>
<li>担当、抗压、接受变化</li>
<li>能长期实习(优秀者有转正机会)</li>
<li>需要一个大的平台来展示和发挥自己的能力</li>
</ul>
<p>你将收获:</p>
<ul>
<li>重新认识快速成长的自己</li>
<li>一份世界500强的实习经历</li>
<li>一群优秀的伙伴</li>
</ul>
</blockquote>
<p>3、爱奇艺 Java 实习生 - 游戏事业部</p>
<blockquote>
<p>要求:至少 6 个月以上每周三天以上实习。</p>
<ol>
<li>本科以上学历,计算机、软件工程相关专业;</li>
<li>基础扎实,熟悉 Java 编程,熟悉 Spring、MyBatis 等框架优先;</li>
<li>熟悉 SQL 语句,熟练使用 MySQL 数据库;</li>
<li>良好的沟通、表达、协调能力,富有激情,学习能力强;</li>
<li>有 GitHub 账号或者技术博客优先;</li>
<li>热爱游戏行业优先。</li>
</ol>
</blockquote>
<p>这里随便找了三个,从招聘需求里看,好多公司目前招聘的话在招聘需求中并不怎么会写的很清楚,有的也不会说明要求的技术栈,这其实有时会对我们这种新人来说,有点不好的,这样的话我们就没有明确的目标去复习,还有就是一些加分项,其实也是有点帮助的。就比如有些招聘上面的说有优秀博客和 GitHub 者优先,这两点的话我们其实可以在大学慢慢积累出来的,对面试确实有帮助,我好些面试机会都是靠这两个的。还有套路就是,别光信他这招聘需求,进去面试可能就不问你这些方面的问题了,那些公司几乎都是这么个套路:<strong>面试造火箭,入职拧螺丝</strong> ! 进去公司之前可能需要你懂很多东西,但是进去的话还只是专门做一方面的东西。不管怎样,如果你有机会进去大公司的话(而且适合去),还是去大公司吧,出来大厂光环不少。</p>
<ol>
<li>认真耐心地拧螺丝钉,说不定有机会去造大火箭——正规大公司的节奏。</li>
<li>短时间把螺丝拧出花,说不定有机会造小火箭——上升中创业公司的节奏。</li>
</ol>
<h3 id="招聘面试的套路"><a href="#招聘面试的套路" class="headerlink" title="招聘面试的套路"></a>招聘面试的套路</h3><p>参考:<a href="https://mp.weixin.qq.com/s/qRwDowetBkJqpeMeAZsIpA" target="_blank" rel="external">https://mp.weixin.qq.com/s/qRwDowetBkJqpeMeAZsIpA</a> 一个在掘金上认识的老哥,在京东工作,写的不错,干脆分享下。大家可以去看他的博客,<a href="http://mindwind.me/" target="_blank" rel="external">http://mindwind.me/</a> 当时我求职的时候通过作者博客也学到不少东西。</p>
<p>一次集中的扩招需求,有点像每年一度的晋升评审,都需要对大量的候选人进行定级评审,因为每一个新招聘的人员都会对其有一个定级的过程。</p>
<p>维度:</p>
<blockquote>
<ul>
<li>通用能力:考察其沟通表达、学习成长等</li>
<li>专业知识:考察其知识的掌握、深度、广度等</li>
<li>专业能力:考察其技能应用的能力和结果</li>
<li>工作业绩:考察其工作成果、产出、创新点等</li>
<li>价值观:考察其认知、理解、行为等</li>
</ul>
</blockquote>
<p>整个面试过程会包括下面几个部分:</p>
<p><strong>自我介绍</strong><br>一开始的简短自我介绍,考察点在于对自我的总结、归纳和认知能力。观察其表达的逻辑性和清晰性,有个整体印象。</p>
<p><strong>项目经历</strong><br>一般我不会专门问一些比较死的专业技术点之类的知识,都是套在候选人的项目经历和过往经验中穿插。通过其描述,来判断其掌握知识点的范围和深度,以及在实际的案例中如何运用这些知识与技能解决真正的问题的。</p>
<p>所以,不会有所谓的题库。每一个我决定面试的候选人,都是提前细读其简历,提炼场景和发掘需要问的问题,相当于面试前有个二三十分钟的备课过程,组织好面试时的交互过程与场景,以顺利达到我想要了解的点。</p>
<p><strong>团队合作</strong><br>通常还会问候选人其所在团队中的角色,他们的工作模式、协作方式,并给出一些真实的场景化案例观察其应对的反应。评价一下关于他周围的同事、下属或领导,了解他在团队中的自我定位。这里的考察点是沟通协作方面的通用能力。</p>
<p><strong>学习成长</strong><br>这个维度考察的关键点包括:成长潜力、职业生涯规划的清晰度。人与人之间成长速度的关键差距,我自己观察得出的结论在于:自驱力。而路径的清晰性,也是产生自驱的一个源动力,否则可能会感觉迷茫,而陷于困顿。</p>
<p><strong>文化匹配</strong><br>这算是价值观的一部分吧。其实,这是最难考核的,我没有什么好方法,基本靠感觉。曾经有过好几次碰到经历和技能都不错的人,但总是感觉哪里不对,但又着急要人,就放进来了。但最终感觉是对的,合作很快就结束了,人也走了。</p>
<p><strong>综合评价</strong><br>总结点评候选人的优势、劣势并进行技术定级,定级也没有绝对标准,而是相对的。我一般就是和周围觉得差不多级别的人的平均水准比较下,大概就会有一个技术级别的判断。</p>
<p><strong>套路</strong></p>
<p>招聘面试,其实是一个对人的筛选,而筛选的本质是匹配 —— 匹配人与职位。第一,你得非常清楚地理解,这个职位需要什么样属性的人。第二,确定你的候选人是否拥有这个职位要求的必须属性。那么,首先回答第一个问题,一般的职位需要什么样的属性?</p>
<p>属性,又可以进一步拆解为三个层次。第一层次是「技能(Skills)」,技能是你习得的一种工具,就像程序员会用某种语言和框架来编写某类应用程序。第二层次是「能力(Abilities)」,能力是你运用工具的思考和行为方式,用同样的语言和框架编写同样程序的程序员能力可以差别很大。而第三层次是「价值观(Values)」,价值观是一个人根深蒂固的信念以及驱动行为的原因与动力所在。</p>
<h3 id="简历撰写套路"><a href="#简历撰写套路" class="headerlink" title="简历撰写套路"></a>简历撰写套路</h3><p>参考:<a href="https://mp.weixin.qq.com/s/3f8hGAQ-auLdkxkQ8XG3CQ" target="_blank" rel="external">https://mp.weixin.qq.com/s/3f8hGAQ-auLdkxkQ8XG3CQ</a></p>
<p>简历,是如此重要,它是获得一份满意工作的敲门砖,但不同的简历敲门的声响可不同。</p>
<p>但很多时候简历给人的感觉也似乎微不足道,因为没有人会真正细致的去读一份简历。而仅仅是快速的浏览一遍,就几乎同时对一个候选人形成了一种要么强烈,要么无感的印象。现实中的真实情况是,你的简历只有十几二十秒的时间窗口机会会被浏览到,然后就决定了能否进入下一步。</p>
<p>要让面试官看了你的简历后:知道你做过什么?看看技能、经历与岗位需求的匹配度,然后再问问你是谁?你通过简历散发出来的味道是什么感觉,我愿意和这样的人一起共事么?</p>
<p>一份简历的最少必要内容包括:</p>
<blockquote>
<ul>
<li><p>个人信息</p>
</li>
<li><ul>
<li>姓名</li>
<li>年龄</li>
<li>手机</li>
<li>邮箱</li>
</ul>
</li>
<li><p>教育经历</p>
</li>
<li><ul>
<li>博士(硕士、本科) 有多个全部写出来,最高学历写在上面</li>
</ul>
</li>
<li><p>工作经历(最匹配职位需求的,挑选出来的 TOP3 的项目)</p>
</li>
<li><ul>
<li><p>项目1</p>
</li>
<li><ul>
<li>项目背景上下文(场景、问题)</li>
<li>你在其中的角色(职责、发挥的作用、结果度量)</li>
<li>与此项经历有关的知识与技能(技术栈)</li>
</ul>
</li>
<li><p>项目2</p>
</li>
<li><p>项目3</p>
</li>
</ul>
</li>
<li><p>附加信息</p>
</li>
<li><ul>
<li>博客:持续有内容,不碎碎念</li>
<li>开源:GitHub 持续 commit</li>
<li>社区:有一定专业影响力的</li>
<li>书籍:用心写的</li>
<li>演讲:行业大会级别的</li>
<li>专利:凑数的就算了</li>
<li>论文:学术界比较有影响力的</li>
<li>爱好:真正的兴趣点</li>
</ul>
</li>
</ul>
</blockquote>
<p>对于我们学生,缺乏工作经历,那就写写独特的学习或实习经历。同学们大家都共有的经历就不要随便写上去凑数了。对于学生,看重的是通用能力,学习能力,适应能力以及对工作的态度和热情。如果没有区分度高的经历,那么有作品也是很好的。比如将你的做的网站部署出来,把地址写在简历上。</p>
<p>关于技术栈部分的技术术语,很多程序员不太注意。比如,把 Java 写成 java 或 JAVA,Java 已是一个专有品牌名词,大小写要完全符合,这一点和 iOS 类似(i 小写,OS 大写)。另外,像 HTML,CSS 则全部大写,因为这是多个单词的缩写。一些小小的细节就能读出你的专业性和散发出来的味道。最后,技术术语不是罗列得多就好,不是真正熟练的技能,不要轻易写进简历。因为这将给你自己挖坑。你可以将你自己擅长的或者很熟的知识点写进去,有时想着重就加粗或者打个括号,这样可以挖坑给面试官,让他去问你熟悉的(前提要确保你真的能讲清楚,我试过这个方法很有效的)。</p>
<p>然后就是简历格式了,最好是 PDF 了,Word 在不同的电脑上的打开效果可能不一样,格式可能会变,况且有些人的电脑不一定装了 Word,不过我喜欢用 Markdown 写简历,简洁,适合程序员,然后把 Markdown 转换成 PDF 出来。</p>
<h3 id="简历投递套路"><a href="#简历投递套路" class="headerlink" title="简历投递套路"></a>简历投递套路</h3><p><strong>内推</strong></p>
<p>有内推通道尽量走内推通道,不知道方便多少,而且成功几率也很大!找熟人,找学长学姐吧!牛客网讨论区很多内推帖子,可以去找找。不过今年的好多公司的内推通道都不咋管用了,套路越来越多了。记得去年好多公司内推都是免笔试,直接进入面试阶段,今年直接变成内推免简历筛选,进入笔试。因为现在的内推越来越不靠谱,直接面试的话,会增加公司的面试成本,干脆笔试再筛选一部分人。</p>
<p><strong>拉勾网</strong></p>
<p>拉勾上还是算不错的。</p>
<p><strong>Boss 直聘</strong></p>
<p>虽说前段时间出现了程序员找工作进入传销最后导致死亡的惨事发生,但是里面总比智联招聘和前程无忧靠谱点。因为智联招聘和前程无忧几乎被广告党和培训机构给占领了。</p>
<p><strong>脉脉</strong></p>
<p>里面招应届生和实习生比较少,但是也有,可以试试。</p>
<p>总之,简历投递给公司之前,请确认下这家公司到底咋样,先去百度了解下,别被坑了,每个平台都有一些居心不良的广告党等着你上钩,千万别上当!!!</p>
<h3 id="找工作经历"><a href="#找工作经历" class="headerlink" title="找工作经历"></a>找工作经历</h3><p>这段经历,算是自己很难忘记的经历吧。既辛酸既充实的日子!也很感谢自己在这段时间的系统复习,感觉把自己的基础知识再次聚集在一起了,自己的能力在这一段时间提升的也很快。后面有机会的话我也想写一系列的相关文章,为后来准备工作(面试)的同学提供一些自己的帮助。自己在找工作的这段时间面过的公司也有几家大厂,但是结果都不是很好,对我自己有很大的压力,当时心里真的感觉 :“自己真的有这么差”,为什么一直被拒,当时很怀疑自己的能力,自己也有总结原因。一是面试的时候自己准备的还不够充分,虽说自己脑子里对这些基础有点印象,但是面试的时候自己稍紧张下就描述不怎么清楚了,导致面试官觉得你可能广度够了,深度还不够(这是阿里面试官电话面试说的);二是自己的表达能力还是有所欠缺,不能够将自己所要表达的东西说出来,这可能我要在后面加强的地方;三是我的学校问题。在面了几家公司失败后,终于面了家公司要我了,我也确定在这家公司了。很幸运,刚出来,就有一个很好(很负责)的架构师带我,这周就给了我一个很牛逼的项目给我看,里面新东西很多,说吃透了这个项目,以后绝对可以拿出去吹逼(一脸正经.jpg)。找工作期间,自己也经常去收集一些博客,并把它保存下来,这样能够让自己下次更好的系统复习,还在牛客网整理了很多面经,每天看几篇面经,知道面试一般问什么问题,都有啥套路,其实你看多了面经就会发现,面试考的题目几乎都差不多,区别不是很大。目前我的找工作经历就简短的介绍到这里了,如果感兴趣的话,可以加群:528776268 期待志同道合的你。</p>
<h3 id="自己面试面经"><a href="#自己面试面经" class="headerlink" title="自己面试面经"></a>自己面试面经</h3><h4 id="亚信"><a href="#亚信" class="headerlink" title="亚信"></a>亚信</h4><p>地址:<a href="http://www.54tianzhisheng.cn/2017/08/04/yaxin/" target="_blank" rel="external">http://www.54tianzhisheng.cn/2017/08/04/yaxin/</a></p>
<p>1)自我介绍(说到一个亮点:长期坚持写博客,面试官觉得这个习惯很好,算加分项吧)</p>
<p>2)看到简历项目中用到 Solr,详细的问了下 Solr(自己介绍了下 Solr 的使用场景和建立索引等东西)</p>
<p>3)项目里面写了一个 “ 敏感词和 JS 标签过滤防 XSS 攻击”,面试官让我讲了下这个 XSS 攻击,并且是怎样实现的</p>
<p>4)项目里写了支持 Markdown,问是不是自己写的解析代码,(回答不是,自己引用的是 GitHub上的一个开源项目解析的)</p>
<p>5)想问我前端的知识,我回复到:自己偏后端开发,前端只是了解,然后面试官就不问了</p>
<p>6)问我考不考研?</p>
<p>7)觉得杭州怎么样?是打算就呆在杭州还是把杭州作为一个跳板?</p>
<p>8)有啥小目标?以后是打算继续技术方向,还是先技术后管理(还开玩笑的说:是不是赚他几个亿,当时我笑了笑)</p>
<p>9)有啥兴趣爱好?</p>
<p><strong>总结</strong>:面试问的问题不算多,主要是通过简历上项目所涉及的东西提问的,如果自己不太会的切记不要写上去。面试主要考察你回答问题来判断你的逻辑是否很清楚。</p>
<h4 id="爱奇艺"><a href="#爱奇艺" class="headerlink" title="爱奇艺"></a>爱奇艺</h4><p>地址:<a href="http://www.54tianzhisheng.cn/2017/08/04/iqiyi/" target="_blank" rel="external">http://www.54tianzhisheng.cn/2017/08/04/iqiyi/</a></p>
<h5 id="笔试(半个小时)"><a href="#笔试(半个小时)" class="headerlink" title="笔试(半个小时)"></a>笔试(半个小时)</h5><p>题目:(记得一些)</p>
<p>1、重载重写的区别?</p>
<p>2、转发和重定向的区别?</p>
<p>3、画下 HashMap 的结构图?HashMap 、 HashTable 和 ConcurrentHashMap 的区别?</p>
<p>4、statement 和 preparedstatement 区别?</p>
<p>5、JSP 中一个 <c:value> 中取值与直接取值的区别?会有什么安全问题?</c:value></p>
<p>6、实现一个线程安全的单例模式</p>
<p>7、一个写 sql 语句的题目</p>
<p>8、自己实现一个 List,(主要实现 add等常用方法)</p>
<p>9、Spring 中 IOC 和 AOP 的理解?</p>
<p>10、两个对象的 hashcode 相同,是否对象相同?equal() 相同呢?</p>
<p>11、@RequestBody 和 @ResponseBody 区别?</p>
<p>12、JVM 一个错误,什么情况下会发生?</p>
<p>13、常用的 Linux 命令?</p>
<h5 id="第一轮面试(80-分钟)"><a href="#第一轮面试(80-分钟)" class="headerlink" title="第一轮面试(80 分钟)"></a>第一轮面试(80 分钟)</h5><p>1、自我介绍</p>
<p>2、介绍你最熟悉的一个项目</p>
<p>3、讲下这个 XSS 攻击</p>
<p>4、HashMap 的结构?HashMap 、 HashTable 和 ConcurrentHashMap 的区别?</p>
<p>5、HashMap 中怎么解决冲突的?(要我详细讲下)</p>
<p>6、ConcurrentHashMap 和 HashTable 中线程安全的区别?为啥建议用 ConcurrentHashMap ?能把 ConcurrentHashMap 里面的实现详细的讲下吗?</p>
<p>7、Session 和 Cookie 的区别?</p>
<p>8、你项目中登录是怎样做的,用的 Cookie 和 Session?</p>
<p>9、讲讲你对 Spring 中的 IOC 和 AOP 的理解?</p>
<p>10、问了好几个注解的作用?</p>
<p>11、statement 和 preparedstatement 区别?</p>
<p>12、$ 和 # 的区别?以及这两个在哪些地方用?</p>
<p>13、前面项目介绍了数据是爬虫爬取过来的,那你讲讲你的爬虫是多线程的吧?</p>
<p>14、讲讲 Python 中的多线程和 Java 中的多线程区别?</p>
<p>15、自己刚好前几天在看线程池,立马就把面试官带到我熟悉的线程池,和面试官讲了下 JDK 自带的四种线程池、ThreadPoolExecutor 类中的最重要的构造器里面的七个参数,然后再讲了下线程任务进入线程池和核心线程数、缓冲队列、最大线程数量比较。</p>
<p>16、线程同步,你了解哪几种方式?</p>
<p>17、讲下 Synchronized?</p>
<p>18、讲下 RecentLock 可重入锁? 什么是可重入锁?为什么要设计可重入锁?</p>
<p>19、讲下 Volatile 吧?他是怎样做到同步的?</p>
<p>20、Volatile 为什么不支持原子性?举个例子</p>
<p>21、Atomic 怎么设计的?(没看过源码,当时回答错了,后来才发现里面全部用 final 修饰的属性和方法)</p>
<p>22、问几个前端的标签吧?(问了一个不会,直接说明我偏后端,前端只是了解,后面就不问了)</p>
<p>23、SpringBoot 的了解?</p>
<p>24、Linux 常用命令?</p>
<p>25、JVM 里的几个问题?</p>
<p>26、事务的特性?</p>
<p>27、隔离级别?</p>
<p>28、网络状态码?以 2、3、4、5 开头的代表什么意思。</p>
<p>29、并发和并行的区别?</p>
<p>30、你有什么问题想问我的?</p>
<p>一面面完后面试官和说这份试卷是用来考 1~3 年开发工作经验的,让我准备一下,接下来的二面。</p>
<h5 id="第二轮面试(半个小时)"><a href="#第二轮面试(半个小时)" class="headerlink" title="第二轮面试(半个小时)"></a>第二轮面试(半个小时)</h5><p>1、一上来就问怎么简历名字都没有,我指了简历第一行的我的名字,还特意大写了,然后就问学校是不是在上海,我回答在南昌(感觉被鄙视了一波,后面我在回答问题的时候面试官就一直在玩手机,估计后面对我的印象就不是很好了)</p>
<p>2、自我介绍</p>
<p>3、说一说数据库建表吧(从范式讲)</p>
<p>4、讲讲多态?(这个我答出来了,可是面试官竟然说不是这样吧,可能面试官没听请,后面还说我是不是平时写多态比较少,感觉这个也让面试官对我印象减分)</p>
<p>5、将两个数转换(不借助第三个参数)</p>
<p>6、手写个插入排序吧(写完了和面试官讲了下执行流程)</p>
<p>7、讲讲你对 Spring 中的 IOC 和 AOP 的理解?</p>
<p>8、问了几个常用的 Linux 命令?</p>
<p>9、也问到多线程?和一面一样把自己最近看的线程池也讲了一遍</p>
<p>10、学 Java 多久了?</p>
<p>11、你有什么想问的?</p>
<h5 id="总结:"><a href="#总结:" class="headerlink" title="总结:"></a>总结:</h5><p>面试题目大概就是这么多了,有些问题自己也忘记了,面试题目顺序不一定是按照上面所写的。再次感谢爱奇艺的第一面面试官了,要不是他帮忙内推的,我可能还没有机会收到面试机会。自己接到爱奇艺面试邀请电话是星期一晚上快7点中的,之后加了面试官微信约好了星期四面试的(时间准备较短,之前没系统的复习过)。星期四一大早(5点就起床了),然后就收拾了下,去等公交车,转了两次车,然后再做地铁去爱奇艺公司的,总共路上花费时间四个多小时。总的来说,这次面试准备的时间不是很充裕,所以准备的个人觉得不是很好,通过这次的面试,发现面试还是比较注重基础和深度的,我也知道了自己的一些弱处,还需要在哪里加强,面试技巧上也要掌握些。为后面的其他公司继续做好充足的准备。加油!!!</p>
<h4 id="阿里"><a href="#阿里" class="headerlink" title="阿里"></a>阿里</h4><p>地址:<a href="http://www.54tianzhisheng.cn/2017/08/04/alibaba/" target="_blank" rel="external">http://www.54tianzhisheng.cn/2017/08/04/alibaba/</a></p>
<p>(菜鸟网络部门)(49 分钟)</p>
<p>2017.08.02 晚上9点21打电话过来,预约明天什么时候有空面试,约好第二天下午两点。</p>
<p>2017.08.03 下午两点10分打过来了。</p>
<p>说看了我的<a href="http://www.54tianzhisheng.cn/" target="_blank" rel="external">博客</a>和 <a href="https://github.com/zhisheng17" target="_blank" rel="external">GitHub</a>,觉得我学的还行,知识广度都还不错,但是还是要问问具体情况,为什么没看到你春招的记录,什么原因没投阿里?非得说一个原因,那就是:我自己太菜了,不敢投。</p>
<p>1、先自我介绍</p>
<p>2、什么是多态?哪里体现了多态的概念?</p>
<p>3、HashMap 源码分析,把里面的东西问了个遍?最后问是不是线程安全?引出 ConcurrentHashMap</p>
<p>4、ConcurrentHashMap 源码分析</p>
<p>5、类加载,双亲委托机制</p>
<p>6、Java内存模型(一开始说的不是他想要的,主要想问我堆和栈的细节)</p>
<p>7、垃圾回收算法</p>
<p>8、线程池,自己之前看过,所以说的比较多,最后面试官说了句:看你对线程池了解还是很深了</p>
<p>9、事务的四种特性</p>
<p>10、什么是死锁?</p>
<p>11、乐观锁和悲观锁的策略</p>
<p>12、高可用网站的设计(有什么技术实现)</p>
<p>13、低耦合高内聚</p>
<p>14、设计模式了解不?你用过哪几种,为什么用,单例模式帮我们做什么东西?有什么好处?</p>
<p>15、你参与什么项目中成长比较快?学到了什么东西,以前是没有学过的?</p>
<p>16、项目中遇到的最大困难是怎样的?是怎么解决的?</p>
<p>17、智力题(两根不均匀的香,点一头烧完要一个小时,怎么确定15分钟)</p>
<p>18、你有什么问题想要问我的?</p>
<p>19、问了菜鸟网络他们部门主要做什么?</p>
<p>20、对我这次面试做个评价:看了你<a href="http://www.54tianzhisheng.cn/" target="_blank" rel="external">博客</a>和 <a href="https://github.com/zhisheng17" target="_blank" rel="external">GitHub</a>,知道你对学习的热情还是很高的,花了不少功夫,后面有通知!</p>
<p><strong>总结</strong>:面试总的来说,第一次电话面试,感觉好紧张,好多问题自己会点,但是其中的细节没弄清楚,自己准备的也不够充分。面试官很友好,看到我紧张,也安慰我说不要紧,不管以后出去面试啥的,不需要紧张,公司问的问题可能很广,你只需要把你知道的说出来就行,不会的直接说不会就行。之前一直不敢投阿里,因为自己准备的完全不够充分,但是在朋友磊哥的帮助下,还是试了下,不管结果怎么样,经历过总比没有的好。</p>
<p>后面说有通知,结果并没有,只看到官网的投递按钮变灰了。在掘金上一个朋友(我隔壁学校的),当时看我挂了说要不要让他租一起的隔壁邻居再内推下淘宝,我想想还是算了,自己目前能力真的是有限,达不到进阿里的要求!不过还是要感谢那个哥们,人真的超级好,虽然我们未曾谋面,但是有机会的话,我一定会请你吃饭的。</p>
<h4 id="哔哩哔哩"><a href="#哔哩哔哩" class="headerlink" title="哔哩哔哩"></a>哔哩哔哩</h4><p>首先直接根据简历项目开问,自我介绍都没有。</p>
<p>1、登录从前端到后端整个过程描述一遍?越详细越好,说到密码加密,网络传输,后台验证用户名和密码,Cookie 设置等。具体问我密码加密是前台还是后台加密,说了在后台加密?面试官说,那你做这个项目有什么意思?密码传输都是明文的,默认 HTTP 传递是明文传输,当时被面试官带进前台加密还是后台加密的沟里去了,没想到用 HTTPS ,后来后来的路上查了些资料才知道的,面试过程中他很想我说前台加密,但是前台加密算法那代码就摆在那里,很容易就给破解了吧,也没给点提示说 HTTPS,我只好投降</p>
<p>2、写一个查询的 sql 语句</p>
<p>3、线程同步的方法?Synchronized、Volatile、(面试官好像觉得 Volatile 不可以做到同步,我和他说了半天的 Volatile 原理 ,他竟然不认同,我开始怀疑他的实力了)、ThreadLocal、Atomic。</p>
<p>说到这些了,我当时竟然没把他带进我我给他挖的坑里去(线程池,之前好好研究过呢,可惜了)</p>
<p>4、Spring IOC 和 AOP 的理解?叫我写 AOP 的代码,我没写</p>
<p>5、JDK 动态代理和 Cglib 代理区别?</p>
<p>5、你觉得项目里面你觉得哪些技术比较好?我指了两个,然后他也没有问下去。</p>
<p>6、解释下 XSS 攻击</p>
<p>7、Spring 和 SpringBoot 的区别?</p>
<p>8、JVM 垃圾回收算法?分代中为什么要分三层?</p>
<p>9、OOM 是什么?什么情况会发生?</p>
<p>10、你觉得你有啥优点?</p>
<p>然后就叫我等一会,一会有人事来通知我,结果过了一会人事叫我可以回去等通知了。</p>
<p><strong>总结</strong>:到公司的时候已经一点多钟了,面试直接在一个很多人的地方(吃饭的地方)直接面的,周围还有人再吃饭,场景有点尴尬,面试过程感觉很随意,想到什么问题就问什么,完全没有衔接,问到的有些地方感觉面试官自己都不清楚,还怀疑我所说的,另外就是问题比较刁钻,总体技术也就那样吧!</p>
<h4 id="目前所在公司"><a href="#目前所在公司" class="headerlink" title="目前所在公司"></a>目前所在公司</h4><p>当时是我现在的老大(架构师)面的,先是电话面试过一次,问的问题也比较难,不过最后还是觉得我基础还是不错的。最后叫我去公司面试下,来到公司面试问的问题那就更难了,几乎好多都回答不出来,但是简单的说了下思路,最后再叫主任面试了下,问的问题就很简单了,最后就是 HR 面了,主要说了下工资问题和什么时候能报道!这几次面试的问题当时由于时间比较紧,也没去整理,现在也记不清楚了!目前自己已经工作了快一个月了,给的项目也完全是新东西,对我的挑战也很大,有时自己也确实不怎么知道,不过我老大很耐心的教我,对我也很不错,这也是我打算留在这里的原因,碰到个好老大不易!必须好好珍惜!</p>
<h3 id="实习感悟"><a href="#实习感悟" class="headerlink" title="实习感悟"></a>实习感悟</h3><p>进公司是架构运维组中的 Java 实习开发,目前实习已经快一个月了,说实话,实习后才发现一天真的很忙,写下这篇征文也是在周末整理大晚上写的。刚进公司就给了一个 Consul 的服务发现与注册和健康检查的项目,里面涉及的东西有 Consul、Docker、Nginx、Lua、ElasticSearch 还有几个很轻量级的框架,对我来说几乎都是新东西,确实需要时间去了解,再优化和改里面的 bug 的过程中,幸好我老大和我理了几次思路,才让我对整个项目有所进展,后续继续是在优化这项目(可能以后这个项目的所有东西都是我来做)。在上海,住的地方离公司有一定的距离,上班几乎要一个小时,每天花在上班路上的时间很多,这也导致我每天感觉很忙。公司上班时间比较弹性,无打卡,虽说公司不加班,但是每天自己都不怎么会按点下班,自己也想在实习阶段多学点东西!这段时间也是最关键的时间,碰到个问题,要花好久时间才能解决,也有可能未必解决得了,有时觉得自己啥都不会,这么点东西都做不好,有点否定自己。这也确实是自己的技术知识栈缺乏,和自己学的 SSM、Spring Boot 这些都不相关,也不怎么写业务逻辑代码。所以感觉很痛苦,不像自己以前写的代码那样顺畅,当然可能是自己以前自己写的项目太 low 了。</p>
<p>看到掘金-凯伦征文中写到:</p>
<blockquote>
<p><strong>公司其实并不期望刚刚进来的你,能够创造多少价值。新人是要成长的,在成长期难免会遇到各种各样的小问题,这可能是大多数人的必经之路,因为你所看到的同事,他们都比你在工作领域待的时间更久,有更多的经验,可以把他们作为目标,但不要把他们作为现在自己的标准,那样会压力太大。</strong></p>
</blockquote>
<p>感觉这段话对我现在很受用! <strong>加油,好好挺过这个阶段,别轻易说放弃!</strong></p>
<h3 id="书籍推荐"><a href="#书籍推荐" class="headerlink" title="书籍推荐"></a>书籍推荐</h3><p>大学,我不怎么喜欢玩游戏,自己也还算不怎么堕落吧,看了以下的一些书籍,算是对我后面写博客、找工作也有很大的帮助。如果你是大神,请忽略,如果你还是还在大学,和我一样不想把时间浪费在游戏上,可以看看我推荐的一些书籍,有想讨论的请在评论下留下你的评论或者加上面给的群号。</p>
<h4 id="Java"><a href="#Java" class="headerlink" title="Java"></a>Java</h4><p>1、《Java 核心技术》卷一 、卷二 两本书,算是入门比较好的书籍了</p>
<p>2、《疯狂 Java 讲义》 很厚的一本书,里面的内容也是很注重基础了</p>
<p>3、《Java 并发编程的艺术》—— 方腾飞 、魏鹏、程晓明著 方腾飞 是并发编程网的创始人,里面的文章确实还不错,可以多看看里面的文章,收获绝对很大。</p>
<p>4、《 Java多线程编程核心技术》—— 高洪岩著 这本书也算是入门多线程编程的不错书籍,我之前还写了一篇读书笔记呢,<a href="http://www.54tianzhisheng.cn/2017/06/04/Java-Thread/" target="_blank" rel="external">《Java 多线程编程核心技术》学习笔记及总结</a> , 大家如果不想看书的可以去看我的笔记。</p>
<p>5、《Java 并发编程实战》 这本书讲的有点难懂啊,不过确实也是一本很好的书,以上三本书籍如果都弄懂了,我觉得你并发编程这块可能大概就 OK 了,然后再去看看线程池的源码,了解下线程池,我觉得那就更棒了。不想看的话,请看我的博客:<a href="http://www.54tianzhisheng.cn/2017/07/29/ThreadPool/" target="_blank" rel="external">Java 线程池艺术探索</a> 我个人觉得还是写的很不错,那些大厂面试也几乎都会问线程池的东西,然后大概内容也就是我这博客写的</p>
<p>6、《Effective Java》中文版 第二版 算是 Java 的进阶书籍了,面试好多问题也是从这出来的</p>
<p>7、《深入理解 Java 虚拟机——JVM高级特性与最佳实践》第二版 这算是国内讲 JVM 最清楚的书了吧,目前还是只看了一遍,后面继续啃,大厂面试几乎也是都会考 JVM 的,阿里面 JVM 特别多,想进阿里的同学请一定要买这本书去看。</p>
<p>8、《深入分析Java Web技术内幕 修订版》许令波著 里面知识很广,每一章都是一个不同的知识,可见作者的优秀,不愧是阿里大神。</p>
<p>9、《大型网站系统与 Java 中间件实践》—— 曽宪杰 著 作者是前淘宝技术总监,见证了淘宝网的发展,里面的讲的内容也是很好,看完能让自己也站在高处去思考问题。</p>
<p>10、《大型网站技术架构 —— 核心原理与案例分析》 —— 李智慧 著 最好和上面那本书籍一起看,效果更好,两本看完了,提升思想的高度!</p>
<p>11、《疯狂Java.突破程序员基本功的16课》 李刚 著 书中很注重 Java 的一些细节,讲的很深入,但是书中的错别字特多,可以看看我的读书笔记:<a href="http://www.54tianzhisheng.cn/2017/05/31/Java-16-lession/" target="_blank" rel="external">《疯狂 Java 突破程序员基本功的 16 课》读书笔记</a></p>
<p>12、《Spring 实战》 Spring 入门书籍</p>
<p>13、《Spring 揭秘》—— 王福强 著 这本书别提多牛了,出版时期为 2009 年,豆瓣评分为 9.0 分,写的是真棒!把 Spring 的 IOC 和 AOP 特性写的很清楚,把 Spring 的来龙去脉讲的很全。墙裂推荐这本书籍,如果你想看 Spring,作者很牛,资深架构师,很有幸和作者有过一次交流,当时因为自己的一篇博客 <a href="http://www.54tianzhisheng.cn/2017/03/27/Pyspider%E6%A1%86%E6%9E%B6%20%E2%80%94%E2%80%94%20Python%E7%88%AC%E8%99%AB%E5%AE%9E%E6%88%98%E4%B9%8B%E7%88%AC%E5%8F%96%20V2EX%20%E7%BD%91%E7%AB%99%E5%B8%96%E5%AD%90/" target="_blank" rel="external">Pyspider框架 —— Python爬虫实战之爬取 V2EX 网站帖子</a>,竟然找到我想叫我去实习,可惜了,当时差点就跟着他混了。作者还有一本书 《Spring Boot 揭秘》。</p>
<p>14、《Spring 技术内幕》—— 深入解析 Spring 架构与设计原理 讲解 Spring 源码,深入了内部机制,个人觉得还是不错的。</p>
<p>15、Spring 官方的英文文档 这个别提了,很好,能看英文尽量看英文</p>
<p>16、《跟开涛学 Spring 3》 《跟开涛学 Spring MVC》 京东大神,膜</p>
<p>17、《看透springMvc源代码分析与实践》 算是把 Spring MVC 源码讲的很好的了</p>
<p>见我的笔记:</p>
<p><a href="http://www.54tianzhisheng.cn/2017/07/09/servlet/" target="_blank" rel="external">1、通过源码详解 Servlet</a></p>
<p><a href="http://www.54tianzhisheng.cn/2017/07/14/Spring-MVC01/" target="_blank" rel="external">2 、看透 Spring MVC 源代码分析与实践 —— 网站基础知识</a></p>
<p><a href="http://www.54tianzhisheng.cn/2017/07/14/Spring-MVC02/" target="_blank" rel="external">3 、看透 Spring MVC 源代码分析与实践 —— 俯视 Spring MVC</a></p>
<p><a href="http://www.54tianzhisheng.cn/2017/07/21/Spring-MVC03/" target="_blank" rel="external">4 、看透 Spring MVC 源代码分析与实践 —— Spring MVC 组件分析</a></p>
<p>18、《Spring Boot 实战》</p>
<p>19、Spring Boot 官方 Reference Guide 网上好多写 SpringBoot 的博客,几乎和这个差不多。</p>
<p>20、《JavaEE开发的颠覆者: Spring Boot实战》</p>
<p>21、MyBatis 当然是官方的文档最好了,而且还是中文的。</p>
<p>自己也写过几篇文章,帮助过很多人入门,传送门:</p>
<p>1、<a href="http://www.54tianzhisheng.cn/2017/03/28/%E9%80%9A%E8%BF%87%E9%A1%B9%E7%9B%AE%E9%80%90%E6%AD%A5%E6%B7%B1%E5%85%A5%E4%BA%86%E8%A7%A3Mybatis(%E4%B8%80" target="_blank" rel="external">通过项目逐步深入了解Mybatis(一)</a>/)</p>
<p>2、<a href="http://www.54tianzhisheng.cn/2017/03/28/%E9%80%9A%E8%BF%87%E9%A1%B9%E7%9B%AE%E9%80%90%E6%AD%A5%E6%B7%B1%E5%85%A5%E4%BA%86%E8%A7%A3Mybatis(%E4%BA%8C" target="_blank" rel="external">通过项目逐步深入了解Mybatis(二)</a>/)</p>
<p>3、<a href="http://www.54tianzhisheng.cn/2017/03/28/%E9%80%9A%E8%BF%87%E9%A1%B9%E7%9B%AE%E9%80%90%E6%AD%A5%E6%B7%B1%E5%85%A5%E4%BA%86%E8%A7%A3Mybatis(%E4%B8%89" target="_blank" rel="external">通过项目逐步深入了解Mybatis(三)</a>/)</p>
<p>4、<a href="http://www.54tianzhisheng.cn/2017/03/28/%E9%80%9A%E8%BF%87%E9%A1%B9%E7%9B%AE%E9%80%90%E6%AD%A5%E6%B7%B1%E5%85%A5%E4%BA%86%E8%A7%A3Mybatis(%E5%9B%9B" target="_blank" rel="external">通过项目逐步深入了解Mybatis(四)</a>/)</p>
<p>22、《深入理解 Java 内存模型》—— 程晓明 著 我觉得每个 Java 程序员都应该了解下 Java 的内存模型,该书籍我看的是电子版的,不多,但是讲的却很清楚,把重排序、顺序一致性、Volatile、锁、final等写的很清楚。</p>
<h4 id="Linux"><a href="#Linux" class="headerlink" title="Linux"></a>Linux</h4><p>《鸟哥的Linux私房菜 基础学习篇(第三版) 》</p>
<p>鸟哥的Linux私房菜:服务器架设篇(第3版) 鸟哥的书</p>
<h4 id="计算机网络"><a href="#计算机网络" class="headerlink" title="计算机网络"></a>计算机网络</h4><p>《计算机网络第六版——谢希仁 编》</p>
<p>《计算机网络自顶向下方法》</p>
<h4 id="计算机系统"><a href="#计算机系统" class="headerlink" title="计算机系统"></a>计算机系统</h4><p>《代码揭秘:从C/C.的角度探秘计算机系统 —— 左飞》</p>
<p>《深入理解计算机系统》</p>
<p>《计算机科学导论_佛罗赞》</p>
<h4 id="数据库"><a href="#数据库" class="headerlink" title="数据库"></a>数据库</h4><p>《高性能MySQL》</p>
<p>《Mysql技术内幕InnoDB存储引擎》</p>
<h4 id="Python"><a href="#Python" class="headerlink" title="Python"></a>Python</h4><p>这门语言语法很简单,上手快,不过我目前好久没用了,都忘得差不多了。当时是看的廖雪峰的 Python 博客</p>
<p>自己也用 Python 做爬虫写过几篇博客,不过有些是在前人的基础上写的。感谢那些栽树的人!</p>
<h4 id="工具"><a href="#工具" class="headerlink" title="工具"></a>工具</h4><p>Git : 廖雪峰的 Git 教程</p>
<p>IDEA:<a href="https://github.com/judasn/IntelliJ-IDEA-Tutorial" target="_blank" rel="external">IntelliJ IDEA 简体中文专题教程</a></p>
<p>Maven:《Maven实战》</p>
<h4 id="其他"><a href="#其他" class="headerlink" title="其他"></a>其他</h4><p>《如何高效学习-斯科特杨》 教你怎样高效学习的</p>
<p>《软技能:代码之外的生存指南》 程序员除了写代码,还得懂点其他的软技能。</p>
<p>《提问的智慧“中文版”》</p>
<p><a href="https://github.com/ryanhanwu/How-To-Ask-Questions-The-Smart-Way" target="_blank" rel="external">《How-To-Ask-Questions-The-Smart-Way》</a> 作为程序员的你,一定要学会咋提问,不然别人都不想鸟你。</p>
<h3 id="优秀网站推荐"><a href="#优秀网站推荐" class="headerlink" title="优秀网站推荐"></a>优秀网站推荐</h3><p>1、GitHub 别和我说不知道</p>
<p>2、InfoQ 文章很不错</p>
<p>3、CSDN 经常看博客专家的博客,里面大牛很多,传送门:<a href="http://blog.csdn.net/tzs_1041218129" target="_blank" rel="external">zhisheng</a></p>
<p>4、知乎 多关注些大牛,看他们吹逼</p>
<p>5、掘金 自己也在上面写专栏,粉丝已经超过一万了,传送门 :<a href="https://juejin.im/user/57510b82128fe10056ca70fc" target="_blank" rel="external">zhisheng</a></p>
<p>6、并发编程网 前面已经介绍</p>
<p>7、developerworks 上面的博客也很好</p>
<p>8、博客园 里面应该大牛也很多,不过自己没在上面写过博客</p>
<p>9、微信公众号 关注了很多人,有些人的文章确实很好,平时也经常看。</p>
<p>10、牛客网 刷笔试题不错的地方,里面大牛超多,怀念叶神和左神讲课的时候,还有很有爱的牛妹。</p>
<h3 id="优秀博客推荐"><a href="#优秀博客推荐" class="headerlink" title="优秀博客推荐"></a>优秀博客推荐</h3><p><a href="https://www.liaoxuefeng.com/" target="_blank" rel="external">廖雪峰</a> Git 和 Python 入门文章就是从他博客看的</p>
<p><a href="http://www.ruanyifeng.com/blog/" target="_blank" rel="external">阮一峰的网络日志</a></p>
<p><a href="https://coolshell.cn/" target="_blank" rel="external">酷壳-陈皓</a></p>
<p><a href="https://www.zhihu.com/people/rednaxelafx/answers" target="_blank" rel="external">RednaxelaFX</a> R大,牛逼的不得了</p>
<p><a href="http://calvin1978.blogcn.com/" target="_blank" rel="external">江南白衣</a> 老司机</p>
<p><a href="http://stormzhang.com/" target="_blank" rel="external">stormzhang</a> 人称帅逼张,微信公众号写的不错</p>
<p><a href="http://lovestblog.cn/" target="_blank" rel="external">你假笨</a> 阿里搞 JVM 的,很厉害</p>
<p><a href="http://www.jianshu.com/u/90ab66c248e6" target="_blank" rel="external">占小狼</a></p>
<p><a href="http://www.bysocket.com/" target="_blank" rel="external">泥瓦匠BYSocket</a></p>
<p><a href="http://cuiqingcai.com/" target="_blank" rel="external">崔庆才</a> 写了好多 Python 爬虫相关的文章</p>
<p><a href="http://www.ityouknow.com/" target="_blank" rel="external">纯洁的微笑</a> SpringBoot 系列不错,其他的文章自己看了感觉是自己喜欢的那种文笔</p>
<p><a href="http://blog.didispace.com/" target="_blank" rel="external">程序猿DD</a></p>
<p><a href="http://itmuch.com/" target="_blank" rel="external">周立</a></p>
<p><a href="http://vip.iocoder.cn/" target="_blank" rel="external">芋道源码的博客</a> 好多系列的源码分析</p>
<p><a href="http://www.54tianzhisheng.cn/" target="_blank" rel="external">zhisheng</a> 这个是我不要脸,竟然把自己博客地址的写上去了</p>
<h3 id="求职资料放送"><a href="#求职资料放送" class="headerlink" title="求职资料放送"></a>求职资料放送</h3><p>自己在准备找工作那段时间,系统的复习了下大学所学的知识,期间在网上参考了很多不错的博客,并收集下来了,个人觉得还是不错的,因为这是包含了自己的心血,所以一直没怎么送出来,只给过我的几个同学,还有就是一些学习视频和实战项目视频。借着这次征文的机会,我想送给那些有缘人,希望你或许是那种在求职道路上正在艰难走着的人;或许是大一大二的学弟学妹们却想好好学习,有个奋斗的目标,不堪在大学堕落的;或许是工作一两年后感觉基础还比较薄弱的。要资料的时候期望你能简单的介绍下自己,期望你!联系方式请看文章最下面。</p>
<h3 id="最后"><a href="#最后" class="headerlink" title="最后"></a>最后</h3><p>送一句话,<strong>越努力,越幸运,祝早日成为大神!</strong></p>
<p>这些地方可以找到我:</p>
<ul>
<li>blog: <a href="http://www.54tianzhisheng.cn/" target="_blank" rel="external">http://www.54tianzhisheng.cn/</a></li>
<li>GitHub: <a href="https://github.com/zhisheng17" target="_blank" rel="external">https://github.com/zhisheng17</a></li>
<li>QQ 群:528776268</li>
</ul>
<hr>
]]></content>
<summary type="html">
<p><img src="http://ohfk1r827.bkt.clouddn.com/shanghai.jpeg-1" alt=""></p>
</summary>
<category term="面经" scheme="http://yoursite.com/tags/%E9%9D%A2%E7%BB%8F/"/>
</entry>
<entry>
<title>Linux 下 lua 开发环境安装及安装 luafilesystem</title>
<link href="http://yoursite.com/2017/09/15/linux-lua-lfs-install/"/>
<id>http://yoursite.com/2017/09/15/linux-lua-lfs-install/</id>
<published>2017-09-15T15:17:26.540Z</published>
<updated>2017-09-15T15:21:45.473Z</updated>
<content type="html"><![CDATA[<p>火云邪神语录:天下武功,无坚不破,唯快不破!Nginx 的看家本领就是速度,Lua 的拿手好戏亦是速度,这两者的结合在速度上无疑有基因上的优势。</p>
<p><img src="http://www.lua.org/images/lua.gif" alt=""><br><a id="more"></a><br>最近一直再折腾这个,干脆就稍微整理下。以防后面继续跳坑!</p>
<p>安装:</p>
<h3 id="1-先安装-lua-的相关依赖"><a href="#1-先安装-lua-的相关依赖" class="headerlink" title="1.先安装 lua 的相关依赖"></a>1.先安装 lua 的相关依赖</h3><p>安装 C 开发环境<br>由于 gcc 包需要依赖 binutils 和 cpp 包,另外 make 包也是在编译中常用的,所以一共需要 9 个包来完成安装,因此我们只需要执行 9 条指令即可:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line">gcc:命令未找到(解决方法)</div><div class="line">yum install cpp</div><div class="line">yum install binutils</div><div class="line">yum install glibc</div><div class="line">yum install glibc-kernheaders</div><div class="line">yum install glibc-common</div><div class="line">yum install glibc-devel</div><div class="line">yum install gcc</div><div class="line">yum install make</div><div class="line">yum install readline-devel</div></pre></td></tr></table></figure>
<h3 id="2-安装-lua5-1-5"><a href="#2-安装-lua5-1-5" class="headerlink" title="2.安装 lua5.1.5"></a>2.安装 lua5.1.5</h3><p>下载地址:<a href="http://www.lua.org/ftp/" target="_blank" rel="external">http://www.lua.org/ftp/</a></p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div></pre></td><td class="code"><pre><div class="line">tar -zxvf lua-5.1.5.tar.gz</div><div class="line">cd lua-5.1.5</div><div class="line">vi Makefile</div><div class="line">设置 INSTALL_TOP= /usr/local/lua</div><div class="line">make linux</div><div class="line">make test</div><div class="line">make install</div><div class="line">rm -rf /usr/bin/lua</div><div class="line">ln -s /usr/local/lua/bin/lua /usr/bin/lua</div><div class="line">ln -s /usr/local/lua/share/lua /usr/share/lua</div><div class="line"></div><div class="line">设置环境变量:</div><div class="line">vim /etc/profile</div><div class="line"></div><div class="line">添加:</div><div class="line">export LUA_HOME=/usr/local/lua</div><div class="line">export PATH=$PATH:$LUA_HOME/bin</div><div class="line"></div><div class="line">环境变量生效:</div><div class="line">source /etc/profile</div></pre></td></tr></table></figure>
<h3 id="3、安装-luarocks"><a href="#3、安装-luarocks" class="headerlink" title="3、安装 luarocks"></a>3、安装 luarocks</h3><p>是一个 Lua 包管理器,基于 Lua 语言开发,提供一个命令行的方式来管理 Lua 包依赖、安装第三方 Lua 包等。</p>
<p>地址: <a href="https://github.com/luarocks/luarocks" target="_blank" rel="external">https://github.com/luarocks/luarocks</a></p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div></pre></td><td class="code"><pre><div class="line">使用 luarocks-2.2.1 版本在我机器上没有问题,但是使用 luarocks-2.4.2 出现问题</div><div class="line"></div><div class="line">wget http://luarocks.org/releases/luarocks-2.2.1.tar.gz</div><div class="line"></div><div class="line">tar -zxvf luarocks-2.2.1.tar.gz</div><div class="line"></div><div class="line">cd luarocks-2.2.1</div><div class="line"></div><div class="line">./configure --with-lua=/usr/local --with-lua-include=/usr/local/lua/include</div><div class="line"></div><div class="line">设置环境变量:</div><div class="line"></div><div class="line">export LUA_LUAROCKS_PATH=/usr/local/luarocks-2.2.1</div><div class="line">export PATH=$PATH:$LUA_LUAROCKS_PATH</div><div class="line"></div><div class="line">make & make install</div></pre></td></tr></table></figure>
<h3 id="4、安装-luafilesystem"><a href="#4、安装-luafilesystem" class="headerlink" title="4、安装 luafilesystem"></a>4、安装 luafilesystem</h3><p>是一个用于 lua 进行文件访问的库,可以支持 lua 5.1 和 lua5.2,且是跨平台的,在为 lua 安装 lfs 之前需要先安装luarocks。因为自己的需求刚好需要这模块。</p>
<p>地址:<a href="https://github.com/keplerproject/luafilesystem" target="_blank" rel="external">https://github.com/keplerproject/luafilesystem</a></p>
<p>文档: <a href="http://keplerproject.github.io/luafilesystem/index.html" target="_blank" rel="external">http://keplerproject.github.io/luafilesystem/index.html</a></p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">luarocks install luafilesystem</div></pre></td></tr></table></figure>
<h3 id="5、测试"><a href="#5、测试" class="headerlink" title="5、测试"></a>5、测试</h3><p>测试 lua 是否安装成功</p>
<p><code>lua -v</code></p>
<p>结果:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">Lua 5.1.5 Copyright (C) 1994-2012 Lua.org, PUC-Rio</div></pre></td></tr></table></figure>
<p>测试 luafilesystem 是否安装成功</p>
<p>a.lua</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">local lfs = require"lfs"</div><div class="line"></div><div class="line">function Rreturn(filePath)</div><div class="line"> local time = os.date("%a, %d %b %Y %X GMT", lfs.attributes(filePath).modification)</div><div class="line"> --打印文件的修改时间</div><div class="line"> print(time)</div><div class="line">end</div><div class="line"></div><div class="line">Rreturn("/opt/lua/a.txt")</div></pre></td></tr></table></figure>
<p>a.txt</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">a</div><div class="line">b</div><div class="line">c</div></pre></td></tr></table></figure>
<p>运行:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">lua a.lua</div></pre></td></tr></table></figure>
<p>结果:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">Tue, 12 Sep 2017 18:43:13 GMT</div></pre></td></tr></table></figure>
<p>出现打印出时间的结果就意味着已经安装好了。</p>
<hr>
<p>当然以上这是在 Linux 安装的, Windows 上的其实比这还简单了,但是安装 luafilesystem 的话需要自己去下载个 lfs.dll ,然后把这个放到 lua 的安装路径去。很简单的,这里就不细说了。</p>
<h3 id="出现过的错误:"><a href="#出现过的错误:" class="headerlink" title="出现过的错误:"></a>出现过的错误:</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">[root@n1 lua-5.1.5]# make linux test</div><div class="line">cd src && make linux</div><div class="line">make[1]: Entering directory `/opt/lua-5.1.5/src'</div><div class="line">make all MYCFLAGS=-DLUA_USE_LINUX MYLIBS="-Wl,-E -ldl -lreadline -lhistory -lncurses"</div><div class="line">make[2]: Entering directory `/opt/lua-5.1.5/src'</div><div class="line">gcc -O2 -Wall -DLUA_USE_LINUX -c -o lapi.o lapi.c</div><div class="line">make[2]: gcc:命令未找到</div><div class="line">make[2]: *** [lapi.o] 错误 127</div><div class="line">make[2]: Leaving directory `/opt/lua-5.1.5/src'</div><div class="line">make[1]: *** [linux] 错误 2</div><div class="line">make[1]: Leaving directory `/opt/lua-5.1.5/src'</div><div class="line">make: *** [linux] 错误 2</div></pre></td></tr></table></figure>
<p><strong>原因</strong>:最开始的那些依赖没安装</p>
]]></content>
<summary type="html">
<p>火云邪神语录:天下武功,无坚不破,唯快不破!Nginx 的看家本领就是速度,Lua 的拿手好戏亦是速度,这两者的结合在速度上无疑有基因上的优势。</p>
<p><img src="http://www.lua.org/images/lua.gif" alt=""><br>
</summary>
<category term="lua" scheme="http://yoursite.com/tags/lua/"/>
</entry>
<entry>
<title>全文搜索引擎 Elasticsearch 集群搭建入门教程</title>
<link href="http://yoursite.com/2017/09/09/Elasticsearch-install/"/>
<id>http://yoursite.com/2017/09/09/Elasticsearch-install/</id>
<published>2017-09-09T03:56:42.374Z</published>
<updated>2017-09-09T04:59:12.775Z</updated>
<content type="html"><![CDATA[<h3 id="介绍"><a href="#介绍" class="headerlink" title="介绍"></a>介绍</h3><p>ElasticSearch 是一个基于 Lucene 的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎,基于 RESTful web 接口。Elasticsearch 是用 Java 开发的,并作为 Apache 许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。基百科、Stack Overflow、Github 都采用它。</p>
<p>本文从零开始,讲解如何使用 Elasticsearch 搭建自己的全文搜索引擎。每一步都有详细的说明,大家跟着做就能学会。<br><a id="more"></a></p>
<h3 id="环境"><a href="#环境" class="headerlink" title="环境"></a>环境</h3><p>1、VMware</p>
<p>2、Centos 6.6</p>
<p>3、Elasticsearch 5.5.2</p>
<p>4、JDK 1.8</p>
<p>VMware 安装以及在 VMware 中安装 Centos 这个就不说了,环境配置直接默认就好,不过分配给机器的内存最好设置大点(建议 2G),</p>
<p>使用 dhclient 命令来自动获取 IP 地址,查看获取的 IP 地址则使用命令 ip addr 或者 ifconfig ,则会看到网卡信息和 lo 卡信息。</p>
<p>给虚拟机额中的 linux 设置固定的 ip(因为后面发现每次机器重启后又要重新使用 dhclient 命令来自动获取 IP 地址)</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">vim /etc/sysconfig/network-scripts/ifcfg-eth0</div></pre></td></tr></table></figure>
<p>修改:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">onboot=yes</div><div class="line">bootproto=static</div></pre></td></tr></table></figure>
<p>增加:(下面可设置可不设置)</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">IPADDR=192.168.1.113 网卡IP地址</div><div class="line">GATEWAY=192.168.1.1</div><div class="line">NETMASK=255.255.255.0</div></pre></td></tr></table></figure>
<p>设置好之后,把网络服务重启一下, <code>service network restart</code></p>
<p>修改 ip 地址参考: <a href="http://jingyan.baidu.com/article/e4d08ffdd417660fd3f60d70.html" target="_blank" rel="external">http://jingyan.baidu.com/article/e4d08ffdd417660fd3f60d70.html</a></p>
<p>大环境都准备好了,下面开始安装步骤:</p>
<h3 id="安装-JDK-1-8"><a href="#安装-JDK-1-8" class="headerlink" title="安装 JDK 1.8"></a>安装 JDK 1.8</h3><p>先卸载自带的 openjdk,查找 openjdk</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">rpm -qa | grep java</div></pre></td></tr></table></figure>
<p>卸载 openjdk</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">yum -y remove java-1.7.0-openjdk-1.7.0.65-2.5.1.2.el65.x8664</div><div class="line">yum -y remove java-1.6.0-openjdk-1.6.0.0-11.1.13.4.el6.x86_64</div></pre></td></tr></table></figure>
<p><strong>解压 JDK 安装包:</strong></p>
<p>附上jdk1.8的下载地址:<br><a href="http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html" target="_blank" rel="external">http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html</a></p>
<p>解压完成后配置一下环境变量就 ok</p>
<p>1、在/usr/local/下创建Java文件夹</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">cd /usr/local/ 进入目录</div><div class="line">mkdir java 新建java目录</div></pre></td></tr></table></figure>
<p>2、文件夹创建完毕,把安装包拷贝到 Java 目录中,然后解压 jdk 到当前目录</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">cp /usr/jdk-8u144-linux-x64.tar.gz /usr/local/java/ **注意匹配你自己的文件名** 拷贝到java目录</div><div class="line">tar -zxvf jdk-8u144-linux-x64.tar.gz 解压到当前目录(Java目录)</div></pre></td></tr></table></figure>
<p>3、解压完之后,Java目录中会出现一个jdk1.8.0_144的目录,这就解压完成了。之后配置一下环境变量。<br>编辑/etc/下的profile文件,配置环境变量</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">vi /etc/profile 进入profile文件的编辑模式</div><div class="line"></div><div class="line">在最后边追加一下内容(**配置的时候一定要根据自己的目录情况而定哦!**)</div><div class="line"></div><div class="line"> JAVA_HOME=/usr/local/java/jdk1.8.0_144</div><div class="line"> CLASSPATH=$JAVA_HOME/lib/</div><div class="line"> PATH=$PATH:$JAVA_HOME/bin</div><div class="line"> export PATH JAVA_HOME CLASSPATH</div></pre></td></tr></table></figure>
<p>之后保存并退出文件之后。</p>
<p>让文件生效:<code>source /etc/profile</code></p>
<p>在控制台输入Java 和 Java -version 看有没有信息输出,如下: <code>java -version</code></p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">java version "1.8.0_144"</div><div class="line"> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)</div><div class="line"> Java HotSpot(TM) Client VM (build 25.60-b23, mixed mode)</div></pre></td></tr></table></figure>
<p>能显示以上信息,就说明 JDK 安装成功啦</p>
<hr>
<h3 id="安装-Maven"><a href="#安装-Maven" class="headerlink" title="安装 Maven"></a>安装 Maven</h3><p>因为后面可能会用到 maven ,先装上这个。</p>
<p>1、下载 maven</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">wget http://mirrors.hust.edu.cn/apache/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz</div></pre></td></tr></table></figure>
<p>2、解压至 /usr/local 目录</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">tar -zxvf apache-maven-3.2.5-bin.tar.gz</div></pre></td></tr></table></figure>
<p>3、配置公司给的配置</p>
<p>替换成公司给的 setting.xml 文件,修改关于本地仓库的位置, 默认位置: ${user.home}/.m2/repository</p>
<p>4、配置环境变量etc/profile 最后添加以下两行</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">export MAVEN_HOME=/usr/local/apache-maven-3.2.5</div><div class="line">export PATH=${PATH}:${MAVEN_HOME}/bin</div></pre></td></tr></table></figure>
<p>5、测试</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">[root@localhost ~]# mvn -v</div><div class="line">Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T09:29:23-08:00)</div><div class="line">Maven home: /usr/local/apache-maven-3.2.5</div></pre></td></tr></table></figure>
<p>VMware 虚拟机里面的三台机器 IP 分别是:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">192.168.153.133</div><div class="line">192.168.153.134</div><div class="line">192.168.153.132</div></pre></td></tr></table></figure>
<h3 id="配置-hosts"><a href="#配置-hosts" class="headerlink" title="配置 hosts"></a>配置 hosts</h3><p>在 /etc/hosts下面编写:ip node 节点的名字(域名解析)</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">vim /etc/hosts</div></pre></td></tr></table></figure>
<p>新增:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">192.168.153.133 es1</div><div class="line">192.168.153.134 es2</div><div class="line">192.168.153.132 es3</div></pre></td></tr></table></figure>
<h3 id="设置-SSH-免密码登录"><a href="#设置-SSH-免密码登录" class="headerlink" title="设置 SSH 免密码登录"></a>设置 SSH 免密码登录</h3><p>安装expect命令 : yum -y install expect</p>
<p>将 ssh_p2p.jar 随便解压到任何目录下: (这个 jar 包可以去网上下载)</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">unzip ssh_p2p.zip</div></pre></td></tr></table></figure>
<p>修改 resource 的 ip 值</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">vim /ssh_p2p/deploy_data/resource (各个节点和账户名,密码,free代表相互都可以无密码登陆)</div></pre></td></tr></table></figure>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">#设置为你每台虚拟机的ip地址,用户名,密码</div><div class="line">address=(</div><div class="line">"192.168.153.133,root,123456,free"</div><div class="line">"192.168.153,134,root,123456,free"</div><div class="line">"192.168.153.132,root,123456,free"</div><div class="line">)</div></pre></td></tr></table></figure>
<p>修改 start.sh 的运行权限</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">chmod u+x start.sh</div></pre></td></tr></table></figure>
<p>运行</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">./start.sh</div></pre></td></tr></table></figure>
<p>测试:</p>
<p>ssh ip地址 (测试是否可以登录)</p>
<h3 id="安装-ElasticSearch"><a href="#安装-ElasticSearch" class="headerlink" title="安装 ElasticSearch"></a>安装 ElasticSearch</h3><p>下载地址: <a href="https://www.elastic.co/downloads/elasticsearch" target="_blank" rel="external">https://www.elastic.co/downloads/elasticsearch</a></p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.2.tar.gz</div><div class="line">cd /usr/local</div><div class="line">tar -zxvf elasticsearch-5.5.2.tar.gz</div></pre></td></tr></table></figure>
<p><code>su tzs</code> 切换到 tzs 用户下 ( 默认不支持 root 用户)</p>
<p><code>sh /usr/local/elasticsearch/bin/elasticsearch -d</code> 其中 -d 表示后台启动</p>
<p>在 vmware 上测试是否成功:curl <a href="http://localhost:9200/" target="_blank" rel="external">http://localhost:9200/</a></p>
<p><img src="http://ohfk1r827.bkt.clouddn.com/test.jpg-1" alt="test"></p>
<p>出现如上图这样的效果,就代表已经装好了。</p>
<p>elasticsearch 默认 restful-api 的端口是 9200 不支持 IP 地址,也就是说无法从主机访问虚拟机中的服务,只能在本机用 <a href="http://localhost:9200" target="_blank" rel="external">http://localhost:9200</a> 来访问。如果需要改变,需要修改配置文件 /usr/local/elasticsearch/config/elasticsearch.yml 文件,加入以下两行:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">network.bind_host: 0.0.0.0</div><div class="line">network.publish_host: _nonloopback:ipv4</div></pre></td></tr></table></figure>
<p>或去除 network.host 和 http.port 之前的注释,并将 network.host 的 IP 地址修改为本机外网 IP。然后重启,Elasticsearch</p>
<p>关闭方法(输入命令:<code>ps -ef | grep elasticsearch</code> ,找到进程,然后 kill 掉就行了。</p>
<p>如果外网还是不能访问,则有可能是防火墙设置导致的 ( 关闭防火墙:<code>service iptables stop</code> )</p>
<p>修改配置文件:<code>vim config/elasticsearch.yml</code></p>
<p>cluster.name : my-app (集群的名字,名字相同的就是一个集群)</p>
<p>node.name : es1 (节点的名字, 和前面配置的 hosts 中的 name 要一致)</p>
<p>path.data: /data/elasticsearch/data (数据的路径。没有要创建(<code>mkdir -p /data/elasticsearch/{data,logs}</code>),并且给执行用户权限 <code>chown tzs /data/elasticsearch/{data,logs} -R</code> )<br>path.logs: /data/elasticsearch/logs (数据 log 信息的路径,同上)<br>network.host: 0.0.0.0 //允许外网访问,也可以是自己的ip地址<br>http.port: 9200 //访问的端口<br>discovery.zen.ping.unicast.hosts: [“192.168.153.133”, “192.168.153.134”, “192.168.153.132”] //各个节点的ip地址</p>
<p>记得需要添加上:(这个是安装 head 插件要用的, 目前不需要)<br>http.cors.enabled: true<br>http.cors.allow-origin: “*”</p>
<p>最后在外部浏览器的效果如下图:</p>
<p><img src="http://ohfk1r827.bkt.clouddn.com/test-on-bro.jpg-1" alt="test-on-bro"></p>
<h3 id="安装-IK-中文分词"><a href="#安装-IK-中文分词" class="headerlink" title="安装 IK 中文分词"></a>安装 IK 中文分词</h3><p>可以自己下载源码使用 maven 编译,当然如果怕麻烦可以直接下载编译好的</p>
<p><a href="https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v5.5.2" target="_blank" rel="external">https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v5.5.2</a></p>
<p>注意下载对应的版本放在 plugins 目录下</p>
<p>解压</p>
<p><code>unzip elasticsearch-analysis-ik-5.5.2.zip</code></p>
<p>在 es 的 plugins 下新建 ik 目录</p>
<p><code>mkdir ik</code></p>
<p>将刚才解压的复制到ik目录下</p>
<p><code>cp -r elasticsearch/* ik</code></p>
<p>删除刚才解压后的</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">rm -rf elasticsearch</div><div class="line">rm -rf elasticsearch-analysis-ik-5.5.2.zip</div></pre></td></tr></table></figure>
<h4 id="IK-带有两个分词器"><a href="#IK-带有两个分词器" class="headerlink" title="IK 带有两个分词器"></a>IK 带有两个分词器</h4><p><strong>ik_max_word</strong> :会将文本做最细粒度的拆分;尽可能多的拆分出词语</p>
<p><strong>ik_smart</strong>:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有</p>
<p>安装完 IK 中文分词器后(当然不止这种中文分词器,还有其他的,可以参考我的文章 <a href="http://www.54tianzhisheng.cn/2017/09/07/Elasticsearch-analyzers/" target="_blank" rel="external">Elasticsearch 默认分词器和中分分词器之间的比较及使用方法</a>),测试区别如下:</p>
<h5 id="ik-max-word"><a href="#ik-max-word" class="headerlink" title="ik_max_word"></a>ik_max_word</h5><p>curl -XGET ‘<a href="http://192.168.153.134:9200/_analyze?pretty&analyzer=ik_max_word" target="_blank" rel="external">http://192.168.153.134:9200/_analyze?pretty&analyzer=ik_max_word</a>‘ -d ‘联想是全球最大的笔记本厂商’</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div><div class="line">66</div><div class="line">67</div></pre></td><td class="code"><pre><div class="line">{</div><div class="line"> "tokens" : [</div><div class="line"> {</div><div class="line"> "token" : "联想",</div><div class="line"> "start_offset" : 0,</div><div class="line"> "end_offset" : 2,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 0</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "是",</div><div class="line"> "start_offset" : 2,</div><div class="line"> "end_offset" : 3,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 1</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "全球",</div><div class="line"> "start_offset" : 3,</div><div class="line"> "end_offset" : 5,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 2</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "最大",</div><div class="line"> "start_offset" : 5,</div><div class="line"> "end_offset" : 7,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 3</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "的",</div><div class="line"> "start_offset" : 7,</div><div class="line"> "end_offset" : 8,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 4</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "笔记本",</div><div class="line"> "start_offset" : 8,</div><div class="line"> "end_offset" : 11,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 5</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "笔记",</div><div class="line"> "start_offset" : 8,</div><div class="line"> "end_offset" : 10,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 6</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "本厂",</div><div class="line"> "start_offset" : 10,</div><div class="line"> "end_offset" : 12,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 7</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "厂商",</div><div class="line"> "start_offset" : 11,</div><div class="line"> "end_offset" : 13,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 8</div><div class="line"> }</div><div class="line"> ]</div><div class="line">}</div></pre></td></tr></table></figure>
<h5 id="ik-smart"><a href="#ik-smart" class="headerlink" title="ik_smart"></a>ik_smart</h5><p>curl -XGET ‘<a href="http://localhost:9200/_analyze?pretty&analyzer=ik_smart" target="_blank" rel="external">http://localhost:9200/_analyze?pretty&analyzer=ik_smart</a>‘ -d ‘联想是全球最大的笔记本厂商’</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div></pre></td><td class="code"><pre><div class="line">{</div><div class="line"> "tokens" : [</div><div class="line"> {</div><div class="line"> "token" : "联想",</div><div class="line"> "start_offset" : 0,</div><div class="line"> "end_offset" : 2,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 0</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "是",</div><div class="line"> "start_offset" : 2,</div><div class="line"> "end_offset" : 3,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 1</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "全球",</div><div class="line"> "start_offset" : 3,</div><div class="line"> "end_offset" : 5,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 2</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "最大",</div><div class="line"> "start_offset" : 5,</div><div class="line"> "end_offset" : 7,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 3</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "的",</div><div class="line"> "start_offset" : 7,</div><div class="line"> "end_offset" : 8,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 4</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "笔记本",</div><div class="line"> "start_offset" : 8,</div><div class="line"> "end_offset" : 11,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 5</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "厂商",</div><div class="line"> "start_offset" : 11,</div><div class="line"> "end_offset" : 13,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 6</div><div class="line"> }</div><div class="line"> ]</div><div class="line">}</div></pre></td></tr></table></figure>
<h3 id="安装-head-插件"><a href="#安装-head-插件" class="headerlink" title="安装 head 插件"></a>安装 head 插件</h3><p>elasticsearch-head 是一个 elasticsearch 的集群管理工具,它是完全由 html5 编写的独立网页程序,你可以通过插件把它集成到 es。</p>
<p>效果如下图:(图片来自网络)</p>
<p><img src="http://img.my.csdn.net/uploads/201211/17/1353133910_8134.jpg" alt=""></p>
<p><img src="http://img.my.csdn.net/uploads/201211/17/1353133911_9624.jpg" alt=""></p>
<p><img src="http://img.my.csdn.net/uploads/201211/17/1353134135_7264.jpg" alt=""></p>
<p><img src="http://img.my.csdn.net/uploads/201211/17/1353134135_5729.jpg" alt=""></p>
<p><img src="http://img.my.csdn.net/uploads/201211/17/1353133911_8912.jpg" alt=""></p>
<h4 id="安装-git"><a href="#安装-git" class="headerlink" title="安装 git"></a>安装 git</h4><figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">yum remove git</div><div class="line">yum install git</div><div class="line">git clone git://github.com/mobz/elasticsearch-head.git 拉取 head 插件到本地,或者直接在 GitHub 下载 压缩包下来</div></pre></td></tr></table></figure>
<h4 id="安装nodejs"><a href="#安装nodejs" class="headerlink" title="安装nodejs"></a>安装nodejs</h4><p>先去官网下载 node-v8.4.0-linux-x64.tar.xz</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">tar -Jxv -f node-v8.4.0-linux-x64.tar.xz</div><div class="line">mv node-v8.4.0-linux-x64 node</div></pre></td></tr></table></figure>
<p>环境变量设置:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">vim /etc/profile</div></pre></td></tr></table></figure>
<p>新增:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div></pre></td><td class="code"><pre><div class="line">export NODE_HOME=/opt/node</div><div class="line">export PATH=$PATH:$NODE_HOME/bin</div><div class="line">export NODE_PATH=$NODE_HOME/lib/node_modules</div></pre></td></tr></table></figure>
<p>使配置文件生效(这步很重要,自己要多注意这步)</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">source /etc/profile</div></pre></td></tr></table></figure>
<p>测试是否全局可用了:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">node -v</div></pre></td></tr></table></figure>
<p>然后</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">mv elasticsearch-head head</div><div class="line">cd head/</div><div class="line">npm install -g grunt-cli</div><div class="line">npm install</div><div class="line">grunt server</div></pre></td></tr></table></figure>
<p>再 es 的配置文件中加:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">http.cors.enabled: true</div><div class="line">http.cors.allow-origin: "*"</div></pre></td></tr></table></figure>
<p>在浏览器打开 <code>http://192.168.153.133:9100/</code> 就可以看到效果了,</p>
<h3 id="遇到问题"><a href="#遇到问题" class="headerlink" title="遇到问题"></a>遇到问题</h3><p>把坑都走了一遍,防止以后再次入坑,特此记录下来</p>
<p><strong>1、ERROR Could not register mbeans java.security.AccessControlException: access denied (“javax.management.MBeanTrustPermission” “register”)</strong></p>
<p>改变 elasticsearch 文件夹所有者到当前用户</p>
<p>sudo chown -R noroot:noroot elasticsearch</p>
<p>这是因为 elasticsearch 需要读写配置文件,我们需要给予 config 文件夹权限,上面新建了 elsearch 用户,elsearch 用户不具备读写权限,因此还是会报错,解决方法是切换到管理员账户,赋予权限即可:</p>
<p>sudo -i</p>
<p>chmod -R 775 config</p>
<p><strong>2、[WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]</strong><br><strong>org.elasticsearch.bootstrap.StartupException: java.lang.RuntimeException: can not run elasticsearch as root</strong></p>
<p>原因是elasticsearch默认是不支持用root用户来启动的。</p>
<p>解决方案一:Des.insecure.allow.root=true</p>
<p>修改/usr/local/elasticsearch-2.4.0/bin/elasticsearch,</p>
<p>添加 ES_JAVA_OPTS=”-Des.insecure.allow.root=true”</p>
<p>或执行时添加: sh /usr/local/elasticsearch-2.4.0/bin/elasticsearch -d -Des.insecure.allow.root=true</p>
<p>注意:正式环境用root运行可能会有安全风险,不建议用root来跑。</p>
<p>解决方案二:添加专门的用户</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">useradd elastic</div><div class="line">chown -R elastic:elastic elasticsearch-2.4.0</div><div class="line">su elastic</div><div class="line">sh /usr/local/elasticsearch-2.4.0/bin/elasticsearch -d</div></pre></td></tr></table></figure>
<p><strong>3、UnsupportedOperationException: seccomp unavailable: requires kernel 3.5+ with CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER compiled in</strong></p>
<p>只是警告,使用新的linux版本,就不会出现此类问题了。</p>
<p><strong>4、ERROR: [4] bootstrap checks failed</strong><br><strong>[1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]</strong></p>
<p>原因:无法创建本地文件问题,用户最大可创建文件数太小</p>
<p>解决方案:切换到 root 用户,编辑 limits.conf 配置文件, 添加类似如下内容:</p>
<p>vim /etc/security/limits.conf</p>
<p>添加如下内容:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">* soft nofile 65536</div><div class="line">* hard nofile 131072</div><div class="line">* soft nproc 2048</div><div class="line">* hard nproc 4096</div></pre></td></tr></table></figure>
<p><strong>[2]: max number of threads [1024] for user [tzs] is too low, increase to at least [2048]</strong></p>
<p>原因:无法创建本地线程问题,用户最大可创建线程数太小</p>
<p>解决方案:切换到root用户,进入limits.d目录下,修改90-nproc.conf 配置文件。</p>
<p>vim /etc/security/limits.d/90-nproc.conf</p>
<p>找到如下内容:</p>
<ul>
<li>soft nproc 1024</li>
</ul>
<p>修改为</p>
<ul>
<li>soft nproc 2048</li>
</ul>
<p><strong>[3]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]</strong></p>
<p>原因:最大虚拟内存太小</p>
<p>root用户执行命令:</p>
<p><code>sysctl -w vm.max_map_count=262144</code></p>
<p>或者修改 /etc/sysctl.conf 文件,添加 “vm.max_map_count”设置<br>设置后,可以使用<br>$ sysctl -p</p>
<p><strong>[4]: system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk</strong></p>
<p>原因:Centos6不支持SecComp,而ES5.4.1默认bootstrap.system_call_filter为true进行检测,所以导致检测失败,失败后直接导致ES不能启动。<br>详见 :<a href="https://github.com/elastic/elasticsearch/issues/22899" target="_blank" rel="external">https://github.com/elastic/elasticsearch/issues/22899</a></p>
<p>解决方法:在elasticsearch.yml中新增配置bootstrap.system_call_filter,设为false,注意要在Memory下面:<br>bootstrap.memory_lock: false<br>bootstrap.system_call_filter: false</p>
<p><strong>5、 java.lang.IllegalArgumentException: property [elasticsearch.version] is missing for plugin [head]</strong></p>
<p>再 es 的配置文件中加:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">http.cors.enabled: true</div><div class="line">http.cors.allow-origin: "*"</div></pre></td></tr></table></figure>
<h3 id="最后"><a href="#最后" class="headerlink" title="最后"></a>最后</h3><p>整个搭建的过程全程自己手动安装,不易,如果安装很多台机器,是否可以写个脚本之类的自动搭建呢?可以去想想的。首发于:<a href="http://www.54tianzhisheng.cn/2017/09/09/Elasticsearch-install/" target="_blank" rel="external">http://www.54tianzhisheng.cn/2017/09/09/Elasticsearch-install/</a> ,转载请注明出处,谢谢配合!</p>
]]></content>
<summary type="html">
<h3 id="介绍"><a href="#介绍" class="headerlink" title="介绍"></a>介绍</h3><p>ElasticSearch 是一个基于 Lucene 的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎,基于 RESTful web 接口。Elasticsearch 是用 Java 开发的,并作为 Apache 许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。基百科、Stack Overflow、Github 都采用它。</p>
<p>本文从零开始,讲解如何使用 Elasticsearch 搭建自己的全文搜索引擎。每一步都有详细的说明,大家跟着做就能学会。<br>
</summary>
<category term="Elasticsearch" scheme="http://yoursite.com/tags/Elasticsearch/"/>
</entry>
<entry>
<title>Elasticsearch 默认分词器和中分分词器之间的比较及使用方法</title>
<link href="http://yoursite.com/2017/09/07/Elasticsearch-analyzers/"/>
<id>http://yoursite.com/2017/09/07/Elasticsearch-analyzers/</id>
<published>2017-09-07T01:50:46.000Z</published>
<updated>2017-09-08T14:52:31.627Z</updated>
<content type="html"><![CDATA[<p>介绍:ElasticSearch 是一个基于 Lucene 的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎,基于 RESTful web 接口。Elasticsearch 是用 Java 开发的,并作为Apache许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。</p>
<p>Elasticsearch中,内置了很多分词器(analyzers)。下面来进行比较下系统默认分词器和常用的中文分词器之间的区别。<br><a id="more"></a></p>
<h2 id="系统默认分词器:"><a href="#系统默认分词器:" class="headerlink" title="系统默认分词器:"></a>系统默认分词器:</h2><h3 id="1、standard-分词器"><a href="#1、standard-分词器" class="headerlink" title="1、standard 分词器"></a>1、standard 分词器</h3><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html" target="_blank" rel="external">https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html</a></p>
<p>如何使用:<a href="http://www.yiibai.com/lucene/lucene_standardanalyzer.html" target="_blank" rel="external">http://www.yiibai.com/lucene/lucene_standardanalyzer.html</a></p>
<p>英文的处理能力同于StopAnalyzer.支持中文采用的方法为单字切分。他会将词汇单元转换成小写形式,并去除停用词和标点符号。</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="comment">/**StandardAnalyzer分析器*/</span></div><div class="line"><span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">standardAnalyzer</span><span class="params">(String msg)</span></span>{</div><div class="line"> StandardAnalyzer analyzer = <span class="keyword">new</span> StandardAnalyzer(Version.LUCENE_36);</div><div class="line"> <span class="keyword">this</span>.getTokens(analyzer, msg);</div><div class="line">}</div></pre></td></tr></table></figure>
<h3 id="2、simple-分词器"><a href="#2、simple-分词器" class="headerlink" title="2、simple 分词器"></a>2、simple 分词器</h3><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html" target="_blank" rel="external">https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html</a></p>
<p>如何使用: <a href="http://www.yiibai.com/lucene/lucene_simpleanalyzer.html" target="_blank" rel="external">http://www.yiibai.com/lucene/lucene_simpleanalyzer.html</a></p>
<p>功能强于WhitespaceAnalyzer, 首先会通过非字母字符来分割文本信息,然后将词汇单元统一为小写形式。该分析器会去掉数字类型的字符。</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="comment">/**SimpleAnalyzer分析器*/</span></div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">simpleAnalyzer</span><span class="params">(String msg)</span></span>{</div><div class="line"> SimpleAnalyzer analyzer = <span class="keyword">new</span> SimpleAnalyzer(Version.LUCENE_36);</div><div class="line"> <span class="keyword">this</span>.getTokens(analyzer, msg);</div><div class="line"> }</div></pre></td></tr></table></figure>
<h3 id="3、Whitespace-分词器"><a href="#3、Whitespace-分词器" class="headerlink" title="3、Whitespace 分词器"></a>3、Whitespace 分词器</h3><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html" target="_blank" rel="external">https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html</a></p>
<p>如何使用:<a href="http://www.yiibai.com/lucene/lucene_whitespaceanalyzer.html" target="_blank" rel="external">http://www.yiibai.com/lucene/lucene_whitespaceanalyzer.html</a></p>
<p>仅仅是去除空格,对字符没有lowcase化,不支持中文;<br>并且不对生成的词汇单元进行其他的规范化处理。</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="comment">/**WhitespaceAnalyzer分析器*/</span></div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">whitespaceAnalyzer</span><span class="params">(String msg)</span></span>{</div><div class="line"> WhitespaceAnalyzer analyzer = <span class="keyword">new</span> WhitespaceAnalyzer(Version.LUCENE_36);</div><div class="line"> <span class="keyword">this</span>.getTokens(analyzer, msg);</div><div class="line"> }</div></pre></td></tr></table></figure>
<h3 id="4、Stop-分词器"><a href="#4、Stop-分词器" class="headerlink" title="4、Stop 分词器"></a>4、Stop 分词器</h3><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-analyzer.html" target="_blank" rel="external">https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-analyzer.html</a></p>
<p>如何使用:<a href="http://www.yiibai.com/lucene/lucene_stopanalyzer.html" target="_blank" rel="external">http://www.yiibai.com/lucene/lucene_stopanalyzer.html</a></p>
<p> StopAnalyzer的功能超越了SimpleAnalyzer,在SimpleAnalyzer的基础上增加了去除英文中的常用单词(如the,a等),也可以更加自己的需要设置常用单词;不支持中文</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="comment">/**StopAnalyzer分析器*/</span></div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">stopAnalyzer</span><span class="params">(String msg)</span></span>{</div><div class="line"> StopAnalyzer analyzer = <span class="keyword">new</span> StopAnalyzer(Version.LUCENE_36);</div><div class="line"> <span class="keyword">this</span>.getTokens(analyzer, msg);</div><div class="line"> }</div></pre></td></tr></table></figure>
<h3 id="5、keyword-分词器"><a href="#5、keyword-分词器" class="headerlink" title="5、keyword 分词器"></a>5、keyword 分词器</h3><p>KeywordAnalyzer把整个输入作为一个单独词汇单元,方便特殊类型的文本进行索引和检索。针对邮政编码,地址等文本信息使用关键词分词器进行索引项建立非常方便。</p>
<h3 id="6、pattern-分词器"><a href="#6、pattern-分词器" class="headerlink" title="6、pattern 分词器"></a>6、pattern 分词器</h3><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html" target="_blank" rel="external">https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html</a></p>
<p>一个pattern类型的analyzer可以通过正则表达式将文本分成”terms”(经过token Filter 后得到的东西 )。接受如下设置:</p>
<p>一个 pattern analyzer 可以做如下的属性设置:</p>
<table>
<thead>
<tr>
<th>lowercase</th>
<th>terms是否是小写. 默认为 true 小写.</th>
</tr>
</thead>
<tbody>
<tr>
<td>pattern</td>
<td>正则表达式的pattern, 默认是 \W+.</td>
</tr>
<tr>
<td>flags</td>
<td>正则表达式的flags</td>
</tr>
<tr>
<td>stopwords</td>
<td>一个用于初始化stop filter的需要stop 单词的列表.默认单词是空的列表</td>
</tr>
</tbody>
</table>
<h3 id="7、language-分词器"><a href="#7、language-分词器" class="headerlink" title="7、language 分词器"></a>7、language 分词器</h3><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html" target="_blank" rel="external">https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html</a></p>
<p>一个用于解析特殊语言文本的analyzer集合。( arabic,armenian, basque, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french,galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian,persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.)可惜没有中文。不予考虑</p>
<h3 id="8、snowball-分词器"><a href="#8、snowball-分词器" class="headerlink" title="8、snowball 分词器"></a>8、snowball 分词器</h3><p>一个snowball类型的analyzer是由standard tokenizer和standard filter、lowercase filter、stop filter、snowball filter这四个filter构成的。</p>
<p>snowball analyzer 在Lucene中通常是不推荐使用的。</p>
<h3 id="9、Custom-分词器"><a href="#9、Custom-分词器" class="headerlink" title="9、Custom 分词器"></a>9、Custom 分词器</h3><p>是自定义的analyzer。允许多个零到多个tokenizer,零到多个 Char Filters. custom analyzer 的名字不能以 “_”开头.</p>
<p>The following are settings that can be set for a custom analyzer type:</p>
<table>
<thead>
<tr>
<th>Setting</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>tokenizer</td>
<td>通用的或者注册的tokenizer.</td>
</tr>
<tr>
<td>filter</td>
<td>通用的或者注册的token filters</td>
</tr>
<tr>
<td>char_filter</td>
<td>通用的或者注册的 character filters</td>
</tr>
<tr>
<td>position_increment_gap</td>
<td>距离查询时,最大允许查询的距离,默认是100</td>
</tr>
</tbody>
</table>
<p>自定义的模板:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div></pre></td><td class="code"><pre><div class="line">index :</div><div class="line"> analysis :</div><div class="line"> analyzer :</div><div class="line"> myAnalyzer2 :</div><div class="line"> type : custom</div><div class="line"> tokenizer : myTokenizer1</div><div class="line"> filter : [myTokenFilter1, myTokenFilter2]</div><div class="line"> char_filter : [my_html]</div><div class="line"> position_increment_gap: 256</div><div class="line"> tokenizer :</div><div class="line"> myTokenizer1 :</div><div class="line"> type : standard</div><div class="line"> max_token_length : 900</div><div class="line"> filter :</div><div class="line"> myTokenFilter1 :</div><div class="line"> type : stop</div><div class="line"> stopwords : [stop1, stop2, stop3, stop4]</div><div class="line"> myTokenFilter2 :</div><div class="line"> type : length</div><div class="line"> min : 0</div><div class="line"> max : 2000</div><div class="line"> char_filter :</div><div class="line"> my_html :</div><div class="line"> type : html_strip</div><div class="line"> escaped_tags : [xxx, yyy]</div><div class="line"> read_ahead : 1024</div></pre></td></tr></table></figure>
<h3 id="10、fingerprint-分词器"><a href="#10、fingerprint-分词器" class="headerlink" title="10、fingerprint 分词器"></a>10、fingerprint 分词器</h3><p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html" target="_blank" rel="external">https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html</a></p>
<hr>
<h2 id="中文分词器:"><a href="#中文分词器:" class="headerlink" title="中文分词器:"></a>中文分词器:</h2><h3 id="1、ik-analyzer"><a href="#1、ik-analyzer" class="headerlink" title="1、ik-analyzer"></a>1、ik-analyzer</h3><p><a href="https://github.com/wks/ik-analyzer" target="_blank" rel="external">https://github.com/wks/ik-analyzer</a></p>
<p>IKAnalyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。</p>
<p>采用了特有的“正向迭代最细粒度切分算法“,支持细粒度和最大词长两种切分模式;具有83万字/秒(1600KB/S)的高速处理能力。</p>
<p>采用了多子处理器分析模式,支持:英文字母、数字、中文词汇等分词处理,兼容韩文、日文字符</p>
<p>优化的词典存储,更小的内存占用。支持用户词典扩展定义</p>
<p>针对Lucene全文检索优化的查询分析器IKQueryParser(作者吐血推荐);引入简单搜索表达式,采用歧义分析算法优化查询关键字的搜索排列组合,能极大的提高Lucene检索的命中率。</p>
<p>Maven用法:</p>
<figure class="highlight xml"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="tag"><<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"><<span class="name">groupId</span>></span>org.wltea.ik-analyzer<span class="tag"></<span class="name">groupId</span>></span></div><div class="line"> <span class="tag"><<span class="name">artifactId</span>></span>ik-analyzer<span class="tag"></<span class="name">artifactId</span>></span></div><div class="line"> <span class="tag"><<span class="name">version</span>></span>3.2.8<span class="tag"></<span class="name">version</span>></span></div><div class="line"><span class="tag"></<span class="name">dependency</span>></span></div></pre></td></tr></table></figure>
<p>在IK Analyzer加入Maven Central Repository之前,你需要手动安装,安装到本地的repository,或者上传到自己的Maven repository服务器上。</p>
<p>要安装到本地Maven repository,使用如下命令,将自动编译,打包并安装:<br>mvn install -Dmaven.test.skip=true</p>
<h4 id="Elasticsearch添加中文分词"><a href="#Elasticsearch添加中文分词" class="headerlink" title="Elasticsearch添加中文分词"></a>Elasticsearch添加中文分词</h4><h5 id="安装IK分词插件"><a href="#安装IK分词插件" class="headerlink" title="安装IK分词插件"></a>安装IK分词插件</h5><p><a href="https://github.com/medcl/elasticsearch-analysis-ik" target="_blank" rel="external">https://github.com/medcl/elasticsearch-analysis-ik</a></p>
<p>进入elasticsearch-analysis-ik-master</p>
<p>更多安装请参考博客:</p>
<p>1、<a href="http://blog.csdn.net/dingzfang/article/details/42776693" target="_blank" rel="external">为elastic添加中文分词</a> : <a href="http://blog.csdn.net/dingzfang/article/details/42776693" target="_blank" rel="external">http://blog.csdn.net/dingzfang/article/details/42776693</a></p>
<p>2、<a href="http://www.cnblogs.com/xing901022/p/5910139.html" target="_blank" rel="external">如何在Elasticsearch中安装中文分词器(IK+pinyin)</a> :<a href="http://www.cnblogs.com/xing901022/p/5910139.html" target="_blank" rel="external">http://www.cnblogs.com/xing901022/p/5910139.html</a></p>
<p>3、<a href="http://blog.csdn.net/jam00/article/details/52983056" target="_blank" rel="external">Elasticsearch 中文分词器 IK 配置和使用</a> : <a href="http://blog.csdn.net/jam00/article/details/52983056" target="_blank" rel="external">http://blog.csdn.net/jam00/article/details/52983056</a></p>
<h4 id="ik-带有两个分词器"><a href="#ik-带有两个分词器" class="headerlink" title="ik 带有两个分词器"></a>ik 带有两个分词器</h4><p><strong>ik_max_word</strong> :会将文本做最细粒度的拆分;尽可能多的拆分出词语</p>
<p><strong>ik_smart</strong>:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有</p>
<p>区别:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div><div class="line">66</div><div class="line">67</div><div class="line">68</div><div class="line">69</div><div class="line">70</div><div class="line">71</div><div class="line">72</div><div class="line">73</div><div class="line">74</div><div class="line">75</div><div class="line">76</div><div class="line">77</div><div class="line">78</div><div class="line">79</div><div class="line">80</div><div class="line">81</div><div class="line">82</div><div class="line">83</div><div class="line">84</div><div class="line">85</div><div class="line">86</div><div class="line">87</div><div class="line">88</div><div class="line">89</div><div class="line">90</div><div class="line">91</div><div class="line">92</div><div class="line">93</div><div class="line">94</div><div class="line">95</div><div class="line">96</div><div class="line">97</div><div class="line">98</div><div class="line">99</div><div class="line">100</div><div class="line">101</div><div class="line">102</div><div class="line">103</div><div class="line">104</div><div class="line">105</div><div class="line">106</div><div class="line">107</div><div class="line">108</div><div class="line">109</div><div class="line">110</div><div class="line">111</div><div class="line">112</div><div class="line">113</div><div class="line">114</div><div class="line">115</div><div class="line">116</div><div class="line">117</div><div class="line">118</div><div class="line">119</div><div class="line">120</div><div class="line">121</div><div class="line">122</div><div class="line">123</div><div class="line">124</div><div class="line">125</div><div class="line">126</div><div class="line">127</div><div class="line">128</div><div class="line">129</div><div class="line">130</div><div class="line">131</div><div class="line">132</div><div class="line">133</div></pre></td><td class="code"><pre><div class="line"># ik_max_word</div><div class="line"></div><div class="line">curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '联想是全球最大的笔记本厂商'</div><div class="line">#返回</div><div class="line"></div><div class="line">{</div><div class="line"> "tokens" : [</div><div class="line"> {</div><div class="line"> "token" : "联想",</div><div class="line"> "start_offset" : 0,</div><div class="line"> "end_offset" : 2,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 0</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "是",</div><div class="line"> "start_offset" : 2,</div><div class="line"> "end_offset" : 3,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 1</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "全球",</div><div class="line"> "start_offset" : 3,</div><div class="line"> "end_offset" : 5,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 2</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "最大",</div><div class="line"> "start_offset" : 5,</div><div class="line"> "end_offset" : 7,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 3</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "的",</div><div class="line"> "start_offset" : 7,</div><div class="line"> "end_offset" : 8,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 4</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "笔记本",</div><div class="line"> "start_offset" : 8,</div><div class="line"> "end_offset" : 11,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 5</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "笔记",</div><div class="line"> "start_offset" : 8,</div><div class="line"> "end_offset" : 10,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 6</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "本厂",</div><div class="line"> "start_offset" : 10,</div><div class="line"> "end_offset" : 12,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 7</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "厂商",</div><div class="line"> "start_offset" : 11,</div><div class="line"> "end_offset" : 13,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 8</div><div class="line"> }</div><div class="line"> ]</div><div class="line">}</div><div class="line"></div><div class="line"></div><div class="line"># ik_smart</div><div class="line"></div><div class="line">curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_smart' -d '联想是全球最大的笔记本厂商'</div><div class="line"></div><div class="line"># 返回</div><div class="line"></div><div class="line">{</div><div class="line"> "tokens" : [</div><div class="line"> {</div><div class="line"> "token" : "联想",</div><div class="line"> "start_offset" : 0,</div><div class="line"> "end_offset" : 2,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 0</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "是",</div><div class="line"> "start_offset" : 2,</div><div class="line"> "end_offset" : 3,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 1</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "全球",</div><div class="line"> "start_offset" : 3,</div><div class="line"> "end_offset" : 5,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 2</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "最大",</div><div class="line"> "start_offset" : 5,</div><div class="line"> "end_offset" : 7,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 3</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "的",</div><div class="line"> "start_offset" : 7,</div><div class="line"> "end_offset" : 8,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 4</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "笔记本",</div><div class="line"> "start_offset" : 8,</div><div class="line"> "end_offset" : 11,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 5</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "厂商",</div><div class="line"> "start_offset" : 11,</div><div class="line"> "end_offset" : 13,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 6</div><div class="line"> }</div><div class="line"> ]</div><div class="line">}</div></pre></td></tr></table></figure>
<p>下面我们来创建一个索引,使用 ik<br>创建一个名叫 iktest 的索引,设置它的分析器用 ik ,分词器用 ik_max_word,并创建一个 article 的类型,里面有一个 subject 的字段,指定其使用 ik_max_word 分词器</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div></pre></td><td class="code"><pre><div class="line">curl -XPUT 'http://localhost:9200/iktest?pretty' -d '{</div><div class="line"> "settings" : {</div><div class="line"> "analysis" : {</div><div class="line"> "analyzer" : {</div><div class="line"> "ik" : {</div><div class="line"> "tokenizer" : "ik_max_word"</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> },</div><div class="line"> "mappings" : {</div><div class="line"> "article" : {</div><div class="line"> "dynamic" : true,</div><div class="line"> "properties" : {</div><div class="line"> "subject" : {</div><div class="line"> "type" : "string",</div><div class="line"> "analyzer" : "ik_max_word"</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line">}'</div></pre></td></tr></table></figure>
<p>批量添加几条数据,这里我指定元数据 _id 方便查看,subject 内容为我随便找的几条新闻的标题</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">curl -XPOST http://localhost:9200/iktest/article/_bulk?pretty -d '</div><div class="line">{ "index" : { "_id" : "1" } }</div><div class="line">{"subject" : ""闺蜜"崔顺实被韩检方传唤 韩总统府促彻查真相" }</div><div class="line">{ "index" : { "_id" : "2" } }</div><div class="line">{"subject" : "韩举行"护国训练" 青瓦台:决不许国家安全出问题" }</div><div class="line">{ "index" : { "_id" : "3" } }</div><div class="line">{"subject" : "媒体称FBI已经取得搜查令 检视希拉里电邮" }</div><div class="line">{ "index" : { "_id" : "4" } }</div><div class="line">{"subject" : "村上春树获安徒生奖 演讲中谈及欧洲排外问题" }</div><div class="line">{ "index" : { "_id" : "5" } }</div><div class="line">{"subject" : "希拉里团队炮轰FBI 参院民主党领袖批其“违法”" }</div><div class="line">'</div></pre></td></tr></table></figure>
<p>查询 “希拉里和韩国”</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div><div class="line">48</div><div class="line">49</div><div class="line">50</div><div class="line">51</div><div class="line">52</div><div class="line">53</div><div class="line">54</div><div class="line">55</div><div class="line">56</div><div class="line">57</div><div class="line">58</div><div class="line">59</div><div class="line">60</div><div class="line">61</div><div class="line">62</div><div class="line">63</div><div class="line">64</div><div class="line">65</div><div class="line">66</div><div class="line">67</div><div class="line">68</div><div class="line">69</div><div class="line">70</div><div class="line">71</div></pre></td><td class="code"><pre><div class="line">curl -XPOST http://localhost:9200/iktest/article/_search?pretty -d'</div><div class="line">{</div><div class="line"> "query" : { "match" : { "subject" : "希拉里和韩国" }},</div><div class="line"> "highlight" : {</div><div class="line"> "pre_tags" : ["<font color='red'>"],</div><div class="line"> "post_tags" : ["</font>"],</div><div class="line"> "fields" : {</div><div class="line"> "subject" : {}</div><div class="line"> }</div><div class="line"> }</div><div class="line">}</div><div class="line">'</div><div class="line">#返回</div><div class="line">{</div><div class="line"> "took" : 113,</div><div class="line"> "timed_out" : false,</div><div class="line"> "_shards" : {</div><div class="line"> "total" : 5,</div><div class="line"> "successful" : 5,</div><div class="line"> "failed" : 0</div><div class="line"> },</div><div class="line"> "hits" : {</div><div class="line"> "total" : 4,</div><div class="line"> "max_score" : 0.034062363,</div><div class="line"> "hits" : [ {</div><div class="line"> "_index" : "iktest",</div><div class="line"> "_type" : "article",</div><div class="line"> "_id" : "2",</div><div class="line"> "_score" : 0.034062363,</div><div class="line"> "_source" : {</div><div class="line"> "subject" : "韩举行"护国训练" 青瓦台:决不许国家安全出问题"</div><div class="line"> },</div><div class="line"> "highlight" : {</div><div class="line"> "subject" : [ "<font color=red>韩</font>举行"护<font color=red>国</font>训练" 青瓦台:决不许国家安全出问题" ]</div><div class="line"> }</div><div class="line"> }, {</div><div class="line"> "_index" : "iktest",</div><div class="line"> "_type" : "article",</div><div class="line"> "_id" : "3",</div><div class="line"> "_score" : 0.0076681254,</div><div class="line"> "_source" : {</div><div class="line"> "subject" : "媒体称FBI已经取得搜查令 检视希拉里电邮"</div><div class="line"> },</div><div class="line"> "highlight" : {</div><div class="line"> "subject" : [ "媒体称FBI已经取得搜查令 检视<font color=red>希拉里</font>电邮" ]</div><div class="line"> }</div><div class="line"> }, {</div><div class="line"> "_index" : "iktest",</div><div class="line"> "_type" : "article",</div><div class="line"> "_id" : "5",</div><div class="line"> "_score" : 0.006709609,</div><div class="line"> "_source" : {</div><div class="line"> "subject" : "希拉里团队炮轰FBI 参院民主党领袖批其“违法”"</div><div class="line"> },</div><div class="line"> "highlight" : {</div><div class="line"> "subject" : [ "<font color=red>希拉里</font>团队炮轰FBI 参院民主党领袖批其“违法”" ]</div><div class="line"> }</div><div class="line"> }, {</div><div class="line"> "_index" : "iktest",</div><div class="line"> "_type" : "article",</div><div class="line"> "_id" : "1",</div><div class="line"> "_score" : 0.0021509775,</div><div class="line"> "_source" : {</div><div class="line"> "subject" : ""闺蜜"崔顺实被韩检方传唤 韩总统府促彻查真相"</div><div class="line"> },</div><div class="line"> "highlight" : {</div><div class="line"> "subject" : [ ""闺蜜"崔顺实被<font color=red>韩</font>检方传唤 <font color=red>韩</font>总统府促彻查真相" ]</div><div class="line"> }</div><div class="line"> } ]</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>这里用了高亮属性 highlight,直接显示到 html 中,被匹配到的字或词将以红色突出显示。若要用过滤搜索,直接将 match 改为 term 即可</p>
<h4 id="热词更新配置"><a href="#热词更新配置" class="headerlink" title="热词更新配置"></a>热词更新配置</h4><p>网络词语日新月异,如何让新出的网络热词(或特定的词语)实时的更新到我们的搜索当中呢</p>
<p>先用 ik 测试一下</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div></pre></td><td class="code"><pre><div class="line">curl -XGET 'http://localhost:9200/_analyze?pretty&analyzer=ik_max_word' -d '</div><div class="line">成龙原名陈港生</div><div class="line">'</div><div class="line">#返回</div><div class="line">{</div><div class="line"> "tokens" : [ {</div><div class="line"> "token" : "成龙",</div><div class="line"> "start_offset" : 1,</div><div class="line"> "end_offset" : 3,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 0</div><div class="line"> }, {</div><div class="line"> "token" : "原名",</div><div class="line"> "start_offset" : 3,</div><div class="line"> "end_offset" : 5,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 1</div><div class="line"> }, {</div><div class="line"> "token" : "陈",</div><div class="line"> "start_offset" : 5,</div><div class="line"> "end_offset" : 6,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 2</div><div class="line"> }, {</div><div class="line"> "token" : "港",</div><div class="line"> "start_offset" : 6,</div><div class="line"> "end_offset" : 7,</div><div class="line"> "type" : "CN_WORD",</div><div class="line"> "position" : 3</div><div class="line"> }, {</div><div class="line"> "token" : "生",</div><div class="line"> "start_offset" : 7,</div><div class="line"> "end_offset" : 8,</div><div class="line"> "type" : "CN_CHAR",</div><div class="line"> "position" : 4</div><div class="line"> } ]</div><div class="line">}</div></pre></td></tr></table></figure>
<p>ik 的主词典中没有”陈港生” 这个词,所以被拆分了。<br>现在我们来配置一下</p>
<p>修改 IK 的配置文件 :ES 目录/plugins/ik/config/ik/IKAnalyzer.cfg.xml</p>
<p>修改如下:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div></pre></td><td class="code"><pre><div class="line"><?xml version="1.0" encoding="UTF-8"?></div><div class="line"><!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"></div><div class="line"><properties></div><div class="line"> <comment>IK Analyzer 扩展配置</comment></div><div class="line"> <!--用户可以在这里配置自己的扩展字典 --></div><div class="line"> <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry></div><div class="line"> <!--用户可以在这里配置自己的扩展停止词字典--></div><div class="line"> <entry key="ext_stopwords">custom/ext_stopword.dic</entry></div><div class="line"> <!--用户可以在这里配置远程扩展字典 --></div><div class="line"> <entry key="remote_ext_dict">http://192.168.1.136/hotWords.php</entry></div><div class="line"> <!--用户可以在这里配置远程扩展停止词字典--></div><div class="line"> <!-- <entry key="remote_ext_stopwords">words_location</entry> --></div><div class="line"></properties></div></pre></td></tr></table></figure>
<p>这里我是用的是远程扩展字典,因为可以使用其他程序调用更新,且不用重启 ES,很方便;当然使用自定义的 mydict.dic 字典也是很方便的,一行一个词,自己加就可以了</p>
<p>既然是远程词典,那么就要是一个可访问的链接,可以是一个页面,也可以是一个txt的文档,但要保证输出的内容是 utf-8 的格式</p>
<p>hotWords.php 的内容</p>
<figure class="highlight php"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">$s = <span class="string"><<<'EOF'</span></div><div class="line">陈港生</div><div class="line">元楼</div><div class="line">蓝瘦</div><div class="line">EOF;</div><div class="line">header(<span class="string">'Last-Modified: '</span>.gmdate(<span class="string">'D, d M Y H:i:s'</span>, time()).<span class="string">' GMT'</span>, <span class="keyword">true</span>, <span class="number">200</span>);</div><div class="line">header(<span class="string">'ETag: "5816f349-19"'</span>);</div><div class="line"><span class="keyword">echo</span> $s;</div></pre></td></tr></table></figure>
<p>ik 接收两个返回的头部属性 Last-Modified 和 ETag,只要其中一个有变化,就会触发更新,ik 会每分钟获取一次<br>重启 Elasticsearch ,查看启动记录,看到了三个词已被加载进来</p>
<p>再次执行上面的请求,返回, 就可以看到 ik 分词器已经匹配到了 “陈港生” 这个词,同理一些关于我们公司的专有名字(例如:永辉、永辉超市、永辉云创、云创 …. )也可以自己手动添加到字典中去。</p>
<h3 id="2、结巴中文分词"><a href="#2、结巴中文分词" class="headerlink" title="2、结巴中文分词"></a>2、结巴中文分词</h3><h4 id="特点:"><a href="#特点:" class="headerlink" title="特点:"></a>特点:</h4><p>1、支持三种分词模式:</p>
<ul>
<li><p>精确模式,试图将句子最精确地切开,适合文本分析;</p>
</li>
<li><p>全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;</p>
</li>
<li><p>搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。</p>
</li>
</ul>
<p>2、支持繁体分词</p>
<p>3、支持自定义词典</p>
<h3 id="3、THULAC"><a href="#3、THULAC" class="headerlink" title="3、THULAC"></a>3、THULAC</h3><p>THULAC(THU Lexical Analyzer for Chinese)由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包,具有中文分词和词性标注功能。THULAC具有如下几个特点:</p>
<p>能力强。利用我们集成的目前世界上规模最大的人工分词和词性标注中文语料库(约含5800万字)训练而成,模型标注能力强大。</p>
<p>准确率高。该工具包在标准数据集Chinese Treebank(CTB5)上分词的F1值可达97.3%,词性标注的F1值可达到92.9%,与该数据集上最好方法效果相当。</p>
<p>速度较快。同时进行分词和词性标注速度为300KB/s,每秒可处理约15万字。只进行分词速度可达到1.3MB/s。</p>
<p>中文分词工具thulac4j发布</p>
<p>1、规范化分词词典,并去掉一些无用词;</p>
<p>2、重写DAT(双数组Trie树)的构造算法,生成的DAT size减少了8%左右,从而节省了内存;</p>
<p>3、优化分词算法,提高了分词速率。</p>
<figure class="highlight xml"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="tag"><<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"><<span class="name">groupId</span>></span>io.github.yizhiru<span class="tag"></<span class="name">groupId</span>></span></div><div class="line"> <span class="tag"><<span class="name">artifactId</span>></span>thulac4j<span class="tag"></<span class="name">artifactId</span>></span></div><div class="line"> <span class="tag"><<span class="name">version</span>></span>${thulac4j.version}<span class="tag"></<span class="name">version</span>></span></div><div class="line"><span class="tag"></<span class="name">dependency</span>></span></div></pre></td></tr></table></figure>
<p><a href="http://www.cnblogs.com/en-heng/p/6526598.html" target="_blank" rel="external">http://www.cnblogs.com/en-heng/p/6526598.html</a></p>
<p>thulac4j支持两种分词模式:</p>
<p>SegOnly模式,只分词没有词性标注;</p>
<p>SegPos模式,分词兼有词性标注。</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div></pre></td><td class="code"><pre><div class="line"><span class="comment">// SegOnly mode</span></div><div class="line">String sentence = <span class="string">"滔滔的流水,向着波士顿湾无声逝去"</span>;</div><div class="line">SegOnly seg = <span class="keyword">new</span> SegOnly(<span class="string">"models/seg_only.bin"</span>);</div><div class="line">System.out.println(seg.segment(sentence));</div><div class="line"><span class="comment">// [滔滔, 的, 流水, ,, 向着, 波士顿湾, 无声, 逝去]</span></div><div class="line"></div><div class="line"><span class="comment">// SegPos mode</span></div><div class="line">SegPos pos = <span class="keyword">new</span> SegPos(<span class="string">"models/seg_pos.bin"</span>);</div><div class="line">System.out.println(pos.segment(sentence));</div><div class="line"><span class="comment">//[滔滔/a, 的/u, 流水/n, ,/w, 向着/p, 波士顿湾/ns, 无声/v, 逝去/v]</span></div></pre></td></tr></table></figure>
<h3 id="4、NLPIR"><a href="#4、NLPIR" class="headerlink" title="4、NLPIR"></a>4、NLPIR</h3><p>中科院计算所 NLPIR:<a href="http://ictclas.nlpir.org/nlpir/" target="_blank" rel="external">http://ictclas.nlpir.org/nlpir/</a> (可直接在线分析中文)</p>
<p>下载地址:<a href="https://github.com/NLPIR-team/NLPIR" target="_blank" rel="external">https://github.com/NLPIR-team/NLPIR</a></p>
<p>中科院分词系统(NLPIR)JAVA简易教程: <a href="http://www.cnblogs.com/wukongjiuwo/p/4092480.html" target="_blank" rel="external">http://www.cnblogs.com/wukongjiuwo/p/4092480.html</a></p>
<h3 id="5、ansj分词器"><a href="#5、ansj分词器" class="headerlink" title="5、ansj分词器"></a>5、ansj分词器</h3><p><a href="https://github.com/NLPchina/ansj_seg" target="_blank" rel="external">https://github.com/NLPchina/ansj_seg</a></p>
<p>这是一个基于n-Gram+CRF+HMM的中文分词的java实现.</p>
<p>分词速度达到每秒钟大约200万字左右(mac air下测试),准确率能达到96%以上</p>
<p>目前实现了.中文分词. 中文姓名识别 .</p>
<p>用户自定义词典,关键字提取,自动摘要,关键字标记等功能<br>可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目.</p>
<p>maven 引入:</p>
<figure class="highlight xml"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line"><span class="tag"><<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"><<span class="name">groupId</span>></span>org.ansj<span class="tag"></<span class="name">groupId</span>></span></div><div class="line"> <span class="tag"><<span class="name">artifactId</span>></span>ansj_seg<span class="tag"></<span class="name">artifactId</span>></span></div><div class="line"> <span class="tag"><<span class="name">version</span>></span>5.1.1<span class="tag"></<span class="name">version</span>></span></div><div class="line"><span class="tag"></<span class="name">dependency</span>></span></div></pre></td></tr></table></figure>
<p><strong>调用demo</strong></p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div></pre></td><td class="code"><pre><div class="line">String str = <span class="string">"欢迎使用ansj_seg,(ansj中文分词)在这里如果你遇到什么问题都可以联系我.我一定尽我所能.帮助大家.ansj_seg更快,更准,更自由!"</span> ;</div><div class="line"> System.out.println(ToAnalysis.parse(str));</div><div class="line"></div><div class="line"> 欢迎/v,使用/v,ansj/en,_,seg/en,,,(,ansj/en,中文/nz,分词/n,),在/p,这里/r,如果/c,你/r,遇到/v,什么/r,问题/n,都/d,可以/v,联系/v,我/r,./m,我/r,一定/d,尽我所能/l,./m,帮助/v,大家/r,./m,ansj/en,_,seg/en,更快/d,,,更/d,准/a,,,更/d,自由/a,!</div></pre></td></tr></table></figure>
<h3 id="6、哈工大的LTP"><a href="#6、哈工大的LTP" class="headerlink" title="6、哈工大的LTP"></a>6、哈工大的LTP</h3><p><a href="https://link.zhihu.com/?target=https%3A//github.com/HIT-SCIR/ltp" target="_blank" rel="external">https://link.zhihu.com/?target=https%3A//github.com/HIT-SCIR/ltp</a></p>
<p>LTP制定了基于XML的语言处理结果表示,并在此基础上提供了一整套自底向上的丰富而且高效的中文语言处理模块(包括词法、句法、语义等6项中文处理核心技术),以及基于动态链接库(Dynamic Link Library, DLL)的应用程序接口、可视化工具,并且能够以网络服务(Web Service)的形式进行使用。</p>
<p>关于LTP的使用,请参考: <a href="http://ltp.readthedocs.io/zh_CN/latest/" target="_blank" rel="external">http://ltp.readthedocs.io/zh_CN/latest/</a></p>
<h3 id="7、庖丁解牛"><a href="#7、庖丁解牛" class="headerlink" title="7、庖丁解牛"></a>7、庖丁解牛</h3><p>下载地址:<a href="http://pan.baidu.com/s/1eQ88SZS" target="_blank" rel="external">http://pan.baidu.com/s/1eQ88SZS</a></p>
<p>使用分为如下几步:</p>
<ol>
<li><p>配置dic文件:<br>修改paoding-analysis.jar中的paoding-dic-home.properties文件,将“#paoding.dic.home=dic”的注释去掉,并配置成自己dic文件的本地存放路径。eg:/home/hadoop/work/paoding-analysis-2.0.4-beta/dic</p>
</li>
<li><p>把Jar包导入到项目中:<br>将paoding-analysis.jar、commons-logging.jar、lucene-analyzers-2.2.0.jar和lucene-core-2.2.0.jar四个包导入到项目中,这时就可以在代码片段中使用庖丁解牛工具提供的中文分词技术,例如:</p>
</li>
</ol>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">Analyzer analyzer = <span class="keyword">new</span> PaodingAnalyzer(); <span class="comment">//定义一个解析器</span></div><div class="line">String text = <span class="string">"庖丁系统是个完全基于lucene的中文分词系统,它就是重新建了一个analyzer,叫做PaodingAnalyzer,这个analyer的核心任务就是生成一个可以切词TokenStream。"</span>; <span style=<span class="string">"font-family: Arial, Helvetica, sans-serif;"</span>><span class="comment">//待分词的内容</span></span></div><div class="line">TokenStream tokenStream = analyzer.tokenStream(text, <span class="keyword">new</span> StringReader(text)); <span class="comment">//得到token序列的输出流</span></div><div class="line"><span class="keyword">try</span> {</div><div class="line"> Token t;</div><div class="line"> <span class="keyword">while</span> ((t = tokenStream.next()) != <span class="keyword">null</span>)</div><div class="line"> {</div><div class="line"> System.out.println(t); <span class="comment">//输出每个token</span></div><div class="line"> }</div><div class="line">} <span class="keyword">catch</span> (IOException e) {</div><div class="line"> e.printStackTrace();</div><div class="line">}</div></pre></td></tr></table></figure>
<h3 id="8、sogo在线分词"><a href="#8、sogo在线分词" class="headerlink" title="8、sogo在线分词"></a>8、sogo在线分词</h3><p>sogo在线分词采用了基于汉字标注的分词方法,主要使用了线性链链CRF(Linear-chain CRF)模型。词性标注模块主要基于结构化线性模型(Structured Linear Model)</p>
<p>在线使用地址为:<br><a href="http://www.sogou.com/labs/webservice/" target="_blank" rel="external">http://www.sogou.com/labs/webservice/</a></p>
<h3 id="9、word分词"><a href="#9、word分词" class="headerlink" title="9、word分词"></a>9、word分词</h3><p>地址: <a href="https://github.com/ysc/word" target="_blank" rel="external">https://github.com/ysc/word</a></p>
<p>word分词是一个Java实现的分布式的中文分词组件,提供了多种基于词典的分词算法,并利用ngram模型来消除歧义。能准确识别英文、数字,以及日期、时间等数量词,能识别人名、地名、组织机构名等未登录词。能通过自定义配置文件来改变组件行为,能自定义用户词库、自动检测词库变化、支持大规模分布式环境,能灵活指定多种分词算法,能使用refine功能灵活控制分词结果,还能使用词频统计、词性标注、同义标注、反义标注、拼音标注等功能。提供了10种分词算法,还提供了10种文本相似度算法,同时还无缝和Lucene、Solr、ElasticSearch、Luke集成。注意:word1.3需要JDK1.8</p>
<p>maven 中引入依赖:</p>
<figure class="highlight xml"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div></pre></td><td class="code"><pre><div class="line"><span class="tag"><<span class="name">dependencies</span>></span></div><div class="line"> <span class="tag"><<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"><<span class="name">groupId</span>></span>org.apdplat<span class="tag"></<span class="name">groupId</span>></span></div><div class="line"> <span class="tag"><<span class="name">artifactId</span>></span>word<span class="tag"></<span class="name">artifactId</span>></span></div><div class="line"> <span class="tag"><<span class="name">version</span>></span>1.3<span class="tag"></<span class="name">version</span>></span></div><div class="line"> <span class="tag"></<span class="name">dependency</span>></span></div><div class="line"><span class="tag"></<span class="name">dependencies</span>></span></div></pre></td></tr></table></figure>
<p>ElasticSearch插件:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div></pre></td><td class="code"><pre><div class="line">1、打开命令行并切换到elasticsearch的bin目录</div><div class="line">cd elasticsearch-2.1.1/bin</div><div class="line"></div><div class="line">2、运行plugin脚本安装word分词插件:</div><div class="line">./plugin install http://apdplat.org/word/archive/v1.4.zip</div><div class="line"></div><div class="line">安装的时候注意:</div><div class="line"> 如果提示:</div><div class="line"> ERROR: failed to download</div><div class="line"> 或者</div><div class="line"> Failed to install word, reason: failed to download</div><div class="line"> 或者</div><div class="line"> ERROR: incorrect hash (SHA1)</div><div class="line"> 则重新再次运行命令,如果还是不行,多试两次</div><div class="line"></div><div class="line">如果是elasticsearch1.x系列版本,则使用如下命令:</div><div class="line">./plugin -u http://apdplat.org/word/archive/v1.3.1.zip -i word</div><div class="line"></div><div class="line">3、修改文件elasticsearch-2.1.1/config/elasticsearch.yml,新增如下配置:</div><div class="line">index.analysis.analyzer.default.type : "word"</div><div class="line">index.analysis.tokenizer.default.type : "word"</div><div class="line"></div><div class="line">4、启动ElasticSearch测试效果,在Chrome浏览器中访问:</div><div class="line">http://localhost:9200/_analyze?analyzer=word&text=杨尚川是APDPlat应用级产品开发平台的作者</div><div class="line"></div><div class="line">5、自定义配置</div><div class="line">修改配置文件elasticsearch-2.1.1/plugins/word/word.local.conf</div><div class="line"></div><div class="line">6、指定分词算法</div><div class="line">修改文件elasticsearch-2.1.1/config/elasticsearch.yml,新增如下配置:</div><div class="line">index.analysis.analyzer.default.segAlgorithm : "ReverseMinimumMatching"</div><div class="line">index.analysis.tokenizer.default.segAlgorithm : "ReverseMinimumMatching"</div><div class="line"></div><div class="line">这里segAlgorithm可指定的值有:</div><div class="line">正向最大匹配算法:MaximumMatching</div><div class="line">逆向最大匹配算法:ReverseMaximumMatching</div><div class="line">正向最小匹配算法:MinimumMatching</div><div class="line">逆向最小匹配算法:ReverseMinimumMatching</div><div class="line">双向最大匹配算法:BidirectionalMaximumMatching</div><div class="line">双向最小匹配算法:BidirectionalMinimumMatching</div><div class="line">双向最大最小匹配算法:BidirectionalMaximumMinimumMatching</div><div class="line">全切分算法:FullSegmentation</div><div class="line">最少词数算法:MinimalWordCount</div><div class="line">最大Ngram分值算法:MaxNgramScore</div><div class="line">如不指定,默认使用双向最大匹配算法:BidirectionalMaximumMatching</div></pre></td></tr></table></figure>
<h3 id="10、jcseg分词器"><a href="#10、jcseg分词器" class="headerlink" title="10、jcseg分词器"></a>10、jcseg分词器</h3><p><a href="https://code.google.com/archive/p/jcseg/" target="_blank" rel="external">https://code.google.com/archive/p/jcseg/</a></p>
<h3 id="11、stanford分词器"><a href="#11、stanford分词器" class="headerlink" title="11、stanford分词器"></a>11、stanford分词器</h3><p>Stanford大学的一个开源分词工具,目前已支持汉语。</p>
<p>首先,去【1】下载Download Stanford Word Segmenter version 3.5.2,取得里面的 data 文件夹,放在maven project的 src/main/resources 里。</p>
<p>然后,maven依赖添加:</p>
<figure class="highlight xml"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div></pre></td><td class="code"><pre><div class="line"><span class="tag"><<span class="name">properties</span>></span></div><div class="line"> <span class="tag"><<span class="name">java.version</span>></span>1.8<span class="tag"></<span class="name">java.version</span>></span></div><div class="line"> <span class="tag"><<span class="name">project.build.sourceEncoding</span>></span>UTF-8<span class="tag"></<span class="name">project.build.sourceEncoding</span>></span></div><div class="line"> <span class="tag"><<span class="name">corenlp.version</span>></span>3.6.0<span class="tag"></<span class="name">corenlp.version</span>></span></div><div class="line"> <span class="tag"></<span class="name">properties</span>></span></div><div class="line"> <span class="tag"><<span class="name">dependencies</span>></span></div><div class="line"> <span class="tag"><<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"><<span class="name">groupId</span>></span>edu.stanford.nlp<span class="tag"></<span class="name">groupId</span>></span></div><div class="line"> <span class="tag"><<span class="name">artifactId</span>></span>stanford-corenlp<span class="tag"></<span class="name">artifactId</span>></span></div><div class="line"> <span class="tag"><<span class="name">version</span>></span>${corenlp.version}<span class="tag"></<span class="name">version</span>></span></div><div class="line"> <span class="tag"></<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"><<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"><<span class="name">groupId</span>></span>edu.stanford.nlp<span class="tag"></<span class="name">groupId</span>></span></div><div class="line"> <span class="tag"><<span class="name">artifactId</span>></span>stanford-corenlp<span class="tag"></<span class="name">artifactId</span>></span></div><div class="line"> <span class="tag"><<span class="name">version</span>></span>${corenlp.version}<span class="tag"></<span class="name">version</span>></span></div><div class="line"> <span class="tag"><<span class="name">classifier</span>></span>models<span class="tag"></<span class="name">classifier</span>></span></div><div class="line"> <span class="tag"></<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"><<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"><<span class="name">groupId</span>></span>edu.stanford.nlp<span class="tag"></<span class="name">groupId</span>></span></div><div class="line"> <span class="tag"><<span class="name">artifactId</span>></span>stanford-corenlp<span class="tag"></<span class="name">artifactId</span>></span></div><div class="line"> <span class="tag"><<span class="name">version</span>></span>${corenlp.version}<span class="tag"></<span class="name">version</span>></span></div><div class="line"> <span class="tag"><<span class="name">classifier</span>></span>models-chinese<span class="tag"></<span class="name">classifier</span>></span></div><div class="line"> <span class="tag"></<span class="name">dependency</span>></span></div><div class="line"> <span class="tag"></<span class="name">dependencies</span>></span></div></pre></td></tr></table></figure>
<p>测试:</p>
<figure class="highlight java"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div></pre></td><td class="code"><pre><div class="line"><span class="keyword">import</span> java.util.Properties;</div><div class="line"></div><div class="line"><span class="keyword">import</span> edu.stanford.nlp.ie.crf.CRFClassifier;</div><div class="line"></div><div class="line"><span class="keyword">public</span> <span class="class"><span class="keyword">class</span> <span class="title">CoreNLPSegment</span> </span>{</div><div class="line"></div><div class="line"> <span class="keyword">private</span> <span class="keyword">static</span> CoreNLPSegment instance;</div><div class="line"> <span class="keyword">private</span> CRFClassifier classifier;</div><div class="line"></div><div class="line"> <span class="function"><span class="keyword">private</span> <span class="title">CoreNLPSegment</span><span class="params">()</span></span>{</div><div class="line"> Properties props = <span class="keyword">new</span> Properties();</div><div class="line"> props.setProperty(<span class="string">"sighanCorporaDict"</span>, <span class="string">"data"</span>);</div><div class="line"> props.setProperty(<span class="string">"serDictionary"</span>, <span class="string">"data/dict-chris6.ser.gz"</span>);</div><div class="line"> props.setProperty(<span class="string">"inputEncoding"</span>, <span class="string">"UTF-8"</span>);</div><div class="line"> props.setProperty(<span class="string">"sighanPostProcessing"</span>, <span class="string">"true"</span>);</div><div class="line"> classifier = <span class="keyword">new</span> CRFClassifier(props);</div><div class="line"> classifier.loadClassifierNoExceptions(<span class="string">"data/ctb.gz"</span>, props);</div><div class="line"> classifier.flags.setProperties(props);</div><div class="line"> }</div><div class="line"></div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">static</span> CoreNLPSegment <span class="title">getInstance</span><span class="params">()</span> </span>{</div><div class="line"> <span class="keyword">if</span> (instance == <span class="keyword">null</span>) {</div><div class="line"> instance = <span class="keyword">new</span> CoreNLPSegment();</div><div class="line"> }</div><div class="line"></div><div class="line"> <span class="keyword">return</span> instance;</div><div class="line"> }</div><div class="line"></div><div class="line"> <span class="keyword">public</span> String[] doSegment(String data) {</div><div class="line"> <span class="keyword">return</span> (String[]) classifier.segmentString(data).toArray();</div><div class="line"> }</div><div class="line"></div><div class="line"> <span class="function"><span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">void</span> <span class="title">main</span><span class="params">(String[] args)</span> </span>{</div><div class="line"></div><div class="line"> String sentence = <span class="string">"他和我在学校里常打桌球。"</span>;</div><div class="line"> String ret[] = CoreNLPSegment.getInstance().doSegment(sentence);</div><div class="line"> <span class="keyword">for</span> (String str : ret) {</div><div class="line"> System.out.println(str);</div><div class="line"> }</div><div class="line"></div><div class="line"> }</div><div class="line"></div><div class="line">}</div></pre></td></tr></table></figure>
<p><strong>博客</strong>:</p>
<p><a href="https://blog.sectong.com/blog/corenlp_segment.html" target="_blank" rel="external">https://blog.sectong.com/blog/corenlp_segment.html</a></p>
<p><a href="http://blog.csdn.net/lightty/article/details/51766602" target="_blank" rel="external">http://blog.csdn.net/lightty/article/details/51766602</a></p>
<h3 id="12、Smartcn"><a href="#12、Smartcn" class="headerlink" title="12、Smartcn"></a>12、Smartcn</h3><p>Smartcn为Apache2.0协议的开源中文分词系统,Java语言编写,修改的中科院计算所ICTCLAS分词系统。很早以前看到Lucene上多了一个中文分词的contribution,当时只是简单的扫了一下.class文件的文件名,通过文件名可以看得出又是一个改的ICTCLAS的分词系统。</p>
<p><a href="http://lucene.apache.org/core/5_1_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html" target="_blank" rel="external">http://lucene.apache.org/core/5_1_0/analyzers-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html</a></p>
<h3 id="13、pinyin-分词器"><a href="#13、pinyin-分词器" class="headerlink" title="13、pinyin 分词器"></a>13、pinyin 分词器</h3><p>pinyin分词器可以让用户输入拼音,就能查找到相关的关键词。比如在某个商城搜索中,输入 <code>yonghui</code>,就能匹配到 <code>永辉</code>。这样的体验还是非常好的。</p>
<p>pinyin分词器的安装与IK是一样的。下载地址:<a href="https://github.com/medcl/elasticsearch-analysis-pinyin" target="_blank" rel="external">https://github.com/medcl/elasticsearch-analysis-pinyin</a></p>
<p>一些参数请参考 GitHub 的 readme 文档。</p>
<p>这个分词器在1.8版本中,提供了两种分词规则:</p>
<ul>
<li><p>pinyin,就是普通的把汉字转换成拼音;</p>
</li>
<li><p>pinyin_first_letter,提取汉字的拼音首字母</p>
</li>
</ul>
<p>使用:</p>
<p>1.Create a index with custom pinyin analyzer</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div></pre></td><td class="code"><pre><div class="line">curl -XPUT http://localhost:9200/medcl/ -d'</div><div class="line">{</div><div class="line"> "index" : {</div><div class="line"> "analysis" : {</div><div class="line"> "analyzer" : {</div><div class="line"> "pinyin_analyzer" : {</div><div class="line"> "tokenizer" : "my_pinyin"</div><div class="line"> }</div><div class="line"> },</div><div class="line"> "tokenizer" : {</div><div class="line"> "my_pinyin" : {</div><div class="line"> "type" : "pinyin",</div><div class="line"> "keep_separate_first_letter" : false,</div><div class="line"> "keep_full_pinyin" : true,</div><div class="line"> "keep_original" : true,</div><div class="line"> "limit_first_letter_length" : 16,</div><div class="line"> "lowercase" : true,</div><div class="line"> "remove_duplicated_term" : true</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line">}'</div></pre></td></tr></table></figure>
<p>2.Test Analyzer, analyzing a chinese name, such as 刘德华</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer</div></pre></td></tr></table></figure>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div></pre></td><td class="code"><pre><div class="line">{</div><div class="line"> "tokens" : [</div><div class="line"> {</div><div class="line"> "token" : "liu",</div><div class="line"> "start_offset" : 0,</div><div class="line"> "end_offset" : 1,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 0</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "de",</div><div class="line"> "start_offset" : 1,</div><div class="line"> "end_offset" : 2,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 1</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "hua",</div><div class="line"> "start_offset" : 2,</div><div class="line"> "end_offset" : 3,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 2</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "刘德华",</div><div class="line"> "start_offset" : 0,</div><div class="line"> "end_offset" : 3,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 3</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "ldh",</div><div class="line"> "start_offset" : 0,</div><div class="line"> "end_offset" : 3,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 4</div><div class="line"> }</div><div class="line"> ]</div><div class="line">}</div></pre></td></tr></table></figure>
<p>3.Create mapping</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div></pre></td><td class="code"><pre><div class="line">curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'</div><div class="line">{</div><div class="line"> "folks": {</div><div class="line"> "properties": {</div><div class="line"> "name": {</div><div class="line"> "type": "keyword",</div><div class="line"> "fields": {</div><div class="line"> "pinyin": {</div><div class="line"> "type": "text",</div><div class="line"> "store": "no",</div><div class="line"> "term_vector": "with_offsets",</div><div class="line"> "analyzer": "pinyin_analyzer",</div><div class="line"> "boost": 10</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line">}'</div></pre></td></tr></table></figure>
<p>4.Indexing</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'</div></pre></td></tr></table></figure>
<p>5.Let’s search</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div></pre></td><td class="code"><pre><div class="line">http://localhost:9200/medcl/folks/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E</div><div class="line">curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:%e5%88%98%e5%be%b7</div><div class="line">curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu</div><div class="line">curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh</div><div class="line">curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:de+hua</div></pre></td></tr></table></figure>
<p>6.Using Pinyin-TokenFilter</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">curl -XPUT http://localhost:9200/medcl1/ -d'</div><div class="line">{</div><div class="line"> "index" : {</div><div class="line"> "analysis" : {</div><div class="line"> "analyzer" : {</div><div class="line"> "user_name_analyzer" : {</div><div class="line"> "tokenizer" : "whitespace",</div><div class="line"> "filter" : "pinyin_first_letter_and_full_pinyin_filter"</div><div class="line"> }</div><div class="line"> },</div><div class="line"> "filter" : {</div><div class="line"> "pinyin_first_letter_and_full_pinyin_filter" : {</div><div class="line"> "type" : "pinyin",</div><div class="line"> "keep_first_letter" : true,</div><div class="line"> "keep_full_pinyin" : false,</div><div class="line"> "keep_none_chinese" : true,</div><div class="line"> "keep_original" : false,</div><div class="line"> "limit_first_letter_length" : 16,</div><div class="line"> "lowercase" : true,</div><div class="line"> "trim_whitespace" : true,</div><div class="line"> "keep_none_chinese_in_first_letter" : true</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line">}'</div></pre></td></tr></table></figure>
<p>Token Test:刘德华 张学友 郭富城 黎明 四大天王</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">curl -XGET http://localhost:9200/medcl1/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer</div></pre></td></tr></table></figure>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">{</div><div class="line"> "tokens" : [</div><div class="line"> {</div><div class="line"> "token" : "ldh",</div><div class="line"> "start_offset" : 0,</div><div class="line"> "end_offset" : 3,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 0</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "zxy",</div><div class="line"> "start_offset" : 4,</div><div class="line"> "end_offset" : 7,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 1</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "gfc",</div><div class="line"> "start_offset" : 8,</div><div class="line"> "end_offset" : 11,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 2</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "lm",</div><div class="line"> "start_offset" : 12,</div><div class="line"> "end_offset" : 14,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 3</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "token" : "sdtw",</div><div class="line"> "start_offset" : 15,</div><div class="line"> "end_offset" : 19,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 4</div><div class="line"> }</div><div class="line"> ]</div><div class="line">}</div></pre></td></tr></table></figure>
<p>7.Used in phrase query</p>
<p>(1)、</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">PUT /medcl/</div><div class="line"> {</div><div class="line"> "index" : {</div><div class="line"> "analysis" : {</div><div class="line"> "analyzer" : {</div><div class="line"> "pinyin_analyzer" : {</div><div class="line"> "tokenizer" : "my_pinyin"</div><div class="line"> }</div><div class="line"> },</div><div class="line"> "tokenizer" : {</div><div class="line"> "my_pinyin" : {</div><div class="line"> "type" : "pinyin",</div><div class="line"> "keep_first_letter":false,</div><div class="line"> "keep_separate_first_letter" : false,</div><div class="line"> "keep_full_pinyin" : true,</div><div class="line"> "keep_original" : false,</div><div class="line"> "limit_first_letter_length" : 16,</div><div class="line"> "lowercase" : true</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> GET /medcl/folks/_search</div><div class="line"> {</div><div class="line"> "query": {"match_phrase": {</div><div class="line"> "name.pinyin": "刘德华"</div><div class="line"> }}</div><div class="line"> }</div></pre></td></tr></table></figure>
<p>(2)、</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div><div class="line">46</div><div class="line">47</div></pre></td><td class="code"><pre><div class="line">PUT /medcl/</div><div class="line"> {</div><div class="line"> "index" : {</div><div class="line"> "analysis" : {</div><div class="line"> "analyzer" : {</div><div class="line"> "pinyin_analyzer" : {</div><div class="line"> "tokenizer" : "my_pinyin"</div><div class="line"> }</div><div class="line"> },</div><div class="line"> "tokenizer" : {</div><div class="line"> "my_pinyin" : {</div><div class="line"> "type" : "pinyin",</div><div class="line"> "keep_first_letter":false,</div><div class="line"> "keep_separate_first_letter" : true,</div><div class="line"> "keep_full_pinyin" : false,</div><div class="line"> "keep_original" : false,</div><div class="line"> "limit_first_letter_length" : 16,</div><div class="line"> "lowercase" : true</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"> }</div><div class="line"></div><div class="line"> POST /medcl/folks/andy</div><div class="line"> {"name":"刘德华"}</div><div class="line"></div><div class="line"> GET /medcl/folks/_search</div><div class="line"> {</div><div class="line"> "query": {"match_phrase": {</div><div class="line"> "name.pinyin": "刘德h"</div><div class="line"> }}</div><div class="line"> }</div><div class="line"></div><div class="line"> GET /medcl/folks/_search</div><div class="line"> {</div><div class="line"> "query": {"match_phrase": {</div><div class="line"> "name.pinyin": "刘dh"</div><div class="line"> }}</div><div class="line"> }</div><div class="line"></div><div class="line"> GET /medcl/folks/_search</div><div class="line"> {</div><div class="line"> "query": {"match_phrase": {</div><div class="line"> "name.pinyin": "dh"</div><div class="line"> }}</div><div class="line"> }</div></pre></td></tr></table></figure>
<h3 id="14、Mmseg-分词器"><a href="#14、Mmseg-分词器" class="headerlink" title="14、Mmseg 分词器"></a>14、Mmseg 分词器</h3><p>也支持 Elasticsearch</p>
<p>下载地址:<a href="https://github.com/medcl/elasticsearch-analysis-mmseg/releases" target="_blank" rel="external">https://github.com/medcl/elasticsearch-analysis-mmseg/releases</a> 根据对应的版本进行下载</p>
<p>如何使用:</p>
<p>1、创建索引:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">curl -XPUT http://localhost:9200/index</div></pre></td></tr></table></figure>
<p>2、创建 mapping</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'</div><div class="line">{</div><div class="line"> "properties": {</div><div class="line"> "content": {</div><div class="line"> "type": "text",</div><div class="line"> "term_vector": "with_positions_offsets",</div><div class="line"> "analyzer": "mmseg_maxword",</div><div class="line"> "search_analyzer": "mmseg_maxword"</div><div class="line"> }</div><div class="line"> }</div><div class="line"></div><div class="line">}'</div></pre></td></tr></table></figure>
<p>3.Indexing some docs</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div></pre></td><td class="code"><pre><div class="line">curl -XPOST http://localhost:9200/index/fulltext/1 -d'</div><div class="line">{"content":"美国留给伊拉克的是个烂摊子吗"}</div><div class="line">'</div><div class="line"></div><div class="line">curl -XPOST http://localhost:9200/index/fulltext/2 -d'</div><div class="line">{"content":"公安部:各地校车将享最高路权"}</div><div class="line">'</div><div class="line"></div><div class="line">curl -XPOST http://localhost:9200/index/fulltext/3 -d'</div><div class="line">{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}</div><div class="line">'</div><div class="line"></div><div class="line">curl -XPOST http://localhost:9200/index/fulltext/4 -d'</div><div class="line">{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}</div><div class="line">'</div></pre></td></tr></table></figure>
<p>4.Query with highlighting(查询高亮)</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div></pre></td><td class="code"><pre><div class="line">curl -XPOST http://localhost:9200/index/fulltext/_search -d'</div><div class="line">{</div><div class="line"> "query" : { "term" : { "content" : "中国" }},</div><div class="line"> "highlight" : {</div><div class="line"> "pre_tags" : ["<tag1>", "<tag2>"],</div><div class="line"> "post_tags" : ["</tag1>", "</tag2>"],</div><div class="line"> "fields" : {</div><div class="line"> "content" : {}</div><div class="line"> }</div><div class="line"> }</div><div class="line">}</div><div class="line">'</div></pre></td></tr></table></figure>
<p>5、结果:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div></pre></td><td class="code"><pre><div class="line">{</div><div class="line"> "took": 14,</div><div class="line"> "timed_out": false,</div><div class="line"> "_shards": {</div><div class="line"> "total": 5,</div><div class="line"> "successful": 5,</div><div class="line"> "failed": 0</div><div class="line"> },</div><div class="line"> "hits": {</div><div class="line"> "total": 2,</div><div class="line"> "max_score": 2,</div><div class="line"> "hits": [</div><div class="line"> {</div><div class="line"> "_index": "index",</div><div class="line"> "_type": "fulltext",</div><div class="line"> "_id": "4",</div><div class="line"> "_score": 2,</div><div class="line"> "_source": {</div><div class="line"> "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"</div><div class="line"> },</div><div class="line"> "highlight": {</div><div class="line"> "content": [</div><div class="line"> "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首 "</div><div class="line"> ]</div><div class="line"> }</div><div class="line"> },</div><div class="line"> {</div><div class="line"> "_index": "index",</div><div class="line"> "_type": "fulltext",</div><div class="line"> "_id": "3",</div><div class="line"> "_score": 2,</div><div class="line"> "_source": {</div><div class="line"> "content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"</div><div class="line"> },</div><div class="line"> "highlight": {</div><div class="line"> "content": [</div><div class="line"> "均每天扣1艘<tag1>中国</tag1>渔船 "</div><div class="line"> ]</div><div class="line"> }</div><div class="line"> }</div><div class="line"> ]</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>参考博客:</p>
<p>为elastic添加中文分词: <a href="http://blog.csdn.net/dingzfang/article/details/42776693" target="_blank" rel="external">http://blog.csdn.net/dingzfang/article/details/42776693</a></p>
<h3 id="15、bosonnlp-(玻森数据中文分析器)"><a href="#15、bosonnlp-(玻森数据中文分析器)" class="headerlink" title="15、bosonnlp (玻森数据中文分析器)"></a>15、bosonnlp (玻森数据中文分析器)</h3><p>下载地址:<a href="https://github.com/bosondata/elasticsearch-analysis-bosonnlp" target="_blank" rel="external">https://github.com/bosondata/elasticsearch-analysis-bosonnlp</a></p>
<p>如何使用:</p>
<p>运行 ElasticSearch 之前需要在 config 文件夹中修改 elasticsearch.yml 来定义使用玻森中文分析器,并填写玻森 API_TOKEN 以及玻森分词 API 的地址,即在该文件结尾处添加:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div></pre></td><td class="code"><pre><div class="line">index:</div><div class="line"> analysis:</div><div class="line"> analyzer:</div><div class="line"> bosonnlp:</div><div class="line"> type: bosonnlp</div><div class="line"> API_URL: http://api.bosonnlp.com/tag/analysis</div><div class="line"> # You MUST give the API_TOKEN value, otherwise it doesn't work</div><div class="line"> API_TOKEN: *PUT YOUR API TOKEN HERE*</div><div class="line"> # Please uncomment if you want to specify ANY ONE of the following</div><div class="line"> # areguments, otherwise the DEFAULT value will be used, i.e.,</div><div class="line"> # space_mode is 0,</div><div class="line"> # oov_level is 3,</div><div class="line"> # t2s is 0,</div><div class="line"> # special_char_conv is 0.</div><div class="line"> # More detials can be found in bosonnlp docs:</div><div class="line"> # http://docs.bosonnlp.com/tag.html</div><div class="line"> #</div><div class="line"> #</div><div class="line"> # space_mode: put your value here(range from 0-3)</div><div class="line"> # oov_level: put your value here(range from 0-4)</div><div class="line"> # t2s: put your value here(range from 0-1)</div><div class="line"> # special_char_conv: put your value here(range from 0-1)</div></pre></td></tr></table></figure>
<p>需要注意的是</p>
<p>必须在 API_URL 填写给定的分词地址以及在API_TOKEN:<em>PUT YOUR API TOKEN HERE</em> 中填写给定的玻森数据API_TOKEN,否则无法使用玻森中文分析器。该 API_TOKEN 是注册玻森数据账号所获得。</p>
<p>如果配置文件中已经有配置过其他的 analyzer,请直接在 analyzer 下如上添加 bosonnlp analyzer。</p>
<p>如果有多个 node 并且都需要 BosonNLP 的分词插件,则每个 node 下的 yaml 文件都需要如上安装和设置。</p>
<p>另外,玻森中文分词还提供了4个参数(space_mode,oov_level,t2s,special_char_conv)可满足不同的分词需求。如果取默认值,则无需任何修改;否则,可取消对应参数的注释并赋值。</p>
<p><strong>测试:</strong></p>
<p>建立 index</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">curl -XPUT 'localhost:9200/test'</div></pre></td></tr></table></figure>
<p>测试分析器是否配置成功</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div></pre></td><td class="code"><pre><div class="line">curl -XGET 'localhost:9200/test/_analyze?analyzer=bosonnlp&pretty' -d '这是玻森数据分词的测试'</div></pre></td></tr></table></figure>
<p>结果</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div><div class="line">25</div><div class="line">26</div><div class="line">27</div><div class="line">28</div><div class="line">29</div><div class="line">30</div><div class="line">31</div><div class="line">32</div><div class="line">33</div><div class="line">34</div><div class="line">35</div><div class="line">36</div><div class="line">37</div><div class="line">38</div><div class="line">39</div><div class="line">40</div><div class="line">41</div><div class="line">42</div><div class="line">43</div><div class="line">44</div><div class="line">45</div></pre></td><td class="code"><pre><div class="line">{</div><div class="line"> "tokens" : [ {</div><div class="line"> "token" : "这",</div><div class="line"> "start_offset" : 0,</div><div class="line"> "end_offset" : 1,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 0</div><div class="line"> }, {</div><div class="line"> "token" : "是",</div><div class="line"> "start_offset" : 1,</div><div class="line"> "end_offset" : 2,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 1</div><div class="line"> }, {</div><div class="line"> "token" : "玻森",</div><div class="line"> "start_offset" : 2,</div><div class="line"> "end_offset" : 4,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 2</div><div class="line"> }, {</div><div class="line"> "token" : "数据",</div><div class="line"> "start_offset" : 4,</div><div class="line"> "end_offset" : 6,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 3</div><div class="line"> }, {</div><div class="line"> "token" : "分词",</div><div class="line"> "start_offset" : 6,</div><div class="line"> "end_offset" : 8,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 4</div><div class="line"> }, {</div><div class="line"> "token" : "的",</div><div class="line"> "start_offset" : 8,</div><div class="line"> "end_offset" : 9,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 5</div><div class="line"> }, {</div><div class="line"> "token" : "测试",</div><div class="line"> "start_offset" : 9,</div><div class="line"> "end_offset" : 11,</div><div class="line"> "type" : "word",</div><div class="line"> "position" : 6</div><div class="line"> } ]</div><div class="line">}</div></pre></td></tr></table></figure>
<p>配置 Token Filter</p>
<p>现有的 BosonNLP 分析器没有内置 token filter,如果有过滤 Token 的需求,可以利用 BosonNLP Tokenizer 和 ES 提供的 token filter 搭建定制分析器。</p>
<p>步骤</p>
<p>配置定制的 analyzer 有以下三个步骤:</p>
<p>添加 BosonNLP tokenizer<br>在 elasticsearch.yml 文件中 analysis 下添加 tokenizer, 并在 tokenizer 中添加 BosonNLP tokenizer 的配置:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div><div class="line">24</div></pre></td><td class="code"><pre><div class="line">index:</div><div class="line"> analysis:</div><div class="line"> analyzer:</div><div class="line"> ...</div><div class="line"> tokenizer:</div><div class="line"> bosonnlp:</div><div class="line"> type: bosonnlp</div><div class="line"> API_URL: http://api.bosonnlp.com/tag/analysis</div><div class="line"> # You MUST give the API_TOKEN value, otherwise it doesn't work</div><div class="line"> API_TOKEN: *PUT YOUR API TOKEN HERE*</div><div class="line"> # Please uncomment if you want to specify ANY ONE of the following</div><div class="line"> # areguments, otherwise the DEFAULT value will be used, i.e.,</div><div class="line"> # space_mode is 0,</div><div class="line"> # oov_level is 3,</div><div class="line"> # t2s is 0,</div><div class="line"> # special_char_conv is 0.</div><div class="line"> # More detials can be found in bosonnlp docs:</div><div class="line"> # http://docs.bosonnlp.com/tag.html</div><div class="line"> #</div><div class="line"> #</div><div class="line"> # space_mode: put your value here(range from 0-3)</div><div class="line"> # oov_level: put your value here(range from 0-4)</div><div class="line"> # t2s: put your value here(range from 0-1)</div><div class="line"> # special_char_conv: put your value here(range from 0-1)</div></pre></td></tr></table></figure>
<p>添加 token filter</p>
<p>在 elasticsearch.yml 文件中 analysis 下添加 filter, 并在 filter 中添加所需 filter 的配置(下面例子中,我们以 lowercase filter 为例):</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">index:</div><div class="line"> analysis:</div><div class="line"> analyzer:</div><div class="line"> ...</div><div class="line"> tokenizer:</div><div class="line"> ...</div><div class="line"> filter:</div><div class="line"> lowercase:</div><div class="line"> type: lowercase</div></pre></td></tr></table></figure>
<p>添加定制的 analyzer</p>
<p>在 elasticsearch.yml 文件中 analysis 下添加 analyzer, 并在 analyzer 中添加定制的 analyzer 的配置(下面例子中,我们把定制的 analyzer 命名为 filter_bosonnlp):</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">index:</div><div class="line"> analysis:</div><div class="line"> analyzer:</div><div class="line"> ...</div><div class="line"> filter_bosonnlp:</div><div class="line"> type: custom</div><div class="line"> tokenizer: bosonnlp</div><div class="line"> filter: [lowercase]</div></pre></td></tr></table></figure>
<hr>
<h2 id="自定义分词器"><a href="#自定义分词器" class="headerlink" title="自定义分词器"></a>自定义分词器</h2><p>虽然Elasticsearch带有一些现成的分析器,然而在分析器上Elasticsearch真正的强大之处在于,你可以通过在一个适合你的特定数据的设置之中组合字符过滤器、分词器、词汇单元过滤器来创建自定义的分析器。</p>
<p><strong>字符过滤器</strong>:</p>
<p>字符过滤器 用来 整理 一个尚未被分词的字符串。例如,如果我们的文本是HTML格式的,它会包含像 <code><p></code> 或者 <code><div></code> 这样的HTML标签,这些标签是我们不想索引的。我们可以使用 html清除 字符过滤器 来移除掉所有的HTML标签,并且像把 <code>&Aacute;</code> 转换为相对应的Unicode字符 Á 这样,转换HTML实体。</p>
<p>一个分析器可能有0个或者多个字符过滤器。</p>
<p><strong>分词器</strong>:</p>
<p>一个分析器 必须 有一个唯一的分词器。 分词器把字符串分解成单个词条或者词汇单元。 标准 分析器里使用的 标准 分词器 把一个字符串根据单词边界分解成单个词条,并且移除掉大部分的标点符号,然而还有其他不同行为的分词器存在。</p>
<p><strong>词单元过滤器</strong>:</p>
<p>经过分词,作为结果的 词单元流 会按照指定的顺序通过指定的词单元过滤器 。</p>
<p>词单元过滤器可以修改、添加或者移除词单元。我们已经提到过 lowercase 和 stop 词过滤器 ,但是在 Elasticsearch 里面还有很多可供选择的词单元过滤器。 词干过滤器 把单词 遏制 为 词干。 ascii_folding 过滤器移除变音符,把一个像 “très” 这样的词转换为 “tres” 。 ngram 和 edge_ngram 词单元过滤器 可以产生 适合用于部分匹配或者自动补全的词单元。</p>
<h3 id="创建一个自定义分析器"><a href="#创建一个自定义分析器" class="headerlink" title="创建一个自定义分析器"></a>创建一个自定义分析器</h3><p>我们可以在 analysis 下的相应位置设置字符过滤器、分词器和词单元过滤器:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div></pre></td><td class="code"><pre><div class="line">PUT /my_index</div><div class="line">{</div><div class="line"> "settings": {</div><div class="line"> "analysis": {</div><div class="line"> "char_filter": { ... custom character filters ... },</div><div class="line"> "tokenizer": { ... custom tokenizers ... },</div><div class="line"> "filter": { ... custom token filters ... },</div><div class="line"> "analyzer": { ... custom analyzers ... }</div><div class="line"> }</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>这个分析器可以做到下面的这些事:</p>
<p>1、使用 html清除 字符过滤器移除HTML部分。</p>
<p>2、使用一个自定义的 映射 字符过滤器把 & 替换为 “和” :</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">"char_filter": {</div><div class="line"> "&_to_and": {</div><div class="line"> "type": "mapping",</div><div class="line"> "mappings": [ "&=> and "]</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>3、使用 标准 分词器分词。</p>
<p>4、小写词条,使用 小写 词过滤器处理。</p>
<p>5、使用自定义 停止 词过滤器移除自定义的停止词列表中包含的词:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div></pre></td><td class="code"><pre><div class="line">"filter": {</div><div class="line"> "my_stopwords": {</div><div class="line"> "type": "stop",</div><div class="line"> "stopwords": [ "the", "a" ]</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>我们的分析器定义用我们之前已经设置好的自定义过滤器组合了已经定义好的分词器和过滤器:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">"analyzer": {</div><div class="line"> "my_analyzer": {</div><div class="line"> "type": "custom",</div><div class="line"> "char_filter": [ "html_strip", "&_to_and" ],</div><div class="line"> "tokenizer": "standard",</div><div class="line"> "filter": [ "lowercase", "my_stopwords" ]</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<p>汇总起来,完整的 创建索引 请求 看起来应该像这样:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div><div class="line">10</div><div class="line">11</div><div class="line">12</div><div class="line">13</div><div class="line">14</div><div class="line">15</div><div class="line">16</div><div class="line">17</div><div class="line">18</div><div class="line">19</div><div class="line">20</div><div class="line">21</div><div class="line">22</div><div class="line">23</div></pre></td><td class="code"><pre><div class="line"></div><div class="line">PUT /my_index</div><div class="line">{</div><div class="line"> "settings": {</div><div class="line"> "analysis": {</div><div class="line"> "char_filter": {</div><div class="line"> "&_to_and": {</div><div class="line"> "type": "mapping",</div><div class="line"> "mappings": [ "&=> and "]</div><div class="line"> }},</div><div class="line"> "filter": {</div><div class="line"> "my_stopwords": {</div><div class="line"> "type": "stop",</div><div class="line"> "stopwords": [ "the", "a" ]</div><div class="line"> }},</div><div class="line"> "analyzer": {</div><div class="line"> "my_analyzer": {</div><div class="line"> "type": "custom",</div><div class="line"> "char_filter": [ "html_strip", "&_to_and" ],</div><div class="line"> "tokenizer": "standard",</div><div class="line"> "filter": [ "lowercase", "my_stopwords" ]</div><div class="line"> }}</div><div class="line">}}}</div></pre></td></tr></table></figure>
<p>索引被创建以后,使用 analyze API 来 测试这个新的分析器:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div></pre></td><td class="code"><pre><div class="line">GET /my_index/_analyze?analyzer=my_analyzer</div><div class="line">The quick & brown fox</div></pre></td></tr></table></figure>
<p>下面的缩略结果展示出我们的分析器正在正确地运行:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div></pre></td><td class="code"><pre><div class="line">{</div><div class="line"> "tokens" : [</div><div class="line"> { "token" : "quick", "position" : 2 },</div><div class="line"> { "token" : "and", "position" : 3 },</div><div class="line"> { "token" : "brown", "position" : 4 },</div><div class="line"> { "token" : "fox", "position" : 5 }</div><div class="line"> ]</div><div class="line">}</div></pre></td></tr></table></figure>
<p>这个分析器现在是没有多大用处的,除非我们告诉 Elasticsearch在哪里用上它。我们可以像下面这样把这个分析器应用在一个 string 字段上:</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><div class="line">1</div><div class="line">2</div><div class="line">3</div><div class="line">4</div><div class="line">5</div><div class="line">6</div><div class="line">7</div><div class="line">8</div><div class="line">9</div></pre></td><td class="code"><pre><div class="line">PUT /my_index/_mapping/my_type</div><div class="line">{</div><div class="line"> "properties": {</div><div class="line"> "title": {</div><div class="line"> "type": "string",</div><div class="line"> "analyzer": "my_analyzer"</div><div class="line"> }</div><div class="line"> }</div><div class="line">}</div></pre></td></tr></table></figure>
<h3 id="最后"><a href="#最后" class="headerlink" title="最后"></a>最后</h3><p>整理参考网上资料,如有不正确的地方还请多多指教!</p>
]]></content>
<summary type="html">
<p>介绍:ElasticSearch 是一个基于 Lucene 的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎,基于 RESTful web 接口。Elasticsearch 是用 Java 开发的,并作为Apache许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。</p>
<p>Elasticsearch中,内置了很多分词器(analyzers)。下面来进行比较下系统默认分词器和常用的中文分词器之间的区别。<br>
</summary>
<category term="Elasticsearch" scheme="http://yoursite.com/tags/Elasticsearch/"/>
</entry>
<entry>
<title>那些年我看过的书 —— 致敬我的大学生活 —— Say Good Bye !</title>
<link href="http://yoursite.com/2017/08/26/recommend-books/"/>
<id>http://yoursite.com/2017/08/26/recommend-books/</id>
<published>2017-08-26T05:38:47.267Z</published>
<updated>2017-08-28T13:59:39.113Z</updated>
<content type="html"><![CDATA[<p><img src="http://ohfk1r827.bkt.clouddn.com/learn-2004897_960_720.png-1" alt=""><br><a id="more"></a></p>
<h3 id="开头"><a href="#开头" class="headerlink" title="开头"></a>开头</h3><p>2017.08.21 正式开启我入职的里程,现在已是工作了一个星期了,这个星期算是我入职的过渡期,算是知道了学校生活和工作的差距了,总之,尽快习惯这种生活吧。下面讲下自己的找工作经历和大学阅读的书籍,算是一种书籍推荐,为还在迷茫的你指引方向,同时为我三年的大学生活致敬!也激励我大四在公司实习能更上一层楼!</p>
<h3 id="找工作经历"><a href="#找工作经历" class="headerlink" title="找工作经历"></a>找工作经历</h3><p>这段经历,算是自己很难忘记的经历吧。既辛酸既充实的日子!也很感谢自己在这段时间的系统复习,感觉把自己的基础知识再次聚集在一起了,自己的能力在这一段时间提升的也很快。后面有机会的话我也想写一系列的相关文章,为后来准备工作(面试)的同学提供一些自己的帮助。自己在找工作的这段时间面过的公司也有几家大厂,但是结果都不是很好,对我自己有很大的压力,当时心里真的感觉 :“自己真的有这么差”,为什么一直被拒,当时很怀疑自己的能力,自己也有总结原因。一是面试的时候自己准备的还不够充分,虽说自己脑子里对这些基础有点印象,但是面试的时候自己稍紧张下就描述不怎么清楚了,导致面试官觉得你可能广度够了,深度还不够(这是阿里面试官电话面试说的);二是自己的表达能力还是有所欠缺,不能够将自己所要表达的东西说出来,这可能我要在后面加强的地方;三是我的学校问题。在面了几家公司失败后,终于面了家公司要我了,我也确定在这家公司了。很幸运,刚出来,就有一个很好(很负责)的架构师带我,这周就给了我一个很牛逼的项目给我看(虽然自己目前还没有思路改里面的代码),里面新东西很多,说吃透了这个项目,以后绝对可以拿出去吹逼(一脸正经.jpg)。目前我的找工作经历就简短的介绍到这里了,如果感兴趣的话,可以加群:528776268 进来和我讨论交流。</p>