目录
hanlp
TensorFlow 不显示警告
白名单列表:
例子代码:
测试案例:
结巴分词:
常见停用词表
paddle lac
hanlp
可以实现地名,人名,手机号码,
可以实现智能分词,识别白名单的内容。
https://github.com/hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/ner_mtl.ipynb
hanlp_common
pip install pynvml
pip install hanlp_downloader
pip install hanlp[full] -U
这个会自动安装TensorFlow,TensorFlow下载超时,自行下载TensorFlow,
然后安装:
pip install tensorflow_intel-2.16.1-cp310-cp310-win_amd64.whl --user
pip install hanlp[full] -U
TensorFlow 不显示警告
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
白名单列表:
whitelist.txt
举例:
科目三:歌曲 一分一秒:歌曲
例子代码:
# -*- coding:utf-8 -*-
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import hanlp
from hanlp.components.mtl.tasks.ner.tag_ner import TaggingNamedEntityRecognition
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ERNIE_GRAM_ZH)
ner: TaggingNamedEntityRecognition = HanLP['ner/msra']
ner_dict_whitelist = {}
with open('whitelist.txt', 'r', encoding='utf-8') as file:
for line in file:
key, value = line.strip().split(':')
ner_dict_whitelist[key] = value
ner.dict_whitelist = ner_dict_whitelist
doc = HanLP('给我买一双红色的袜子 小华', tasks='ner/msra')
# doc.pretty_print()
for word in doc['tok/fine']:
print("word",word)
for name in doc['ner/msra']:
print("name",name)
测试案例:
给我买一双红色的袜子 小华
识别了红色,袜子,小华,一双没有识别。
结巴分词:
常见停用词表
https://github.com/goto456/stopwords
pip install jieba
import jieba
# 分词
def stripdata(Text):
# jieba 默认启用了HMM(隐马尔科夫模型)进行中文分词
seg_list = jieba.cut(Text) # 分词
# 获取字典,去除停用词
line = "/".join(seg_list)
wordlist = stripword(line)
print('分词 result')
for value in wordlist:
print(value)
# print("\n关键字:\n" + wordlist)
# 停用词分析
def stripword(seg):
wordlist = []
# 获取停用词表
stop = open('cn_stopwords.txt', 'r+', encoding='utf-8')
stopword = stop.read().split("\n")
# 遍历分词表
for key in seg.split('/'):
# print(key)
# 去除停用词,去除单字,去除重复词
if not (key.strip() in stopword) and (len(key.strip()) > 1) and not (key.strip() in wordlist):
wordlist.append(key)
# 停用词去除END
stop.close()
if 0:
keyword = open('key_word.txt', 'w+', encoding='utf-8')
print("去停用词:\n")
for key in wordlist:
keyword.write(key + "\n")
keyword.close()
return wordlist
if __name__ == '__main__':
stripdata('给我买一双红色的袜子 小华')
paddle lac
https://github.com/baidu/LAC