github中fasttext库README官文文档翻译

参考链接：fastText/python/README.md at main · facebookresearch/fastText (github.com)

fastText模块介绍

fastText 是一个用于高效学习单词表述和句子分类的库。在本文档中，我们将介绍如何在 python 中使用 fastText。

环境要求

fastText 可在现代 Mac OS 和 Linux 发行版上运行。由于它使用了 C++11 功能，因此需要一个支持 C++11 的编译器。您需要 Python（版本 2.7 或 ≥ 3.4）、NumPy & SciPy 和 pybind11。

安装

要安装最新版本，可以执行：

$ pip install fasttext

或者，要获取 fasttext 的最新开发版本，您可以从我们的 github 代码库中安装：

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python setup.py install

使用概览

词语表征模型

为了像这里描述的那样学习单词向量，我们可以像这样使用 fasttext.train_unsupervised 函数：

import fasttext

# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')

其中，data.txt 是包含 utf-8 编码文本的训练文件。返回的模型对象代表您学习的模型，您可以用它来检索信息。

print(model.words)   # list of words in dictionary
print(model['king']) # get the vector of the word 'king'

保存和加载模型对象

调用函数 save_model 可以保存训练好的模型对象。

model.save_model("model_filename.bin")

并通过函数 load_model 加载模型参数：

model = fasttext.load_model("model_filename.bin")

文本分类模型

为了使用这里介绍的方法训练文本分类器，我们可以这样使用 fasttext.train_supervised 函数：

import fasttext

model = fasttext.train_supervised('data.train.txt')

其中 data.train.txt 是一个文本文件，每行包含一个训练句子和标签。默认情况下，我们假定标签是以字符串 __label__ 为前缀的单词。模型训练完成后，我们就可以检索单词和标签列表：

print(model.words)
print(model.labels)

为了通过在测试集上计算精度为 1 (P@1) 和召回率来评估我们的模型，我们使用了测试函数：

def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

print_results(*model.test('test.txt'))

我们还可以预测特定文本的标签：

model.predict("Which baking dish is best to bake a banana bread ?")

默认情况下，predict 只返回一个标签：概率最高的标签。您也可以通过指定参数 k 来预测多个标签：

model.predict("Which baking dish is best to bake a banana bread ?", k=3)

如果您想预测多个句子，可以传递一个字符串数组：

model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)

当然，您也可以像文字表示法那样，将模型保存到文件或从文件加载模型。

用量化技术压缩模型文件

当您想保存一个经过监督的模型文件时，fastText 可以对其进行压缩，从而只牺牲一点点性能，获得更小的模型文件。

# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)

# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")

model_filename.ftz 的大小将远远小于 model_filename.bin。

重要：预处理数据/编码约定

一般来说，对数据进行适当的预处理非常重要。特别是根文件夹中的示例脚本可以做到这一点。

fastText 假定使用 UTF-8 编码的文本。对于 Python2，所有文本都必须是 unicode；对于 Python3，所有文本都必须是 str。传入的文本将由 pybind11 编码为 UTF-8，然后再传给 fastText C++ 库。这意味着在构建模型时，使用 UTF-8 编码的文本非常重要。在类 Unix 系统中，可以使用 iconv 转换文本。

fastText 将根据以下 ASCII 字符（字节）进行标记化（将文本分割成片段）。特别是，它无法识别 UTF-8 的空白。我们建议用户将UTF-8 空格/单词边界转换为以下适当的符号之一。

空间
选项卡
垂直制表符
回车
换页
空字符

换行符用于分隔文本行。特别是，如果遇到换行符，EOS 标记就会被附加到文本行中。唯一的例外情况是，标记的数量超过了字典标题中定义的 MAX_LINE_SIZE 常量。这意味着，如果文本没有换行符分隔，例如 fil9 数据集，它将被分割成具有 MAX_LINE_SIZE 的标记块，而 EOS 标记不会被附加。

标记符的长度是UTF-8 字符的数量，通过考虑字节的前两位来识别多字节序列的后续字节。在选择子字的最小和最大长度时，了解这一点尤为重要。此外，EOS 标记（在字典标头中指定）被视为一个字符，不会被分解为子字。

API——应用程序接口

train_unsupervised （无监督训练参数）

    input             # training file path (required)
    model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
    lr                # learning rate [0.05]
    dim               # size of word vectors [100]
    ws                # size of the context window [5]
    epoch             # number of epochs [5]
    minCount          # minimal number of word occurences [5]
    minn              # min length of char ngram [3]
    maxn              # max length of char ngram [6]
    neg               # number of negatives sampled [5]
    wordNgrams        # max length of word ngram [1]
    loss              # loss function {ns, hs, softmax, ova} [ns]
    bucket            # number of buckets [2000000]
    thread            # number of threads [number of cpus]
    lrUpdateRate      # change the rate of updates for the learning rate [100]
    t                 # sampling threshold [0.0001]
    verbose           # verbose [2]

train_supervised parameters（监督训练参数）

    input             # training file path (required)
    lr                # learning rate [0.1]
    dim               # size of word vectors [100]
    ws                # size of the context window [5]
    epoch             # number of epochs [5]
    minCount          # minimal number of word occurences [1]
    minCountLabel     # minimal number of label occurences [1]
    minn              # min length of char ngram [0]
    maxn              # max length of char ngram [0]
    neg               # number of negatives sampled [5]
    wordNgrams        # max length of word ngram [1]
    loss              # loss function {ns, hs, softmax, ova} [softmax]
    bucket            # number of buckets [2000000]
    thread            # number of threads [number of cpus]
    lrUpdateRate      # change the rate of updates for the learning rate [100]
    t                 # sampling threshold [0.0001]
    label             # label prefix ['__label__']
    verbose           # verbose [2]
    pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

模型对象、

train_supervised、train_unsupervised 和 load_model 函数返回 _FastText 类的一个实例，我们一般将其命名为模型对象。

该对象将这些训练参数作为属性公开：lr、dim、ws、epoch、minCount、minCountLabel、minn、maxn、neg、wordNgrams、loss、bucket、thread、lrUpdateRate、t、label、verbose、pretrainedVectors。因此，model.wordNgrams 将给出用于训练该模型的单词 ngram 的最大长度。

此外，该对象还公开了多个函数：

    get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
                            # This is equivalent to `dim` property.
    get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
    get_input_matrix        # Get a copy of the full input matrix of a Model.
    get_labels              # Get the entire list of labels of the dictionary
                            # This is equivalent to `labels` property.
    get_line                # Split a line of text into words and labels.
    get_output_matrix       # Get a copy of the full output matrix of a Model.
    get_sentence_vector     # Given a string, get a single vector represenation. This function
                            # assumes to be given a single line of text. We split words on
                            # whitespace (space, newline, tab, vertical tab) and the control
                            # characters carriage return, formfeed and the null character.
    get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
    get_subwords            # Given a word, get the subwords and their indicies.
    get_word_id             # Given a word, get the word id within the dictionary.
    get_word_vector         # Get the vector representation of word.
    get_words               # Get the entire list of words of the dictionary
                            # This is equivalent to `words` property.
    is_quantized            # whether the model has been quantized
    predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
    quantize                # Quantize the model reducing the size of the model and it's memory footprint.
    save_model              # Save the model to the given path
    test                    # Evaluate supervised model using file given by path
    test_label              # Return the precision and recall score for each label.

属性 words, labels 返回字典中的单词和标签：

model.words         # equivalent to model.get_words()
model.labels        # equivalent to model.get_labels()

该对象重载了 __getitem__ 和 __contains__ 函数，以便返回单词的表示形式和检查单词是否在词汇表中

model['king']       # equivalent to model.get_word_vector('king')
'king' in model     # equivalent to `'king' in model.get_words()`