UserWarning：不标签：NUMBER：存在于所有培训示例中

2018-07-04 07:52:52

我正在做多标签分类，我试图预测每个文档的正确标签，这里是我的代码：

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df = 0.8, 
                                   min_df = 10)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

predicted = cross_val_predict(classifier, X, y)

运行我的代码时，我得到了多个警告：

UserWarning: Label not :NUMBER: is present in all training examples.

当我打印出预测的和真实的标签时，cca的所有文档中有一半是对空白标签的预测。

为什么会发生这种情况，是否与训练运行时打印出的警告有关？我怎样才能避免这些空洞的预测？

EDIT01：使用其他估计量而不是LinearSVC()时也会发生这种情况。

我试过RandomForestClassifier() ，它也给出了空的预测。奇怪的是，当我使用cross_val_predict(classifier, X, y, method='predict_proba')来预测每个标签的概率时，而不是二元决策0/1时，每个预测集总是至少有一个标签，概率> 0对于给定的文件。所以我不知道为什么这个标签不是选择二元决策？或者二元决策的评估方式与概率不同？

EDIT02：我发现了一个旧帖子，OP正在处理类似的问题。这是相同的情况吗？

为什么会发生这种情况，是否与训练运行时打印出的警告有关？

问题很可能是某些标签只出现在几个文档中（请查看该主题了解详情）。将数据集分成火车和测试以验证模型时，可能会发生培训数据中缺少一些标签。让train_indices成为一个具有训练样本索引的数组。如果训练样本中没有出现特定标签（索引k ），则指标矩阵y[train_indices]第k列中的所有元素都是零。

我怎样才能避免这些空洞的预测？

在上述场景中，分类器将无法可靠地预测测试文档中的第k个标签（下一段中的更多内容）。因此，您不能相信clf.predict所做的预测，您需要自己实现预测函数，例如，通过使用clf.decision_function返回的决策值，如本答案中的建议。

所以我不知道为什么这个标签不是用二元决策选择的？或者二元决策的评估方式与概率不同？

在包含许多标签的数据集中，大多数标签的出现频率都很低。如果将这些低值输入到二元分类器（即进行0-1预测的分类器），那么分类器很可能会为所有文档上的所有标签选择0。

我找到了一个OP在处理类似问题的旧帖子。这是相同的情况吗？

是的，一点没错。那个人和你面对的问题完全一样，他的代码和你的代码非常相似。

演示

为了进一步解释这个问题，我使用模拟数据阐述了一个简单的玩具例子。

Q = {'What does the "yield" keyword do in Python?': ['python'],
     'What is a metaclass in Python?': ['oop'],
     'How do I check whether a file exists using Python?': ['python'],
     'How to make a chain of function decorators?': ['python', 'decorator'],
     'Using i and j as variables in Matlab': ['matlab', 'naming-conventions'],
     'MATLAB: get variable type': ['matlab'],
     'Why is MATLAB so fast in matrix multiplication?': ['performance'],
     'Is MATLAB OOP slow or am I doing something wrong?': ['matlab-oop'],
    }
dataframe = pd.DataFrame({'body': Q.keys(), 'tag': Q.values()})    

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df=0.8, 
                                   min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

请注意，我已经设置了min_df=1因为我的数据集比你的要小得多。当我运行下面的句子时：

predicted = cross_val_predict(classifier, X, y)

我收到了一堆警告

C:...multiclass.py:76: UserWarning: Label not 4 is present in all training examples.
  str(classes[c]))
C:multiclass.py:76: UserWarning: Label not 0 is present in all training examples.
  str(classes[c]))
C:...multiclass.py:76: UserWarning: Label not 3 is present in all training examples.
  str(classes[c]))
C:...multiclass.py:76: UserWarning: Label not 5 is present in all training examples.
  str(classes[c]))
C:...multiclass.py:76: UserWarning: Label not 2 is present in all training examples.
  str(classes[c]))

和以下预测：

In [5]: np.set_printoptions(precision=2, threshold=1000)    

In [6]: predicted
Out[6]: 
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])

那些条目全部为0行表示没有标签被预测用于相应的文档。

解决方法

为了分析，让我们手动验证模型，而不是通过cross_val_predict 。

import warnings
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=1, test_size=.5, random_state=0)
train_indices, test_indices = rs.split(X).next()

with warnings.catch_warnings(record=True) as received_warnings:
    warnings.simplefilter("always")
    X_train, y_train = X[train_indices], y[train_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    classifier.fit(X_train, y_train)
    predicted_test = classifier.predict(X_test)
    for w in received_warnings:
        print w.message

当执行上面的代码片段时，会发出两个警告（我使用上下文管理器来确保捕获警告）：

Label not 2 is present in all training examples.
Label not 4 is present in all training examples.

这与训练样本中缺少指数2和4标签一致：

In [40]: y_train
Out[40]: 
array([[0, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1]])

对于一些文件中，预测是空的（对应于与在全零的行的那些文件predicted_test ）：

In [42]: predicted_test
Out[42]: 
array([[0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0]])

为了解决这个问题，你可以像这样实现你自己的预测功能：

def get_best_tags(clf, X, lb, n_tags=3):
    decfun = clf.decision_function(X)
    best_tags = np.argsort(decfun)[:, :-(n_tags+1): -1]
    return lb.classes_[best_tags]

通过这样做，每个文档总是被赋予具有最高置信度分数的n_tag标签：

In [59]: mlb.inverse_transform(predicted_test)
Out[59]: [('matlab',), (), (), ('matlab', 'naming-conventions')]

In [60]: get_best_tags(classifier, X_test, mlb)
Out[60]: 
array([['matlab', 'oop', 'matlab-oop'],
       ['oop', 'matlab-oop', 'matlab'],
       ['oop', 'matlab-oop', 'matlab'],
       ['matlab', 'naming-conventions', 'oop']], dtype=object)

我也有同样的错误。然后我使用LabelEncoder（）而不是MultiLabelBinarizer（）对标签进行编码。

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(Labels)

我不再有这个错误了。

链接地址: http://www.djcxy.com/p/95597.html

上一篇: UserWarning: Label not :NUMBER: is present in all training examples

下一篇: Why can std::apply call a lambda but not the equivalent template function?