Multi-label classification to predict topic tags of technical articles from LinkedInfo.co

This code snippet is to predict topic tags based on the text of an article. Each article could have 1 or more tags (usually have at least 1 tag), and the tags are not mutually exclusive. So this is a multi-label classification problem. It's different from multi-class classification, the classes in multi-class classification are mutually exclusive, i.e., each item belongs to 1 and only 1 class.

In this snippet, we will use OneVsRestClassifier (the One-Vs-the-Rest) in scikit-learn to process the multi-label classification. The article data will be retrieved from LinkedInfo.co via Web API. The methods in this snippet should give credits to Working With Text Data - scikit-learn and this post.

Table of Contents

Preprocessing data and explore the method

dataset.df_tags fetches the data set from LinkedInfo.co. It calls Web API of LinkedInfo.co to retrieve the article list, and then download and extract the full text of each article based on an article's url. The tags of each article are encoded using MultiLabelBinarizer in scikit-learn. The implementation of the code could be found in dataset.py. We've set the parameter of content_length_threshold to 100 to screen out the articles with less than 100 for the description or full text.

import dataset

ds = dataset.df_tags(content_length_threshold=100)

The dataset contains 3353 articles by the time retrieved the data. The dataset re returned as an object with the following attribute:

  • ds.data: pandas.DataFrame with cols of title, description, fulltext

  • ds.target: encoding of tagsID

  • ds.target_names: tagsID

  • ds.target_decoded: the list of lists contains tagsID for each info

>> ds.data.head()
description fulltext title
0 Both HTTP 1.x and HTTP/2 rely on lower level c… [Stressgrid]()\n\n__\n\n[]( "home")\n\n * [… Achieving 100k connections per second with Elixir
1 At Phusion we run a simple multithreaded HTTP … [![Hongli Lai](images/avatar-b64f1ad5.png)]( What causes Ruby memory bloat?
2 Have you ever wanted to contribute to a projec… [ ![Real Python](/static/real-python-logo.ab1a… Managing Multiple Python Versions With pyenv
3 安卓在版本Pie中第一次引入了ART优化配置文件,这个新特性利用发送到Play Cloud的… 安卓在版本Pie中第一次引入了[ART优化配置文件](https://youtu.be/Yi... ART云配置文件,提高安卓应用的性能
4 I work at Red Hat on GCC, the GNU Compiler Col… [ ![Red Hat\nLogo](https://developers.redhat.c... Usability improvements in GCC 9
>> ds.target[:5]
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])
>> ds.target_names[:5]
array(['academia', 'access-control', 'activemq', 'aes', 'agile'],
      dtype=object)
>> ds.target_decoded[:5]
[['concurrency', 'elixir'],
 ['ruby'],
 ['python', 'virtualenv'],
 ['android'],
 ['gcc']]

The following snippet is the actual process of getting the above dataset, by reading from file.

import json
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

infos_file = 'data/infos/infos_0_3353_fulltext.json'
with open(infos_file, 'r') as f:
    infos = json.load(f)

content_length_threshold = 100

data_lst = []
tags_lst = []
for info in infos['content']:
    if len(info['fulltext']) < content_length_threshold:
        continue
    if len(info['description']) < content_length_threshold:
        continue
    data_lst.append({'title': info['title'],
                     'description': info['description'],
                     'fulltext': info['fulltext']})
    tags_lst.append([tag['tagID'] for tag in info['tags']])

df_data = pd.DataFrame(data_lst)
df_tags = pd.DataFrame(tags_lst)

# fit and transform the binarizer
mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(tags_lst)
Y.shape
(3221, 560)

Now we've transformed the target (tags) but we cannot directly perform the algorithms on the text data, so we have to process and transform them into vectors. In order to do this, we will use TfidfVectorizer to preprocess, tokenize, filter stop words and transform the text data. The TfidfVectorizer implements the tf-idf (Term Frequency-Inverse Document Frequency) to reflect how important a word is to to a document in a collection of documents.

from sklearn.feature_extraction.text import TfidfVectorizer

# Use the default parameters for now, use_idf=True in default
vectorizer = TfidfVectorizer()
# Use the short descriptions for now for faster processing
X = vectorizer.fit_transform(df_data.description)
X.shape
(3221, 35506)

As mentioned in the beginning, this is a multi-label classification problem, we will use OneVsRestClassifier to tackle our problem. And firstly we will use the SVM (Support Vector Machines) with linear kernel, implemented as LinearSVC in scikit-learn, to do the classification.

from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

# Use default parameters, and train and test with small set of samples.
clf = OneVsRestClassifier(LinearSVC())

from sklearn.utils import resample

X_sample, Y_sample = resample(
    X, Y, n_samples=1000, replace=False, random_state=7)

# X_sample_test, Y_sample_test = resample(
#     X, Y, n_samples=10, replace=False, random_state=1)

X_sample_train, X_sample_test, Y_sample_train, Y_sample_test = train_test_split(
    X_sample, Y_sample, test_size=0.01, random_state=42)

clf.fit(X_sample, Y_sample)
Y_sample_pred = clf.predict(X_sample_test)

# Inverse transform the vectors back to tags
pred_transformed = mlb.inverse_transform(Y_sample_pred)
test_transformed = mlb.inverse_transform(Y_sample_test)

for (t, p) in zip(test_transformed, pred_transformed):
    print(f'tags: {t} predicted as: {p}')
tags: ('javascript',) predicted as: ('javascript',)
tags: ('erasure-code', 'storage') predicted as: ()
tags: ('mysql', 'network') predicted as: ()
tags: ('token',) predicted as: ()
tags: ('flask', 'python', 'web') predicted as: ()
tags: ('refactoring',) predicted as: ()
tags: ('emacs',) predicted as: ()
tags: ('async', 'javascript', 'promises') predicted as: ('async', 'javascript')
tags: ('neural-networks',) predicted as: ()
tags: ('kubernetes',) predicted as: ('kubernetes',)

Though not very satisfied, this classifier predicted right a few tags. Next we'll try to search for the best parameters for the classifier and train with fulltext of articles.

Search for best model parameters for SVM with linear kernel

For the estimators TfidfVectorizer and LinearSVC, they both have many parameters could be tuned for better performance. We'll the GridSearchCV to search for the best parameters with the help of Pipeline.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV


# Split the dataset into training and test set, and use fulltext of articles:
X_train, X_test, Y_train, Y_test = train_test_split(
    df_data.fulltext, Y, test_size=0.5, random_state=42)

# Build vectorizer classifier pipeline
clf = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', OneVsRestClassifier(LinearSVC())),
])

# Grid search parameters, I minimized the parameter set based on previous
# experience to accelerate the processing speed.
# And the combination of penalty='l1' and loss='squared_hinge' are not supported when dual=True
parameters = {
    'vect__ngram_range': [(1, 2), (1, 3), (1, 4)],
    'vect__max_df': [1, 0.9, 0.8, 0.7],
    'vect__min_df': [1, 0.9, 0.8, 0.7, 0],
    'vect__use_idf': [True, False],
    'clf__estimator__penalty': ['l1', 'l2'],
    'clf__estimator__C': [1, 10, 100, 1000],
    'clf__estimator__dual': [False],
}

gs_clf = GridSearchCV(clf, parameters, cv=5, n_jobs=-1)
gs_clf.fit(X_train, Y_train)
import datetime
from sklearn import metrics


# Predict the outcome on the testing set in a variable named y_predicted
Y_predicted = gs_clf.predict(X_test)

print(metrics.classification_report(Y_test, Y_predicted))
print(gs_clf.best_params_)
print(gs_clf.best_score_)

# Export some of the result cols
cols = [
    'mean_test_score',
    'mean_fit_time',
    'param_vect__ngram_range',
]
df_result = pd.DataFrame(gs_clf.cv_results_)
df_result = df_result.sort_values(by='rank_test_score')
df_result = df_result[cols]

timestamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
df_result.to_html(
    f'data/results/gridcv_results_{timestamp}_linearSVC.html')

Here we attach the top-5 performed classifiers with selected parameters.

rank_test_score mean_test_score mean_fit_time param_vect__max_df param_vect__ngram_range param_vect__use_idf param_clf__estimator__penalty param_clf__estimator__C
64 1 0.140811 96.127405 0.8 (1, 4) True l1 10
70 2 0.140215 103.252332 0.7 (1, 4) True l1 10
58 2 0.140215 98.990952 0.9 (1, 4) True l1 10
154 2 0.140215 1690.433151 0.9 (1, 4) True l1 1000
68 5 0.139618 70.778621 0.7 (1, 3) True l1 10

Training and testing with the best parameters

Based on the grid search results, we found the following parameters combined with the default parameters have the best performance. Now let's see how it will perform.

X_train, X_test, Y_train, Y_test = train_test_split(
    df_data, Y, test_size=0.2, random_state=42)

clf = Pipeline([
    ('vect', TfidfVectorizer(use_idf=True,
                             max_df=0.8, ngram_range=[1, 4])),
    ('clf', OneVsRestClassifier(LinearSVC(penalty='l1', C=10, dual=False))),
])

clf.fit(X_train.fulltext, Y_train)


Y_pred = clf.predict(X_test.fulltext)

# Inverse transform the vectors back to tags
pred_transformed = mlb.inverse_transform(Y_pred)
test_transformed = mlb.inverse_transform(Y_test)

for (title, t, p) in zip(X_test.title, test_transformed, pred_transformed):
    print(f'Article title: {title} \n'
          f'Manual tags:  {t} \n'
          f'predicted as: {p}\n')

Here below is a fraction of the list that shows the manually input tags and the predicted tags. We can see that usually the more frequently appeared and more popular tags have better change to be correctly predicted. Personally, I would say the prediction is satisfied to me comparing when I tag the articles manually. However, there's much room for improvement.

Article title: Will PWAs Replace Native Mobile Apps?
Manual tags:  ('pwa',)
predicted as: ('pwa',)

Article title: 基于Consul的分布式信号量实现
Manual tags:  ('consul', 'distributed-system')
predicted as: ('microservices', 'multithreading')

Article title: commit 和 branch 理解深入
Manual tags:  ('git',)
predicted as: ('git',)

Article title: Existential types in Scala
Manual tags:  ('scala',)
predicted as: ('scala',)

Article title: Calling back into Python from llvmlite-JITed code
Manual tags:  ('jit', 'python')
predicted as: ('compiler', 'python')

Article title: Writing a Simple Linux Kernel Module
Manual tags:  ('kernel', 'linux')
predicted as: ('linux',)

Article title: Semantic segmentation with OpenCV and deep learning
Manual tags:  ('deep-learning', 'opencv')
predicted as: ('deep-learning', 'image-classification', 'opencv')

Article title: Transducers: Efficient Data Processing Pipelines in JavaScript
Manual tags:  ('javascript',)
predicted as: ('javascript',)

Article title: C++之stl::string写时拷贝导致的问题
Manual tags:  ('cpp',)
predicted as: ('functional-programming',)

Article title: WebSocket 浅析
Manual tags:  ('websocket',)
predicted as: ('websocket',)

Article title: You shouldn’t name your variables after their types for the same reason you wouldn’t name your pets “dog” or “cat”
Manual tags:  ('golang',)
predicted as: ('golang',)

Article title: Introduction to Data Visualization using Python
Manual tags:  ('data-visualization', 'python')
predicted as: ('data-visualization', 'matplotlib', 'python')

Article title: How JavaScript works: A comparison with WebAssembly + why in certain cases it’s better to use it over JavaScript
Manual tags:  ('javascript', 'webassembly')
predicted as: ('javascript', 'webassembly')

Article title: Parsing logs 230x faster with Rust
Manual tags:  ('log', 'rust')
predicted as: ('rust',)

Article title: Troubleshooting Memory Issues in Java Applications
Manual tags:  ('java', 'memory')
predicted as: ('java',)

Article title: How to use Docker for Node.js development
Manual tags:  ('docker', 'node.js')
predicted as: ('docker',)

A glance at the different evaluation metrics

Now let's have a look at the evaluation metrics on the prediction performance. Evaluating multi-label classification is very different from evaluating binary classification. There're quite many different evaluation methods for different situations in the model evaluation part of scikit-learn's documentation. We will take a look at the ones that suit this problem.

We can start with the accuracy_score function in metrics module. As mentioned in scikit-learn documentation, in multi-label classification, a subset accuracy is 1.0 when the entire set of predicted labels for a sample matches strictly with the true label set. The equation is simple like this:

from sklearn import metrics
import matplotlib.pyplot as plt

metrics.accuracy_score(Y_test, Y_pred)
0.26356589147286824

The score is somehow low. But we should be noted that for this problem, an inexact match of the labels is acceptable in many cases, e.g., an article talks about the golang's interface is predicted with an only label golang while it was manually labeled with golang and interface. So to my opinion, this accuracy_score is not a good evaluation metric for this problem.

Now let's see the classification_report that presents averaged precision, recall and f1-score.

print(metrics.classification_report(Y_test, Y_pred))
precision recall f1-score support
micro avg 0.74 0.42 0.54 1186
macro avg 0.17 0.13 0.14 1186
weighted avg 0.60 0.42 0.48 1186

Let's look at the micro row. Why? Let me quote scikit-learn's documentation:

"micro" gives each sample-class pair an equal contribution to the overall metric (except as a result of sample-weight). Rather than summing the metric per class, this sums the dividends and divisors that make up the per-class metrics to calculate an overall quotient. Micro-averaging may be preferred in multilabel settings, including multiclass classification where a majority class is to be ignored.

Here we're more interested in the average precision, which is 0.74. As we mentioned, for this problem and for me, it's more important to not predict a label that should be negative to an article. Some of the labels for an article, e.g., the label interface for the just mentioned article, are less important. So I'm OK for having a low score of recall, which measure how good the model predicts all the labels as the manually labeled.

However, there's much room for improvement.

  • Many of the labels have very few appearances or even once. These labels could be filtered out or oversampling with text augmentation to mitigate the impact to model performance.

  • The training-test set split should be controlled by methods like stratified sampling, so that all the labels would appear in both sets with similar percentages. But again this problem is unlikely to be solved by now since there isn't enough samples.

  • Another problem to be though about is, the training samples are not equally labeled, i.e., for the same example all the articles talking about golang's interface, some of them labeled with golang + interface while some of them labeled only golang.

Avatar
PENG, Cong
Ph.D. Student

Related