Machine Learning / Neural Networks / Python

Comparing performance of a modern NLP framework, BERT, vs a classical approach, TF-IDF, for document classification with simple and easy to understand code.

We compare the performance of the modern, state of the art, BERT model against one of the previously popular methods, TF-IDF, for document classification. The performance is evaluated on the well known 20 news groups dataset.

Mausam Gaurav
Written on
Sep 24, 2021

10 min read . 688 Views


In this article, we would first cover the traditional TF-IDF approach for document classification. TF-IDF stands for term frequency-inverse document frequency. TF-IDF is a numerical statistic often used as a weighing factor for words in a document, and as a proxy for how important a word in a document is in relation to all other words in a corpus containing other documents. Thereafter we would use the modern BERT approach for classifying the same documents. We would fine-tune a pre-trained BERT model for this purpose.

We would use Google-Colab for running our code. However, you could also run the code locally from your machine if you wish so.

First, we ensure that the relevant python libraries are present in our python environment. For TF-IDF, we would use the 'sklearn' library and for BERT we would use the transformers library. Colab already has 'sklearn' installed, and 'pytorch' a key dependency for the 'transformers' library is already installed. However, we would still need to install the below with pip. 

Install missing libraries

! pip install transformers datasets plotly nltk

TF-IDF approach

Load datasets

Next, we need to load the 20 newsgroup dataset using 'sklearn'. On Colab we could use the 'fetch_20newsgroups' method to download and load the dataset. If using locally we can also download the dataset manually and then use the 'load_files' method to load the dataset.

# For automatically downloading the dataset and loading it to memory
from sklearn.datasets import fetch_20newsgroups
data_train = fetch_20newsgroups(subset='train', categories=None, shuffle=True, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=None, shuffle=True, random_state=42)

# # For manually downloading the dataset and loading the downloaded dataset in to memory (change the paths as relevant)
# from sklearn.datasets import load_files
# data_train = load_files(container_path=r'D:\Projects\datasets\20news\20news-bydate-train', encoding='latin', shuffle=True, random_state=42)
# data_test = load_files(container_path=r'D:\Projects\datasets\20news\20news-bydate-test', encoding='latin', shuffle=True, random_state=42)

We can inspect the loaded datasets as below.

# Inspect dataset contents

# Output
# dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
# dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

You would see that both the train and test datasets consist of the data (i.e. the text documents), the filenames of these text documents, the target_names i.e. the document labels in text, and the target i.e. the document labels in numbers.

We can see the list of all the unique labels (20 in total - the dataset is aptly called 20 newsgroup dataset) as below:

# View all dataset categories

# Output
# ['sci.electronics',
#  'talk.politics.misc',
#  'comp.sys.mac.hardware',
#  '',
#  'talk.politics.guns',
#  '',
#  '',
#  'alt.atheism',
#  '',
#  '',
#  '',
#  '',
#  'talk.religion.misc',
#  'talk.politics.mideast',
#  'sci.crypt',
#  '',
#  '',
#  '',
#  '',
#  'soc.religion.christian']

Create TF-IDF vectors of the documents

We now need to create the vector representations of all the documents in the training and test datasets using the TfidfVectorizer object from 'sklearn'. We would fit the vectorizer object with the training dataset using the fit_transform method, which would first create features based on all training documents and then transform the training samples into vector representations of these features. We could do that as below.

# Tf-idf vectorizer. Create features based on training data samples and then convert training samples into vector representations of these features.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X_train = vectorizer.fit_transform(

After this step, the training data would be converted to a sparse array of vectors i.e. a matrix representation of the entire corpus of training documents. We can inspect the matrix as below.

print("n_samples: %d, n_features: %d" % X_train.shape)

# Output
# n_samples: 11314, n_features: 129791

If you want to visualize the features, you could do something like the below. Note that the features are a long list of words, so we are only looking at a particular slice of the features.


# Output
# ['asx', 'asya', 'asylum', 'asymetrix', 'asymmetric', 'asymmetries', 'asymptotically', 'async', 'asynch', 'asynchronicity']

We can visualize the Tf-idf weights for the first sample document as below.

import numpy as np
from pprint import pprint

X_train_array = X_train.toarray()
X_train_sample1_vector = X_train_array[0]

features_array = np.array(vectorizer.get_feature_names())

relevant_features_sample1 = features_array[X_train_sample1_vector > 0]
print('Features with non-zero tf-idf weights for sample 1.')


print('Tf-idf weights for these features for sample 1.') 
relevant_weights_sample1 = X_train_sample1_vector[X_train_sample1_vector > 0]

# Output
# Features with non-zero tf-idf weights for sample 1.
# array(['15', '60s', '70s', 'addition', 'body', 'bricklin', 'brought',
#        'bumper', 'called', 'car', 'college', 'day', 'door', 'doors',
#        'early', 'engine', 'enlighten', 'funky', 'history', 'host', 'il',
#        'info', 'know', 'late', 'lerxst', 'looked', 'looking', 'mail',
#        'maryland', 'model', 'neighborhood', 'nntp', 'park', 'posting',
#        'production', 'rac3', 'really', 'rest', 'saw', 'separate', 'small',
#        'specs', 'sports', 'tellme', 'thanks', 'thing', 'umd',
#        'university', 'wam', 'wondering', 'years'], dtype='<U180')

# Tf-idf weights for these features for sample 1.
# array([0.07761318, 0.17349466, 0.1700526 , 0.11686055, 0.10682953,
#        0.20483876, 0.11464293, 0.16277056, 0.08467486, 0.24391653,
#        0.09561454, 0.08115624, 0.12274527, 0.14221568, 0.10359526,
#        0.11906511, 0.16439042, 0.19103829, 0.10224976, 0.04341232,
#        0.11993249, 0.08963448, 0.05246633, 0.11298951, 0.36712898,
#        0.11075174, 0.08467486, 0.07341243, 0.13403789, 0.10839785,
#        0.16125528, 0.04371988, 0.12149832, 0.04224856, 0.13269588,
#        0.19693883, 0.06945008, 0.10014748, 0.10372149, 0.12096469,
#        0.09223978, 0.13291429, 0.13247949, 0.21683229, 0.06754479,
#        0.07616042, 0.21982819, 0.04545303, 0.26946658, 0.10886588,
#        0.07429919])

Now we have analyzed the document vectors of the training data a little bit, we need to convert the test data documents to vectors as well. (Note, that this time around we don't want to use the fit_transform method but rather just the transform method. We need to convert the test data documents into vector representations of the training features. This is because, in the real world, we may just have the training data, we build the model with the training data and expect the model to work for some unseen test data). We could do this as below.

# Create vector of the test dataset by utilizing only the features present in training dataset)
X_test = vectorizer.transform(

Like before, we can analyze the test document vector matrix as below. 

print("n_samples: %d, n_features: %d" % X_test.shape)

# Output
# n_samples: 7532, n_features: 129791

Before training any model on the training data, we need one more thing the train and test labels. Perform the below.

# training and test labels
y_train, y_test =,

At this point, we can optionally choose to reduce the number of features. So currently we have 129, 791 features of which a number of these features (words) such as '03ii', '03i' etc. – don't make any sense. So we can optionally choose to reduce these features for better model training efficiency.  Note that the tf-idf weights for such irrelevant features are already zero, so reducing the number of features would not impact model accuracy. However, it would definitely improve the training efficiency. We arbitrarily set 50,000 as the number of features (words we want to focus on). We can select the best 50,000 features based on their chi-square association score with the target variable, using the 'sklearn' SelectKBest method as below.

from sklearn.feature_selection import SelectKBest, chi2

# mapping from integer feature name to original token string
feature_names = vectorizer.get_feature_names()

# Feature reduction with Kbest features based on chi2 score
ch2 = SelectKBest(chi2, k=50000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
if feature_names:
    # keep selected feature names
    feature_names = [feature_names[i] for i
                     in ch2.get_support(indices=True)]

Building the model

For training, we choose the linear support vector machine model (SVC), and then train it with the training dataset as below. The model choice is arbitrary, for this demonstration. In an ideal situation, you should start out with a set of different models and choose the one with the best performance.

from time import time
# Training our model

from sklearn.svm import LinearSVC

clf = LinearSVC(penalty="l2", random_state=123)

t0 = time(), y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)

Once the model has been trained, we can validate its accuracy as below. Note that our performance metric here is accuracy but we can choose other metrics as well, such as precision, recall, etc.

# Testing our model

from sklearn import metrics

t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time:  %0.3fs" % test_time)

score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)

# Output
# test time:  0.026s
# accuracy:   0.860

So, as we could see that our model with TF-IDF did a good job here and we achieved an accuracy of 86%.

BERT approach

Before going any further with BERT, we need to understand that the base version of BERT has a token limit of 512 tokens. BERT converts the input text into tokens where a token is essentially a word. So, just to confirm whether our BERT model would be expected to work with our 20news group data, we can check the distribution of the number of words per document in our dataset as below. Note that the below approach is approximate as the regex tokenizer would sometimes fail to identify words, where the pattern of characters would not exactly fit the '\w+' to identify words. However, this is very close to producing the correct number of words.

# Collate all documents
docs_all = data_train['data'] + data_test['data']

# Find number of words per document in the dataset
from nltk.tokenize import RegexpTokenizer
import statistics
tokenizer = RegexpTokenizer(r'\w+')
n_words = [len(tokenizer.tokenize(doc)) for doc in docs_all]

# Output
# 18
# 39841
# 317.65207471081396

So as we could see that the minimum word count in a document is 18 and the maximum is 39,841. However, the mean is ~318 words per ...

You are seeing a restricted version of the article. The article is reserved for registered users only whose profiles have been approved. To read the full article, register on the website and then send your profile for approval. To send your profile for approval, after registering, fill in the required details in your profile section and press submit for approval. Only genuine and non-malicious profiles would be approved. Corporate profiles with intention to use such free information for profit making would not be approved.
135 Praises
0 comments have been posted.

Post your comment

Required for comment verification