In this article, we would first cover the traditional TF-IDF approach for document classification. TF-IDF stands for term frequency-inverse document frequency. TF-IDF is a numerical statistic often used as a weighing factor for words in a document, and as a proxy for how important a word in a document is in relation to all other words in a corpus containing other documents. Thereafter we would use the modern BERT approach for classifying the same documents. We would fine-tune a pre-trained BERT model for this purpose.
We would use Google-Colab for running our code. However, you could also run the code locally from your machine if you wish so.
First, we ensure that the relevant python libraries are present in our python environment. For TF-IDF, we would use the 'sklearn' library and for BERT we would use the transformers library. Colab already has 'sklearn' installed, and 'pytorch' a key dependency for the 'transformers' library is already installed. However, we would still need to install the below with pip.
Install missing libraries
! pip install transformers datasets plotly nltk
Next, we need to load the 20 newsgroup dataset using 'sklearn'. On Colab we could use the 'fetch_20newsgroups' method to download and load the dataset. If using locally we can also download the dataset manually and then use the 'load_files' method to load the dataset.
# For automatically downloading the dataset and loading it to memory from sklearn.datasets import fetch_20newsgroups data_train = fetch_20newsgroups(subset='train', categories=None, shuffle=True, random_state=42) data_test = fetch_20newsgroups(subset='test', categories=None, shuffle=True, random_state=42) # # For manually downloading the dataset and loading the downloaded dataset in to memory (change the paths as relevant) # from sklearn.datasets import load_files # data_train = load_files(container_path=r'D:\Projects\datasets\20news\20news-bydate-train', encoding='latin', shuffle=True, random_state=42) # data_test = load_files(container_path=r'D:\Projects\datasets\20news\20news-bydate-test', encoding='latin', shuffle=True, random_state=42)
We can inspect the loaded datasets as below.
# Inspect dataset contents print(data_train.keys()) print(data_test.keys()) # Output # dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR']) # dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
You would see that both the train and test datasets consist of the data (i.e. the text documents), the filenames of these text documents, the target_names i.e. the document labels in text, and the target i.e. the document labels in numbers.
We can see the list of all the unique labels (20 in total - the dataset is aptly called 20 newsgroup dataset) as below:
# View all dataset categories list(set(data_train['target_names'])) # Output # ['sci.electronics', # 'talk.politics.misc', # 'comp.sys.mac.hardware', # 'comp.sys.ibm.pc.hardware', # 'talk.politics.guns', # 'comp.windows.x', # 'sci.med', # 'alt.atheism', # 'comp.os.ms-windows.misc', # 'comp.graphics', # 'rec.sport.hockey', # 'sci.space', # 'talk.religion.misc', # 'talk.politics.mideast', # 'sci.crypt', # 'rec.motorcycles', # 'rec.sport.baseball', # 'rec.autos', # 'misc.forsale', # 'soc.religion.christian']
Create TF-IDF vectors of the documents
We now need to create the vector representations of all the documents in the training and test datasets using the TfidfVectorizer object from 'sklearn'. We would fit the vectorizer object with the training dataset using the fit_transform method, which would first create features based on all training documents and then transform the training samples into vector representations of these features. We could do that as below.
# Tf-idf vectorizer. Create features based on training data samples and then convert training samples into vector representations of these features. from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english') X_train = vectorizer.fit_transform(data_train.data)
After this step, the training data would be converted to a sparse array of vectors i.e. a matrix representation of the entire corpus of training documents. We can inspect the matrix as below.
print("n_samples: %d, n_features: %d" % X_train.shape) # Output # n_samples: 11314, n_features: 129791
If you want to visualize the features, you could do something like the below. Note that the features are a long list of words, so we are only looking at a particular slice of the features.
print(vectorizer.get_feature_names()[30000:30010]) # Output # ['asx', 'asya', 'asylum', 'asymetrix', 'asymmetric', 'asymmetries', 'asymptotically', 'async', 'asynch', 'asynchronicity']
We can visualize the Tf-idf weights for the first sample document as below.
import numpy as np from pprint import pprint X_train_array = X_train.toarray() X_train_sample1_vector = X_train_array features_array = np.array(vectorizer.get_feature_names()) relevant_features_sample1 = features_array[X_train_sample1_vector > 0] print('Features with non-zero tf-idf weights for sample 1.') pprint(relevant_features_sample1) print('\n') print('Tf-idf weights for these features for sample 1.') relevant_weights_sample1 = X_train_sample1_vector[X_train_sample1_vector > 0] pprint(relevant_weights_sample1) # Output # Features with non-zero tf-idf weights for sample 1. # array(['15', '60s', '70s', 'addition', 'body', 'bricklin', 'brought', # 'bumper', 'called', 'car', 'college', 'day', 'door', 'doors', # 'early', 'engine', 'enlighten', 'funky', 'history', 'host', 'il', # 'info', 'know', 'late', 'lerxst', 'looked', 'looking', 'mail', # 'maryland', 'model', 'neighborhood', 'nntp', 'park', 'posting', # 'production', 'rac3', 'really', 'rest', 'saw', 'separate', 'small', # 'specs', 'sports', 'tellme', 'thanks', 'thing', 'umd', # 'university', 'wam', 'wondering', 'years'], dtype='<U180') # Tf-idf weights for these features for sample 1. # array([0.07761318, 0.17349466, 0.1700526 , 0.11686055, 0.10682953, # 0.20483876, 0.11464293, 0.16277056, 0.08467486, 0.24391653, # 0.09561454, 0.08115624, 0.12274527, 0.14221568, 0.10359526, # 0.11906511, 0.16439042, 0.19103829, 0.10224976, 0.04341232, # 0.11993249, 0.08963448, 0.05246633, 0.11298951, 0.36712898, # 0.11075174, 0.08467486, 0.07341243, 0.13403789, 0.10839785, # 0.16125528, 0.04371988, 0.12149832, 0.04224856, 0.13269588, # 0.19693883, 0.06945008, 0.10014748, 0.10372149, 0.12096469, # 0.09223978, 0.13291429, 0.13247949, 0.21683229, 0.06754479, # 0.07616042, 0.21982819, 0.04545303, 0.26946658, 0.10886588, # 0.07429919])
Now we have analyzed the document vectors of the training data a little bit, we need to convert the test data documents to vectors as well. (Note, that this time around we don't want to use the fit_transform method but rather just the transform method. We need to convert the test data documents into vector representations of the training features. This is because, in the real world, we may just have the training data, we build the model with the training data and expect the model to work for some unseen test data). We could do this as below.
# Create vector of the test dataset by utilizing only the features present in training dataset) X_test = vectorizer.transform(data_test.data)
Like before, we can analyze the test document vector matrix as below.
print("n_samples: %d, n_features: %d" % X_test.shape) # Output # n_samples: 7532, n_features: 129791
Before training any model on the training data, we need one more thing – the train and test labels. Perform the below.
# training and test labels y_train, y_test = data_train.target, data_test.target
At this point, we can optionally choose to reduce the number of features. So currently we have 129, 791 features – of which a number of these features (words) such as '03ii', '03i' etc. – don't make any sense. So we can optionally choose to reduce these features for better model training efficiency. Note that the tf-idf weights for such irrelevant features are already zero, so reducing the number of features would not impact model accuracy. However, it would definitely improve the training efficiency. We arbitrarily set 50,000 as the number of features (words we want to focus on). We can select the best 50,000 features based on their chi-square association score with the target variable, using the 'sklearn' SelectKBest method as below.
from sklearn.feature_selection import SelectKBest, chi2 # mapping from integer feature name to original token string feature_names = vectorizer.get_feature_names() # Feature reduction with Kbest features based on chi2 score ch2 = SelectKBest(chi2, k=50000) X_train = ch2.fit_transform(X_train, y_train) X_test = ch2.transform(X_test) if feature_names: # keep selected feature names feature_names = [feature_names[i] for i in ch2.get_support(indices=True)]
Building the model
For training, we choose the linear support vector machine model (SVC), and then train it with the training dataset as below. The model choice is arbitrary, for this demonstration. In an ideal situation, you should start out with a set of different models and choose the one with the best performance.
from time import time
# Training our model from sklearn.svm import LinearSVC clf = LinearSVC(penalty="l2", random_state=123) t0 = time() clf.fit(X_train, y_train) train_time = time() - t0 print("train time: %0.3fs" % train_time)
Once the model has been trained, we can validate its accuracy as below. Note that our performance metric here is accuracy but we can choose other metrics as well, such as precision, recall, etc.
# Testing our model from sklearn import metrics t0 = time() pred = clf.predict(X_test) test_time = time() - t0 print("test time: %0.3fs" % test_time) score = metrics.accuracy_score(y_test, pred) print("accuracy: %0.3f" % score) # Output # test time: 0.026s # accuracy: 0.860
So, as we could see that our model with TF-IDF did a good job here and we achieved an accuracy of 86%.
Before going any further with BERT, we need to understand that the base version of BERT has a token limit of 512 tokens. BERT converts the input text into tokens where a token is essentially a word. So, just to confirm whether our BERT model would be expected to work with our 20news group data, we can check the distribution of the number of words per document in our dataset as below. Note that the below approach is approximate as the regex tokenizer would sometimes fail to identify words, where the pattern of characters would not exactly fit the '\w+' to identify words. However, this is very close to producing the correct number of words.
# Collate all documents docs_all = data_train['data'] + data_test['data'] # Find number of words per document in the dataset from nltk.tokenize import RegexpTokenizer import statistics tokenizer = RegexpTokenizer(r'\w+') n_words = [len(tokenizer.tokenize(doc)) for doc in docs_all] print(min(n_words)) print(max(n_words)) print(statistics.mean(n_words)) # Output # 18 # 39841 # 317.65207471081396
So as we could see that the minimum word count in a document is 18 and the maximum is 39,841. However, the mean is ~318 words per ...