Text classification using tensorflow
Hi guys,
In this article, you will learn how to train your own text classification Model from scratch using Tensorflow in just a couple of lines of code.
a brief about text classification
Text classification is a subpart of natural language processing that focuses on grouping a paragraph into predefined groups based on its content, for instance classifying categories of news whether its sports, business, music and etc
what will you learn?
- One hot encoding
- Word Embedding
- Neural network with an embedding layer
- Evaluating and Testing trained Model
The above mention concepts are fundamental things that you supposed to understand when it comes to natural language processing with TensorFlow and you can apply them to multiple NLP-based projects, so I recommend you read this to an end to really grasp it.
Building Sentiment analyzer as we learn
We are going to build a simple TensorFlow model that will be classifying user's reviews as either positive or negative as a result of effectively generalizing the training data.
ML Libraries we need
Apart from the Tensorflow itself, we also need other python library and tools to develop our model, and this article assumes you have them installed on your machine
Quick installation
If you don't have those libraries installed, here a quick installation guide with pip;
pip install numpy
pip install tensorflow
pip install matplotlib
Now once everything is installed, now we are ready to get our hands dirty and began building our Model.
Getting started
First of all, we need to import all the necessary library we just installed in our codebase;
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
Dataset
Dataset can come in various file formats (csv, json, sql) but in this article, we gonna use just a 1D array of sample customer review messages just as shown below;
data_x = [
'good', 'well done', 'nice', 'Excellent',
'Bad', 'OOps I hate it deadly', 'embrassing',
'A piece of shit']
Samwise we can have our label as a 1D numpy array of 0, and 1 whereby 1 stand for positive review and 0 stands for negative review arranged corresponding to the training data (data_x) just as shown below;
data_x = [
'good', 'well done', 'nice', 'Excellent',
'Bad', 'OOps I hate it deadly', 'embrassing',
'A piece of shit']
label_x = np.array([1,1,1,1, 0,0,0,0])
Data Engineering - One Hot encoding
The machine only understands numbers and that's doesn't change when it comes to training textual data, therefore to be able to train it, we need a way to have a numerical representation of our text dataset that's where on-hot encoding comes into play.
Tensorflow provides an inbuilt method to help you so that you can learn more about it by visiting one hot encoding docs, and here is how you put that into code;
data_x = [
'good', 'well done', 'nice', 'Excellent',
'Bad', 'OOps I hate it deadly', 'embrassing',
'A piece of shit']
label_x = np.array([1,1,1,1, 0,0,0,0])
one_hot_x = [tf.keras.preprocessing.text.one_hot(d, 50) for d in data_x]
print(one_hot_x)
Here an output;
[[21], [9, 34], [24], [20], [28], [41, 26, 9, 17, 26], [36], [9, 41]]
With just one line of code of list comprehension, we were able to have a numerical representation of our text datasets.
Data Engineering - Padding
If you look carefully you will notice it resulted in arrays of different sizes this is due to varying lengths of individual training data.
That's not good, we need to ensure our training data items have an equal length to be able to train it that's why need to do padding to normalize it to a certain standard length.
what padding will do is extend arrays with length lower than standard length to equal it by appending 0s and removes extra element to those with exceeding length;
Now with the nature of our dataset, lets setting our standard length(max_len) to be four(4) for our training data, Here is how you put that into code,
maxlen is a parameter for the standard length, and let set it accordingly;
data_x = [
'good', 'well done', 'nice', 'Excellent',
'Bad', 'OOps I hate it deadly', 'embrassing',
'A piece of shit']
label_x = np.array([1,1,1,1, 0,0,0,0])
# one hot encoding
one_hot_x = [tf.keras.preprocessing.text.one_hot(d, 50) for d in data_x]
# padding
padded_x = tf.keras.preprocessing.sequence.pad_sequences(one_hot_x, maxlen=4, padding = 'post')
print(padded_x)
Your output is going to look like this;
array([[21, 0, 0, 0], [ 9, 34, 0, 0], [24, 0, 0, 0], [20, 0, 0, 0],[28, 0, 0, 0], [26, 9, 17, 26], [36, 0, 0, 0],[ 9, 41, 0, 0]], dtype=int32)
As we can see now our training data is engineered now it is ready for training;
Building a Model
I'm assuming have TensorFlow basics and you are familiar with sequential models, everything is going to be as normal with an Exception of an Embedding Layer;
Why Embedding Layer?
The data we have engineered is just arrays of numbers and doesn't and it can be had to relate how one is similar to the other one by comparing numbers that's why we need to have an Embedding layer which helps to turn those number into something more meaningful by turning them into dense vectors of fixed size which we can compute its relations;
The embedding layer receives main three parameters
- input_dim (summation of unique words in your corpus)
- output_dim (size of corresponding dense vectors)
- input_length (standard length of input data)
Here an Example;
sample_data = np.array([[1], [4]], dtype='int32')
emb_layer = tf.keras.layers.Embedding(50, 4, input_length=4)
print(emb_layer(sample_data))
Here how your output will look like;
f.Tensor(
[[[-0.04779602 -0.01631527 0.01087242 0.00247218]]
[[-0.03402965 0.02020274 0.02596027 -0.00916996]]], shape=(2, 1, 4), dtype=float32)
Now instead of having a bunch of meaningless 0s, we can have a vector representation like this for our data and that's what the embedding layer does now let's put it into our project;
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(50, 8, input_length=4),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1, activation='sigmoid')
])
Above is the complete architecture of our text classification Model with the addition of Flatten() which just reduce higher-dimensional tensor vectors into 2D, and the last Dense layer which is the deciding node for our classification model which will have a final say whether a review is positive or negative
Now that we have initialized our model, we finalize configuring by specifying an optimizer algorithm to be used and category of loss to be calculated during and optimizations;
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
Output
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 4, 8) 400
_________________________________________________________________
flatten (Flatten) (None, 32) 0
_________________________________________________________________
dense (Dense) (None, 1) 33
=================================================================
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
Training model
Now after once we finish configuring our model we can begin training our model, Since our data is only short we don't usually need many epochs to train it but let's fit with 1000 epochs and visualizing the learning curve
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
data_x = [
'good', 'well done', 'nice', 'Excellent',
'Bad', 'OOps I hate it deadly', 'embrassing',
'A piece of shit']
label_x = np.array([1,1,1,1, 0,0,0,0])
# one hot encoding
one_hot_x = [tf.keras.preprocessing.text.one_hot(d, 50) for d in data_x]
# padding
padded_x = tf.keras.preprocessing.sequence.pad_sequences(one_hot_x, maxlen=4, padding = 'post')
# Architecting our Model
model = tf.keras.models.Sequential([
tf.keras.layers.Embedding(50, 8, input_length=4),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# specifying training params
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(padded_x, label_x, epochs=1000,
batch_size=2, verbose=0)
# plotting training graph
plt.plot(history.history['loss'])
The output of the training graph is going to look as shown below;
This looks pretty good revealing that our training was able to minimize the loss effectively, and our model is ready for testing.
Evaluating of model
Let's create a Simple function to predict new words using the model have just created, it won't be that smart since our data was really short.
def predict(word):
one_hot_word = [tf.keras.preprocessing.text.one_hot(word, 50)]
pad_word = tf.keras.preprocessing.sequence.pad_sequences(one_hot_word, maxlen=4, padding='post')
result = model.predict(pad_word)
if result[0][0]>0.1:
print('you look positive')
else:
print('damn you\'re negative')
Let's test calling predict method with different word parameters
>>> predict('this tutorial is cool')
you look positive
>>> predict('This tutorial is bad as me ')
damn you're negative
Our model was able to successfully classify the positive and negative reviews which shows it really learnt something.