Training a Convolutional Neural Network for Images Classification

Image classification is one of the main application of convolutional neural networks, if you are interested in the topic like me, you may have consulted many theoretical resources and many tutorials, and this may be cumbersome as there’s a lot of information and concepts to digest. As a curious developer you may want to put your hands into code as soon as possible, if this is your case you’re in the right place.

In the following post, I’m going to train a convolutional neural network to classify brain tumors images, and going into a short brief of some key concepts.

TL;DR If you’re eager to check the code this is the notebook.

Prerequisites

Before starting, be sure you have trained machine learning models before, and that you are proficient in Python. We are going to use the following tools:

Dataset preprocessing

You may know right now that one of the most important steps to work in classification tasks, no matter the technique (common statistical approaches, machine learning, neural networks) is to gather enough quality data. Luckily, there’s a lot of open source datasets and Kaggle, hosts one with classified brain tumor images.

The dataset has four classes distributed as:

Glioma tumor: 901 samples.
Pituitary tumor: 844 samples.
Meningioma tumor: 913 samples.
Normal (no tumor): 438 samples.

You can download the dataset into your work environment using the following Kaggle api command, but first you need to obtain a Kaggle key.

!kaggle datasets download -d thomasdubail/brain-tumors-256x256

In a previous post, I stated the necessity to have three separated datasets: training, validation and testing, so the first step will be to build these three datasets. I divided the whole dataset in these percentages:

Training: 66%
Validation: 14%
Testing: 20%

I separated the images into folders following the previous distribution. Then, you have to transform the images into a representation that can be understood by Tensorflow, which are tensors. We can easily do this with some Keras magic. The function image_dataset_from_directory reads images from a directory and returns a dataset, the name of the folder will be interpreted as the class for the read samples, so, after moving the images to the proper folders, my directory structure looks like this:

With the following code I’m passing the root folder for each dataset to the function, in which case, the train_dataset will be composed of the 66% of images and each sample will have the class name corresponding to the directory that contains it (meningioma_tumor, pituitary_tumor,
glioma_tumor, normal).

# Use the image_dataset_from_directory to create the 3 datasets
from tensorflow.keras.utils import image_dataset_from_directory

train_dataset = image_dataset_from_directory(
    new_base_dir / "train",
    image_size=(180, 180),
    batch_size=32)
validation_dataset = image_dataset_from_directory(
    new_base_dir / "validation",
    image_size=(180, 180),
    batch_size=32)
test_dataset = image_dataset_from_directory(
    new_base_dir / "test",
    image_size=(180, 180),
    batch_size=32)

Building a baseline model

A basic step when training a model, is to first build a base line, something not too fancy but capable of complete the work, which is a model that learn how to classify the samples, no matter if it overfits. This first step let you know if the problem is solvable.

# Create an initial model
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(180, 180, 3))
x = layers.Rescaling(1./255)(inputs)
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(4, activation="softmax")(x)
base_model = keras.Model(inputs=inputs, outputs=outputs)

A Convolutional Neural Network (in a not formal description), is a stack of convolutional and max pooling layers. The convolutional layers learn patterns over the images on an incremental way, which is, the first layers learn simple patterns like borders or colors, and superior layers learn higher value patterns as tires, ears, eyes, etc. The max pooling layers intensify the patterns so the network focus its knowledge on these and discard the noise. As a rule of thumb, in a multi-class problem, the activation function on the last layers must be softmax. Now let’s train the baseline model.

# Configure the callbacks to:
# - Save the best model
# - Save the loss values

callbacks = [
    keras.callbacks.EarlyStopping(
        monitor="val_accuracy",
        patience=10,
    ),
    keras.callbacks.ModelCheckpoint(
        filepath="base_model",
        save_best_only=True,
        monitor="val_loss")
]
history = base_model.fit(
    train_dataset,
    epochs=100,
    validation_data=validation_dataset,
    callbacks=callbacks)

I’m using the keras.callbacks.ModelCheckpoint to save the best model by monitoring the validation loss, which is how far a prediction is from the correct value for the sample in the validation dataset. Also I’m using the keras.callbacks.EarlyStopping callback to stop the training once the model has reached 10 epochs (check the patience parameter) without improving the accuracy, which is, the model has stopped learning.

It’s a good idea to plot the accuracy and the loss value on training and validation datasets, since it give us the information about the model overfitting behavior and the capacity of the CNN to solve the problem, let’s check the graphics:

We can see the accuracy on training went up almost to a 100%, which means our model is capable to solve the problem. But on the other hand the accuracy for the validation dataset didn’t surpass the 70%, also, the training and validation loss curves gradually separate, whilst the training loss decreases, the validation loss increases. This is the common symptom of overfitting. Our model will not generalize, and in a real scenario will not be accurate classifying new tumors images. Let’s check the accuracy of our model on the test dataset.

We achieved a 75%. But hey, cheers up, this is not bad news, we complete the first task, now that we know we the problem is solvable, we need to improve our model, we need to beat the baseline model.

Beating the baseline model

A great approach to improve a model is to obtain more data, take this approach whenever is feasible. Nevertheless, this is a limitation in some cases, and in this problem, we are not able to easily get more brain tumors images. Fortunately we have some other tools to improve our model: data augmentation and transfer learning.

Adding data augmentation

We can’t get new brain tumors images, but we can synthetically create new ones from the existing samples, that’s what data augmentation means, to take the existing samples and modify them to feed our model. Again we’ll use the tools provided by Keras, some preprocessing layers to make slight changes on the images by transformations as: zooming, rotation, contrast change or cropping i.a. This new “modified” images will be new samples for our model. You can check all the data augmentation possibilities here.

The following code shows a sample of images preprocessing using the Keras layers. First let’s download an image.

import tensorflow as tf
from tensorflow.keras import layers 
import urllib
import PIL

img_url = 'https://upload.wikimedia.org/wikipedia/commons/1/11/Iron_Maiden_in_Bercy_4.jpg'
urllib.request.urlretrieve(img_url, "sample.png")
img = PIL.Image.open("sample.png")

Now to process the image, we need to convert it into a Tensor, let’s do this step first and print the Tensor representation:

import matplotlib.pyplot as plt

image = tf.keras.utils.load_img("sample.png")
input_arr = tf.keras.utils.img_to_array(img)
img_tensor = tf.convert_to_tensor(input_arr)
_ = plt.imshow(img_tensor.numpy().astype('uint8'))

Now we are going to create a stack of layers to preprocess the image, this can be easily done using the Sequential API, then we pass the image through the layers and print the result.

from tensorflow import keras

data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(0.4),
        layers.RandomZoom(0.2),
    ]
)

result = data_augmentation(img_tensor)
_ = plt.imshow(result.numpy().astype('uint8'))

Transfer Learning

Another great technique to improve your model is the use of transfer learning. A cool feature of deep learning is the ability to generalize and to learn high hierarchy patterns in larger datasets, that may be used in another problem (with a different dataset!), i.e., in an image classification problem, a model trained with a big enough dataset, may have learned visual patterns that are generic and therefore, portable. Fortunately, Keras has a set of pretrained models on thousands of images, you can check the catalog here.

I used the VGG16 pretrained model from Keras, this model has learned patterns about images, but it’s not trained to classify tumors images, thats why we reuse a subset of the layers, specifically the ones before the classifying layers, called the convolutional base (a.k.a conv base), thus, we need to train the final layers, that is, the classifier. Let’s see the code:

# Re-import the conv base for experimenting with different frozen layers
conv_base  = keras.applications.vgg16.VGG16(
    weights="imagenet",
    include_top=False)
conv_base.trainable = False

The include_top = false indicates Keras to load the model without the final classification layer (only the conv base), the conv.base.trainable = False indicates Keras not to update the weights of the conv base. If we skip this parameter we’ll end losing the patterns learned by VGG16, because it will be retrained with the new set of tumor images.

Putting all together

Now we can configure a new model to beat the baseline using what we have explored until know: a pretrained conv base with learned visual patterns over a greater dataset, a data augmentation approach to extend the tumor images dataset and a final classifier layer:

from tensorflow.keras import layers

data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(0.1),
        layers.RandomZoom(0.2),
        layers.RandomContrast(0.5)
    ]
)

inputs = keras.Input(shape=(180, 180, 3))
x = data_augmentation(inputs)
x = keras.applications.vgg16.preprocess_input(x)
x = conv_base(x)
x = layers.Flatten()(x)
x = layers.Dense(256)(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(4, activation="softmax")(x)
model_with_conv_base = keras.Model(inputs, outputs)

model_with_conv_base.compile(loss="sparse_categorical_crossentropy",
              optimizer="rmsprop",
              metrics=["accuracy"])

The preprocess_input call is necessary to formatting the inputs in the shape expected by the VGG16, also there’s a Dropout layer, which is a regularization technique. The dropout layer randomly inhibits the output of some neurons to prevent some “conspiracy” inside the network. By “dropping out” the output value of neurons, the next layers must adapt their weights to handle by themself the representation (without the help of the dropped neurons). This regularization technique is used to reduce the overfitting. Let’s check the result of the full approach.

With this new model, the validation accuracy surpassed the 70% and the curves of loss values decreased together. This means we’re controlling the overfitting, finally lets see the accuracy on the test dataset:

We improved the base model passing from a 75% to an 83% of accuracy on testing.

Further steps for your experimentation

Finally, I encourage you to improve the model by exploring other approaches as:

Create your own network topology combining Convolutional and MaxPooling layers, change the number of filters in the convolutional layers and track the results.
Add and test other combinations of augmentation layers for the data augmentation approach.
Experiment with the transfer learning using other pretrained models from Keras.

Another technique you may try is the fine tuning. In the transfer learning section I stated you should “freeze” the conv base layers to avoid updating the weights and losing the previous training effort, but is not necessary to freeze all the layers. Fine tuning propose to unfroze some of the final layers in the conv base, so it learns new patterns without losing the generalization gained beforehand. Here is the documentation for the approach. I recommend to read the 8th chapter of Deep Learning With Python by Francois Cholet, where I learned all these approaches. Happy coding!

Validating your Deep Learning Model

Most probably you are aware of the training and testing stages on machine learning and deep learning right? You get a dataset composed by samples, where each sample contains features related with the problem and a label or target value. You want your model to learn how to predict the target value according to the features in a sample. So you divide your dataset into training and testing set. You use your training set to teach the model how to predict, and then you evaluate what the model has learn by using the testing set. If you are reading this post, is because you are aware there’s a third set, the validation set, why do you need one? What best practices should you follow to build one? Keep reading.

Why do you need a Validation Set

Before I continue, I want you to have this scenario in your mind, let’s say you’re a math teacher, and you teach your student how to solve some problems. Later at the end of the semester, you need to evaluate your students, to be sure they learned how to solve problems, so you write a test and ask them to work on the exact two problems you teach them how to solve!

Evaluating the exact same problems you teach your students it’s not really useful.

Will this evaluation be useful for you to be sure your students really learn how to solve the problems? Maybe they just memorized the problems you teach them in class, what will happen when they face new problems they haven’t seen before? This analogy is useful to understand the training (lesson in classes) of your models (the students). You need to know that your model is capable to solve what it suppose to, and you should evaluate your model with data samples it hasn’t seen in the training. The ability of the model to perform on unseen data is called Generalization.

That’s why you need a testing set separated from the training set, to evaluate the performance of your model. Then, why do you need a validation set? Well, that’s because, on a deep learning model (and in ML in general), you have a set of hyperparameters to tune and architectural decisions to make, in order to get the best model that fits to your problem. So you follow iteratively these steps:

Train your model using training data.
Validate your model with the validation data.
Tune your model by changing hyperparameters and changing your network architecture.
Repeat steps 1 to 3 until you check your model is able to generalize.
Test your model with the testing data.

On the first iterations, your model may be underfitting, which means it’s not able to beat a baseline, in the next iterations if you tune properly, your model will increase its performance, until it’s overfitting on the evaluation set. Once you get there, you must fight this issue, by using feature engineering or regularization techniques, you know you’re on your way by comparing the training and validation loss curves.

Now that you know why the validation is important, let’s check what techniques to use to build the dataset.

How to build your Validation Set

It’s common to use two popular approaches: Holdout and K-Fold lets see them in detail.

Holdout Validation

By using Holdout, you must separate a portion of the training set to be used as validation. It’s a common heuristic to shuffle your data to achieve representativeness on training and validation, which is to have samples that have all the labels the model should learn to classify in both sets (training and validation), even is a good idea to have samples with key features that according to your business knowledge, you know beforehand the model should learn. If your problem doesn’t fit in classification, but regression, you should try to have representativeness of all the range values your model needs to learn to predict.

Holdout Validation.

This approach works well when you have a sufficient size of your training set that tolerate to be reduced (to trim the validation set portion) without affecting the model learning. There is another tradeoff, given the model is been validated in the same portion of data, you may want to validate with different and more representative data, when you have one of these two issues, you may want to use K-Fold validation.

K-Fold Validation

When your training set is not big enough to separate a portion for validation, you should use K-Fold. In this approach, you split your training set into subsets (folds) with a predefined size. The process steps are:

Split your training set into K folds.
Select the first fold to be your validation set, use the rest to be your training set.
Instantiate a new model.
Train and validate your model.
Save the metrics.
Repeat steps two to 5 with the next fold.

At the end of the process, calculate the average of your metrics (dividing into the number of folds) to get the validation and training metrics, and proceed to tune your model and repeat the same process until you are sure your model is generalizing. As a final step you should use all the original training set to run the training stage on your tuned model.

K-Fold Validation.

Data Redundancy

As you have seen, the validation is a vital step to tune your model, beside the representativeness we mention earlier, you should be aware of data redundancy as an issue you may face. Data redundancy occurs when some of the samples present in the validation set are also present in the training set (the same teacher analogy I described before) , this of course will give you better results on validation, but these are fake metrics, that will give you the impression that your model doesn’t need to be tuned, hence it performance will degrade on testing stage.

Hands On Code

I have created a Google Colab Notebook, where you can check by examples the theory I just explained to you, let see some important parts of the notebook, I’m using the IMDB Keras dataset so the problem is a binary classification.

Define the initial model

I started by defining a model that by heuristic, it’s not going to perform well on the problem, I’m using four dense layers with 20 units per layer:

model = keras.Sequential([
    layers.Dense(20, activation="relu"),
    layers.Dense(20, activation="relu"),
    layers.Dense(20, activation="relu"),
    layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

I trained this model in 20 epochs using a batch size of 512, note how I’m using validation_data parameter to pass the validation data set.

history = model.fit(training_data_x,
                    training_data_y,
                    epochs=20,
                    batch_size=512,
                    validation_data=(validation_data_x, validation_data_y))

Now, after the training, I plotted the loss values for training and validation stages, the result show how the training loss decay while the validation drop but then grows, which mean our model is overfitting:

Validation and training loss curves separate, which means our model is overfitting.

Tuning the model

To improve the performance, I tuned the model by using only two dense layers of 16 units, and added dropout and regularizer, I also reduced the validation set size to have more samples in training.

from tensorflow.keras import regularizers
model = keras.Sequential([
    layers.Dense(16, kernel_regularizer=regularizers.l2(0.002),activation="relu"),
    layers.Dropout(0.2),
    layers.Dense(16, kernel_regularizer=regularizers.l2(0.002),activation="relu"),
    layers.Dropout(0.2),
    layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

After training, the plot shows how we are controlling the overfitting:

Validation loss values are not increasing, we have controlled overfit.

K Fold Validation

As one of the premises to use K Fold validation is the size of the dataset, I reduced the number of samples from 25000 to only 5000. By using only 20% of the original data I choose 4 folds, so for each iteration, I’m training the model with 3750 samples and making the validation with 1250 samples; using the already tuned model from the previous step, these are the loss values curves:

Loss values curves for K Fold Validation.

According to the plot the model is not overfitting, finally on the testing stage, the model reached 0.86 on accuracy.

Exploring Data Redundancy

To test the effect of data redundancy, I used the original dataset of 25000 samples, I separated 7500 samples to be the validation set, and replace 2500 samples duplicating them from the training dataset, so the validation dataset has a 33% of data leaked from the training set. I used the original model without the tuning, where the validation loss reached 0.7.

Same untuned model with data redundancy.

Even we know the model overfits, the effect of data redundancy obfuscate this, and give us the impression that we have a tuned model

The validation of your model is an essential step to make the necessary tuning of your model, in real life you may train your model with huge datasets, which will consume machine resources and time, so be sure you have validated and tuned your model before the full training stage. Remember to be a good teacher and not show the answers to your student, to be sure it will perform as you expect in real life. Happy coding!

How to use a trained machine learning model using Python

Now that you trained your model…

Many times I faced a funny (or sad?) situation, when I started researching on Machine Learning (using Python), I learned a lot about all the packages and methods for supervised and unsupervised learning, as may be common for some of you, most of the knowledge was available at online tutorials. All these tutorials had a similar structure, something like this:

Ta da! At the end of the tutorial you got a trained model, which was ready to be used with unseen data, because of course what we all really want is to use a model with new data. But … the tutorials never explained that part.

So, how can we use a trained model to predict unseen values and how can you put this model to work on real scenarios? Keep reading.

Taking a shortcut

Let pass through the steps to train a model very quickly. For the purpose of this exercise, we will train a model which classify numbers between two classes: lower than 50 and greater than 50:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import random
import pandas as pd

# Create a dataframe of random numbers between 1 and 100
# Create a label column assigning 1 to values above 50 and 0 to everything else
data_df = pd.DataFrame(data=[random.randrange(1,100) for _ in range(100)], columns=["value"])
data_df['label'] = data_df['value'].apply(lambda x: 1 if x > 50 else 0)

# Divide training and test sets
train_df, test_df = train_test_split(data_df, test_size=0.3, random_state=0)

train_data = train_df['value']
train_target = train_df['label']

test_data = test_df['value']
test_target = test_df['label']

classifier = RandomForestClassifier()
classifier.fit(train_data.values.reshape(-1,1),train_target)

# Predictions over the test data
predictions = classifier.predict(test_data.values.reshape(-1,1))

# Print the metrics
print(confusion_matrix(test_target,predictions))
print(classification_report(test_target,predictions))
print(accuracy_score(test_target, predictions))

Unsurprisingly, such a simple dataset gives perfect score metrics. Now this is the step where all tutorials ended, a trained model. The first thing I always thought in this point was: How do I save this model for future use?

Saving and loading a trained model

Meet pickle, this is a Python module which serialise and de-serialise objects into/from a byte stream. Following the previous script, by adding these two lines we will save our model:

import pickle
pickle.dump(classifier, open('model.sav','wb'))

That’s all you need to save your trained model, now, you’ll want to load it into another script for further use, again we use pickle. In the next script we load the model and use it to predict labels for unseen data:

import pickle

model = pickle.load(open('model.sav','rb'))
model.predict([[-1],[1000]])

The output is 0 for the -1 sample and 1 for 1000, a perfect prediction!

Using a trained model on an API

A Machine Learning trained model is a tool, but the moment you use it on real business cases is when all the effort to collect, prepare the data and tune the hyper-parameters makes sense, here is where the ROI gets clear and the model usefulness will make the stakeholders really happy to have invested in ML. There are many possibilities for using your models in scripts, being the most common, running batch process to assign labels on production data.

However, using your models in batch process restricts your possibilities with ML, to unlock its real power you need to be able to use your models in real time, think in recommender systems for example. Using the same approach to load a trained model, and a web framework like Flask you can expose your model through a defined API, and apply it to real scenarios.

A use case of a trained model in batch, the user receives data processed by the model on a previous window.

Using a trained model on real time, the user triggers the prediction process, do you see the advantage?

#!flask/bin/python
from flask import Flask, request
import pickle

model = pickle.load(open('model.sav','rb'))

app = Flask(__name__)

@app.route('/model/prediction',methods=['POST'])
def predict():

	item_key = 'value'
	
	if not request.json or item_key not in request.json:
		return 'Empty payload',400

	value = request.json[item_key]

	return str(model.predict([[value]])[0])


if __name__ == '__main__':
    app.run(host='0.0.0.0',port=8082)

The previous code, creates and API which runs in the 8082 port, I selected such number to avoid conflicts with other web servers you may be running in your machine. It exposes a single endpoint at /model/prediction responding to POST method. It receives a JSON payload and seeks an attribute value, finally it predicts the label for “value” using our trained model and returns the answer. I saved the previous script as flask_model.py, so all I need to do to run the API is: python3 flask_model.py. You can test your API using curl:

curl -X POST http://127.0.0.1:8082/model/prediction -H "Content-Type: application/json" -d '{"value":120}'

The previous request returns 1, as 120 > 50. And that will be all! By using pickle to save and load your models, and Flask to expose them through an API, you are able to take your ML initiatives to production. There are other fancy approaches to achieve the same by using paid services like GCP or AWS, but if you want to keep going free this is a solid start. Happy coding!

How to estimate text similarity with Python

Disponible en Español

Did Melania Trump plagiarise Michelle Obama’s speech?

On 2016, during the Republican National Convention, Melania Trump gave a speech to support Donald Trump campaign; as soon as the convention concluded, Twitter users noted similarities in some lines pronounced by Mrs Trump and a speech from Michelle Obama eight years ago on the Democratic National Convention; of course, Melania and her husband were criticised and the campaign team defended them, arguing the speech was written from notes and real life experiences.

How the Twitter’s users noted the similarities? On one side, some lines were exactly the same in both speeches, on the other hand, as said in this article from Usa Today:

It’s not entirely a verbatim match, but the two sections bear considerable similarity in wording, construction and themes.

If you were to automate the process to detect those similarities, what approach would you take? A first technique will be to compare both texts word by word but this will not scale well; you must consider the complexity of comparing all the possible sentences of consecutive words from a text against the other. Fortunately, NLP gives us a clever solution.

What are we going to do?

There is a core task for NLP called text similarity, which works solving the problem we stated: How do you compare texts without going on a naïve and inefficient approach? To do so, you need to transform the texts to a common representation and then you need to define a metric to compare them.

In the following sections you will see: the mathematical concepts behind the approach, the code example explained in detail so you may repeat the process by yourself and the answer to the original question: Did Melania plagiarise or not?

Text Similarity Concepts

TF-IDF

Straight to the point, the text is transformed to a vector. The words are then called features. Each position in the vector represents a feature and the value in the vector position depends on the method you use. One way to do it, is to count how many times the word appears in the text, divide it by the total count of terms in the document and assign this value to the vector for that feature, which is called Term Frequency or TF.

Screen Shot 2020-08-09 at 5.23.37 PM — Term frequency, being t a term, n_t,d the times the term appears in a document. The denominator is the count of all the terms in the document.

Term frequency alone may give relevance to common words present in the document, but they are not necessarily important, they may be stopwords. The stopwords are words that do not add meaning to a text, like articles, pronouns or modal verbs: I, you, the, that, would, could … and so on.

To know how important a word is in a particular document, Inverse document frequency or IDF is used. IDF seeks the relevance in the document by counting how many documents contain a term in the corpus.

Screen Shot 2020-08-09 at 5.23.53 PM — Inverse document frequency.

In IDF, N represents the number of documents on the corpus, whilst df_t represent the number of documents containing a term t. If all the documents in the corpus contain a term t, then N/df_t will be equal to 1, and log(1) = 0, which means the term is not representative as, emphasising again, it appears in all documents.

Term frequency–inverse document frequency or TF-IDF combines the two previous metrics: if a word is present in a document, but also it’s in all the other documents of the corpus, it’s not a representative word and TF-IDF gives a low weight value. Conversely, if a word has high frequency by appearing many times in a document and it only appears in that document, then TF-IDF gives a high weight value.

Screen Shot 2020-08-09 at 5.24.30 PM — Term frequency–inverse document frequency.

The TF-IDF values are calculated for each feature (word) and assigned to the vector.

Cosine Similarity

Having the texts in the vector representation, it’s time to compare them, so how do you compare vectors?

It’s easy to model text to vectors in Python, lets see an example:

from sklearn.feature_extraction.text import TfidfVectorizer

phrase_one = 'This is Sparta'
phrase_two = 'This is New York'
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([phrase_one,phrase_two])

vectorizer.get_feature_names()
['is', 'new', 'sparta', 'this', 'york']
X.toarray()
array([[0.50154891, 0.        , 0.70490949, 0.50154891, 0.        ],
       [0.40993715, 0.57615236, 0.        , 0.40993715, 0.57615236]])

This code snippet shows two texts, “This is Sparta” and “This is New York“. Our vocabulary has five words: “This“, “is“, “Sparta“, “New” and “York“.

The vectorizer.get_feature_names() line shows the vocabulary. The X.toarray() shows both texts as vectors, with the TF-IDF value for each feature. Note how for the first vector, the second and fifth position have a value of zero, those positions correspond to the words “new” and “york” which are not in the first text. In the same way, the third position for the second vector is zero; that position correspond to “sparta” which is not present in the second text. But how do you compare the two vectors?

By using the dot product it’s possible to find the angle between vectors, this is the concept of cosine similarity. Having the texts as vectors and calculating the angle between them, it’s possible to measure how close are those vectors, hence, how similar the texts are. An angle of zero means the text are exactly equal. As you remember from your high school classes, the cosine of zero is 1.

The cosine of the angle between two vectors gives a similarity measure.

Finding the similarity between texts with Python

First, we load the NLTK and Sklearn packages, lets define a list with the punctuation symbols that will be removed from the text, also a list of english stopwords.

from string import punctuation
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

language_stopwords = stopwords.words('english')
non_words = list(punctuation)

Lets define three functions, one to remove the stopwords from the text, one to remove punctuation and the last one which receives a filename as parameter, read the file, pass all the string to lowercase and calls the other two functions to return a preprocessed string.

def remove_stop_words(dirty_text):
    cleaned_text = ''
    for word in dirty_text.split():
        if word in language_stopwords or word in non_words:
            continue
        else:
            cleaned_text += word + ' '
    return cleaned_text

def remove_punctuation(dirty_string):
    for word in non_words:
        dirty_string = dirty_string.replace(word, '')
    return dirty_string

def process_file(file_name):
    file_content = open(file_name, "r").read()
    # All to lower case
    file_content = file_content.lower()
    # Remove punctuation and spanish stopwords
    file_content = remove_punctuation(file_content)
    file_content = remove_stop_words(file_content)
    return file_content

Now, lets call the process_file function to load the files with the text you want to compare. For my example, I’m using the content of three of my previous blog entries.

nlp_article = process_file("nlp.txt")
sentiment_analysis_article = process_file("sentiment_analysis.txt")
java_certification_article = process_file("java_cert.txt")

Once you have the preprocessed text, it’s time to do the data science magic, we will use TF-IDF to convert a text to a vector representation, and cosine similarity to compare these vectors.

#TF-IDF
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([nlp_article,sentiment_analysis_article,java_certification_article])
similarity_matrix = cosine_similarity(X,X)

The output of the similarity matrix is:

[[1.         0.217227   0.05744137]
 [0.217227   1.         0.04773379]
 [0.05744137 0.04773379 1.        ]]

First, note the diagonal with ‘1‘, this is the similarity of each document with itself, the value 0.217227 is the similarity between the NLP and the Sentiment Analysis posts. The value 0.05744137 is the similarity between NLP and Java certification posts. Finally the value 0.04773379 represents the similarity between the Sentiment Analysis and the Java certification posts. As the NLP and the sentiment analysis posts have related topics, its similarity is greater than the one they hold with the Java certification post.

Similarity between Melania Trump and Michelle Obama speeches

With the same tools, you could calculate the similarity between both speeches. I took the texts from this article, and ran the same script. This is the similarity matrix output:

[[1.         0.29814417]
 [0.29814417 1.        ]]

If you skipped the technical explanation and jumped directly here to know the result, let me give you a resume: using an NLP technique I estimated the similarity of two blog post with common topics written by me. Then, using the same method I estimated the similarity between the Melania and Michelle speeches.

Now, lets make some analysis here. By calculating the similarity, between two blog posts written by the same author (me), about related topics (NLP and Sentiment Analysis), the result was 0.217227. The similarity between Melania and Michelle speeches was 0.29814417. Which in conclusion, means, that two speeches from two different persons belonging to opposite political parties, are more similar, than two blog posts for related topics and from the same author. I let the final conclusion to you. The full code, and the text files are on my Github repo.

Happy analysis!

NLP, artificial intelligence applied to language

Disponible en Español

We live in years of technologic advance acceleration, years of automation, human tasks are being totally performed by machines or supported by them. With these advances, many terms have gained popularity, words like Big Data, Machine Learning, Artificial Intelligence, Deep Learning, Neural Networks and a long list of etc. Such terms are popular because of how their application in common problems ease our lives. Today I want to bring you a term that may be known by some of you and new to others, Natural Language Processing or NLP.

What is Natural Language Processing?

In short NLP is an area of computer science that seek to analyse, model and understand the human language, hmm an easy task isn`t it? Have you ever thought how to model human language? Or how we take the way we interpret language in our brains, make it explicit rules and wrote those rules as code to be read by a machine? Years ago, this would have seemed science fiction, but nowadays, NLP surround us everyday.

A common phrase related to NLP will be “Hey Siri”; of course, you may posses an Android, but you will communicate with your Google assistant by speaking too and giving instructions which are interpreted on your device. Even if you don’t ask serious questions to your cellphone and just chit chat for fun, the doubt is: how these digital assistants works? How NLP works? The first step to understand this, is to understand how structured language works.

Back to school

Language is like an huge set of lego pieces, which can be combined to create awesome structures; this lego pieces are composed of tinier pieces. Also, there are some rules to combine them, these rules scale up, from the tinier pieces, to the medium, big and to the huge structures you build. The smallest kind of pieces are the characters (A-Z) you call these phonemes. As you know, the phonemes alone are meaningless, so you start to combine them to build bigger blocks.

The next kind of pieces are the morphemes, these are the minimum combination of phonemes that has a meaning to you, you may identify the morphemes as words, but even all words are morphemes, not all morphemes are words, that’s the case of prefixes and suffixes.

NLP — The composition of morphemes to build a new one. Prefixes and suffixes are not words by itself, but they are morphemes.

Lexemes are variations of morphemes that share common meaning, and in general share a common root; for example, “find”, “finds”, “found” and “finding” share the common root “FIND”. This concept is particularly important in text analysis, since text may have many lexemes, that in the end, refer to a common meaning and a common context. Being able to trace back the lexemes to their root or lemma is called lemmatization and it ease the analysis by leaving the meaningful unit of each word.

By combining lexemes and morphemes, you assemble phrases and sentences. But there are some rules to combine words, you just don’t put them in random order. A common well formed sentence may have a noun a verb and prepositions binding them as in: “Javier plays the guitar at nights”. The set of laws for order in words is called syntax.

Above the phrases and sentences, is where the beauty lies, with those blocks people create magnificent buildings, the books, poems and songs you love. In this level, a context start to exist, and the language structure exhibits a deeper meaning. We want the machines to process such context and understand that meaning.

What are the NLP people doing?

The most popular area of research is text classification, the intention behind it is to assign one label (or more) to a text. A common use of text classification is the spam detection, companies like Gmail or Outlook use it. Another great use relies on customer service support, having to check thousands of complaints from customers is not practical, as most of this comments are not clear about the complaint; text classification helps to filter the info which leads to actions. The process to apply text classification follows a common machine learning model training, you start from a data set of texts, you proceed to assign labels to each text sample, divide the data set into training and test sets, train the model (previously choosing a method that fits the problem) and then you use your model to classify unseen data.

NLP (1) — Common process to obtain and use a model for text classification.

I started the definition of NLP by giving the example of Siri and Google assistant, you must be aware of others like Cortana or Alexa, these are the focus of conversational agents area. This area (and in general all the NLP areas) intersect in common use cases with information extraction, in the last, the objective is to identify the entities involved in the text. For example, in a phrase like: “The president went to the Congress in Bogotá, to defend himself against corruption charges”, in order to understand the meaning, an algorithm need to extract the words: “president”, “Congress”, “Bogotá” and “corruption”, these words are know as entities; you could identify the entities easily as they take the form of nouns in sentences. From the text that lies between entities, an algorithm could infer relationships: “The president went to the Congress”; entities and the relationships binding them form a context. A conversational agent could use this context to answer user queries, this is close to another area: information retrieval, which works with how a machine can understand human questions and retrieve information that answer the queries and is (of course) used in search engines. The conversational agents make use of the information extraction and retrieval to chat with the user.

With the common use of NLP areas more and more applications born: calendar event detection, plagiarism detection, speech recognition, spelling check, sentiment analysis, translation from language to language, the list may grow as the research continues.

How does NLP works?

There are three common approaches to work with NLP, the first one is heuristics. With heuristics, the rules to understand the language are handcrafted; the approach works for MVP’s application, but are sensitive to concept drift and the accuracy suffer if the application scale. To work with heuristic a domain expert is required, a drawback if you think about it as an added dependency; the tackle to this is to use knowledge bases, dictionaries and thesaurus from the web, these resources are maintained by the community and free (in some cases). A common tool in the analysis with this approach are the regular expressions or regex, think about the extraction of user names in social networks post with a regex like:

"@.*"

A popular approach to NLP is the use of machine learning; by having datasets of text, a model is training to work on the desired task, some of the most common techniques are: naïve Bayes, support vector machines and conditional random fields. The conditional random fields or CRF have gained popularity outperforming Hidden Markov Models by giving relevance to the order of the words in text and the context they from. CRF have been used successfully in entity extraction tasks.

Finally deep learning with neural networks is the third approach, leveraged by the ability of neural networks to work with unstructured data.

Where can I start?

Personally I have worked with the heuristic and machine learning approaches, I used Python as programming language, its versatility to work with object oriented, functional and structured paradigms makes it a great option, also counts with a full ecosystem of packages to work in data science. Some of the tool you may use are:

Pandas: This will be your best friend working with data, probably you are going to handle csv or json files, with pandas you could load this files and work with them in a matrix structure. It allows to make queries over the matrix, transformation, dimensionality reduction, mapping, filtering among others.
https://pandas.pydata.org/
Numpy: Is the the tool for working with algebra and n-dimensional data structures, you will see, as you advance, words are converted into vectors and arrays and this package ease the work with such structures. https://numpy.org/
Sklearn: This is the package for machine learning, gives you a great set of classifications algorithms, clustering methods for unsupervised learning, data preprocessing, random generation of training and testing files, evaluation functions and many more, this is the package you should dominate for the ML tasks. https://scikit-learn.org/stable/
NTLK: Last but not the least, the ad hoc package for NLP, gives you: classification, tokenization, stemming, parsing, stop words dictionaries and many other tools to work with the areas we talk about it. https://www.nltk.org/

To read more about this topic, you could go to:

Happy research!