Training a Convolutional Neural Network for Images Classification

Image classification is one of the main application of convolutional neural networks, if you are interested in the topic like me, you may have consulted many theoretical resources and many tutorials, and this may be cumbersome as there’s a lot of information and concepts to digest. As a curious developer you may want to put your hands into code as soon as possible, if this is your case you’re in the right place.

In the following post, I’m going to train a convolutional neural network to classify brain tumors images, and going into a short brief of some key concepts.

TL;DR If you’re eager to check the code this is the notebook.

Prerequisites

Before starting, be sure you have trained machine learning models before, and that you are proficient in Python. We are going to use the following tools:

Dataset preprocessing

You may know right now that one of the most important steps to work in classification tasks, no matter the technique (common statistical approaches, machine learning, neural networks) is to gather enough quality data. Luckily, there’s a lot of open source datasets and Kaggle, hosts one with classified brain tumor images.

The dataset has four classes distributed as:

Glioma tumor: 901 samples.
Pituitary tumor: 844 samples.
Meningioma tumor: 913 samples.
Normal (no tumor): 438 samples.

You can download the dataset into your work environment using the following Kaggle api command, but first you need to obtain a Kaggle key.

!kaggle datasets download -d thomasdubail/brain-tumors-256x256

In a previous post, I stated the necessity to have three separated datasets: training, validation and testing, so the first step will be to build these three datasets. I divided the whole dataset in these percentages:

Training: 66%
Validation: 14%
Testing: 20%

I separated the images into folders following the previous distribution. Then, you have to transform the images into a representation that can be understood by Tensorflow, which are tensors. We can easily do this with some Keras magic. The function image_dataset_from_directory reads images from a directory and returns a dataset, the name of the folder will be interpreted as the class for the read samples, so, after moving the images to the proper folders, my directory structure looks like this:

With the following code I’m passing the root folder for each dataset to the function, in which case, the train_dataset will be composed of the 66% of images and each sample will have the class name corresponding to the directory that contains it (meningioma_tumor, pituitary_tumor,
glioma_tumor, normal).

# Use the image_dataset_from_directory to create the 3 datasets
from tensorflow.keras.utils import image_dataset_from_directory

train_dataset = image_dataset_from_directory(
    new_base_dir / "train",
    image_size=(180, 180),
    batch_size=32)
validation_dataset = image_dataset_from_directory(
    new_base_dir / "validation",
    image_size=(180, 180),
    batch_size=32)
test_dataset = image_dataset_from_directory(
    new_base_dir / "test",
    image_size=(180, 180),
    batch_size=32)

Building a baseline model

A basic step when training a model, is to first build a base line, something not too fancy but capable of complete the work, which is a model that learn how to classify the samples, no matter if it overfits. This first step let you know if the problem is solvable.

# Create an initial model
from tensorflow import keras
from tensorflow.keras import layers

inputs = keras.Input(shape=(180, 180, 3))
x = layers.Rescaling(1./255)(inputs)
x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.MaxPooling2D(pool_size=2)(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
x = layers.Flatten()(x)
outputs = layers.Dense(4, activation="softmax")(x)
base_model = keras.Model(inputs=inputs, outputs=outputs)

A Convolutional Neural Network (in a not formal description), is a stack of convolutional and max pooling layers. The convolutional layers learn patterns over the images on an incremental way, which is, the first layers learn simple patterns like borders or colors, and superior layers learn higher value patterns as tires, ears, eyes, etc. The max pooling layers intensify the patterns so the network focus its knowledge on these and discard the noise. As a rule of thumb, in a multi-class problem, the activation function on the last layers must be softmax. Now let’s train the baseline model.

# Configure the callbacks to:
# - Save the best model
# - Save the loss values

callbacks = [
    keras.callbacks.EarlyStopping(
        monitor="val_accuracy",
        patience=10,
    ),
    keras.callbacks.ModelCheckpoint(
        filepath="base_model",
        save_best_only=True,
        monitor="val_loss")
]
history = base_model.fit(
    train_dataset,
    epochs=100,
    validation_data=validation_dataset,
    callbacks=callbacks)

I’m using the keras.callbacks.ModelCheckpoint to save the best model by monitoring the validation loss, which is how far a prediction is from the correct value for the sample in the validation dataset. Also I’m using the keras.callbacks.EarlyStopping callback to stop the training once the model has reached 10 epochs (check the patience parameter) without improving the accuracy, which is, the model has stopped learning.

It’s a good idea to plot the accuracy and the loss value on training and validation datasets, since it give us the information about the model overfitting behavior and the capacity of the CNN to solve the problem, let’s check the graphics:

We can see the accuracy on training went up almost to a 100%, which means our model is capable to solve the problem. But on the other hand the accuracy for the validation dataset didn’t surpass the 70%, also, the training and validation loss curves gradually separate, whilst the training loss decreases, the validation loss increases. This is the common symptom of overfitting. Our model will not generalize, and in a real scenario will not be accurate classifying new tumors images. Let’s check the accuracy of our model on the test dataset.

We achieved a 75%. But hey, cheers up, this is not bad news, we complete the first task, now that we know we the problem is solvable, we need to improve our model, we need to beat the baseline model.

Beating the baseline model

A great approach to improve a model is to obtain more data, take this approach whenever is feasible. Nevertheless, this is a limitation in some cases, and in this problem, we are not able to easily get more brain tumors images. Fortunately we have some other tools to improve our model: data augmentation and transfer learning.

Adding data augmentation

We can’t get new brain tumors images, but we can synthetically create new ones from the existing samples, that’s what data augmentation means, to take the existing samples and modify them to feed our model. Again we’ll use the tools provided by Keras, some preprocessing layers to make slight changes on the images by transformations as: zooming, rotation, contrast change or cropping i.a. This new “modified” images will be new samples for our model. You can check all the data augmentation possibilities here.

The following code shows a sample of images preprocessing using the Keras layers. First let’s download an image.

import tensorflow as tf
from tensorflow.keras import layers 
import urllib
import PIL

img_url = 'https://upload.wikimedia.org/wikipedia/commons/1/11/Iron_Maiden_in_Bercy_4.jpg'
urllib.request.urlretrieve(img_url, "sample.png")
img = PIL.Image.open("sample.png")

Now to process the image, we need to convert it into a Tensor, let’s do this step first and print the Tensor representation:

import matplotlib.pyplot as plt

image = tf.keras.utils.load_img("sample.png")
input_arr = tf.keras.utils.img_to_array(img)
img_tensor = tf.convert_to_tensor(input_arr)
_ = plt.imshow(img_tensor.numpy().astype('uint8'))

Now we are going to create a stack of layers to preprocess the image, this can be easily done using the Sequential API, then we pass the image through the layers and print the result.

from tensorflow import keras

data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(0.4),
        layers.RandomZoom(0.2),
    ]
)

result = data_augmentation(img_tensor)
_ = plt.imshow(result.numpy().astype('uint8'))

Transfer Learning

Another great technique to improve your model is the use of transfer learning. A cool feature of deep learning is the ability to generalize and to learn high hierarchy patterns in larger datasets, that may be used in another problem (with a different dataset!), i.e., in an image classification problem, a model trained with a big enough dataset, may have learned visual patterns that are generic and therefore, portable. Fortunately, Keras has a set of pretrained models on thousands of images, you can check the catalog here.

I used the VGG16 pretrained model from Keras, this model has learned patterns about images, but it’s not trained to classify tumors images, thats why we reuse a subset of the layers, specifically the ones before the classifying layers, called the convolutional base (a.k.a conv base), thus, we need to train the final layers, that is, the classifier. Let’s see the code:

# Re-import the conv base for experimenting with different frozen layers
conv_base  = keras.applications.vgg16.VGG16(
    weights="imagenet",
    include_top=False)
conv_base.trainable = False

The include_top = false indicates Keras to load the model without the final classification layer (only the conv base), the conv.base.trainable = False indicates Keras not to update the weights of the conv base. If we skip this parameter we’ll end losing the patterns learned by VGG16, because it will be retrained with the new set of tumor images.

Putting all together

Now we can configure a new model to beat the baseline using what we have explored until know: a pretrained conv base with learned visual patterns over a greater dataset, a data augmentation approach to extend the tumor images dataset and a final classifier layer:

from tensorflow.keras import layers

data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(0.1),
        layers.RandomZoom(0.2),
        layers.RandomContrast(0.5)
    ]
)

inputs = keras.Input(shape=(180, 180, 3))
x = data_augmentation(inputs)
x = keras.applications.vgg16.preprocess_input(x)
x = conv_base(x)
x = layers.Flatten()(x)
x = layers.Dense(256)(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(4, activation="softmax")(x)
model_with_conv_base = keras.Model(inputs, outputs)

model_with_conv_base.compile(loss="sparse_categorical_crossentropy",
              optimizer="rmsprop",
              metrics=["accuracy"])

The preprocess_input call is necessary to formatting the inputs in the shape expected by the VGG16, also there’s a Dropout layer, which is a regularization technique. The dropout layer randomly inhibits the output of some neurons to prevent some “conspiracy” inside the network. By “dropping out” the output value of neurons, the next layers must adapt their weights to handle by themself the representation (without the help of the dropped neurons). This regularization technique is used to reduce the overfitting. Let’s check the result of the full approach.

With this new model, the validation accuracy surpassed the 70% and the curves of loss values decreased together. This means we’re controlling the overfitting, finally lets see the accuracy on the test dataset:

We improved the base model passing from a 75% to an 83% of accuracy on testing.

Further steps for your experimentation

Finally, I encourage you to improve the model by exploring other approaches as:

Create your own network topology combining Convolutional and MaxPooling layers, change the number of filters in the convolutional layers and track the results.
Add and test other combinations of augmentation layers for the data augmentation approach.
Experiment with the transfer learning using other pretrained models from Keras.

Another technique you may try is the fine tuning. In the transfer learning section I stated you should “freeze” the conv base layers to avoid updating the weights and losing the previous training effort, but is not necessary to freeze all the layers. Fine tuning propose to unfroze some of the final layers in the conv base, so it learns new patterns without losing the generalization gained beforehand. Here is the documentation for the approach. I recommend to read the 8th chapter of Deep Learning With Python by Francois Cholet, where I learned all these approaches. Happy coding!

Validating your Deep Learning Model

Most probably you are aware of the training and testing stages on machine learning and deep learning right? You get a dataset composed by samples, where each sample contains features related with the problem and a label or target value. You want your model to learn how to predict the target value according to the features in a sample. So you divide your dataset into training and testing set. You use your training set to teach the model how to predict, and then you evaluate what the model has learn by using the testing set. If you are reading this post, is because you are aware there’s a third set, the validation set, why do you need one? What best practices should you follow to build one? Keep reading.

Why do you need a Validation Set

Before I continue, I want you to have this scenario in your mind, let’s say you’re a math teacher, and you teach your student how to solve some problems. Later at the end of the semester, you need to evaluate your students, to be sure they learned how to solve problems, so you write a test and ask them to work on the exact two problems you teach them how to solve!

Evaluating the exact same problems you teach your students it’s not really useful.

Will this evaluation be useful for you to be sure your students really learn how to solve the problems? Maybe they just memorized the problems you teach them in class, what will happen when they face new problems they haven’t seen before? This analogy is useful to understand the training (lesson in classes) of your models (the students). You need to know that your model is capable to solve what it suppose to, and you should evaluate your model with data samples it hasn’t seen in the training. The ability of the model to perform on unseen data is called Generalization.

That’s why you need a testing set separated from the training set, to evaluate the performance of your model. Then, why do you need a validation set? Well, that’s because, on a deep learning model (and in ML in general), you have a set of hyperparameters to tune and architectural decisions to make, in order to get the best model that fits to your problem. So you follow iteratively these steps:

Train your model using training data.
Validate your model with the validation data.
Tune your model by changing hyperparameters and changing your network architecture.
Repeat steps 1 to 3 until you check your model is able to generalize.
Test your model with the testing data.

On the first iterations, your model may be underfitting, which means it’s not able to beat a baseline, in the next iterations if you tune properly, your model will increase its performance, until it’s overfitting on the evaluation set. Once you get there, you must fight this issue, by using feature engineering or regularization techniques, you know you’re on your way by comparing the training and validation loss curves.

Now that you know why the validation is important, let’s check what techniques to use to build the dataset.

How to build your Validation Set

It’s common to use two popular approaches: Holdout and K-Fold lets see them in detail.

Holdout Validation

By using Holdout, you must separate a portion of the training set to be used as validation. It’s a common heuristic to shuffle your data to achieve representativeness on training and validation, which is to have samples that have all the labels the model should learn to classify in both sets (training and validation), even is a good idea to have samples with key features that according to your business knowledge, you know beforehand the model should learn. If your problem doesn’t fit in classification, but regression, you should try to have representativeness of all the range values your model needs to learn to predict.

Holdout Validation.

This approach works well when you have a sufficient size of your training set that tolerate to be reduced (to trim the validation set portion) without affecting the model learning. There is another tradeoff, given the model is been validated in the same portion of data, you may want to validate with different and more representative data, when you have one of these two issues, you may want to use K-Fold validation.

K-Fold Validation

When your training set is not big enough to separate a portion for validation, you should use K-Fold. In this approach, you split your training set into subsets (folds) with a predefined size. The process steps are:

Split your training set into K folds.
Select the first fold to be your validation set, use the rest to be your training set.
Instantiate a new model.
Train and validate your model.
Save the metrics.
Repeat steps two to 5 with the next fold.

At the end of the process, calculate the average of your metrics (dividing into the number of folds) to get the validation and training metrics, and proceed to tune your model and repeat the same process until you are sure your model is generalizing. As a final step you should use all the original training set to run the training stage on your tuned model.

K-Fold Validation.

Data Redundancy

As you have seen, the validation is a vital step to tune your model, beside the representativeness we mention earlier, you should be aware of data redundancy as an issue you may face. Data redundancy occurs when some of the samples present in the validation set are also present in the training set (the same teacher analogy I described before) , this of course will give you better results on validation, but these are fake metrics, that will give you the impression that your model doesn’t need to be tuned, hence it performance will degrade on testing stage.

Hands On Code

I have created a Google Colab Notebook, where you can check by examples the theory I just explained to you, let see some important parts of the notebook, I’m using the IMDB Keras dataset so the problem is a binary classification.

Define the initial model

I started by defining a model that by heuristic, it’s not going to perform well on the problem, I’m using four dense layers with 20 units per layer:

model = keras.Sequential([
    layers.Dense(20, activation="relu"),
    layers.Dense(20, activation="relu"),
    layers.Dense(20, activation="relu"),
    layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

I trained this model in 20 epochs using a batch size of 512, note how I’m using validation_data parameter to pass the validation data set.

history = model.fit(training_data_x,
                    training_data_y,
                    epochs=20,
                    batch_size=512,
                    validation_data=(validation_data_x, validation_data_y))

Now, after the training, I plotted the loss values for training and validation stages, the result show how the training loss decay while the validation drop but then grows, which mean our model is overfitting:

Validation and training loss curves separate, which means our model is overfitting.

Tuning the model

To improve the performance, I tuned the model by using only two dense layers of 16 units, and added dropout and regularizer, I also reduced the validation set size to have more samples in training.

from tensorflow.keras import regularizers
model = keras.Sequential([
    layers.Dense(16, kernel_regularizer=regularizers.l2(0.002),activation="relu"),
    layers.Dropout(0.2),
    layers.Dense(16, kernel_regularizer=regularizers.l2(0.002),activation="relu"),
    layers.Dropout(0.2),
    layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

After training, the plot shows how we are controlling the overfitting:

Validation loss values are not increasing, we have controlled overfit.

K Fold Validation

As one of the premises to use K Fold validation is the size of the dataset, I reduced the number of samples from 25000 to only 5000. By using only 20% of the original data I choose 4 folds, so for each iteration, I’m training the model with 3750 samples and making the validation with 1250 samples; using the already tuned model from the previous step, these are the loss values curves:

Loss values curves for K Fold Validation.

According to the plot the model is not overfitting, finally on the testing stage, the model reached 0.86 on accuracy.

Exploring Data Redundancy

To test the effect of data redundancy, I used the original dataset of 25000 samples, I separated 7500 samples to be the validation set, and replace 2500 samples duplicating them from the training dataset, so the validation dataset has a 33% of data leaked from the training set. I used the original model without the tuning, where the validation loss reached 0.7.

Same untuned model with data redundancy.

Even we know the model overfits, the effect of data redundancy obfuscate this, and give us the impression that we have a tuned model

The validation of your model is an essential step to make the necessary tuning of your model, in real life you may train your model with huge datasets, which will consume machine resources and time, so be sure you have validated and tuned your model before the full training stage. Remember to be a good teacher and not show the answers to your student, to be sure it will perform as you expect in real life. Happy coding!

Why should you follow SOLID Principles

SOLID principles, a topic surely you have heard many times if you are into software development, unless you only search, copy and paste your code from stack overflow 🧐, let’s be honest, it’s okay to ask for help on specific doubts, but if you think that writing the code is the trascendental task for a programmer, you may be falling into some common pitfalls. As developers we should carefully think in the solution and take the proper time to design it. SOLID principles are a compass to guide you, to avoid unnecessary (and painful) refactors, so you may have more time to learn the new trending technology, to play some video games or to drink a cool beer.

I want to show you the SOLID principles not from the theoretical view you may find in other blogs, but from a practical example. Let’s review the principles.

A developer who usually copy/paste his code and don’t take any time designing the solutions, received the task to write a simple social network application, it will be only text based so the UI can be really simple. The expected use cases are to post messages and to refresh the timeline. Putting aside non functional requirements and features as the login and user configuration, the business logic is really simple, so our dev propose the following solution:

The implementation stores the information in an SQL database and the UI is designed to be presented on the web:

import java.sql.Timestamp;
import java.util.List;

import leantech.com.db.SQLRepository;

public class SQLAndWebManager implements IRepositoryAndUIManager {

	private String htmlTemplate = "<html>\n<body>\n<h1>\n\t%s\n</h1>\n</body>\n</html>";
	private SQLRepository db;


	public SQLAndWebManager() {
		db = new SQLRepository();
		db.connect();
	}

	@Override
	public String store(String message) {
		return db.insert(message);
	}

	@Override
	public List<String> readSince(Timestamp since) {
		return db.select(since);
	}

	@Override
	public void show(String message) {
		System.out.println("-----------------------------------------------------------------");
		System.out.println(String.format(htmlTemplate, message));
		System.out.println("-----------------------------------------------------------------");
	}

}

With this implementation, lets test our network with this code:

import java.sql.Timestamp;
import java.time.Instant;

public class Executor {

	public static void main(String[] args) {

		SocialNetworkManager manager = new SocialNetworkManager();

		manager.refreshTimeline(getInstant());

		manager.post("This is my first message, hello to all");

		manager.refreshTimeline(getInstant());
	}

	private static Timestamp getInstant() {
		return Timestamp.from(Instant.now());
	}

}

The output is:

Notice how, due to the implementation, the output is formatted as HTML.

Can you spot some violations to the SOLID principles? Could you anticipate what drawbacks this design has? Let’s discuss each one of these points.

Open-Closed Principle (OCP)

“A module should be open for extension but closed for modification.“

I started with this principle to follow a possible scenario driven by a business decision. Now that our social network is complete and running, the stakeholder takes the decision to make this network able to support SMS UI on cell phones. If the developer decides to modify the SQLAndWebManager class, then that will be a violation of the Open Closed principle, the design must allow extension for new functionalities without modifying existing ones. Hopefully our developer stop copy/pasting code and think on a more maintainable solution and came with this new design:

Our SMS and Web social network managers have this implementation:

public class SMSSocialNetworkManager extends SocialNetworkManager {
	
	@Override
	public void post(String message) {
		System.out.println("******************SMS Network Timeline**************************");
		sqlAndWebManager.show(message);
		sqlAndWebManager.store(message);
		System.out.println("******************SMS Network Timeline**************************");
	}

	@Override
	public void refreshTimeline(Timestamp since) {
		System.out.println("******************SMS Network Timeline**************************");
		List<String> readSince = sqlAndWebManager.readSince(since);
		readSince.stream().forEach(m -> sqlAndWebManager.show(m));
		System.out.println("******************SMS Network Timeline**************************");
	}

}

public class WebSocialNetworkManager {

	protected SQLAndWebManager sqlAndWebManager;

	public WebSocialNetworkManager() {
		sqlAndWebManager = new SQLAndWebManager();
	}

	public void post(String message) {

		System.out.println("******************Social Timeline**************************");
		sqlAndWebManager.show(message);
		sqlAndWebManager.store(message);
		System.out.println("******************Social Timeline**************************");
	}

	public void refreshTimeline(Timestamp since) {
		System.out.println("******************Social Timeline**************************");
		List<String> readSince = sqlAndWebManager.readSince(since);
		readSince.stream().forEach(m -> sqlAndWebManager.show(m));
		System.out.println("******************Social Timeline**************************");
	}

}

Now the test execution is the following:

import java.sql.Timestamp;
import java.time.Instant;

public class Executor {

	public static void main(String[] args) {

		WebSocialNetworkManager webManager = new WebSocialNetworkManager();
		webManager.post("This web UI is really cool");
		webManager.refreshTimeline(getInstant());

		SMSSocialNetworkManager smsManager = new SMSSocialNetworkManager();
		smsManager.post("I love the simplicity of sms");
		webManager.refreshTimeline(getInstant());

		SocialNetworkManager genericManager = new SocialNetworkManager();

		String emptyMessage = "";
		smsManager.post(emptyMessage);
		genericManager.post(emptyMessage);

	}

	private static Timestamp getInstant() {
		return Timestamp.from(Instant.now());
	}

}

All may seem resolved according to the requirement, but wait, the SMS output has HTML format, this is not aligned with the original requirement, the problem lies in where the logic for formatting the output is, in the SQLAndWebManager class, which is shared by the Web and SMS.

Besides, the developer decides to forbid empty messages to be published and he puts the logic into the parent class SocialNetworkManager.

        public void post(String message) {
		
                // Control of empty messages
                if (message.isEmpty())
			throw new RuntimeException();
		
		System.out.println("******************Social Timeline**************************");
		sqlAndWebManager.show(message);
		sqlAndWebManager.store(message);
		System.out.println("******************Social Timeline**************************");
	}

Liskov Substitution Principle (LSP)

“Subclasses should be substitutable for their base classes.“

Honoring this principle makes possible polymorphic invocations and your old friend the Strategy pattern, it states you should substitute a parent class for a subclass in your program without side effects. Do you remember the control of empty messages on the last paragraph? Lets check the execution:

                //Instantiation of the generic network
		SocialNetworkManager genericManager = new SocialNetworkManager();

                //Invocation for post method on base class and subclass instances 
		String emptyMessage = "";
		smsManager.post(emptyMessage);
		genericManager.post(emptyMessage);

In the last block, we are calling the post method on the subclass instance smsManager and in the base class instance genericManager with an empty String, if the Liskov principle is honored there must not be side effects, but lets check the output:

The SMS network allows to publish empty strings whilst the base class throw and exception, so base class and subclass, are not substitutable.

Single Responsibility Principle (SRP)

“Each software module should have one and only one reason to change”.

When talking about “reason to change” you should focus on business logic reasons, in the Social Network design, the class SQLAndWebManager has two responsibilities, handles the storage on the database and also the Web UI, watch the And in the class name as it’s a symptom of a SRP violation. As we saw previously, our implementation should support SMS UI and let’s assume that now we want to use a NoSQL database, in such case we would need to modify the SQLAndWebManager class for two different reasons.

Interface Segregation Principle (ISP)

“Many client specific interfaces are better than one general purpose interface.”

We all have heard the importance of defining interfaces and how they produce higher coupling between classes that inheritance, but we must have a clear scope definitions of such interfaces. On the actual design there is an interface, the IRepositoryAndUIManager.

public interface IRepositoryAndUIManager {
	
	void show(String message);

	String store(String message);

	List<String> readSince(Timestamp since);

}

This interface mixes two responsibilities, the UI and the storage, the SQLAndWebManager class implements the IRepositoryAndUIManager interface and it’s interesting to see the side effect of violating the ISP as it may produces a violation of the SRP in the implementing classes. Now if we want to fulfill the requirement to support NoSQL databases, a new class to handle this logic which implements the IRepositoryAndUIManager will be forced to implement the show method, which has no relation with the persistence.

Dependency Inversion Principle (DIP)

“Depend upon Abstractions. Do not depend upon concretions.“

The SocialNetworkManager class and it’s subclasses handle high level business logic, like what to do when the user post a message (to store in the database and to show in the UI) or when the user refresh the time line (to get messages from the database and show them on the UI). As a high level component, it should not depend on low level components which implement details, such is the case of SQLAndWebManager which implements how to connect to a database and how to show messages on a Web UI using HTML format. The motivation behind this principle is to reduce the impact of a change in a class on components that depends on it, as components with low implementation details may change more frequently (SQLAndWebManager), SocialNetworkManager should depend on an abstraction, isolating the impact of a change in its dependency.

Following SOLID Principles

With the previous example, we learn the mistakes on design and the issues they causes when business logic needs to evolve. Now our developer takes some times, reflects about the problem and follow the principles to redesign the app. He came with this refactor:

Single Responsibility Principle

Now the repository and the UI are split components, this allow to modify the persistence type without touching the UI and viceversa.

Open-Closed Principle

The SocialNetworkManager class keeps unmodified (closed to modification) with the stakeholder decision to provide Web and SMS UI, two different clases (WebSocialNetworkManager and SMSSocialNetworkManager) fulfill each requirement extending the base clase (open to extension).

Liskov Substitution Principle

The validation for empty messages is now in the base class, the subclasses respect the implementation and extend the behavior.

public abstract class SocialNetworkManager {

	protected IRepositoryManager repositoryManager;
	protected IUIManager uiManager;

	public SocialNetworkManager(IRepositoryManager repositoryManager, IUIManager uiManager) {
		this.repositoryManager = repositoryManager;
		this.uiManager = uiManager;
	}

	public void post(String message) {
		if (message.isEmpty())
			throw new RuntimeException();
		uiManager.show(message);
		repositoryManager.store(message);
	}

	public void refreshTimeline(Timestamp since) {
		List<String> readSince = repositoryManager.readSince(since);
		readSince.stream().forEach(m -> uiManager.show(m));
	}

}

public class SMSSocialNetworkManager extends SocialNetworkManager {

	public SMSSocialNetworkManager() {
		super(new NoSQLManager(), new SMSManager());
	}

	@Override
	public void post(String message) {
		System.out.println("******************SMS Social Network Timeline**************************");
		super.post(message);
		System.out.println("******************SMS Social Network Timeline**************************");
	}

	@Override
	public void refreshTimeline(Timestamp since) {
		System.out.println("******************SMS Social Network Timeline**************************");
		super.refreshTimeline(since);
		System.out.println("******************SMS Social Network Timeline**************************");
	}

}

Interface Segregation Principle

Now the design has two separated interfaces IUIManager and IRepositoryManager, both defining separated responsibilities, now a class concerned with the persistence (like NoSQLManager) won’t have to implement methods from the UI.

Dependency Inversion

The SocialNetworkManager class now depends on IUIManager and IRepositoryManager, both are interfaces. This dependency inversion allows the subclasses of SocialNetworkManager like SMSSocialNetworkManager and WebSocialNetworkManager to have the concrete implementation they need, in this case the WebSocialNetworkManager uses a Web UI and an SQL database, whilst the SMSSocialNetworkManager uses SMS UI and prefers a NoSQL database.

Always keep in mind that the business moves faster than the refactor you will need to do to keep on track, so the software you wrote shouldn’t be so rigid that new changes be difficult (and expensive) to make, take time to design your solutions, and design thinking in change, you wont catch the business immediately, but you will be really close and then, with the time you save, you could share a cold beer with me.

You can found the code for the discussed example on my github repo and the theory on the Robert C. Martin paper. Happy coding!

How to use a trained machine learning model using Python

Now that you trained your model…

Many times I faced a funny (or sad?) situation, when I started researching on Machine Learning (using Python), I learned a lot about all the packages and methods for supervised and unsupervised learning, as may be common for some of you, most of the knowledge was available at online tutorials. All these tutorials had a similar structure, something like this:

Ta da! At the end of the tutorial you got a trained model, which was ready to be used with unseen data, because of course what we all really want is to use a model with new data. But … the tutorials never explained that part.

So, how can we use a trained model to predict unseen values and how can you put this model to work on real scenarios? Keep reading.

Taking a shortcut

Let pass through the steps to train a model very quickly. For the purpose of this exercise, we will train a model which classify numbers between two classes: lower than 50 and greater than 50:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import random
import pandas as pd

# Create a dataframe of random numbers between 1 and 100
# Create a label column assigning 1 to values above 50 and 0 to everything else
data_df = pd.DataFrame(data=[random.randrange(1,100) for _ in range(100)], columns=["value"])
data_df['label'] = data_df['value'].apply(lambda x: 1 if x > 50 else 0)

# Divide training and test sets
train_df, test_df = train_test_split(data_df, test_size=0.3, random_state=0)

train_data = train_df['value']
train_target = train_df['label']

test_data = test_df['value']
test_target = test_df['label']

classifier = RandomForestClassifier()
classifier.fit(train_data.values.reshape(-1,1),train_target)

# Predictions over the test data
predictions = classifier.predict(test_data.values.reshape(-1,1))

# Print the metrics
print(confusion_matrix(test_target,predictions))
print(classification_report(test_target,predictions))
print(accuracy_score(test_target, predictions))

Unsurprisingly, such a simple dataset gives perfect score metrics. Now this is the step where all tutorials ended, a trained model. The first thing I always thought in this point was: How do I save this model for future use?

Saving and loading a trained model

Meet pickle, this is a Python module which serialise and de-serialise objects into/from a byte stream. Following the previous script, by adding these two lines we will save our model:

import pickle
pickle.dump(classifier, open('model.sav','wb'))

That’s all you need to save your trained model, now, you’ll want to load it into another script for further use, again we use pickle. In the next script we load the model and use it to predict labels for unseen data:

import pickle

model = pickle.load(open('model.sav','rb'))
model.predict([[-1],[1000]])

The output is 0 for the -1 sample and 1 for 1000, a perfect prediction!

Using a trained model on an API

A Machine Learning trained model is a tool, but the moment you use it on real business cases is when all the effort to collect, prepare the data and tune the hyper-parameters makes sense, here is where the ROI gets clear and the model usefulness will make the stakeholders really happy to have invested in ML. There are many possibilities for using your models in scripts, being the most common, running batch process to assign labels on production data.

However, using your models in batch process restricts your possibilities with ML, to unlock its real power you need to be able to use your models in real time, think in recommender systems for example. Using the same approach to load a trained model, and a web framework like Flask you can expose your model through a defined API, and apply it to real scenarios.

A use case of a trained model in batch, the user receives data processed by the model on a previous window.

Using a trained model on real time, the user triggers the prediction process, do you see the advantage?

#!flask/bin/python
from flask import Flask, request
import pickle

model = pickle.load(open('model.sav','rb'))

app = Flask(__name__)

@app.route('/model/prediction',methods=['POST'])
def predict():

	item_key = 'value'
	
	if not request.json or item_key not in request.json:
		return 'Empty payload',400

	value = request.json[item_key]

	return str(model.predict([[value]])[0])


if __name__ == '__main__':
    app.run(host='0.0.0.0',port=8082)

The previous code, creates and API which runs in the 8082 port, I selected such number to avoid conflicts with other web servers you may be running in your machine. It exposes a single endpoint at /model/prediction responding to POST method. It receives a JSON payload and seeks an attribute value, finally it predicts the label for “value” using our trained model and returns the answer. I saved the previous script as flask_model.py, so all I need to do to run the API is: python3 flask_model.py. You can test your API using curl:

curl -X POST http://127.0.0.1:8082/model/prediction -H "Content-Type: application/json" -d '{"value":120}'

The previous request returns 1, as 120 > 50. And that will be all! By using pickle to save and load your models, and Flask to expose them through an API, you are able to take your ML initiatives to production. There are other fancy approaches to achieve the same by using paid services like GCP or AWS, but if you want to keep going free this is a solid start. Happy coding!

How to estimate text similarity with Python

Disponible en Español

Did Melania Trump plagiarise Michelle Obama’s speech?

On 2016, during the Republican National Convention, Melania Trump gave a speech to support Donald Trump campaign; as soon as the convention concluded, Twitter users noted similarities in some lines pronounced by Mrs Trump and a speech from Michelle Obama eight years ago on the Democratic National Convention; of course, Melania and her husband were criticised and the campaign team defended them, arguing the speech was written from notes and real life experiences.

How the Twitter’s users noted the similarities? On one side, some lines were exactly the same in both speeches, on the other hand, as said in this article from Usa Today:

It’s not entirely a verbatim match, but the two sections bear considerable similarity in wording, construction and themes.

If you were to automate the process to detect those similarities, what approach would you take? A first technique will be to compare both texts word by word but this will not scale well; you must consider the complexity of comparing all the possible sentences of consecutive words from a text against the other. Fortunately, NLP gives us a clever solution.

What are we going to do?

There is a core task for NLP called text similarity, which works solving the problem we stated: How do you compare texts without going on a naïve and inefficient approach? To do so, you need to transform the texts to a common representation and then you need to define a metric to compare them.

In the following sections you will see: the mathematical concepts behind the approach, the code example explained in detail so you may repeat the process by yourself and the answer to the original question: Did Melania plagiarise or not?

Text Similarity Concepts

TF-IDF

Straight to the point, the text is transformed to a vector. The words are then called features. Each position in the vector represents a feature and the value in the vector position depends on the method you use. One way to do it, is to count how many times the word appears in the text, divide it by the total count of terms in the document and assign this value to the vector for that feature, which is called Term Frequency or TF.

Screen Shot 2020-08-09 at 5.23.37 PM — Term frequency, being t a term, n_t,d the times the term appears in a document. The denominator is the count of all the terms in the document.

Term frequency alone may give relevance to common words present in the document, but they are not necessarily important, they may be stopwords. The stopwords are words that do not add meaning to a text, like articles, pronouns or modal verbs: I, you, the, that, would, could … and so on.

To know how important a word is in a particular document, Inverse document frequency or IDF is used. IDF seeks the relevance in the document by counting how many documents contain a term in the corpus.

Screen Shot 2020-08-09 at 5.23.53 PM — Inverse document frequency.

In IDF, N represents the number of documents on the corpus, whilst df_t represent the number of documents containing a term t. If all the documents in the corpus contain a term t, then N/df_t will be equal to 1, and log(1) = 0, which means the term is not representative as, emphasising again, it appears in all documents.

Term frequency–inverse document frequency or TF-IDF combines the two previous metrics: if a word is present in a document, but also it’s in all the other documents of the corpus, it’s not a representative word and TF-IDF gives a low weight value. Conversely, if a word has high frequency by appearing many times in a document and it only appears in that document, then TF-IDF gives a high weight value.

Screen Shot 2020-08-09 at 5.24.30 PM — Term frequency–inverse document frequency.

The TF-IDF values are calculated for each feature (word) and assigned to the vector.

Cosine Similarity

Having the texts in the vector representation, it’s time to compare them, so how do you compare vectors?

It’s easy to model text to vectors in Python, lets see an example:

from sklearn.feature_extraction.text import TfidfVectorizer

phrase_one = 'This is Sparta'
phrase_two = 'This is New York'
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([phrase_one,phrase_two])

vectorizer.get_feature_names()
['is', 'new', 'sparta', 'this', 'york']
X.toarray()
array([[0.50154891, 0.        , 0.70490949, 0.50154891, 0.        ],
       [0.40993715, 0.57615236, 0.        , 0.40993715, 0.57615236]])

This code snippet shows two texts, “This is Sparta” and “This is New York“. Our vocabulary has five words: “This“, “is“, “Sparta“, “New” and “York“.

The vectorizer.get_feature_names() line shows the vocabulary. The X.toarray() shows both texts as vectors, with the TF-IDF value for each feature. Note how for the first vector, the second and fifth position have a value of zero, those positions correspond to the words “new” and “york” which are not in the first text. In the same way, the third position for the second vector is zero; that position correspond to “sparta” which is not present in the second text. But how do you compare the two vectors?

By using the dot product it’s possible to find the angle between vectors, this is the concept of cosine similarity. Having the texts as vectors and calculating the angle between them, it’s possible to measure how close are those vectors, hence, how similar the texts are. An angle of zero means the text are exactly equal. As you remember from your high school classes, the cosine of zero is 1.

The cosine of the angle between two vectors gives a similarity measure.

Finding the similarity between texts with Python

First, we load the NLTK and Sklearn packages, lets define a list with the punctuation symbols that will be removed from the text, also a list of english stopwords.

from string import punctuation
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

language_stopwords = stopwords.words('english')
non_words = list(punctuation)

Lets define three functions, one to remove the stopwords from the text, one to remove punctuation and the last one which receives a filename as parameter, read the file, pass all the string to lowercase and calls the other two functions to return a preprocessed string.

def remove_stop_words(dirty_text):
    cleaned_text = ''
    for word in dirty_text.split():
        if word in language_stopwords or word in non_words:
            continue
        else:
            cleaned_text += word + ' '
    return cleaned_text

def remove_punctuation(dirty_string):
    for word in non_words:
        dirty_string = dirty_string.replace(word, '')
    return dirty_string

def process_file(file_name):
    file_content = open(file_name, "r").read()
    # All to lower case
    file_content = file_content.lower()
    # Remove punctuation and spanish stopwords
    file_content = remove_punctuation(file_content)
    file_content = remove_stop_words(file_content)
    return file_content

Now, lets call the process_file function to load the files with the text you want to compare. For my example, I’m using the content of three of my previous blog entries.

nlp_article = process_file("nlp.txt")
sentiment_analysis_article = process_file("sentiment_analysis.txt")
java_certification_article = process_file("java_cert.txt")

Once you have the preprocessed text, it’s time to do the data science magic, we will use TF-IDF to convert a text to a vector representation, and cosine similarity to compare these vectors.

#TF-IDF
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([nlp_article,sentiment_analysis_article,java_certification_article])
similarity_matrix = cosine_similarity(X,X)

The output of the similarity matrix is:

[[1.         0.217227   0.05744137]
 [0.217227   1.         0.04773379]
 [0.05744137 0.04773379 1.        ]]

First, note the diagonal with ‘1‘, this is the similarity of each document with itself, the value 0.217227 is the similarity between the NLP and the Sentiment Analysis posts. The value 0.05744137 is the similarity between NLP and Java certification posts. Finally the value 0.04773379 represents the similarity between the Sentiment Analysis and the Java certification posts. As the NLP and the sentiment analysis posts have related topics, its similarity is greater than the one they hold with the Java certification post.

Similarity between Melania Trump and Michelle Obama speeches

With the same tools, you could calculate the similarity between both speeches. I took the texts from this article, and ran the same script. This is the similarity matrix output:

[[1.         0.29814417]
 [0.29814417 1.        ]]

If you skipped the technical explanation and jumped directly here to know the result, let me give you a resume: using an NLP technique I estimated the similarity of two blog post with common topics written by me. Then, using the same method I estimated the similarity between the Melania and Michelle speeches.

Now, lets make some analysis here. By calculating the similarity, between two blog posts written by the same author (me), about related topics (NLP and Sentiment Analysis), the result was 0.217227. The similarity between Melania and Michelle speeches was 0.29814417. Which in conclusion, means, that two speeches from two different persons belonging to opposite political parties, are more similar, than two blog posts for related topics and from the same author. I let the final conclusion to you. The full code, and the text files are on my Github repo.

Happy analysis!

Cómo estimar la similitud entre documentos con Python

English version

¿Melania Trump plagió el discurso de Michelle Obama?

En el año 2016, durante la convención del partido republicano, Melania Trump dio un discurso apoyando la campaña de su esposo Donald; tan pronto como terminó la convención, usuarios de Twitter notaron similitudes en algunas lineas del discurso de la señora Trump y otro pronunciado por Michelle Obama ocho años antes en la convención del partido demócrata. La crítica sobre los Trump no se hizo esperar y el equipo de campaña llevó a cabo su labor de defensa con el argumento de que el discurso era un reflejo de experiencias de la vida real de Melania.

¿Cómo fue que los usuarios de Twitter notaron las semejanzas? Por un lado, algunas lineas eran exactamente iguales, por otro, como lo dijo un articulo de Usa Today:

It’s not entirely a verbatim match, but the two sections bear considerable similarity in wording, construction and themes.

Si quisieras automatizar el proceso de detección de similitudes ¿qué enfoque tomarías? Una posible solución sería comparar ambos textos, palabra por palabra, sin embargo esto no escalaría de manera eficiente; considera la complejidad de comparar todas las posibles frases compuestas por palabras consecutivas de un documento contra otro. Afortunadamente NLP nos da una solución elegante.

¿Cómo lo logramos?

Hay una técnica de NLP llamada similitud de textos, que se encarga del problema descrito: ¿Cómo comparar textos de una manera eficiente? Para lograr esto, es necesario transformar los documentos en un representación común y definir una métrica para compararlos.

En las secciones siguientes encontraras: los conceptos matemáticos detrás del proceso, el código explicado en detalle para que lo puedas reusar y finalmente la respuesta a la pregunta inicial: ¿Melania cometió plagio o no?

Conceptos de similitud de textos

TF-IDF

El texto es convertido en un vector. Las palabras se denominan features. Cada posición del vector representa un feature y el valor de la posición en el vector depende del método usado. Una forma de calcularlo, es contar cuantas veces la palabra aparece en el texto, dividirlo por el total de palabras únicas del documento y asignar este valor en el vector para el feature, esto es llamado Term Frequency o TF.

Term frequency por si solo da relevancia a palabras comunes en el documento, pero que no son necesariamente importantes, pueden ser stopwords, Las stopwords son palabras que no añaden valor al texto, como artículos, pronombres o preposiciones: yo, tu, el, la, a, ante, de, desde, etc.

Para saber si una palabra es importante en un documento, Inverse document frequency o IDF es usado, IDF encuentra la importancia en el documento contando cuantos documentos ademas del actual, contienen la palabra.

En IDF, N representa el número de documentos en el corpus, mientras que df_trepresenta el número de documentos que contienen a una palabra t. Si todos los documentos en el corpus contienen la palabra t, entonces N/df_t será igual a 1, y log(1) = 0, lo que significa que la palabra no es representativa, pues haciendo énfasis nuevamente, aparece en todos los documentos.

Term frequency–inverse document frequency o TF-IDF combina las dos métricas anteriores: si una palabra aparece en un documento, pero también aparece en los demás documentos del corpus, significa que no es representativa y TF-IDF le da un valor bajo. Por el contrario, si la palabra aparece muchas veces en un documento y solo aparece en ese documento, TF-IDF le da un valor alto.

Los valores de TF-IDF son calculados para cada palabra y dichos valores son asignados a las palabras del vector.

Similitud de coseno

Si se tienen los textos en representación vectorial ¿cómo se pueden comparar?

Primero, veamos un ejemplo de la conversión de texto a vector en Python:

from sklearn.feature_extraction.text import TfidfVectorizer

phrase_one = 'This is Sparta'
phrase_two = 'This is New York'
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([phrase_one,phrase_two])

vectorizer.get_feature_names()
['is', 'new', 'sparta', 'this', 'york']
X.toarray()
array([[0.50154891, 0.        , 0.70490949, 0.50154891, 0.        ],
       [0.40993715, 0.57615236, 0.        , 0.40993715, 0.57615236]])

Este código muestra dos textos, “This is Sparta” y “This is New York“. El vocabulario está compuesto por cinco palabras: “This“, “is“, “Sparta“, “New” y “York“.

La linea vectorizer.get_feature_names() muestra el vocabulario. La linea X.toarray() muestra ambos textos como vectores con el valor asignado por TF-IDF para cada palabra. Se puede ver como, para el primer vector, la segunda y quinta posición tienen un valor de cero. Estas posiciones corresponden a las palabras: “new” y “york” las cuales no aparecen en el primer texto. De la misma forma, la tercer posición del segundo vector es también cero, esta posición corresponde a la palabra “sparta“, la cual no aparece en el segundo texto.

¿Y cómo comparamos los vectores? Usando el producto punto es posible encontrar el ángulo entre dos vectores, este es el concepto conocido como similitud de coseno. Teniendo dos vectores y calculando el ángulo entre ellos, es posible medir que tan cerca están estos, por lo tanto, permite determinar que tan parecidos son dos textos. Un ángulo de cero significa que los textos son exactamente iguales. Si recuerdas tus clases de secundaria, el coseno de cero es 1.

El coseno del ángulo entre dos vectores da la medida de similitud.

Encontrando la similitud entre textos con Python

Primero, se cargan los paquetes de NLTK y Sklearn, se define una lista con los símbolos de puntuación y las stopwords, para que sean retiradas del texto.

from string import punctuation
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

language_stopwords = stopwords.words('english')
non_words = list(punctuation)

Se definen tres funciones, una para remover las stopwords, otra para remover los símbolos de puntuación y otra que recibe la ruta de un archivo de texto como parámetro, lee el archivo, transforma el texto a minúscula y llama las otras dos funciones para tener el texto preprocesado.

def remove_stop_words(dirty_text):
    cleaned_text = ''
    for word in dirty_text.split():
        if word in language_stopwords or word in non_words:
            continue
        else:
            cleaned_text += word + ' '
    return cleaned_text

def remove_punctuation(dirty_string):
    for word in non_words:
        dirty_string = dirty_string.replace(word, '')
    return dirty_string

def process_file(file_name):
    file_content = open(file_name, "r").read()
    # All to lower case
    file_content = file_content.lower()
    # Remove punctuation and spanish stopwords
    file_content = remove_punctuation(file_content)
    file_content = remove_stop_words(file_content)
    return file_content

Se llama la función process_file para cargar los archivos a analizar, en este ejemplo, estoy usando el contenido de tres posts anteriores de mi blog.

nlp_article = process_file("nlp.txt")
sentiment_analysis_article = process_file("sentiment_analysis.txt")
java_certification_article = process_file("java_cert.txt")

Una vez que se tiene el texto preprocesado, se usa TF-IDF y se aplica la similitud de coseno.

#TF-IDF
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([nlp_article,sentiment_analysis_article,java_certification_article])
similarity_matrix = cosine_similarity(X,X)

La salida es la siguiente matriz de similitud:

[[1.         0.217227   0.05744137]
 [0.217227   1.         0.04773379]
 [0.05744137 0.04773379 1.        ]]

Observa la diagonal de unos en la matriz, cada uno representa la similitud de un documento consigo mismo, el valor 0.217227 es la similitud entre los posts de NLP y análisis de sentimientos. El valor de 0.05744137 es la similitud entre el post de NLP y el de certificación de Java. Finalmente el valor de 0.04773379 representa la similitud entre los posts de análisis de sentimientos y el de la certificación de Java. Ya que el de NLP y el de análisis de sentimientos tratan temas en común, su medida de similitud es mas grande que la que tiene cada uno con la de la certificación de Java.

Similitud entre el discurso de Melania Trump y el de Michelle Obama

Los textos de los discursos se pueden encontrar en este artículo, usando el mismo proceso se calcula la siguiente matriz de similitud:

[[1.         0.29814417]
 [0.29814417 1.        ]]

Si te saltaste la explicación técnica hasta este punto buscando el resultado, te daré un resumen: usando una técnica de NLP estimé la similitud de dos posts de este blog que tienen temas en común y fueron escritos por mi. Luego, usando el mismo método, estimé la similitud entre el discurso de Melania y Michelle.

Ahora, apliquemos algo de análisis. Al calcular la similitud entre dos posts escritos por un mismo autor (yo) que tratan sobre temas relacionados (NLP y análisis de sentimientos), el resultado fue de 0.217227. La similitud entre los discursos de Melania y Michelle fue de 0.29814417. Lo anterior, quiere decir, que dos discursos escritos por personas diferentes pertenecientes a corrientes políticas opuestas, son mas parecidos que mis dos posts. Te dejo la conclusión final a ti. El código completo y los archivos puedes encontrarlos en mi repo de Github.

¡Buena suerte con tu análisis!

Sentiment Analysis or NLP in practice

From my previous post, you had a short briefing about NLP, what it is and why it’s an important research area among the AI community. In this post I want take you to an NLP branch with a growing popularity: Sentiment analysis.

What’s sentiment analysis? Lets say, you are running for president on one of the most powerful countries on earth, you are going to a debate on national TV against your direct contender, you need to win this debate desperately. There will be a lot of sensitive topics, like immigration and health care policies, one false step will make you lose the presidency. Wouldn’t be great to know what the people is thinking of your speech as soon as possible? Wouldn’t be nice to gather these insights on what people think and modify your speech? With sentiment analysis, this is possible.

You will need a stream of text data with the opinions of the electors and a trained model to classify the opinions. The data, of course, is available right now on social networks like Twitter, the place where you and I publish our opinions about … well, about everything. The trained model … that’s the tricky part.

Sentiment Analisys blog entry — Sentiment analysis in action

No really, what sentiment analysis is?

The previous example is not entirely fiction as you may suppose, but analysing a huge stream of data to gather insights and change a debate posture in real time is difficult, as you may face computers processing restrictions. Nevertheless, the example gives you a clue about what sentiment analysis is.

Also known as opinion mining, the sentiment analysis intends to extract information from texts in natural language, precisely, what someone (or something) referred on the texts, thinks about someone (or something) referred in the same text too. What’s an opinion? What’s a sentiment? The difference is subtle, but let’s say the opinion is a concrete view, whereas the sentiment is related to a feeling: “He’s worried about the republicans winning the white house” denotes a sentiment, “I think the democrats will ruin the country” is an opinion.

It may seem trivial to detect the opinion (or sentiment) among a text, but let me give you some examples of why it’s not so simple. An opinion may be direct “I hate hip hop lyrics” or indirect “This new electric car saves me a lot of money”, note the specific naming of the feeling in the first phrase. The opinion may be also explicit “I love the guitar sound on metal music” (also a direct opinion) or implicit “With right wing politics I would have freedom to buy guns”. It’s important to keep and eye on the context, which makes difficult to understand the opinion, in the implicit example, is the opinion holder showing support or rejection?, quite difficult to know, as the person who speaks may be ok with people having guns.

To detect the sentiment in a text, there is a growing popularity subarea of sentiment analysis, the polarity classification. Having a text, the polarity classification assigns a label according to the sentiment exhibited from an entity to an aspect of another entity which is referred in the text. In the polarity classification three common tags are used: negative, positive and neutral. Positive and negative sentiments are valuable on the analysis, the subjectivity is where the opinion lies, and form the base to take decisions, for example:

“Most electors think it’s a good idea to build a wall to stop illegal immigration”

“As a religious man, I hate the idea of gay adoption”

The texts with neutral polarity are the kind of “I don’t know which candidate is the best” and exhibit some value in order to detect indecision.

Just polarity classification?

No. The definition of sentiment analysis fits in the polarity classification task, and that’s why is the core task, but there are other areas. Have you thought in phrases like: “There is no girls in this party, what a great way to enjoy a Saturday night!”, or “I just started to read your book, now I’m sure I wont have any problem to sleep at nights”. The first one is an irony, as express the opposite of what’s the speaker is trying to say, the second one is sarcasm, as wraps a criticism towards a particular victim. Sentiment analysis handles this kind of texts in an area called Sarcasm and Irony detection.

With the popularity of social networks, it’s easier to tailor a campaign in order to increase the sells of a product, on the other side, the marketplaces selling products allows the customer reviews. It’s common to create campaigns to spread fake information, and to create fake reviews with the goal in mind to distort the audience opinion. This activity is known as opinion spam, and there’s and area called opinion spam detection to handle it.

As you have seen from the previous examples, I have talked about the sentiment in the texts, but it may be valuable to extract the opinion holder, who is having the sentiment? The electors? The politicians? The author of the text? Or is the author talking of a third person?, this is an important analysis for the decision maker, and the area for this task is Entity opinion holder extraction.

Sentiment Analysis in real life

There are two main areas which came to my mind that take advantage of the sentiment analysis: politics and e-commerce. As I pointed in the beginning, knowing what people thinks in social networks, is highly valuable for politics, remember how in the past, Big data has been used in the race for power (Cambridge Analytica and Obama campaign). Twitter is a special social network in the political arena, day by day the trending topics include economics and social context, with a huge amount of user interacting, leaving their opinions. Reflect about the power to understand what all this users think, contemplate the opportunity to cluster this users into supporters and detractors and to tailor campaigns to retain your supporters and to convince undecided.

Sentiment Analisys blog entry (1) — Invest in social media campaigns addressed to the right target

On e-commerce, you have probably watched the review section for products, where customer rate items from 1 to 5 stars (in most cases), and then they write what they think about the item. It’s kinda trivial to know if a customer like or dislike a product according to their star rate, what’s difficult is to know what exactly do they value about the product, or what they hate. In a review with 5 stars you may find phrases like “I love it” but you also may find “I love how the battery of this cell phone last 8 hours”; similarly, on a 1 star review, the text may be “Your product sucks” or “I don’t like the game take so much time to load”. in both scenarios, on the specific reviews, you get the parts for the sentiment analysis: an opinion holder, an entity, an aspect of such entity and the sentiment to that aspect emitted by the opinion holder, imagine the opportunity to process automatically all the reviews, and to detect what people like and what people hate, and imagine taking this knowledge to your product team. You could improve your NPS and hopefully your incomes.

Why to go deep

Some years ago the Big Data term was coined, and it’s now a common concern in all technology areas, if you think in the amount of data being produced in a single social network as Twitter on a single topic, you will be aware of the necessity to create decision support systems to manage these huge data waves to produce knowledge; that’s why, I deliberately leave a term outside the sentiment analysis definition: automatic. The final goal of the sentiment analysis is to produce automatic tools for detection, if you combine the opportunity to gather data on Internet and the knowledge to build tools that detect sentiments, you may create a solutions to give your company a decisive advantage, to foreseeing the political future of your country, or what comic book to buy and don’t regret.

If you want to put your hands on real code for sentiment analysis check my repo.

NLP, inteligencia artificial aplicada al lenguaje humano

English version

Vivimos en años de avance tecnológico, años de automatización; donde tareas humanas están siendo llevadas a cabo por maquinas o soportadas por ellas. Con estos avances, muchos términos han ganado popularidad: Big Data, machine learning, inteligencia artificial, deep learning, redes neuronales entre otros. Estos términos son populares, porque mas allá de ser una moda, han demostrado su aplicación en problemas reales. Hoy quiero traer un nuevo término Procesamiento del Lenguaje Natural o NLP por sus siglas en ingles.

¿Qué es NLP?

NLP es un área de la computación que busca analizar, modelar y entender el lenguaje humano, una tarea bastante difícil si lo pensamos un momento. ¿Has considerado como nuestro cerebro interpreta el lenguaje? ¿Como somos capaces de comunicarnos? Si tuviéramos claro este punto, ¿como podríamos llevar este conocimiento tácito a reglas explícitas que pudieran ser programadas en una máquina? Lo que antes parecía solo ciencia ficción ahora nos rodea todos los días y veremos por que.

Si has pronunciado las palabras “Hola Siri” ya has visto los beneficios de NLP de primera mano; claro está, puede que prefieras otra marca como Android, Microsoft, Amazon; indistintamente de si te diriges al asistente de Google, Alexa o Cortana, estas dando instrucciones que son interpretadas por una máquina. ¿Como funciona entonces NLP? Para entender esto, debemos entender como se estructura el lenguaje humano.

De vuelta al colegio

El lenguaje es como un conjunto enorme de piezas que podemos combinar para crear estructuras hermosas y gigantes. Para lograr estas estructuras, combinamos las piezas mas pequeñas para crear bloques de un tamaño un poco mayor. La combinación de piezas sigue ciertas reglas, y dichas reglas dependen de si estamos trabajando con las piezas mas pequeñas o con las que hemos construido a partir de estas. Las piezas mas pequeñas son denominadas fonemas, que en la práctica son las letras del alfabeto (A-Z). Las letras por si solas no logran tener un significado, así que comenzamos a combinarlas.

Los siguientes bloques se denominan morfemas, estos son la combinación minima de fonemas que logra tener un significado, de manera común las denominamos palabras, sin embargo, no todos los morfemas se consideran palabras, tal es el caso de prefijos (aero, anti, macro, infra) y sufijos (arquía, ito, ita, filia).

NLP — Composición de morfemas para construir uno nuevo. Los prefijos y sufijos no son palabras pero si morfemas.

Los lexemas son variaciones de los morfemas que comparten un significado común y en general, una raíz común; por ejemplo “encontrar”, “encontrando”, “encontrado”, “encontraste” comparten la raíz “encontrar”. Este concepto es particularmente importante en fases de NLP, dado que el texto a analizar suele tener distintos lexemas que confluyen al mismo significado y construyen el mismo contexto. El proceso de extraer la raíz común de los lexemas (o lemma) es llamado lematización, trabajar con la raíz de los lexemas permite resumir el texto y facilita el análisis.

Al combinar lexemas y morfemas, se obtienen frases y oraciones; existen ciertas reglas para estas combinaciones, pues no escribimos de manera aleatoria. Un ejemplo común de una oración bien formada suele tener un sustantivo, un verbo y proposiciones que los unen como en: “Javier toca la guitarra en las noches”. El conjunto de leyes que rigen el orden de las palabras es denominado sintaxis.

A partir de la combinación de frases y oraciones, nacen las grandes creaciones que amamos: libros, poemas, canciones, etc. En este nivel existe un contexto y la estructura refleja un significado. Este contexto es el que queremos que sea procesado y entendido por las maquinas.

¿Qué se está haciendo en NLP?

El área mas popular de investigación es clasificación de texto, el objetivo de esta área es asignar una o mas etiquetas a un texto. Un uso común es la detección de spam, usado por compañías como Gmail o Outlook. Otro caso de uso está en el soporte de servicio al cliente; estos equipos deben procesar miles de peticiones/reclamos, no es una tarea eficiente a gran escala, dada la necesidad de interacción humana, además, muchos de los reclamos suelen tener información que no es de valor (solo reflejan insatisfacción pero no el porqué); la clasificación de texto ayuda a filtrar la información que puede llevar a accionables. El proceso para aplicar clasificación de texto es similar al del entrenamiento de un modelo de machine learning, se inicia con un conjunto de datos (texto para el caso puntual), se asignan las etiquetas a cada una de las muestras, se divide el conjunto en entrenamiento y pruebas, se elige un algoritmo de entrenamiento apropiado para el problema y finalmente se entrena el modelo. Luego de la validación del modelo, se usa para clasificar los nuevos datos.

NLP (1) — Proceso para obtener un modelo de clasificación de texto.

Inicié la definición de NLP dando el ejemplo de Siri, Google assitant, Cortana y Alexa, estos hacen parte del área de agentes conversacionales. En general todas las areas de NLP, tienen en común la extracción de información, el objetivo de esta última es identificar las entidades mencionadas en el texto. Por ejemplo, en una frase como: “El presidente fue al congreso en Bogotá para defenderse de los cargos de corrupción”, para entender el significado, un algoritmo necesitaría extraer palabras como: “presidente”, “congreso”, “Bogotá” y “corrupción”, estas palabras son conocidas como entidades; pueden identificarse fácilmente ya que suelen tomar la forma de sustantivos. Del texto que se encuentra entre las entidades es posible extraer las relaciones: “El presidente fue al congreso”; entidades y relaciones forman un contexto. Un agente conversacional usa el contexto para responder a consultas del usuario, la interpretación de consultas involucra otra área de NLP, information retrieval; ademas de interpretar la consulta, busca dentro de documentos la/s solución/es mas cercana/s a dicha consulta; information retrieval es por supuesto usado en motores de búsqueda. Los agentes conversacionales se apalancan de las áreas mencionadas para dar solución a las peticiones de los usuarios.

A medida que se vuelve mas común la aplicación de NLP, nuevos casos de uso surgen: detección de eventos de calendario en correo electrónico, detección de plagio, reconocimiento de voz, corrección de ortografía, análisis de sentimientos, traducciones; la lista crece a medida que la investigación avanza.

¿Como trabajar con NLP?

Hay tres enfoques para trabajar con NLP, el primero es el uso de heuristicas. Con las heuristicas, las reglas para entender el lenguaje son creadas a mano; el enfoque es de utilidad para MVP’s de aplicaciones, pero es sensible al cambio de contexto (es decir los usuarios dejan de hablar como lo hacían cuando se crearon las reglas) y sufren de perdida de exactitud cuando la aplicación escala. Para trabajar con heuristicas se requiere de expertos en el dominio del problema, esto puede ser una desventaja si lo vemos como una dependencia mas del sistema; para compensar la ausencia del experto, o como una heuristica adicional, se suelen usar bases de conocimiento, diccionarios y tesauros disponibles en la web, estos recursos son mantenidos por la comunidad (en algunos casos) y de acceso libre. Una herramienta común para el análisis en este enfoque son las expresiones regulares, por ejemplo, la siguiente expresión podría usarse para extraer nombres de usuarios al analizar posts en redes sociales:

"@.*"

Un enfoque popular en NLP es el de machine learning; teniendo conjuntos de datos de texto, se entrena un modelo en la tarea deseada; las técnicas mas comunes son: naïve Bayes, support vector machines y conditional random fields. Los conditional random fields o CRF han ganado popularidad superando a los modelos de cadenas de Markov, al dar relevancia al orden de aparición de las palabras y el contexto que forman. CRF ha sido usado de manera exitosa en la extracción de entidades.

Finalmente, deep learning con el uso de redes neuronales es el tercer enfoque, aprovechando la habilidad de las redes neuronales para trabajar con datos sin estructura aparente.

¿Donde puedo iniciar?

Personalmente he trabajado con heuristicas y machine learning, el lenguaje de programación que recomiendo es Python, dada la versatilidad de trabajar con orientación a objetos, programación funcional y estructurada, ademas cuenta con un gran ecosistema para trabajar en ciencia de datos. Las herramientas que seguramente necesitaras son:

Pandas: Tu mejor amigo manipulando datos. Con pandas es bastante fácil cargar archivos csv y json, con los que seguro tendrás que interactuar, y trabajar con los datos en una estructura de matriz. Permite hacer búsquedas sobre la matriz, transformación, reducción de dimensionalidad, transformación de valores y filtrado entre otras. https://pandas.pydata.org/
Numpy: Es la herramienta para trabajar con algebra y estructuras de datos de n dimensiones. A medida que avances en el análisis de texto, veras la necesidad de convertir palabras y documentos en vectores, este paquete facilita el trabajo con estas estructuras. https://numpy.org/
Sklearn: La librería para trabajar con machine learning, tiene una gran cantidad de algoritmos de clasificación, métodos de cluster para aprendizaje no supervisado, preprocesamiento de datos, generación aleatoria de archivos de entrenamiento y pruebas así como funciones de evaluación entre otras cosas. Dominar esta herramienta es la base no solo para NLP sino para los temas que se relacionen con machine learning. https://scikit-learn.org/stable/
NTLK: Finalmente, el paquete ad hoc package para NLP, provee: clasificación, tokenización, lematización, parseo, diccionarios de stop words y en general un completo set de herramientas para análisis de lenguaje. https://www.nltk.org/

Para leer mas del tema te recomiendo:

Feliz investigación!

NLP, artificial intelligence applied to language

Disponible en Español

We live in years of technologic advance acceleration, years of automation, human tasks are being totally performed by machines or supported by them. With these advances, many terms have gained popularity, words like Big Data, Machine Learning, Artificial Intelligence, Deep Learning, Neural Networks and a long list of etc. Such terms are popular because of how their application in common problems ease our lives. Today I want to bring you a term that may be known by some of you and new to others, Natural Language Processing or NLP.

What is Natural Language Processing?

In short NLP is an area of computer science that seek to analyse, model and understand the human language, hmm an easy task isn`t it? Have you ever thought how to model human language? Or how we take the way we interpret language in our brains, make it explicit rules and wrote those rules as code to be read by a machine? Years ago, this would have seemed science fiction, but nowadays, NLP surround us everyday.

A common phrase related to NLP will be “Hey Siri”; of course, you may posses an Android, but you will communicate with your Google assistant by speaking too and giving instructions which are interpreted on your device. Even if you don’t ask serious questions to your cellphone and just chit chat for fun, the doubt is: how these digital assistants works? How NLP works? The first step to understand this, is to understand how structured language works.

Back to school

Language is like an huge set of lego pieces, which can be combined to create awesome structures; this lego pieces are composed of tinier pieces. Also, there are some rules to combine them, these rules scale up, from the tinier pieces, to the medium, big and to the huge structures you build. The smallest kind of pieces are the characters (A-Z) you call these phonemes. As you know, the phonemes alone are meaningless, so you start to combine them to build bigger blocks.

The next kind of pieces are the morphemes, these are the minimum combination of phonemes that has a meaning to you, you may identify the morphemes as words, but even all words are morphemes, not all morphemes are words, that’s the case of prefixes and suffixes.

NLP — The composition of morphemes to build a new one. Prefixes and suffixes are not words by itself, but they are morphemes.

Lexemes are variations of morphemes that share common meaning, and in general share a common root; for example, “find”, “finds”, “found” and “finding” share the common root “FIND”. This concept is particularly important in text analysis, since text may have many lexemes, that in the end, refer to a common meaning and a common context. Being able to trace back the lexemes to their root or lemma is called lemmatization and it ease the analysis by leaving the meaningful unit of each word.

By combining lexemes and morphemes, you assemble phrases and sentences. But there are some rules to combine words, you just don’t put them in random order. A common well formed sentence may have a noun a verb and prepositions binding them as in: “Javier plays the guitar at nights”. The set of laws for order in words is called syntax.

Above the phrases and sentences, is where the beauty lies, with those blocks people create magnificent buildings, the books, poems and songs you love. In this level, a context start to exist, and the language structure exhibits a deeper meaning. We want the machines to process such context and understand that meaning.

What are the NLP people doing?

The most popular area of research is text classification, the intention behind it is to assign one label (or more) to a text. A common use of text classification is the spam detection, companies like Gmail or Outlook use it. Another great use relies on customer service support, having to check thousands of complaints from customers is not practical, as most of this comments are not clear about the complaint; text classification helps to filter the info which leads to actions. The process to apply text classification follows a common machine learning model training, you start from a data set of texts, you proceed to assign labels to each text sample, divide the data set into training and test sets, train the model (previously choosing a method that fits the problem) and then you use your model to classify unseen data.

I started the definition of NLP by giving the example of Siri and Google assistant, you must be aware of others like Cortana or Alexa, these are the focus of conversational agents area. This area (and in general all the NLP areas) intersect in common use cases with information extraction, in the last, the objective is to identify the entities involved in the text. For example, in a phrase like: “The president went to the Congress in Bogotá, to defend himself against corruption charges”, in order to understand the meaning, an algorithm need to extract the words: “president”, “Congress”, “Bogotá” and “corruption”, these words are know as entities; you could identify the entities easily as they take the form of nouns in sentences. From the text that lies between entities, an algorithm could infer relationships: “The president went to the Congress”; entities and the relationships binding them form a context. A conversational agent could use this context to answer user queries, this is close to another area: information retrieval, which works with how a machine can understand human questions and retrieve information that answer the queries and is (of course) used in search engines. The conversational agents make use of the information extraction and retrieval to chat with the user.

With the common use of NLP areas more and more applications born: calendar event detection, plagiarism detection, speech recognition, spelling check, sentiment analysis, translation from language to language, the list may grow as the research continues.

How does NLP works?

There are three common approaches to work with NLP, the first one is heuristics. With heuristics, the rules to understand the language are handcrafted; the approach works for MVP’s application, but are sensitive to concept drift and the accuracy suffer if the application scale. To work with heuristic a domain expert is required, a drawback if you think about it as an added dependency; the tackle to this is to use knowledge bases, dictionaries and thesaurus from the web, these resources are maintained by the community and free (in some cases). A common tool in the analysis with this approach are the regular expressions or regex, think about the extraction of user names in social networks post with a regex like:

"@.*"

A popular approach to NLP is the use of machine learning; by having datasets of text, a model is training to work on the desired task, some of the most common techniques are: naïve Bayes, support vector machines and conditional random fields. The conditional random fields or CRF have gained popularity outperforming Hidden Markov Models by giving relevance to the order of the words in text and the context they from. CRF have been used successfully in entity extraction tasks.

Finally deep learning with neural networks is the third approach, leveraged by the ability of neural networks to work with unstructured data.

Where can I start?

Personally I have worked with the heuristic and machine learning approaches, I used Python as programming language, its versatility to work with object oriented, functional and structured paradigms makes it a great option, also counts with a full ecosystem of packages to work in data science. Some of the tool you may use are:

Pandas: This will be your best friend working with data, probably you are going to handle csv or json files, with pandas you could load this files and work with them in a matrix structure. It allows to make queries over the matrix, transformation, dimensionality reduction, mapping, filtering among others.
https://pandas.pydata.org/
Numpy: Is the the tool for working with algebra and n-dimensional data structures, you will see, as you advance, words are converted into vectors and arrays and this package ease the work with such structures. https://numpy.org/
Sklearn: This is the package for machine learning, gives you a great set of classifications algorithms, clustering methods for unsupervised learning, data preprocessing, random generation of training and testing files, evaluation functions and many more, this is the package you should dominate for the ML tasks. https://scikit-learn.org/stable/
NTLK: Last but not the least, the ad hoc package for NLP, gives you: classification, tokenization, stemming, parsing, stop words dictionaries and many other tools to work with the areas we talk about it. https://www.nltk.org/

To read more about this topic, you could go to:

Happy research!

How does Netflix or Spotify knows what you like? – A briefing on Recommender Systems

Have you ever wanted to know how Netflix, Spotify or others interactive platforms recommend you products (include here Amazon, a pioneer), well, recently I have been studying this topic, it’s an area called Recommender Systems which tries to fix a problem known as Long Tail.

The Long Tail problem refers to having a great number of products to offer to a customer and the task to find the one he is looking for. A common approach will be to recommend popular items, but is that okey all times?

Long Tail Example — Your tastes may be way different from the mass. Don´t you know what “Despacito” is?

In deed is not always accurate to recommend the most popular items, and the success of an online selling system depends on this accuracy, as a user you will lose confidence in a system that doesn’t know you and give you items you don’t want. So how Amazon, Netflix and Spotify (among others) recommend you items? I’m going to explain two simple and yet really powerful approaches that handle this task.

Collaborative Filtering

The collaborative filtering states that: Given a user A who loves items 1 and 2, and given a user B who loves item 1, the recommender system must recommend item 2 to user B. That’s a really simple logic if you think yet really effective. That’s how you start watching LOTR movies on Netflix and receive Star Wars suggestions on your dashboard.

The collaborative filtering builds a matrix of User rows vs Item columns, having in each position of this matrix the rating given by a user to an item. Then the distance between users depending of the rating they have given to items is calculated with a metric, the most used measures are Jaccard, Pearson and Cosine Similarity.

Having a system with a low number of items and users makes this approach feasible to search in all the data, but this is not going to scale if (as Netflix or Spotify) you have millions of users and items. In such cases you define a limit of neighbors to search.

A cool feature that came with this approach is the boost on “serendipity“; which in the recommender systems context is the ability of the system to predict items that are going to be liked by the customer, but are totally new to him (like a heavy metal fanatic who receives a recommendation on jazz for example).

Nevertheless, there is an “Achilles heel” on this approach, that’s called cold-start, it represent the drawback to predict items to a new user, and it make sense if you think, how are you going to predict something to a user that has not rated any item and you don’t know what does he likes. Even if the customer have rated a few, it will be hard to the algorithm to find similar users. The same happens when a new item is created, no one has rated such item, so it’s invisible.

Content based recommendation

The collaborative filtering doesn’t take into account any item feature, but the content based does. In this approach, each item is mapped to a set of values that represent it, the most common item representation is through keywords. Having this representation the system can find similar items, using information retrieval techniques as TF-IDF.

Keywords for movies — Example of movies described as keywords. LOTR and Captain America are similar because both have heroes, Dunkirk and Captain America share the “war” theme.

In practice the system will find which items have been liked (or high rated) by a customer, then it will find other n similar items to those preferred by the actual customer and such items will be presented.

This solution avoids the cold start problem, as you just need one rated item from the new customer to start the predictions. But the contend based has a drawback, there is no serendipity here, as the system only recommend “more of the same”.

Wanna Try?

The real recommender systems works as hybrids, joining many approaches as the last two. This is a new cool area with a lot of research to do and many interesting literature, I encourage you to keep reading and to go deep in this topic.

If you want a practical example here I let you one of each approach, a movie collaborative recommender in Java and a movie content recommender in Python, fell free to modify the code to your needs. Happy coding!

Prerequisites

Dataset preprocessing

Building a baseline model

Beating the baseline model

Adding data augmentation

Transfer Learning

Putting all together

Further steps for your experimentation

Why do you need a Validation Set

How to build your Validation Set

Holdout Validation

K-Fold Validation

Data Redundancy

Hands On Code

Define the initial model

Tuning the model

K Fold Validation

Exploring Data Redundancy

A poorly designed Social Network

Open-Closed Principle (OCP)

Liskov Substitution Principle (LSP)

Single Responsibility Principle (SRP)

Interface Segregation Principle (ISP)

Dependency Inversion Principle (DIP)

Following SOLID Principles

Now that you trained your model…

Taking a shortcut

Saving and loading a trained model

Using a trained model on an API

Did Melania Trump plagiarise Michelle Obama’s speech?

What are we going to do?

Text Similarity Concepts

TF-IDF

Cosine Similarity

Finding the similarity between texts with Python

Similarity between Melania Trump and Michelle Obama speeches

¿Melania Trump plagió el discurso de Michelle Obama?

¿Cómo lo logramos?

Conceptos de similitud de textos

TF-IDF

Similitud de coseno

Encontrando la similitud entre textos con Python

Similitud entre el discurso de Melania Trump y el de Michelle Obama

No really, what sentiment analysis is?

“Most electors think it’s a good idea to build a wall to stop illegal immigration”

“As a religious man, I hate the idea of gay adoption”

Just polarity classification?

Sentiment Analysis in real life

Why to go deep

¿Qué es NLP?

De vuelta al colegio

¿Qué se está haciendo en NLP?

¿Como trabajar con NLP?

¿Donde puedo iniciar?

What is Natural Language Processing?

Back to school

What are the NLP people doing?

How does NLP works?

Where can I start?

Collaborative Filtering

Content based recommendation

Wanna Try?