How to estimate text similarity with Python

How to estimate text similarity with Python

Disponible en Español

Did Melania Trump plagiarise Michelle Obama’s speech?

On 2016, during the Republican National Convention, Melania Trump gave a speech to support Donald Trump campaign; as soon as the convention concluded, Twitter users noted similarities in some lines pronounced by Mrs Trump and a speech from Michelle Obama eight years ago on the Democratic National Convention; of course, Melania and her husband were criticised and the campaign team defended them, arguing the speech was written from notes and real life experiences.

image

How the Twitter’s users noted the similarities? On one side, some lines were exactly the same in both speeches, on the other hand, as said in this article from Usa Today:

It’s not entirely a verbatim match, but the two sections bear considerable similarity in wording, construction and themes.

If you were to automate the process to detect those similarities, what approach would you take? A first technique will be to compare both texts word by word but this will not scale well; you must consider the complexity of comparing all the possible sentences of consecutive words from a text against the other. Fortunately, NLP gives us a clever solution.

What are we going to do?

There is a core task for NLP called text similarity, which works solving the problem we stated: How do you compare texts without going on a naïve and inefficient approach? To do so, you need to transform the texts to a common representation and then you need to define a metric to compare them.

In the following sections you will see: the mathematical concepts behind the approach, the code example explained in detail so you may repeat the process by yourself and the answer to the original question: Did Melania plagiarise or not?

Text Similarity Concepts

TF-IDF

Straight to the point, the text is transformed to a vector. The words are then called features. Each position in the vector represents a feature and the value in the vector position depends on the method you use. One way to do it, is to count how many times the word appears in the text, divide it by the total count of terms in the document and assign this value to the vector for that feature, which is called Term Frequency or TF.

Screen Shot 2020-08-09 at 5.23.37 PM
Term frequency, being t a term, nt,d the times the term appears in a document. The denominator is the count of all the terms in the document.

Term frequency alone may give relevance to common words present in the document, but they are not necessarily important, they may be stopwords. The stopwords are words that do not add meaning to a text, like articles, pronouns or modal verbs: I, you, the, that, would, could … and so on.

To know how important a word is in a particular document, Inverse document frequency or IDF is used. IDF seeks the relevance in the document by counting how many documents contain a term in the corpus.

Screen Shot 2020-08-09 at 5.23.53 PM
Inverse document frequency.

In IDF, N represents the number of documents on the corpus, whilst dft represent the number of documents containing a term t. If all the documents in the corpus contain a term t, then N/dft will be equal to 1, and log(1) = 0, which means the term is not representative as, emphasising again, it appears in all documents.

Term frequency–inverse document frequency or TF-IDF combines the two previous metrics: if a word is present in a document, but also it’s in all the other documents of the corpus, it’s not a representative word and TF-IDF gives a low weight value. Conversely, if a word has high frequency by appearing many times in a document and it only appears in that document, then TF-IDF gives a high weight value.

Screen Shot 2020-08-09 at 5.24.30 PM
Term frequency–inverse document frequency.

The TF-IDF values are calculated for each feature (word) and assigned to the vector.

Cosine Similarity

Having the texts in the vector representation, it’s time to compare them, so how do you compare vectors?

It’s easy to model text to vectors in Python, lets see an example:

from sklearn.feature_extraction.text import TfidfVectorizer

phrase_one = 'This is Sparta'
phrase_two = 'This is New York'
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([phrase_one,phrase_two])

vectorizer.get_feature_names()
['is', 'new', 'sparta', 'this', 'york']
X.toarray()
array([[0.50154891, 0.        , 0.70490949, 0.50154891, 0.        ],
       [0.40993715, 0.57615236, 0.        , 0.40993715, 0.57615236]])

This code snippet shows two texts, “This is Sparta” and “This is New York“. Our vocabulary has five words: “This“, “is“, “Sparta“, “New” and “York“.

The vectorizer.get_feature_names() line shows the vocabulary. The X.toarray() shows both texts as vectors, with the TF-IDF value for each feature. Note how for the first vector, the second and fifth position have a value of zero, those positions correspond to the words “new” and “york” which are not in the first text. In the same way, the third position for the second vector is zero; that position correspond to “sparta” which is not present in the second text. But how do you compare the two vectors?

By using the dot product it’s possible to find the angle between vectors, this is the concept of cosine similarity. Having the texts as vectors and calculating the angle between them, it’s possible to measure how close are those vectors, hence, how similar the texts are. An angle of zero means the text are exactly equal. As you remember from your high school classes, the cosine of zero is 1.

Scalar-product.svg
The cosine of the angle between two vectors gives a similarity measure.

Finding the similarity between texts with Python

First, we load the NLTK and Sklearn packages, lets define a list with the punctuation symbols that will be removed from the text, also a list of english stopwords.

from string import punctuation
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

language_stopwords = stopwords.words('english')
non_words = list(punctuation)

Lets define three functions, one to remove the stopwords from the text, one to remove punctuation and the last one which receives a filename as parameter, read the file, pass all the string to lowercase and calls the other two functions to return a preprocessed string.

def remove_stop_words(dirty_text):
    cleaned_text = ''
    for word in dirty_text.split():
        if word in language_stopwords or word in non_words:
            continue
        else:
            cleaned_text += word + ' '
    return cleaned_text

def remove_punctuation(dirty_string):
    for word in non_words:
        dirty_string = dirty_string.replace(word, '')
    return dirty_string

def process_file(file_name):
    file_content = open(file_name, "r").read()
    # All to lower case
    file_content = file_content.lower()
    # Remove punctuation and spanish stopwords
    file_content = remove_punctuation(file_content)
    file_content = remove_stop_words(file_content)
    return file_content

Now, lets call the process_file function to load the files with the text you want to compare. For my example, I’m using the content of three of my previous blog entries.

nlp_article = process_file("nlp.txt")
sentiment_analysis_article = process_file("sentiment_analysis.txt")
java_certification_article = process_file("java_cert.txt")

Once you have the preprocessed text, it’s time to do the data science magic, we will use TF-IDF to convert a text to a vector representation, and cosine similarity to compare these vectors.

#TF-IDF
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([nlp_article,sentiment_analysis_article,java_certification_article])
similarity_matrix = cosine_similarity(X,X)

The output of the similarity matrix is:

[[1.         0.217227   0.05744137]
 [0.217227   1.         0.04773379]
 [0.05744137 0.04773379 1.        ]]

First, note the diagonal with ‘1‘, this is the similarity of each document with itself, the value 0.217227 is the similarity between the NLP and the Sentiment Analysis posts. The value 0.05744137 is the similarity between NLP and Java certification posts. Finally the value 0.04773379 represents the similarity between the Sentiment Analysis and the Java certification posts. As the NLP and the sentiment analysis posts have related topics, its similarity is greater than the one they hold with the Java certification post.

Similarity between Melania Trump and Michelle Obama speeches

With the same tools, you could calculate the similarity between both speeches. I took the texts from this article, and ran the same script. This is the similarity matrix output:

[[1.         0.29814417]
 [0.29814417 1.        ]]

If you skipped the technical explanation and jumped directly here to know the result, let me give you a resume: using an NLP technique I estimated the similarity of two blog post with common topics written by me. Then, using the same method I estimated the similarity between the Melania and Michelle speeches.

Now, lets make some analysis here. By calculating the similarity, between two blog posts written by the same author (me), about related topics (NLP and Sentiment Analysis), the result was 0.217227. The similarity between Melania and Michelle speeches was 0.29814417. Which in conclusion, means, that two speeches from two different persons belonging to opposite political parties, are more similar, than two blog posts for related topics and from the same author. I let the final conclusion to you. The full code, and the text files are on my Github repo.

Happy analysis!

Cómo estimar la similitud entre documentos con Python

Cómo estimar la similitud entre documentos con Python

English version

¿Melania Trump plagió el discurso de Michelle Obama?

En el año 2016, durante la convención del partido republicano, Melania Trump dio un discurso apoyando la campaña de su esposo Donald; tan pronto como terminó la convención, usuarios de Twitter notaron similitudes en algunas lineas del discurso de la señora Trump y otro pronunciado por Michelle Obama ocho años antes en la convención del partido demócrata. La crítica sobre los Trump no se hizo esperar y el equipo de campaña llevó a cabo su labor de defensa con el argumento de que el discurso era un reflejo de experiencias de la vida real de Melania.

image

¿Cómo fue que los usuarios de Twitter notaron las semejanzas? Por un lado, algunas lineas eran exactamente iguales, por otro, como lo dijo un articulo de Usa Today:

It’s not entirely a verbatim match, but the two sections bear considerable similarity in wording, construction and themes.

Si quisieras automatizar el proceso de detección de similitudes ¿qué enfoque tomarías? Una posible solución sería comparar ambos textos, palabra por palabra, sin embargo esto no escalaría de manera eficiente; considera la complejidad de comparar todas las posibles frases compuestas por palabras consecutivas de un documento contra otro. Afortunadamente NLP nos da una solución elegante.

¿Cómo lo logramos?

Hay una técnica de NLP llamada similitud de textos, que se encarga del problema descrito: ¿Cómo comparar textos de una manera eficiente? Para lograr esto, es necesario transformar los documentos en un representación común y definir una métrica para compararlos.

En las secciones siguientes encontraras: los conceptos matemáticos detrás del proceso, el código explicado en detalle para que lo puedas reusar y finalmente la respuesta a la pregunta inicial: ¿Melania cometió plagio o no?

Conceptos de similitud de textos

TF-IDF

El texto es convertido en un vector. Las palabras se denominan features. Cada posición del vector representa un feature y el valor de la posición en el vector depende del método usado. Una forma de calcularlo, es contar cuantas veces la palabra aparece en el texto, dividirlo por el total de palabras únicas del documento y asignar este valor en el vector para el feature, esto es llamado Term Frequency o TF.

Screen Shot 2020-08-09 at 5.23.37 PM
Term frequency, siendo t una palabra, nt,d el número de veces que la palabra aparece en el documento. El denominador es el número total de palabras en el documento.

Term frequency por si solo da relevancia a palabras comunes en el documento, pero que no son necesariamente importantes, pueden ser stopwords, Las stopwords son palabras que no añaden valor al texto, como artículos, pronombres o preposiciones: yo, tu, el, la, a, ante, de, desde, etc.

Para saber si una palabra es importante en un documento, Inverse document frequency o IDF es usado, IDF encuentra la importancia en el documento contando cuantos documentos ademas del actual, contienen la palabra.

Screen Shot 2020-08-09 at 5.23.53 PM
Inverse document frequency.

En IDF, N representa el número de documentos en el corpus, mientras que dfrepresenta el número de documentos que contienen a una palabra t. Si todos los documentos en el corpus contienen la palabra t, entonces N/dft será igual a 1, y log(1) = 0, lo que significa que la palabra no es representativa, pues haciendo énfasis nuevamente, aparece en todos los documentos.

Term frequency–inverse document frequency o TF-IDF combina las dos métricas anteriores: si una palabra aparece en un documento, pero también aparece en los demás documentos del corpus, significa que no es representativa y TF-IDF le da un valor bajo. Por el contrario, si la palabra aparece muchas veces en un documento y solo aparece en ese documento, TF-IDF le da un valor alto.

Screen Shot 2020-08-09 at 5.24.30 PM
Term frequency–inverse document frequency.

Los valores de TF-IDF son calculados para cada palabra y dichos valores son asignados a las palabras del vector.

Similitud de coseno

Si se tienen los textos en representación vectorial ¿cómo se pueden comparar?

Primero, veamos un ejemplo de la conversión de texto a vector en Python:

from sklearn.feature_extraction.text import TfidfVectorizer

phrase_one = 'This is Sparta'
phrase_two = 'This is New York'
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([phrase_one,phrase_two])

vectorizer.get_feature_names()
['is', 'new', 'sparta', 'this', 'york']
X.toarray()
array([[0.50154891, 0.        , 0.70490949, 0.50154891, 0.        ],
       [0.40993715, 0.57615236, 0.        , 0.40993715, 0.57615236]])

Este código muestra dos textos, “This is Sparta” y “This is New York“. El vocabulario está compuesto por cinco palabras: “This“, “is“, “Sparta“, “New” y “York“.

La linea vectorizer.get_feature_names() muestra el vocabulario. La linea X.toarray() muestra ambos textos como vectores con el valor asignado por TF-IDF para cada palabra. Se puede ver como, para el primer vector, la segunda y quinta posición tienen un valor de cero. Estas posiciones corresponden a las palabras: new” y “york” las cuales no aparecen en el primer texto. De la misma forma, la tercer posición del segundo vector es también cero, esta posición corresponde a la palabra sparta“, la cual no aparece en el segundo texto.

¿Y cómo comparamos los vectores? Usando el producto punto es posible encontrar el ángulo entre dos vectores, este es el concepto conocido como similitud de coseno. Teniendo dos vectores y calculando el ángulo entre ellos, es posible medir que tan cerca están estos, por lo tanto, permite determinar que tan parecidos son dos textos. Un ángulo de cero significa que los textos son exactamente iguales. Si recuerdas tus clases de secundaria, el coseno de cero es 1.

Scalar-product.svg
El coseno del ángulo entre dos vectores da la medida de similitud.

Encontrando la similitud entre textos con Python

Primero, se cargan los paquetes de NLTK y Sklearn, se define una lista con los símbolos de puntuación y las stopwords, para que sean retiradas del texto.

from string import punctuation
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

language_stopwords = stopwords.words('english')
non_words = list(punctuation)

Se definen tres funciones, una para remover las stopwords, otra para remover los símbolos de puntuación y otra que recibe la ruta de un archivo de texto como parámetro, lee el archivo, transforma el texto a minúscula y llama las otras dos funciones para tener el texto preprocesado.

def remove_stop_words(dirty_text):
    cleaned_text = ''
    for word in dirty_text.split():
        if word in language_stopwords or word in non_words:
            continue
        else:
            cleaned_text += word + ' '
    return cleaned_text

def remove_punctuation(dirty_string):
    for word in non_words:
        dirty_string = dirty_string.replace(word, '')
    return dirty_string

def process_file(file_name):
    file_content = open(file_name, "r").read()
    # All to lower case
    file_content = file_content.lower()
    # Remove punctuation and spanish stopwords
    file_content = remove_punctuation(file_content)
    file_content = remove_stop_words(file_content)
    return file_content

Se llama la función process_file para cargar los archivos a analizar, en este ejemplo, estoy usando el contenido de tres posts anteriores de mi blog.

nlp_article = process_file("nlp.txt")
sentiment_analysis_article = process_file("sentiment_analysis.txt")
java_certification_article = process_file("java_cert.txt")

Una vez que se tiene el texto preprocesado, se usa TF-IDF y se aplica la similitud de coseno.

#TF-IDF
vectorizer = TfidfVectorizer ()
X = vectorizer.fit_transform([nlp_article,sentiment_analysis_article,java_certification_article])
similarity_matrix = cosine_similarity(X,X)

La salida es la siguiente matriz de similitud:

[[1.         0.217227   0.05744137]
 [0.217227   1.         0.04773379]
 [0.05744137 0.04773379 1.        ]]

Observa la diagonal de unos en la matriz, cada uno representa la similitud de un documento consigo mismo, el valor 0.217227 es la similitud entre los posts de NLP y análisis de sentimientos. El valor de 0.05744137 es la similitud entre el post de NLP y el de certificación de Java. Finalmente el valor de 0.04773379 representa la similitud entre los posts de análisis de sentimientos y el de la certificación de Java. Ya que el de NLP y el de análisis de sentimientos tratan temas en común, su medida de similitud es mas grande que la que tiene cada uno con la de la certificación de Java. 

Similitud entre el discurso de Melania Trump y el de Michelle Obama

Los textos de los discursos se pueden encontrar en este artículo, usando el mismo proceso se calcula la siguiente matriz de similitud:

[[1.         0.29814417]
 [0.29814417 1.        ]]

Si te saltaste la explicación técnica hasta este punto buscando el resultado, te daré un resumen: usando una técnica de NLP estimé la similitud de dos posts de este blog que tienen temas en común y fueron escritos por mi. Luego, usando el mismo método, estimé la similitud entre el discurso de Melania y Michelle.

Ahora, apliquemos algo de análisis. Al calcular la similitud entre dos posts escritos por un mismo autor (yo) que tratan sobre temas relacionados (NLP y análisis de sentimientos), el resultado fue de 0.217227. La similitud entre los discursos de Melania y Michelle fue de 0.29814417. Lo anterior, quiere decir, que dos discursos escritos por personas diferentes pertenecientes a corrientes políticas opuestas, son mas parecidos que mis dos posts. Te dejo la conclusión final a ti. El código completo y los archivos puedes encontrarlos en mi repo de Github.

¡Buena suerte con tu análisis!

Sentiment Analysis or NLP in practice

Sentiment Analysis or NLP in practice

From my previous post, you had a short briefing about NLP, what it is and why it’s an important research area among the AI community. In this post I want take you to an NLP branch with a growing popularity: Sentiment analysis.

What’s sentiment analysis? Lets say, you are running for president on one of the most powerful countries on earth, you are going to a debate on national TV against your direct contender, you need to win this debate desperately. There will be a lot of sensitive topics, like immigration and health care policies, one false step will make you lose the presidency. Wouldn’t be great to know what the people is thinking of your speech as soon as possible? Wouldn’t be nice to gather these insights on what people think and modify your speech? With sentiment analysis, this is possible.

You will need a stream of text data with the opinions of the electors and a trained model to classify the opinions. The data, of course, is available right now on social networks like Twitter, the place where you and I publish our opinions about … well, about everything. The trained model … that’s the tricky part.

Sentiment Analisys blog entry
Sentiment analysis in action

No really, what sentiment analysis is?

The previous example is not entirely fiction as you may suppose, but analysing a huge stream of data to gather insights and change a debate posture in real time is difficult, as you may face computers processing restrictions. Nevertheless, the example gives you a clue about what sentiment analysis is.

Also known as opinion mining, the sentiment analysis intends to extract information from texts in natural language, precisely, what someone (or something) referred on the texts, thinks about someone (or something) referred in the same text too. What’s an opinion? What’s a sentiment? The difference is subtle, but let’s say the opinion is a concrete view, whereas the sentiment is related to a feeling: “He’s worried about the republicans winning the white house” denotes a sentiment, “I think the democrats will ruin the country” is an opinion.

It may seem trivial to detect the opinion (or sentiment) among a text, but let me give you some examples of why it’s not so simple. An opinion may be direct “I hate hip hop lyrics” or indirect “This new electric car saves me a lot of money”, note the specific naming of the feeling in the first phrase. The opinion may be also explicit “I love the guitar sound on metal music” (also a direct opinion) or implicit “With right wing  politics I would have freedom to buy guns”. It’s important to keep and eye on the context, which makes difficult to understand the opinion, in the implicit example, is the opinion holder showing support or rejection?, quite difficult to know, as the person who speaks may be ok with people having guns.

To detect the sentiment in a text, there is a growing popularity subarea of sentiment analysis, the polarity classification. Having a text, the polarity classification assigns a label according to the sentiment exhibited from an entity to an aspect of another entity which is referred in the text. In the polarity classification three common tags are used: negative, positive and neutral. Positive and negative sentiments are valuable on the analysis, the subjectivity is where the opinion lies, and form the base to take decisions, for example:

“Most electors think it’s a good idea to build a wall to stop illegal immigration”

“As a religious man, I hate the idea of gay adoption”

The texts with neutral polarity are the kind of “I don’t know which candidate is the best” and exhibit some value in order to detect indecision.

Just polarity classification?

No. The definition of sentiment analysis fits in the polarity classification task, and that’s why is the core task, but there are other areas. Have you thought in phrases like: “There is no girls in this party, what a great way to enjoy a Saturday night!”, or “I just started to read your book, now I’m sure I wont have any problem to sleep at nights”. The first one is an irony, as express the opposite of what’s the speaker is trying to say, the second one is sarcasm, as wraps a criticism towards a particular victim. Sentiment analysis handles this kind of texts in an area called Sarcasm and Irony detection.

With the popularity of social networks, it’s easier to tailor a campaign in order to increase the sells of a product, on the other side, the marketplaces selling products allows the customer reviews. It’s common to create campaigns to spread fake information, and to create fake reviews with the goal in mind to distort the audience opinion. This activity is known as opinion spam, and there’s and area called opinion spam detection to handle it.

As you have seen from the previous examples, I have talked about the sentiment in the texts, but it may be valuable to extract the opinion holder, who is having the sentiment? The electors? The politicians? The author of the text? Or is the author talking of a third person?, this is an important analysis for the decision maker, and the area for this task is Entity opinion holder extraction.

Sentiment Analysis in real life

There are two main areas which came to my mind that take advantage of the sentiment analysis: politics and e-commerce. As I pointed in the beginning, knowing what people thinks in social networks, is highly valuable for politics, remember how in the past, Big data has been used in the race for power (Cambridge Analytica and Obama campaign). Twitter is a special social network in the political arena, day by day the trending topics include economics and social context, with a huge amount of user interacting, leaving their opinions. Reflect about the power to understand what all this users think, contemplate the opportunity to cluster this users into supporters and detractors and to tailor campaigns to retain your supporters and to convince undecided.

Sentiment Analisys blog entry (1)
Invest in social media campaigns addressed to the right target

On e-commerce, you have probably watched the review section for products, where customer rate items from 1 to 5 stars (in most cases), and then they write what they think about the item. It’s kinda trivial to know if a customer like or dislike a product according to their star rate, what’s difficult is to know what exactly do they value about the product, or what they hate.  In a review with 5 stars you may find phrases like “I love it” but you also may find “I love how the battery of this cell phone last 8 hours”; similarly, on a 1 star review, the text may be “Your product sucks” or “I don’t like the game take so much time to load”. in both scenarios, on the specific reviews, you get the parts for the sentiment analysis: an opinion holder, an entity, an aspect of such entity and the sentiment to that aspect emitted by the opinion holder, imagine the opportunity to process automatically all the reviews, and to detect what people like and what people hate, and imagine taking this knowledge to your product team. You could improve your NPS and hopefully your incomes.

Why to go deep

Some years ago the Big Data term was coined, and it’s now a common concern in all technology areas, if you think in the amount of data being produced in a single social network as Twitter on a single topic, you will be aware of the necessity to create decision support systems to manage these huge data waves to produce knowledge; that’s why, I deliberately leave a term outside the sentiment analysis definition: automatic. The final goal of the sentiment analysis is to produce automatic tools for detection, if you combine the opportunity to gather data on Internet and the knowledge to build tools that detect sentiments, you may create a solutions to give your company a decisive advantage, to foreseeing the political future of your country, or what comic book to buy and don’t regret.

If you want to put your hands on real code for sentiment analysis check my repo.

NLP, inteligencia artificial aplicada al lenguaje humano

NLP, inteligencia artificial aplicada al lenguaje humano

English version

Vivimos en años de avance tecnológico, años de automatización; donde tareas humanas están siendo llevadas a cabo por maquinas o soportadas por ellas. Con estos avances, muchos términos han ganado popularidad: Big Data, machine learning, inteligencia artificial, deep learning, redes neuronales entre otros. Estos términos son populares, porque mas allá de ser una moda, han demostrado su aplicación en problemas reales. Hoy quiero traer un nuevo término Procesamiento del Lenguaje Natural o NLP por sus siglas en ingles.

¿Qué es NLP?

NLP es un área de la computación que busca analizar, modelar y entender el lenguaje humano, una tarea bastante difícil si lo pensamos un momento. ¿Has considerado como nuestro cerebro interpreta el lenguaje? ¿Como somos capaces de comunicarnos? Si tuviéramos claro este punto, ¿como podríamos llevar este conocimiento tácito a reglas explícitas que pudieran ser programadas en una máquina? Lo que antes parecía solo ciencia ficción ahora nos rodea todos los días y veremos por que.

Si has pronunciado las palabras “Hola Siri” ya has visto los beneficios de NLP de primera mano; claro está, puede que prefieras otra marca como Android, Microsoft, Amazon; indistintamente de si te diriges al asistente de Google, Alexa o Cortana, estas dando instrucciones que son interpretadas por una máquina. ¿Como funciona entonces NLP? Para entender esto, debemos entender como se estructura el lenguaje humano.

De vuelta al colegio

El lenguaje es como un conjunto enorme de piezas que podemos combinar para crear estructuras hermosas y gigantes. Para lograr estas estructuras, combinamos las piezas mas pequeñas para crear bloques de un tamaño un poco mayor. La combinación de piezas sigue ciertas reglas, y dichas reglas dependen de si estamos trabajando con las piezas mas pequeñas o con las que hemos construido a partir de estas. Las piezas mas pequeñas son denominadas fonemas, que en la práctica son las letras del alfabeto (A-Z). Las letras por si solas no logran tener un significado, así que comenzamos a combinarlas.

Los siguientes bloques se denominan morfemas, estos son la combinación minima de fonemas que logra tener un significado, de manera común las denominamos palabras, sin embargo, no todos los morfemas se consideran palabras, tal es el caso de prefijos (aero, anti, macro, infra) y sufijos (arquía, ito, ita, filia).

NLP
Composición de morfemas para construir uno nuevo. Los prefijos y sufijos no son palabras pero si morfemas.

Los lexemas son variaciones de los morfemas que comparten un significado común y en general, una raíz común; por ejemplo “encontrar”, “encontrando”, “encontrado”, “encontraste” comparten la raíz “encontrar”. Este concepto es particularmente importante en fases de NLP, dado que el texto a analizar suele tener distintos lexemas que confluyen al mismo significado y construyen el mismo contexto. El proceso de extraer la raíz común de los lexemas (o lemma) es llamado lematización, trabajar con la raíz de los lexemas permite resumir el texto y facilita el análisis.

Al combinar lexemas y morfemas, se obtienen frases y oraciones; existen ciertas reglas para estas combinaciones, pues no escribimos de manera aleatoria. Un ejemplo común de una oración bien formada suele tener un sustantivo, un verbo y proposiciones que los unen como en: “Javier toca la guitarra en las noches”. El conjunto de leyes que rigen el orden de las palabras es denominado sintaxis.

A partir de la combinación de frases y oraciones, nacen las grandes creaciones que amamos: libros, poemas, canciones, etc. En este nivel existe un contexto y la estructura refleja un significado. Este contexto es el que queremos que sea procesado y entendido por las maquinas.

¿Qué se está haciendo en NLP?

El área mas popular de investigación es clasificación de texto, el objetivo de esta área es asignar una o mas etiquetas a un texto. Un uso común es la detección de spam, usado por compañías como Gmail o Outlook. Otro caso de uso está en el soporte de servicio al cliente; estos equipos deben procesar miles de peticiones/reclamos, no es una tarea eficiente a gran escala, dada la necesidad de interacción humana, además, muchos de los reclamos suelen tener información que no es de valor (solo reflejan insatisfacción pero no el porqué); la clasificación de texto ayuda a filtrar la información que puede llevar a accionables. El proceso para aplicar clasificación de texto es similar al del entrenamiento de un modelo de machine learning, se inicia con un conjunto de datos (texto para el caso puntual), se asignan las etiquetas a cada una de las muestras, se divide el conjunto en entrenamiento y pruebas, se elige un algoritmo de entrenamiento apropiado para el problema y finalmente se entrena el modelo. Luego de la validación del modelo, se usa para clasificar los nuevos datos.

NLP (1)
Proceso para obtener un modelo de clasificación de texto.

Inicié la definición de NLP dando el ejemplo de Siri, Google assitant, Cortana y Alexa, estos hacen parte del área de agentes conversacionales. En general todas las areas de NLP, tienen en común la extracción de información, el objetivo de esta última es identificar las entidades mencionadas en el texto. Por ejemplo, en una frase como: “El presidente fue al congreso en Bogotá para defenderse de los cargos de corrupción”, para entender el significado, un algoritmo necesitaría extraer palabras como: “presidente”, “congreso”, “Bogotá” y “corrupción”, estas palabras son conocidas como entidades; pueden identificarse fácilmente ya que suelen tomar la forma de sustantivos. Del texto que se encuentra entre las entidades es posible extraer las relaciones: “El presidente fue al congreso”; entidades y relaciones forman un contexto. Un agente conversacional usa el contexto para responder a consultas del usuario, la interpretación de consultas involucra otra área de NLP, information retrieval; ademas de interpretar la consulta, busca dentro de documentos la/s solución/es mas cercana/s a dicha consulta; information retrieval es por supuesto usado en motores de búsqueda. Los agentes conversacionales se apalancan de las áreas mencionadas para dar solución a las peticiones de los usuarios.

A medida que se vuelve mas común la aplicación de NLP, nuevos casos de uso surgen: detección de eventos de calendario en correo electrónico, detección de plagio, reconocimiento de voz, corrección de ortografía, análisis de sentimientos, traducciones; la lista crece a medida que la investigación avanza.

¿Como trabajar con NLP?

Hay tres enfoques para trabajar con NLP, el primero es el uso de heuristicas. Con las heuristicas, las reglas para entender el lenguaje son creadas a mano; el enfoque es de utilidad para MVP’s de aplicaciones, pero es sensible al cambio de contexto (es decir los usuarios dejan de hablar como lo hacían cuando se crearon las reglas) y sufren de perdida de exactitud cuando la aplicación escala. Para trabajar con heuristicas se requiere de expertos en el dominio del problema, esto puede ser una desventaja si lo vemos como una dependencia mas del sistema; para compensar la ausencia del experto, o como una heuristica adicional, se suelen usar bases de conocimiento, diccionarios y tesauros disponibles en la web, estos recursos son mantenidos por la comunidad (en algunos casos) y de acceso libre. Una herramienta común para el análisis en este enfoque son las expresiones regulares, por ejemplo, la siguiente expresión podría usarse para extraer nombres de usuarios al analizar posts en redes sociales:

"@.*"

Un enfoque popular en NLP es el de machine learning; teniendo conjuntos de datos de texto, se entrena un modelo en la tarea deseada; las técnicas mas comunes son: naïve Bayes, support vector machines y conditional random fields. Los conditional random fields o CRF han ganado popularidad superando a los modelos de cadenas de Markov, al dar relevancia al orden de aparición de las palabras y el contexto que forman. CRF ha sido usado de manera exitosa en la extracción de entidades.

Finalmente, deep learning con el uso de redes neuronales es el tercer enfoque, aprovechando la habilidad de las redes neuronales para trabajar con datos sin estructura aparente.

¿Donde puedo iniciar?

Personalmente he trabajado con heuristicas y machine learning, el lenguaje de programación que recomiendo es Python, dada la versatilidad de trabajar con orientación a objetos, programación funcional y estructurada, ademas cuenta con un gran ecosistema para trabajar en ciencia de datos. Las herramientas que seguramente necesitaras son:

  • Pandas: Tu mejor amigo manipulando datos. Con pandas es bastante fácil cargar archivos csv y json, con los que seguro tendrás que interactuar, y trabajar con los datos en una estructura de matriz. Permite hacer búsquedas sobre la matriz, transformación, reducción de dimensionalidad, transformación de valores y filtrado entre otras. https://pandas.pydata.org/
  • Numpy: Es la herramienta para trabajar con algebra y estructuras de datos de n dimensiones. A medida que avances en el análisis de texto, veras la necesidad de convertir palabras y documentos en vectores, este paquete facilita el trabajo con estas estructuras. https://numpy.org/
  • Sklearn: La librería para trabajar con machine learning, tiene una gran cantidad de algoritmos de clasificación, métodos de cluster para aprendizaje no supervisado, preprocesamiento de datos, generación aleatoria de archivos de entrenamiento y pruebas así como funciones de evaluación entre otras cosas. Dominar esta herramienta es la base no solo para NLP sino para los temas que se relacionen con machine learning. https://scikit-learn.org/stable/
  • NTLK: Finalmente, el paquete ad hoc package para NLP, provee: clasificación, tokenización, lematización, parseo, diccionarios de stop words y en general un completo set de herramientas para análisis de lenguaje. https://www.nltk.org/

Para leer mas del tema te recomiendo:

Feliz investigación!

NLP, artificial intelligence applied to language

NLP, artificial intelligence applied to language

Disponible en Español

We live in years of technologic advance acceleration, years of automation, human tasks are being totally performed by machines or supported by them. With these advances, many terms have gained popularity, words like Big Data, Machine Learning, Artificial Intelligence, Deep Learning, Neural Networks and a long list of etc. Such terms are popular because of how their application in common problems ease our lives. Today I want to bring you a term that may be known by some of you and new to others, Natural Language Processing or NLP.

What is Natural Language Processing?

In short NLP is an area of computer science that seek to analyse, model and understand the human language, hmm an easy task isn`t it? Have you ever thought how to model human language? Or how we take the way we interpret language in our brains, make it explicit rules and wrote those rules as code to be read by a machine? Years ago, this would have seemed science fiction, but nowadays, NLP surround us everyday.

A common phrase related to NLP will be “Hey Siri”; of course, you may posses an Android, but you will communicate with your Google assistant by speaking too and giving instructions which are interpreted on your device. Even if you don’t ask serious questions to your cellphone and just chit chat for fun, the doubt is: how these digital assistants works? How NLP works? The first step to understand this, is to understand how structured language works.

Back to school

Language is like an huge set of lego pieces, which can be combined to create awesome structures; this lego pieces are composed of tinier pieces. Also, there are some rules to combine them, these rules scale up, from the tinier pieces, to the medium, big and to the huge structures you build. The smallest kind of pieces are the characters (A-Z) you call these phonemes. As you know, the phonemes alone are meaningless, so you start to combine them to build bigger blocks.

The next kind of pieces are the morphemes, these are the minimum combination of phonemes that has a meaning to you, you may identify the morphemes as words, but even all words are morphemes, not all morphemes are words, that’s the case of prefixes and suffixes.

NLP
The composition of morphemes to build a new one. Prefixes and suffixes are not words by itself, but they are morphemes.

Lexemes are variations of morphemes that share common meaning, and in general share a common root; for example, “find”, “finds”, “found” and “finding” share the common root “FIND”. This concept is particularly important in text analysis, since text may have many lexemes, that in the end, refer to a common meaning and a common context. Being able to trace back the lexemes to their root or lemma is called lemmatization and it ease the analysis by leaving the meaningful unit of each word.

By combining lexemes and morphemes, you assemble phrases and sentences. But there are some rules to combine words, you just don’t put them in random order. A common well formed sentence may have a noun a verb and prepositions binding them as in: “Javier plays the guitar at nights”. The set of laws for order in words is called syntax.

Above the phrases and sentences, is where the beauty lies, with those blocks people create magnificent buildings, the books, poems and songs you love. In this level, a context start to exist, and the language structure exhibits a deeper meaning. We want the machines to process such context and understand that meaning.

What are the NLP people doing?

The most popular area of research is text classification, the intention behind it is to assign one label (or more) to a text. A common use of text classification is the spam detection, companies like Gmail or Outlook use it. Another great use relies on customer service support, having to check thousands of complaints from customers is not practical, as most of this comments are not clear about the complaint; text classification helps to filter the info which leads to actions. The process to apply text classification follows a common machine learning model training, you start from a data set of texts, you proceed to assign labels to each text sample, divide the data set into training and test sets, train the model (previously choosing a method that fits the problem) and then you use your model to classify unseen data.

NLP (1)
Common process to obtain and use a model for text classification.

I started the definition of NLP by giving the example of Siri and Google assistant, you must be aware of others like Cortana or Alexa, these are the focus of conversational agents area. This area (and in general all the NLP areas) intersect in common use cases with information extraction, in the last, the objective is to identify the entities involved in the text. For example, in a phrase like: “The president went to the Congress in Bogotá, to defend himself against corruption charges”, in order to understand the meaning, an algorithm need to extract the words: “president”, “Congress”, “Bogotá” and “corruption”, these words are know as entities; you could identify the entities easily as they take the form of nouns in sentences. From the text that lies between entities, an algorithm could infer relationships: “The president went to the Congress”; entities and the relationships binding them form a context. A conversational agent could use this context to answer user queries, this is close to another area: information retrieval, which works with how a machine can understand human questions and retrieve information that answer the queries and is (of course) used in search engines. The conversational agents make use of the information extraction and retrieval to chat with the user.

With the common use of NLP areas more and more applications born: calendar event detection, plagiarism detection, speech recognition, spelling check, sentiment analysis, translation from language to language, the list may grow as the research continues.

How does NLP works?

There are three common approaches to work with NLP, the first one is heuristics. With heuristics, the rules to understand the language are handcrafted; the approach works for MVP’s application, but are sensitive to concept drift and the accuracy suffer if the application scale. To work with heuristic a domain expert is required, a drawback if you think about it as an added dependency; the tackle to this is to use knowledge bases, dictionaries and thesaurus from the web, these resources are maintained by the community and free (in some cases). A common tool in the analysis with this approach are the regular expressions or regex, think about the extraction of user names in social networks post  with a regex like:

"@.*"

A popular approach to NLP is the use of machine learning; by having datasets of text, a model is training to work on the desired task, some of the most common techniques are: naïve Bayes, support vector machines and conditional random fields. The conditional random fields or CRF have gained popularity outperforming Hidden Markov Models by giving relevance to the order of the words in text and the context they from. CRF have been used successfully in entity extraction tasks.

Finally deep learning with neural networks is the third approach, leveraged by the ability of neural networks to work with unstructured data.

Where can I start?

Personally I have worked with the heuristic and machine learning approaches, I used Python as programming language, its versatility to work with object oriented, functional and structured paradigms makes it a great option, also counts with a full ecosystem of packages to work in data science. Some of the tool you may use are:

  • Pandas: This will be your best friend working with data, probably you are going to handle csv or json files, with pandas you could load this files and work with them in a matrix structure. It allows to make queries over the matrix, transformation, dimensionality reduction, mapping, filtering among others.
    https://pandas.pydata.org/
  • Numpy: Is the the tool for working with algebra and n-dimensional data structures, you will see, as you advance, words are converted into vectors and arrays and this package ease the work with such structures. https://numpy.org/
  • Sklearn: This is the package for machine learning, gives you a great set of classifications algorithms, clustering methods for unsupervised learning, data preprocessing, random generation of training and testing files, evaluation functions and many more, this is the package you should dominate for the ML tasks. https://scikit-learn.org/stable/
  • NTLK: Last but not the least, the ad hoc package for NLP, gives you: classification, tokenization, stemming, parsing, stop words dictionaries and many other tools to work with the areas we talk about it. https://www.nltk.org/

To read more about this topic, you could go to:

Happy research!