NLP, inteligencia artificial aplicada al lenguaje humano

Vivimos en años de avance tecnológico, años de automatización; donde tareas humanas están siendo llevadas a cabo por maquinas o soportadas por ellas. Con estos avances, muchos términos han ganado popularidad: Big Data, machine learning, inteligencia artificial, deep learning, redes neuronales entre otros. Estos términos son populares, porque mas allá de ser una moda, han demostrado su aplicación en problemas reales. Hoy quiero traer un nuevo término Procesamiento del Lenguaje Natural o NLP por sus siglas en ingles.

¿Qué es NLP?

NLP es un área de la computación que busca analizar, modelar y entender el lenguaje humano, una tarea bastante difícil si lo pensamos un momento. ¿Has considerado como nuestro cerebro interpreta el lenguaje? ¿Como somos capaces de comunicarnos? Si tuviéramos claro este punto, ¿como podríamos llevar este conocimiento tácito a reglas explícitas que pudieran ser programadas en una máquina? Lo que antes parecía solo ciencia ficción ahora nos rodea todos los días y veremos por que.

Si has pronunciado las palabras “Hola Siri” ya has visto los beneficios de NLP de primera mano; claro está, puede que prefieras otra marca como Android, Microsoft, Amazon; indistintamente de si te diriges al asistente de Google, Alexa o Cortana, estas dando instrucciones que son interpretadas por una máquina. ¿Como funciona entonces NLP? Para entender esto, debemos entender como se estructura el lenguaje humano.

De vuelta al colegio

El lenguaje es como un conjunto enorme de piezas que podemos combinar para crear estructuras hermosas y gigantes. Para lograr estas estructuras, combinamos las piezas mas pequeñas para crear bloques de un tamaño un poco mayor. La combinación de piezas sigue ciertas reglas, y dichas reglas dependen de si estamos trabajando con las piezas mas pequeñas o con las que hemos construido a partir de estas. Las piezas mas pequeñas son denominadas fonemas, que en la práctica son las letras del alfabeto (A-Z). Las letras por si solas no logran tener un significado, así que comenzamos a combinarlas.

Los siguientes bloques se denominan morfemas, estos son la combinación minima de fonemas que logra tener un significado, de manera común las denominamos palabras, sin embargo, no todos los morfemas se consideran palabras, tal es el caso de prefijos (aero, anti, macro, infra) y sufijos (arquía, ito, ita, filia).

NLP — Composición de morfemas para construir uno nuevo. Los prefijos y sufijos no son palabras pero si morfemas.

Los lexemas son variaciones de los morfemas que comparten un significado común y en general, una raíz común; por ejemplo “encontrar”, “encontrando”, “encontrado”, “encontraste” comparten la raíz “encontrar”. Este concepto es particularmente importante en fases de NLP, dado que el texto a analizar suele tener distintos lexemas que confluyen al mismo significado y construyen el mismo contexto. El proceso de extraer la raíz común de los lexemas (o lemma) es llamado lematización, trabajar con la raíz de los lexemas permite resumir el texto y facilita el análisis.

Al combinar lexemas y morfemas, se obtienen frases y oraciones; existen ciertas reglas para estas combinaciones, pues no escribimos de manera aleatoria. Un ejemplo común de una oración bien formada suele tener un sustantivo, un verbo y proposiciones que los unen como en: “Javier toca la guitarra en las noches”. El conjunto de leyes que rigen el orden de las palabras es denominado sintaxis.

A partir de la combinación de frases y oraciones, nacen las grandes creaciones que amamos: libros, poemas, canciones, etc. En este nivel existe un contexto y la estructura refleja un significado. Este contexto es el que queremos que sea procesado y entendido por las maquinas.

¿Qué se está haciendo en NLP?

El área mas popular de investigación es clasificación de texto, el objetivo de esta área es asignar una o mas etiquetas a un texto. Un uso común es la detección de spam, usado por compañías como Gmail o Outlook. Otro caso de uso está en el soporte de servicio al cliente; estos equipos deben procesar miles de peticiones/reclamos, no es una tarea eficiente a gran escala, dada la necesidad de interacción humana, además, muchos de los reclamos suelen tener información que no es de valor (solo reflejan insatisfacción pero no el porqué); la clasificación de texto ayuda a filtrar la información que puede llevar a accionables. El proceso para aplicar clasificación de texto es similar al del entrenamiento de un modelo de machine learning, se inicia con un conjunto de datos (texto para el caso puntual), se asignan las etiquetas a cada una de las muestras, se divide el conjunto en entrenamiento y pruebas, se elige un algoritmo de entrenamiento apropiado para el problema y finalmente se entrena el modelo. Luego de la validación del modelo, se usa para clasificar los nuevos datos.

NLP (1) — Proceso para obtener un modelo de clasificación de texto.

Inicié la definición de NLP dando el ejemplo de Siri, Google assitant, Cortana y Alexa, estos hacen parte del área de agentes conversacionales. En general todas las areas de NLP, tienen en común la extracción de información, el objetivo de esta última es identificar las entidades mencionadas en el texto. Por ejemplo, en una frase como: “El presidente fue al congreso en Bogotá para defenderse de los cargos de corrupción”, para entender el significado, un algoritmo necesitaría extraer palabras como: “presidente”, “congreso”, “Bogotá” y “corrupción”, estas palabras son conocidas como entidades; pueden identificarse fácilmente ya que suelen tomar la forma de sustantivos. Del texto que se encuentra entre las entidades es posible extraer las relaciones: “El presidente fue al congreso”; entidades y relaciones forman un contexto. Un agente conversacional usa el contexto para responder a consultas del usuario, la interpretación de consultas involucra otra área de NLP, information retrieval; ademas de interpretar la consulta, busca dentro de documentos la/s solución/es mas cercana/s a dicha consulta; information retrieval es por supuesto usado en motores de búsqueda. Los agentes conversacionales se apalancan de las áreas mencionadas para dar solución a las peticiones de los usuarios.

A medida que se vuelve mas común la aplicación de NLP, nuevos casos de uso surgen: detección de eventos de calendario en correo electrónico, detección de plagio, reconocimiento de voz, corrección de ortografía, análisis de sentimientos, traducciones; la lista crece a medida que la investigación avanza.

¿Como trabajar con NLP?

Hay tres enfoques para trabajar con NLP, el primero es el uso de heuristicas. Con las heuristicas, las reglas para entender el lenguaje son creadas a mano; el enfoque es de utilidad para MVP’s de aplicaciones, pero es sensible al cambio de contexto (es decir los usuarios dejan de hablar como lo hacían cuando se crearon las reglas) y sufren de perdida de exactitud cuando la aplicación escala. Para trabajar con heuristicas se requiere de expertos en el dominio del problema, esto puede ser una desventaja si lo vemos como una dependencia mas del sistema; para compensar la ausencia del experto, o como una heuristica adicional, se suelen usar bases de conocimiento, diccionarios y tesauros disponibles en la web, estos recursos son mantenidos por la comunidad (en algunos casos) y de acceso libre. Una herramienta común para el análisis en este enfoque son las expresiones regulares, por ejemplo, la siguiente expresión podría usarse para extraer nombres de usuarios al analizar posts en redes sociales:

"@.*"

Un enfoque popular en NLP es el de machine learning; teniendo conjuntos de datos de texto, se entrena un modelo en la tarea deseada; las técnicas mas comunes son: naïve Bayes, support vector machines y conditional random fields. Los conditional random fields o CRF han ganado popularidad superando a los modelos de cadenas de Markov, al dar relevancia al orden de aparición de las palabras y el contexto que forman. CRF ha sido usado de manera exitosa en la extracción de entidades.

Finalmente, deep learning con el uso de redes neuronales es el tercer enfoque, aprovechando la habilidad de las redes neuronales para trabajar con datos sin estructura aparente.

¿Donde puedo iniciar?

Personalmente he trabajado con heuristicas y machine learning, el lenguaje de programación que recomiendo es Python, dada la versatilidad de trabajar con orientación a objetos, programación funcional y estructurada, ademas cuenta con un gran ecosistema para trabajar en ciencia de datos. Las herramientas que seguramente necesitaras son:

Pandas: Tu mejor amigo manipulando datos. Con pandas es bastante fácil cargar archivos csv y json, con los que seguro tendrás que interactuar, y trabajar con los datos en una estructura de matriz. Permite hacer búsquedas sobre la matriz, transformación, reducción de dimensionalidad, transformación de valores y filtrado entre otras. https://pandas.pydata.org/
Numpy: Es la herramienta para trabajar con algebra y estructuras de datos de n dimensiones. A medida que avances en el análisis de texto, veras la necesidad de convertir palabras y documentos en vectores, este paquete facilita el trabajo con estas estructuras. https://numpy.org/
Sklearn: La librería para trabajar con machine learning, tiene una gran cantidad de algoritmos de clasificación, métodos de cluster para aprendizaje no supervisado, preprocesamiento de datos, generación aleatoria de archivos de entrenamiento y pruebas así como funciones de evaluación entre otras cosas. Dominar esta herramienta es la base no solo para NLP sino para los temas que se relacionen con machine learning. https://scikit-learn.org/stable/
NTLK: Finalmente, el paquete ad hoc package para NLP, provee: clasificación, tokenización, lematización, parseo, diccionarios de stop words y en general un completo set de herramientas para análisis de lenguaje. https://www.nltk.org/

Para leer mas del tema te recomiendo:

Feliz investigación!

NLP, artificial intelligence applied to language

Disponible en Español

We live in years of technologic advance acceleration, years of automation, human tasks are being totally performed by machines or supported by them. With these advances, many terms have gained popularity, words like Big Data, Machine Learning, Artificial Intelligence, Deep Learning, Neural Networks and a long list of etc. Such terms are popular because of how their application in common problems ease our lives. Today I want to bring you a term that may be known by some of you and new to others, Natural Language Processing or NLP.

What is Natural Language Processing?

In short NLP is an area of computer science that seek to analyse, model and understand the human language, hmm an easy task isn`t it? Have you ever thought how to model human language? Or how we take the way we interpret language in our brains, make it explicit rules and wrote those rules as code to be read by a machine? Years ago, this would have seemed science fiction, but nowadays, NLP surround us everyday.

A common phrase related to NLP will be “Hey Siri”; of course, you may posses an Android, but you will communicate with your Google assistant by speaking too and giving instructions which are interpreted on your device. Even if you don’t ask serious questions to your cellphone and just chit chat for fun, the doubt is: how these digital assistants works? How NLP works? The first step to understand this, is to understand how structured language works.

Back to school

Language is like an huge set of lego pieces, which can be combined to create awesome structures; this lego pieces are composed of tinier pieces. Also, there are some rules to combine them, these rules scale up, from the tinier pieces, to the medium, big and to the huge structures you build. The smallest kind of pieces are the characters (A-Z) you call these phonemes. As you know, the phonemes alone are meaningless, so you start to combine them to build bigger blocks.

The next kind of pieces are the morphemes, these are the minimum combination of phonemes that has a meaning to you, you may identify the morphemes as words, but even all words are morphemes, not all morphemes are words, that’s the case of prefixes and suffixes.

NLP — The composition of morphemes to build a new one. Prefixes and suffixes are not words by itself, but they are morphemes.

Lexemes are variations of morphemes that share common meaning, and in general share a common root; for example, “find”, “finds”, “found” and “finding” share the common root “FIND”. This concept is particularly important in text analysis, since text may have many lexemes, that in the end, refer to a common meaning and a common context. Being able to trace back the lexemes to their root or lemma is called lemmatization and it ease the analysis by leaving the meaningful unit of each word.

By combining lexemes and morphemes, you assemble phrases and sentences. But there are some rules to combine words, you just don’t put them in random order. A common well formed sentence may have a noun a verb and prepositions binding them as in: “Javier plays the guitar at nights”. The set of laws for order in words is called syntax.

Above the phrases and sentences, is where the beauty lies, with those blocks people create magnificent buildings, the books, poems and songs you love. In this level, a context start to exist, and the language structure exhibits a deeper meaning. We want the machines to process such context and understand that meaning.

What are the NLP people doing?

The most popular area of research is text classification, the intention behind it is to assign one label (or more) to a text. A common use of text classification is the spam detection, companies like Gmail or Outlook use it. Another great use relies on customer service support, having to check thousands of complaints from customers is not practical, as most of this comments are not clear about the complaint; text classification helps to filter the info which leads to actions. The process to apply text classification follows a common machine learning model training, you start from a data set of texts, you proceed to assign labels to each text sample, divide the data set into training and test sets, train the model (previously choosing a method that fits the problem) and then you use your model to classify unseen data.

I started the definition of NLP by giving the example of Siri and Google assistant, you must be aware of others like Cortana or Alexa, these are the focus of conversational agents area. This area (and in general all the NLP areas) intersect in common use cases with information extraction, in the last, the objective is to identify the entities involved in the text. For example, in a phrase like: “The president went to the Congress in Bogotá, to defend himself against corruption charges”, in order to understand the meaning, an algorithm need to extract the words: “president”, “Congress”, “Bogotá” and “corruption”, these words are know as entities; you could identify the entities easily as they take the form of nouns in sentences. From the text that lies between entities, an algorithm could infer relationships: “The president went to the Congress”; entities and the relationships binding them form a context. A conversational agent could use this context to answer user queries, this is close to another area: information retrieval, which works with how a machine can understand human questions and retrieve information that answer the queries and is (of course) used in search engines. The conversational agents make use of the information extraction and retrieval to chat with the user.

With the common use of NLP areas more and more applications born: calendar event detection, plagiarism detection, speech recognition, spelling check, sentiment analysis, translation from language to language, the list may grow as the research continues.

How does NLP works?

There are three common approaches to work with NLP, the first one is heuristics. With heuristics, the rules to understand the language are handcrafted; the approach works for MVP’s application, but are sensitive to concept drift and the accuracy suffer if the application scale. To work with heuristic a domain expert is required, a drawback if you think about it as an added dependency; the tackle to this is to use knowledge bases, dictionaries and thesaurus from the web, these resources are maintained by the community and free (in some cases). A common tool in the analysis with this approach are the regular expressions or regex, think about the extraction of user names in social networks post with a regex like:

"@.*"

A popular approach to NLP is the use of machine learning; by having datasets of text, a model is training to work on the desired task, some of the most common techniques are: naïve Bayes, support vector machines and conditional random fields. The conditional random fields or CRF have gained popularity outperforming Hidden Markov Models by giving relevance to the order of the words in text and the context they from. CRF have been used successfully in entity extraction tasks.

Finally deep learning with neural networks is the third approach, leveraged by the ability of neural networks to work with unstructured data.

Where can I start?

Personally I have worked with the heuristic and machine learning approaches, I used Python as programming language, its versatility to work with object oriented, functional and structured paradigms makes it a great option, also counts with a full ecosystem of packages to work in data science. Some of the tool you may use are:

Pandas: This will be your best friend working with data, probably you are going to handle csv or json files, with pandas you could load this files and work with them in a matrix structure. It allows to make queries over the matrix, transformation, dimensionality reduction, mapping, filtering among others.
https://pandas.pydata.org/
Numpy: Is the the tool for working with algebra and n-dimensional data structures, you will see, as you advance, words are converted into vectors and arrays and this package ease the work with such structures. https://numpy.org/
Sklearn: This is the package for machine learning, gives you a great set of classifications algorithms, clustering methods for unsupervised learning, data preprocessing, random generation of training and testing files, evaluation functions and many more, this is the package you should dominate for the ML tasks. https://scikit-learn.org/stable/
NTLK: Last but not the least, the ad hoc package for NLP, gives you: classification, tokenization, stemming, parsing, stop words dictionaries and many other tools to work with the areas we talk about it. https://www.nltk.org/

To read more about this topic, you could go to:

Happy research!

How does Netflix or Spotify knows what you like? – A briefing on Recommender Systems

Have you ever wanted to know how Netflix, Spotify or others interactive platforms recommend you products (include here Amazon, a pioneer), well, recently I have been studying this topic, it’s an area called Recommender Systems which tries to fix a problem known as Long Tail.

The Long Tail problem refers to having a great number of products to offer to a customer and the task to find the one he is looking for. A common approach will be to recommend popular items, but is that okey all times?

Long Tail Example — Your tastes may be way different from the mass. Don´t you know what “Despacito” is?

In deed is not always accurate to recommend the most popular items, and the success of an online selling system depends on this accuracy, as a user you will lose confidence in a system that doesn’t know you and give you items you don’t want. So how Amazon, Netflix and Spotify (among others) recommend you items? I’m going to explain two simple and yet really powerful approaches that handle this task.

Collaborative Filtering

The collaborative filtering states that: Given a user A who loves items 1 and 2, and given a user B who loves item 1, the recommender system must recommend item 2 to user B. That’s a really simple logic if you think yet really effective. That’s how you start watching LOTR movies on Netflix and receive Star Wars suggestions on your dashboard.

The collaborative filtering builds a matrix of User rows vs Item columns, having in each position of this matrix the rating given by a user to an item. Then the distance between users depending of the rating they have given to items is calculated with a metric, the most used measures are Jaccard, Pearson and Cosine Similarity.

Having a system with a low number of items and users makes this approach feasible to search in all the data, but this is not going to scale if (as Netflix or Spotify) you have millions of users and items. In such cases you define a limit of neighbors to search.

A cool feature that came with this approach is the boost on “serendipity“; which in the recommender systems context is the ability of the system to predict items that are going to be liked by the customer, but are totally new to him (like a heavy metal fanatic who receives a recommendation on jazz for example).

Nevertheless, there is an “Achilles heel” on this approach, that’s called cold-start, it represent the drawback to predict items to a new user, and it make sense if you think, how are you going to predict something to a user that has not rated any item and you don’t know what does he likes. Even if the customer have rated a few, it will be hard to the algorithm to find similar users. The same happens when a new item is created, no one has rated such item, so it’s invisible.

Content based recommendation

The collaborative filtering doesn’t take into account any item feature, but the content based does. In this approach, each item is mapped to a set of values that represent it, the most common item representation is through keywords. Having this representation the system can find similar items, using information retrieval techniques as TF-IDF.

Keywords for movies — Example of movies described as keywords. LOTR and Captain America are similar because both have heroes, Dunkirk and Captain America share the “war” theme.

In practice the system will find which items have been liked (or high rated) by a customer, then it will find other n similar items to those preferred by the actual customer and such items will be presented.

This solution avoids the cold start problem, as you just need one rated item from the new customer to start the predictions. But the contend based has a drawback, there is no serendipity here, as the system only recommend “more of the same”.

Wanna Try?

The real recommender systems works as hybrids, joining many approaches as the last two. This is a new cool area with a lot of research to do and many interesting literature, I encourage you to keep reading and to go deep in this topic.

If you want a practical example here I let you one of each approach, a movie collaborative recommender in Java and a movie content recommender in Python, fell free to modify the code to your needs. Happy coding!

leantechblog

Tag: information retrieval