Simple Natural Language Processing

Wednesday, July 13, 2016

Usually I post on my Medium blog but thought of posting it here because I thought it's going to be a long post and a boring reading for many others.

I was interested about machine learning for a long time and the recent hype got me interested about it all over again, and wanted learn more about it. I made a very simple neural network that can play ticktacktoe few months back and I thought about playing with image recognition but later I got bit interested about Natural Language Processing or NLP for short.

There are various methods out there, but I was interested in a way to pick out keywords from a sentence, specially from a question, which that keyword can be used to search for an answer.
The Problem

When we look at most questions what I felt was that there are only few words that are key to answering that questions, which are the keywords. If we can pick out that keyword finding an answer becomes simple it's about keyword spotting.

Let's say you are having a search engine or a website, that gives information about places and locations depending on the queries by a user, let's say the questions are like this,
  •         where can i buy avocado?
  •         where can i find coca butter?

So in question number one the keyword is 'avocado' in question number two the keyword is 'cocoa butter'. We don't need the words 'can', 'i', 'buy', 'find' to find an answer to the question that is being asked, we only need to know the word avocado and cocoa butter to know the answer to the question. That is what I thought about the simplest way to solve the problem, get the keyword solve the question.

The Solution

 

The first step was to find a set of questions to train the algorithm, which is readily available on sites like Twitter and Facebook. After that because we need to train the algorithm we will manually find the keywords of each question and manually feed the question and the answer to the algorithm.

The solution that I came up was giving weights to every word of the question, at the starting of it each word will have a value of 1, and the algorithm scans the sentence and looks for the keyword which we have given in the training set and after finding that keyword we add a 0.001 or some value like that to the keyword and we deduct 0.001 from all other words, and we save these words and the assigned weights in the vocabulary of the algorithm.

For example in the first question where can i buy avocado? the algorithm looks for the word avocado and adds 0.001 to it's current weight and we deduct 0.001 from all other words. Leaving us with just one word with a high weight than the rest of the words.

Then the algorithm moves to the second question, where can i find coca butter? but when come to second question the words 'where', 'can' are already in the vocabulary, with assigned weights (0.999) and here too in the second question just like before the answer is provided and a further 0.001 will be deducted from weights of the weights of the words like 'where', 'can', 'i' etc. And whatever the words that are not in the vocabulary and their weights are added.

This process is repeated for many many times, for many questions where we are left with very low weights for not so important weights and high values of weights for the important words (keywords).

So What Happens Next?

 

So after that we are giving a new question that is not in the training algorithm, like 'where can i buy bacon'. Here the algorithm comes up with words that it already knows like 'where', 'can', 'i', 'buy' these words are assigned low weights in the training set, it knows that these words are not important.

And then it comes with a new word, 'bacon'. This word is given the default weight of 1 and added to the database, and the word with the highest weight is returned as the keyword, here it is bacon because it is a new word and has a weight of one, and other words have a lesser value than one.

So when the algorithm comes with a query that is similar to a one in the teaching set, 'where can i buy avocado?' the highest weighted word is 'avocado' and it is returned as the keyword. And the weights of the keyword and the other words are updated.

So in time the not so useful words like 'where' will have low weights and will get lower and where the keywords will get higher as time goes by. So more and more queries processed and better the algorithm becomes.

The Glitch

 

This approach was good but I came across with some issues,
  • Because the algorithm was checking for each word against a given vocabulary for it's weights it is like brute forcing and is not very efficient as vocabulary increases.
  • If there is one word given in the search query and that will be the only word and will have the highest weight and will be returned as the query.
  • If more than one word that are not in the vocabulary is present then both will have equal high values and will be returned as the keywords giving false results.  
  • The algorithm can only picks up one word, and can't pick up two related words. Such as Pizza Hut, if the search query contains the word 'Pizza Hut' then it will only return 'Pizza' not 'Pizza Hut' because the algorithm is only good at picking up one keyword not related phrases.
So it was obvious that this method is not very effective, so I thought of looking for more efficient method in NLP for keyword spotting which I hopefully write in a new post.
I know there are tools and existing methods of NLP but I was interested in making one from the ground up for fun. So what do you think? feel free to share what you think about it.

No comments :

Post a Comment