Kumar Shridhar
Co-chief AI Scientist

Generative Model Chatbots

Generative bots are the smartest chatbots models out there. But are they solutions for our every problems?

To know more about the types of chatbots and why generative models based chatbots are the most smart ones today, just follow my previous post:

Rule based bots vs AI botsIf the bot answers a question logically or solves a given task, it should be considered smart. A lot of personal…medium.com

Sequence to Sequence Models:

Sequence to sequence based models was introduced first in the paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Kyunghyun Cho et al in 2014.

The proposed neural network architecture known as RNN Encoder–Decoder, consists of two recurrent neural networks (RNN) that act as an encoder and a decoder pair. The encoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vector representation back to a variable-length target sequence. The two networks are trained jointly to maximize the conditional probability of the target sequence given a source sequence. [2]

Image source: Deep Learning for chatbots part 1

THE MODEL

RNN or Recurrent Neural Network is a neural network where the output not only depends on the current input, but to a series of input given in the past. Since the output is influenced by a series of past inputs, it makes RNN very effective in Natural Language Processing as the contexts of next word does not necessarily rely only on the previous word but to a series of words before that.

Depending on the scenario, the RNNs can be used to deal with a variety of tasks. Some of them are listed down below.

Image source: The Unreasonable Effectiveness of Recurrent Neural Networks

If you want to know more about RNN in detail and their use cases, visit RECURRENT NEURAL NETWORKS TUTORIAL, PART 1 — INTRODUCTION TO RNNS.Recurrent Neural Networks Tutorial, Part 1 - Introduction to RNNsRecurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks. But despite their…www.wildml.com

But, Vanilla RNN faces the problem of vanishing gradient which paved the way for LSTMs that handles the problem very well due to the introduction of memory cells and gates in LSTMs.

LSTMs are type of RNNs that handles the long term dependency problem of RNNs very well due to the introduction of Gates in LSTM. It allows the cells to remember what information needs to be remembered from the previous cells and what needs to be updated. Read more about LSTMs here:

Understanding LSTM Networks -- colah's blogThese loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that…colah.github.io

The sequence to sequence model uses two LSTM networks, one each for encoding and decoding respectively. I used three LSTM layers with 512 as layer sizes respectively. Some other parameters that I used were:

Batch size: 128
Learning rate: Played with few learning rates. Chose a low learning rate overall. Model did not performed very well. Then I initialized a large learning rate with decaying factor and the performance improved.
Vocabulary size: 50000 words for both encoding and decoding models
Checkpoints saved after every 500 iterations.

DATASET USED

For training the model, two files were created where one acted as input and other as output. Input file contained all the questions or the first part of the conversation whereas the output file contained all the respective answers or the other part of the conversation. Each file had a vocabulary of 50ooo words and any word outside of the vocabulary was marked with an UNK tag (unknown).

An example of the lines in files are:

INPUT: Hi!
OUTPUT: Hi!
INPUT: How are you?
OUTPUT: I am good. Thanks for asking.
INPUT: You’re asking me out. That’s so cute. What’s your name again?
OUTPUT: Forget it.
INPUT: No, no, it’s my fault — we didn’t have a proper introduction
OUTPUT: Cameron.
INPUT: The thing is, Cameron — I’m at the mercy of a particularly hideous breed of loser. My sister. I can’t date until she does.
OUTPUT: Seems like she could get a date easy enough…

The datasets used for training the model are following:

Reddit conversation datasets for the months of September — December 2016 which included all the comments on Reddit and their replies and their respective replies or comments and so on. The dataset was cleaned where only those questions were taken into account where there was a reply for the question asked. All the other replies and sub comments were removed. The questions asked were saved in the Input file and the respective comment/answer to the Output file.Directory ContentsIf you have any questions about the data formats of the files or any other questions, please feel free to contact me at…files.pushshift.io

The link is used to get the Reddit datasets which is available in json format of all the years. Please feel free to make a donation here (https://pushshift.io/donations/) if you are using a lot of datasets to appreciate the efforts of the creators.

2. Cornell Movie — Dialogs Corpus which contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts. It has 220,579 conversational exchanges between 10,292 pairs of movie characters.

Cornell Movie-Dialogs CorpusThis corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:www.cs.cornell.edu

3. Twitter feed data with a twitter scrapper and data pre processed and arranged in question-answer or conversational format.

Marsan-Ma/twitter_scrapertwitter_scraper - Scrap real time posts from twitter through the streaming apigithub.com

4. Other datasets include Ubuntu Dialogue corpus v1.0 and Washington University Law datasets which were cleaned and modified accordingly.

TRAINING

The model was trained on a Nvidia Titan X 12 GB GPU for 3 days.

Number of iterations : 90000
Learning rate at 90000 iteration: 0.0003 (Starting Learning rate: 0.15 with a 0.9 learning rate decay factor)
Perplexity: 2.74 (starting perplexity close to 400)
Checkpoints saved after every 500 iterations

PERFORMANCE

The model performed well for few common sentences but it did not performed very good on uncommon sentences although it tried to generate some answers.

The bot name and icon in browser

Human: Hi!

Bot: Hi!
Human: How are you?
Bot: Why are you asking?
Human: Who are you?
Bot: Why are you doing this?
Human: I have a problem with my Ubuntu computer.
Bot: What is that?
Human: I donot know.
Bot: So it all worked fine.
Human: are you a human ?
Bot: No , not real .
Human: are you a computer?
Bot: Certainly , Doctor .
Human: I am not a doctor.
Bot: I don ‘ t believe it .
Human: Am I a doctor?
Bot: sure but a can always be a.

From the above sentences, we can see that the bot is performing reasonably well on the small talk sentences and it is generating answers based on the questions asked. It is also taking the previous context well into account while generating the next sentences which should not be true as the model does not have a memory to store previous contexts. But due to the fact that the model is trained on a conversation, it has learned the usage of pronouns reasonably well and is trying to incorporate into the answers in a similar way as the questions asked.

Human: who is the president?

Bot: Nice, me.
Human: Will AI destroy humanity?
Bot: to find what happened.
Human: When will the world end
Bot: You mean last night would you?
Human: get lost!
Bot: You got it !
Human: can you teach me something?
Bot: what do I do to install ?
Human: you are not making sense
Bot: yeah , I know , : ) : ) )

Now when the model is asked some random question, it is generating some random answers which sometimes makes complete sense and sometimes it does not. (Maybe bot’s level of intelligence is way higher than us and it is difficult to understand it.) Also, a thing to note is that the bot has learned to use some smileys too which to much surprise goes well with the emotions it wants to convey. It has just learned the position of the smileys and a smiling face smiley :) is generally after a positive reply and it is doing that exactly without learning the context to do so. Or, it might has associated the smiley with some words and it is attaching the smiley with the words which is a very good sign.

Human: If I have a blue box, what color is my box?

Bot: What did you say ? Are you all right ?
Human: are you drunk?
Bot: I ‘ m him .
Human: tell me a good name for my son?
Bot: ah , i have windows , so it ‘ s still good.
Human: tell me the name of your company
Bot: It ‘ s a named horror de de de de .
Human: who is your daddy?
Bot: The name, then I must kill it ?
Human: are you gay?
Bot: yes, thanks :)

In the above sentences, the bot is replying with the answers that syntactically make sense, but lacks the logical aspect. Like for question ‘are you drunk’, the bot feels it is similar to ‘are you computer’ and it replies with an affirmative answer which to structural logic it is correct as drunk word might be new to the bot (not in vocabulary) and it associated the word more close to computer than human and replies with an affirmative answer. Similarly, for question ‘tell me the name of your company’, the bot generates a fictitious name ‘horror de de de de’ which is a very good sign for a generative model. It is understanding the question to some extent and it came up with a great name, I would say.

EXPERIMENTATION

Changing the word order in the input sentences and feeding them in reverse (last to first) to the encoder proved very useful in Machine translation as it seemed that the decoder learns better due to last word in encoder gets closer to the first word in decoder and it learned better. I tried the same for the chatbot and I could not see a significant difference. It might be due to the fact that I am training on a very huge corpus and it might be difficult to evaluate the results. But upon training on both the ways and when asked with similar questions, I could not see much difference.I tried with more LSTM layers and cells (4 LSTMs with 1024 cells) and the training time increased to a large extent which made to stop the model after 30k iterations and I switched back to 3 LSTMs with 512 cells. It might be said that more LSTMs layers might have performed better but there was a tradeoff between time and performance and I sticked to 3 layers at the moment. I will definitely try it with more LSTMs and will share the results in future.I tried with different learning rate and finally settled with a high learning rate at the start (0.15) and a learning rate decay of 0.9 after that. A lot of learning rates can be tried.I kept my vocabulary limited to 50k words, which includes mostly used words in English. It can be increased or decreased based on the use case. Also, the data clean up and pre processing of the data can be done according to the use cases.

FUTURE SCOPE

In future, the model can be integrated with memory networks like end to end memory networks or integration of attention mechanism in the bot where the model learns to focus on certain words while generating answers.

REFERENCES

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine TranslationEnd-To-End Memory Networks

Memory NetworksSequence to Sequence Learning with Neural Networks

DEEP LEARNING FOR CHATBOTS, PART 1 — INTRODUCTIONUnderstanding LSTM Networks

Tensorflow Recurrent Neural NetworkThe Unreasonable Effectiveness of Recurrent Neural Networks