RCE Endeavors 😅

October 6, 2021

Making a Discord Chat Bot using Markov Chains (2/2)

Filed under: Programming — admin @ 12:35 PM

Continuing on from the previous post, this post will describe the steps involved in creating a Discord bot to write randomly generated chat to a server. Before continuing with this post, make sure to set up a bot application and add it to your server.

There are four main steps needed to get up and running:

  • Get the corpus of chat text
  • Normalize the data
  • Build a Markov chain from the data
  • Connect to the server and listen for commands

Getting the chat text

Unlike the previous post, where we used an existing corpus of text, we now need to generate one ourselves. This text will come from historical chat messages for a particular server channel. We can simply grab this data and write the raw messages out to a file. To do the heavy lifting of interacting with Discord, we will rely on Discord.py. This library will also be used to build the bot functionality later on as well.

The library provides a set of client events that we can hook in to. For this, we need to implement a handler for the on_ready event. When this event is triggered, our chat scraper will get the channel information via get_channel, and then get the message history from the channel.history property. Once we have the message history, we just iterate over each message and write the message content out to a file. Put into code, it looks like the following:

async def on_ready(self):
    print("Logged on as {}!".format(self.user))

    channel = self.get_channel(self.channel_id)
    if channel is None:
        print("Could not find channel {}".format(self.channel_id))
        sys.exit(-1)

    print("Found channel #{} for id {}. Reading messages...".format(channel.name, self.channel_id))
    with open(self.output_path , "w+", encoding="utf-8") as output_file:
        async for message in channel.history(limit=None, after=self.last_date):
            output = message.content + "\n"
            output_file.write(output)

That’s it for getting the chat text. After running the chat scraper against a channel, there should be an output file containing the raw message content. The next step is to clean the data up a bit and normalize it.

Normalizing the data

As described in the previous post, we want to add a normalization step in our data processing. This is done with the goal of improving the quality of sentences that the Markov chain will generate. By standardizing on capitalization, filtering out punctuation, ignoring non-printable text, and filtering out data that isn’t considered chat, the Markov chain is better able to model sentence structure. To perform the normalization, a new script is created which takes an input file containing raw chat data, and an output path to write the normalized data to. The normalization is kept pretty basic: we read in a line of text, strip out unwanted attributes (non-alphanumeric characters, extra whitespace, URLs), and write the normalized line to the output file.

def normalize_text(line):
    if not line:
        return ""

    trimmed_line = " ".join(line.split())
    split_line = ""
    for word in trimmed_line.split(" "):
        if validators.url(word):
            continue

        split_line += word.lower() + " "

    if not split_line or split_line.isspace():
        return ""

    pattern = re.compile(r"[^A-Za-z ]+", re.UNICODE)
    normalized_line = pattern.sub("", split_line)

    return normalized_line

def main(args):
    print("Reading input file {}. Writing output to {}.".format(args.inputfile, args.outputfile));

    with open(args.outputfile, "w+", encoding="utf-8") as output_file:
        with open(args.inputfile, encoding="utf-8") as input_file:
            for input_line in input_file:
                output_line = normalize_text(input_line.rstrip())
                if output_line: 
                    output_file.write(output_line)

    print("Finished processing.")

After running the script, we have an output file with better data. It is certainly by no means perfect: normalizing chat messages is a very challenging problem. Even after stripping out some “non-chat” portions of the message, there are still issues with the data: a message may contain typos, a message may not be a complete sentence, a message can be non-nonsensical, i.e. a user spamming random keys, and a whole host of other issues. These issues are acknowledged, but hand-waved away in this post, since the purpose is to create a fun and simple chat bot, and not something that tries to generate the most realistic sentences possible.

Putting everything together into a bot

At this point, we should have enough knowledge to put everything together. We have the background information for how Markov chains work and how to create them, we have a (kind of) normalized data set to work with, and we have the appropriate library available to interact with Discord. We have to now glue these things together: we will create a program that takes in an input file containing chat data, and the bot token. This program will read the chat data and build the Markov chain for it — the code from the previous article will be re-used here. The bot will then connect to Discord and wait for on_message events. Once an event is received, the bot will generate a random sentence and send it to the channel. Putting this into code, you get the following:

import argparse
import collections
import discord
import random
import re

class MarkovBot(discord.Client):

    def __init__(self, args):
        discord.Client.__init__(self)

        self.token = args.token

        with open(args.inputfile) as input_file:
            text = input_file.read()

        self.markov_table = self.create_markov(text)
        self.word_list = list(self.markov_table.keys())

    def generate_sentence(self, markov_table, seed_word, num_words):
        if seed_word in markov_table:
            sentence = seed_word.capitalize()
            for i in range(0, num_words):
                next_word = random.choice(markov_table[seed_word])
                seed_word = next_word

                sentence += " " + seed_word

            return sentence
        else:
            print("Word {} not found in table.".format(seed_word))

    def create_markov(self, normalized_text):
        words = normalized_text.split()
        markov_table = collections.defaultdict(list)
        for current, next in zip(words, words[1:]):
            markov_table[current].append(next)

        return markov_table

    def run(self):
        super().run(self.token)

    async def on_ready(self):
        print("Logged on as {}!".format(self.user))

    async def on_message(self, message):
        if message.author == self.user:
            return

        response = ""
        if message.content == "!talk":
            response = self.generate_sentence(self.markov_table, random.choice(self.word_list), random.randint(7, 25))

        if response:
            await message.channel.send(response)

def main(args):
    client = MarkovBot(args)
    client.run()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    optional = parser._action_groups.pop()

    required = parser.add_argument_group("required arguments")
    required.add_argument("-t", "--token", required=True)
    required.add_argument("-i", "--inputfile", required=True)

    parser._action_groups.append(optional) 
    args = parser.parse_args()

    main(args)

Usage

There are three scripts that need to be executed in order to get the bot running. These scripts are developed throughout this series and are also available on Github.

python .\chat_scraper.py -c <Channel ID> -t <Bot Token> -o <Raw Data Filename>
python .\text_normalizer.py -i <Raw Data Filename> -o <Normalized Data Filename>
python .\markov_bot_simple.py -t <Bot Token> -i <Normalized Data Filename>

Examples

Razer black plague for that whole thing the scene down with us

Betting on youtube. thats one hand into play occasionally.

Happening. im getting a fukkin animal not sure

Radio show. i completely forgot about it was brought portal ingredients. i dont like gme momentum has blue screen combo.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

 

Powered by WordPress