Week 11 - Text Analysis & Generation

Markov Chain Text Analysis with primitive Python¶

Introduction to Markov Chains¶

A Markov chain is a way to predict what comes next in a sequence based on the current state, without considering the past.

Think of it like this: if you have a bunch of text, you look at each word and see what words usually follow it. For example, after “I want,” you might often see “to eat” or “to sleep.”

When you want to generate new text, you start with a word and pick the next word based on what usually follows it. You keep doing this to create a sentence that sounds somewhat natural, but it might not always make total sense.

So, basically, a Markov chain helps create random text that resembles real conversation by using patterns in word sequences.

Markov Chains are VERY primitive examples of one part of LLMs.

Language models use something kinda like Markov chains to build attention layers to predict the next word in a sequence based on previous words and attention given to them. By analyzing the patterns in a text, the LLMs can learn to generate new text that mimics the original style, vocabulary, and even sentence structure.

Today, we will be taking small steps toward building our own SIMPLE Markov chain generator.

Unique Words¶

Counting Unique Words in a text file, though not with any finesse.

unique_words = {}
for line in open(filename):
    for word in line.split():
        unique_words[word] = 1

print(len(unique_words))

Relevance to Markov Objective: This is a basic step in text analysis, allowing us to understand the vocabulary richness of a text.

Dealing with Punctuation, an Intro to Data Wrangling¶

import unicodedata  # Imported unicodedata so I could find punctuation marks (unicodedata.category that startswith 'P')

def get_punctuation(filename):
    punc_marks = {}
    with open(filename) as fp:
        for line in fp:
            for char in line:
                category = unicodedata.category( char )
                if category.startswith( 'P' ):
                    punc_marks[char] = 1
    return ''.join(punc_marks)

punctuation = get_punctuation('files/hyde.txt')

def split_line(line):
    return line.replace('—', ' ').split()

def clean_word(word, punctuation):
    return word.strip(punctuation).lower()

Relevance to Markov Objective: Cleaning the text ensures accurate word counting and analysis, better than above!.

Word Frequencies¶

word_counter = {}
for line in open(filename):
    for word in split_line(line):
        word = clean_word(word)
        word_counter[word] += word_counter.get(word, 0)

print_most_common(word_counter)  # This method is below!

* Relevance to Markov Objective: Understanding word frequencies helps identify important words and themes in a text.

Optional Parameters¶

def print_most_common(word_counter, num=5):  # Notice num is optional. If you pass no arg to num it defaults to 5
    items = sorted(word_counter.items(), key=lambda x: x[1], reverse=True)
    for word, freq in items[:num]:
        print(freq, word, sep='\t')

* Relevance to Markov Objective: Optional parameters make functions more flexible and reusable, but there is no relevance to Markov TA.

Dictionary Subtraction¶

Here we want to remove the words that aren’t considered valid words by our language standards.

def subtract(d1, d2):
    res = {}
    for key in d1:
        if key not in d2:
            res[key] = d1[key]
    return res

# OR

def subtract_eff(d1, d2):
    return {key: d1[key] for key in d1 if key not in d2}

diff = subtract(word_counter, valid_words)  # Valid words would be loaded by a dictionary file.
print_most_common(diff)

Relevance: This technique can be used for spell-checking, identifying uncommon words, or finding unique vocabulary.

Random Numbers¶

weights = word_counter.values()  # i.e. the frequency of each word
random_words = random.choices(words, weights=weights, k=6)  # k=6 means select 6 words
print(' '.join(random_words))  # print the sentence.

* Relevance to Markov Objective Only relevant to show that random text generation, even based on word frequency, sounds way more like gibberish.

Bigrams¶

A bigram is a sequence of two consecutive words or elements in a text or dataset, often used for analyzing relationships and patterns in language.

window = []

def count_bigram(bigram):
    key = tuple(bigram)
    bigram_counter[key] = bigram_counter.get(key, 0) + 1

def process_word(word):
    window.append(word)

    if len(window) == 2:
        count_bigram(window)
        window.pop(0)

bigram_counter = {}
for line in open(filename):
    for word in split_line(line):
        word = clean_word(word)
        process_word(word)

print_most_common(bigram_counter)

Relevance to Markov Objective: Analyzing bigrams helps us understand word associations and patterns.

Markov Analysis¶

successor_map = {}

def add_bigram(bigram):
    first, second = bigram
    successor_map[first].setdefault(first, []).append(second)

def process_word_bigram(word):
    window.append(word)

    if len(window) == 2:
        add_bigram(window)
        window.pop(0)

for line in open(filename):
    for word in split_line(line):
        word = clean_word(word)
        process_word_bigram(word)

print(successor_map['going'])

* Relevance It Is: Markov analysis is the foundation for generating text by predicting the next word based on the current word or sequence of words.

Generating Text¶

for i in range(10):
    successors = successor_map[word]
    word = random.choice(successors)
    print(word, end=' ')

* Relevance: This is the final goal of the lecture and generates text that mimics the original style. It’s still gibberish, however!