Week 11 - Text Analysis & Generation
Markov Chain Text Analysis with primitive Python¶
Introduction to Markov Chains¶
A Markov chain is a way to predict what comes next in a sequence based on the current state, without considering the past.
Think of it like this: if you have a bunch of text, you look at each word and see what words usually follow it. For example, after “I want,” you might often see “to eat” or “to sleep.”
When you want to generate new text, you start with a word and pick the next word based on what usually follows it. You keep doing this to create a sentence that sounds somewhat natural, but it might not always make total sense.
So, basically, a Markov chain helps create random text that resembles real conversation by using patterns in word sequences.
- Markov Chains are VERY primitive examples of one part of LLMs.
Language models use something kinda like Markov chains to build attention layers to predict the next word in a sequence based on previous words and attention given to them. By analyzing the patterns in a text, the LLMs can learn to generate new text that mimics the original style, vocabulary, and even sentence structure.
- Today, we will be taking small steps toward building our own SIMPLE Markov chain generator.
Unique Words¶
-
Counting Unique Words in a text file, though not with any finesse.
1 2 3 4 5 6
unique_words = {} for line in open(filename): for word in line.split(): unique_words[word] = 1 print(len(unique_words))
-
Relevance to Markov Objective: This is a basic step in text analysis, allowing us to understand the vocabulary richness of a text.
Dealing with Punctuation, an Intro to Data Wrangling¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
- Relevance to Markov Objective: Cleaning the text ensures accurate word counting and analysis, better than above!.
Word Frequencies¶
1 2 3 4 5 6 7 |
|
Optional Parameters¶
1 2 3 4 |
|
Dictionary Subtraction¶
- Here we want to remove the words that aren’t considered valid words by our language standards.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
def subtract(d1, d2): res = {} for key in d1: if key not in d2: res[key] = d1[key] return res # OR def subtract_eff(d1, d2): return {key: d1[key] for key in d1 if key not in d2} diff = subtract(word_counter, valid_words) # Valid words would be loaded by a dictionary file. print_most_common(diff)
- Relevance: This technique can be used for spell-checking, identifying uncommon words, or finding unique vocabulary.
Random Numbers¶
1 2 3 |
|
Bigrams¶
- A bigram is a sequence of two consecutive words or elements in a text or dataset, often used for analyzing relationships and patterns in language.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
window = [] def count_bigram(bigram): key = tuple(bigram) bigram_counter[key] = bigram_counter.get(key, 0) + 1 def process_word(word): window.append(word) if len(window) == 2: count_bigram(window) window.pop(0) bigram_counter = {} for line in open(filename): for word in split_line(line): word = clean_word(word) process_word(word) print_most_common(bigram_counter)
- Relevance to Markov Objective: Analyzing bigrams helps us understand word associations and patterns.
Markov Analysis¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Generating Text¶
1 2 3 4 |
|