In this notebook, we'll analyze the rhyme distributions in Shakespeare's sonnets. We'll start by extracting the sonnets from a Project Gutenberg webpage, then use nltk to analyze the sonnets' rhymes. We'll also use pandas to make some simple bar graphs (because who really has the time to work through the matplotlib api?).
This is supplementary material to my talk Words, Words, Words: Reading Shakespeare with Python
Download this notebook here
import requests
from lxml import etree
import nltk
import string
import pandas as pd
# include this line to generate graphs in the body of the notebook
%matplotlib inline
def pull_sonnets():
sonnets_html = requests.get('http://www.gutenberg.org/files/1041/1041-h/1041-h.htm').content
html = etree.HTML(sonnets_html)
poems_elements = [element for element in html.xpath("//p[@class='poem']")]
clean_sonnets = []
for element in poems_elements:
clean_sonnet = "\n".join([text.strip() for text in element.itertext() if text.strip()])
clean_sonnets.append(clean_sonnet)
return clean_sonnets
sonnets = pull_sonnets()
def last_word(line):
"""
Takes a line and returns the last work in the line
"""
# split the line into words
words = line.split()
# take the last word
last = words[-1]
# remove leading/trailing punctuation
# and make lowercase
last = last.strip(string.punctuation).lower()
return last
def get_rhymes(sonnets):
rhymes = []
for sonnet in sonnets:
lines = sonnet.split('\n')
if len(lines) != 14:
continue
# since we know sonnets have the same rhyme scheme
# (abab cdcd efef gg) we can "hard code" it here
for index in xrange(12):
if index % 4 in [0, 1]:
pair = (last_word(lines[index]), last_word(lines[index + 2]))
rhymes.append(pair)
rhymes.append((last_word(lines[12]), last_word(lines[13])))
return rhymes
Here we're going to use nltk's FreqDist
class to create a frequency distributions of the rhymes in the sonnets. So, at the end, we'll get a mapping of each rhyme to the amount of times it occurs.
rhymes = get_rhymes(sonnets)
fd = nltk.FreqDist(rhymes)
for rhyme, freq in fd.most_common(10):
print rhyme, freq
df = pd.DataFrame(fd.most_common(10))
df.columns = ["rhyme", "frequency"]
df.sort(ascending=False).plot(
kind='barh',
x='rhyme',
title="Most Common Rhymes in the Sonnets",
)
Here we're going to build something a little more nuanced. Here, we want to know this: for a given word, what is the frequency distribution of the words that rhyme with it.
To answer that question, we use a ConditionalFreqDist
:
rhymes = rhymes + [tuple(reversed(rhyme)) for rhyme in rhymes]
cfd = nltk.ConditionalFreqDist(rhymes)
for word, freq in cfd["thee"].most_common():
print word, freq
# note that cfd["thee"] is itself a frequency distribution:
type(cfd['thee'])
# plot this distribution using pandas
df2 = pd.DataFrame(cfd['thee'].most_common(), columns=['word', 'frequency'])
df2.sort(ascending=False).plot(
kind='barh', x='word', title='Words Most Commonly Rhymed with "thee" in the Sonnets'
)