In this notebook, we'll analyze the rhyme distributions in Shakespeare's sonnets. We'll start by extracting the sonnets from a Project Gutenberg webpage, then use nltk to analyze the sonnets' rhymes. We'll also use pandas to make some simple bar graphs (because who really has the time to work through the matplotlib api?).
This is supplementary material to my talk Words, Words, Words: Reading Shakespeare with Python
Download this notebook here
import requests from lxml import etree import nltk import string import pandas as pd # include this line to generate graphs in the body of the notebook %matplotlib inline
def pull_sonnets(): sonnets_html = requests.get('http://www.gutenberg.org/files/1041/1041-h/1041-h.htm').content html = etree.HTML(sonnets_html) poems_elements = [element for element in html.xpath("//p[@class='poem']")] clean_sonnets =  for element in poems_elements: clean_sonnet = "\n".join([text.strip() for text in element.itertext() if text.strip()]) clean_sonnets.append(clean_sonnet) return clean_sonnets
sonnets = pull_sonnets()
def last_word(line): """ Takes a line and returns the last work in the line """ # split the line into words words = line.split() # take the last word last = words[-1] # remove leading/trailing punctuation # and make lowercase last = last.strip(string.punctuation).lower() return last
def get_rhymes(sonnets): rhymes =  for sonnet in sonnets: lines = sonnet.split('\n') if len(lines) != 14: continue # since we know sonnets have the same rhyme scheme # (abab cdcd efef gg) we can "hard code" it here for index in xrange(12): if index % 4 in [0, 1]: pair = (last_word(lines[index]), last_word(lines[index + 2])) rhymes.append(pair) rhymes.append((last_word(lines), last_word(lines))) return rhymes
Here we're going to use nltk's
FreqDist class to create a frequency distributions of the rhymes in the sonnets. So, at the end, we'll get a mapping of each rhyme to the amount of times it occurs.
rhymes = get_rhymes(sonnets) fd = nltk.FreqDist(rhymes)
for rhyme, freq in fd.most_common(10): print rhyme, freq
(u'thee', u'me') 14 (u'me', u'thee') 9 (u'thee', u'be') 8 (u'days', u'praise') 6 (u'heart', u'art') 5 (u'heart', u'part') 5 (u'love', u'prove') 5 (u'prove', u'love') 4 (u'face', u'disgrace') 4 (u'eyes', u'lies') 4
df = pd.DataFrame(fd.most_common(10)) df.columns = ["rhyme", "frequency"] df.sort(ascending=False).plot( kind='barh', x='rhyme', title="Most Common Rhymes in the Sonnets", )
<matplotlib.axes._subplots.AxesSubplot at 0x1078a3410>
Here we're going to build something a little more nuanced. Here, we want to know this: for a given word, what is the frequency distribution of the words that rhyme with it.
To answer that question, we use a
rhymes = rhymes + [tuple(reversed(rhyme)) for rhyme in rhymes] cfd = nltk.ConditionalFreqDist(rhymes) for word, freq in cfd["thee"].most_common(): print word, freq
me 23 be 12 see 5 thee 2 melancholy 1 free 1 posterity 1 usury 1
# note that cfd["thee"] is itself a frequency distribution: type(cfd['thee'])
# plot this distribution using pandas df2 = pd.DataFrame(cfd['thee'].most_common(), columns=['word', 'frequency']) df2.sort(ascending=False).plot( kind='barh', x='word', title='Words Most Commonly Rhymed with "thee" in the Sonnets' )
<matplotlib.axes._subplots.AxesSubplot at 0x1078fa450>