Programming A to Z - Assignments #6 and #7

More text processing.

790 words

Since it’s getting to be that time in the semester when I feel that I should be focusing on final projects, I didn’t spend too much time on the context free grammar and Bayesian classificiation assignments. My writeups for both are below.

I’ve studied context free grammars in the past (ahh undergrad memories of Ling120 (syllabus here, but not the one I had)), so I have a good sense of how they work. I made a few quick modifications to the grammar file used by Adam’s ContextFilter class to handle certain conjunctions, adverbs/adverbial phrases, and prepositional phrases. I also made some modifications to support quotations, but they aren’t particularly refined — I couldn’t come up with a simple solution to the nesting problem in the below example that didn’t involve duplication of many of the other rules:

this smug dichotomy thought “ that smug walrus foregrounds this time that yelled ” the seagull that spins this important sea said “ that restaurant that sneezes has come ” “ ”

Thinking about ways to solve this problem highlighted how CFGs can get rather complex rather quickly. When, for example, do you want a period/comma at the end of the quotation? When is one period/comma sufficient for multiple quotations? How do you resolve those situations programmatically?

My test.grammar file is online here, and you can run it with Adam’s Java classes that are zipped here. I recommend you only test a few of my rules at a time and comment out the others — otherwise you might get sentences like this:

the important trombone yelled “ wow, the blue amoeba said ” oh, this blue thing interprets this seagull that said “ wow, the trombone sneezes ” “ but this amoeba said ” this suburb that quickly whines in that sea that daydreams by that corsage that quickly or habitually computes this dichotomy slowly yet habitually vocalizes and damn, that luxurious restaurant habitually prefers this seagull “ but that suburb interprets the time yet damn, that sea said ” that restaurant tediously slobbers “ yet wow, that seagull quickly yet slowly foregrounds the restaurant or the boiling hot time spins this bald restaurant of this trombone but that amoeba computes this smug restaurant but the seagull that quickly or slowly prefers this sea yelled ” oh, the thing that spins this restaurant tediously yet quietly whines “ or wow, this amoeba coughs through the important sea and oh, that time tediously yet slowly coughs yet oh, that trombone habitually computes that luxurious suburb or wow, that thing that said ” my, that amoeba that said “ my, that time tediously or slowly sneezes ” quietly yet tediously foregrounds that trombone that has come “ said ” my, the restaurant that habitually but quietly prefers the dichotomy that said “ damn, this boiling hot trombone quietly slobbers by the trombone that quietly coughs ” said “ oh, the amoeba habitually coughs ” “ yet damn, this seagull that spins the seagull that sneezes for the important trombone quietly coughs but my, this sea that slowly yet habitually has come foregrounds this dichotomy and that blue thing tediously slobbers ”

For the assignment on Bayesian classification I combined Adam’s BayesClassifier.java with the previous Markov chain examples to use n-grams instead of words as the tokens for analysis. BayesNGramClassifier.java can be found here as a .txt file, and you can download all of the required files here. Note you might have to increase the amount of memory available to Java to run the analysis with higher values for n. Try something like java -Xmx1024m BayesNGramClassifier 10 shakespeare.txt twain.txt austen.txt sonnets.txt if you’re having trouble.

I compared sonnets.txt to shakespeare.txt, twain.txt and austen.txt as in the example using various values of n for the analysis. The data is below, with the word-level analysis first. Note that higher numbers (i.e. those closer to zero) indicate a greater degree of similarity.

n shakespeare.txt twain.txt austen.txt
word -59886.12916378722 -64716.741899235174 -66448.68994538345
1 -311.94997977326625 -348.2797252624347 -351.8612295074756
2 -6688.356420467105 -6824.843204592283 -7076.488251510615
3 -46332.8806624305 -47629.58502376338 -49557.906858505885
4 -155190.04376334642 -161815.95665896614 -167839.50470553883
5 -350322.9494161118 -369897.08857782563 -379600.90797560615
6 -581892.4161591302 -620871.7848829604 -629557.118086935
7 -798094.5896325088 -851043.4785550251 -856926.3903304675
8 -977428.4318098201 -1033391.2297240103 -1037851.0025613104
9 -1106125.9701775634 -1153251.0919529926 -1159479.8816597122
10 -1184654.361656962 -1218599.6808817217 -1227484.9929278728
11 -1221770.2880299168 -1242286.351341775 -1255024.535274092
12 -1228641.7908902294 -1238848.5031651617 -1254404.827728626
13 -1214247.043351669 -1217403.6233457213 -1235480.9184978919
14 -1187489.0276571538 -1186476.2556523846 -1205959.6178494398
15 -1153511.2780243065 -1150529.1594209142 -1170746.8132369826

When the n-grams become 14 characters long (which is very long, considering the average length of English words) the analysis finally starts to break down, and it no longer correctly classifies sonnets.txt as being most similar to shakespeare.txt. Some values of n certainly perform better than others, but I’d need to delve further into the mathematics of how these numbers are calculated in order to do more detailed analysis.