assignments

Programming A to Z – Assignment #5 Markov v vokraM

The class preceding our fifth assignment was on Markov chains, which involve statistical models of text that can be used to predict what character (or word, or some other unit) would be likely to follow a preceeding series of n characters (or words or other units). (An interesting paper of such an algorithm from the assigned reading can be found here.) Our assignment, with a few suggestions, was simply to:

Modify, augment, or replace the in-class Markov chain example.

As presented by Adam, the MarkovFilter example looked at each series of, say, three characters to build a model of what the next character is most likely to be, and then used this model to generate new lines of text by starting with an initial series of three characters used by the actual text, choosing the next character based on the first three characters, choosing the character after that based on the new most recent three characters (the latter two from the first set of three, and the new fourth character), and so on.

It occurred to me that perhaps it was unnecessary that the algorithm examine the text as we read it, from left to right. Instead, I wanted to rewrite MarkovFilter.java to work backwards – starting with the last three characters of a line and working backwards instead of forwards, looking at the set of three characters at the (temporary) beginning of the line, choosing a new first character for the line from a modified statistical model about which character is most likely to precede them, and repeating until the line is of the desired length. VokramFilter.java represents this reversed Markov filter, and the zip file of all the classes is here.

The text this generated seemed approximately as similar to English as that generated using the forward looking method, and I considered for a while how to be sure this is the case. I suspected that the operation performed by backward-looking Vokram analysis was equivalent to reversing the text that was input to a forward-looking Markov algorithm and then reversing the result, and it seemed like that operation would do the same thing as the Markov algorithm on its own. Yet I couldn’t quite work out a more thorough proof of that intuition, and will see if Adam has any insights. I considered doing a comparative analysis of several large texts using both MarkovFilter and VokramFilter (by comparing the Markov analysis of a text generated from applying VokramFilter to an original text to a Markov analysis of that actual original text), but didn’t have a chance.

A2Z
ITP
assignments

Programming A to Z – Assignment #4 Concordance Sorting

The class prior to the fourth assignment covered concordances, data structures for word counts, and other related topics. I completed one of the more challenging suggested alternative tasks for the assignment:

Investigate Java’s Collections class. See if you can figure out how to use Collections.sort() to sort the output of ConcordanceFilter.java—first in alphabetical order, then ordered by word count. (See the official Sun tutorial.)

The Java documentation was relatively clear about what I needed to do using the Collections and Comparator classes to get it working, and Google answered any remaining questions I had about syntax. There are a few files that I edited to run it (including AlphabeticComparator and a WordCountComparator classes), and you can download a zip file of the assignment here. When run, ConcordanceFilter.java will search for a word within a text and output each line on which that word occurred, then output those lines again in alphabetical order, and then output those lines a third time with the line with the fewest words first. For example:

$ java ConcordanceFilter place <lovecraft_dreams.txt
All contexts
remote place beyond the horizon, showing the ruin and antiquity of the city,
over a bridge to a place where the houses grew thinner and thinner. And it was
Contexts sorted alphabeticallydsfadsfd
over a bridge to a place where the houses grew thinner and thinner. And it was
remote place beyond the horizon, showing the ruin and antiquity of the city,
Contexts sorted by word count
remote place beyond the horizon, showing the ruin and antiquity of the city,
over a bridge to a place where the houses grew thinner and thinner. And it was

A2Z
ITP
assignments

Programming A to Z – Assignment #3 URLFinder.java

The third assignment was posted here, and was to

Make a program that creatively transforms or performs analysis on a text using regular expressions. The program should take its input from the keyboard and send its to the screen (or redirect to/from a file). Your program might (a) filter lines from the input, based on whether they match a pattern; (b) match and display certain portions of each line; (c) replace certain portions of each line with new text; or (d) any combination of the above.

Sample ideas: Replace all words in a text of a certain length with a random word; find telephone numbers or e-mail addresses in a text; locate words within a word list that will have a certain score in Scrabble; etc.

Bonus challenge 1: Use one or more features of regular expression syntax that we didn’t discuss in class. Reference here.

Bonus challenge 2: Use one or more features of the Pattern or Matcher class that we didn’t discuss in class. Of particular interest: regex flags (CASE_INSENSITIVE, MULTILINE), “back references” in replaceAll. Matcher class reference here.

I’m planning on doing a much larger project involving analysis of links on Twitter, and I decided to do a very tiny piece of that project for this assignment. I used the XML results from a Twitter search as my input and used a regular expression to look for URLs in the individual tweets. I stored the URLs and the number of times each of them occured in a hashmap, and then printed that information at the end of the analysis.

Usage of Java’s HashMap, Set, and Iterator classes came back to me quickly, and the only tricky part was the regular expression. I ended up using
     <title>.*(http://)(\\S+)(.*)(</title>)+
The content of each message posted to Twitter is enclosed in a <title></title> tag, and including that in the regular expression insures that we don’t capture data that are not part of any message but still contain URLs. I require that tag at the beginning of the line and then look for any number of characters before the beginning of the URL, as represented by .*. Then I get all of the characters up until the first white space with (\\S+), any characters that happen to be after the end of the URL with (.*), and then finally the closing </title> tag, with a + to require at least one occurrence because I know it must be present. The .java file is here, and the compiled .class file is here. You’ll need to add Adam’s a2z.jar file to your classpath, so be sure to get that too if you want to recompile it.

The New York Times did a visualization of tweets during the Superbowl last week, and it was widely circulated on Twitter. A search for “http superbowl nyt” returns a a list of tweets in which most people are sharing links to that visualization, and the results of that search make suitable example input. One specific tinyurl link to the visualization is shared several times, and it demonstrates that the code is functional. The input file is here, and the output file is here.

A2Z
ITP
assignments

Programming A to Z – Assignment #2 Repunctuate.java

The second assignment was posted here, and was to

Create a program (using, e.g., the tools presented in class) that behaves like a UNIX text processing program (such as cat, grep, tr, etc.). Your program should take text as input (any text, or a particular text of your choosing) and output a version of the text that has been filtered and/or munged. Your program should use at least one method of Java’s String class that we didn’t discuss in class.
Be creative, insightful, or intentionally banal. Optional: Use the program that you created in tandem with another UNIX command line utility.

Expanding on/explicitly exacerbating the problem of punctuation I had last week with rearranging the couplets (when the couplets were reordered, you’d often get two lines ending with commas and then two lines ending with periods, and it distracted from the semantic munging I had intended), I wrote a quick little Java program to randomly replace marks of punctuation in the input file. It extends Adam’s TextFilter library, so it works like the command line tools we used last week. I kept certain characters (such as parenthesis and quotation marks) the same because I wanted to keep the text readable while making more subtle changes to the intonation and flow .

The .java file is here, and the compiled .class file is here. The original text of Robert Frost’s ‘Stopping By Woods On A Snowy Evening’ can be found here, and the repunctuated text can be found here. You’ll need to add Adam’s a2z.jar file to your classpath, so be sure to get that too if you want to recompile it.

The Repunctuate.java program also works nicely with the various command line utilities from last week. For example
grep , <frost.txt | java Repunctuate >output.txt
will first filter for only those lines in frost.txt with a comma, and will then repunctuate them and save the output.

A2Z
ITP
assignments

Programming A to Z – Assignment #1 Coupled Couplets

The first assignment was posted here, and was to

Use a combination of the UNIX commands discussed in class (along with any other commands that you discover) to compose a text. Your “source code” for this assignment will simply consist of what you executed on the command line. Indicate what kind of source text the “program” expects, and give an example of what text it generates. Use man to discover command line options that you might not have known about (grep -i is a good one).

I decided to work off of the sonnets.txt file that Adam Parrish (the instructor) had provided as a resource – it’s just a long list of (Shakespearean?) sonnets, with the ending couplets notable indented by two spaces. I decided to extract only these couplets and then reorder the lines slighty so that the AABBCCDDetc rhyme scheme was changed to ABABCDCDetc. It wasn’t going to be a fascinating work of ‘procedural poetics,’ but it seemed like an interesting challenge that would teach me a little more about the command line.

I had to Google around a lot and looked at the man pages to figure out how to make it work – it was somewhat frustrating dealing with UNIX syntax. The code with detailed comments is here, and the code without the comments is here (if you want to run it yourself, use the latter – it wasn’t working with the comments, and I need to ask Adam about it on Tuesday).

The output of the file when sonnets.txt (above) is used as the input is here. It should accept any input file with the couplets indented by two spaces and the other lines not indented, regardless of whether or not they are true sonnets. If there are an odd number of couplets, it will ignore the last one.

(I realize that the punctuation at the end of the lines gets somewhat messed up. This is fixable – I could put commas at the end of one line and periods at the end of the next – but probably isn’t worth the effort.)

A2Z
ITP
assignments