Monthly Archives: February 2009

Programming A to Z – Assignment #3 URLFinder.java

The third assignment was posted here, and was to

Make a program that creatively transforms or performs analysis on a text using regular expressions. The program should take its input from the keyboard and send its to the screen (or redirect to/from a file). Your program might (a) filter lines from the input, based on whether they match a pattern; (b) match and display certain portions of each line; (c) replace certain portions of each line with new text; or (d) any combination of the above.

Sample ideas: Replace all words in a text of a certain length with a random word; find telephone numbers or e-mail addresses in a text; locate words within a word list that will have a certain score in Scrabble; etc.

Bonus challenge 1: Use one or more features of regular expression syntax that we didn’t discuss in class. Reference here.

Bonus challenge 2: Use one or more features of the Pattern or Matcher class that we didn’t discuss in class. Of particular interest: regex flags (CASE_INSENSITIVE, MULTILINE), “back references” in replaceAll. Matcher class reference here.

I’m planning on doing a much larger project involving analysis of links on Twitter, and I decided to do a very tiny piece of that project for this assignment. I used the XML results from a Twitter search as my input and used a regular expression to look for URLs in the individual tweets. I stored the URLs and the number of times each of them occured in a hashmap, and then printed that information at the end of the analysis.

Usage of Java’s HashMap, Set, and Iterator classes came back to me quickly, and the only tricky part was the regular expression. I ended up using
     <title>.*(http://)(\\S+)(.*)(</title>)+
The content of each message posted to Twitter is enclosed in a <title></title> tag, and including that in the regular expression insures that we don’t capture data that are not part of any message but still contain URLs. I require that tag at the beginning of the line and then look for any number of characters before the beginning of the URL, as represented by .*. Then I get all of the characters up until the first white space with (\\S+), any characters that happen to be after the end of the URL with (.*), and then finally the closing </title> tag, with a + to require at least one occurrence because I know it must be present. The .java file is here, and the compiled .class file is here. You’ll need to add Adam’s a2z.jar file to your classpath, so be sure to get that too if you want to recompile it.

The New York Times did a visualization of tweets during the Superbowl last week, and it was widely circulated on Twitter. A search for “http superbowl nyt” returns a a list of tweets in which most people are sharing links to that visualization, and the results of that search make suitable example input. One specific tinyurl link to the visualization is shared several times, and it demonstrates that the code is functional. The input file is here, and the output file is here.

Programming A to Z – Assignment #2 Repunctuate.java

The second assignment was posted here, and was to

Create a program (using, e.g., the tools presented in class) that behaves like a UNIX text processing program (such as cat, grep, tr, etc.). Your program should take text as input (any text, or a particular text of your choosing) and output a version of the text that has been filtered and/or munged. Your program should use at least one method of Java’s String class that we didn’t discuss in class.
Be creative, insightful, or intentionally banal. Optional: Use the program that you created in tandem with another UNIX command line utility.

Expanding on/explicitly exacerbating the problem of punctuation I had last week with rearranging the couplets (when the couplets were reordered, you’d often get two lines ending with commas and then two lines ending with periods, and it distracted from the semantic munging I had intended), I wrote a quick little Java program to randomly replace marks of punctuation in the input file. It extends Adam’s TextFilter library, so it works like the command line tools we used last week. I kept certain characters (such as parenthesis and quotation marks) the same because I wanted to keep the text readable while making more subtle changes to the intonation and flow .

The .java file is here, and the compiled .class file is here. The original text of Robert Frost’s ‘Stopping By Woods On A Snowy Evening’ can be found here, and the repunctuated text can be found here. You’ll need to add Adam’s a2z.jar file to your classpath, so be sure to get that too if you want to recompile it.

The Repunctuate.java program also works nicely with the various command line utilities from last week. For example
grep , <frost.txt | java Repunctuate >output.txt
will first filter for only those lines in frost.txt with a comma, and will then repunctuate them and save the output.