Counting links on Twitter.
The third assignment was posted here, and was to
Make a program that creatively transforms or performs analysis on a text using regular expressions. The program should take its input from the keyboard and send its to the screen (or redirect to/from a file). Your program might (a) filter lines from the input, based on whether they match a pattern; (b) match and display certain portions of each line; © replace certain portions of each line with new text; or (d) any combination of the above.
Sample ideas: Replace all words in a text of a certain length with a random word; find telephone numbers or e-mail addresses in a text; locate words within a word list that will have a certain score in Scrabble; etc.
Bonus challenge 1: Use one or more features of regular expression syntax that we didn’t discuss in class. Reference here.
Bonus challenge 2: Use one or more features of the Pattern or Matcher class that we didn’t discuss in class. Of particular interest: regex flags (
MULTILINE), “back references” in
replaceAll. Matcher class reference here.
I’m planning on doing a much larger project involving analysis of links on Twitter, and I decided to do a very tiny piece of that project for this assignment. I used the XML results from a Twitter search as my input and used a regular expression to look for URLs in the individual tweets. I stored the URLs and the number of times each of them occured in a hashmap, and then printed that information at the end of the analysis.
Usage of Java’s HashMap, Set, and Iterator classes came back to me quickly, and the only tricky part was the regular expression. I ended up using
The content of each message posted to Twitter is enclosed in a
<title>``</title> tag, and including that in the regular expression insures that we don’t capture data that are not part of any message but still contain URLs. I require that tag at the beginning of the line and then look for any number of characters before the beginning of the URL, as represented by
.*. Then I get all of the characters up until the first white space with
(\\S+), any characters that happen to be after the end of the URL with
(.*), and then finally the closing
</title> tag, with a
+ to require at least one occurrence because I know it must be present. The .java file is here, and the compiled .class file is here. You’ll need to add Adam’s a2z.jar file to your classpath, so be sure to get that too if you want to recompile it.
The New York Times did a visualization of tweets during the Superbowl last week, and it was widely circulated on Twitter. A search for “http superbowl nyt” returns a a list of tweets in which most people are sharing links to that visualization, and the results of that search make suitable example input. One specific tinyurl link to the visualization is shared several times, and it demonstrates that the code is functional. The input file is here, and the output file is here.