Category Archives: A2Z

Programming A to Z – Delvicious, Initial Implementation Details

For my midterm project for Programming A to Z I decided to start working on a Delicious / Google Custom Search Engine mashup that I’ve been wanting to make for a few months. It will be called Delvicious, and a complete description of the project can be found on its project page. This post will primarily be about the initial implementation details and the progress I’ve made so far, but will conclude with some future plans.

I started off by looking at various options for getting a user’s bookmarks on the Delicious Tools page. I decided to use RSS feeds for the very first version, but those are limited to 100 bookmarks and I knew that I’d have to switch to something else later. I spent a long time familiarizing myself with Google’s Custom Search Engine tools – there are a lot of options for customizing the sites available to the custom searches. I ultimately decided that I needed the power and flexibility of a self-hosted XML file of annotations that would contain the URL’s of the sites to be searched. In addition, this seemed like a good project to start learning a programming language called Python.

I dusted off an old Delicious account, memento85, and added a few random bookmarks. I made a Python script that retrieved the most recent 100 bookmarks for a user as a JSON object and wrote the url’s from those bookmarks to a properly formatted annotations XML file. It took some trial and error, but Python was generally painless and you can see the script that I used here as a txt file. I then set it to run a few times an hour as a cron job on my server, and this made sure that my annotations XML file would be updated when my bookmarks changed. (Note that changes to the XML are not immediately reflected in the CSE, but this is ok – people can remember the sites that they’ve been to in the last few hours on their own).

Once that was working I set up a second CSE to use another Delicious account, lehrblogger, that had many more of my bookmarks imported. The annotations file made by bookmarksearch.py for this account looked like this. Adding this xml as the annotations feed in the CSE results in the following functional custom search engine – try it out below or go to the search’s homepage.

But why, exactly, is such a thing especially useful? Let’s say that I am looking for a specific site that I am sure that I had bookmarked a while ago and want to find again. I know it has something to do with SMS, but can’t be sure of any other keywords. If I do a search for ‘sms’ on my Delicious account, I get only one result. It is returned by the search because I tagged this result with ‘sms’, but perhaps it is not the site I was looking for and I am still certain that the other site is in my bookmarks. I could use the Custom Search Engine to search the full text of these same bookmarks, and this returns these four results, the one found by Delicious and three others. If I had been looking for, say, the first result, it would have been very difficult/tedious to find with only the tools offered by Delicious.

After that initial part of the project was both working and useful I started to think about ways in which it could be expanded. The Google CSE supports refinements, or categories of search results, which allow a user to quickly filter for results of a given topic. I thought there was a nice parallel between refinements and Delicious’ tags, and it seemed like a good next step to use the tags as refinements by pairing them in the annotations XML file with their respective URLs.

This feature also requires, however, that the main file that defines the CSE list all of the refinements. Google does provide an API for easily modifying this file, but a user must be authenticated with Google as the owner of CSE. I needed the updates to happen regularly as part of a cron job, and it would not work for each user to need to authenticate (i.e. type in her Google username and password) each time the CSE was updated. Even if I found a way to use authentication data as part of the cron job, I was concerned about storing that sort of sensitive information on my own server.

Thus it made sense to make a much larger jump forward in the project than I intended so early on: I decided to rebuild the application to run on Google’s App Engine. App Engine is a scalable hosting/infrastructure system on which to build rich web applications, and it offers substantial free bandwidth and CPU time as well as a competitively priced payment plan for larger/more popular applications.

App Engine uses Python and is well documented, so I dove in with the Hello World example. A good first step seemed to be to get the annotations XML file populated by the bookmarks returned by Delicious API call made by App Engine, and next I needed a way to serve that file at a persistent URL for the CSE to use. These things were more challenging than I expected – I had difficulty authenticating with Delicious, parsing the XML (as opposed to JSON) data that came back, and finding a way to serve those URLs as a properly formatted XML file. I initially looked for a way to write the URLs to a static file, but eventually found a detailed tutorial on writing blogging software for the App Engine, and I was able to adapt the RSS publishing portion of that example for my purposes.

The annotations XML files were now being published to URLs such as http://delv-icio-us.appspot.com/annotations/memento85 but currently it only works for that one user and you can actually put whatever you want after “annotations/”. Once I had App Engine making a call to the Delicious API and serving the resulting bookmark URLs in an annotations XML file, it was easy to set up a new custom search engine, this time for the handful of bookmarks of memento85.

Because the process of getting the above working involved so much trial-and-error, and because I intend to continue developing the project into a more complex application, I set up a GitHub project for the App Engine portion of Delvicious. There are many, many things left to do before the project is complete. I will need to:

  • understand Google’s Datastore so that I can store information about the Google accounts of users and the Delicious accounts paired with them.
  • also store the bookmark data for each user in the Datastore – too many API calls are required to fetch the entire list of bookmarks each time the XML is served, and it makes much more sense to store the bookmarks again and update them as new bookmarks are added. This also saves the need for the cron job – I can simply fetch new bookmarks whenever the CSE requests the annotations XML.
  • design the various pages of the application, including signup and account management.
  • develop the search page – ideally I can present the user with a single search box that will use both the built-in Delicious search and the Custom Search Engine and present the results side-by-side.

I’m excited about implementing these features and hope to continue this project for the remainder of the course. Delvicious will be a good opportunity to learn Python, familiarize myself with building web applications using the App Engine, and create a mashup that people might find truly useful.

Programming A to Z – Assignment #5 Markov v vokraM

The class preceding our fifth assignment was on Markov chains, which involve statistical models of text that can be used to predict what character (or word, or some other unit) would be likely to follow a preceeding series of n characters (or words or other units). (An interesting paper of such an algorithm from the assigned reading can be found here.) Our assignment, with a few suggestions, was simply to:

Modify, augment, or replace the in-class Markov chain example.

As presented by Adam, the MarkovFilter example looked at each series of, say, three characters to build a model of what the next character is most likely to be, and then used this model to generate new lines of text by starting with an initial series of three characters used by the actual text, choosing the next character based on the first three characters, choosing the character after that based on the new most recent three characters (the latter two from the first set of three, and the new fourth character), and so on.

It occurred to me that perhaps it was unnecessary that the algorithm examine the text as we read it, from left to right. Instead, I wanted to rewrite MarkovFilter.java to work backwards – starting with the last three characters of a line and working backwards instead of forwards, looking at the set of three characters at the (temporary) beginning of the line, choosing a new first character for the line from a modified statistical model about which character is most likely to precede them, and repeating until the line is of the desired length. VokramFilter.java represents this reversed Markov filter, and the zip file of all the classes is here.

The text this generated seemed approximately as similar to English as that generated using the forward looking method, and I considered for a while how to be sure this is the case. I suspected that the operation performed by backward-looking Vokram analysis was equivalent to reversing the text that was input to a forward-looking Markov algorithm and then reversing the result, and it seemed like that operation would do the same thing as the Markov algorithm on its own. Yet I couldn’t quite work out a more thorough proof of that intuition, and will see if Adam has any insights. I considered doing a comparative analysis of several large texts using both MarkovFilter and VokramFilter (by comparing the Markov analysis of a text generated from applying VokramFilter to an original text to a Markov analysis of that actual original text), but didn’t have a chance.

Programming A to Z – Assignment #4 Concordance Sorting

The class prior to the fourth assignment covered concordances, data structures for word counts, and other related topics. I completed one of the more challenging suggested alternative tasks for the assignment:

Investigate Java’s Collections class. See if you can figure out how to use Collections.sort() to sort the output of ConcordanceFilter.java—first in alphabetical order, then ordered by word count. (See the official Sun tutorial.)

The Java documentation was relatively clear about what I needed to do using the Collections and Comparator classes to get it working, and Google answered any remaining questions I had about syntax. There are a few files that I edited to run it (including AlphabeticComparator and a WordCountComparator classes), and you can download a zip file of the assignment here. When run, ConcordanceFilter.java will search for a word within a text and output each line on which that word occurred, then output those lines again in alphabetical order, and then output those lines a third time with the line with the fewest words first. For example:

$ java ConcordanceFilter place <lovecraft_dreams.txt
All contexts
remote place beyond the horizon, showing the ruin and antiquity of the city,
over a bridge to a place where the houses grew thinner and thinner. And it was
Contexts sorted alphabeticallydsfadsfd
over a bridge to a place where the houses grew thinner and thinner. And it was
remote place beyond the horizon, showing the ruin and antiquity of the city,
Contexts sorted by word count
remote place beyond the horizon, showing the ruin and antiquity of the city,
over a bridge to a place where the houses grew thinner and thinner. And it was

Programming A to Z – Assignment #3 URLFinder.java

The third assignment was posted here, and was to

Make a program that creatively transforms or performs analysis on a text using regular expressions. The program should take its input from the keyboard and send its to the screen (or redirect to/from a file). Your program might (a) filter lines from the input, based on whether they match a pattern; (b) match and display certain portions of each line; (c) replace certain portions of each line with new text; or (d) any combination of the above.

Sample ideas: Replace all words in a text of a certain length with a random word; find telephone numbers or e-mail addresses in a text; locate words within a word list that will have a certain score in Scrabble; etc.

Bonus challenge 1: Use one or more features of regular expression syntax that we didn’t discuss in class. Reference here.

Bonus challenge 2: Use one or more features of the Pattern or Matcher class that we didn’t discuss in class. Of particular interest: regex flags (CASE_INSENSITIVE, MULTILINE), “back references” in replaceAll. Matcher class reference here.

I’m planning on doing a much larger project involving analysis of links on Twitter, and I decided to do a very tiny piece of that project for this assignment. I used the XML results from a Twitter search as my input and used a regular expression to look for URLs in the individual tweets. I stored the URLs and the number of times each of them occured in a hashmap, and then printed that information at the end of the analysis.

Usage of Java’s HashMap, Set, and Iterator classes came back to me quickly, and the only tricky part was the regular expression. I ended up using
     <title>.*(http://)(\\S+)(.*)(</title>)+
The content of each message posted to Twitter is enclosed in a <title></title> tag, and including that in the regular expression insures that we don’t capture data that are not part of any message but still contain URLs. I require that tag at the beginning of the line and then look for any number of characters before the beginning of the URL, as represented by .*. Then I get all of the characters up until the first white space with (\\S+), any characters that happen to be after the end of the URL with (.*), and then finally the closing </title> tag, with a + to require at least one occurrence because I know it must be present. The .java file is here, and the compiled .class file is here. You’ll need to add Adam’s a2z.jar file to your classpath, so be sure to get that too if you want to recompile it.

The New York Times did a visualization of tweets during the Superbowl last week, and it was widely circulated on Twitter. A search for “http superbowl nyt” returns a a list of tweets in which most people are sharing links to that visualization, and the results of that search make suitable example input. One specific tinyurl link to the visualization is shared several times, and it demonstrates that the code is functional. The input file is here, and the output file is here.

Programming A to Z – Assignment #2 Repunctuate.java

The second assignment was posted here, and was to

Create a program (using, e.g., the tools presented in class) that behaves like a UNIX text processing program (such as cat, grep, tr, etc.). Your program should take text as input (any text, or a particular text of your choosing) and output a version of the text that has been filtered and/or munged. Your program should use at least one method of Java’s String class that we didn’t discuss in class.
Be creative, insightful, or intentionally banal. Optional: Use the program that you created in tandem with another UNIX command line utility.

Expanding on/explicitly exacerbating the problem of punctuation I had last week with rearranging the couplets (when the couplets were reordered, you’d often get two lines ending with commas and then two lines ending with periods, and it distracted from the semantic munging I had intended), I wrote a quick little Java program to randomly replace marks of punctuation in the input file. It extends Adam’s TextFilter library, so it works like the command line tools we used last week. I kept certain characters (such as parenthesis and quotation marks) the same because I wanted to keep the text readable while making more subtle changes to the intonation and flow .

The .java file is here, and the compiled .class file is here. The original text of Robert Frost’s ‘Stopping By Woods On A Snowy Evening’ can be found here, and the repunctuated text can be found here. You’ll need to add Adam’s a2z.jar file to your classpath, so be sure to get that too if you want to recompile it.

The Repunctuate.java program also works nicely with the various command line utilities from last week. For example
grep , <frost.txt | java Repunctuate >output.txt
will first filter for only those lines in frost.txt with a comma, and will then repunctuate them and save the output.