March 2009

Programming A to Z – Delvicious, Django and Assignment #8

project page here with overview and previous posts

Since I had already worked with XML and web services for my midterm project, I decided to take this week’s assignment for Programming A to Z as an opportunity to continue work on that project. I thought that a good next step would be to use App Engine’s Datastore to store the necessary information about all of a user’s bookmarks, and I would again access that data initially as an XML document obtained from the Delicious API.

I had already begun to learn the Python web framework Django for the Textonic project for Design for UNICEF, and it seemed like it would provide a rich set of useful tools for this project as well. The Datastore does not work, however, like the relational databases with which I am more familiar. Django relies by default on a relational database and is thus incompatible with Google’s Datastore out-of-the-box, but there are two software projects that aim to reconcile these differences. Google’s App Engine Helper (tutorial, code) seemed less well- and less actively-developed than app-engine-patch (tutorial, code), so I decided to go with the latter.

Django is quite powerful and gives you a lot of functionality for free, but when trying to branch out from various tutorials I encountered the somewhat strange challenge of figuring out exactly what it did for you and what it didn’t. It took me a quite a while to get the hang of working with Django on App Engine, so I didn’t have time to actually get the XML stored in a reasonable set of Models in the database. I have, however, gotten successful storage of Delicious logins to work, which is a good first step. The updated code is available on GitHub, and I should be able to make additional improvements soon.

A2Z
ITP
assignments
projects

Little Computers – Meetapp (Initial Progress)

previously: project description

I’ve made some initial progress on organizing the interface for my Meetup.com iPhone Application. I have a tab bar at the bottom, table views with navigation bars for each tab, and an array of strings populating one of table views with sample text. I started off with David’s UITableView tutorial, but ran into a series of problems when I tried to integrate it into Apple’s Tab Bar Application project template. I eventually gave up on using Interface Builder and decided to do the entire thing programmatically using this excellent tutorial that I found online. That worked without any trouble, and I was able to modify the example to serve as the basics of my application.

I have uploaded what I have so far to a repository on GitHub – more frequent updates will be committed there, but I’ll also post here at major milestones.

A side note – Joe Hewitt, the developer of the Facebook iPhone app, recently open-sourced several of the iPhone libraries that he used as the ‘Three20 Project’. They look like they might be useful, and the post certainly deserves a link and a thank you.

ITP
Little Computers
projects

Programming A to Z – Assignments #6 and #7

Since it’s getting to be that time in the semester when I feel that I should be focusing on final projects, I didn’t spend too much time on the context free grammar and Bayesian classificiation assignments. My writeups for both are below.

I’ve studied context free grammars in the past (ahh undergrad memories of Ling120 (syllabus here, but not the one I had)), so I have a good sense of how they work. I made a few quick modifications to the grammar file used by Adam’s ContextFilter class to handle certain conjunctions, adverbs/adverbial phrases, and prepositional phrases. I also made some modifications to support quotations, but they aren’t particularly refined – I couldn’t come up with a simple solution to the nesting problem in the below example that didn’t involve duplication of many of the other rules:

this smug dichotomy thought ” that smug walrus foregrounds this time that yelled ” the seagull that spins this important sea said ” that restaurant that sneezes has come ” ” ”

Thinking about ways to solve this problem highlighted how CFGs can get rather complex rather quickly. When, for example, do you want a period/comma at the end of the quotation? When is one period/comma sufficient for multiple quotations? How do you resolve those situations programmatically?

My test.grammar file is online here, and you can run it with Adam’s Java classes that are zipped here. I recommend you only test a few of my rules at a time and comment out the others – otherwise you might get sentences like this:

the important trombone yelled ” wow, the blue amoeba said ” oh, this blue thing interprets this seagull that said ” wow, the trombone sneezes ” ” but this amoeba said ” this suburb that quickly whines in that sea that daydreams by that corsage that quickly or habitually computes this dichotomy slowly yet habitually vocalizes and damn, that luxurious restaurant habitually prefers this seagull ” but that suburb interprets the time yet damn, that sea said ” that restaurant tediously slobbers ” yet wow, that seagull quickly yet slowly foregrounds the restaurant or the boiling hot time spins this bald restaurant of this trombone but that amoeba computes this smug restaurant but the seagull that quickly or slowly prefers this sea yelled ” oh, the thing that spins this restaurant tediously yet quietly whines ” or wow, this amoeba coughs through the important sea and oh, that time tediously yet slowly coughs yet oh, that trombone habitually computes that luxurious suburb or wow, that thing that said ” my, that amoeba that said ” my, that time tediously or slowly sneezes ” quietly yet tediously foregrounds that trombone that has come ” said ” my, the restaurant that habitually but quietly prefers the dichotomy that said ” damn, this boiling hot trombone quietly slobbers by the trombone that quietly coughs ” said ” oh, the amoeba habitually coughs ” ” yet damn, this seagull that spins the seagull that sneezes for the important trombone quietly coughs but my, this sea that slowly yet habitually has come foregrounds this dichotomy and that blue thing tediously slobbers “

 
 
For the assignment on Bayesian classification I combined Adam’s BayesClassifier.java with the previous Markov chain examples to use n-grams instead of words as the tokens for analysis. BayesNGramClassifier.java can be found here as a .txt file, and you can download all of the required files here. Note you might have to increase the amount of memory available to Java to run the analysis with higher values for n. Try something like java -Xmx1024m BayesNGramClassifier 10 shakespeare.txt twain.txt austen.txt if you're having trouble.

I compared sonnets.txt to shakespeare.txt, twain.txt and austen.txt as in the example using various values of n for the analysis. The data is below, with the word-level analysis first. Note that higher numbers (i.e. those closer to zero) indicate a greater degree of similarity.

n shakespeare.txt twain.txt austen.txt
word -59886.12916378722 -64716.741899235174 -66448.68994538345
1 -311.94997977326625 -348.2797252624347 -351.8612295074756
2 -6688.356420467105 -6824.843204592283 -7076.488251510615
3 -46332.8806624305 -47629.58502376338 -49557.906858505885
4 -155190.04376334642 -161815.95665896614 -167839.50470553883
5 -350322.9494161118 -369897.08857782563 -379600.90797560615
6 -581892.4161591302 -620871.7848829604 -629557.118086935
7 -798094.5896325088 -851043.4785550251 -856926.3903304675
8 -977428.4318098201 -1033391.2297240103 -1037851.0025613104
9 -1106125.9701775634 -1153251.0919529926 -1159479.8816597122
10 -1184654.361656962 -1218599.6808817217 -1227484.9929278728
11 -1221770.2880299168 -1242286.351341775 -1255024.535274092
12 -1228641.7908902294 -1238848.5031651617 -1254404.827728626
13 -1214247.043351669 -1217403.6233457213 -1235480.9184978919
14 -1187489.0276571538 -1186476.2556523846 -1205959.6178494398
15 -1153511.2780243065 -1150529.1594209142 -1170746.8132369826

When the n-grams become 14 characters long (which is very long, considering the average length of English words) the analysis finally starts to break down, and it no longer correctly classifies sonnets.txt as being most similar to shakespeare.txt. Some values of n certainly perform better than others, but I'd need to delve further into the mathematics of how these numbers are calculated in order to do more detailed analysis.

A2Z
ITP
assignments

Design for UNICEF – Detextive*

cross-posted at textonic.org
previously: Mobile Tech for Social Change Barcamp, Design for UNICEF – RapidSMS and Mechanical Turk

*We might not actually use this name, but I like it and am going to enjoy it, at least for now.

Our RapidSMS/Mechanical Turk project is moving forward. Last week we met with the software development team at UNICEF that built RapidSMS, re-focused our efforts on creating a tool to process incoming SMS messages with Mechanical Turk, and divided out the tasks before our next meeting. I thought about what specific features we would need to provide to administrators of the system in the field for them to be able to set up and configure the system to work with RapidSMS. I made a few slides to present the ideas to our group, and the deck is below.

(I made both this presentation and the previous Meetapp presentation with 280 Slides, a web-based presentation editor made by a startup called 280 North. Give it a try – I find it great for sharing presentations, and I prefer it to working with Google Docs.)

Design For UNICEF
ITP
projects

Programming A to Z – Delvicious, Initial Implementation Details

For my midterm project for Programming A to Z I decided to start working on a Delicious / Google Custom Search Engine mashup that I’ve been wanting to make for a few months. It will be called Delvicious, and a complete description of the project can be found on its project page. This post will primarily be about the initial implementation details and the progress I’ve made so far, but will conclude with some future plans.

I started off by looking at various options for getting a user’s bookmarks on the Delicious Tools page. I decided to use RSS feeds for the very first version, but those are limited to 100 bookmarks and I knew that I’d have to switch to something else later. I spent a long time familiarizing myself with Google’s Custom Search Engine tools – there are a lot of options for customizing the sites available to the custom searches. I ultimately decided that I needed the power and flexibility of a self-hosted XML file of annotations that would contain the URL’s of the sites to be searched. In addition, this seemed like a good project to start learning a programming language called Python.

I dusted off an old Delicious account, memento85, and added a few random bookmarks. I made a Python script that retrieved the most recent 100 bookmarks for a user as a JSON object and wrote the url’s from those bookmarks to a properly formatted annotations XML file. It took some trial and error, but Python was generally painless and you can see the script that I used here as a txt file. I then set it to run a few times an hour as a cron job on my server, and this made sure that my annotations XML file would be updated when my bookmarks changed. (Note that changes to the XML are not immediately reflected in the CSE, but this is ok – people can remember the sites that they’ve been to in the last few hours on their own).

Once that was working I set up a second CSE to use another Delicious account, lehrblogger, that had many more of my bookmarks imported. The annotations file made by bookmarksearch.py for this account looked like this. Adding this xml as the annotations feed in the CSE results in the following functional custom search engine – try it out below or go to the search’s homepage.

But why, exactly, is such a thing especially useful? Let’s say that I am looking for a specific site that I am sure that I had bookmarked a while ago and want to find again. I know it has something to do with SMS, but can’t be sure of any other keywords. If I do a search for ’sms’ on my Delicious account, I get only one result. It is returned by the search because I tagged this result with ’sms’, but perhaps it is not the site I was looking for and I am still certain that the other site is in my bookmarks. I could use the Custom Search Engine to search the full text of these same bookmarks, and this returns these four results, the one found by Delicious and three others. If I had been looking for, say, the first result, it would have been very difficult/tedious to find with only the tools offered by Delicious.

After that initial part of the project was both working and useful I started to think about ways in which it could be expanded. The Google CSE supports refinements, or categories of search results, which allow a user to quickly filter for results of a given topic. I thought there was a nice parallel between refinements and Delicious’ tags, and it seemed like a good next step to use the tags as refinements by pairing them in the annotations XML file with their respective URLs.

This feature also requires, however, that the main file that defines the CSE list all of the refinements. Google does provide an API for easily modifying this file, but a user must be authenticated with Google as the owner of CSE. I needed the updates to happen regularly as part of a cron job, and it would not work for each user to need to authenticate (i.e. type in her Google username and password) each time the CSE was updated. Even if I found a way to use authentication data as part of the cron job, I was concerned about storing that sort of sensitive information on my own server.

Thus it made sense to make a much larger jump forward in the project than I intended so early on: I decided to rebuild the application to run on Google’s App Engine. App Engine is a scalable hosting/infrastructure system on which to build rich web applications, and it offers substantial free bandwidth and CPU time as well as a competitively priced payment plan for larger/more popular applications.

App Engine uses Python and is well documented, so I dove in with the Hello World example. A good first step seemed to be to get the annotations XML file populated by the bookmarks returned by Delicious API call made by App Engine, and next I needed a way to serve that file at a persistent URL for the CSE to use. These things were more challenging than I expected – I had difficulty authenticating with Delicious, parsing the XML (as opposed to JSON) data that came back, and finding a way to serve those URLs as a properly formatted XML file. I initially looked for a way to write the URLs to a static file, but eventually found a detailed tutorial on writing blogging software for the App Engine, and I was able to adapt the RSS publishing portion of that example for my purposes.

The annotations XML files were now being published to URLs such as http://delv-icio-us.appspot.com/annotations/memento85 but currently it only works for that one user and you can actually put whatever you want after “annotations/”. Once I had App Engine making a call to the Delicious API and serving the resulting bookmark URLs in an annotations XML file, it was easy to set up a new custom search engine, this time for the handful of bookmarks of memento85.

Because the process of getting the above working involved so much trial-and-error, and because I intend to continue developing the project into a more complex application, I set up a GitHub project for the App Engine portion of Delvicious. There are many, many things left to do before the project is complete. I will need to:

  • understand Google’s Datastore so that I can store information about the Google accounts of users and the Delicious accounts paired with them.
  • also store the bookmark data for each user in the Datastore – too many API calls are required to fetch the entire list of bookmarks each time the XML is served, and it makes much more sense to store the bookmarks again and update them as new bookmarks are added. This also saves the need for the cron job – I can simply fetch new bookmarks whenever the CSE requests the annotations XML.
  • design the various pages of the application, including signup and account management.
  • develop the search page – ideally I can present the user with a single search box that will use both the built-in Delicious search and the Custom Search Engine and present the results side-by-side.

I’m excited about implementing these features and hope to continue this project for the remainder of the course. Delvicious will be a good opportunity to learn Python, familiarize myself with building web applications using the App Engine, and create a mashup that people might find truly useful.

A2Z
ITP
projects
web ideas