April 2009

Programming A to Z – A Delvicious Milestone

project page here with overview and previous posts

I’ve had to focus on other projects for the past couple of weeks, but I finally got to turn my attention back to Delvicious. It will now:

  • keep track of Delicious login information for a particular Google account
  • fetch those bookmarks using the Delicious API and store them in the GAE Datastore
  • serve the bookmarks back as an XML file to a user-specific URL

There’s enough functionality to make it worth visiting the appspot page here. Sign in with Google, enter your Delicious credentials, and fetch your bookmarks – you should see them displayed in a list. You could then make a custom search engine, go to the advanced tab on the left side bar, and add a URL of the form delv-icio-us.appspot.com/delvicious/annotations/[your_delicious_username].xml as an annotations feed. Your bookmarks should then start appearing in your search results. Test out the new version (that searches memento85’s small number of Delicious bookmarks) below –

I need to clean up the interface navigation, but the real next step is to dive into the CSE API documentation and automate the creation of the search engine so that one is automatically paired with a Google users Delicious account.

A2Z
ITP
projects

On the Plethora of Social Websites

The problem with the plethora of social websites is not that people have to keep track of and maintain multiple profiles/identities – we do that in real life too (acting differently in different situations, talking about different things with different people, etc) and we’re pretty good at it. The problem is instead that we have to keep track of multiple profiles/identities for every person with whom we interact in more than one context. It is this social/cognitive task which we find so disorienting.

Note the inaugural use of the ‘thesis’ category! It’s not due for over a year, but this is the task, or at least an early formulation of the task, that I want to tackle with my final ITP project.

ITP
Wanderli.st
thesis
web ideas

Programming A to Z – Assignment #9 Evolvocabulary

The last of the weekly assignments was relatively open ended –

Acquire some text. Visualize it. Source and methodology are up to you, but be prepared to justify your choices.

I decided to use my papers from college as my source text. I copy pasted the contents of the papers into plain text files, and had hoped to see how my vocabulary evolved through time (hence the project name… not the most clever one I’ve come up with, but it will do for a weekly assignment). (Note I didn’t include the writing I did for certain technical Linguistics, Computer Science, and Physics classes. I also didn’t include papers that were group-authored.)

In week four of the class we had looked at how to represent word counts as a hashmap of words to the number of times that they occurred in a text (see the WordCount class in the notes). WordCount extends TextFilter, however, and TextFilter is built to only be able to read data from a single file. I thought about combining the files into one or trying to use multiple TextFilters, but it seemed easier and more elegant to start from scratch.

Scala seemed like it would be well-suited to this sort of problem, and I was eager to find a use for it since it had been a couple months since I last worked on TwiTerra. My aforementioned friend Jorge pointed me towards a class written to parse Ruby log files; that code, which uses a Scala community library called Scalax, served as a useful starting point.

You can see the full source code for the assignment here, but I’ve pasted a particularly interesting function below. There’s some “deep Scalax magic” going on here (as Jorge says), which I’ll explain -

  def readAllFiles(fileNames: List[String]): Map[String, Map[String, Int]] = {
    val wordsInFiles: List[(String, String)] = for {
      filename <- fileNames
      line <- filename.toFile.readLines()
      word <- line.split("\\W+|\\d+")
    } yield (word.toLowerCase.trim, filename)
 
    val emptyMap: Map[String, Map[String, Int]] = Map.empty.withDefaultValue(Map.empty.withDefaultValue(0))
 
    wordsInFiles.foldLeft(emptyMap) { case (map, (word, filename)) => 
        map.update(word, map.apply(word).update(filename, map.apply(word).apply(filename) + 1))
      //map.update(word, map.apply(word).update(filename, map(word)(filename) + 1))
      //map.update(word, map.apply(word)(filename) = map(word)(filename) + 1)
      //map.update(word, map(word)(filename) = map(word)(filename) + 1)
      //map.update(word, map(word)(filename) += 1)
      //map(word) = (map(word)(filename) += 1)
    }
  }

The function takes a list of the names of the files mentioned above, and returns a map that has each word mapped to another map, and each of those maps has the names of the files in which that word occurred mapped to the number of times that word occurred in that file. There are three parts of the function:

  1. The first part of the function goes through all the files in that list, and then through each line in that file, and then through each token in that line (delimiting tokens by non-alphabetic characters), and puts each of those words in a list as a pair with the file in which they occurred. Thus wordsInFiles is a long list of words and file names, with an entry for every word on every line of every file.
  2. The function then initializes an empty map (emptyMap) with default values for the words and file names, and 0 for the word counts. This eliminates the need for a lot of hassle later on checking to see if words/file names are in the map – we can just assume they are there and trust it to use the default values if they aren’t.
  3. Finally, it operates on each pair in the wordsInFiles list and updates the map accordingly. foldLeft is explained thoroughly on the Ruby log file example linked above, but I’ll go through this specific case. It starts off with the emptyMap, goes through each pair in the wordsInFiles list, and performs a function on the pair of that map paired with that pair from the list ((map, (word, filename))) to fold that list pair into the map. The result of that fold is a map that is then used in the fold of the next item in the list, and this process continues for each (word, filename) pair in the wordsInFiles list.

    The function performed on each item in the list is not as complicated as it looks, and note that each of the commented lines is equivalent to the uncommented one – I left them to show the progressive application of syntactic sugar (which I won’t go into here). The purpose of this next line is to increment the count of the number of occurrences of the current word in the current file.

    The outermost map.update finds the word in the map and replaces the map with which it is associated with a new one. This new map needs to be an updated version of the previous map, which we retrieve with map.apply(word). We want to update only one of the values corresponding to the file names in that words’ map of filenames to occurrence-counts, so we need to get the previous count (using two map.apply’s to get to the value of a key in a map within a map) and increment it before the resulting map is sent to the update.

… “deep Scalax magic.”

I’ve saved the results of the visualization in the below images. The program didn’t quite create the effect that I had intended – that of showing how my vocabulary evolved over time – but I did give some sense of the topics about which I was thinking and the words I used to describe them. I thought about eliminating common words, but it seemed like it would be hard to make those decisions in a non-arbitrary manner. Click each image for a larger version in which the differences are more visibly apparent, and I recommend opening the below images in tabs and cycling through them with shortcut keys so that it’s easier to make quick comparisons.

A2Z
ITP
assignments