A2Z

Programming A to Z – Delvicious Alpha

project page here with overview and previous posts

http://delv-icio-us.appspot.com/

A very alpha version of Delvicious is live at the above link! You can currently sign in with your Delicious account (or use one of the ones I posted towards the bottom of the GitHub page), tell it to fetch all of your bookmarks (making a copy of the URLs in Google’s datastore), and then perform searches of those pages using a Google Custom Search Engine that is created on the fly* from XML files that are automatically generated and served by App Engine. I’ve had a little trouble getting the CSE’s to work the first time a user tries to use the search box, and I suspect that the search engines are not immediately ready for use after their (implicit) creation (which I think happens when the user first tries to do a search). One of the first of many next steps is to find a way to initialize that search engine automatically and inform the user (via email) when it is ready. If it’s not working for you, wait a little while and try again later.

There’s a lot of other functionality I want to add, and other next steps include:

  • automatic updating of a user’s Delicious bookmarks, modifying only the necessary entries in the Datastore to reflect the changes on Delicious
  • leveraging Delicious tags to improve search results – both by using them as refinements, and by being sure to return results that have the tag you are searching for (I’d love to have support for a “tag:tagname” operator in the search box)
  • an option to leverage the number of other users on Delicious who have bookmarked a site to improve search results, since presumably heavily bookmarked pages are more useful
  • Delicious results optionally displayed side-by-side with the Google results, both for comparison, and to make sure that all useful results are found (currently Delicious finds things that Google doesn’t) (I might use the Beautiful Soup HTML parser to scrape the results from the Delicious search pages since they aren’t exposed by the API)
  • improve authentication with Delicious
  • explore ways to support the searching of private bookmarks (but I’m not sure this is possible, since Google needs public XML files to create the search engines)
  • do the graphic design for the various pages

I will continue to work on the project over the summer, and I hope to launch an initial version within the next month or two.

* I did this using Linked Custom Search Engines, and it’s worth noting that this was not the original plan. I had intended to ask users to authenticate with both Google and Delicious, and use the Google authentication for the creation of and interaction with CSE’s using HTTP requests. CSE’s only support Client Login authentication, however, which is intended for desktop/mobile applications and requires users to manually type in their usernames and passwords. I knew that I would not be able to convince users to give me their unencrypted Google login information (nor did I really want to possess it), so, without the ability to authenticate Google users on a google.com Google sign-in page, I was going to need an alternate solution.

A2Z
ITP
projects

Programming A to Z – A Delvicious Milestone

project page here with overview and previous posts

I’ve had to focus on other projects for the past couple of weeks, but I finally got to turn my attention back to Delvicious. It will now:

  • keep track of Delicious login information for a particular Google account
  • fetch those bookmarks using the Delicious API and store them in the GAE Datastore
  • serve the bookmarks back as an XML file to a user-specific URL

There’s enough functionality to make it worth visiting the appspot page here. Sign in with Google, enter your Delicious credentials, and fetch your bookmarks – you should see them displayed in a list. You could then make a custom search engine, go to the advanced tab on the left side bar, and add a URL of the form delv-icio-us.appspot.com/delvicious/annotations/[your_delicious_username].xml as an annotations feed. Your bookmarks should then start appearing in your search results. Test out the new version (that searches memento85’s small number of Delicious bookmarks) below –

I need to clean up the interface navigation, but the real next step is to dive into the CSE API documentation and automate the creation of the search engine so that one is automatically paired with a Google users Delicious account.

A2Z
ITP
projects

Programming A to Z – Assignment #9 Evolvocabulary

The last of the weekly assignments was relatively open ended –

Acquire some text. Visualize it. Source and methodology are up to you, but be prepared to justify your choices.

I decided to use my papers from college as my source text. I copy pasted the contents of the papers into plain text files, and had hoped to see how my vocabulary evolved through time (hence the project name… not the most clever one I’ve come up with, but it will do for a weekly assignment). (Note I didn’t include the writing I did for certain technical Linguistics, Computer Science, and Physics classes. I also didn’t include papers that were group-authored.)

In week four of the class we had looked at how to represent word counts as a hashmap of words to the number of times that they occurred in a text (see the WordCount class in the notes). WordCount extends TextFilter, however, and TextFilter is built to only be able to read data from a single file. I thought about combining the files into one or trying to use multiple TextFilters, but it seemed easier and more elegant to start from scratch.

Scala seemed like it would be well-suited to this sort of problem, and I was eager to find a use for it since it had been a couple months since I last worked on TwiTerra. My aforementioned friend Jorge pointed me towards a class written to parse Ruby log files; that code, which uses a Scala community library called Scalax, served as a useful starting point.

You can see the full source code for the assignment here, but I’ve pasted a particularly interesting function below. There’s some “deep Scalax magic” going on here (as Jorge says), which I’ll explain -

  def readAllFiles(fileNames: List[String]): Map[String, Map[String, Int]] = {
    val wordsInFiles: List[(String, String)] = for {
      filename <- fileNames
      line <- filename.toFile.readLines()
      word <- line.split("\\W+|\\d+")
    } yield (word.toLowerCase.trim, filename)
 
    val emptyMap: Map[String, Map[String, Int]] = Map.empty.withDefaultValue(Map.empty.withDefaultValue(0))
 
    wordsInFiles.foldLeft(emptyMap) { case (map, (word, filename)) => 
        map.update(word, map.apply(word).update(filename, map.apply(word).apply(filename) + 1))
      //map.update(word, map.apply(word).update(filename, map(word)(filename) + 1))
      //map.update(word, map.apply(word)(filename) = map(word)(filename) + 1)
      //map.update(word, map(word)(filename) = map(word)(filename) + 1)
      //map.update(word, map(word)(filename) += 1)
      //map(word) = (map(word)(filename) += 1)
    }
  }

The function takes a list of the names of the files mentioned above, and returns a map that has each word mapped to another map, and each of those maps has the names of the files in which that word occurred mapped to the number of times that word occurred in that file. There are three parts of the function:

  1. The first part of the function goes through all the files in that list, and then through each line in that file, and then through each token in that line (delimiting tokens by non-alphabetic characters), and puts each of those words in a list as a pair with the file in which they occurred. Thus wordsInFiles is a long list of words and file names, with an entry for every word on every line of every file.
  2. The function then initializes an empty map (emptyMap) with default values for the words and file names, and 0 for the word counts. This eliminates the need for a lot of hassle later on checking to see if words/file names are in the map – we can just assume they are there and trust it to use the default values if they aren’t.
  3. Finally, it operates on each pair in the wordsInFiles list and updates the map accordingly. foldLeft is explained thoroughly on the Ruby log file example linked above, but I’ll go through this specific case. It starts off with the emptyMap, goes through each pair in the wordsInFiles list, and performs a function on the pair of that map paired with that pair from the list ((map, (word, filename))) to fold that list pair into the map. The result of that fold is a map that is then used in the fold of the next item in the list, and this process continues for each (word, filename) pair in the wordsInFiles list.

    The function performed on each item in the list is not as complicated as it looks, and note that each of the commented lines is equivalent to the uncommented one – I left them to show the progressive application of syntactic sugar (which I won’t go into here). The purpose of this next line is to increment the count of the number of occurrences of the current word in the current file.

    The outermost map.update finds the word in the map and replaces the map with which it is associated with a new one. This new map needs to be an updated version of the previous map, which we retrieve with map.apply(word). We want to update only one of the values corresponding to the file names in that words’ map of filenames to occurrence-counts, so we need to get the previous count (using two map.apply’s to get to the value of a key in a map within a map) and increment it before the resulting map is sent to the update.

… “deep Scalax magic.”

I’ve saved the results of the visualization in the below images. The program didn’t quite create the effect that I had intended – that of showing how my vocabulary evolved over time – but I did give some sense of the topics about which I was thinking and the words I used to describe them. I thought about eliminating common words, but it seemed like it would be hard to make those decisions in a non-arbitrary manner. Click each image for a larger version in which the differences are more visibly apparent, and I recommend opening the below images in tabs and cycling through them with shortcut keys so that it’s easier to make quick comparisons.

A2Z
ITP
assignments

Programming A to Z – Delvicious, Django and Assignment #8

project page here with overview and previous posts

Since I had already worked with XML and web services for my midterm project, I decided to take this week’s assignment for Programming A to Z as an opportunity to continue work on that project. I thought that a good next step would be to use App Engine’s Datastore to store the necessary information about all of a user’s bookmarks, and I would again access that data initially as an XML document obtained from the Delicious API.

I had already begun to learn the Python web framework Django for the Textonic project for Design for UNICEF, and it seemed like it would provide a rich set of useful tools for this project as well. The Datastore does not work, however, like the relational databases with which I am more familiar. Django relies by default on a relational database and is thus incompatible with Google’s Datastore out-of-the-box, but there are two software projects that aim to reconcile these differences. Google’s App Engine Helper (tutorial, code) seemed less well- and less actively-developed than app-engine-patch (tutorial, code), so I decided to go with the latter.

Django is quite powerful and gives you a lot of functionality for free, but when trying to branch out from various tutorials I encountered the somewhat strange challenge of figuring out exactly what it did for you and what it didn’t. It took me a quite a while to get the hang of working with Django on App Engine, so I didn’t have time to actually get the XML stored in a reasonable set of Models in the database. I have, however, gotten successful storage of Delicious logins to work, which is a good first step. The updated code is available on GitHub, and I should be able to make additional improvements soon.

A2Z
ITP
assignments
projects

Programming A to Z – Assignments #6 and #7

Since it’s getting to be that time in the semester when I feel that I should be focusing on final projects, I didn’t spend too much time on the context free grammar and Bayesian classificiation assignments. My writeups for both are below.

I’ve studied context free grammars in the past (ahh undergrad memories of Ling120 (syllabus here, but not the one I had)), so I have a good sense of how they work. I made a few quick modifications to the grammar file used by Adam’s ContextFilter class to handle certain conjunctions, adverbs/adverbial phrases, and prepositional phrases. I also made some modifications to support quotations, but they aren’t particularly refined – I couldn’t come up with a simple solution to the nesting problem in the below example that didn’t involve duplication of many of the other rules:

this smug dichotomy thought ” that smug walrus foregrounds this time that yelled ” the seagull that spins this important sea said ” that restaurant that sneezes has come ” ” ”

Thinking about ways to solve this problem highlighted how CFGs can get rather complex rather quickly. When, for example, do you want a period/comma at the end of the quotation? When is one period/comma sufficient for multiple quotations? How do you resolve those situations programmatically?

My test.grammar file is online here, and you can run it with Adam’s Java classes that are zipped here. I recommend you only test a few of my rules at a time and comment out the others – otherwise you might get sentences like this:

the important trombone yelled ” wow, the blue amoeba said ” oh, this blue thing interprets this seagull that said ” wow, the trombone sneezes ” ” but this amoeba said ” this suburb that quickly whines in that sea that daydreams by that corsage that quickly or habitually computes this dichotomy slowly yet habitually vocalizes and damn, that luxurious restaurant habitually prefers this seagull ” but that suburb interprets the time yet damn, that sea said ” that restaurant tediously slobbers ” yet wow, that seagull quickly yet slowly foregrounds the restaurant or the boiling hot time spins this bald restaurant of this trombone but that amoeba computes this smug restaurant but the seagull that quickly or slowly prefers this sea yelled ” oh, the thing that spins this restaurant tediously yet quietly whines ” or wow, this amoeba coughs through the important sea and oh, that time tediously yet slowly coughs yet oh, that trombone habitually computes that luxurious suburb or wow, that thing that said ” my, that amoeba that said ” my, that time tediously or slowly sneezes ” quietly yet tediously foregrounds that trombone that has come ” said ” my, the restaurant that habitually but quietly prefers the dichotomy that said ” damn, this boiling hot trombone quietly slobbers by the trombone that quietly coughs ” said ” oh, the amoeba habitually coughs ” ” yet damn, this seagull that spins the seagull that sneezes for the important trombone quietly coughs but my, this sea that slowly yet habitually has come foregrounds this dichotomy and that blue thing tediously slobbers “

 
 
For the assignment on Bayesian classification I combined Adam’s BayesClassifier.java with the previous Markov chain examples to use n-grams instead of words as the tokens for analysis. BayesNGramClassifier.java can be found here as a .txt file, and you can download all of the required files here. Note you might have to increase the amount of memory available to Java to run the analysis with higher values for n. Try something like java -Xmx1024m BayesNGramClassifier 10 shakespeare.txt twain.txt austen.txt if you're having trouble.

I compared sonnets.txt to shakespeare.txt, twain.txt and austen.txt as in the example using various values of n for the analysis. The data is below, with the word-level analysis first. Note that higher numbers (i.e. those closer to zero) indicate a greater degree of similarity.

n shakespeare.txt twain.txt austen.txt
word -59886.12916378722 -64716.741899235174 -66448.68994538345
1 -311.94997977326625 -348.2797252624347 -351.8612295074756
2 -6688.356420467105 -6824.843204592283 -7076.488251510615
3 -46332.8806624305 -47629.58502376338 -49557.906858505885
4 -155190.04376334642 -161815.95665896614 -167839.50470553883
5 -350322.9494161118 -369897.08857782563 -379600.90797560615
6 -581892.4161591302 -620871.7848829604 -629557.118086935
7 -798094.5896325088 -851043.4785550251 -856926.3903304675
8 -977428.4318098201 -1033391.2297240103 -1037851.0025613104
9 -1106125.9701775634 -1153251.0919529926 -1159479.8816597122
10 -1184654.361656962 -1218599.6808817217 -1227484.9929278728
11 -1221770.2880299168 -1242286.351341775 -1255024.535274092
12 -1228641.7908902294 -1238848.5031651617 -1254404.827728626
13 -1214247.043351669 -1217403.6233457213 -1235480.9184978919
14 -1187489.0276571538 -1186476.2556523846 -1205959.6178494398
15 -1153511.2780243065 -1150529.1594209142 -1170746.8132369826

When the n-grams become 14 characters long (which is very long, considering the average length of English words) the analysis finally starts to break down, and it no longer correctly classifies sonnets.txt as being most similar to shakespeare.txt. Some values of n certainly perform better than others, but I'd need to delve further into the mathematics of how these numbers are calculated in order to do more detailed analysis.

A2Z
ITP
assignments