September 19th 2004 anselm@hook.org
Last rev Sep 23 2004 - continuing to add comments to Javascript section.
No source code is published yet - this is still a work in progress.
Today we're going to build a social content engine for organizing and sharing content with our friends.
The service we build will let you:
Our work will be modelled on newly emerging services including del.icio.us, Flickr and Webjay . The code itself will be a rewrite based on actually a fairly small subset of Thingster and BooksWeLike which I've been developing (and learning to understand the implications of) over the last 6 months or so. If you haven't used delicious in particular you need to stop reading this, go there, make an account and play with it for a while.
Users use these services to organize their own content for later recollection. But since the services are public, other users can peek into the collective space, and discover similar items, topics or persons.
In this project we're going to look for opportunities to stress the 'synthesis' aspect of social discovery; to escape from the pattern of curated collections managed and presented by one person. If there is time it would be fun to play with generating statistics and views on participants and their recommendations as well.
The components that we need to write to deliver our service will include:
One of the specific things we're going to build into our service is a 'tags' mechanism as popularized by delicious. Users will be able to publish tags to categorize items of interest and other users will be able to pivot on those tags to discover items of like interest.
We are going to push RDF quite hard. We will write a lightweight persistent and embeddable RDF triple store in Java - possibly being the first people to do so. This will be the cornerstore of our application and represents significant value even beyond this particular project. We'll also seek to use official RDF vocabularies as much as possible. We want to have something that is not only functional for our own use but that can interact with the rich ecology of the web - publishing data via RSS or RDF/XML to a wide variety of other services.
We are also going to push Javascript quite a bit to express the client side interface. Again we will seek to build fairly powerful components that will have significant reuse for other projects.
Overall the pattern of the finished project is to build an XML driven web-service built on top of industrial strength concepts that can be re-used for almost any conceivable knowledge management application.
To accomplish all this you will need these third party pieces:
The results should be quite fun to drive and fairly industrial. Let's take a look at some of the ideas next.
Here we're just going to muck about about with casual observations about what it means to have a 'social content' system. Ideally we'd like to end up with a laundry list of constraints that can guide our choices.
There's an old saw that goes "actions speak louder than words". A car can have its left signal flashing but be travelling blindly down the road not turning at all... Or oncoming traffic may suddenly and mysteriously slow down suggesting the presence of a fine officer of the law doing his part to help keep a community orderly - or even just a kid crossing the street without illuminating the crosswalk signal.
In vehicular traffic drivers wheel and race making moment to moment decisions on the basis of each others inputs; signalling to each other in a variety of both intentional and unintentional ways. As a participant you end up creating a mental model of the things around you, the situational landscape, and the best navigation choices.
On the net there is a potential for similar behavior.
If we could just watch what people "do" instead of what they "say" we might actually find that the quality of knowledge we're getting from them is actually higher.
People on the net do of course signal to each other with a variety of intentional and explicit mechanisms. There are countless blogging services, craigslist, vanilla websites, listservs, email, wikis, p2p networks, irc, sms and on and on.
But that space has started falling over. There is incessant spam, and almost everything has become saturated with 'adwords by google'. The language and phrasing of traditional content has steered sharply towards maximizing ad revenue. The intentional signals are polluted and noise-ridden.
Watching flocks of humans pinwheel about has up until now been the domain of web portals. Now we're seeing this become more democratic as new p2p psychographic behaviour tracking services such as A9 and Ask Jeeves are rolled out.
The newer services that are emerging seem to have few parallels to existing services. Wikipedia of course does offer social benefit but it has content organized and massaged by hand. Orkut, Friendster, Multiply, LinkedIn are social but don't have any particular organizational utility; there is no personal activity that others observe - most behavior is explicit. CraigsList and Meetup and Upcoming do provide community but the signalling is all explicit again.
In automating the synthesis of many peoples observations there is perhaps an immediacy, a lower latency between oneself and ones peers. Perhaps this satisfies an instinctive need for a sense of connectedness. The best I can say is that delicious seems more 'human' than say 'google news' or many of the other sites I look at on a daily basis.
Can we get anything specific from all this? Here's a grabbag of constraints:
As we find ourselves employing capricious aesthetics to arbitrate between technology choices we can bolster this list.
One thing that we do know is our service is simply a web-site.
We don't have to think much about "what kind" of web-site yet. And in fact we'd prefer not to. We'd like to pluck away all the orthogonal pieces and erase them from consideration as early as possible.
Since this "serving web content" is a well defined goal we can at least take it off our list. This will reduce the total number of things that we have to think about.
Overall then we're looking at some kickoff code that goes something like this:
static public void main(String[] args) {
server = new jetty server
context = new jetty context
session = new subclassed instace of a jetty handler
}
Here we're not bothering to package up the system as a servlet. We want this to be easily accessible to the debugger and we're basically in a hurry overall. We want to build the whole project in less time than our boredom threshhold. Considering how to package something as a servlet will multiply the total number of considerations in this project and create spurious complexity.
The Jetty Documentation tells us that we need to subclass a Jetty Resource Handler to do actual work. In this case we invent a 'Session' concept that will be responsible for replying to user requests as per our application. In broad strokes this will look like this:
public class Session extends org.mortbay.jetty.ResourceHandler {
handle_event() {
if the request is for a vanilla web page then just return it
if the request is a database query then pass it off to some kind of query handler we are about to detail out.
return query results as an xml graph
}
}
Our Session Handler above will be shallow. We're going to push most of the work off to an XML query handler layer.
One complexity that we have to keep in mind is that multiple response handlers can be active at the same time so we'll have to remember to put semaphores or synchronized blocks around any code that isn't thread-safe. This will require a careful audit of the project when it is done.
Now that we can "start up our app" we need to pick another piece to do. Our choices are the query layer or the user interface. But it really does seem like we are going to have to do a bit of real work now and deal with our actual persistent datastore. Since we have a main() entry point we should be able to do quick tests anything we now write.
The actual code for the above should be in the tarball at the end of this project.
The first piece of real work is to write a lightweight RDF Triple Store. This section will get the most discussion in fact; there are many details here.
Again here we don't have to think much about "what kind" of application is going to use the triple store. In a sense we're making a decision that will enforce design apriori - because of previous experiences I've had with RDF and influences I've gotten from other people who have used RDF quite successfully.
RDF is a perpetually emerging grammer for expressing the relationships between objects. It will be the cornerstone of this project and just about every other project that we walk through. We're actually going to use RDF/XML - one way of expressing RDF.
One of the things we need to do is to load up RDF content off disk. Although we're interested in writing a datastore we're actually not that terribly interested in writing an RDF parser. And excellent ones already exist. To load content into our RDF database we'll use Jena's RDF parser called ARP:
Another thing we need to do is to store stuff. We are not really keen on writing a BTree on Disk or some other storage system. Java 1.4 does support NIO - memory mapped IO and it is somewhat appealing to write our own system based on that - but this takes us out of Java 1.2 land and breaks a design constraint. Also there are some rather bizarre systems such as Prevalyer which offer transparent persistence but I'm just not sure about the idea of inhaling hundreds of thousands of RDF triples every time we start up - regardless of performance. In this case we're going to go with PERST - which is a very nice datastore written by some crazy russian guy:
Side Note:
We could in fact avoid writing our own triple store if we used Jena or Kowari:
http://www.hpl.hp.com/semweb/jena2.htm
http://www.kowari.org
And in fact we could just grab an open source blogging tool off the shelf:
We're not going to go with the completely off-the-shelf solutions in this project because:
As far as I know nobody else has written an embeddable persistent Java based RDF Triple store yet. As soon as one comes out we can chuck all of this code out the window - but to achieve our learning and portability constraints we are (for now) forced to use a solution that we write ourselves.
Another big question - possibly the biggest question of this entire project - is what is the best mapping between RDF (say in an XML file) and RDF in memory.
There are a number of excellent W3C sponsored articles on RDF mappings to RDBMS. (In this case we're looking for a mapping from RDF to an OODB - but the ideas are the same). This article:
http://www.w3.org/2001/sw/Europe/reports/scalable_rdbms_mapping_report/
and
http://www.w3.org/2001/sw/Europe/reports/rdf_scalable_storage_report/
talks about some of the data-type requirements and implementation issues than an RDF Store might have for example. Some of the completely reasonable considerations they cite are:
We're actually going to respectfully ignore quite a bit of this good advice - but it is worth reading.
Our RDF database is going to have only a single kind of persisted object - an RDF triple. Where an RDF Triple consists of a:
{ Subject, Predicate, Value }
Each of these parts can be represented in Java:
In Java our simple triple container would look like this:
public class Triple extends Persistent {
public String sub;
public String pred;
public String val;
}
Side Note:
Even if we're not going to be formal we should be at least aware of the weaknesses of both the data model and the representation of that data model being used here:
Another way to store RDF triples would be to bind all triples associated with a given subject as a single Subject node. Doing this in Java would look like so:
public class Reference extends Persistent {
public String subject;
public Hashtable values = new Hashtable();
}
Although we're not doing it this way - this second way does have a subtle advantage. It would allow a query engine to operate across disjoint database back ends. For example you might have a spatial database and a vanilla subject-sorted keyword index and you might want to return some features from each. Since each reference is fully self contained you could easily emit a stream of blended features - without having to duplicate those features into each database. This is a significant benefit - but again something we're not doing.
Yet another way to do this would be to use an IDL to generate your java objects from an OWL definition. This is completely insane but I can see cases where people might do it:
public class MyRDFPerson {
public String uri;
public int age;
public float height;
}
We are going to use the first approach however we will wrap the triples inside of a Reference Class as exampled above so that from the outside you won't really care about the implementation that much - and in fact it will be very easy to swap implementations even as far as switching to Jena or directly backing your persistence requirements with PostgreSQL.
Here is what that Reference class is going to look like:
public class Reference {
String uri;
public String get(String predicate);
public String set(String predicate, String value, boolean allowDuplicates );
}
The rules we'd think of normally associating with set() would say that duplicate predicates are not allowed per subject. In a Java class for example you can't say "int myvalue; int myvalue;". But in RDF this method can explicitly allow a given predicate to be declared more than once if allowDuplicates is true. You'd typically however want an rdf:Bag. Let's say that for example you wanted to associate several tags with a given subject - you'd want to declare a child bag that belongs to that subject and have that child cite all of the tags in question.
At this stage we have a concept of a 'Reference'. This acts as a bag for predicates and values associated with a given Subject.
What we need now is actual persistence and a way to manufacture and store handles on our Reference objects. Basically now we're going to just glue all of the pieces into one huge blob called 'Database'.
So this is where we call upon PERST to do the heavy lifting for us:
import org.mortbay.perst.*;
class DatabaseRoot extends Persistent {
FieldIndex subs;
FieldIndex preds;
FieldIndex vals;
}
This incantation declares 3 persistent field indexes using PERST. Now when we commit triples into the database we commit them to all 3 indexes. And to query for any triple we can query any of the indexes.
PERST supports range queries, exact queries, and "subject starts with" string queries. Queries can be done in forward or reverse index order.
For our needs this will suffice. For example:
However to do more complex queries such as say find all things that are within a certain value of predicate "geo:long" and predicate "geo:lat" we have to issue multiple queries and do explicit joins by hand. Technically speaking however one can actually avoid fully explicit joins (where one has a full copy of each set) by using java code to iterate through the second set with the first set in hand. (In the particular case where we are doing something that looks like a spatial query - we could use the spatial indexing that PERST provides).
There's one more piece on top of all this that we need to add. We need some concept of an overall "database" that can yield instances of References that the application logic can then manipulate. That database layer will wrap PERST completely; making it invisible to the outside world and will look something like this:
interface Database {
public Reference get(String key);
}
With a little bit of glue this layer is basically done. Please refer to the associated tar-ball for the exact details.
Now we're done most of the hard stuff. We just have to think about the user experience and build out some UI. Actually that will also be quite a bit of work - but hard in a different way - as we wander a thicket of possible UI choices next.
Side Note:
A lot of people wonder if RDF is really any kind of improvement over other ways of expressing objects. People often complain that RDF/XML is overly verbose and not human editable for example. And people do wonder if the same content couldn't be packaged under some other schema altogether. Here are some of my thoughts as a first-time-user from playing with RDF over the last few months:
At this point we have a way to serve content, and we have a way to store content. Now we have to consider exactly how the user is going to interact with content.
Here is where we move into the thinking that specializes the design away from being any generic web driven database application.
We do know that effectively we're building a CMS - it understands what a user is, what posts are, how to perform various useful queries and enforces a permissions policy such that users cannot overwrite each others space. The kinds of concepts we're needing to manage include:
We also have a list of constraints from our earlier design talk.
One thing we do know is that there will be users and user accounts. Presumably users have preferences as well.
As well users will make 'posts'.
These roles seem fairly clear. We can use FOAF to define people. And for posts we just define some RDF predicates in a vocabulary to capture basic post data. In fact we don't even have to do any work - we can just use RSS as is with <title> <link> and <description> being perfectly adequate.
Tags are a new concept here and get a little bit more discussion.
Let's cite a few things that tags do:
Note that there isn't any particularily deep reasoning as to why we're using tags - it's just an easy, convenient, brief and memorable concept for users.
At the same time there is quite a bit of formal discussion on voluntary categorization, prototype theory and the like. You can read some of the literature in cognitive psychology for more discussion of these topics - in particular Eleanor Rosch and George Lakoff. But at the same time it's probably best to think of tags as a simple colloquial concept and not to read too much into them.
Here is a bit of a ramble about some of the thinking however. One essay that I like to drag out even now is:
I like to use the made-up phrase 'platypus effect' to capture a bit of the ideas expressed by Antero Taivalsaari:
At the time I was puzzled by finding ways to categorize knowledge - wanting to build all kinds of complicated virtual file systems and the like. ( I sometimes wonder if Ma Bell didn't invent C++ and OOP abstraction because of their problem domain - dealing with millions of identical phone records. If Ma Bell had been say a games developer instead they might have encouraged something that dealt better with lots of heterogenous types. )
But Del.icio.us tags pretty much demonstrated that this was actually trivial - and that thinking about this too much is basically just a waste of time.
The URL presents a very small text space within which a number of not completely orthogonal concepts are being 'crumpled'. We are effectively trying to represent a set of slightly irrational 'human shaped' ideas within a few dozen bytes. The URL space should be:
Del.icio.us uses an especially nice pattern where the url path represents a kind of 'sum of children streams'.
We're going to do something similar where the URL is broken up like this:
Effectively the url is broken into:
[ domain ] / [ username ] / [ tag ] [ ?styles ]
Each parent folder sums up all of the content of all children folders. It's an intuitive and useful metaphor. It even works with hierarchical tags.
An alternative pattern could be to do [username].[domainname]/[user tag path]. This is problematic simply for DNS management issues and because it ruins the opportunity to use the domain name space for other kinds of more appropriate overloading and precedence order. It is (arguably) more clear to humans to say "portland.craigslist.org/anselm" than to say "anselm.craigslist.org/portland" for example. So we won't do this.
Using a streams concept helps us work in RDF. There are some nice things we can do in the database layer for indexing and discovering collections of facts under a given stream or stream with a wildcard path.
Streams do create some worries and considerations however:
Now we're done thinking about the way the user "sees" the system.
We're not actually being terribly innovative here - just emulating patterns that work. Hopefully though we get to play a bit more later on once these foundations are in place.
Since we have a model of user interaction - with streams and tags and all that stuff - we need to figure out how we're going to drive that interaction. We have to make a bridge between the user and the database engine.
We're going to want a query layer that can be directly queried by the client application. This is not RDQL (although it could use RDQL or another query language) but is tailored towards our specific application. It also imposes a security wall so that users cannot pollute other users content.
Basically we just want a laundry list of the kinds of capabilities we need and then we can pluck out commonalities and implement something simple that translates these high level requests into actual indexed query lookups of our RDF database.
Typical queries are probably:
The discussion of the actual implementation of the query engine is probably too much detail for here. I'll let you look at the code to see the specifics of how these queries were implemented based on this set of use cases.
This javascript stuff all sounds terribly mundane but actually it's quite liberating - it means you as an engineer can get more stuff off your shoulders and get other people to deal with it. That means much more leverage, more people stirring the pot and more help overall.
What's happening is that a few web services now are starting to use Javascript - and thats a pattern we're going to use.
Googles gmail and Amazon's A9 service are good examples of this.
Historically most web services manufactured the user interface on the server side using Mason, ASP, JSP or other such grammers. These solutions are actually quite difficult for designers to work with and they create a security liability in that the pages can express commands that permeate the security wall between the client and server state.
A cool thing about Javascript is that we're able to ask the server to ship us pure XML and then we (or our lackeys) can do the layout of that ourselves. We can even have long complicated dialogues with the server - asking small questions about users or state and making decisions based on that. We could let a user try to create a 'shared discussion group' and then advise the user if their group was made or if the name was already taken for example.
We're able to use the same patterns we would in an ordinary not-split-over-a-network application.
Here's a general laundry list of the reasons Javascript is appealing:
There are some drawbacks to using Javascript:
In the way we're going to use Javascript there are also a few seemingly bizarre design choices. We're going to simply have a single html document on the server that we're going to send to the client over and over. This single document will change its appearance based on the current URL that the client is on. And what this means is that we have to 'round-trip' form parameters back to the client document for it to do work.
In a sense we are shipping an 'application' to the client - and even though HTML is too stupid to know it - that application persists between pages and doesn't have to introduce any new pages.
The Javascript application delivers the UI. That UI consists of pieces like this:
There are going to be many UI pieces - but we can build them as we come across them. It doesn't require a lot of pre-planning.
The amusing thing about a Javascript based application is that the HTML is treated as just a launching point. There is almost no HTML at all:
<html>
<body>
<javascript>
deal_with_entire_pages_content();
</javascript>
</body>
</html>
All of the work is done from javascript. It doesn't even make sense to draw header or footer banners in HTML unless they are absolutely universally constant.
The client application sits inside of our javascript code and more or less just fulfills the list of UI pages that we want to have. It's largely a sequence of functions that we pick between. We look at the users current URL and the current parameters and then execute the appropriate subroutine to draw that page.
In the case of this application a fly-over of the code at 10,000 feet might look something like this:
All of these aspirations are going to be pinned on a small library of Javascript functions. We're going to write some XML utilities, some layout utilities and a few other bits and pieces. Overall the library will be something like this:
I don't have time to actually walk through exact code in this discussion. You'll have to refer to the tarball for now. Later I may add more comments to this.
Here is the tarball. [ Well it's not up yet but it will be in a week or so when I have a chance to finish it ].
These services are fun to build from a kind of mad scientist perspective. The tools we have today to architect these large scale social systems are so powerful and so easy to use that it can be as little as a few days work to unleash an entirely new social application on an unsuspecting public.
If you're going to use this starting point professionally then there are other considerations not covered here; such as finding ways to aggregate and or federate content so that you can take advantage of laws of utility and avoid walled garden effects. As well if you are deploying a commercial service based on this code you may want to support some wiki like concepts so that users can entirely customize their own experience.
What could you do with this?
You could make your own Craigs List such as discussed by Jo:http://frot.org/geo/craigslist.html
Your own personal knowledge tracking system - for tracking your habits or even your finances.
Effectively this becomes a big bucket that you can pour stuff into. If used personally it could become a hugely powerful tool for long term stuff organization and management; from tracking habits, health, phone numbers and other such often lost things to post-organizing existing collections of duplicate archives and the like. You could attach an aggregator to this and do say brute force geo-location of news-articles and project them onto a globe; and then do peer based review of those articles or additional decoration of facts from people who are on the ground in that area...
Really the sky is the limit.
In fact I originally started down this path with the hopes of writing a video game. The idea of managing users, managing content and doing it all in a high performance way came out of the kinds of demands that a large scale locative-media multi-player experience would have. I ended up recognizing that even building this foundation was a chore in itself and made just doing an RDF based CMS the first goal.
The thing to do is to think about where all of these services are going over the next 10 years. Clearly many of them are going to go away - and clearly others will have to find ways to federate and share their knowledge.
Hope you had fun.
Please send me comments if you liked this essay to anselm@hook.org . I'm also looking for ways to improve my understanding of this space so I'd like to hear advice about better or more rigorous ways to build an RDF database and to do embeddable persistence overall.
I'd like to thank Tangra, Maciej, Joshua, Brad Degraf, Dan Brickley and especially Jo Walsh for getting me interested in RDF in the first place. All mistakes are my own and many insights belong to these people.
- a