« Solving the Wrong Problems | Main | Narrative & Time Management »
Sunday
28Jun2009

Scalable Trust Metrics in the Google Appengine Data Store

This is a dump of my thoughts on how to use Google App Engine for the back end of Trustmap. I’m no expert on data and this may not make much sense if you’re not me.

From a data store perspective, a user’s immediate trustmap looks like:

class Trustmap(db.Model):
user = db.KeyProperty()
context = db.StringProperty()
users = db.ListProperty(db.Key)

This allows a query that shows who user trusts in context:

>>> tm = Trustmap.all().filter('user =', user).filter('context =', context).get()
>>> tm.users
[... list of users...]

This allows people to tag other people, much like you can use delicious to tag URLs. However, as with delicious, this query simply returns the information put into the system. There’s no added value.

Our aim with @trustmap is to provide that added value by calculating trust metrics for each supplied context. The idea being that if I trust user for context then I also implicitly trust who user trusts for context, albeit less than I trust user himself.

Take the following network:

       G       H      
\ /
A
/ \
B C
/ \ / \
D E F

A trusts B in a specific context and B trusts D and E. So according to the trust metric approach above, A can be said to implicitly trust D and E because A trusts B.

Now, our challenge is to store this information on Google Appengine, following the appengine paradigm of super fast reads. What are the issues?

Appengine Restrictions

Appengine is great because it’s real, live and functioning and provides a scalable data store that doesn’t slow down when it gets full of rows. It also provides memcache, a great little webapp framework and automated deployment.

However, it comes with a whole bunch of caveats and restrictions. Notably:

  1. requests need to be completed within a specific time (atm 30 secs)
  2. requests need to, on average, not consume anything like 30 seconds worth of CPU
  3. max of 1MB data in any one db.put operation
  4. transactions have to be within entity groups
  5. limited background processing
  6. limited remote API
  7. pure python only

Also, when appengine apps scale towards relatively high usage, db operations error more frequently (or so I’m told here).

The Naive Approach

The naive approach is to simply iterate through trustmaps when querying. Silly pseudocode being:

zero_degrees = []
one_degree = []
two_degrees = []
...
zero_degrees = trustmap
for user in zero_degrees:
one_degree += user.trustmap
for user in one_degree:
two_degrees += user.trustmap
...
return zero_degrees + one_degree + two_degrees

This has the benefit of being simple and working with just the Trustmap model class from above. However, it’s very quickly preclusively expensive and slow to calculate. If a user has, on average 1000 people in their trustmap for a context and we calculate 2 degrees of separation, we get 1,000 * 1,000 * 1,000 = 1,000,000,000 calculations.

Keeping a Lid on the Calculations

There are two factors that influence the number of calculations: the number of people in a trustmap for a context and the degrees of separation. One way of keeping the numbers down (tx to David Pinto for the advice) is to prescribe a small maximum for the former, say max 5 people per context. This would make the max calcs for 2 degrees of separation to 5 * 5 * 5 = 125.

However, whilst an intriguing ‘creative constraint’ I have the feeling people might want to trust more people than that for important tags. Say you’re a casting manager keeping track of a lot of dancers, you might want to trust more than 5 people for #chorus. Or a marketeer might want to trust a long list for distribution. Conclusion: setting a max per context is a good limiter but needs to be unobtrusively higher than 5. Say 30 or 100 or 500.

The second weapon is limiting the degrees of separation. We want to add value but we need to avoid calculations. How would just one degree of separation work out? Say with a maximum of 99 users per context? That’s a maximum of 10,000 trustmaps affected by one change. Can we handle that?

Computed Trustmap

Let’s assume that, instead of building the results on query, we’re going to build them when data is written. One approach is to flag up which trustmaps need to be recalculated in response to an add (i.e.: when a user trusts) or remove (when a user untrusts) event.

Let’s imagine we have a new class::

class ComputedTrustmap(db.Model):
user = db.KeyProperty()
context = db.StringProperty()
users = db.ListProperty(db.Key)

Imagine that the left hand side of the network diagramme above doesn’t exist and that we’re at the point when A decides to trust B for a specific context. What changes?

  1. A’s trustmap
  2. A’s computed trustmap
  3. G & H’s computed trustmaps

A’s trustmap changes because A has added a user to it. A’s computed trustmap has changed because A now trusts B’s users for context and G and H’s change because they trust A and thus should have A’s trustmap in their computed trustmaps.

Calculate What?

Having worked out which trustmaps / computed trustmaps are affected, we have two routes. The first would be to recalculate the affected trustmaps, along the lines of the ‘naive’ logic above. However, this seems like overkill because, after all, the majority of those trustmaps is going to be just the same.

The second approach is to make the specific changes to the trustmaps required by the event. Let’s look at what those would be in this case:

  1. we need to add B to A’s trustmap
  2. we need to add B’s trustmap to A’s computed trustmap
  3. we need to add B to G & H’s computed trustmaps

How would this work in practice? Say we have an initial request handler (or a synchronous function called by the request handler). The first operation, adding B to A’s trustmap is trivial. How about the second operation, adding B’s trustmap to A’s computer trustmap?

The complexity of the second operation comes from the need to sort a computed trustmap. The logic is as follows. If you query a users’ trustmap for a context, the first results should be the people that user trusts directly (from their direct trustmap), followed by the people that user trusts indirectly (from the computed trustmap). However the people from the computed trustmap should also be ordered so that someone trusted indirectly twice (i.e.: by two people in the direct trustmap) should come before someone trusted once, and so on.

Now, we’re going to sort the computed trustmap by key_name, as this allows for efficient batching. So one solution might be to have multiple ComputedTrustmap entities for a user and a context that have different priorities. That way, when adding (or removing) a user from the computed trustmap, the pseudocode logic would be:

>>> query = ComputedTrustmap.all().filter('user =', user).filter('context =', context)
>>> ctm = query.filter('users =', trusted_user).get()
>>> if ctm: # move trusted_user from this entity to the entity with the next highest priority
>>> else: # add trusted_user to the entity with the lowest priority

(The key [ahem] to efficient batched querying would then be to put the priority in the entity key, after the user and context ala:

keyname = u'%s/%s/%s' % (user, context, priority)

A query like:

ComputedTrustmap.all().filter('user =', user).filter('context =', context).order('__key__')

Would then return a list of matching entities, in order of priority, each with a list of indirectly trusted users.)

Now, the routine, set out in four lines of pseudocode above, for adding (or removing) a user from the computed trustmap adds a bit of faff but is still limited by the maximum number of people per context. In contrast, the issue with the third operation, adding B to G & H’s computed trustmaps, is that whilst we can limit the maximum number of people a user can trust for a context, we can’t limit the number of people who trust a user.

For example, how many users would trust Seth Godin for a relevant context? So when Seth occasionally makes a change, thousands or more people may require their computed trustmaps to be updated. This means that whilst the second operation could perhaps have been done within the scope of an initial request handler, the third operator has to have the capacity to be iterated through by a background process.

We’ll need to store a snapshot of the users who need to have B added to their computed trustmaps. So we get a class like:

class DirtyUsers(dm.model):
users = db.ListProperty(db.Key)
context = db.StringProperty()
action = db.StringProperty() # 'add' | 'remove'
user = db.KeyProperty()
...

We then iterate in a worker to:

  1. action(user) to computed trustmap (following the logic outlined above)
  2. updated / delete the DirtyUsers entity to keep track of the computed trustmaps remaining to update

Ok, so far so good, we’ve got a mechanism of making just the relevant changes in response to an event. However, this approach introduces a further problem: the possibility of the trustmaps getting out of sync.

Sync

The problem is that, because we only ever change what needs to be changed, if, for some reason, a change isn’t made, the trustmap and thus all future calculations based upon it will be erroneous.

Let’s look at the specifics. We have two sets of ‘puts’. The first:

  1. add B to A’s trustmap
  2. add B’s trustmap to A’s computed trustmap
  3. store DirtyUsers entity for context, action and user

The second as outlined above:

  1. action(user) to computed trustmap (following the logic outlined above)
  2. updated / delete the DirtyUsers entity to keep track of the computed trustmaps remaining to update

Now we can, if we want, use a transaction for the first set. The DirtyUsers entity, and A’s Computed Trustmap can all have A’s Trustmap entity as parent and thus belong to the same entity group. However, whether we do that or not, we can’t update the DirtyUsers entity in the same transaction as the ComputedTrustmap entities in the worker iterations.

However, we do have a ‘get out of jail’ mechanism for these computations: if it goes wrong, re-calculate everything. If we store the original action (A trusts B for context) on the DirtyUsers entity we can then have an occasional background process that scans for very old unprocessed DirtyUsers and, if they seem suspicious, delete them and file all the affected trustmaps for re-calculation.

Thoughts?

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>