By Vincent Driessen
on Friday, October 03, 2014

Yesterday, a friend asked me how I would solve a certain problem he was facing. He did have a working solution, but felt like he could make it more generally applicable. Not shying away at a good challenge, I decided to take it and see how I would solve it. In this blog post you can read about my solution.

The Problem

Consider this JSON document. Return the same JSON document, but with the points list sorted by stop time in descending order.

{
  "timestamp": 1412282459,
  "res": [
    {
      "group": "1",
      "catlist": [
        {
          "cat": "1",
          "start": "none",
          "stop": "none",
          "points": [
              {"point": "1", "start": "13.00", "stop": "13.35"},
              {"point": "2", "start": "11.00", "stop": "14.35"}
          ]
        }
      ]
    }
  ]
}

Note:
It's kept short for brevity: the actual document contained many more items in each of the nested lists (so multiple groups, categories, and points), but this snippet covers the overall structure.

Analysis

This is an incredibly common task I'm sure any programmer with a sufficiently long career has encountered in one shape or form.

The simple, ad hoc, solution to the problem above is:

def sort_points(d):
    for group in d['res']:
        for cat in group['catlist']:
            cat['points'] = sorted(cat['points'], reverse=True, key=lambda p: p['stop'])

Some downsides in arbitrary order:

  • Not reusable. This function can only deal with dicts of the exact same structure. It's required to know the key names and the type of data living at each nesting level. Any other dict will make it choke;
  • Brittle. The algorithm itself needs to be changed when the document gets nested in another document, or when its structure should change. Nest it in another dictionary, and you need an extra for loop in there. Pass it just a category, and you need to take out a for loop;
  • Lacks abstraction. The algorithm mixes traversing the dict with modifying it—these should be two separate things.
  • Changes the dict in-place. There's no need to rely on using mutable data structures here. It should work for dicts you cannot change, too.

Solution

How do we solve this more elegantly?

The data itself isn't interesting in the slightest. This is a problem of structure only. Let's try to break out the pieces that are specific for this problem and see if we can factor out a generic piece, taking the specific parts as arguments.

The three key tasks we need to perform:

  1. traverse the entire structure recursively;
  2. determine if we've arrived at a given designated path;
  3. change the nested structure at that location (using given function).

Note that these steps already show our function params. We need a way of specifying the "path" to drill down into, and specify what transformation to apply to those targeted elements.


Traversal

First and foremost, we need a way of traversing arbitrarily nested structures. Given that this is only JSON data (so limited to strings, numbers, lists and dicts), we can start with a recursive function that will walk the structure and produce a new output which will essentially be a deep copy of the input:

def traverse(obj):
    if isinstance(obj, dict):
        return {k: traverse(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [traverse(elem) for elem in obj]
    else:
        return obj  # no container, just values (str, int, float)

Here are some examples:

>>> traverse(3)
3
>>> traverse('ohai')
'ohai'
>>> traverse(['ohai', {'name': 'Vincent', 'age': 32}])
['ohai', {'name': 'Vincent', 'age': 32}]

Path Matching

To specify the target path into the structure, we need to come up with a syntax that can express those, a mini-language. Here's an example:

'res[].catlist[].points'

This notation clearly specifies the steps to take to drill down into the structure to arrive at any nested element. Note that each step is explicit: it's either a step into a dict key (the string), or into a list (the empty list). This convenient string notation is of course just sugar for the following:

['res', [], 'catlist', [], 'points']

How can we use this structure? Let's take the traverse() function and change it to keep a record of the paths it's traversing along as it traverses:

def traverse(obj, path=None):
    if path is None:
        path = []

    if isinstance(obj, dict):
        return {k: traverse(v, path + [k]) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [traverse(elem, path + [[]]) for elem in obj]
    else:
        return obj

But we're not doing anyting with the path we're tracking this way yet.


Applying an Action

Now we need to use that path and in the interesting case perform an action. One way is to add that to the traverse() function, but it would do more than just traversing that way. Let's update the traverse function with a callback argument that will get called for every node in the structure (every leaf, dict entry, or list item). This would fully decouple traversing the structure from modifying it.

def traverse(obj, path=None, callback=None):
    if path is None:
        path = []

    if isinstance(obj, dict):
        value = {k: traverse(v, path + [k], callback)
                 for k, v in obj.items()}
    elif isinstance(obj, list):
        value = [traverse(elem, path + [[]], callback)
                 for elem in obj]
    else:
        value = obj

    if callback is None:  # if a callback is provided, call it to get the new value
        return value
    else:
        return callback(path, value)

Now the traverse() function is really generic and can be used to replace any node in the structure. We can now implement our traverse_modify() function that will look for a specific node, and update it. In this example, the transformer() function is our callback that will be invoked on every node in the structure. If the current path matches the target path, it will perform the action.

def traverse_modify(obj, target_path, action):
    target_path = to_path(target_path)  # converts 'foo.bar' to ['foo', 'bar']

    # This will get called for every path/value in the structure
    def transformer(path, value):
        if path == target_path:
            return action(value)
        else:
            return value

    return traverse(obj, callback=transformer)

Back to the Problem

There we have it: a generic, reusable function that abstracted out each individual interaction: traversal, path matching, and action. To solve our original problem with it, now simply call:

from operator import itemgetter

def sort_points(points):
    """Will sort a list of points."""
    return sorted(points, reverse=True, key=itemgetter('stop'))

traverse_modify(doc, 'res[].catlist[].points', sort_points)

Here's the gist to the complete solution.

Go Wild

This blog post was meant to provide some insight into a typical software engineering approach to problem solving. Breaking down a problem into pieces, and abstracting out the essence is perhaps the most joyful thing to do as an engineer.

The constructed traverse_modify() function could be written in many different ways if you would like to give it more power. Examples are supporting the following path specs:

qux[].(foo|bar)[]  # match both foo or bar
results[0..4]      # only apply action to first 4 results
foo.bar.*          # match any key value
foo.bar.**.qux     # match 0 or more levels of keys
...

Another extension would be to allow it to take multiple target-path/action combinations and apply them all in a single traversal.

All of that is left as an exercise to the reader though, to keep this post clear and concise.

Summary

The fantastic thing is that, picking a good abstraction for the traverse() function opens a door to an entire world of possibilities to modify arbitrary Python object structures. In the end, you get more than you initially set out for.

If you build something cool based on this however, please let me know :)

If you want to get in touch, I'm @nvie on Twitter.