Yesterday, a friend asked me how I would solve a certain problem he was facing. He did have a working solution, but felt like he could make it more generally applicable. Not shying away at a good challenge, I decided to take it and see how I would solve it. In this blog post you can read about my solution.
The Problem ¶
Consider this JSON document. Return the same JSON document, but with the points list sorted by stop time in descending order.
{ "timestamp": 1412282459, "res": [ { "group": "1", "catlist": [ { "cat": "1", "start": "none", "stop": "none", "points": [ {"point": "1", "start": "13.00", "stop": "13.35"}, {"point": "2", "start": "11.00", "stop": "14.35"} ] } ] } ] }
Note:
It's kept short for brevity: the actual document contained many more items in each of the nested lists (so multiple groups, categories, and points), but this snippet covers the overall structure.
Analysis ¶
This is an incredibly common task I'm sure any programmer with a sufficiently long career has encountered in one shape or form.
The simple, ad hoc, solution to the problem above is:
def sort_points(d): for group in d['res']: for cat in group['catlist']: cat['points'] = sorted(cat['points'], reverse=True, key=lambda p: p['stop'])
Some downsides in arbitrary order:
- Not reusable. This function can only deal with dicts of the exact same structure. It's required to know the key names and the type of data living at each nesting level. Any other dict will make it choke;
- Brittle. The algorithm itself needs to be changed when the document gets
nested in another document, or when its structure should change. Nest it in
another dictionary, and you need an extra
for
loop in there. Pass it just a category, and you need to take out afor
loop; - Lacks abstraction. The algorithm mixes traversing the dict with modifying it—these should be two separate things.
- Changes the dict in-place. There's no need to rely on using mutable data structures here. It should work for dicts you cannot change, too.
Solution ¶
How do we solve this more elegantly?
The data itself isn't interesting in the slightest. This is a problem of structure only. Let's try to break out the pieces that are specific for this problem and see if we can factor out a generic piece, taking the specific parts as arguments.
The three key tasks we need to perform:
- traverse the entire structure recursively;
- determine if we've arrived at a given designated path;
- change the nested structure at that location (using given function).
Note that these steps already show our function params. We need a way of specifying the "path" to drill down into, and specify what transformation to apply to those targeted elements.
Traversal ¶
First and foremost, we need a way of traversing arbitrarily nested structures. Given that this is only JSON data (so limited to strings, numbers, lists and dicts), we can start with a recursive function that will walk the structure and produce a new output which will essentially be a deep copy of the input:
def traverse(obj): if isinstance(obj, dict): return {k: traverse(v) for k, v in obj.items()} elif isinstance(obj, list): return [traverse(elem) for elem in obj] else: return obj # no container, just values (str, int, float)
Here are some examples:
>>> traverse(3) 3 >>> traverse('ohai') 'ohai' >>> traverse(['ohai', {'name': 'Vincent', 'age': 32}]) ['ohai', {'name': 'Vincent', 'age': 32}]
Path Matching ¶
To specify the target path into the structure, we need to come up with a syntax that can express those, a mini-language. Here's an example:
'res[].catlist[].points'
This notation clearly specifies the steps to take to drill down into the structure to arrive at any nested element. Note that each step is explicit: it's either a step into a dict key (the string), or into a list (the empty list). This convenient string notation is of course just sugar for the following:
['res', [], 'catlist', [], 'points']
How can we use this structure? Let's take the traverse()
function and change
it to keep a record of the paths it's traversing along as it traverses:
def traverse(obj, path=None): if path is None: path = [] if isinstance(obj, dict): return {k: traverse(v, path + [k]) for k, v in obj.items()} elif isinstance(obj, list): return [traverse(elem, path + [[]]) for elem in obj] else: return obj
But we're not doing anyting with the path
we're tracking this way yet.
Applying an Action ¶
Now we need to use that path and in the interesting case perform an action.
One way is to add that to the traverse()
function, but it would do more than
just traversing that way. Let's update the traverse function with a callback
argument that will get called for every node in the structure (every leaf, dict
entry, or list item). This would fully decouple traversing the structure from
modifying it.
def traverse(obj, path=None, callback=None): if path is None: path = [] if isinstance(obj, dict): value = {k: traverse(v, path + [k], callback) for k, v in obj.items()} elif isinstance(obj, list): value = [traverse(elem, path + [[]], callback) for elem in obj] else: value = obj if callback is None: # if a callback is provided, call it to get the new value return value else: return callback(path, value)
Now the traverse()
function is really generic and can be used to replace any
node in the structure. We can now implement our traverse_modify()
function
that will look for a specific node, and update it. In this example, the
transformer()
function is our callback that will be invoked on every node in
the structure. If the current path matches the target path, it will perform
the action.
def traverse_modify(obj, target_path, action): target_path = to_path(target_path) # converts 'foo.bar' to ['foo', 'bar'] # This will get called for every path/value in the structure def transformer(path, value): if path == target_path: return action(value) else: return value return traverse(obj, callback=transformer)
Back to the Problem ¶
There we have it: a generic, reusable function that abstracted out each individual interaction: traversal, path matching, and action. To solve our original problem with it, now simply call:
from operator import itemgetter def sort_points(points): """Will sort a list of points.""" return sorted(points, reverse=True, key=itemgetter('stop')) traverse_modify(doc, 'res[].catlist[].points', sort_points)
Here's the gist to the complete solution.
Go Wild ¶
This blog post was meant to provide some insight into a typical software engineering approach to problem solving. Breaking down a problem into pieces, and abstracting out the essence is perhaps the most joyful thing to do as an engineer.
The constructed traverse_modify()
function could be written in many different
ways if you would like to give it more power. Examples are supporting the
following path specs:
qux[].(foo|bar)[] # match both foo or bar results[0..4] # only apply action to first 4 results foo.bar.* # match any key value foo.bar.**.qux # match 0 or more levels of keys ...
Another extension would be to allow it to take multiple target-path/action combinations and apply them all in a single traversal.
All of that is left as an exercise to the reader though, to keep this post clear and concise.
Summary ¶
The fantastic thing is that, picking a good abstraction for the traverse()
function opens a door to an entire world of possibilities to modify arbitrary
Python object structures. In the end, you get more than you initially set out
for.
If you build something cool based on this however, please let me know :)