Sunday, January 11, 2015

Pipeline Tools vs Monolithic Tools

This weekend I took some time to implement a report that I've been lazily avoiding. Without the report in place, it is easy to call-out our local improvements, but it is very hard to track our actual improvements versus a fixed baseline, say 1 year ago. As engineers we want to be able to report both local improvements, but also our overall trend. It was while creating a pipeline for collecting the data I needed and massaging it into the format that I wanted, that I got a chance to reflect on the power and differences between pipeline tool-chains and monolithic tools. I prefer pipelines personally, but I often fall into the monolithic tool trap, in all of its variations. Here I've laid out some guidelines and rules that I find helpful and that keep me productive.

Spotting Monolithic Tools

So what is a monolithic tool and what does it look like? It depends on the application and what you plan on doing, however, it is generally any tool which executes multiple transformations to create a final output and where there may be additional internal choices made by the tool on which transformations are chosen. If my tool has to download from a web server, parse some html files, produces an object model, reads through the object model doing some filtering and then finally outputs some high level graphs and analysis I'm probably working with a monolithic tool. It isn't designed to be used in a pipeline, instead it comes pre-equipped with everything it needs.

Also, when anything changes in the pipeline, it generally requires starting from scratch. In my case, I probably have to download from the web server again. I may also not know when I need to run the tool again and so then, for efficiency, the tool needs complicated logic to determine if its output is up-to-date or not. How fragile right? But every day we deal with these problems. In fact monolithic tools are the norm and we love them since they result in push-button results.

So why complain about monolithic tools? They prevent us from realizing short term efficiencies since producing a monolithic tool is very expensive in terms of engineering effort. How many times have we shied away from writing something simple because it was only 80% correct? Worse, how often are we thwarted in producing what I'll call the "money" transformation because there is too much dirtiness in the input, such that we can't even implement the "money" transformation with a high degree of confidence it will work? It turns out there are some great techniques for moving past the ugliness and getting to your money transformation if you are willing to think about pipelining and incremental improvement. Let's investigate that.

Avoiding the Monolithic Tools Curse

To avoid the monolithic tools curse we have to break out problems down first into smaller sub-problems and find ways to exploit caching for efficiency. As an example, I do a lot of build over build analysis for my product. There are literally hundreds if not thousands of builds a day across many branches of code and those builds in turn have various characteristics such as the flavor of optimizations they used, the change description lists going into them, whether they were built on a dev machine or a build lab, etc..

So to start, I have to collect builds and characteristics. Then sort and filter them. Once I have that information I need to pick interesting builds and start the process of collecting information. In my case I'll utilize the private symbols to get as much information as possible.

But lets halt there for a moment. Symbols are complicated. If you've ever used MSDIA then you'll know what I'm talking about. It can be many hundreds of lines of code to figure things out. Also, it takes a while to dump a private symbol database that is hundreds of megabytes. I could write a monolithic tool to load such a database, provide a GUI, allow querying and asking questions using a SQL like language, but then I'd be jumping the shark. Let's introduce more, simple tools, into the equation instead.

So, we run a tool like cvdump, which knows how to process symbols and provide a textual output. Thankfully that lets me avoid the interfaces, and COM and C++ required to the do the same and instantly gives me access to pattern matching tools in other languages. Even better, we can automate this as another tool. While we have a daemon downloading and indexing builds, we have another watching for symbols to come in and turning them into textual output.

Let's add more tools. How about a tool that detects all of the types in a build and outputs them in name sorted order. Seems like that could be useful. How about another tool that finds all of the functions and spits out their sizes. We can keep going, adding more tools, putting out more intermediate files for yet even further tools to examine. Those tools might be for compliance, security, code refactoring or just simple reporting. With this we can start defining rules that we can follow.

Rule #1: Stop working on a tool once it can provide a meaningful transformation and cache its output for further tools to run.

We should also make sure we use existing tools as often as possible even if they don't provide an immediately consumable format. In other words, their cached output may require that we do another, small transform, for them to be useful. In the case of cvdump, the tool outputs information in a structured format, but not a regular enough format that you can logically process the records directly from the file, so we create additional tools for pulling out interesting information such as types, functions, etc...

Rule #2: Prefer existing tools which provide a "good enough" transform over a new tool which provides a "perfect" transform.

One thing that might also not be obvious is the need to cache intermediate results. In a language like PowerShell you might be used to pipeline productions where the result of one command is fed into another, by object, in a very efficient way. This makes the commands fast. However, we often work in a world where we can't get the data fast. Sometimes it is better to cache than pipe. And the more we keep the intermediate results the easier it is to start from any point in the process without having to deal with earlier costly portions. This provides our final rule for this section.

Rule #3: Cache intermediate results so your tools can be easily debugged and re-run at any stage in your pipeline.

Case Study: Transforming your Code

I do a lot of code transformation and I often run into a problem. If I am going to transform all of the code, then the number of permutations of complexity goes up significantly. For instance, lets say your coding guidelines require function signatures to appear 1 parameter per line. That means to detect a full function signature is quite a complex process. You have to process multiple lines of input using more complicated regular expressions or tokenization. Once you have the inputs you want you'll need to perform the necessary transformation and finally you'll need to format the output and properly replace the existing code with the new code, which may or may not even be the same number of lines.

How would I approach this problem? I actually get asked this a lot so I think about it a lot. What I would do is run the code through a series of normalization processes. Each one designed to fix what I'll call a defect, but feel free to call it a complexity if you want ;-) For instance, the entire problem domain would be easier if I had the following set of components:
  1. A source analyzer that can find and lift functions out of the file.
  2. A function analyzer that can transform the lifted functions (perhaps even some of them being done by hand)
  3. A function printer that can turn my updated function back into a form compliant with our coding guidelines.
  4. Finally a source patcher that uses information from the original run, perhaps even information that got updated (such as source file, start line, end line, function) to reemit the updated code.
That might seem complex, but each of the individual tools will probably be useful to you many times over. In my case, a tool like which reads in a file and writes it out by removing several lines and which does the reverse could be the tools of choice for the source patcher. A full AST editing suite would be nice, but maybe it just isn't an option. Imagine the language I'm writing is a small DSL of my own design. I'll probably never build an AST for such a thing, as that would be a waste of time.

Rule #4: KISS - Keep It Simple Stupid - Use the simplest tools possible. It will be easy to search for increasingly complicated tools, but avoid that unless absolutely necessary.

What happens when you run into what I'll call a local deviation in your inputs? An example in my codebase was that sometimes people would put in bad line endings. This would in turn confuse diff tools, code editors and just about any other tool that wanted to process line by line. The solution was a character analyzer and fix-up to remove this from the entire code base. This was a pre-condition of doing later work. Don't be scared to fix up local deviations when you find them.

Rule #5: Favor incremental, simplifying transformations if they block you from easily implementing the final "money" transformation.

Anything that goes beyond 5 rules is asking to be simplified so I'll stop there. This weekend I was able to use the above 5 rules while implementing a report on binary sizes over time. Specifically rule #3 allowed me to tweak and rerun the analysis phases in seconds rather than half an hour to go collect the intermediate data. That only had to be done once. There were also many "errors" such as missing binaries and incomplete submission data. My process used rule #5 to make early transformations which filtered this out so that the final script which produced a CSV was only about 20 lines of Perl. Excel was finally used to produce the graph.

No comments:

Post a Comment