Saturday, January 31, 2015

The Wounded Workforce

I've been debating whether or not to write this for some time. Its hugely controversial. I've finally made the decision to finish/publish this piece due to the amount of feedback I get from the other developers around me.

The Promise

The great promise of technology is that it is self-reinforcing. Every technological advancement, is in theory, supposed to allow us to make the next great advancement with ever increasing speed. This applies in so much as their is 0 friction to advancing and applying each new level of technology. But we all know there is friction and it comes in the form of monetary spend, profit, and availability of resources. Since technology in different areas increases at different rates, there are also stalls. A prime example rampant in the tech industry today is the relatively slow increase in battery technology relative to the rapid advances in computing and display that want to take advantage of more power and longer battery lives.

So barring the technological impediments itself, what is stopping us from realizing the promise?

The Balance

Companies try to balance revenue, profit and overhead. Equipment spend is an example of overhead. So is a morale budget. As is your payroll and employee related expenses. Staring at these on paper you'd think that they directly impact your profit and that lowering these overheads would increase your profit. The correlation here is not quite as direct as that. Without these overheads you likely wouldn't have any revenue and thus eliminating overheads can also reduce the profits. So companies break all of these down into variable and fixed categories, generally putting a big fat "fixed" on the equipment, morale, and payroll and call it a day. When times get tough, they can cinch the waistband and after some periods of improvement they can start to relax it again, though rarely does one give back operations overhead once one has learned to deal without it.

While I ran a small company of my ow for many years, I have no real expertise in how to achieve this balance. If anything, my experience running my own company forced me to spend a large amount of my overhead on enabling technologies. To the extent, a much larger portion of my budget went into technology spend than any company would ever consider. So how is the balance impacting my ability to do my job and how are we wounding our own workforce?

The Wounded Workforce

Employees use company provided technology such as laptops, desktop machines, cell phones, pagers, tablets and A/V equipment to create new products and technology that in turn create revenue. During the process of this creation we often build up some debt (longer builds, bigger/more complex Excel Spreadsheets, more data in the database) in whatever we are working on. We also observe a general depreciation in the function and value of the equipment being used. Basically, once there is a spend on technology, an employee will tend to see a direct improvement in their ability to do their jobs.

There is an upper limit in the employees own ability to realize that improvement (if a user is not currently exceeding the limitations of their devices, a new device with fewer limitations won't help).

Then over time, due to wear and tear on the equipment itself and changes in the processes and work environment the productivity will slowly drop.

So far this all sounds great, so what is the problem? Well, the promise stated that technology would improve at an ever more rapid rate and new types of disruptive technologies would increase that even further. We are there and we are experiencing this state directly. Company spends are fixed and they are based on long period refresh rates. Improvements in technologies such as solid state drives or the yearly Intel refresh on processors revolutionize the consumer industry over the course of a couple of years while leaving the companies on their older hardware for upwards of 5. Let me provide some direct examples.

  • Building Software - OS reboot times, IDE load times, and source build times are all dramatically improved by SSDs. New companies and start-ups began with that technology. Most business class machines offer the smallest possible SSD's (128-160GB) for the system drive only at hugely inflated prices.
  • Visual Workflow - 24" monitors hit an under $300 budget price about 6 years ago. Larger monitors in the 27" range and 30" ranges are starting to see those prices. Yet I still see a lot of installations of much smaller monitors. Healthcare is a great example here, because they consistently have UIs which exceed the under 20" low resolution screens they have available. Weight is not the factor since the new monitors are lower power and lower weight even at increased sizes.
  • CPU Workloads - So what about that one guy in the office with the million line Excel sheet that takes 4 minutes to open? Every year new opportunities come out to make his or her life easier with a new machine. Potentially sacrificing raw speed for more cores, or whatever it might be that will make that workflow better. Going back to building software, in a large build, it is parallelism that needs improvement, not raw single core throughput. So getting an 8 core machine at 2.6ghz might trounce the standard 4 core 3.6ghz machine that the company tends to buy.
So now we've homogenized the entire spend to get every employee a "good enough" machine. Probably not the best machine for them, but the best machine we can find, within budget for a multitude of given workflows. Workflows that include job roles as disparate as IT worker, Accountant, Software Engineer, Product Manager, ... You get the point. Each of these users has dramatically different requirements and very rarely gets a choice or say in how to fulfill those requirements.

Unless you work in a start-up, or your company has progressively discovered the idea that your productivity matters and that their hardware spend on you is a multiplier for productivity and revenue, then you are likely working with out of date equipment.


So you are working on out of date machines, its costing your company, and its costing you even more. After all, you were at work for an extra hour working through the fact that your machine is holding you back from getting your job done. But its okay because EVERYONE is being EQUALLY CRIPPLED. And that is how we rationalize it.

But don't rationalize it. Let's do a simple calculation instead. Let's imagine that you are build constrained. Note I'm using this example because I was build constrained. So, I first looked at the build itself and at the cost of 2 weeks of my life I made it better, by about 10%, for everyone. So not bad. But then I took another step and I bought myself SSD's, at a cost of a few hundred bucks. My build speeds just about doubled. While I won't decry my improvement efforts, I have to point out, that probably cost the company a lot of money for me to spend my time on that. A lot of missed opportunity in other things I could have been fixing. Also note, the SSD got me way more win. So I would have had to spend many more weeks on a potentially impossible task to get the same results.

So I put in say $500 to make myself more effective. I wasn't equally crippled anymore. And so I'm pretty certain my improved productivity made me back the $500 and then some. But I have a long history of doing this. I started with monitors, gigantic 30" monitors and that was the start. I still use them, even after nearly 8 years so it was a good buy. I also drop in memory, graphics cards, and I'm on the boat as to whether or not I should be an actual build machine from scratch. Why? Because I want to get more done. I want to produce more amazing things. The technology that I work on every single day can make that happen and I want to use it to a multiplying effect.

And this is why many companies are embracing BYOD. The benefits are seemingly obvious. Think about the fact that you bought your own cell phone and your own plan and then you connected it to your work email account. How much productivity do you generate for your company there? I'm betting quite a bit, but yet you are willing to pay for that because you also get to leave work when you want and go in less at odd times of the night to resolve issues. Its become easier. I think cell phones were the big kick in the pants for BYOD, since the benefits of your own device always available to you and catered to you are most obvious there.

For machines it is less so since it tends to be a largish spend. A couple of thousand dollars. And there is an issue when you leave the company of who owns that stuff and what you should do with information on the hard drives etc... That is an ethics problem though, and employees should be prepared to wipe their machines if they are using it for work. As a technology problem, your company spends a meager amount getting you a "decent" machine and the reality is if you tossed in just a few hundred bucks more you could have an "awesome" machine. I want my machine(s) to be awesome. And if I already have one that I bought from my previous job, I'll just use it.

I really do hope that more individuals realize their full potential and look at machine spend out of their own pocket as an improvement to them rather than a failing of their company. Further I hope more companies start to offer more than the standard options. I for one would love to be provided a BYOD plan with full transparency around my allotment. It would at least alleviate the potential of me complaining about my crappy machine, because, well I bought it myself...

Monday, January 26, 2015

Code, Dead Code and Undead Code

So my job recently has been helping to build a "New Rendering Engine" for this awesome piece of software now known as Codename Project Spartan (read about it on the IE Blog or in a more editorialized form on Smashing Magazine). Even if you don't read the articles, and you just scan the pictures, you'll see this awesome concept. A component called MSHTML with lots of things that people like, compatibility (yay my old stuff still works), legacy support (yep, stuff still works), and Intranets (where most of the worlds data is actually stored).

But then, like magic, we fly into the future (I'm assuming a slightly lighter color of blue means THE FUTURE) and we get some new things like interoperability (Ermagod, I wrote my site in one browser and it worked in another one, did I win the Internets?) and a single default engine (framework authors rejoice, no more multiple document mode IE).

Working on this cool new thing has enlightened me (or rather reinforced in me) a sense of how costly and interesting various types of code can be. While I can't go into details on the "New Rendering Engine" I can talk a bit about my observations while doing some refactoring and the various code smells that I came across along the way.


Both of the Microsoft web platforms have a lot of code. We measure it in the millions of lines (each). Code is generally useful and it enables all of your key end user scenarios. Most code is well tested giving you have high confidence that it works as it was designed to work. People use the features provided by the code and provide satisfactory feedback, maybe a few nitpicks here and there, but overall the customer is happy.

Code is what we all try to write. Its the piece of software engineering we are all proud of.

Dead Code

And then there is dead code. When you shine a bright enough light on your code base you can find it. It looks quite odd and sometimes it doesn't even compile. Maybe it is behind an #ifdef, maybe it is a source file left out of the build system, whatever it might be this code is truly dead. But dead code doesn't stop there. The linker, in its omniscience, can figure out whether or not something is being used and it can omit it. Primarily you'll find a lot of unused template expansions getting evicted, but sometimes you get the Golden Ticket of all dead code that leads to many unused functions and member variables.

The cost of dead code is not that high if you think about it. First, you pay the cost of compiling it to some extent. You pay the cost to search index it and people find it when they do searches. Maybe it confuses new people on the team. Sometimes you even find that people go editing that junk without realizing just how dead it is (actually I find that a lot, its fun to let them know too, just before you delete the whole block). If there are member variables then perhaps that is costly to your run-time, since you are allocating space for members that are never used. You may even zero them out or initialize them paying other costs. Still these are easy to find because the linker gives you some hints.

So I said the cost was not that high, but all of those things above, in millions of lines of code must waste some time? Absolutely. But when we talk about wastes of time we have to focus instead on the relative cost. There is a more expensive type of code than dead code.

Undead Code

I never had a strong term for it, but after seeing what it is capable of I think it deserves to be placed in
the ranks of all the other undead horrors in the world. Undead code is bits of code lurking in your source that don't contribute to the positive experience of your end users. Code you've written, but provided no logical way to access. The compiler happily keeps it because it is so entangled in other conditions that it can't be provably removed from the code. Later someone comes along and from 30 calls away, passes the right parameter combinations and causes your code to raise itself from the dead and either crash and burn (in the best case) or provide years of buggy experiences to your users (in the worst case).

The costs of undead code are leaps and bounds beyond dead code. Why? Well, dead code has a big foam finger pointing at it if you are willing to look. So you can do something about it. The dead code might already be ifdef'ed out and so you aren't paying the price of compiling it or even fixing build breaks as the code around it changes.

Undead code harbors far more dead data members. Often times entire classes, v-tables, virtual calls, and lots of other really expensive stuff is being held ransom by a few undead lines. Those lines are also insanely hard to remove. You have to prove that the conditions of their existence can't be met by the other code. Sometimes undead code creates a nice circular reference and you have to chase through many layers to find the loop and eventually take a snip somewhere that lets you remove the entire thing.

I don't like to say that code has bugs. It really doesn't. It has capabilities and we really don't like all of them. Some of them are quite undesirable especially the crashing ones. So often this undead code creates a capability and depending on how long that code has been out there, sometimes people even start depending on those capabilities. Let's go so far as to say that some code is so complex that it can evidence emergent behavior, mostly due to this concept of undead code.

Bring on the Flame Throwers

As with all undead things, you want to burn them with fire. So break out your flame throwers and start making the round.

Probably the best flame thrower you'll find is code coverage. Thankfully VS has always supplied us with an excellent set of code instrumentation tools for figuring out our code coverage. You fire up your app, run it under the scenario, and then see if your code even gets a chance to run. Just like all other forms of exercise, exercising your code will help trim the fat and find better ways of dealing with error conditions or force you to think about how to make your code more testable. Strangely it also makes you write more tests if you ever want the numbers to go up. If there were ever a cure for turning undead code into deleted code it would be called code coverage.

Remember that dead code from before. Well, undead code likes to hide around dead code to mask its own stench. My favorite so far was an entire series of code behind an existence check for a variable. Basically something like this:
if (_pSuperCoolThing != nullptr) { ... Do tons of cool stuff ... }
Now that looks pretty legit. I mean, super cool thing, do tons of cool stuff. So I asked the linker to tell me about dead code and what did I find. I found that the function EnsureSuperCoolThing was dead. Wait a minute? Really? Oh BTW, CSuperCoolThing's constructor and destructor were also being thrown out.... Oh no, super cool thing, what are you doing, why are you not enabling my tons of cool stuff?

So once I realized the connection, you can imagine that I threw a party, revived everything and talked about how I spent hours writing a cool new feature... No, of course not. That tons of cool stuff had no tests, it hadn't run in many years, yes YEARS, and was not something I wanted to stamp my name on. I removed it. Dead code led to the removal of a bunch of undead code. Total number of edits to the code since being dead you might ask? Well, 4. Ouch, but its okay, because this also tells you how important it is to regularly run dead code analysis on your code and really clean it up. Make the time to do it.

There are also a few less technical/manual methods for finding undead code. A great one that a lot of companies use, especially in the web space, is simple telemetry. If you are logging your API, page, widget or component usage, then you can make data driven decisions on when to remove a potentially costly module. I mean, if nobody is using it, then other customers are paying the cost in disk footprint and unused capabilities. If there is a security hole or other type of bug, then it might even be a customer experience detractor. the only righteous thing to do is remove the code.

So if you, like me, hate dealing with unused, untested, undead code, then hopefully this gives you some ideas or ammunition to unleash the flame throwers in your code base. If you do, make sure you put in processes to keep the undead hordes at bay and decrease the likelihood of new infections through proper code coverage and testing.

No variable is infinitely better than an always true variable

When you are trying to simplify an architecture or remove dead code one of the juiciest finds you can come across is some variable or parameter which is always a fixed value. It is just crying out, "Remove Me", and I'm more than willing to answer. But it is never quite as easy as you think. Remember, it is still a variable which means its value, while clearly and logically fixed in the code in your window, might just have some way of being, well, variable.

Obscure Variables

So in my code base the most obscure types of variables are global boolean variables which can generally be set from the registry. The code that reads that registry in turn can be configured by anyone and I may or may not allow the value to be changed based on the name of the executable running. Sure, it has defaults, but they CAN be changed. But wait, who would change them right? I mean, they COULD change them, but I own the code, I own the application, I don't set the registry so they must be fixed... Right?? Think again. Someone, somewhere found your hidden gem and polished it for themselves.

There are other ways for variables to be obscure though. Maybe you marked them protected and someone decided to override and modify those values. Those are pretty easy to find, since the code is usually nearby. Unit tests can also do pretty much whatever they want. They can refuse to link in your version of the variable and instead link in their own, set to whatever value they want. The number of places you have to look in order to really verify your variable IS fixed is starting to become quite staggering.

No Variable is Better

So that variable was put into the code for a reason. Maybe it was paranoia. Maybe it was for a long forgotten feature that you never quite finished. Whatever the reason, that variable is going to ruin your life one day.

How do I know? Today I spent 8 hours tracking down how setting a bunch of if statements to true for a series of fixed variables in my code base could lead to a terrible, single block leak. And by single block, I mean exactly that. 16 bytes. 16 bytes in 8 different tests out of thousands of passing, non-leaking tests. They key, in this case was a registry controlled variable with some defaults. I had missed, that there was a configuration where a specific test host could carry the value false and get into the behavior I had just removed. It cost me, big, on a Sunday.

The Moral

No variable is infinitely better than an always true variable. Or a variable that is always one. Or a particular literal string. No matter how the variable is fixed, unless it is a locally scoped constant of some sort, you probably want to remove it from the code. You want to forget about all of those potential else clauses and you want to save others from shooting themselves in the foot by thinking your variable is in some way, well, variable. Its not. So don't make it one.

Sunday, January 11, 2015

Pipeline Tools vs Monolithic Tools

This weekend I took some time to implement a report that I've been lazily avoiding. Without the report in place, it is easy to call-out our local improvements, but it is very hard to track our actual improvements versus a fixed baseline, say 1 year ago. As engineers we want to be able to report both local improvements, but also our overall trend. It was while creating a pipeline for collecting the data I needed and massaging it into the format that I wanted, that I got a chance to reflect on the power and differences between pipeline tool-chains and monolithic tools. I prefer pipelines personally, but I often fall into the monolithic tool trap, in all of its variations. Here I've laid out some guidelines and rules that I find helpful and that keep me productive.

Spotting Monolithic Tools

So what is a monolithic tool and what does it look like? It depends on the application and what you plan on doing, however, it is generally any tool which executes multiple transformations to create a final output and where there may be additional internal choices made by the tool on which transformations are chosen. If my tool has to download from a web server, parse some html files, produces an object model, reads through the object model doing some filtering and then finally outputs some high level graphs and analysis I'm probably working with a monolithic tool. It isn't designed to be used in a pipeline, instead it comes pre-equipped with everything it needs.

Also, when anything changes in the pipeline, it generally requires starting from scratch. In my case, I probably have to download from the web server again. I may also not know when I need to run the tool again and so then, for efficiency, the tool needs complicated logic to determine if its output is up-to-date or not. How fragile right? But every day we deal with these problems. In fact monolithic tools are the norm and we love them since they result in push-button results.

So why complain about monolithic tools? They prevent us from realizing short term efficiencies since producing a monolithic tool is very expensive in terms of engineering effort. How many times have we shied away from writing something simple because it was only 80% correct? Worse, how often are we thwarted in producing what I'll call the "money" transformation because there is too much dirtiness in the input, such that we can't even implement the "money" transformation with a high degree of confidence it will work? It turns out there are some great techniques for moving past the ugliness and getting to your money transformation if you are willing to think about pipelining and incremental improvement. Let's investigate that.

Avoiding the Monolithic Tools Curse

To avoid the monolithic tools curse we have to break out problems down first into smaller sub-problems and find ways to exploit caching for efficiency. As an example, I do a lot of build over build analysis for my product. There are literally hundreds if not thousands of builds a day across many branches of code and those builds in turn have various characteristics such as the flavor of optimizations they used, the change description lists going into them, whether they were built on a dev machine or a build lab, etc..

So to start, I have to collect builds and characteristics. Then sort and filter them. Once I have that information I need to pick interesting builds and start the process of collecting information. In my case I'll utilize the private symbols to get as much information as possible.

But lets halt there for a moment. Symbols are complicated. If you've ever used MSDIA then you'll know what I'm talking about. It can be many hundreds of lines of code to figure things out. Also, it takes a while to dump a private symbol database that is hundreds of megabytes. I could write a monolithic tool to load such a database, provide a GUI, allow querying and asking questions using a SQL like language, but then I'd be jumping the shark. Let's introduce more, simple tools, into the equation instead.

So, we run a tool like cvdump, which knows how to process symbols and provide a textual output. Thankfully that lets me avoid the interfaces, and COM and C++ required to the do the same and instantly gives me access to pattern matching tools in other languages. Even better, we can automate this as another tool. While we have a daemon downloading and indexing builds, we have another watching for symbols to come in and turning them into textual output.

Let's add more tools. How about a tool that detects all of the types in a build and outputs them in name sorted order. Seems like that could be useful. How about another tool that finds all of the functions and spits out their sizes. We can keep going, adding more tools, putting out more intermediate files for yet even further tools to examine. Those tools might be for compliance, security, code refactoring or just simple reporting. With this we can start defining rules that we can follow.

Rule #1: Stop working on a tool once it can provide a meaningful transformation and cache its output for further tools to run.

We should also make sure we use existing tools as often as possible even if they don't provide an immediately consumable format. In other words, their cached output may require that we do another, small transform, for them to be useful. In the case of cvdump, the tool outputs information in a structured format, but not a regular enough format that you can logically process the records directly from the file, so we create additional tools for pulling out interesting information such as types, functions, etc...

Rule #2: Prefer existing tools which provide a "good enough" transform over a new tool which provides a "perfect" transform.

One thing that might also not be obvious is the need to cache intermediate results. In a language like PowerShell you might be used to pipeline productions where the result of one command is fed into another, by object, in a very efficient way. This makes the commands fast. However, we often work in a world where we can't get the data fast. Sometimes it is better to cache than pipe. And the more we keep the intermediate results the easier it is to start from any point in the process without having to deal with earlier costly portions. This provides our final rule for this section.

Rule #3: Cache intermediate results so your tools can be easily debugged and re-run at any stage in your pipeline.

Case Study: Transforming your Code

I do a lot of code transformation and I often run into a problem. If I am going to transform all of the code, then the number of permutations of complexity goes up significantly. For instance, lets say your coding guidelines require function signatures to appear 1 parameter per line. That means to detect a full function signature is quite a complex process. You have to process multiple lines of input using more complicated regular expressions or tokenization. Once you have the inputs you want you'll need to perform the necessary transformation and finally you'll need to format the output and properly replace the existing code with the new code, which may or may not even be the same number of lines.

How would I approach this problem? I actually get asked this a lot so I think about it a lot. What I would do is run the code through a series of normalization processes. Each one designed to fix what I'll call a defect, but feel free to call it a complexity if you want ;-) For instance, the entire problem domain would be easier if I had the following set of components:
  1. A source analyzer that can find and lift functions out of the file.
  2. A function analyzer that can transform the lifted functions (perhaps even some of them being done by hand)
  3. A function printer that can turn my updated function back into a form compliant with our coding guidelines.
  4. Finally a source patcher that uses information from the original run, perhaps even information that got updated (such as source file, start line, end line, function) to reemit the updated code.
That might seem complex, but each of the individual tools will probably be useful to you many times over. In my case, a tool like which reads in a file and writes it out by removing several lines and which does the reverse could be the tools of choice for the source patcher. A full AST editing suite would be nice, but maybe it just isn't an option. Imagine the language I'm writing is a small DSL of my own design. I'll probably never build an AST for such a thing, as that would be a waste of time.

Rule #4: KISS - Keep It Simple Stupid - Use the simplest tools possible. It will be easy to search for increasingly complicated tools, but avoid that unless absolutely necessary.

What happens when you run into what I'll call a local deviation in your inputs? An example in my codebase was that sometimes people would put in bad line endings. This would in turn confuse diff tools, code editors and just about any other tool that wanted to process line by line. The solution was a character analyzer and fix-up to remove this from the entire code base. This was a pre-condition of doing later work. Don't be scared to fix up local deviations when you find them.

Rule #5: Favor incremental, simplifying transformations if they block you from easily implementing the final "money" transformation.

Anything that goes beyond 5 rules is asking to be simplified so I'll stop there. This weekend I was able to use the above 5 rules while implementing a report on binary sizes over time. Specifically rule #3 allowed me to tweak and rerun the analysis phases in seconds rather than half an hour to go collect the intermediate data. That only had to be done once. There were also many "errors" such as missing binaries and incomplete submission data. My process used rule #5 to make early transformations which filtered this out so that the final script which produced a CSV was only about 20 lines of Perl. Excel was finally used to produce the graph.

Saturday, January 10, 2015

Using PEG.js to lex WebIDL, parsing to come later ;-)

Last weekend I spent some time with Node.js and Cheerio in order to parse HTML pages and scrape some content. Sometimes you don't need to scrape, you have the precise content and you simply need it in a form that is more malleable to the task at hand. A simple example might be a file, which you need to read line by line (this is so common, many APIs have a readLine() function already available. Other times you have XML, CSV or JSON. All of these are well known formats, but they are mainly Data Interchange Formats used to communicate between two separate entities. By themselves they tend to lack human readability and structure and may still prove hard to manipulate.

So how do we solve that? Well, we introduce DSLs or Domain Specific Languages. To the human, these contain tons of relevant detail, are easy to write, and easy to understand. They can often be modeled using natural language and remove restrictions on fixed structures, syntaxes and keywords. To create such a DSL we define its grammar, using another type of DSL ;-), and ideally the grammar will generate some processing code for the inputs, transform that into some more malleable output and our problem is solved. Humans are happy, programmers are happy and the computer is happy (unless you've errantly created a parser with an infinite loop, in which case your computer is simply very hot).


That first step, creating the grammar and using some library to generate a parser seems like it could be hard if we don't already have a library in mind. Thankfully we do. In my case I'm going to use PEG.js which you can play with using an online grammar compiler. If you do use the online compiler, save your work often, as you can generate bad intermediate parsers which will hang your browser and cause you to lose work.

Why did I choose this library? Well, I've been following the project for a while and it has stood the test of time. It also uses a technique called PEG or Parsing Expression Grammar which is a way to define a recursive descent parser. Recursive descent parsers are very easy for humans to wrap their heads around since it is how we might write such a parser by hand. However, due to limitations such as selecting the first match rather than the longest match of a given token sequence, requires more care when defining the order of the productions.


Installation is simple and I'll again be using Node.js for these samples. The following install command should do:
npm install pegjs -g
I use the -g so that the pegjs command can be used on the command line to convert my grammar into a parser before I run the rest of it through Node.js. Running that command on a grammar file will produce a new output file of the same name with the .js extension. Example would be WebIDL.pegjs would produce WebIDL.js. The later file can then be included by your Node.js based tokenizer. Let's put all of this together with a code sample which shows me loading the produced parser, loading a sample webIDL file, and then outputting the result in JSON format.

Lexer vs Parser

We are now setup, we have a simple tokenization test script. Now we need to figure out how far our parser is going to go. When parsing there are generally two phases of construction. The first phase identifies lexical tokens from the underlying characters in the stream. For instance, a series of digits might become an integer. From this lexical analysis we can create a one dimensional stream of tokens that by themselves have no semantic meaning. For WebIDL this construct works as there are 7 lexical tokens defined in the grammar. These tokens can then be analyzed with parser rules to determine if the tokens themselves are in the right order. For instance, a parser rule might define a list of integers as such:
list_of_integers: INTEGER (',' INTEGER)*
The production uses lexer rules in upper-case like INTEGER and combines that with further information that says I want at least one INTEGER and then I want 0 or more groups of commas followed by INTEGERs...) Technically even the ',' should be another type of lexer rule maybe COMMA, but in WebIDL this will instead be classified into the general token OTHER.

Grammar To Tokens

I'm excited and I don't want this to get too long so I'm going to jump right into the grammar conversion. I won't go into depth on the more complicated rules, but I will show how to take some of the simpler constructs in the WebIDL document and turn them into proper tokens. Let's start with OTHER which is really simple.

Since OTHER started as a simple regular expression, we just move that over into the PEG format. We also give the characters a label and a production body. Each matching group can be given a function body that will be executed on the results. Normally this produces a character, or an array, or even a complicated structure which you can then run some code on to produce something else. In this case the value is a single character, so we pass that value to a static function on token that returns a fully instantiated token with a type, a string value (raw) and a strongly typed value.

Let's look at something a little bit harder, the STRING token. This rule matches the quote characters and an array of 0 or more non-quote characters. In this case our production only names the array of characters, since we can take the quote character matches for granted. The final production with the two helpers required to make the token looks like the following:

Notice how we've used the flatten function to convert an array of characters into a single string. This is required because PEG.js has stored each match separately for us so we can do very complicated analysis and construction. In our case it is more of a hindrance so we do a simple flatten since the string is sufficient. Also notice that to keep the original string sequence we have both a value which is the string without quotes and we also keep the entire matched sequence of characters so we can re-emit the file. This is important depending on the usage of your code. If you are writing code that manipulates the file directly then this is necessary, but if you are simply producing a new output you can remove the string field and save the overhead since it won't be needed.

I want to do one more lexer rule for you here, the INTEGER rule, since the final production doesn't look very much like the initial regular expression provided in the document. PEG.js uses a forward slash character '/' to represent the alternation group. And in this context, to be exceptionally clear, it will try to process the first item in the group and if that succeeds, regardless of match length, that will be the token that it proceeds to process the input with. If in nested contexts it fails, it will then roll back and try the next. This can lead to some insane run-times, but techniques like memoization can prevent this from being a concern. For my example I noticed that the regular expression was trying to parse decimal, hex and octal values so I broke this out to simplify my matching and token construction.

In addition to how I split the rules out, this also shows how we might use optional matches. Here the opt:'-'? represents a token which may or may not be present. It can then be tested for in our function. If we do see a negative value, then we simply flip the sign of the current token. This is also interesting. Note that our variables in the INTEGER lexer have actually become the tokens that we passed back from DECINT, HEXINT, and OCTINT. When you later compose parser rules, the values you get will be lexical tokens such as these and you can now work with your results in an object oriented way no longer concerning yourself with the raw characters in the original stream.

Let's finish putting this together. The last bit you need is a start rule. By default the top level rule is your start rule. So we are going to define one called 'start' and another called 'tokens'. The 'start' rule is now a parser rule and it takes a list of tokens using tokens+. tokens is in turn our alternation group for all of our 7 lexical rules. And note, the ordering is IMPORTANT. Notice that FLOAT must precede INTEGER. That is because INTEGER OTHER INTEGER is a possible token sequence otherwise. With other parser technologies the longest possible matched token would be taken by default, but in PEG, the first possible token is taken regardless of length. With a few unit tests this is pretty trivial to detect and had you hand-written the parser you likely would have concerned yourself with similar.


Monday, January 5, 2015

Explorations in Node: Using the Request and Cheerio Modules

Given the speed of growth of the web there is a limitless amount of data available and most of it is transmitted in clear text for you to process as you see fit. Traditionally I would use a language like Perl or C# since both come equipped with many web friendly request classes, but today I decided to fumble through a bit of Node.js instead.

Investigating the Problem Space

After a couple of web searches I found that the request module is a pretty popular way to get the initial content. This effectively replaces concepts like LWP::Simple or System.Net.HttpWebRequest. Next I knew that I would want to parse it and while regular expressions are powerful and would have been my solution of choice previously, I instead wanted to use whatever the cool kids were using. Turns out there are a lot of examples for the jsdom module but a pretty decent stack overflow article enumerated about 15 options and in that article I found out that the cheerio module is a jQuery for Node.js.

Being that I'm a browser developer I've had a ton of experience with jQuery, mainly debugging why it isn't working and creating patches to make it work. That made cheerio seem like the perfect approach to the problem, assuming it was able to process the content in the expected way. Cheerio also installed and worked on a raw Windows machine without providing a Python path which further increased my confidence that it was a good, stand-alone module that wouldn't give me a bunch of setup grief. Time to write some code.

Implementing the Solution

To implement my simple scraper I got some documentation pages that my wife has been spending time with lately. Might as well process something that might have some use in the future. These are android documentation, automatically generated most likely, meaning they have a fairly regular structure. This is one of the keys to processing web pages, the regular structure of fixed identifiers and classes allow for simple jQuery CSS based queries to find whatever you might need. You start with structure nodes, and then process the individual content within. An example would be something as simple as the following...
// Processes all descendents of an item with id pubmethods
// that has a classname of jd-linkcol
You can use ANY CSS query though. So you can select based on attributes and their values, presence or non-presence using the :not() selector syntax, and of course you can have multiple ancestor predicates just like in your style sheets. For a web developer something like XPath or a standard regular expression just wouldn't be nearly as familiar, maybe more powerful, but certainly not as easy to use. This has always made me love the querySelector API in the browser as well.

There were some interesting challenges though. It turns out cheerio is not a compliant HTML 5 parser. It doesn't know how to handle the various insertion modes and it fails at managing the open element stack and active formatting elements. For this reason you may find that malformed documents require you to be more precise. Swapping the find method for the children method can help when things nest when they shouldn't. This is equivalent to using the child selector (>) which also didn't work as expected in cheerio when used with the find method.

With that I'll point you at the code. The gist has a revision. You'll notice in revision 2, shared below, that I added a mechanism for testing multiple URLs and also made portions of the code more robust to different page structures. I wish that the diff between revision 1 and 2 would have been a good diff, but when viewed on Github it looks like I deleted all of the content and completely replaced it. Looking at version 1, then version 2 though might provide you some additional insights. There are also many additional parsing strategies that I didn't discuss that might be interesting to you.


Sunday, January 4, 2015

Explorations in Node: A simple symbol server proxy

Being a native code developer and working on one a project as large as Windows and Internet Explorer I find that I spend a lot of my time in the debugger working with symbols. Generally the debugger is really good about fetching only the symbols it needs and limiting its queries, but other times it fails miserably and really wants that ntdll pdb.

My core job is also as a performance developer. Which means in addition to debugging I'm also processing a lot of ETW traces or ETLs using a tool called WPA or Windows Performance Analyzer. For various reasons it turns out that symbol loading in this tool, especially when working with monolithic binaries and private symbols, can take quite some time. And we aren't the only ones to have the problem as it appears even Chrome has monolithic binaries and symbols per a series of posts from a former colleague, Bruce Dawson.

I could claim that this impacts me greatly, but it doesn't impact me nearly as much as it does my boss who started the original project to fix the problem. You can see Rico's solution over on his blog. He wrote his solution in C# using the built in networking facilities of the .NET Framework. Even with a solution partially in hand, there is opportunity to learn from this experience, so let's begin.

Interrogating the Problem Space

When Rico first talked to me about this problem he was in the middle of defeating the many dragons in the .NET Frameworks networking stack in order to get the experience that he wanted. He was already a few hours in and was "almost done" so changing directions, for him, at this point would have been a waste of time. Instead of resetting we just walked through the problem space and asked if there were any more immediate solutions. I proposed, Eric Lawrence's Fiddler since I had written numerous filters in that tool before and it is "pretty easy" to do but probably not as easy to package.

My next thought was, isn't Node.js designed for this whole request interception problem? Its also very good at setting up lightweight servers which meant it might be good for sharing the solution with the entire team (if someone ran the server, then not everyone would have to run their own proxy). After interrogating the problem space it was pretty obvious that Rico's solution had a lot of value and it would also be expensive to explore the other options. It looks like we had a winner, but I still wanted to use this as an opportunity to learn more Node.js.

Implementing the Solution

It turns out that a proxy connection is basically a starter tutorial for using Node.js so when I sat down in the evening to give it a try 80% of the code I needed was sitting out there on at least 30 different blogs. I read through a few and then hacked away on a simple proxy. My overall contribution was just enough code to configure the server with some basic variables like where the remote symbol server should be and symbols you want to be downloaded. After a bit of testing, I was able to verify that launching WPA and I was able to filter down to just the symbols I cared about and that under the debugger I was able to remove lengthy delays as well. An overall success!!

Reflections on Learning

Something that I thought would take me many hours was actually complete in about 15 minutes. This really showcased the power of Node.js, but it also showcased something else about learning. Previously I noted that unless taking the time to learn was going to clearly cause you to fail or miss a deadline you should try to favor something new over applying things you are already intimately familiar with. You really need to find the fun and opportunity to grow in your solution. For me, this reintroduced me to Node.js after having shelved it for quite some time. Feel free to take my final code, learn from it and compare it to Rico's solution as well.


Saturday, January 3, 2015

Improvement through Self Reflection, Learning and Gamification

Of all of my years in software engineering, the past year for me has been the most rewarding. If I were writing this a year ago, I would be saying the same thing. And the year before that as well. By continuously going through iterations of self-reflection, forming corrective experiments, evaluating results and finally locking in new behaviors I'm able to provide myself opportunities to level-up. Sometimes multiple times per year. The same self-reflection helps me to avoid relapsing into prior, less effective behaviors. By tracking data about my own performance and getting feedback from others I'm able to quantify how effectively the entire process is running.

Up to now, my process has primarily been my own. Its ad-hoc, constantly changing and highly adapted to my own personality. It certainly isn't a process for everyone. More recently I have taken the time to share with a few people that I mentor the basic building blocks of my approach. The results of these mentoring sessions have served to be yet another compounding tool in my own arsenal of self-improvement. By openly sharing my personal techniques with others we've collaboratively been able to improve as a group, generalize the techniques and find other more powerful ways of applying the techniques.

This blog will be my attempt at open-sourcing if you will the results of my mentoring sessions. I will take one more step and also open-source as much as possible my divergent approach to learning. So in addition to general software processes, posts on gamification of the home and workplace and techniques for overcoming challenges in the software engineering process, you'll also get to peek inside my mind as I delve into tons of new (at least to me) technologies. This will make the content here extremely varied and not for everyone, but hopefully you find at least some of it useful enough to hang around through what will be another of my experiments.

With that in mind, let me leave you with something useful, like a short on each of my title techniques.

Everyone I work with tends to agree they use time inefficiently. There are many techniques for fixing this problem and one I like more than others is known as GTD or Getting Things Done. We can start simpler though and so I give them the following advice. For one day, write down start and end times for all of your tasks in a notebook. Including interruptions. For me this tends to be no more than 2 sheets of paper. At the end of the day categorize your time and ask yourself where there are opportunities to trim the fat. The overhead of this technique is maybe a few minutes of your time and I can almost guarantee that you'll find an entire wasted hour if you are honest with yourself!

Another common theme with my mentees is that they have a hard time balancing learning new things with applying existing knowledge. An example might be a process you do at work that takes maybe 30-40 minutes to do manually and you don't do it very often. To automate that you'd have to learn more deeply how something works. You'd probably spend 2 hours on that learning. What do you do? Under time pressure how do you balancing learning with applying? My response is always favor learning unless that time is going to cause you to fail in some unrecoverable way. And when you are done, share your results with others. You may find that your 2 hours just saved 100 people 30-40 minutes each. In my line of work I quantify this by counting how many software engineers I just created by freeing up hours of others time.

The final theme I'll share today is that we are generally terrible about rewarding ourselves for work done. If you go to work for an 8 hour day, finish 8 hours of work, then head home, how do you feel? Many people feel exhausted. Asked what they accomplished they'll say they accomplished what they were "supposed to do" or "expected to do". They can probably point out 50 things they didn't do or still have more to do as well. This is a problem. We are delaying our gratification until an epic event in the future, such as "Shipped IE 11" or "Feature Landed in Main". Gamification teaches us to structure rewards in smaller increments. About 5 minutes apart if possible. For now, my advice on this subject is learn to celebrate the small wins. In the future I'll talk about how myself and others gamify work and head home at the end of the day feeling energized instead of exhausted.

Hopefully I've piqued your interest enough to stick around for future installments! If you are extra inspired and start applying the above techniques feel free to toss me an email and let me know how it works out for you.