Sunday, August 2, 2015

3 Days of Data Science

Last week was Microsoft's OneWeek and during that week we run a special event called the OneHack. This is a 3 day hackathon which is just like any other in which groups of engineers come together to build things that they think can have an impact at Microsoft. A hack can be anything, you aren't told what to hack on, it is your decision. Who you hack with, your decision. Its great. It is 3 days of time for you to prove a point and show your team, your organization and Microsoft that there is a better way to do something!

I decided that my hack would center around metrics that I could collect with a DOM profiler. Not a sampling DOM profiler like those that ship in the dev tools, but a full blown, call attributed profiler capable of telling you absolutely if a given site calls a given method. This project played in really well with the new culture of using data science to prioritize work and gain insights about the project.

The amount of data was going to be large, very large, as we had multiple methods available to us to scale out the collection. At the same time, the number of datasets which we could key back to the profiling data was also going to be quite large.

Thankfully, I was able to procure a small but amazingly skilled team of people ranging from data scientists, to visualization experts, to automation experts. Everyone was able to self direct while identifying the next potential insight and then working towards integrating the data to make the insight viable.

The rest of this article will describe the process insights I gained throughout the process. Process insights that can hopefully help you in future endeavors of a similar sort.

Even Data Must be Tested

Humans are amazingly good at spotting things that defy the general pattern of what they are seeing. For this reason, every time we were working with data we kept hearing someone say, "Something doesn't look right" or "Does it make sense that this value would be 0?" or me at one point saying "How is my min greater than my max?"

We were collecting gigabytes of data, sometimes recollecting it, and after collection every strip of data had to go through a transformation process. For us this was using the WPA or Windows Performance Analyzer tools to decode some events that were emitted by the profiler. Once decoded, those events had to be split from their files, grouped, and joined back together again under our chosen key, which was consequently the URL of the website which had executed the script. During this process there were so many things that could go wrong, the most likely of which was the CSV processing getting choked up on an invalid CSV. This caused us to think about simple things that we could do in order to validate data. For instance:

  • Every profile output will have exactly 8 columns of data, not 1, not 14, but 8. So simply ensuring that the columns of data were there was a huge step in finding out when the CSV format was failing.
  • Every profile output will have a string in column 2 and the rest of the columns will be numeric. Parsing each column to numeric and validating it was another good fix.

Once we had the site to API data. We then had to join or aggregate all of the similar site data together. This is where I failed miserably. In my haste I summed the min and max columns rather than computing, you know, actual min and max. This led to some craziness, like max being greater than total in some cases. So then we came up with some additional tests.

  • The max should always be greater than or equal to the min.
  • If there is only a single count, make sure min, max and total are equal.
  • Test for negatives and overflows in your data. When working with gigabytes you are often surprised by overflow of common 32-bit ranges. We consequently used 64-bit and didn't experience this, yay!

You might ask, if you are just jumping in, how would you spot these issues? As noted, humans rock at spotting bad data. You can generally glance at a screen and spot it. But it isn't all that easy. So my recommendation is draw the data into Excel or your favorite graphing library and graph it. At this point a lot of things will now visually fall out. A single graph is worth a thousand rows of data!

Graphs can even be a part of your testing methodology if you are willing to have your transformation process simply export graphs. We were using both a combination of Perl, C#, Node and io.js to do all of our data processing, so depending on what language you were in graphing was more or less easy.

Intermediate Data is Still Data

When working on the data you start with very simple things. Some of the things we started with were:
  • Profiling data for thousands of websites and their iframe content resulting in more than 100k unique URLs calling a couple of thousand DOM APIs.
  • Profiling data for our check-in test suite keyed by test name.
  • A test suite file containing the test name along with more meta-data, like test owner.
  • A description of every browser's API surface area gleaned by crawling their DOMs.
  • A complete description of the EdgeHTML API surface area gleaned through our FastDOM Compiler (consequently the same compiler that emits the instrumentation).

And there are things we didn't even know about until we got into the hack.
  • For each test owner, which team are they on? Aggregating by team is a far more powerful visualization than by owner when you are in a large organization like Microsoft Edge.
  • A description of every browser API from the public specifications.
  • Unions, intersections and differences between different browser's API surface area.
  • Tons of filters for things where APIs were defined in different places on different browsers. Example: webkitRequestAnimationframe and requestAnimationFrame should be merged as they are "aliases"

Given all of these fairly complete data sources, we kept finding that to generate new insights, we needed different combinations. We often had the combinations available, but you had to load a monstrous data source to get the values. Most data sources were in CSV format, but some were in XML and others in JSON. Different people needed data in different formats and so often having it in CSV was okay, but having it in JSON was better and so a conversion was needed.

Often when converting between formats or doing joins you lose data as well. Having these intermediates before the data is lost can be useful to others. Let me provide a couple of examples.

DOM Properties have either a getter or a getter/setter pair. Our profiler was capable of emitting telemetry separately for these two entry points. This meant that most data was keyed based on the DOM API name, however, our original data was keyed based on the API name AND the entry point type. We needed the profiler data to join with several other pieces of data and be emitted as an intermediate before we were able to collapse the key identities down to just the API name. This allowed us, on our final pages, to list for a given API, the separated performance characteristics of the getter and setter.

When aggregating URL down to domain which is necessary for properly tabulating a group of website pages into a single consistent view, you end up losing many URLs. For example, an about:blank page should have the same URL as its parent, but it is also an about:blank page. If you want to backtrack the data to the page to investigate manually, you need the original URL. By building most of our intermediates with URL in mind and having only the final output convert URL down to domain, you are able to always recover data. In this case, adding a column to your data for domain, but keeping everything else allows you to build the aggregate group by domain. I'll get more into this later since this hits on another important concept.

Given this, when you are working with lose data, I highly recommend keeping all intermediate outputs that you produce. Also, document them all and which data they pulled in, which scripts were used, and what all of the output files were. If you can, write a script that will re-run all of the conversion stages on the original data and produce your current state. We consistently found the following issues when working as a group...

  • Someone was working on a data file we could no longer produce. Perhaps a prior version.
  • Someone needed an intermediate that we had deleted.
  • Someone needed an intermediate + a conversion that we had lost the script for.

All of these slowed us down and could have been easily avoided.

Schema, Schema, Schema, and a SCHEMA

At some point it became obvious we had too much data for everyone to be able to remember everything that was available. In the beginning I was the data oracle since I had previously worked the most with this data. However, in the end we had everyone producing NEW data and so my days as oracle were numbered. It becomes critically important to have schemas available the longer your data munging exercise continues.

A schema is NOT costly. The schema for a CSV file is a header column. At first, we didn't have header columns. Mostly due to the fact that when I write Perl I don't want to have to cut the first line ;-) I'm lazy apparently. But this lead to massive confusion. When you have a CSV filled with 7 straights columns of numbers you start to question what they mean. The cost of a header column to a Perl script is one additional line of code, the cost of not putting a header column for humans is hours of explaining what the columns mean. Hang it up if you ever accidentally transpose the data ;-)

A schema is like a design document or diagram. It is an abstraction. It allows others to extend the design without your input because you've already explained your portion of the design. With a schema you will gain the ability to create new cuts of the data and insights without having to dig deeply into the data itself and you can test new ideas without having to write code.

The basis of the previous point is that once you have a schema you can compose your data together in new ways. This can allow you to see how all of your data is interrelated and leads to the use of existing data for creating new diagrams and insights. When you don't have a schema, you find yourself recomposing data from baseline sources because in previous iterations you didn't groom the data appropriately. Let me provide an example from the URL vs domain case.

URL,Member,Calls,get HTMLElement::innerHTML,1,get HTMLElement::innerHTML,1

domain,Member,Calls,get HTMLElement::innerHTML,2

At some point we needed the domain mapping to reduce our reporting matrix to around 1000 data points. However, later we needed the URL as well when building the web pages which presented this data. That way you could understand which pages on were using the value, since just going to was unlikely to induce the property get. Had we been thinking about this in terms of a proper schema, we would have simply added a column to the data that would allow us to dynamically aggregate.

URL,domain,Member,Calls,,get HTMLElement::innerHTML,1,,get HTMLElement::innerHTML,1

Then we could generate a view on top of this such as (select domain,Member,SUM(Calls) group by domain,member). If that is too costly to compute each time, we can always schedule a task to recompute it.

This insight alone would have advanced our project by another 2 days worth of our existing effort had we employed it from the very beginning. While sad that we were unable to employ it in time, there is always a next time. In fact, as we seek to make our efforts permanent this type of insight will go a long way towards improving how we push data through our pipeline.

Data is NOT Hierarchical, Until you Display It

I produced so much data in JSON throughout the hack. We started with 2 MB JSON files, then 14 MB JSON files, then 150 MB JSON files. Why? This was our hierarchical data stores and it "enabled" the code which generated the pages to step through increasingly more detailed information by simply walking from the parent into the more detailed children. Effectively I was treating all of my data as hierarchical because my output in display was hierarchical.

Why is this a bad thing? It makes filtering, sorting and other operations a total mess. My data could be have been a much larger table with a lot of redundant data. Then I could make all of my decisions at a high level, only once filtering and sorting were done would I feed all of this into the page generation routines. The opposite is that I have to filter at EVERY level in the hierarchy instead. Or at least I have to filter at every level where the data exists that I'm trying to filter on. Let's take a simple JSON where we have levels for interfaces, the members they contain, then information about the member. I want my output to be filtered by the "owner" of the member. Here is the JSON.

If I filter on "Jim", then I expect that I should first see interface HTMLElement, then the member className. I should be able to reasonably know that I have to display all of that data. To do that, I probably loop through all of the interface key names, then pass the interface down to another function which loops through the member key names, and finally pass those into another function which will "finally" filter out "Jim" and decide to show something. However, at this level, I've lost key information, unless I've enclosed around it all, such that I don't know the interface and potentially the member that I'm inside of. I've not encoded the key into the data itself, I've encoded the key into the containing structure instead.

How do I fix it? Well, encode the keys as I've described so that the members gain enough information. That would be a good start. But what happens then when I have two properties on the same interface. How do I know to only emit the interface once? And how do I know to end the interface? Okay, things are getting nasty fast. It turns out you end up first walking the entire data structure and memorizing enough information to then go back through and only show those things that belong to Jim. You end up replicating a portion of the hierarchy through the filter process, such that you can walk the "filtered" view and use the "original" view to get any missing data. This is not a great approach. Its error prone and the code gets ugly. I know, I've got tens of thousands of lines of it now :-(

A simpler approach is to simply produce the flat data file. Do the filtering on that, then create the hierarchy on the client as necessary. This approach allows for so much flexibility since it also allows me to send the filter to the server if  don't want to retrieve the entire dataset to produce the view. Here is a tabular view of the same data and a filtering/aggregation of it that produces what I want to be able to create Jim's view of the data.

This is another example of a realization we were not able to take advantage of. We had people in R, others in PowerBI, some in Excel, one guy in C#, and me in Node.js and we were all using all of these flat files in different ways. Because our important data was in hierarchical format, it wasn't accessible to many of the tools. Focusing more on the schema and focusing on flat data tables would have empowered the entire group. This leads me to a final realization about our data.

SQL is Probably Good Enough

At no point when I was looking at the data did I think, man, I should put this into a SQL Server really quick and just do some views. I don't know why. I'm an old school SQL Server guy. I've worked on, for their time, top tier websites. Usually we powered those with huge SQL instances.

SQL and some simple stored procedures could have done most of the aggregation that I was doing offline in scripts. It wold have given me the schema information that I was after. I could have made quick filters and aggregates in a language built for providing me such information. Had I just put my data in SQL I probably would have been way better off. Maybe it would have limited some of the other members of the team, but I'm betting my data exports to them would have been of much higher quality and would have more than made up for the hours we spent grovelling huge data files on the disk.

Further, it would have enabled the scale that we are undoubtedly going to be tackling next. Now, it would have required a permanent instance be up for people to connect to, but that is an operations problem and is really easy to solve. Right now our data is on some network share. Its going to be hard to make that an operations problem. Further, how do we "add data" over time or "version data" or even "back it up". I made a couple of cab archives along the way, but that is literally all we have from the hack.

Now, it was a hack, so I can't fault myself for these mistakes but I can learn from it. Once your data outgrows Excel, its time for SQL. Tons of tools know how to talk to SQL and it would have sped up our iteration speeds and improved our schema generation throughout the hack. Now I know ;-)


This is one arc of my experience during the hack. The data arc. This is where I rediscovered many things I had forgotten about building large data pipelines. There were many more experience arcs as well, especially around the insights that we had. More importantly, when we shared our insights we got a lot of really positive feedback from the team. We even decided to create a new alias for openly sharing data sets more broadly. This will allow others to hopefully find problems, improve on, create new insights from, and hopefully accept our data.

I'll end with a variation of an info-graphic that our data science guy shared with us. His had 8 or so stages of data grief, but the classic graphic apparently has 5 and they definitely seem appropriate given my experiences.