Peter Gillard-Moss

Transient opinion made permanent.

Monitorama 2013

I spent the back end of the week attending the Monitorama EU 2013 hackathon in Berlin. It was an enjoyable, well organized affair. The talks, were generally of high quality and those I didn’t find engaging others had called out as some of the best of the day. Which suggests a good spread.

The conference was centred on one common goal: to make monitoring better. A community is definitely forming. One which is generally polite and respectful with a focus on co-operation, open source, open platforms and moving the industry forward regardless of differences (devs, ops, dev-ops, marketeers). It was one of the few events - both in professional and external pursuits - where sponsors genuinely support and enable a cause rather than the usual cynical sponsorship we have become used to - railroading, restricting and blackmailing the consumer. Instead there was more a symbiotic relationship between participants, organizers and sponsors. All had a common goal and vision and enabled a great event. As a participant I wanted to see the sponsors products because they were genuinely of interest and provided education. As a sponsor it’s a great way to get your message to people. As an organizer it enables the event and the message to become, to realize. I think that is a rare thing these days.

As I reflected upon the conference and the common threads on my bus journey to the airport it occurred to me how immature and lost we are as an industry on the topic. One co-participant said to me “I came expecting someone to tell me it was all easy and I was doing it wrong. But everyone is struggling with this stuff.”. It was a true observation. No talk dared suggest a “silver bullet” or evangelise an approach. Talk after talk focussed on the struggles and the questions, although there were some answers many of the big questions were left unanswered or addressed with dreams or speculation.

Though it is not quite that hopeless. Further reflection enlightened me as I realized that, although it is an accurate picture it is still one that tells a story of great progress. Monitoring has evolved a long way. As a community the monitoring enthusiasts have solved a large number of monitoring problems; the biggest of which is how to even get data and where to put it. Collecting data, storing it, processing it, the last few years have solved these problems (though, the conference demonstrated that there are still areas for innovation in these areas). Where we are all lost is what to do with this data. Speaker after speaker admitted that it is beyond them, reached out for help to those more skilled, knowledgeable and smarter in these areas. Arms held out to a data science and statistics community that is all too far disconnected from the community that needs it. This is the great challenge.

Alerts and surfacing problems came up again and again as a problem unsolved due to the limited skill set of the software engineer. And with that came another admission: our clumsy attempts at using naive school education statistics - predominantly normal distributions and percentiles - is misguided, perhaps even dangerous (well, relative to the sense of danger in business software rather than airplanes as speaker after speaker informed us). We’ve learnt a tool and some basic maths but it does not apply to our domain and never will. A classic case of a hammer making everything look like a nail. Our domains are unpredictable with complex models that will not and cannot fit the mathematical models of the predictable, steady rhythmic metrics of the factory lines. The result is unacceptable numbers of false positives and true positives missed or lost and even ignored in all the noise of the false positives. We need better models. And with that comes many challenges.

Some speakers entertained suspicions that as an industry we are at the forefront. The evidence is convincing. Not only are we pioneers in our industry but pioneers in monitoring across the spectrum. Other industries flounder and fall as we do even when lives depend on it. And other industries fall back on the most faithful algorithms of all, despite there great flaws, and that’s those held secret by the human brain.

For the human brain is the best pattern matcher, the best instrument to sort the signal from the noise, the best algorithms for detecting genuine or even potential problems. Yet it is a mundane unskilled activity of staring at screens, disrupting and demoralising. It conjures images of the Simpsons episode where Homer is employed to monitor the reactor plant to disastrous consequences. This, of course, creates a hankering in the technologist. Where there is a mundane activity performed by a human then a well-engineered automated solution offers permanent relief. Yet, as already mentioned, adequate algorithms are beyond us and current attempts hinder rather than help.

At the end of the conference I felt that alerts were good intentions leading to hell. To paraphrase Gogol the road seems straight and well lit yet we have all wandered of course and are scrabbling around in the darkness. A darkness caused by biblical swarms of alerts. Some speakers suggested turning them off all together and relying on humans, because that’s all you could rely on. Others suggested minimizing them or prioritizing them or collating and aggregating or other meta strategies. It was an area I remained wholly unconvinced. Nobody stood up and dared suggest that they had an alerting strategy that worked.

Given this is traditionally primarily an operational concern there was a refreshing absence of developer bashing. It seems there was common agreement that upstream developers need to consider monitoring as a first class concern and create applications that are well monitored. Downstream there was broad agreement that operations need to work to provide the services and tools to allow developers to easily integrate with monitoring platforms. And ultimately, for success, there must be communication and collaboration.

At the end of the two day event I concluded that this is a four part problem: one of technology, people, analytics and userbility. These parts all need different skills and different communities. Monitoring is also not just about cpu usage of servers. As monitoring grows I hope that next year most of the talks will be from data scientists, usability experts and business people telling stories beyond the cpu gauge and disk space alert.