At MIGS 2015, I went to see Alexandre Denault’s talk You can’t fix it if you don’t know what’s broken. The title just screamed User experience to me.
Alexandre started to explain he’s seen poeple trying to fix problems without knowing them, and it’s not pretty. The first step to problem solving is determining what’s broken, then fix it. Often, finding what’s broken is actually harder than fixing the problem. And then I realised Alexandre was discussing system administration, and realised how much UX shares with Ops.
In this article, I revisit Alexandre’s talk from the perspective of user experience and how we share methods and tools with sys admins to solve different types of problems, that sometimes come together.
How ops monitoring works
In Ops as in UX, we use the scientific method to solve problems. We ask questions, formulate hypothesis, test them, analyse the results, and finally make actionnable recommendations to improve the system.
For example, a question both of us would ask is “Why are players leaving my game after level 5?”
We build hypothesis, and test them. At first, we don’t have a clue, looking at everything that could go wrong. Progressively, we can narrow down the problem space, and eventually find the solution.
In Ops, this is done through different types of tracking and tools. Alexander calls it Golems: a tracking and visualisation tool coupled with a rigorous methodology. The tracking is very specific: one probe is built to monitor one aspect of the problem. The probe returns a status, which can be ok, a warning, or an error into logs.
Information from the probes are accessed with different levels of details. General graphs give the big picture. Graphs allow to visually identify interest points, which allow us to check relevant logs, rather than everything. Logs are a pain to search through, so you want to make sure you’re at least looking at the right log and searching for the right thing in it.
Like a dashboard in a car, the dashboard graphs and information doesn’t tell you what the problem is exactly, only where to look for it. An engine light in your car doesn’t tell you what’s wrong with the engine, but you know where to start looking for a problem. It provides a summary view here information is consolidated in one place to easily supervise alerts that need attention.
When an issue is spotted, the rigorous method kicks in. You have to look for stuff that should be there but isn’t, or shouldn’t be there but is. Are any values off, is th wrong amount of something happening? Methodical attention to detail is key, never skip a thing.
Now, let’s focus more on the talk on ops and the many technical details and stories Alexander shared with us during his talk.
Which Golem to chose?
At Ludia, Alexander works with elastic search, Kibana, Collectd, Graphite grafana, nagios, sensu…. There are many options to set up Golems. golems for rent like New relic, Splunk, Loggly, Victorops, Opsgenie. They can help set up something fast and generic.
Alexander is more for building his own Golem, claiming it’s faster and to the point. A golem is built iteratively. Spot a problem, figure it out, add a probe to monitor whether the problem occurs again, so next time it will be faster to diagnose and fix.
Sometimes the Golem will help determine and fix issues. Sometimes the Golem will teach you to make a better golem. Once in a while, your tools wont be sufficient: you can either stumble around or get more information by adding more probes.
How and when to improve your Golem
Need a bigger database?
Once, Alexander’s monitoring showed “0 users connected”. The first solution was the best: the probe was broken. Another time, the monitor told his “database is full”. Well, no need to look deep into that one, the database was indeed full and he needed a bigger database.
The silent killer
A typical case where more probes may be needed is the “silent killer”: sql stopped running, you need to restart it. It’s pretty easy to fix, but it could happen again since you didn’t figure out the cause. Is it a memory leak? Is it Ownkiller that killed a random process to keep your machine from crashing?
Sometimes, your probes aren’t good enough to spot the problems. In Alexander’s experience, one time, the community reported players were being disconnected every 5 seconds. The monitors wouldn’t notice it because they did not have enough data to spot mySql was near its limit but not cyclic enough to be spotted. He needed to update the Golem.
Working with marketing
Other times, the issue is foreign to your golem, but someone else’s might gather the data that allows to make sense of the issues, like the analytical department’s golem. For example, if since update x, transactions are down by 25% : the dashboard tells nothing, transactions are lower indeed but logs are ok, it’s usually a good idea to go sit next to analytics and check their graphs.
Depending upon the platform (android, not apple), golems can show differences that need to be interpreted properly. In android, a transaction per countries view showed france and germany are down, but not in UK. So if there’s no problem with europe servers… it was euros! The EUR character wasn’t displayed properly and crashed the game, despite no community complaints.
So what to put in your golem, and when to stop?
Basic information to look at are simulations of user journeys, like logging in, Disk space, hardware, load of machines, network, os and disk stats… Then come application level statistics (devs usually provide measurements for ops). However…
If you monitor too much data, you won’t be able to use it! Probes have a maintenance cost. Log entries require storage space and efficient search. The number of logs increase the complexity of information to display.
Build something, add what info you used to fix what broke recently. Setting up a golem is an iterative process, things get better with time.
Of course, the same goes for game analytics and studying the player behavior : start with a couple of things to track, and increase the complexity progressively as you find interesting insights to refine them, without wasting ressources on measures that won’t really tell you anything.