Go with open source or commercial monitoring for my growing cloud application? This question is as old as open source software. But with the rapid growth of native cloud applications, it becomes more important than ever. When you’re building your SaaS product and your engineering team has a tight budget, turning to an open source monitoring tool deployed in-house may look like a great option.
However, as you scale for your user growth, this choice may negatively impact your engineering teams, your business, and your customers. For many reasons, commercial monitoring usually is a better solution as a SaaS business grows, and monitoring issues can be avoided altogether if a hosted commercial solution is chosen at the start.
In this blog, I will introduce five key questions you need to ask yourself before adopting an open source monitoring tool for your teams to keep an eye over your service. To illustrate the importance of these questions in the real world, I will describe the actual experiences of Boxever, a very successful SaaS provider of a leading customer personalization platform.
Today, the Boxever Intelligence Cloud application processes billions of events weekly from 700 million customer profiles. As Boxever grew their platform to handle this enormous scale, their engineering team started to run into performance issues with their open source tooling.
Initially, they adopted Prometheus, an open source time-series database, to monitor their cloud service. At first, as Boxever engineers started analyzing only a limited amount of metrics, Prometheus was adequate. But as the Boxever business grew, they soon realized they needed a more scalable offering. Below are the five questions they eventually asked themselves – that you should ask yourself now – that can help you avoid future monitoring roadblocks.
1. How important is high-availability of my monitoring platform?
Availability and reliability of your monitoring directly impact a SaaS business’ bottom line and customer satisfaction – most businesses can’t afford their monitoring platform to be down when they need it most.
One of the biggest reasons for Boxever to switch from Prometheus to Wavefront by VMware is that Prometheus lacks support for a real high availability configuration. The way that Prometheus handles high availability is through their federation concept that is based on a complex, layered architecture similar to a pyramid structure. Once the limits of one “pyramid” instance are reached, you’d need the second one. If one of the instances goes away, so does your monitoring data, which is a big concern for any serious site reliability team.
With Wavefront, Boxever engineers can meet high availability requirements of their customers, as redundancies and a high availability architecture are built-in to the Wavefront cloud at multiple levels.
“You want to avoid operating purely reactively. You want to avoid customers reporting issues before you see them. Wavefront is helping us be proactive and correct service and infrastructure issues before customers are impacted.”
– Anders Holm, Sr. Systems Engineer, Boxever
2. Should I invest my finite engineering resources in building and maintaining a monitoring tool vs. my core SaaS business?
For a SaaS enterprise, it’s all about delivering a service your customers love. And for developers, it’s fulfilling to work on meaningful projects that directly grow business value. Focus on business value and growth was important to Boxever as well. As their engineering team started to add more metrics into Prometheus, they saw latency issues when their metrics were loading. When they analyzed what to do to try to scale the tooling, they’d have to invest six months of an engineer’s time to re-architect the Prometheus implementation for scale up. And then, invest even more time to maintain it.
They quickly realized they had to make the conscious decision to focus their engineering resources on scaling their own business, and on innovation vs. on building monitoring solutions. So they turned to Wavefront, a cloud-hosted, metrics-driven analytics service where enterprise-class scalability and high-availability are already built-in to the service. Thereafter, for Boxever, it’s now seamless for their teams to add new metrics flows, with no impact to performance and reliability of their monitoring functionality.
3. How much time and resource for maintenance should my monitoring platform require?
For an engineering team supporting a rapidly growing cloud service and infrastructure, this question is particularly important. Is your team big enough to fill in gaps left with open source monitoring blind spots? Going back to the previous Boxever scenario, they initially found Prometheus was good enough. But they soon realized, to cope with the increased load, when their Prometheus instance would fail and fall over, they had to re-configure it to restart.
In addition, you want to avoid a situation where a Prometheus instance doesn’t recover after a crash which can leave you without monitoring. After switching to Wavefront, the Boxever team no longer needed to do any maintenance to maintain a complex system implementation. They now just deploy configurations of what metrics to grab, restart the Wavefront (Telegraf) metric collection agent, and they’re done.
4. Will my monitoring platform scale immediately when I need it, without having to pay for that scale prematurely?
Let’s assume your SaaS business is successful and starts to grow. The important question you now need to ask is – will my tools scale? As Boxever started to rapidly grow, so did their metrics volumes as their engineers added more and more necessary telemetry. They then started to ponder why invest further into bigger, yet more complex Prometheus infrastructure. They didn’t want to lose any focus on innovation velocity and continuing to scale their successful business.
Consider the scale-limiting scenario: when one Prometheus node gets too busy to deal with all the metrics, you can either get a bigger box to run it, or give it more horsepower. Or you need to spin up another instance of Prometheus. But you then have the audacious task of configuring which instance of Prometheus gets what metrics from what services on your network. Then you have to maintain all of that. To cut to the chase, it gets messy very quickly.
On top of all that, when you provide a relatively responsive UX for your teams – the metric visualizations and the customizations that everybody wants – you have to add a third node, which goes up and gets the latest metrics data from the other two nodes that are dedicated to gathering all the data. You now need to have three nodes to run Prometheus, so you’re tripling your costs from where you started. At that total cost of ownership (TCO), it’s not worth the effort, the time, and, ultimately, the distraction that comes with all that complexity.
For Boxever, it became crystal clear that building out a large Prometheus infrastructure would continue to increase in complexity and TCO as they grew – without incremental functional benefits. So they decided to advance to the Wavefront metrics-driven analytics service. Now the only thing they need to “worry about” is finding the URL to get their graphs, posing their queries, and off they go with dashboards, drill-downs and alerts.
5. How quickly do I need my engineering teams to adopt and use my monitoring platform?
As your digital enterprise grows, so does your engineering team. How quickly can they ramp-up, including their adoption of monitoring tools for visibility? One reason the Boxever team originally selected Prometheus was because it appeared that the Prometheus query language would give them lots of flexibility. Initially, it did, but it came with a high cost of long learning curves and long onboarding time for new team members. Then, even when new engineers starting to use the tool, adding and pushing new metrics, they run into scaling limitations. Unable to handle a relatively light query load, the queries became painfully slow to run, causing time-outs and crashing their browsers.
When evaluating Wavefront, the Boxever engineering team was positively surprised to see how easy the Wavefront Query Language was to use, and how much faster its queries completed with no foreseen limitation in scale. Wavefront’s queries immediately resulted in a significantly shorter learning curve but also with a clear increase in power and flexibility.
All of the right answers to the aforementioned questions led Boxever engineers to ditch their open source monitoring platform and move to Wavefront.
“I think we’re already in monitoring nirvana because we’re high from all the happiness of how well Wavefront works for us.”
– Anders Holm, Sr. Systems Engineer, Boxever
At the surface, the decision with metrics monitoring on whether to go open source or commercial is not easy. But as you review and ponder the aforementioned five questions, you can start to gather the right information to understand and make your decision smarter. With an awareness of the experiences at Boxever, you can see some of the pitfalls over time with using open source tooling in-house. Instead, you can bypass the complexity and headaches by choosing a hosted commercial service like Wavefront by VMware. Ask the right questions, and you’ll more likely get the right answers. You’ll more likely to get there even faster too with Wavefront by VMware.