Eric Nielsen (@ericnpro) and I (@davidj7494) just interviewed Jad El-Zein, Technology Director in VMware’s Cloud Management Business Unit, Office of the CTO. The SoundCloud version of the podcast can be found here. Jad’s IT Operations background includes over a decade long tenure with Lockheed Martin where he first interacted with VMware as a customer. Jad has spent most of the last decade at VMware focused on multi-cloud management and data center automation. Our interview centered on how machine learning is revolutionizing the technologies we use to manage infrastructure and applications.
Before we jumped deep into IT Operations and ML, we spent a bit of time discussing what machine learning is and is not. Jad did a great job of simplifying a complex topic and led us through a short tutorial on how machine learning works in general. We discussed some core concepts related to machine learning such as supervised learning, which provides the learning system with a predefined view of the world; unsupervised learning, where the system creates its own view of the world based on what it observes; and reinforcement learning where the system takes one or both of these ideas and then applies actions to a target in order to achieve a defined goal. Rinse and repeat.
Machine learning meets IT Ops
We then turned our attention to the specific case of IT Operations Management. Historically, tuning technology to produce improved system performance meant interpreting tons of data from a myriad of monitoring systems and then engaging in a what I would characterize as a “live fire” exercise. A system admin would make one to a few course grained adjustments to the configuration of the system in the hope of improving performance. They then crossed their fingers and prayed the changes they made didn’t result in a slew of negative cross system interactions that instead of improving performance, actually degraded it. If that negative outcome happened, they would then scramble to recover by rolling back one or more of the recently made adjustments. This was a world where making changes to achieve a performance benefit often entailed more risk than could be justified by the anticipated reward.
As system complexity has grown, so has the challenge of tuning systems manually. This is especially true as cloud and micro-services have become ever larger parts of what IT Operations teams have to manage. Today, there is far too much data to assess in anything like near real time. There are also far too many potential component level interactions for a human mind to reasonably predict what will happen if a configuration is changed. This increased level of complexity is driving the need to adopt new approaches that leverage machine learning. ML is viewed as a great vehicle for analyzing massive amounts of data and then taking fine grained actions that can improve system performance while minimizing the risk of unintended consequences.
VMware vRealize AI Cloud
In discussing the application of machine learning to vRealize technologies, Jad explained that when VMware makes recommendations on the best way to configure a VMware Software Defined Data Center (SDDC) it is based on what we know about what works on average for most environments. Increasingly, supervised learning plays a significant part in developing this kind of systems knowledge. But every organization’s environment is unique, so once an SDDC is deployed, we also need to leverage the concept of unsupervised learning. We need to use unsupervised learning to learn about things that a newly deployed SDDC hasn’t seen before.
Reinforcement learning then comes into play as the system takes into account what it knew before the deployment, plus what it has learned since the deployment to recommend or automatically make changes in the environment in pursuit of a specific goal. That initial goal is focused on improved system performance. Reinforcement learning then kicks in to learn from what actually happened once an action was taken.
Jad also helped Eric and I understand the concept of a “digital twin” which is used to continuously simulate the behavior of the actual environment that is being managed. This twin can simulate millions of adjustments in near real time, taking into account cross component interactions as part of the simulation. The ML platform can then recommend or automatically implement only those actions that will holistically improve overall system performance.
Before we wrapped up our discussion, we got specific around how VMware is integrating machine learning into the vRealize portfolio of capabilities. This effort was first introduced two plus years ago at VMworld. Today, what was once a kernel of pre-release technology has blossomed into a powerful machine learning platform that is available as a SaaS offering called vRealized AI Cloud. From a capability standpoint, the initial set of releases of vRealize AI Cloud has been focused on training the system to pursue the goal of improving the performance of vSAN deployments.
In addition, the development team has also focused on creating something the team calls the “explainability UI”. Most people aren’t completely comfortable with a black box running their systems so the explainability UI makes it drop dead easy to understand the changes that vRealize AI Cloud is taking to make the system run better. It communicates this information to an end user in a way that lets them also understand how much each of the implemented changes is contributing to an overall performance improvement.
Listen to the podcast
As always Eric and I had a great time on the podcast. Jad was a fun guest and is super knowledgeable both about IT Operations as well as the application of machine learning to this particular area.
You can listen to the podcast on YouTube, SoundCloud or Spotify. If want to check out other podcast related to multi-cloud architecture and how VMware technologies can help address challenges in this area check out the SoundCloud playlist on this topic.
Want to know more about VMware’s unique approach to multi-cloud architecture? Get the definitive guide.