When the need for a new computing environment arises, there are many choices out there to fill that need, ranging from the original, tried-and-true, build-it-yourself method, to the newest as-a-service solution, VMware Cloud on AWS. In between lie options such as an out-of-the-box, hyper-converged Infrastructure package, as well as a number of managed solutions.
VMware Cloud on AWS promises to be the ultimate in simplicity. It’s essentially a zero-hardware, zero-configuration, fully managed, fully scalable solution that runs the same platform most enterprises have been running for years – vSphere.
But what exactly do you give up by not building out your own installation? Since VMware Cloud on AWS is designed to be able to connect and integrate easily with your existing on-premises vSphere environment, it was important that our customer success team had a way to simulate what our customers will be doing. We also wanted to experience what it’s like to use VMware Cloud on AWS as a customer. This resulted in the opportunity to build an on-premises lab running vSphere, which we could use to connect to VMware Cloud on AWS.
With all the expertise and platforms available to us, we figured it should be a pretty simple process. We decided to document the experience of setting up a new environment, as a historical reminder of what things used to be like before we could go online and procure an entire Software-Defined Data Center (SDDC) in a few clicks. It was also an opportunity to see some of the challenges that could be encountered when setting up a new on-premises environment – and how many of those issues would no longer be factors with VMware Cloud on AWS.
As we progressed through the build, it turned out that plenty could go wrong, even with experienced data center experts working on it. Read on for some of the things that caused us frustration in our process of building out an SDDC from scratch, and see if you’re ready to give some of them up.
Part 1 – Planning
The first thing we worked on was some basic design decisions. The hardware was a fixed configuration and we were able to procure a fully managed hardware solution, which helped make a number of the decisions for us. Instead, we focused on the software design: how many clusters? What kind of network design? What versions of vSphere?
Our hosts had 2 10Gbps NICs, and since we only had 4 hosts for this environment, and we planned on using vSAN, we had to limit ourselves to a single cluster, with a single distributed vSwitch. That should be fine for our needs, although it requires us to use a fully collapsed NSX design – which meant we had to make sure to add our vCenter and PSC to the NSX exclusion list to avoid a potential future operational error in a FW rule blocking our vCenter (which makes it hard to undo, since vCenter is the primary management interface).
At least, since it was a hosted hardware solution, we didn’t have to wait for hardware to be delivered, racked, and cabled in our data center.
We decided to use the latest version of all the vSphere and NSX software – but we knew it would be a good idea to spend some time checking the compatibility and interoperability guides and matrices to make sure everything was supported. For the most part we were OK, but the RAID controller on our hosts wasn’t listed for ESXi 6.5 with vSAN. Since this was for a lab, and the adapter was listed as supported for ESXi 6.5, as well vSAN for ESXi 6.0U3 (just not vSAN 6.6), we decided we would go ahead with it, which wouldn’t likely work if we wanted to run it in production. It would have meant ordering new hardware, and the associated approvals, costs and delays – and nobody wants to go back to the well for more money a second time.
Once we had our design figured out, all we had to do was throw on some ISO images, install everything, put in some basic config and we should be ready to start running. That turned out to be a little optimistic.
Part 2 – Building
The first pain we felt was how long it took to get all the large ISO images downloaded and put into a location where we could use them. It wasn’t difficult, but it was surprising how slow the network paths could be accessing some environments – especially with 9GB and 5GB images to transfer.
At first, things were going smoothly. We got ESXi installed on our hosts, and were getting ready to get a vCenter going, when we figured we should really update the host’s firmware to be consistent. Managing the firmware versions to ensure it’s designed to work with the driver versions is one of the most critical pieces to building a stable datacenter. So after a few hours of research, and time lost downloading the wrong 10GB ISO image, we had a bootable image to update the servers from. Just mount it using the remote management software, boot and let it run.
I’ll come back in the morning. It should be done by then.
Good thing by now, we had a jumpbox built on the environment. A few hours uploading the images there and – WOW – that virtual drive is working SO much faster. Where was that high-speed LAN when we needed it?
Then, we just had to push out the vSphere updates – it’s so great how easy it is to update ESXi hosts with the built-in VMware Update Manager in the latest VCSA. Wait, why is this one host not coming back up? It can’t find its network card?
That turned into half a day of troubleshooting why the driver is not picking up the network card on one particular host, when the same process and versions worked fine on all the other ones. Finally, we cut our losses and just re-installed ESXi, which turned out to be a 30-minute solution to a six-hour problem.
Part 3 – Configuring
All that was left to do was set up our vSAN storage, so we were thinking we must be through the worst.
Oh, wait! That’s funny – two of our hosts weren’t showing any disks attached. A call to the hosting support got that fixed. They changed the incorrect hosts that were assigned to us. Of course, then we had to update the firmware and install ESXi again – we sure got lots of practice with that!
inally we could see disks on all our hosts as we expect. But vSAN wasn’t so interested in them. We found a KB article saying we need to configure each disk as its own RAID0 virtual disk with some recommended cache settings. Back into the BIOS setup. Only a few hundred clicks… and a couple reboots… did you ever notice how long it takes a physical host to reboot? All those fancy management tools, different device BIOSes and options that make them effective servers sure add to the boot time. It provided many great opportunities to grab a coffee. Which got us thinking about the nearest coffee place to every data center any of us had ever been in – it’s just something you need to know.
Part 4 – Running it
By this point, the environment was pretty close to being ready – we had our hosts set up, vCenter mostly configured, core network services built. Just some final touches left to do: move VMs from the temporary storage and networking we used to build them, onto the newly configured vSAN storage and distributed vSwitch. Which, fortunately, worked pretty smoothly. The one time we made a small error, vCenter was kind enough to roll back the config and let us do it properly. Back in the day that would have been a long process to rebuild the networking. Progress is great!
Then all of the sudden, we start getting an “ERROR 400 cannot connect to vCenter.” It was working fine just a second earlier. Did the migration cause it? First instinct was to try rebooting things, just in case. But that didn’t help us.
Instead, we starting looking through some of these vCenter logs, and found what appeared to be a time sync problem. Knowing how critical time sync is, we had set up NTP for everything. But, would you look at that, our NTP must not have been working right, the time on our PSC was way off from the vCenter. So we manually set the time correctly, and we were back in business. For an hour. Why wouldn’t time to stay in sync? Turned out, our network was blocking the domain controller from connecting to an upstream NTP server to synchronize its time, which caused it to give up after a while, and try to sync to the hardware clock, which wasn’t so good.
After a few hours of bliss, we got the next alarm: “vSAN Alarm – disk health”. Just a few hours in, and already a disk had failed! We should have seen the warnings, the same one was showing up in a “foreign” state earlier, which caused us some challenges. Guess that wasn’t just an accident. But, since we were using a managed hosting solution, we just opened a ticket to get the disk changed. Fortunately, we got great service, and they had it replaced within a few hours. Then we had to go through the process of getting the new disk configured in BIOS, adding it back to vSAN and letting it rebuild, which would have been easier if vSAN would actually list it. Good thing it only took us a few hours to figure out that it wasn’t a new disk, and already had some partitions on it, preventing vSAN from using it until we erased them.
Would you forgive us if we found it funny when we went to install NSX, and one of the hosts was limited to 1500 byte packets, instead of the 1600 byte MTU NSX requires for VXLAN? I hope so, because it’s the truth!
We eventually got everything up and running as we wanted it; everything worked as it should. But, given how many systems are involved – and that they’re often managed by different teams – there are bound to be some things that need some special attention. And we all know hardware failures happen. What really stood out, though, is how it put the simplicity of building a VMware Cloud on AWS SDDC into perspective. Giving up some of the challenges of building out a data center from scratch sounds like something we’d all be willing to try. We’ll show you that process in a future article – but suffice it to say that we’ll be able to include plenty of pictures and still keep the article shorter than this one was.