vSAN Stretched Clusters

Recently I had a design session with a customer looking to setup a fairly unique DR scenario. They are a health care organization  with 2 central datacenters located within the same town 10 miles apart.

Additionally they had hundreds of clinics and offices scattered across the state that had no DR plans in place. A quick solution was to setup vSphere Replication between these external locations and the primary DC in the HQ. But the customer wanted an additional layer of security.

The proposal was to setup a vSAN stretched cluster between the primary and secondary DC’s. Using nested fault tolerance each DC would have FTT=2 while using FTT=1 between the DC’s.

This would give the customer a primary target for all external DC’s giving them FTT=2. Additionally the primary DC was replicating to a secondary DC which also enabled FTT=2. This gave the customer resiliency both within a site and between sites.

This was only possible due to the low latency between primary and secondary sites so if you consider something like this please keep in mind the stretch vSAN cluster requirements:

Data Site to Data Site Network Latency

Data site to data site network refers to the communication between non-witness sites, in other words, sites that run virtual machines and hold virtual machine data. Latency or RTT (Round Trip Time) between sites hosting virtual machine objects should not be greater than 5msec (< 2.5msec one-way).

Finally the customer elected to host the witness node at a non-production test facility in another city. Luckily the networking and site latency requirements for this are much more lenient:

Data Site to Witness Network Latency

This refers to the communication between non-witness  sites and the witness site.

In most vSAN Stretched luster configurations, latency or RTT (Round Trip Time) between sites hosting VM objects and the witness nodes should not be greater than 200msec (100msec one-way).  

 

In my next post I’ll go over some interesting behavior we discovered with vSphere Replication with regard to FTT settings and vSAN.

 

 

wannacry and disabling smb1

If you’ve been living under a rock you don’t know anything about wannacry. If that’s the case you may already be in trouble.

But if you’re on top of things then you know one of the major recommendations beyond patching your systems (you should be doing this anyway) is to disable smb1 across your environment.

Beyond breaking many things like printers, scanners and folder shares for legacy applications; it also will break AD authentication from the virtual center virtual appliance. It’s key to note that the windows installation is not effected by this.

VMware has a good KB article around this which calls out the requirements for SMB1:

https://kb.vmware.com/kb/2134063

Luckily there is a workaround you can apply to the VCSA to enable smb2 for authentication.

  • SSH into the VCSA
  • enable the bash shell
    • shell.set –enabled true
  • enter the bash shell
  • shell
  • Set the SMB2Enabled Flag in likewise’s config:
    • /opt/likewise/bin/lwregshell set_value ‘[HKEY_THIS_MACHINE\Services\lwio\Parameters\Drivers\rdr]’ Smb2Enabled 1
  • You can verify the values with the following command:
    • /opt/likewise/bin/lwregshell list_values ‘[HKEY_THIS_MACHINE\Services\lwio\Parameters\Drivers\rdr]’
  • Then restart likewise:
    • /opt/likewise/bin/lwsm restart lwio

 

Once you’ve made these changes, your vcsa will once again be able to authenticate via AD using smb2.

The Security Field Today

Reading the headlines over the last few weeks, you would think the world in IT is coming to an end.

And in some ways, things are scarier now then they have been for some time. IT departments are now reaping the decision to not patch their environments or to stay on legacy operating systems.

At the end of the day, the release of wannacry and it’s new variants as well as the new ‘Hera’ and ‘Athena’ which compromises ALL windows operating systems is a wakeup call for all of IT. We must now assume that our systems are vulnerable at all times. Assume that a 0-day vulnerability is in the wild and targeting you.

How can you protect yourself in this environment? There are several ways, some cost prohibitive and some operationally prohibitive.

  • Place every device behind a dedicated firewall (physical, virtual or some combo)
  • Use pvlans to isolate every virtual machine (doesn’t help with physical as much)
  • create thousands of vlans and create thousands of ACL’s on net-devices

Creating thousands of vlans simply does not scale if you have more then 4,000 devices. Beyond that, creating and managing the acls necessary for this is operationally unfeasible.

Isolating every device behind a dedicated firewall is cost prohibitive and operationally prohibitive as well.

Really only VMware NSX is positioned today to be able to cope with todays environment. By placing a stateful firewall at every vNic within your environment you can massively reduce the threat scope within your environment. Moving to a zero-trust stance means only specifically allowed traffic to VM’s is valid, everything else is blocked. Combine this with Identity based fire-walling and you greatly shrink even that limited scope to specific allowed users.

I won’t post additional material because there literally are thousands of posts and pages but if you have not looked at NSX, you really need to. It’s expensive, no doubt but today it’s really the only option with the new frankly terrifying security field we’re all now part of.

VSAN 6.6 HomeLab Heads Up!!

After upgrading my homelab to vSAN 6.6 I noticed that my memory usage suddenly jumped up around 2.5 times normal consumption. Wondering if I had missed something I went back through and read the release notes with more focus and tracked down this tidbit:

  • Object management improvements (LSOM File System) – Reduce compute overhead by using more memory. Optimize destaging by reducing cache/CPU thrashing.

There it is, in plain english, vSAN 6.6 improves performance by utilizing more memory. Now for most folks in true production environments, a few extra GB won’t be noticed but if your lab only has 64GB of RAM, anything extra get’s noticed quickly.

 

Based on some calculations I’ve been able to use it looks like the RAM consumption roughly doubles what it was before. The original formula for calculating RAM usage could be found here from VMware:

https://kb.vmware.com/kb/2113954

I spoke to a few folks internally at VMware and indeed the numbers will shortly be updated to reflect the new changes. One thing to note based on my home tests, enabling or disabling Dedup, compression and the new encryption features has no impact on the memory changes.

Make sure to plan accordingly when thinking about upgrading to vSAN 6.6

VCLOUD DIRECTOR SP 8.20 NEW FEATURES – Edge Deployment

vCloud Director has supported the control of vCNS and more recently NSX for some time. The mainstays of this were the deployment and management of Edge appliances. Now vCNS/NSX edges have had the same basic functionality but until very recently, vCloud Director was unable to take full advantage of the newer NSX edge features. Instead it worked with NSX in compatibility mode deploying older style vCNS edges or new NSX edges but limiting what functionality they had.

With vCD 8.20 the deployment of edges has not changed, you’ll still get the same basic deployment options but you will discover a few new options and additional text then in previous Edge deployments.

Once the edge is deployed vCD can then control and manage the edge as before. What’s new is the ability to upgrade the edge to advanced functionality. While the same edge is deployed it now communicates with vCD using the current generation of API’s opening a new set of features and functions.

To upgrade the Edge to this new functionality right click the edge and select ‘Convert to Advanced Gateway.’ In my testing this has not caused any disruption to existing services but take caution when performing this upgrade.

Managing the edge still requires you to right click the edge and select ‘Edge Gateway Services’ but now a new HTML5 control menu will open.

The new UI will look familiar if you’ve read my post on the new distributed firewall controls but let’s take a look at the basic edge configuration.

Clicking on the ‘Edge Settings’ will bring up the deployment settings on the edge:

One nice thing that VMware has added is some sanity checks on input data. Previously when working with Edges there was no data checks until you attempted to submit a change. This often left you attempting to make a change multiple times until you determined where your error was and corrected it.

I purposely input some bad data so you can get an idea of how this looks and works. 

Hopefully you see how the new edge functionality works from a basic deployment standpoint. We’ll continue with the Edge Logical Firewall in our next post.

 

vCloud Director SP 8.20 New Features – Distributed Firewall

With the release of vCloud Director 8.20 VMware has stepped up their game and offered a set of features that’s truly impressive.

vCloud Director has been around for some time so I won’t dive into the existing feature set but will instead work on a series of posts going over each new feature and how it works. First up….Distributed Firewall.

Now the NSX Distributed Firewall up until this point has been only accessible either via API or the vSphere WebClient. In a multi-tenant environment giving access to the WebClient was not possible given the lack of Role Based Access Controls (RBAC) within the NSX UI. This required Service Providers to deploy custom UI’s making back end API calls, time consuming to say the least.

Now vCloud Director has taken the hard work out of the equation by providing a fully multi-tenant HTML5 interface based on the Clarity VMware standard with full RBAC baked in. Let’s take a look.

The DFW UI is accessed via the OrgVDC as seen below. The key thing to remember here is that by having the DFW managed under each OrgVDC you are able to give each VDC control over their own firewall. With the already baked in RBAC of vCloud this gives a huge amount of granular control.

Selecting ‘Manage Firewall’ will load the HTML5 UI in a separate window. Initially the DFW is disabled for the OrgVDC and will need to be enabled:

The UI will look very familiar to anyone who has used the WebClient UI with a few caveats.

The functionality of the DFW is exactly the same as via the WebClient including creating of IP and MAC sets, however there are 3 key exceptions:

  1. Unable to interact with Service Composer
  2. Unable to create Object Groupings (more on this later)
  3. Unable to create new service groupings

Finally the last bit of new information is the ‘Applied To’ section. As vCloud is fully multi-tenant the NSX UI must also be. How this works is that you are only able to apply the DFW rules to objects under the control of your OrgVDCs. Your options for applying rules are:

  1. Edges
  2. OrgVDC Networks
  3. Virtual Machines
  4. OrgVDCs

Now let’s take a moment and create a rule so you can see how it looks within vSphere.

Clicking the ‘+’ button brings up a new rule above our default rule. For sake of this post I’m simply creating a duplicate Allow Any/Any rule.

After labeling the rule, I select the Applied To field and select my OrgVDC:

As with the WebClient you must commit all rules and save them before they are applied. The discard option is also present in case you need to roll back:

Finally we have the completed live rule:

Looking into the vSphere WebClient we see the following entry:

A Unique section is created for each OrgVDC with a live DFW. The naming of the section corresponds to the UUID of our OrgVDC. As we chose to use the OrgVDC as our target, a grouping object was automatically created encompassing all VM’s within our OrgVDC with a label also matching the OrgVDC UUID.

Looking in the hosts and cluster section we’ll see that it nicely matches up as well:

 

As you can see the new feature of a fully multi-tenant HTML5 UI allowing per OrgVDC control of the distributed firewall is a huge step forward for customers. Customers can now self-manage micro-segmentation within their multi-tenant vCloud environment in a secure and scalable fashion greatly enhancing the value add of vCloud Director and NSX.

 

Fun with vRA7 and vCloud Director

One of the VMware technologies I simply could not wrap my mind around was vRealize Automation (formerly vCAC.) I made the decision to buckle down and see if I could get it up and running in my lab and try to finally tackle the suite.

I had no problem getting the VRA7 appliance to deploy which speaks volumes over the original vCAC5 deployment I did almost 2 years ago (go VMware.) One place I kept stumbling however was I couldn’t get vCloud Director to register as an EndPoint and collect any data. I went over everything I could think of when I finally noticed this in the log:

 

Reading into the error I noticed vRA was attempting to reach vCloud Director at https://vcd-lab.justavmwblog.com/api/api/versions. Notice the second ‘/api/ which was clearly our issue. Suspecting this was not a vRA issue, I took a look at the public addresses section of vCloud and look what I discovered:

Notice the API URL has a /api appended on the end of the FQDN. Now I am not sure how that got there to be honest, but after removing the /api I was left with the following:

Upon applying the settings, vRA was able to connect without issue using the following configuration with vRA:

So moral of the story is to always check the logs and see what API endpoint vRA is trying to use.

 

 

HTML5 Client not auto-refreshing

Ran into an interesting situation today and it looks like I was not alone. In my homelab I was using the HTML5 client and noticed that it was not auto-refreshing as it is supposed to. I poked around a bit to make sure there wasn’t anything wrong with my system but turns out it was something quite simpler.

Turns out AdBlock Plus was the culprit. As soon as I disabled the protection for the *.justavmwblog.com domain everything worked great. A few hours later I ran into a user on /r/vmware running into the same issue. As soon as he disabled ADP everything worked properly.

I hit up the PM for the HTML5 client and it was something they had not tested with before. I thought this was a bit of an oversight and they are going to make sure to add some checks in the future going forward. That being said, as long as you can trust your VMware infrastructure URL’s and trust VMware to not serve any Ads at any point; I think you can safely whitelist the URL’s and never run into this issue again.

 

 

Intel Nuc and vSAN outage

As I’ve been slowly adding components to my NUC lab it’s become painfully clear that the single gigabit interface on each of my nics leads to problems. This is exasperated by running hybrid vSan as the shared storage system and it can be a bit of bandwidth hog.

So I needed a way to limit vSAN so as to not gobble up all the bandwidth on my servers and cause downtime. I accomplished this using the network resource pools within vSphere. Specifically I guaranteed a reservation for virtual machine, management and vMotion traffic as well as vSAN. The key however was to set a limit on the bandwidth vSAN could use, thus limiting any resync’s or vMotions due to maintenance mode from flooding my single interface and causing downtime.

VSAN Goodness and Lessons Learned

I’ve been working on my homelab for the last few days and have had a hard time getting VSAN working. In the end it was a combination of misconfigurations in my lab and me trying to be ‘creative.’ Moral of the story is don’t be creative….

So after getting everything online my 2 year old thought it would be fun to press the bright blue LED power button on one of my NUCs. Thankfully I had the FTT setting to 1 on my cluster so the other 2 nuc’s were able to handle the failure and I never lost my VM’s (yay vSAN.)

Once I turned on the third nuc however it kept showing as degraded in my environment no matter what I did. Turns out the way to fix this was to change the FTT policy to ‘0’ and apply the new storage policy to all VM’s in the environment.

This took a bit to sync the VM data but once complete I was able to once again change the FTT policy to ‘1’ and again apply to all the VM’s.

I had roughly 200GB of data to sync but it was showing the esxi01 nuc as ‘reconfiguring’ rather then ‘degraded’ and I was able to see all the VM’s now syncing via the vSAN health monitor.

Now if I had a fourth node in the cluster, this would not have happened but alas I can only support 3 at the moment, so this is a compromise I can work with.