During mydiscussions with multiple customers in last few months, it seemedwe discussed only about one major point. It wasto decide between installation type of Distributed vs Simple Architecture. More often than not I have seen customers choosing a deployment model not based on careful considerations but based on what the general belief is. This blog tries to shed light on the considerations that should be made before making the decision between Distributed vs Simple Architecture. Any Solution Architect or Virtualization Architect should read this blog to discuss on the above point.
The general belief:
I have met enough customers to safely assume that more than90% of times a customer asks us for a Distributed Architectureirrespectiveof the size of the environment or total uptime required from the solution.I agree to the fact that it undoubtedly has it’s own benefits like removing single point of failure. Butthe question is should we always choose Distributed Architecture simply because we can or for the stated reason or are there any other careful considerations involved?
My point is, it depends, and you will be surprised to know howmany times I may recommend a simple install over a distributed environment.
Before you think I am talking totally baseless point, let us check the reasons for me saying so.
First point ofconsideration:
In most of the cases, all the management components do not directly affect therunning workload. So a downtime of the management components should not have adirect impact to your existing running environment. During that time, you will not be ableto manage or do new things in your environment. But this in no way is impacting yourcurrent SLA for the already running workload. This should be fine in most cases, but if someone has a public cloud and the main management portal goes down obviously it has bigger impact on business and should be avoided. This fact is more prominent in VMwareenvironments.For example, let’s consider the following situations.
- vCenter or PSC or both goes down– Existing VM’s keep running, no newdeployment or management is possible. In case of vRA with vCenter, no new deployment atcloud level is possible as well, but the Cloud portal works. Even all the features like HA, Networking (distributed) etc. all continue working. Remember this is your main management component in VMware virtualization environment.
- vRealize Automation components go down– Cloud portal is not available,existing workloads keeps running. You can still SSH or RDP to the VM’s hosted in cloud. End users operations are not hampered.
Here I am considering only these two cases as these two are responsible for building Virtualized and Cloud environment respectively. So from above, I can safely assume that availability of the management components are critical but does not immediately affect my already running workload.
Second Point ofconsideration:
Next, let’s explore both these deployment Architectures and their general effect more closely.
Distributed Architecture: FirstLet’s check the implications of a Distributed Architecturemore closely.
Most of the times, because of the following two reasons this Architecture is chosen.
- To remove a single point of failure (increaseavailability)
- To support a larger environment (if a singlenode can support say 10000 elements then 2 nodes will support 20000 elements- load balancing)
A lot of times point two is not applicable. Very few of the cases you would find a customer exceeding the technical limitation of a product.
For example, how many times you have actually seen a single vCenter serversupporting 1000 ESXi hosts and 15000 powered off VM’s in production? Or for that matter a single vCenter appliance taking care of 10000 powered on VM’s? I am yet to see one. Did you ever see a single ESXi host supporting 1024 VM’s or 4096 vCPU’s deployed in that host? Have you ever seen any customer who is actually touching or nearing to the technical limitations of a VMware Product? I doubt and would love to see one.
Besides, if you have an environment this big, then definitely Distributed Architecture is THE WAY for you.
Coming back to the point, hence it seems, the majority of timesthe reason a Distributed Architecture is chosen is to remove a single point offailure thus increasing availability.
So let’s consider a fully distributed architecture for a cloud environment built on vRealize Automation and see the effects it has on the environment..
For a fully distributed architecture of vRealize Automation,we need the following number of components:
- Deployed vRA appliance – 2+
- IaaS web server – 2+
- IaaS Manager Service Server – 2+
- IaaS DEM Server – 1+
- Agent Server – 1+
The number beside the component denotes the minimum number of nodes required for Distributed Architecture.
A total of minimum of 8+ servers are required only for vRealize Automation (with HA in DEM and Agent, you need more or overlap the roles). Also for database you need the following.
- MSSQL Server in HA mode – 2+
On top of that, if you consider the distributed vCenter environment, then you have thefollowing requirements:
- PSC – 2+
- vCenter – 2+
So a total of 14+ VM’s. Of course I am stating the extreme case here and in all probability actual production environment will have less number of servers with overlapping features. But if you have a really big environment then this is the number.
All these components will have Load Balancer in front. So architecturallyvCenter environment looks like following:
Or more precisely and in more details:
And the vRealize Automation environment should be as given below:
The direct effect:
The placement of a Load Balancer in this architecture has a lot of effect in thisenvironment. Let’s consider a physical load balancer in traditionalenvironment, i.e. somewhere upstream after firewall (at least 2 or 3 hops awayfrom the host on which the VM resides).
Now,let’s check how a normal user request for a VM is handled. A user request comesto the front Load Balancer (LB) and based on the decision, it goes to the respective vRealize Automation appliance.From there it again goes out to LB and comes back to a IaaS webserver. Next the request again goes outto LB and based on the decision a Manager server is chosenand finally goes for DEM. The same story applies when the VM creation requestgoes out to vCenter, it reaches LB for choosing PSC and then vCenter node.
In all, considering all these multiple HOPS to LB think howmany extra hops are taking place simply because of the nature of the deployment architecture. Consideringthe number of extra hops consider the performance effect on the overall response time.
Simple deployment architecture:
Now let’s consider the effects of a simple deployment architecture. For ourdiscussion let’s consider the number of supported elements is well within thecapability of a single node. In all probability, a simple architecture will have a single node for every component of the solution.
So nowa request will not have to make so many roundtrips to LB. So for obvious reasons, response time should be higher than a fully distributed architecture. So you get higher performance.
But, the flip side is now you have a single point offailure. So let’s consider the different availability options to increase theoverall uptime of a simple single node deployment architecture.
- The first line of defense is underlying High Availability of Hypervisor with VMmonitoring option. Typically, a physical node failure is sensed within 12seconds and a restart takes place by 15 seconds. For the sake of discussion, let’s consider the OS and the application of the VM comes up within 5 minutes.Considering a node failover happens once in a month, total downtime is 5minutes in a total of 43200 minutes (considering 30-day month). That means youget an uptime of 99.988%. Same goes for VM hung situation or application hung situation, as we are monitoring at the VM level as well.
- Second line of defense is snapshot, if the OS or applicationgets corrupted we simply revert back to a snapshot. First let’s consider anexternal database is used, then there is not much change in the original VM, sorecovering from a snapshot is sufficient and say it requires 20 seconds. Sototal uptime is now 99.999%. But if internal database is used, then simply reverting toan earlier snapshot is not enough. In this case we need to revert to an earliersnapshot to recover the OS and then we need to restore the database from thebackup (we need to have a regular backup mechanism for the database). This willrequire more time, say 10 minutes. In that case your uptime is 99.977%(considering internal database and recovery time of 10 minutes).
- Third line of defense is backups. If everything getscorrupted then you need to restore entire appliance from a backup which saytake 30 minutes. So in this case you get a 99.931% uptime in a month.
So the final choice is based on required uptime. If the business can sustain a 99.931% uptime for management components (worst case scenario)and the total supported elements are well within the product limit range, then I willcertainly suggest a simple install because of the following reasons:
- Simpler to manage
- Simpler to update
- Will perform better (as comparedto full distributed environment)
- Better response time
- Less complex
At the end I would say, do not choose full distributed architecture simply because you can. Consider all the above points. Choosing asimple single node deployment architecture is not so bad after all.
Another point to note, if I need to build a fullydistributed environment then I would prefer using a virtual load balancer likeNSX Edge, which will be much closer to the VM’s than that of a physical onekept in a traditional architecture thus reducing round trip time.
I am simplifying an already complex topic and the final answer is, it all depends. Every environment and requirement is different and there is nosingle rule to follow, but do not discard a simple deployment architecture because ofthe “so called reasons”. Consider it seriously and it may be way better for yourenvironment than the distributed architecture. Till then Happy designing and let me know your view points.
Note: The above discussion is from a virtualization/cloud perspective. It does not apply to traditional physical datacenter as in that case, recovery time for a physical server failure is much higher. And you can not ensure SLA in that case.
The post Virtualization: Make an informed decision: Distributed vs Simple Architecture appeared first on VMware Cloud Management.