Apr 032017
 

At the VMworld 2016 Barcelona keynote, CTO Ray O’Farrell proudly presented the performance improvements in vCenter 6.5. He showed the following slide:

Slide from Ray O’Farrell’s keynote at VMworld 2016 Barcelona, showing 2x improvement in scale from 6.0 to 6.5 and 6x improvement in throughput from 5.5 to 6.5.

As a senior performance engineer who focuses on vCenter, and as one of the presenters of VMworld Session INF8108 (listed in the top-right corner of the slide above), I have received a number of questions regarding the “6x” and “2x scale” labels in the slide above. This blog is an attempt to explain these numbers by describing (at a high level) the performance improvements for vCenter in 6.5. I will focus specifically on the vCenter Appliance in this post.

6x and 2x

Let’s start by explaining the “6x” and the “2x” from the keynote slide.

  1. 6x: We measure performance in operations per second, where operations include powering on VMs, clones, VMotions, etc. More details are presented below under “Benchmarking Details.”The“6x” refers to a sixfold increase in operations per second from vSphere 5.5 to 6.5:
    1. In 5.5, vCenter was capable of approximately 10 operations per second in our testbed.
    2. In 6.0, vCenter was capable of approximately 30 operations per second in our testbed.
    3. In 6.5 vCenter can now perform more than 60 operations per second in our testbed. With faster hardware, vCenter can achieve over 70 operations per second in our testbed.
  2. 2x: The 2x improvement refers to a change in the supported limits for vCenter. The number of hosts has doubled, and the number of VMs has more than doubled:
    1. The supported limits for a single instance of vCenter 6.0 are 1000 hosts, 15000 registered VMs, and 10000 powered-on VMs.
    2. The supported limits for a single instance of vCenter 6.5 are 2000 hosts, 35000 registered VMs, and 25000 powered-on VMs.

Not only are the supported limits higher in 6.5, but the resources required to support such a limit are dramatically reduced.

What does this mean to you as a customer?

The numbers above represent what we have measured in our labs. Clearly, configurations will vary from customer to customer, and observed improvements will differ. In this section, I will give some examples of what we have observed to illustrate the sorts of gains a customer may experience.

PowerOn VM

Before powering on a VM, DRS must collect some information and determine a host for the VM. In addition, both the vCenter server and the ESX host must exchange some information to confirm that the powerOn has succeed and must record the latest configuration of the VM. By a series of optimizations in DRS related to choosing hosts, and by a large number of code optimizations to reduce CPU usage and reduce critical section time, we have seen improvements of up to 3x for individual powerOns in a DRS cluster. We give an example in the figure below, in which we show the powerOn latency (normalized to the vSphere 6.0 latency, lower is better).

Example powerOn latency for 6.0 vs. 6.5, normalized to 6.0. Lower is better. 6.5 outperforms 6.0. The gains are due primarily to improvements in DRS and general code optimizations.

The benefits are likely to be most prominent in large clusters (i.e., 64 hosts and 8000 VMs in a cluster), although all cluster sizes will benefit from these optimizations.

Clone VM

Prior to cloning a VM, vCenter does a series of calls to check compatibility of the VM on the destination host, and it also validates the input parameters to the clone. The bulk of the latency for a clone is typically in the disk subsystem of the vSphere hosts. For our testing, we use small VMs (as described below) to allow us to focus on the vCenter portion of latency. In our tests, due to efficiency improvements in the compatibility checks and in the validation steps, we see up to 30% improvement in clone latency, as seen in the figure below, which depicts normalized clone latency for one of our tests.

Example clone VM latency for 6.0 vs. 6.5, normalized to 6.0. Lower is better. 6.5 outperforms 6.0. The gains are in part due to code improvements where we determine which VMs can run on which hosts.

These gains will be most pronounced when the inventory is large (several thousand VMs) or when the VMs to be cloned are small (i.e., < 16GB). For larger VMs, the latency to copy the VM over the network and the latency to write the VM to disk will dominate over the vCenter latency.

VMotion VM

For a VMotion of a large VM, the costs of pre-copying memory pages and then transferring dirty pages typically dominates. With small VMs (4GB or less), the costs imposed by vCenter are similar to those in the clone operation: checking compatibility of the VM with the new host, whether it be the datastore, the network, or the processor. In our tests, we see approximately 15% improvement in VMotion latency, as shown here:

Example VMotion latency for 6.0 vs. 6.5, normalized to 6.0. Lower is better. 6.5 is slightly better than 6.0. The gains are due in part to general code optimizations in the vCenter server.

As with clone, the bulk of these improvements is from a large number of code optimizations to improve CPU and memory efficiency in vCenter. Similar to clone, the improvements are most pronounced with large numbers of VMs or when the VMs are less than 4GB.

Reconfigure VM

Our reconfiguration operation changes the memory share settings for a VM. This requires a communication with the vSphere host followed by updates to the vCenter database to store new settings. While there have been improvements along each of these code paths, the overall latency is similar from 6.0 to 6.5, as shown in the figure below.

Example latency for VM reconfigure task for 6.0 vs. 6.5, normalized to 6.0. Lower is better. The performance is approximately the same from 6.0 to 6.5 (the difference is within experimental error).

Note that the slight increase in 6.5 is within the experimental error for our setup, so for this particular test, the reconfigure operation is basically the same from 6.0 to 6.5.

The previous data were for a sampling of operations, but our efficiency improvements should result in speedups for most operations, whether called through the UI or through our APIs.

Resource Usage and Restart Time of vCenter Server

In addition to the sorts of gains shown above, the improvements from 6.0 to 6.5 have also dramatically reduced the resource usage of the vCenter server. These improvements are described in more detail below, and we give one example here. For an environment in our lab consisting of a single vCenter server managing 64 Hosts and 8,000 VMs, the overall vCenter server resource usage dropped from 27GB down to 14GB. The drop is primarily due to removal of inventory service and optimizations in the core vpxd process of vCenter (especially with respect to DRS).

In our labs, the optimizations described below have also reduced the the restart time of vCenter (the time from when the machine hosting vCenter is booted until vCenter can accept API or UI requests). The impact depends on the extensions installed and the amount of data to be loaded at startup by the web client (in the case of accepting UI requests), but we have seen improvements greater than 3x in our labs, and anecdotal evidence from the field suggests larger improvements.

Brief Deep Dive into Improvements

The previous section has shown the types of improvements one might expect over different kinds of operations in vCenter. In this section, we briefly describe some of the code changes that resulted in these improvements.

“Rocks” and “Pebbles”

The changes from 6.0 to 6.5 can be divided into large, architectural-type changes (so-called “Rocks” because of the large size of the changes) and a large number of smaller optimizations (so-called “Pebbles” because the changes themselves are smaller).

Rocks

There are three main “Rocks” that have led to performance improvements from 6.0 to 6.5:

  1. Removal of Inventory Service
  2. Significant optimizations to CPU and memory usage in DRS, specifically with respect to snapshotting inventory for compatibility checks and initial placement of VMs upon powerOn.
  3. Change from SLES11 (6.0) to PhotonOS (6.5).

Inventory Service. The Inventory Service was originally added to vCenter in the 5.0 release in order to provide a caching layer for inventory data. Clients to vCenter (like the web client) could then retrieve data from the inventory service instead of going to the vpxd process within vCenter. Second- and Third-party solutions (e.g., vROps or other solutions) could store data in this inventory service so that the web client could easily retrieve such data This inventory service was implemented in Java and was backed by an embedded database. While this approach has some benefits with respect to reducing load to vCenter, the cost of maintaining this cache was far higher than its benefits. In particular, in the largest supported setups of vCenter, the memory cost of this service was nearly 16GB, and could be even larger in customer deployments. Maintaining the embedded database also required significant disk IO (nearly doubling the overall IO in vCenter) and CPU. In 6.5, we have removed this Inventory Service and instead have employed a new design that efficiently retrieves directly from vpxd. With the significant improvements to the vpxd process, this approach is much faster than using the Inventory Service. Moreover, it saves nearly 16GB from our largest setups. Finally, removing Inventory Service also leads to faster restart times for the vCenter server, since the Inventory Service no longer has to synchronize its data with the core vpxd process of vCenter server before vCenter has finished starting up. In our test labs, the restart times (the time from reboot until vCenter can accept client requests) improved by up to 3x, from a few minutes down to around one minute.

DRS. Our performance results had suggested the DRS adds some overhead when computing initial placement and ongoing placement of VMs. When doing this computation, DRS needs to retrieve the current state of the inventory. A significant effort was undertaken in 6.5 to reduce this overhead. The sizes of the snapshots were reduced, and the overhead of taking such a snapshot was dramatically reduced. One additional source of overhead is doing the compatibility checks required to determine if a VM is able to run on a given host. This code was dramatically simplified while still preserving the appropriate load-balancing capabilities of DRS.
The combination of simplifying DRS and removing Inventory Service resulted in significant resource usage reductions. To give a concrete example, in our labs, to support the maximum supported inventory of a 6.0 setup (1000 hosts and 15000 registered VMs) required approximately 27GB, while the same size inventory required only 14GB in 6.5.

PhotonOS. The final “Rock” that I will describe is the change from SLES11 to PhotonOS. PhotonOS uses a much more recent version of the Linux Kernel (4.4 vs. 3.0 for SLES11). With much newer libraries, and with a slimmer set of default modules installed in the base image, PhotonOS has proven to be a more efficient guest OS for the vCenter Server Appliance. In addition to these changes, we have also tuned some settings that have given us performance improvements in our labs (for example, changing some of the default NUMA parameters and ensuring that we are using the pre-emptive kernel).

Pebbles

The “Pebbles” are really an accumulation of thousands of smaller changes that together improve CPU usage, memory usage, and database utilization. Three examples of such “Pebbles” are as follows:

  1. Code optimizations
  2. Locking improvements
  3. Database improvements

Code optimizations. Some of the code optimizations above include low-level optimizations like replacing strings with integers or refactoring code to significantly reduce the number of mallocs and frees. The vast majority of cycles used by the vpxd process are typically spent in malloc or in string manipulations (for example, serializing data responses from hosts). By reducing these overheads, we have significantly reduced the CPU and memory resources used to manage our largest test setups.

Locking improvements. Some of the locking improvements include reducing the length of critical sections and also restructuring code to enable us to remove some coarse-grained locks. For example, we have isolated situations in which an operation may have required consistent state for a cluster, its hosts, and all of its VMs, and reduced the scope so that only VM-level or host-level locks are required. These optimizations require careful reasoning about the code, but ultimately significantly improve concurrency. An additional set of improvements involved simplifying the locking primitives themselves so that they are faster to acquire and release. These sorts of changes also improve concurrency.Improving concurrency not only improves performance, but it better enables us to take advantage of newer hardware with more cores: without such improvements, software would be a bottleneck, and the extra cores would otherwise be idle.

Database improvements. The vCenter server stores configuration and statistics data in the database. Any changes to the VM, host, or cluster configuration that occur as a result of an operation (for example, powering on a VM) must be persisted to the database. We have made an active effort to reduce the amount of data that must be stored in the database (for example, storing it on the host instead). By reducing this data, we reduce the network traffic between vCenter server and the hosts, because less data is transferred, and we also reduce disk traffic by the database.

A side benefit of using the vCenter server appliance is that the database (Postgres) is embedded in the appliance. As a result, the latency between the vpxd service and the database is minimized, resulting in performance improvements relative to using a remote database (as is typically used in vCenter Windows installations). This improvement can be 10% or more in environments with lots of operations being performed.

Benchmarking Details

Our benchmark results are based on our vcbench workload generator. A more complete description of vcbench is given in VMware vCenter Server Performance and Best Practices, but briefly, vcbench consists of a Java client that sends management operations to vCenter server. Operations include (but are not limited to) powering on VMs, cloning VMs, migrating VMs, VMotioning VMs, reconfiguring VMs, registering VMs, and snapshotting VMs. The Java client opens up tens to hundreds of concurrent sessions to the vCenter server and issues tasks on each of these sessions. A graphical depiction is givenin the “VCBench Architecture” slide, above.

The performance of vcbench is typically given in terms of throughput, for example, operations per second. This number represents the number of management operations that vCenter can complete per second. To compute this value, we run vcbench for a specified amount of time (for example, several hours) and then measure how many operations have completed. We then divide by the runtime of the test. For example, 70 operations per second is 4200 operations per minute, or over 25000 operations in an hour. We run anywhere from 32 concurrent sessions to 512 concurrent sessions connected to vCenter.

The throughput measured by vcbench is dependent on the types of operations in the workload mix. We have tried to model our workload mix based on the frequency of operations in customer setups. In such setups, often power operations and provisioning operations (e.g., cloning VMs) are prevalent.

Finally, the throughput measured by vcbench also depends on hardware and other vCenter settings. For example, in our “benchmarking” runs, we run with level 1 statistics. We also do performance measurements with higher statistics levels, but our baseline measurements use level 1. In addition, we use SSDs to ensure the the disk is not a bottleneck, and we also make sure to have sufficient CPU and memory to ensure that they are not resource-constrained. By removing hardware as a constraint, we are able to find and fix bottlenecks in our software. Our benchmarking runs also typically do not have extensions like vROps or NSX connected to vCenter. We do additional runs with these extensions installed so that we can understand their impact and provide guidance to customers, but they are not part of the base performance reports.

Conclusion

vCenter 6.5 can support 2x the inventory size as vCenter 6.0. Moreover, vCenter 6.5 provides dramatically higher throughput than 6.0, and can manage the same environment size with less CPU and memory. In this note, I have tried to give some details regarding the source of these improvements. The gains are due to a number of significant architectural improvements (like removing the Inventory Service caching layer) as well as a great deal of low-level code optimizations (for example, reducing memory allocations and shortening critical sections). I have also provided some details about our benchmarking methodology as well as the hardware and software configuration.

Acknowledgments

The vCenter improvements described in this blog are the results of thousands of person-hours from vCenter developers, performance engineers, and others throughout VMware. My heartfelt thanks goes out to them for making this happen.

The post vCenter 6.5 Performance: what does 6x mean? appeared first on VMware VROOM! Blog.

Updated Version of the Deployment Guide for Hadoop on VMware vSphere

 Allgemein, Knowledge Base, Updates, vCenter Server, VMware, VMware Partner, VMware Virtual Infrastructure, vSphere  Kommentare deaktiviert für Updated Version of the Deployment Guide for Hadoop on VMware vSphere
Jan 312017
 

The newDeployment Guide for Virtualizing Hadoop on VMware vSphere describes the technical choices for running Hadoop and Spark-based applications in virtual machines on vSphere. Innovative technologies and design approaches are appearing very regularly in the big data market; the pace of innovation has not slowed down for sure!

A prime example of this innovation is the rapid growth in Spark adoption for serious enterprise work over the past year or so, overtaking MapReduce as the dominant way of building big data applications. Spark holds out the promise of faster application execution times and easier APIs to use to build your application. A lot of innovation work is now going into optimizing the streaming of large quantities of data into Spark, with an eye to the large data feedsthat will appear from connected cars and other devices in the near future. This new version of the VMware Deployment Guide for Hadoop on vSphere brings the informationup to date with developments in the Spark and YARN (&#rsquo;Yet Another Resource Negotiator&#rdquo;) areas.

The YARNtechnology is the general name for the updated job scheduling and resource management functions that have now become mainstream in Hadoop deployments. The older MapReduce-centric style, once the central resource management schedulerin Hadoop, is now relegated to just another programming framework. MapReduceis stillused for Extract-Transform-Load (ETL) jobs, running in batch mode on a common resource management and schedulingplatform (YARN) – butnow,to a large extent,MapReduceis no longer the dominant paradigm for building applications. Spark is seen as muchmore suited to interactive queries and applications. Spark also runs as an example of another application framework on YARN, and that combination is popular in enterprises today – and so it is the focus of much of our testing currently, as you will see. Spark runs in standalone mode outside of the YARN resource manager context too, but that option is out of scope for the current Deployment Guide, as we see that less often within enterprises today. Of course, that may changein the future.

The previous (2013) version of the Hadoop Deployment guide for vSphere described the Hadoop 1.0 concepts (TaskTracker, JobTracker, etc.,) as they are mapped into virtual machines. That earlier version also contained a wide set of technical choices for the core architecture decisions you need to make. In the new version, the concepts in modern big data such as Spark and YARN are described in a virtualization context.

In the new version, we brought the main design approaches down to two or three (for example choosing DAS or NAS in the storage area) and we extracted the more complicated designs and tool discussions from it, so as to make it more readable and more focused on getting you started. The ideas described here will scale up to hundreds of nodes if you so choose, so they can be used in the large scale too, if you are going that way. That is shown in the medium-size and large scale example deployments that are given in the guide.

You can think of this blog article as a quick shortcut to information in the Deployment Guide.

The main choices to be made at an early stage in considering the deployment Hadoop on vSphere are given below.

These discussion points (apart from the VM sizing and placement ones) are not unique to virtualization and they apply equally in native systems too:

  1. Having identified how much data our new systems will manage, an early question is what type of storage to use. This question can be answered in several ways. An important choice is what type of storage to use. The Deployment Guide explores the use of Direct-Attached Storage (DAS) or an external form of storage for HDFS or a combination;
  2. Whether to use an external storage mechanism (e.g. Isilon NAS) that removes the management of the HDFS data from the now &#rsquo;compute-only&#rdquo; nodes or virtual machines
  3. What Hadoop software/services to place into different types of virtual machines
  4. How to size and map the correct number of virtual machines onto the right number of host vSphere servers.
  5. How to configure your networking so that the load that Hadoop occasionally places on it can be handled well.
  6. How to handle and recover from failures and assure availability of your Hadoop clusters.

 

The set of questions related to data storage come down toa core decision between dispersing your data out across multiple servers or retaining it on one central device. There are advantages to each of these.

The dispersed storage model (Option 1 above) allows you to use commodity servers and storage devices, but it means you have to manage it all using your own tools. If a drive or storage device fails in this scheme, then it is the system administrator’s task to find it,fix it and restore it into the cluster. The centralized model ensures that all of your data is protected in one place – and it may cut down on your overall storage needs. This reduction is due to avoiding the replication factor that applies with DAS-based HDFS.It can also make the data easier to manage from an ingestion and multi-protocol point of view. The Deployment Guide shows that both of these models will work fine with vSphere, using somewhat different architectures.

One other variant in storage is to use All-Flash storage on the servers in a similar fashion to DAS. This approach allows us to consider using Virtual SAN for hosting the entire Hadoop cluster, where earlier hybrid storage lent itself better to hosting the Hadoop Master nodes on the Virtual SAN-controlled storage. This All-Flash design for Hadoop on vSphere with VSAN is documented in a separate white paper from Intel and VMware.

 

Virtual Machine Placement

When taking your decisions about the placement of virtual machine onto servers, users have a distinct advantage in vSphere deployments. We don&#rsquo;t typically know about the server hardware configuration and the storage setup that our virtual machines will be deployed on, in many public clouds. That anonymity is where the flexibility of the public cloud comes from. Correct VM placement onto host servers and storage is very important for Hadoop/Spark however, as VM sizing and subsequent placement can have a profound influence over your application&#rsquo;s performance. That phenomenon is shown in the varied performance work that VMware has carried out on virtualized Hadoop – most recently in the testing of Spark and Machine Learning workloads on vSphere in particular. An example of the results from that work is givenhere

Other topics that are discussed in the Hadoop Deployment Guide are: system availability, networking, and big data best practices. There is also a set of example deployments at the small, medium and large-sized levels for Hadoop clusters. These are all in use either at VMware or at other organizations. You can start out with a small Hadoop cluster on vSphere and expand it upwards over time into the hundreds of servers, if needed.

 

There is a significant set of technical reference material also contained in the References section of the Hadoop on vSphere Deployment Guide that helps you delve into the deeper details on any of the topics covered in the guide. You can take one of the models described in the main text of the guide, or in the references section as your starting point for deployment and follow the guidelines from there. Using your Hadoop vendor’s deployment tool is recommended for your cluster, whether it be your first one or one among many that you deploy. We find that users want more than one version of their Hadoop distribution running at one time (and sometimes want multiple distributions as well). Virtualization is the way to go to achieve that more easily, with separate sets of virtual machines supporting the different versions.

We hope you enjoy the new Hadoop Deployment Guide material! For more information, you can always go to the main Big Data page.

The post Updated Version of the Deployment Guide for Hadoop on VMware vSphere appeared first on VMware vSphere Blog.

New Product Walkthrough – Hybrid vSphere SSL Certificate Replacement

 Allgemein, Knowledge Base, Updates, vCenter Server, VMware, VMware Partner, VMware Virtual Infrastructure, vSphere  Kommentare deaktiviert für New Product Walkthrough – Hybrid vSphere SSL Certificate Replacement
Jan 172017
 

The VMware Certificate Authority (VMCA) was first introduced in vSphere 6.0 to improve the lifecycle management of SSL Certificates. This post will explain a little bit about the VMCA and its capabilities while also making a recommendation on how to deploy certificates in your environment. Finally, a new click-by-click walkthrough has been created to serve as a guide as you are planning the certificate replacement process.

VMCA Overview

Over time, certificates within a vSphere environment have become much more important. Certificates ensure that communication between services, solutions, and users are secure and that systems are who we think they are. By default, VMCA acts as a root certificate authority. Certificates are issued that chain to VMCA where the root certificate of VMCA is self-signed as it is the end of the chain. These VMCA-signed certificates generate those thumbprint and browser security warnings you may be used to seeing because they are not trusted by the client computers by default.

The VMCA acts as a central point in which certificates can be deployed to a vSphere environment without having to manually create Certificate Signing Requests (CSRs) or to manually install the certificates once they are minted. The VMCA, working in conjunction with its new purpose-built certificate store called the VMware Endpoint Certificate Store (VECS), has made managing certificates much easier than in prior vSphere releases.

As shown in the graphic below, the VMCA operates within the Platform Services Controller (PSC). Depending on the topology of your installation, you can choose to deploy a vCenter Server with an embedded PSC or utilize separate external PSCs. The VMCA then issues certificates to any vCenter Servers and associated ESXi hosts that are registered to it. Many of the certificates issued by the VMCA are for internal service-to-service communication within vCenter Server. These services, also called Solution Users, use the certificates to authenticate to one another. As vSphere Users and Administrators, we do not interact directly with these services and therefore these certificates are less impactful to our overall certificate strategy. Note that a vCenter Server has four Solution Users while a PSC has one. A vCenter Server with an embedded PSC has four Solution Users as well.

In vSphere 6.0 we also added a reverse proxy to vCenter Server so than when we do need to communicate with vCenter Server services, that communication is all done via port 443 and secured by the Machine SSL certificate of the vCenter Server. The Machine SSL certificate becomes the primary way in which users secure communications with vCenter Server and the PSC. Remember those annoying web browser certificate warnings when accessing the vSphere Web Client? Those are caused by an untrusted (and perhaps self-signed) Machine SSL certificate.

The real value of the VMCA is in the automation of replacing and renewing certificates without having to manually generate CSRs, mint certificates, then manually install those certificates. If you’ve replaced certificates in a vSphere 5.x (or prior) environment then you know the challenges and time commitment involved in that process prior to the VMCA. The VMCA allows us to drastically reduce the overhead of the certificate lifecycle. I should note that use of the VMCA is not required. The VMCA can essentially be bypassed and custom certificates can be requested and installed for each of the different vSphere components, however, this comes with a higher operational cost. Additionally, it may introduce more opportunity for misconfiguration which could lead to a lower standard of security. Tread wisely. Next, let’s take a look at some different operational models for the VMCA along with a recommendation on the best approach.

The Subordinate CA Approach

One of the operational models of the VMCA is to act as a Subordinate (or Intermediate) Certificate Authority. Initially, with the release of vSphere 6.0 and the VMCA, this was a rather attractive option for customers. As a sub CA to an already established Certificate Authority in an environment, the VMCA could issue certificates to vCenter Server and ESXi hosts that would be inherently trusted and easily get rid of those pesky self-signed certificate errors with ease. However, over time it became very apparent that the risk of this model has outweighed the benefit. From a security perspective, by having a Subordinate CA, a rogue administrator with full access to the PSC could mint fully trusted and valid certificates that are trusted all the way up to the organization’s Root CA. In talking with our customers, many of them who operate in a highly security conscious manner, this type of risk is a deal breaker for the Security teams in those organizations.

The Full Custom Approach

The Subordinate CA approach sounded like a great win for operational simplicity but its downfall was the security risk. On the other end of the spectrum we have the Full Custom approach where every certificate within the vSphere environment is replaced by a unique custom certificate minted by a Root CA. This approach is, in theory, the most secure but as previously mentioned, it introduces a lot more complexity and opportunity for misconfiguration, thereby impacting security negatively. It has a high operational cost in order to gain higher security which means generating a CSR for each vCenter Server and PSC VM, each Solution User, and each ESX host. This could be hundreds or thousands of CSRs to generate and certificates to manage. Once that’s all done then you must worry about renewing all those certs or replacing revoked certificates. This is definitely a tradeoff in simplicity and time in order to gain more security.

The Rise of the Hybrid Approach

The question now becomes, “How can we take advantage of the Certificate Lifecycle benefits of the VMCA (and VECS), mitigate the risk of a subordinate CA, and reduce the overall time and effort it takes to manage all of this?” And thus, a hybrid model was born. A few short months after vSphere 6.0 was released, Mike Foley wrote about a new approach in a post titled, “Custom certificate on the outside, VMware CA (VMCA) on the inside – Replacing vCenter 6.0&#rsquo;s SSL Certificate.” With this “hybrid” approach, custom certificates are used for the Machine SSL certificates of the Platform Services Controller and vCenter Server VMs and then the VMCA is left to manage the Solution Users and ESXi host certificates.

This method of certificate lifecycle management does not use the VMCA as a subordinate CA. It lets the VMCA function as an independent CA and issue the internal Solution User and ESXi host certificates. Meanwhile, custom certificates from an external CA will adhere to the controls of the Enterprise PKI policies. Put these two pieces together and this hybrid approach reduces the work of certificate lifecycle management for Operations while increasing security with the custom certificates. This model even meets strict auditing standards such as with the IRS.

Let’s look at an example. Consider a vSphere 6.x environment that contains 4 Platform Services Controllers and 6 vCenter Servers across 2 sites with 50 hosts per vCenter Server. Let’s look at replacing certificates in this environment while comparing and contrasting the Subordinate CA, Full Custom, and Hybrid approaches we discussed earlier.

First, if we were to use the Subordinate CA approach we would want each PSC in the SSO Domain to also be a Subordinate CA. While not a requirement, this ensures consistency across the environment and will make life easier if there is ever a need to repoint a vCenter Server from one PSC to another. Given that each PSC will be a Subordinate CA, we need to generate a CSR for each of those PSC Sub CAs and submit to the Root CA. Once that is completed and the VMCAs are fitted with their new signing certificates, the VMCAs can then issue Solution User, Machine SSL, and Host certificates. So, in this environment we only have to manually manage 4 CSRs to get 4 certificates. Not bad. But remember, most security teams will forbid this type of deployment because of the risks involved.

Next, let’s go to the Full Custom approach. Recall that this method uses custom certificates for everything. So, we need to generate CSRs for the Solution Users, Machine SSL, and Hosts. This adds up to 338 CSRs to generate the required certificates. Whoa, that’s going to take some time. Not only that, but when it comes to renewal time you get to do this all over again not to mention that certificates could get revoked, hosts could be replaced, and other operations that would require a new certificate. You should be able to see that this causes the most management overhead but it is the most secure way of deploying certificates. There are some environments that may require this approach but for a “normal” production environment this should not be required.

Last, the Hybrid approach mashes components of the previous two together to get the best of both worlds. We still need to manually generate a handful of CSRs for the Machine SSL certificates of each PSC and vCenter Server which gives us 10 certificates to install and manage over time. And by letting the VMCA do its thing, we gain operational benefits as we grow our datacenter. We don’t have to mint a new CSR for every new ESXi host we add into a cluster. We just add it and let VMCA do it’s thing. The same is true for Solution Users.

Below is a table that captures the totals for each of the methods we’ve discussed.

Conclusion

What we have found in talking to these customers that are embracing the hybrid approach is that security teams are most concerned with securing the control plane of the administrators with certificates issued by the security team via their enterprise PKI. The hybrid approach addresses that for securing access to vSphere by replacing the Machine SSL certificate. Per best practices, access to ESXi management should be limited in nature and only done on an isolated network. To address administrative access to functions like the ESXi UI (introduced in 5.5 U3 and 6.0 U2), the VMCA CA certificate can be exported and added to the Trusted Root Certification Authorities container in an Active Directory group policy.

As you can see, the Hybrid approach is the best of both worlds. It addresses the security needs of the Security Team by protecting access to vCenter Server while it also addresses the operational needs of the IT team.

Walkthrough

To see exactly how to implement the Hybrid approach check out the new Product Walkthrough titled, “SSL Certificate Replacement – Hybrid Mode“.

Special thanks to Mike Foley for his help and feedback on this post.

The post New Product Walkthrough – Hybrid vSphere SSL Certificate Replacement appeared first on VMware vSphere Blog.

Jan 092017
 

Top 20Here is our Top 20 vCenter articles list for December 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Build numbers and versions of VMware vCenter Server
  2. Uploading diagnostic information for VMware through the Secure FTP portal
  3. Downloading, licensing, and using VMware products
  4. Using the VMware Knowledge Base
  5. Support Contracts FAQs
  6. Collecting diagnostic information for VMware vCenter Server 4.x, 5.x and 6.x
  7. How to consolidate snapshots in vSphere 5.x/6.x
  8. Investigating virtual machine file locks on ESXi
  9. Troubleshooting an ESXi/ESX host in non responding state
  10. Resetting the VMware vCenter Server 5.x Inventory Service database
  11. Licensing VMware vCenter Site Recovery Manager
  12. vSphere handling of LUNs detected as snapshot LUNs
  13. Update sequence for vSphere 6.0 and its compatible VMware products
  14. How to repoint and re-register vCenter Server 5.1 / 5.5 and components
  15. vmware-dataservice-sca and vsphere-client status change from green to yellow
  16. How to register/add a VM to the Inventory in vCenter Server
  17. Upgrading to vCenter Server 6.0 Update 2a fails on VCSServiceManager with error code ‘1603’
  18. Unable to grow or expand a VMFS volume or datastore
  19. VMware End User License Agreements
  20. Back up and restore vCenter Server Appliance/vCenter Server 6.0 vPostgres database

The post Top 20 vCenter Server articles for December 2016 appeared first on Support Insider.

Dez 192016
 

This video demonstrates how to purge old data from the SQL Server database used by vCenter Server. You would need to perform task if your vCenter database is full.

When the vCenter Server database if full:

  • You cannot log in to vCenter Server
  • VMware VirtualCenter Server service may start and stop immediately.

To resolve this issue, we need to manually purge or truncate the vCenter Server database. Details of how to do this and the script to truncate the database is documented in KB article: Purging old data from the database used by vCenter Server (1025914)

The post Purging old data from the vCenter Server database appeared first on Support Insider.

vCenter Server 6.5 High Availability Performance and Best Practices

 Allgemein, Knowledge Base, Updates, vCenter Server, VMware, VMware Partner, VMware Virtual Infrastructure, vSphere  Kommentare deaktiviert für vCenter Server 6.5 High Availability Performance and Best Practices
Nov 232016
 

High availability services are important in any platform, and vCenter Server is no exception. As the main administrative and management tool of vSphere, it is a critical element that requires HA. vCenter Server HA (aka VCHA) delivers protection against software and hardware failures with excellent performance for common customer scenarios, as shown in this paper.

We thoroughly tested VCHA with a benchmark that simulates common vCenter Server activities in both regular and worst case scenarios. The result is solid data and a comprehensive performance characterization in terms of:

  • Performance of VCHA failover/recovery time objective (RTO)
  • Performance of enabling VCHA
  • VCHA overhead
  • Performance impact of vCenter Server statistics level
  • Performance impact of a private network
  • External Platform Services Controller (PSC) vs Embedded PSC

In addition to the performance study results, the paper describes the VCHA architecture and includes some useful performance best practices for getting the most from VCHA.

For the full paper, see VMware vCenter Server High Availability Performance and Best Practices.

The post vCenter Server 6.5 High Availability Performance and Best Practices appeared first on VMware VROOM! Blog.

How to backup and restore the embedded vCenter Server 6.0 vPostgres database

 Allgemein, Knowledge Base, Updates, vCenter Server, VMware, VMware Partner, VMware Virtual Infrastructure, vSphere  Kommentare deaktiviert für How to backup and restore the embedded vCenter Server 6.0 vPostgres database
Nov 212016
 

This video demonstrates how to backup and restore an embedded vCenter Server 6.0 vPostgres database. Backing up your database protects the data stored in your database. Of course, restoring a backup is an essential part of that function.

This follows up on our recent blog & video: How to backup and restore the embedded vCenter Server Appliance 6.0 vPostgres database

Note: This video is only supported for backup and restore of the vPostgres database to the same vCenter Server. Use of image-based backup and restore is the only solution supported for performing a full, secondary appliance restore.

The post How to backup and restore the embedded vCenter Server 6.0 vPostgres database appeared first on Support Insider.

How to backup and restore the embedded vCenter Server Appliance 6.0 vPostgres database

 Allgemein, Knowledge Base, Updates, vCenter Server, VMware, VMware Partner, VMware Virtual Infrastructure, vSphere  Kommentare deaktiviert für How to backup and restore the embedded vCenter Server Appliance 6.0 vPostgres database
Nov 142016
 

This video demonstrates how to backup and restore an embedded vCenter Server Appliance 6.0 vPostgres database. Backing up your database protects the data stored in your database. Of course, Restoring a backup is an essential part of that function.

Note: This video is only supported for backup and restore of the vPostgres database to the same vCenter Server Appliance. Use of image-based backup and restore is the only solution supported for performing a full, secondary appliance restore.

The post How to backup and restore the embedded vCenter Server Appliance 6.0 vPostgres database appeared first on Support Insider.

Okt 052016
 

Here is our Top 20 vCenter articles list for September 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Licensing VMware vCenter Site Recovery Manager
  2. Resetting the VMware vCenter Server 5.x Inventory Service database
  3. Upgrading to vCenter Server 6.0 best practices
  4. vSphere handling of LUNs detected as snapshot LUNs
  5. How to repoint and re-register vCenter Server 5.1 / 5.5 and components
  6. ESXi 5.5 Update 3b and later hosts are not manageable after an upgrade
  7. Unmanaged workload is detected on datastore running SIOC
  8. vmware-dataservice-sca and vsphere-client status change from green to yellow
  9. How to register/add a VM to the Inventory in vCenter Server
  10. Downloading, licensing, and using VMware products
  11. Update sequence for vSphere 6.0 and its compatible VMware products
  12. Upgrading to vCenter Server 5.5 best practices
  13. Making a VMware feature request
  14. Enhanced vMotion Compatibility (EVC) processor support
  15. Update sequence for vSphere 5.5 and its compatible VMware products
  16. ESXi host disconnects intermittently when heartbeats are not received by vCenter Server
  17. Cannot remove or disable unwanted plug-ins from vCenter Server and vCenter Server Appliance
  18. Back up and restore vCenter Server Appliance/vCenter Server 6.0 vPostgres database
  19. vCenter Server 6.0 requirements for installation
  20. “Deprecated VMFS volume(s) found on the host” error in ESXi hosts

The post Top 20 vCenter Server articles for September 2016 appeared first on Support Insider.

Sep 062016
 

Here is our Top 20 vCenter articles list for August 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Investigating virtual machine file locks on ESXi/ESX
  2. Using the VMware Knowledge Base
  3. Uploading diagnostic information for VMware through the Secure FTP portal
  4. Correlating build numbers and versions of VMware products
  5. Licensing VMware vCenter Site Recovery Manager
  6. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  7. Resetting the VMware vCenter Server 5.x Inventory Service database
  8. Downloading, licensing, and using VMware products
  9. Build numbers and versions of VMware vCenter Server
  10. How to repoint and re-register vCenter Server 5.1 / 5.5 and components
  11. vSphere handling of LUNs detected as snapshot LUNs
  12. Upgrading to vCenter Server 6.0 best practices
  13. How to consolidate snapshots in vSphere 5.x/6.0
  14. ESXi 5.5 Update 3b and later hosts are not manageable after an upgrade
  15. Collecting diagnostic information for VMware vCenter Server 4.x, 5.x and 6.0
  16. How to enable EVC in vCenter Server
  17. Upgrading to vCenter Server 5.5 best practices
  18. VMware End User License Agreements
  19. “Failed to verify the SSL certificate for one or more vCenter Server Systems” error in the vSphere Web Client
  20. VMware vCenter Server 5.x fails to start with the error: Failed to add LDAP entry

The post Top 20 vCenter Server articles for July 2016 appeared first on Support Insider.

Aug 012016
 

Here is our Top 20 vCenter articles list for July 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Uploading diagnostic information for VMware using FTP
  2. Downloading, licensing, and using VMware products
  3. Licensing VMware vCenter Site Recovery Manager
  4. Collecting diagnostic information for VMware vCenter Server 4.x, 5.x and 6.0
  5. Using the VMware Knowledge Base
  6. Best practices for upgrading to vCenter Server 6.0
  7. ESXi hosts are no longer manageable after an upgrade
  8. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  9. Consolidating snapshots in vSphere 5.x/6.0
  10. Diagnosing an ESXi/ESX host that is disconnected or not responding in VMware vCenter Server
  11. How to unlock and reset the vCenter SSO administrator password
  12. Resetting the VMware vCenter Server 5.x Inventory Service database
  13. Correlating build numbers and versions of VMware products
  14. Back up and restore vCenter Server Appliance/vCenter Server 6.0 vPostgres database
  15. Build numbers and versions of VMware vCenter Server
  16. Re-pointing and re-registering VMware vCenter Server 5.1 / 5.5 and components
  17. “Deprecated VMFS volume(s) found on the host” error in ESXi hosts
  18. vmware-dataservice-sca and vsphere-client status change from green to yellow
  19. Investigating virtual machine file locks on ESXi/ESX
  20. VMware End User License Agreements

The post Top 20 vCenter Server articles for July 2016 appeared first on Support Insider.

Jul 052016
 

Here is our Top 20 vCenter articles list for June 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Purging old data from the database used by VMware vCenter Server
  2. ESXi 5.5 Update 3b and later hosts are no longer manageable after upgrade
  3. Resetting the VMware vCenter Server and vCenter Server Appliance 6.0 Inventory Service database
  4. Unlocking and resetting the VMware vCenter Single Sign-On administrator password
  5. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  6. Upgrading to vCenter Server 6.0 best practices
  7. Correlating build numbers and versions of VMware products
  8. Update sequence for vSphere 6.0 and its compatible VMware products
  9. Stopping, starting, or restarting VMware vCenter Server services
  10. In vCenter Server 6.0, the vmware-dataservice-sca and vsphere-client status change from green to yellow continually
  11. Enabling EVC on a cluster when vCenter Server is running in a virtual machine
  12. The vpxd process becomes unresponsive after upgrading to VMware vCenter Server 5.5
  13. Migrating the vCenter Server database from SQL Express to full SQL Server
  14. Reducing the size of the vCenter Server database when the rollup scripts take a long time to run
  15. Consolidating snapshots in vSphere 5.x/6.0
  16. Back up and restore vCenter Server Appliance/vCenter Server 6.0 vPostgres database
  17. Diagnosing an ESXi/ESX host that is disconnected or not responding in VMware vCenter Server
  18. Build numbers and versions of VMware vCenter Server
  19. Increasing the size of a virtual disk
  20. Determining where growth is occurring in the VMware vCenter Server database

The post Top 20 vCenter Server articles for June 2016 appeared first on Support Insider.

Jul 052016
 

Here is our Top 20 vCenter articles list for June 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Purging old data from the database used by VMware vCenter Server
  2. ESXi 5.5 Update 3b and later hosts are no longer manageable after upgrade
  3. Resetting the VMware vCenter Server and vCenter Server Appliance 6.0 Inventory Service database
  4. Unlocking and resetting the VMware vCenter Single Sign-On administrator password
  5. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  6. Upgrading to vCenter Server 6.0 best practices
  7. Correlating build numbers and versions of VMware products
  8. Update sequence for vSphere 6.0 and its compatible VMware products
  9. Stopping, starting, or restarting VMware vCenter Server services
  10. In vCenter Server 6.0, the vmware-dataservice-sca and vsphere-client status change from green to yellow continually
  11. Enabling EVC on a cluster when vCenter Server is running in a virtual machine
  12. The vpxd process becomes unresponsive after upgrading to VMware vCenter Server 5.5
  13. Migrating the vCenter Server database from SQL Express to full SQL Server
  14. Reducing the size of the vCenter Server database when the rollup scripts take a long time to run
  15. Consolidating snapshots in vSphere 5.x/6.0
  16. Back up and restore vCenter Server Appliance/vCenter Server 6.0 vPostgres database
  17. Diagnosing an ESXi/ESX host that is disconnected or not responding in VMware vCenter Server
  18. Build numbers and versions of VMware vCenter Server
  19. Increasing the size of a virtual disk
  20. Determining where growth is occurring in the VMware vCenter Server database

The post Top 20 vCenter Server articles for June 2016 appeared first on Support Insider.

Getting Comfortable with vPostgres and the vCenter Server Appliance – Part 2

 Allgemein, Knowledge Base, Updates, vCenter Server, VMware, VMware Partner, VMware Virtual Infrastructure, vSphere  Kommentare deaktiviert für Getting Comfortable with vPostgres and the vCenter Server Appliance – Part 2
Mai 202016
 

In Part 1 of this blog series I talked about vPostgres, some of its features, and why it’s the database platform of choice for the vCenter Server Appliance. In this post, Part 2, I’ll dig a bit deeper into the gears of vPostgres. We’ll take a look at some of the key configuration settings and why they are important to the vCenter Server Appliance.

Before digging into the vPostgres configuration I do want to make a quick point. This blog post is educational in nature and aimed at allowing vSphere Administrators to be more comfortable with the vPostgres database and the vCenter Server Appliance. In normal circumstances the vPostgres configuration should not be modified in any way. If the vPostgres configuration is modified it could lead to undesirable results and a lack of support from VMware GSS. But I do think it is important to understand how VMware is tuning the vPostgres configuration from a vanilla PostgreSQL deployment in order to increase the comfort level with vPostgres.

vPostgres Configuration Files

Figure 1: Listing of contents of the vPostgres directory on the vCSA

Now that the housekeeping is out of the way, let’s get to it! The vPostgres configuration files are located in /storage/db/vpostgres/ as seen in Figure 1 above. The main config file is postgresql.conf and holds all the normal configuration you’d expect for a database – log locations & rotation settings, memory tuning, and the autovacuum settings just to name a few. There are several ways you can take a look at this configuration file. One option would be to use a text editor such as vi or Vim (both are natively available on the vCenter Server Appliance). If you are looking for a specific setting within the file such as the autovacuum configuration you could use the command:

less postgresql.conf | grep autovacuum

Figure 2: Example of viewing parts of the postgresql.conf file

You could also use a utility like WinSCP to download the file to your workstation and then use your text editor of choice. I would recommend Notepad++ and avoid the built-in Windows editors (Notepad & Wordpad) as they can sometimes add extraneous hidden characters or just don’t deal with the formatting of a *nix-formatted text file very well. There are definitely other ways and depending on your familiarity with Linux you can probably do this dozens of other ways. Feel free to leave a comment on this post if you have a favorite method that I haven’t covered.

vPostgres Logging

As you might expect, once you get the postgresql.conf file opened up you can see that there are quite a few configuration settings. To get detailed information on any setting in particular you can refer to the PostgreSQL 9.3 documentation. The first settings I want to call attention to are several settings that are related to logging. We set the location, naming convention, rotation, and several other logging configurations. You’ll note that we set the log directory with:

log_directory = ‘pg_log’

The pg_log directory is located in the same directory as the conf files – /storage/db/vpostgres/. But pg_log is actually a symlink to /var/log/vmware/vpostgres which is then actually a symlink to /storage/log/vmware. If we take a look in the pg_log directory we can see the log files and verify that the rotation is working correctly. We can also verify the naming convention set in the conf file (log_filename = ‘postgresql-%d.log’ by default) matches. By default we’re rotating the logs to a new file every day (24 hours) and the name of that file is postgresql-%d where %d is the day of the month. So, for example, if today is May 4th then today’s log file will be postgresql-4.log.

Figure 3: vPostgres log rotation

There is also a setting called log_truncate_on_rotation which, if enabled, tells PostgreSQL to overwrite log files with the same name. Since we enable this setting on the vCenter Server Appliance, we’ll see that the June 4th log will have the same name as the May 4th log (postgresql-4.log) and it will be truncated (cleared). This results in a fresh log file for each day of the month. One caveat is that not all months have 31 days, right? So the postgresql-31.log file may stick around a while so you’ll need to be aware of that and the timestamp of the file if you’re using the logs to troubleshoot.

One additional logging parameter is logging_collector which is off by default in PostgreSQL. We enable this parameter on the vCenter Server Appliance to catch log messages sent to stderr by the top-level vPostgres process (“postgres“). For example, for the vCSA, this means that log messages generate by vPostgres backend and extension processes, commands invoked by vPostgres, and any shared libraries that vPostgres might use will all get captured in the log files through stderr. Since these types of messages would normally be missed by syslog having logging collector turned off could result in additional time to troubleshoot an issue. One note is that the logging collector is meant to always capture the logs. Therefore, it is possible that if the vCenter Server Appliance experiences high load that the logging collector could block other processes since it will take priority. However, this should not occur during normal operations and it is highly advisable to keep this setting enabled.

vPostgres Checkpoints

Moving on from the log settings we have checkpoints. Checkpoints, as defined in the PostgreSQL documentation are “points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint”. So, in other words, we write a checkpoint to the Write Ahead Log (WAL) every so often to show that the preceding transactions have been flushed to disk. In the event of a crash, the system looks at the most recent checkpoint to determine from where it needs to start replaying transactions.

In PostgreSQL, checkpoints can be triggered via three different methods. The first method is by issuing a command via the pgsql command line interface. The second way a checkpoint is triggered is time-based which is every 5 minutes by default. The last way a checkpoint can be triggered is by volume. By default, PostgreSQL performs a checkpoint after 48 MBs of data. For vPostgres & vCenter Server, we felt that this volume would create far too many checkpoints and create unnecessary I/O (load) on the vCenter Server Appliance. Therefore, we’ve stretched that out to 90% of the disk allocated for the pg_xlog partition. This is the VMDK where the WAL resides and by default is 5 GB in size. This reduces the I/O of the checkpointing operations. Note that there is tradeoff – a modest increase in recovery.

I’ve put together the following diagram to help illustrate the checkpoint concept for vPostgres. The below diagram represents the Write Ahead Log and shows how much of the log would be replayed in the event of some sort of issue. You can see that only the portion of the file since the last checkpoint needs to be replayed.

Figure 4: WAL and Checkpoints

As mentioned earlier, each time a checkpoint is created in the WAL all of the preceding transactions are flushed to disk. As you might imagine, when this data is written there is the potential for quite a bit of I/O to occur on the disks where the PostgreSQL database is stored. The PostgreSQL engineers thought of this and have a parameter to help mitigate I/O storms called checkpoint_completion_target. By default, this is set to 0.5 which means that the load created by flushing the transactions to disk during the checkpoint process is spread out across 2.5 minutes (or half a checkpoint cycle). In order to further protect vCenter Server customers from this potential I/O spike, we’ve changed the checkpoint_completion_target parameter to 0.9 to further spread out the I/O.

vPostgresHealth

In the final part of this blog post I want to cover something that is added by VMware to help show the current health and status of vPostgres. This is a simple process called the health_status_worker and it writes a file to the /etc/vmware-sca/health/ directory called vmware-postgres-health-status.xml. Currently, this XML file is consumed via the vSphere Web Client and is used to show the status of the vmware-vpostgres service in the Nodes view under Administration > System Configuration in the vSphere Web Client.

Figure 5: Showing the health of vPostgres in the vSphere Web Client

While simple, this could lead to some additional capabilities down the road to make it easier to monitor the health of the vPostgres service.

That concludes this second part of the blog series on getting comfortable with vPostgres and the vCenter Server Appliance. We focused on the configuration and reviewed some of the important changes vPostgres has over a vanilla PostgreSQL installation. In the next post we’ll take a look at some of the tools that are available to monitor and manage the vCenter Server Appliance and vPostgres.

Acknowledgement

I just want to thank Michael Paquier & Nikhil Deshpande for continuing to help out on this subject matter. Thank you!

The post Getting Comfortable with vPostgres and the vCenter Server Appliance – Part 2 appeared first on VMware vSphere Blog.

Getting Comfortable with vPostgres and the vCenter Server Appliance – Part 2

 Allgemein, Knowledge Base, Updates, vCenter Server, VMware, VMware Partner, VMware Virtual Infrastructure, vSphere  Kommentare deaktiviert für Getting Comfortable with vPostgres and the vCenter Server Appliance – Part 2
Mai 202016
 

In Part 1 of this blog series I talked about vPostgres, some of its features, and why it’s the database platform of choice for the vCenter Server Appliance. In this post, Part 2, I’ll dig a bit deeper into the gears of vPostgres. We’ll take a look at some of the key configuration settings and why they are important to the vCenter Server Appliance.

Before digging into the vPostgres configuration I do want to make a quick point. This blog post is educational in nature and aimed at allowing vSphere Administrators to be more comfortable with the vPostgres database and the vCenter Server Appliance. In normal circumstances the vPostgres configuration should not be modified in any way. If the vPostgres configuration is modified it could lead to undesirable results and a lack of support from VMware GSS. But I do think it is important to understand how VMware is tuning the vPostgres configuration from a vanilla PostgreSQL deployment in order to increase the comfort level with vPostgres.

vPostgres Configuration Files

Figure 1: Listing of contents of the vPostgres directory on the vCSA

Now that the housekeeping is out of the way, let’s get to it! The vPostgres configuration files are located in /storage/db/vpostgres/ as seen in Figure 1 above. The main config file is postgresql.conf and holds all the normal configuration you’d expect for a database – log locations & rotation settings, memory tuning, and the autovacuum settings just to name a few. There are several ways you can take a look at this configuration file. One option would be to use a text editor such as vi or Vim (both are natively available on the vCenter Server Appliance). If you are looking for a specific setting within the file such as the autovacuum configuration you could use the command:

less postgresql.conf | grep autovacuum

Figure 2: Example of viewing parts of the postgresql.conf file

You could also use a utility like WinSCP to download the file to your workstation and then use your text editor of choice. I would recommend Notepad++ and avoid the built-in Windows editors (Notepad & Wordpad) as they can sometimes add extraneous hidden characters or just don’t deal with the formatting of a *nix-formatted text file very well. There are definitely other ways and depending on your familiarity with Linux you can probably do this dozens of other ways. Feel free to leave a comment on this post if you have a favorite method that I haven’t covered.

vPostgres Logging

As you might expect, once you get the postgresql.conf file opened up you can see that there are quite a few configuration settings. To get detailed information on any setting in particular you can refer to the PostgreSQL 9.3 documentation. The first settings I want to call attention to are several settings that are related to logging. We set the location, naming convention, rotation, and several other logging configurations. You’ll note that we set the log directory with:

log_directory = ‘pg_log’

The pg_log directory is located in the same directory as the conf files – /storage/db/vpostgres/. But pg_log is actually a symlink to /var/log/vmware/vpostgres which is then actually a symlink to /storage/log/vmware. If we take a look in the pg_log directory we can see the log files and verify that the rotation is working correctly. We can also verify the naming convention set in the conf file (log_filename = ‘postgresql-%d.log’ by default) matches. By default we’re rotating the logs to a new file every day (24 hours) and the name of that file is postgresql-%d where %d is the day of the month. So, for example, if today is May 4th then today’s log file will be postgresql-4.log.

Figure 3: vPostgres log rotation

There is also a setting called log_truncate_on_rotation which, if enabled, tells PostgreSQL to overwrite log files with the same name. Since we enable this setting on the vCenter Server Appliance, we’ll see that the June 4th log will have the same name as the May 4th log (postgresql-4.log) and it will be truncated (cleared). This results in a fresh log file for each day of the month. One caveat is that not all months have 31 days, right? So the postgresql-31.log file may stick around a while so you’ll need to be aware of that and the timestamp of the file if you’re using the logs to troubleshoot.

One additional logging parameter is logging_collector which is off by default in PostgreSQL. We enable this parameter on the vCenter Server Appliance to catch log messages sent to stderr by the top-level vPostgres process (“postgres“). For example, for the vCSA, this means that log messages generate by vPostgres backend and extension processes, commands invoked by vPostgres, and any shared libraries that vPostgres might use will all get captured in the log files through stderr. Since these types of messages would normally be missed by syslog having logging collector turned off could result in additional time to troubleshoot an issue. One note is that the logging collector is meant to always capture the logs. Therefore, it is possible that if the vCenter Server Appliance experiences high load that the logging collector could block other processes since it will take priority. However, this should not occur during normal operations and it is highly advisable to keep this setting enabled.

vPostgres Checkpoints

Moving on from the log settings we have checkpoints. Checkpoints, as defined in the PostgreSQL documentation are “points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint”. So, in other words, we write a checkpoint to the Write Ahead Log (WAL) every so often to show that the preceding transactions have been flushed to disk. In the event of a crash, the system looks at the most recent checkpoint to determine from where it needs to start replaying transactions.

In PostgreSQL, checkpoints can be triggered via three different methods. The first method is by issuing a command via the pgsql command line interface. The second way a checkpoint is triggered is time-based which is every 5 minutes by default. The last way a checkpoint can be triggered is by volume. By default, PostgreSQL performs a checkpoint after 48 MBs of data. For vPostgres & vCenter Server, we felt that this volume would create far too many checkpoints and create unnecessary I/O (load) on the vCenter Server Appliance. Therefore, we’ve stretched that out to 90% of the disk allocated for the pg_xlog partition. This is the VMDK where the WAL resides and by default is 5 GB in size. This reduces the I/O of the checkpointing operations. Note that there is tradeoff – a modest increase in recovery.

I’ve put together the following diagram to help illustrate the checkpoint concept for vPostgres. The below diagram represents the Write Ahead Log and shows how much of the log would be replayed in the event of some sort of issue. You can see that only the portion of the file since the last checkpoint needs to be replayed.

Figure 4: WAL and Checkpoints

As mentioned earlier, each time a checkpoint is created in the WAL all of the preceding transactions are flushed to disk. As you might imagine, when this data is written there is the potential for quite a bit of I/O to occur on the disks where the PostgreSQL database is stored. The PostgreSQL engineers thought of this and have a parameter to help mitigate I/O storms called checkpoint_completion_target. By default, this is set to 0.5 which means that the load created by flushing the transactions to disk during the checkpoint process is spread out across 2.5 minutes (or half a checkpoint cycle). In order to further protect vCenter Server customers from this potential I/O spike, we’ve changed the checkpoint_completion_target parameter to 0.9 to further spread out the I/O.

vPostgresHealth

In the final part of this blog post I want to cover something that is added by VMware to help show the current health and status of vPostgres. This is a simple process called the health_status_worker and it writes a file to the /etc/vmware-sca/health/ directory called vmware-postgres-health-status.xml. Currently, this XML file is consumed via the vSphere Web Client and is used to show the status of the vmware-vpostgres service in the Nodes view under Administration > System Configuration in the vSphere Web Client.

Figure 5: Showing the health of vPostgres in the vSphere Web Client

While simple, this could lead to some additional capabilities down the road to make it easier to monitor the health of the vPostgres service.

That concludes this second part of the blog series on getting comfortable with vPostgres and the vCenter Server Appliance. We focused on the configuration and reviewed some of the important changes vPostgres has over a vanilla PostgreSQL installation. In the next post we’ll take a look at some of the tools that are available to monitor and manage the vCenter Server Appliance and vPostgres.

Acknowledgement

I just want to thank Michael Paquier & Nikhil Deshpande for continuing to help out on this subject matter. Thank you!

The post Getting Comfortable with vPostgres and the vCenter Server Appliance – Part 2 appeared first on VMware vSphere Blog.