I’ve recently been involved in performance testing a SharePoint 2013 farm. This has led to some discoveries on what you can use vCops for when you do performance testing, including what metrics you should look at.
The setup we used for testing, was what Microsoft calls a Medium Farm topology. It consists of 2 Web Frontends, 2 application servers for search and index and 1 database server. In front of the web frontends we’ve placed a Citrix Netscaler for load balancing, SSL off load and such. Each server runs Windows 2012, the SharePoint is 2013 and Microsoft SQL is 2012. The SharePoint has a few webparts on the front page that’s been built for the webpage.
- Webservers: 16 vCpu’s 16 Gb ram
- Application Servers: 8 vCpu’s 8 Gb ram
- Database Server: 8 vCpu’s and 16 Gb ram
All these were run on a single vSphere 5.0 host with 4 CPU’s and 8 cores each, and 196 Gb of Ram, giving us 32 cores. So at this point we were overcomitting on CPU somewhat, having 56 vCpu’s provisioned. These were not the final destination hosts of these servers, at the final destination hosts we wouldn’t overcommit. However we were certain that if we saw bottlenecks here then they would also be present at the final destination for the farm.
Inside Windows we could see the CPU maxing out, to a point where system processes were complaining, and we saw load times of over 30 seconds for the front page.
First thing that happened was someone was looking inside windows and saying more CPU is needed. I, however was looking at vCenter performance and only seeing a max of roughly 40% CPU usage. When looking at vCops i saw a Usage maxing out at 35.63% but a demand of 54.52%.
I interpreted that result as, the VM actually requested more CPU than vmware was giving it. However it was decided to try and give 8 more vCpu’s to the Webservers, instead of scaling them down to 8 vCpu’s
Running the test again with 24 vCpu webservers yielded the exact same result, more than 30 second load times and CPU inside windows maxing out. Looking at vCops we saw this:
An even lower % Usage and Demand. I’m thinking we’re overcomitting too much now, and going back into vCops i pulled out the %Ready counters as well, for both feb 20 and 21.
What that told us was that at 16 vCpu’s %READY maxed out at around 14%, which is kinda bad. And adding 8 more vCpu’s that jumps to 28.75%, meaning close 1/3 of the time the machine was ready, but couldn’t get access to a CPU on the host. One funny thing you can tell from the first picture, is that you can tell when the 8 vCpu’s were added, since that moved the “idle” %READY from around 6% to 14.87%.
The following Monday we scaled the machines down from 24 to 8 vCpus’s and ran the test again. Load times were still at around 30 seconds, so we didn’t really solve the problem, but looking at vCops we saw a completely different picture
The graphs shows the %READY dropping from roughly 12% to around 1%, and funny thing here is that, while we’re running the test the %READY drops even further.
At this time we decided to give Michael Monberg from Vmware a call, to ask about what to look for in vCops. He showed us this neat trick, when looking at performance metrics for a given VM, and have some graphs from the VM showing, like the ones above, you can do this:
Click on the + sign near the health tree
The host, the VM and the Datastore, if you then single click on the host, you get a new Metric Selector
But one from the Host, so expanding the CPU Usage i could select the Core Utilization metric and add that to the graph page. That now gave us graphs from the VM and the Host and the same time. Showing the graphs from Feb 20, 21 and 24 but with the new metric added:
Feb 20, 16 vCpu’s
Feb 21th 24 vCpu’s
From that we could tell that with 16 and 24 vCpu’s the host was totally maxed out on its physical cores, where on 8 vCpu’s we only used around 66%. So when the VM metric only shows 35% CPU usage the Host was maxed out, and thus adding the 8 extra vCpu’s had no positive effect on the VM.
When we ran the test with 8 vCpu’s and got the exact same results, we actually weren’t crossing NUMA nodes, which is what Microsoft recommends. See my blog post on NUMA and vNUMA here
This blog post was written to show some of the nice things you can do with vCenter Operations, and i hope you found it useful.
*disclaimer* This post is based on my findings so you might experience different stats, but the general guidelines are sound to my knowledge *disclaimer*
NUMA or Non Uniform Memory Access has been around in Intel processors since 2007 when they introduced their Nehalem processors.
NUMA and vNUMA
VMware has since vSphere v5.0 supported that the guest OS is exposed to the NUMA of the underlying processors. It’s automatically enabled if you create a VM that has more than 8 vCPU’s, only requirement is that your VM is at hardware level 8. You can manually edit advanced settings, so that the NUMA topology is exposed even if you have a lower number of vCPU’s. However, it is strongly recommended that you clearly understand how that impacts your VM when you do that.
CPU hot-add and vNuma
One thing you have to be aware of when thinking about deploying machines that cross NUMA boundaries is that if you enable hot-add cpu on your guest, VMware hides the NUMA information from guest OS, meaning that then the guest OS can’t smartly place applications and memory on the same NUMA node. So for performance intensive systems I would recommend that you turn off the hot-add CPU feature. Hot-add memory is however still working.
Images show Coreinfo run on a 4 CPU machine with 4 cores before and after CPU hot-add was enabled.
Crossing a NUMA boundary.
NUMA is a memory thing, but you can cross a NUMA boundary more than one way. If you have a 4 way machine with 8 core CPU’s and 128 GB of memory, that gives a NUMA boundary at 8 vCPU’s and 32GB of ram, meaning if you create a VM with more than 8 vCPU’s OR more than 32GB’s of ram, then that VM might have to access memory from another CPU.
If you cross a NUMA boundary there is a penalty, which can be shown in part by Coreinfo.
This example is a 4 way guest with 8 cores on each socket for a total af 32 Cores, but runs on a 2×10 core system, so 12 of the cores are run in HT. It shows the penalty that you’re given by access memory thats located on another node, and it has deemed that accessing local memory is the fastest. I have however seen some systems where Node 0 to Node 0 was slower than Node 1 to Node 1.
In normal environments that might not really be a problem as access to Storage or Networking is way slower than accessing memory on another CPU. But in high performance clusters this is something you should consider when building your physical infrastructure. However I’ve spoken to one customer who had seen up to a 35% degradation in overall performance by crossing a NUMA boundary, and because of that they only advises VM’s that can fit into 1 physical CPU and its memory.
Understand your hardware Configurations
One thing you really need to pay attention to is your hardware configuration. Setting the wrong Socket and Core configs in VMware compared to your physical hardware configuration can decrease performance a lot. Mark Achtemichuk has a nice blogpost which shows how different performance can be by selecting various settings in vSphere for Number of virtual sockets and Number of cores per socket.
On top of that your hardware vendor might not be aware that the config they’re selling you is crossing NUMA boundaries. I’ve eg bought 2 CPU blades from Dell that were put on 4-way motherboards, so I could add more CPU’s later. But then to keep the cost down, Dell used 4 GB memory blocks and distributed them evenly across all the banks on the motherboard. Meaning that half my memory can now only be accessed with a penalty. If memory performance is a needed in your environment, then the cost of using 8 GB memory blocks might be worth it.
Microsoft and NUMA support
Since Windows 2003 Enterprise and Datacenter editions, windows server has been NUMA-aware, meaning the OS can schedule its threads so processes that access the same areas of memory are put unto the same NUMA node, this helps reduce the penalty seen by crossing a NUMA boundary. Microsoft SQL Server has NUMA support since 2005, but it seems that is hasnt been fully supported until 2008 R2. But for SQL server it means that each database engine is started on its own NUMA node. Even if there is only 1 database engine it will attempt to start that engine on the second node. This is because the OS is allocating memory on the first node, so NODE 0 has less memory available to other applications than the other Nodes.
For SharePoint farms Microsofts best practice actually advices you NOT to cross NUMA boundaries, but adopt a scale-out option instead. Of course that means selling more SharePoint licenses, but make sure you do test how your performance is impacted when crossing a NUMA boundary on a SharePoint farm.