Saturday, April 4, 2015

SCOM: Optimizing the SQL Server Page Life Expectancy monitor

Something I come across a lot in the field when talking to people about the SQL Server Management Pack for SCOM is people claiming "The Page Life Expectancy monitor doesn't work" or "it's too noisy". In a lot of those cases the SCOM Administrators or even the DBA's are either about to or have just disabled the monitor. In my opinion that is the wrong decision and the "easy way out". In this post I will explain how I optimize this monitor for an environment.

Debunk the "it's broken" myth
Firstly lets talk about the usual reaction. Most people I talk to believe this monitor is broken. This is typically because this monitor will alert and then after some time resolve the alert (the intended behavior of a Monitor). The reason they reach this conclusion is because of the frequency at which this monitor will alert, but that is not because it is broken but because it has not yet been tuned for your environment.

I should also point at this time that just because you are receiving an alert for Page Life Expectancy doesn't mean that:
  • You have Memory Pressure
  • There are Buffer Pool issues
  • That your server needs more memory
These are just some other myths that should be debunked. I'm not saying it isn't possible to use the Page Life Expectancy to qualify those situations, but there is more to it than that, so you shouldn't just use this counter alone to reach those conclusions.

Page Life Expectancy counter
The accuracy and way we must look at this counter has changed over the years as a result of improvements in hardware (e.g. NUMA), virtualization of infrastructure, reduced cost of memory allowing for very large physical machines.

Paul Randal has a very detailed write up about this counter and it is well worth reading

Lets look at factors we need to consider with this counter and hardware configurations.

NUMA hardware
For NUMA hardware the Page Life Expectancy counter “SQLServer:Buffer Manager” is not always as accurate because it is a aggregated view of all Buffer Nodes. Paul’s blog describes this so I won’t go into the details, but essentially it can be more accurate to monitor the counter on the “SQLServer:Buffer Node” object rather than the consolidated object when using NUMA hardware.

Non-NUMA hardware

For Non-NUMA hardware, the “SQLServer:Buffer Manager” is no longer an aggregated value, so this is generally what will be targeted for monitoring. As it is impossible to create a single collection rule in SCOM to account for all possible scenarios, this is also the counter which has been chosen, largely due to legacy but also because in a world of increasing acceptance for virtualised environments then this is still the typical counter used in most scenarios.

One of the main challenges is that the “suggested” threshold that most DBA’s apply, and monitoring systems use, including but not limited to the SCOM SQL Management Pack, is based on generalized hardware specifications and usage patterns, unfortunately this does not necessarily taken into account the advancements of hardware and increases in memory allocations. It is also impossible to have one threshold that covers all the various hardware configurations and application usage patterns we have these days.

The reason that this is so important is that we need to look at and tune the threshold being monitored based on the hardware specifications and usage patterns specific to the individual environment. For example monitoring this value on a Virtual Machine with 4 Gb of RAM is likely to have a vastly different pattern to a physical machine with NUMA hardware and 96 Gb of memory. The same can be said when comparing a transactional application (OLTP) verses a dataset intensive application (OLAP) applications.

Application behavior
The other input to this counter is the fact that application behavior can have a significant impact on it's value. For example if the application suddenly imports a large volume of data, this has to pass through the buffer pool and if your buffer pool is already full (which ideally it should be), then the database engine will need to drop the clean pages from the buffer pool to free up room for this new data. This process will have an impact on the Page Life Expectancy value and it should take a dip. The following graph shows an application usage pattern where there are different datasets consistently being read into the buffer pool and the engine having to make room for these by dropping clean buffers first, thus reducing the life expectancy of a single page. This behavior on it's own may not be impacting on end user experience and it is not always viable to add sufficient memory to a server to host the entire database in memory (nor should you want to without considering some of the new alternative features). Although the monitor could alert depending on the threshold in place, knowing the application usage pattern allows you to determine if this is expected or not.

Optimizing the Page Life Expectancy counter for your environment
The typical way I approach this monitor is to leave the threshold value as default but when it alerts in the console I use the following process to optimize it for the environment. It is important to note that there is no "one size fits all" or "magic wand" you can use here if you want to monitor this correctly.

This is the process I typically follow.
  1. Receive the alert or run a report on servers which have alerted for low Page Life Expectancy
  2. Open the Performance View in SCOM for the affected server and select the Page Life Expectancy counter
  3. Review the behavior over an extended duration. There is no point looking at just a few hours here, you need to understand the server/application behavior over a week, month, or longer to effectively decide on a more suitable threshold for this counter.
  4. Review the hardware configuration of the affected server to determine if there is a suitable threshold based on the total memory allocated to SQL (see Paul Randal’s blog for an equation)
  5. Adjust the monitoring as required using one or more of the following approaches
    1. Depending on the typical hardware configuration in your environment you may choose to lower or increase the threshold of the base monitor, and then use groups to manage the exceptions such as the following points. I typically choose to leave the threshold as is and use 2 groups to manage the exceptions of either high or low baselines, particularly in an environment of a mixture of hardware configurations (VMs vs Physical and small Memory vs large Memory allocations)
    2. If the threshold needs to be lowered for a selection of servers, then I typically create a group to hold the specific servers and a suitable override for that group
      E.g. Group "SQL Servers Page Life Expectancy Low" with threshold of 100.
    3. If the threshold needs to be increased for a selection of servers, then again create a group to hold that specific server list and a suitable override for that group.
      E.g. Group "SQL Servers Page Life Expectancy High" with threshold of 900.
  6. Continue to monitor the server behavior and ensure the changes are appropriate to the application usage pattern.
  7. Routinely review the thresholds of any overrides and applicable group membership for any groups to ensure they are still applicable.

When evaluating the behavior of the server, it is also important to review the supporting counters like the Batch Requests/sec, Buffer Cache Hit Ratio, Page Reads/sec, Page Lookups/sec, Total Server Memory, Target Server Memory, and related counters to determine if there is in fact an issue with the configuration of SQL Server or just typical application behavior for that server.

Legal Stuff: As always the contents of this blog is provided “as-is”. The information, opinions and views expressed are those of the author and do not necessarily state or reflect those of any other company with affiliation to the products discussed. This includes any URLs or Tools. The author does not accept any responsibility from the use of the information or tools mentioned within this blog, and recommends adequate evaluation against your own requirements to measure suitability.

No comments:

Post a Comment