Oct 12, 2012

Garbage Collector tuning for services with large heaps

I've been working on a Thrift based Java service that uses a large heap. Most of the heap is occupied with a fairly static 7GB dataset that is reloaded every 24 hours. Real-time updates to the data occur throughout the day but they are low in both volume and size, and are incorporated into the next days 'full' dataset. The rest of the heap is made up of very short-lived objects necessary to represent and process client requests to the service. Ideally the service should have a response latency in the order of milliseconds and it met this requirement easily on its initial deployment using just the Concurrent-Mark-Sweep garbage collector.

More recently the traffic has increased by an order of magnitude. In these conditions the default garbage collector configuration couldn't cope and 'stop the world' pauses lasting many seconds were killing performance. We were analysing the garbage collector operation with the following JVM options:


We found that the garbage collector was collecting across the large old generation. However, in the expected usage scenario the old generation should ideally contain only the long lived static dataset, perhaps with the real-time data updates that are delivered throughout the day - nothing should need to be collected from it. The GC logs showed that many more objects were frequently tenured into the old generation only to be later garbage collected. So where were all of these objects coming from? The most likely source were the groups of 'short-lived' objects created per request - if these were being promoted into the old generation then it could grow quickly, requiring frequent long collections. So how can you prevent this from happening?:
  1. Increase the maximum tenuring age to increase the window of opportunity for young objects to be collected before they reach the old generation.
  2. Increase the size of the new generation to lengthen the interval between new space collection cycles which in turn can make it less likely that young objects will be promoted to the old generation.

Increasing the maximum tenure threshold

On each young generation GC cycle, the garbage collector promotes surviving objects from the eden space into a survivor space. There are two survivor spaces and CMS alternates between them on each cycle. There is also a tenure counter on each object in the survivor space that notes the number of GC cycles the object has been there. Every time the survivor space is collected, all objects that remain in the space have their counter incremented. Once an objects counter reaches the maximum tenure age, it is promoted to the old generation.

By keeping objects in the smaller survivor spaces for longer we can hope that objects will have to stick around for a significant amount of time before being promoted to the old generation. In this time we can hope that any truly short-lived objects will be collected long before they hit the maximum tenure count. The default maximum tenure threshold is 4, but it can be configured as high as 15 like so:


Increasing the size of the new generation

Ideally we want all short-lived objects to be collected early on from the new generation. However, if the new space is small then there will be many frequent small collections. When large object throughput is concerned, eden space will fill up quickly - and if the requests are still in progress many objects will enter the survivor spaces as they are still 'in use'. As the collections are frequent, the tenure counters on the objects in the survivor spaces will increment quickly. Should these objects still be 'in use' by the time they reach the maximum tenure age then they'll enter the old generation - something that we want to avoid. To slow this process down we can increase the size of the new space like so:


When increasing the new space we were actually more concerned with increasing eden space. If the time it takes to fill eden (and thus cause a collection) is significantly greater than the average request duration then we could hope that the majority of the newly created objects will be 'dead' by the time a collection starts and are hence immediately eligible for collection from eden. Thanks to the increase in tenuring threshold, those that do survive eden will then have to make it through 15 more GC cycles in the survivor space before they are promoted to the old generation. A arbitrary new allocation of 3GB was chosen as a good starting point.

Monitoring the garbage collector

With these settings in place we wanted to get a good grasp of how they affected the garbage collection profile of my application. Firstly we enabled some extra GC logging to provide insights on the tenuring of objects:


We also started an instance of VisualVM on the same machine and installed the VisualGC plugin. This plugin provides an excellent insight to what's going on inside the garbage collector in real-time and can really aid understanding - especially when compared with the rather obtuse GC log formats.

VisualGC plugin for VisualVM
The VisualGC above shows a snapshot of the GC as the service is load tested. The 7GB dataset is the big green block labelled 'old'. The new collections (dark brown saw-tooth pattern labelled 'eden') are regular as the load is fairly constant, and thus so is the object throughput. It actually takes quite a while for eden to fill up so we could likely decrease the new size from 3GB with out any ill effect. The fact that the survivor spaces (S0, S1, and Histogram) are so empty shows that nearly all per-request objects are disposed of from eden and never make it to the survivor spaces. The 'histogram' section shows that the objects that do make it into survivor fall away rapidly with each cycle so that very little is tenured into the old. This is great because the old space is now effectively static and the GC won't have any cause to try and perform a large collection on it. However, we were curious to see which objects were making it into survivor, so for that we turned to the tenuring distributions in the GC logs (sample edited for brevity):

  2293.760: [GC 2293.760: [ParNew
  Desired survivor size 161054720 bytes, new threshold 15 (max 15)
  - age   1:    8344528 bytes,    8344528 total
  - age   2:    2035920 bytes,   10380448 total
  - age   3:    4351424 bytes,   14731872 total
  - age   4:    4351424 bytes,   14731872 total
  - age  15:    1165424 bytes,   ...

What this log entry shows is that around 10KB of data is making it through to maximum tenuring age. Certainly expected some objects to occasionally reach this point - such as those representing updates to the dataset - however, looking across multiple log entries it was evident that the rate of object promotion to the old space was consistent - perhaps something was leaking? It would be extremely convenient if there was some way of asking the GC 'which objects do you have in tenure bucket 15?' because then we'd then know exactly what is being promoted - unfortunately there is no way of doing so.

Luckily we did have was a 'good guess'. The service makes use of a monitoring component that uses data buckets to aggregate application metrics over time - each bucket typically holds a minutes worth of data and is valid for up to ten minutes. It seemed likely that these buckets would be alive long enough to be promoted to old and - in the worst case - fill up the old space. We decided to rewrite this component to use fixed-length queues that are allocated one on construction. When testing the service with this change the GC logs changed dramatically:

  5755.168: [GC 5755.168: [ParNew
  Desired survivor size 161054720 bytes, new threshold 15 (max 15)
  - age   1:    9638328 bytes,    9638328 total
  - age   2:      79056 bytes,    9717384 total
  - age   3:         24 bytes,    9717408 total
  - age   4:         56 bytes,    9717464 total
  - age  14:       1024 bytes,    9724088 total
  - age  15:         32 bytes,    ...

This log shows an extremely low movement of objects from the new space to the old.

So, we were able to use VisualGC and GC debug logs to provide insight to the garbage collection process. This helped us tune our application so that long pauses were effectively eradicated and latency targets could be reliably met. Of course, a full GC will happen eventually, but our architecture conveniently allows for this to be on our terms (when running under normal conditions). Different architectures may require alternative garbage collection strategies, but I recommend profiling with VisualGC early on and then using the GC logs to provide greater detail.

I must offer my thanks to: my colleague James for pairing up with me on this problem!

No comments:

Post a Comment