跳到主要内容

JEP 331: Low-Overhead Heap Profiling

Summary

Provide a low-overhead way of sampling Java heap allocations, accessible via JVMTI.

Goals

Provide a way to get information about Java object heap allocations from the JVM that:

  • Is low-overhead enough to be enabled by default continuously,
  • Is accessible via a well-defined, programmatic interface,
  • Can sample all allocations (i.e., is not limited to allocations that are in one particular heap region or that were allocated in one particular way),
  • Can be defined in an implementation-independent way (i.e., without relying on any particular GC algorithm or VM implementation), and
  • Can give information about both live and dead Java objects.

Motivation

There is a deep need for users to understand the contents of their heaps. Poor heap management can lead to problems such as heap exhaustion and GC thrashing. As a result, a number of tools have been developed to allow users to introspect into their heaps, such as the Java Flight Recorder, jmap, YourKit, and VisualVM tools.

One piece of information that is lacking from most of the existing tooling is the call site for particular allocations. Heap dumps and heap histograms do not contain this information. This information can be critical to debugging memory issues, because it tells developers the exact location in their code particular (and particularly bad) allocations occurred.

There are currently two ways of getting this information out of HotSpot:

  • First, you can instrument all of the allocations in your application using a bytecode rewriter such as the Allocation Instrumenter. You can then have the instrumentation take a stack trace (when you want one).

  • Second, you can use Java Flight Recorder, which takes a stack trace on TLAB refills and when allocating directly into the old generation. The downsides of this are that a) it is tied to a particular allocation implementation (TLABs), and misses allocations that don’t meet that pattern; b) it doesn’t allow the user to customize the sampling interval; and c) it only logs allocations, so you cannot distinguish between live and dead objects.

This proposal mitigates these problems by providing an extensible JVMTI interface that allows the user to define the sampling interval and returns a set of live stack traces.

Description

New JVMTI event and method

The user facing API for the heap sampling feature proposed here consists of an extension to JVMTI that allows for heap profiling. The following systems rely on an event notification system that would provide a callback such as:

void JNICALL
SampledObjectAlloc(jvmtiEnv *jvmti_env,
JNIEnv* jni_env,
jthread thread,
jobject object,
jclass object_klass,
jlong size)

where:

  • thread is the thread allocating the jobject,
  • object is the reference to the sampled jobject,
  • object_klass is the class for the jobject, and
  • size is the size of the allocation.

The new API also includes a single new JVMTI method:

jvmtiError  SetHeapSamplingInterval(jvmtiEnv* env, jint sampling_interval)

where sampling_interval is the average allocated bytes between a sampling. The specification of the method is:

  • If non zero, the sampling interval is updated and will send a callback to the user with the new average sampling interval of sampling_interval bytes
    • For example, if the user wants a sample every megabyte, sampling_interval would be 1024 * 1024.
  • If zero is passed to the method, the sampler samples every allocation once the new interval is taken into account, which might take a certain number of allocations

Note that the sampling interval is not precise. Each time a sample occurs, the number of bytes before the next sample will be chosen will be pseudo-random with the given average interval. This is to avoid sampling bias; for example, if the same allocations happen every 512KB, a 512KB sampling interval will always sample the same allocations. Therefore, though the sampling interval will not always be the selected interval, after a large number of samples, it will tend towards it.

Use-case example

To enable this, a user would use the usual event notification call to:

jvmti->SetEventNotificationMode(jvmti, JVMTI_ENABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

The event will be sent when the allocation is initialized and set up correctly, so slightly after the actual code performs the allocation. By default, the average sampling interval is 512KB.

The minimum required to enable the sampling event system is to call SetEventNotificationMode with JVMTI_ENABLE and the event type JVMTI_EVENT_SAMPLED_OBJECT_ALLOC. To modify the sampling interval, the user calls the SetHeapSamplingInterval method.

To disable the system,

jvmti->SetEventNotificationMode(jvmti, JVMTI_DISABLE, JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)

disables the event notifications and disables the sampler automatically.

Calling the sampler again via SetEventNotificationMode will re-enable the sampler with whatever sampling interval was currently set (either 512KB by default or the last value passed by a user via SetHeapSamplingInterval).

New capability

To protect the new feature and make it optional for VM implementations, a new capability named can_generate_sampled_object_alloc_events is introduced into the jvmtiCapabilities.

Global / thread level sampling

Using the notification system provides a direct means to send events only for specific threads. This is done via SetEventNotificationMode and providing a third parameter with the threads to be modified.

A full example

The following section provides code snippets to illustrate the sampler's API. First, the capability and the event notification is enabled:

jvmtiEventCallbacks callbacks;
memset(&callbacks, 0, sizeof(callbacks));
callbacks.SampledObjectAlloc = &SampledObjectAlloc;

jvmtiCapabilities caps;
memset(&caps, 0, sizeof(caps));
caps.can_generate_sampled_object_alloc_events = 1;
if (JVMTI_ERROR_NONE != (*jvmti)->AddCapabilities(jvmti, &caps)) {
return JNI_ERR;
}

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
return JNI_ERR;
}

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventCallbacks(jvmti, &callbacks, sizeof(jvmtiEventCallbacks)) {
return JNI_ERR;
}

// Set the sampler to 1MB.
if (JVMTI_ERROR_NONE != (*jvmti)->SetHeapSamplingInterval(jvmti, 1024 * 1024)) {
return JNI_ERR;
}

To disable the sampler (disables events and the sampler):

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_DISABLE,
JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
return JNI_ERR;
}

To re-enable the sampler with the 1024 * 1024 byte sampling interval , a simple call to enabling the event is required:

if (JVMTI_ERROR_NONE != (*jvmti)->SetEventNotificationMode(jvmti, JVMTI_ENABLE,
JVMTI_EVENT_SAMPLED_OBJECT_ALLOC, NULL)) {
return JNI_ERR;
}

User storage of sampled allocations

When an event is generated, the callback can capture a stack trace using the JVMTI GetStackTrace method. The jobject reference obtained by the callback can be also wrapped into a JNI weak reference to help determine when the object has been garbage collected. This approach allows the user to gather data on what objects were sampled, as well as which are still considered live, which can be a good means to understand the job's behavior.

For example, something like this could be done:

extern "C" JNIEXPORT void JNICALL SampledObjectAlloc(jvmtiEnv *env,
JNIEnv* jni,
jthread thread,
jobject object,
jclass klass,
jlong size) {
jvmtiFrameInfo frames[32];
jint frame_count;
jvmtiError err;

err = global_jvmti->GetStackTrace(NULL, 0, 32, frames, &frame_count);
if (err == JVMTI_ERROR_NONE && frame_count >= 1) {
jweak ref = jni->NewWeakGlobalRef(object);
internal_storage.add(jni, ref, size, thread, frames, frame_count);
}
}

where internal_storage is a data structure that can handle the sampled objects, consider if there is a need to clean up any garbage collected sample, etc. The internals of that implementation are usage-specific, and out of scope of this JEP.

The sampling interval can be used as a means to mitigate profiling overhead. With a sampling interval of 512KB, the overhead should be low enough that a user could reasonably leave the system on by default.

Implementation details

The current prototype and implementation proves the feasibility of the approach. It contains five parts:

  1. Architecture dependent changes due to a change of a field name in the ThreadLocalAllocationBuffer (TLAB) structure. These changes are minimal as they are just name changes.
  2. The TLAB structure is augmented with a new allocation_end pointer, to complement the existing end pointer. If the sampling is disabled, the two pointers are always equal and the code performs as before. If the sampling is enabled, end is modified to be where the next sample point is requested. Then, any fast path will "think" the TLAB is full at that point and go down the slow path, which is explained in (3).
  3. The gc/shared/collectedHeap code is changed due to its usage as an entry point to the allocation slow path. When a TLAB is considered full (because allocation has passed the end pointer), the code enters collectedHeap and tries to allocate a new TLAB. At this point, the TLAB is set back to its original size and an allocation is attempted. If the allocation succeeds, the code samples the allocation, and then returns. If it does not, it means allocation has reached the end of the TLAB, and a new TLAB is needed. The code path continues its normal allocation of a new TLAB and determines if that allocation requires a sample. If the allocation is considered too big for the TLAB, the system samples the allocation as well, thus covering in TLAB and out of TLAB allocations for sampling.
  4. When a sample is requested, there is a collector object set on the stack in a place safe for sending the information to the native agent. The collector keeps track of sampled allocations and, at destruction of its own frame, sends a callback to the agent. This mechanism ensures the object is initialized correctly.
  5. If a JVMTI agent has registered a callback for the SampledObjectAlloc event, the event will be triggered and it will obtain sampled allocations. An example implementation can be found in the libHeapMonitorTest.c file, which is used for JTreg testing.

Alternatives

There are multiple alternatives to the system presented in this JEP. The introduction presented two already: Flight Recorder provides an interesting alternative. This implementation provides several advantages. First, JFR does not allow the sampling size to be set or provide a callback. Next, JFR's use of a buffer system can lead to lost allocations when the buffer is exhausted. Finally, the JFR event system does not provide a means to track objects that have been garbage collected, which means it is not possible to use it to provide information about live and garbage collected objects.

Another alternative is bytecode instrumentation using ASM. Its overhead makes it prohibitive and not a workable solution.

This JEP adds a new feature into JVMTI, which is an important API/framework for various development and monitoring tools. With it, a JVMTI agent can use a low overhead heap profiling API along with the rest of the JVMTI functionality, which provides great flexibility to the tools. For instance, it is up to the agent to decide if a stack trace needs to be collected at each event point.

Testing

There are 16 tests in the JTreg framework for this feature that test: turning on/off with multiple threads, multiple threads allocating at the same time, testing if the data is being sampled at the right interval, and if the gathered stacks reflect the correct program information.

Risks and Assumptions

There are no performance penalties or risks with the feature disabled. A user who does not enable the system will not perceive a performance difference.

However, there is a potential performance/memory penalty with the feature enabled. In the initial prototype implementation, the overhead was minimal (<2%). This used a more heavyweight mechanism that modified JIT’d code. In the final version presented here, the system piggy-backs on the TLAB code, and should not experience that regression.

Current evaluation of the Dacapo benchmark puts the overhead at:

  • 0% when the feature is disabled

  • 1% when the feature is enabled at the default 512KB interval, but no callback action is performed (i.e., the SampledAllocEvent method is empty but registered to the JVM).

  • 3% overhead with a sampling callback that does a naive implementation to store the data (using the one in the tests)