JEP 270: Reserved Stack Areas for Critical Sections

Summary

Reserve extra space on thread stacks for use by critical sections, so that they can complete even when stack overflows occur.

Goals

Provide a mechanism to mitigate the risk of deadlocks caused by the corruption of critical data such as java.util.concurrent locks (such as ReentrantLock) caused by a StackOverflowError being thrown in a critical section.
The solution must be mostly JVM-based in order not to require modifications to java.util.concurrent algorithms or published interfaces, or existing library and application code.
The solution must not be limited to the ReentrantLock case, and should be applicable to any critical section in privileged code.

Non-Goals

The solution doesn't aim to provide robustness against stack overflows to non-privileged code.
The solution doesn't aim to avoid StackOverflowErrors, but rather to mitigate the risk that a such an error is thrown inside a critical section and thereby corrupts some data structures.
The proposed solution is a trade-off between solving some well-known corruption cases while preserving performance, with reasonable resource cost and relatively low complexity.

Motivation

StackOverflowError is an asynchronous exception that can be thrown by the Java Virtual Machine whenever the computation in a thread requires a larger stack than is permitted (JVM spec §2.5.2 and §2.5.6). The Java Language Specification permits a StackOverflowError to be thrown synchronously by method invocation (JLS §11.1.3). The HotSpot VM uses this property to implement a "stack-banging" mechanism on method entry.

The stack-banging mechanism is a clean way to report that a stack overflow has occurred while preserving the JVM's integrity, but it doesn't provide a safe way for the application to recover from this situation. A stack overflow could occur in the middle of a sequence of modifications which, if not complete, could leave a data structure in an inconsistent state.

For instance, when a StackOverflowError is thrown in a critical section of the java.util.concurrent.locks.ReentrantLock class, the lock status can be left in an inconsistent state, leading to potential deadlocks. The ReentrantLock class uses an instance of AbstractSynchronizerQueue to implement its critical section. The implementation of its lock() method is:

final void lock() {
    if (compareAndSetState(0, 1))
        setExclusiveOwnerThread(Thread.currentThread());
    else
        acquire(1);
}

The method tries to change the status word with an atomic operation. If the modification is successful then the owner is set by invoking a setter method, otherwise the slow path is invoked. The problem is that if a StackOverflowError is thrown after the status word has been changed and before the owner has been effectively set then the lock becomes unusable: Its status word indicates it is locked but no owner has been set, so no thread can unlock it. Because stack-size checks are performed at method-invocation time (in HotSpot, at least), a StackOverflowError can be thrown either when Thread.currentThread() is invoked or when setExclusiveOwnerThread() is invoked. In either case it leads to a corruption of the ReentrantLock instance, and all threads trying to acquire this lock will be blocked forever.

This particular problem caused some serious issues in JDK 7 because parallel class loading was implemented using a ConcurrentHashMap and, at that time, the ConcurrentHashMap code used ReentrantLock instances. If a ReentrantLock instance was corrupted because of a StackOverflowError then the class-loading mechanism itself could deadlock. (This happened in stress tests (JDK-7011862), but could also happen in the field.)

The implementation of the ConcurrentHashMap class was completely changed in June 2013. The new implementation uses synchronized statements rather than ReentrantLock instances, so JDK 8 and later releases are not subject to class-loading deadlock due to corrupted ReentrantLocks. However, any code using ReentrantLock can still be impacted and cause deadlock. Such issues have already been reported on the concurrency-interest@cs.oswego.edu mailing list.

The problem is not limited to the ReentrantLock class.

Java applications or libraries often rely on the consistency of data structures to work properly. Any modification of those data structures is a critical section: Before the execution of the critical section the data structures are consistent, and after its execution the data structures are consistent too. During its execution, however, the data structure could go through transient inconsistent states.

If a critical section is made of a single Java method containing no other method invocation, the current stack overflow mechanism works well: Either the available stack is sufficient and the method executes without trouble, or it is not sufficient and so a StackOverflowError is thrown before the first bytecode of the method is executed.

The problem occurs when a critical section is made of several methods, for instance a method A which invokes a method B. The available stack can be sufficient to let method A start its execution. Method A starts to modify a data structure and then invokes method B, but the remaining stack is not sufficient to execute B, causing a StackOverflowError to be thrown. Because method B and the remainder of method A have not been executed, the consistency of the data structure might have been compromised.

Description

The main idea of the proposed solution is to reserve some space on the execution stack for critical sections, to allow them to complete their execution where regular code would have been interrupted by a stack overflow. The assumption is that critical sections are relatively small and do not require enormous space on the execution stack to complete successfully. The goal is not to rescue a faulty thread which hits its stack limit, but rather to preserve shared data structures that could be corrupted if the StackOverflowError is thrown in a critical section.

The main mechanism will be implemented in the JVM. The only modification required in the Java source code is the annotation that must be used to identify the critical sections. This annotation, currently named jdk.internal.vm.annotation.ReservedStackAccess, is a runtime method annotation that can be used by any class of privileged code (see paragraph below about the accessibility of this annotation).

In order to prevent the corruption of shared data structures, the JVM will try to delay the throwing of a StackOverflowError until the thread in question has exited all of its critical sections. Each Java thread has a new zone defined in its execution stack, called the reserved zone. This zone can be used only if the Java thread has a method annotated with jdk.internal.vm.annotation.ReservedStackAccess in its current call stack. When a stack overflow condition is detected by the JVM, and the thread has an annotated method in its call stack, the JVM grants temporary access to the reserved zone until no more annotated methods are present in the call stack. When access to the reserved zone is revoked, a delayed StackOverflowError is thrown. If the thread has no annotated method in its call stack when the stack overflow condition is detected then the StackOverflow is thrown immediately (this is current JVM behavior).

Note that the reserved stack space is usable by annotated methods but also by methods invoked, directly or transitively, from them. The nesting of annotated methods is naturally supported, but there's a single shared reserved zone per thread; that is, the invocation of an annotated method does not add a new reserved zone. The sizing of the reserved zone must be done according to the worst case of all annotated critical sections.

By default, the jdk.internal.vm.annotation.ReservedStackAccess annotation is applicable only to privileged code (code loaded by the bootstrap or the extension class loader). Both privileged code and non-privileged code can be annotated with this annotation but by default the JVM will ignore it for non-privileged code. The rationale behind this default policy is that the reserved stack space for critical sections is a shared resource among all critical sections. If any arbitrary code is able to use this space then it is not a reserved space anymore, and this would defeat the whole solution. A JVM flag is available, even in product builds, to relax this policy and allow any code to be able to benefit from this feature.

Implementation

In the HotSpot VM, each Java thread has two zones defined at the end of its execution stack: the yellow zone and the red zone. Both memory areas are protected against all accesses.

If, during its execution, a thread tries to use the memory in the yellow zone, a protection fault is triggered, the protection of the yellow zone is temporarily removed, and a StackOverflowError is created and thrown. Before unwinding the thread execution stack to propagate the StackOverflowError, the protection of the yellow zone is restored.

If the thread tries to use the memory in its red zone, the JVM immediately branches to JVM error-reporting code, leading to the generation of an error report and a crash dump of the JVM process.

The new zone defined by the proposed solution is placed just before the yellow zone. This reserved zone will behave like regular stack space if the thread has a ReservedStackAccess-annotated method in its call stack, and like the yellow zone otherwise.

During the setup of the execution stack of a Java thread, the reserved zone is protected the same way as the yellow and the red zones. If, during its execution, the thread hits its reserved zone, a SIGSEGV signal is generated and the signal handler applies the following algorithm:

If the address of the fault is in the red zone, generate a JVM error report and a crash dump.
If the address of the fault is in the yellow zone, create and throw a StackOverflowError.
If the address of the fault is in the reserved zone, perform a stack walk to check if there's a method annotated with jdk.internal.vm.annotation.ReservedStackAccess on the call stack. If not, create and throw a StackOverflowError. If an annotated method is found, remove the protection of the critical zone and store in the C++ Thread object the stack pointer of the outermost activation (frame) related to an annotated method.

If the protection of the reserved zone has been removed to allow a critical section to complete its execution, the protection must be restored and the delayed StackOverflowError thrown as soon as the thread exits the critical section. The HotSpot interpreter has been modified to check if the registered outermost annotated method is being exited. The check is performed on every frame-activation removal by comparing the value of the stack pointer being restored with the value stored in the C++ Thread object. If the restored stack pointer is above the stored value (stacks grow downward), a call to the runtime is performed to change the memory protection and reset the stack pointer value in the Thread object before jumping to the StackOverflowError generation code. The two compilers have been modified to perform the same check on method exit, but only for ReservedStackAccess annotated methods or methods with annotated methods in-lined in their compiled code.

When an exception is thrown, the control flow doesn't go through the regular method-exit code, so there's a possibility that the protection of the reserved zone will not be restored correctly if the exception is propagated above the annotated method. To prevent this situation, the protection of the reserved zone is restored and the stack pointer value stored in the C++ Thread object is reset each time an exception starts being propagated. In this scenario, the delayed StackOverflowError is not thrown. The rationale is that the thrown exception is more important than the delayed StackOverflowError because it indicates a cause and a point where normal execution has been interrupted.

Throwing a StackOverflowError is the Java way to notify the application that a thread reached its stack limits. However, exceptions and errors are sometime caught by Java code and the notification is lost or not handled correctly, which can make the investigation of the issue really hard. To ease troubleshooting of stack overflow errors in presence of a reserved stack area, the JVM provides two other notifications when access to the reserved stack area is granted: One is a warning printed by the JVM (on the same stream as all other JVM messages), and the second is a JFR event. Note that even if the delayed StackOverflowError is not thrown because another exception has been thrown in a critical section, the JVM warning and the JFR event are generated and are available for troubleshooting.

The reserved-stack feature is controlled by two JVM flags, one to configure the size of the reserved zone (all threads use the same size), and one to allow non-privileged code to use the feature. Setting the size of the reserved zone to zero disables the feature entirely. When disabled, interpreted code and compiled code do not perform the check on method exit.

Memory cost of this solution: For each thread the cost is the virtual memory of its reserved zone, as part of its stack space. The option to implement the reserved zone in a different memory area, as an alternate stack, has been considered. It would, however, significantly increase the complexity of any stack-walking code, so this option has been rejected.

Performance cost: measurements done with JSR-166 tests on ReentrantLocks didn't show any significant impact on performance on x86 platforms.

Performance

Here's how this solution could impact performance.

The most costly operation in this solution is the stack walking performed when looking for an annotated method in the call stack. This operation is performed only when the JVM has detected a potential stack overflow. Without this fix, the JVM would throw a StackOverflowError. So even if the operation is relatively costly, it is better than the current behavior since it will prevent data corruptions. The most frequently-executed part of this solution is the check performed when an annotated method exits, to check if the protection of the reserved zone has to be re-enabled or not. The performance-critical version of this check is in the compiler. The current implementation adds the following code sequence to the compiled code of an annotated method:

0x00007f98fcef5809: cmp    rsp,QWORD PTR [r15+0x298]
0x00007f98fcef5810: jle    0x00007f98fcef583c
0x00007f98fcef5816: mov    rdi,r15
0x00007f98fcef5819: test   esp,0xf
0x00007f98fcef581f: je     0x00007f98fcef5837
0x00007f98fcef5825: sub    rsp,0x8
0x00007f98fcef5829: call   0x00007f9910f62670  ;   {runtime_call}
0x00007f98fcef582e: add    rsp,0x8
0x00007f98fcef5832: jmp    0x00007f98fcef583c
0x00007f98fcef5837: call   0x00007f9910f62670  ;   {runtime_call}

This code is for the x86_64 platform. In fast cases (no need to re-enable protection of the reserved zone) it adds two instructions including a small jump. The version for x86_32 is bigger because it doesn't have the address of the Thread object always available in a register. The feature is also implemented for Solaris/SPARC.

Open issues

The default size of the reserved zone is still an open issue. This size will depend on the longest critical zone in JDK code that uses the ReservedStackAccess annotation and will also depend on the platform architecture. We could also consider different defaults depending upon whether the JVM is running on a high-end server or in a virtual-memory-constrained environment.

To mitigate the sizing issue a debug/troubleshooting feature has been added. This feature is enabled by default on debug builds and available as a diagnostic JVM option in product builds. When activated, it is run when the JVM is about to throw a StackOverflowError: It walks the call stack and if one or more methods annotated with the ReservedStackAccess annotation are found, their names are printed with a warning message on the JVM standard output. The name of the JVM flag controlling this feature is PrintReservedStackAccessOnStackOverflow.

The default size of the reserved area is one page (4K) and experiments have shown that this is sufficient to cover the critical sections of java.util.concurrent locks that have been annotated so far.

The reserved stack area is not fully supported on Windows platforms. During the development of the feature on Windows, a bug was found in the way the stack's special zones are controlled (JDK-8067946). This bug prevents the JVM from granting access to the reserved stack area. As a consequence, when a stack overflow condition is detected on Windows, and an annotated method is on the call stack, the JVM warning is printed, the JFR event is fired, and a StackOverflowError is thrown immediately. There's no change in the behavior of the JVM for the application. However, the JVM warning and the JFR event can help troubleshooting, indicating that a potentially-harmful situation occurred.

Alternatives

Several alternative approaches have been considered and some of them have been implemented and tested. Here's a list of those approaches.

Language-based solutions:

try/catch/finally constructs: They don't solve anything, since there's no guarantee that the finally clause will not trigger a stack overflow too.
New constructs such as:
```
new CriticalSection(
       () -> {
           // do critical section code
        }).enter();
```
This construct might require significant work in javac and the JVM, and its usage is likely to have high impact on performance compared to the reserved stack area, even when not run in a stack-overflow condition.

Code-transformation solutions:

Avoid method calls (because stack overflow checks are performed at method invocation time) by forcing the JIT to inline all called methods: Inlining could require the loading and initialization of classes not used by the application, forcing inlining could conflict with compiler rules (code size, inlining depth), and inlining is not applicable to all code patterns (e.g., reflection).
Code refactoring to avoid method calls at source level: Refactoring would require the modification of already-complex code (java.util.concurrent), and this kind of refactoring would break encapsulation.

Stack-based solutions:

Extended stack banging: Bang the stack further before entering a critical section: This solution has a performance cost, even when not in a stack-overflow condition, and it is hard to maintain with nested critical sections.
Extensible stacks: Build stacks from several non-contiguous memory chunks, adding a new chunk when a stack overflow is detected: This solution adds significant complexity to the JVM to manage non-contiguous stacks (including all the logic currently based on pointer comparisons in stack management); it could also require us to copy/move some section of the stack, and it puts more pressure on the memory-allocation backend due to fragmentation issues.

Testing

This change comes with a reliable unit test able to reproduce the java.util.concurrent.lock.ReentrantLock corruption caused by a stack overflow.

Dependencies

The reserved stack area relies on the "yellow pages" mechanism. This mechanism is currently partly broken on Windows JDK-8067946, so the reserved stack area is not fully supported in this platform.

JEP 270: Reserved Stack Areas for Critical Sections

Summary​

Goals​

Non-Goals​

Motivation​

Description​

Implementation​

Performance​

Open issues​

Alternatives​

Testing​

Dependencies​