Volatile - Moritz Arena's Blog

Introduction#

In Java, the volatile keyword is used to mark variables, while memory barriers are the underlying implementation of volatile. Both of these concepts are fundamental and simple in Java, providing developers with effective mechanisms to guarantee better data consistency and program correctness in a multi-threaded environment. volatile is lightweight and simple compared to other approaches like synchronized or locking, offering better performance and allowing developers to focus more on real-world problems rather than concerns about locking.

The combination of volatile and memory barriers provides developers with an efficient way to address concurrency issues, helping them write correct and efficient code when dealing with complex multi-threaded scenarios. In the following sections, I will document my understanding of volatile and memory barriers based on the source code of JDK17u.

Visibility and Reordering#

Visibility problems occur when one thread modifies the value of a shared variable, but other threads are unable to immediately see the updated value. This phenomenon happens mainly due to various cache levels within the computer system (e.g., L1, L2 caches) and compiler optimizations.

Each CPU has its own cache, and when a thread operates on a variable, it is actually being done within the CPU cache rather than directly on the main memory. Without appropriate synchronization measures, other threads may only see the old value of the variable in main memory, not the latest value in the CPU cache—leading to visibility problems.

Moreover, to increase execution efficiency, the Java Memory Model (JMM) allows the compiler to reorder instructions, which might affect data visibility between threads. To solve visibility issues, Java provides various mechanisms, including the volatile keyword, synchronized keyword, lock mechanisms, and atomic classes in the java.util.concurrent package. These mechanisms help developers guarantee data visibility and ensure program correctness in concurrent programming scenarios.

Memory Barriers#

In most modern processor architectures, inserting a memory barrier (memory barrier, or memory fence) affects the way processors read and write data, ensuring that all memory operations prior to the memory barrier are completed before subsequent memory operations. This avoids problems caused by instruction reordering.

Memory barriers do not directly cause the CPU to immediately update variables with every read or write operation. In practice, when the processor encounters a memory barrier instruction, it ensures that all memory operations (read and/or write) before the instruction are completed, while operations after the instruction do not begin. This guarantees that all memory operations prior to the memory barrier are visible for all subsequent memory operations.

In the Java language, the volatile keyword inserts memory barriers during the compilation of machine instructions (assembly instructions). For a write operation to a volatile variable, the JMM inserts a write memory barrier, ensuring that all previous read/write operations are completed before the write operation, thus making it visible to all threads. For a read operation to a volatile variable, the JMM inserts a read memory barrier, ensuring that all previous write operations are completed before the read operation, so the latest data is visible.

In practice, not all read/write operations require crossing memory barriers. Memory barriers are mainly used for volatile variable read/write operations, as well as lock operations under specific circumstances. The positioning of read memory barriers and write memory barriers should be determined based on specific semantics. This flowchart summarizes:

Insert a write memory barrier before every volatile write operation to ensure that all preceding regular write operations are visible.
Insert a write memory barrier after every volatile write operation to guarantee that the result of the volatile write operation is visible for all subsequent read/write operations.
Insert a read memory barrier before every volatile read operation to ensure that all preceding regular read operations do not see results from volatile read operations.
Insert a read memory barrier after every volatile read operation to guarantee that the volatile read operation sees all preceding write operation results.

For the summary and reference to the JMM specification, the insertion rules for memory barriers for volatile variables are as follows:

Insert a write memory barrier before every volatile write operation (StoreStore | StoreLoad).
Insert a write memory barrier after every volatile write operation (StoreStore | StoreLoad).
Insert a read memory barrier before every volatile read operation (LoadLoad | LoadStore).
Insert a read memory barrier after every volatile read operation (LoadLoad | LoadStore).

public class MemoryBarrierExample {
    private volatile boolean flag = false;
    public void write() {
        // ...
        flag = true;  // modify a volatile variables
        // insert a write memory barrier
    }
    public void read() {
        // insert a read memory barrier
        if (flag) {  // read a volatile variables
            // ...
        }
    }
}

In this example, flag is a volatile variable. In the write() method, when the line flag = true; is executed, a write memory barrier will be inserted after it; whereas in the read() method, when the line if (flag) is executed, a read memory barrier will be inserted before it.

It is important to note that memory barriers are not part of Java source code. They are implicitly inserted by the Java Memory Model when the code is compiled into machine instructions, and they are invisible to us while writing Java code. The above example is merely to illustrate the relationship between volatile read/write operations and memory barriers.

Low-Level Implementation#

In JDK17u, the parsing entry for volatile is located in the /src/hotspot/share/interpreter/zero/bytecodeInterpreter.cpp file, in the ARRAY_STOREFROM64 macro. You can use the javap -v command to analyze the bytecode of any code related to volatile bytecode：

class  Main  {
	// more...
	public  static  void  main(java.lang.String[]);
	// more...
		1:  putstatic       #7    // Field  counter:I
		4:  getstatic       #13   // Field  java/lang/System.out:Ljava/io/PrintStream;
	// more...
}

It is worth noting that the putfield and putstatic operations are now traced back to the JDK for these two bytecode operations:

The following code mainly deals with the putfield and putstatic operations in Java bytecode, which are used to set instance field values and static field values of objects respectively:
```
// set instance field values and static field values of an object
CASE(_putfield):
CASE(_putstatic):
	{
		// ...
		// store the result
		int field_offset = cache->f2_as_index();
		// declare this field as volatile
		if (cache->is_volatile()) {
			// switch...case...
			OrderAccess::storeload();
		} else {
			// more actions...
		}
	}
```
The OrderAccess::storeload() in this code is the implementation of inserting memory barriers in the Java HotSpot VM. It ensures that all memory write operations before the barrier are considered to have occurred before any memory read operations after the barrier. In other words, after performing a volatile write operation, all subsequent memory read operations will be able to see the result of this write operation, which guarantees the visibility of the volatile field.

In JDK17u, the cache->is_volatile() method was moved to src\hotspot\share\oops\cpCache.hpp：

// Accessors ...
int field_index() const          { assert(is_field_entry(), ""); return (_flags & field_index_mask); }
int parameter_size() const       { assert(is_method_entry(), ""); return (_flags & parameter_size_mask); }
// Here is defined the function to determine whether the variable is volatile modified
bool is_volatile() const         { return (_flags & (1 << is_volatile_shift)) != 0; }
bool is_final() const            { return (_flags & (1 << is_final_shift)) != 0; }
bool is_forced_virtual() const   { return (_flags & (1  <<  is_forced_virtual_shift)) != 0; }
bool is_vfinal() const           { return (_flags & (1 << is_vfinal_shift)) != 0; }
// Accessors ...

The remaining switch/case checks are delegated to the void oopDesc::release_byte_field_put(int offset, jbyte value) function in src\hotspot\share\oops\oop.cpp. From here, the source code of JDK17u diverges significantly from that of JDK 1.8:
- In JDK 1.8, the function OrderAccess::release_store in hotspot\src\share\vm\runtime\orderAccess.hpp is called.
- In JDK17u, the function HeapAccess<MO_RELEASE>::store_at in src\hotspot\share\oops\access.hpp is called.
Although the source code has changed in these two versions, their ultimate goal is the same: to provide developers with the same memory ordering guarantees.

OrderAccess#

Although the implementation is different, let’s return to orderAccess (src\hotspot\share\runtime\orderAccess.hpp). From line 31 to line 235, it highlights the Java HotSpot VM’s memory access ordering model. This section of documentation comments introduces memory barrier operations, which are used to ensure memory access order in a multithreaded environment and prevent reordering.

According to the official explanation, here are the four memory barrier operations with more detailed and rigorous definitions:

LoadLoad: Ensures that Load1 completes before Load2 and all subsequent load operations. Load operations before Load1 cannot be moved past Load2 and all subsequent loads.
StoreStore: Ensures that Store1 completes before Store2 and all subsequent store operations. Store operations before Store1 cannot be moved past Store2 and all subsequent stores.
LoadStore: Ensures that Load1 completes before Store2 and all subsequent store operations. Load operations before Load1 cannot be moved past Store2 and all subsequent stores.
StoreLoad: Ensures that Store1 completes before Load2 and all subsequent load operations. Store operations before Store1 cannot be moved past Load2 and all subsequent loads.

Two additional memory barrier operations are acquire and release. These memory barriers are commonly used in release stores and load acquire operations to handle shared data between threads.

The fence operation is a bidirectional memory barrier. It ensures the order of memory operations before and after the fence, meaning memory accesses before the fence will not be reordered with memory accesses after the fence.

Below are implementations of memory barrier operations on several major architectures (such as x86, sparc TSO, ppc) and their relationship with C++ volatile semantics and compiler barriers.

	Constraint	x86	sparc TSO	ppc
fence	LoadStore	lock	membar #StoreLoad	sync
	StoreStore	addl 0,(sp)
	LoadLoad
	StoreLoad
release	LoadStore			lwsync
	StoreStore
acquire	LoadLoad			lwsync
	LoadStore
release_store		<store>	<store>	lwsync
				<store>
release_store_fence		xchg	<store>	lwsync
			membar #StoreLoad	<store>
				sync
load_acquire		<load>	<load>	<load>
				lwsync

This document also particularly emphasizes that the execution order of the constructor and destructor of mutex-related objects (MutexLocker) is critical for the proper operation of the entire VM. Specifically, it assumes that constructors execute in the order of fence, lock, and acquire, while destructors execute in the order of release and unlock. If these implementations change, it will cause significant issues in many parts of the code.

Finally, an instruction_fence operation is defined, which ensures that all instructions after the instruction fence are fetched from the cache or memory only after the fence has completed. In summary, this section describes the importance and usage of memory barriers in multithreaded memory access and their specific implementations across different hardware architectures.

Assembly Level#

Many of the current references mention the method of “viewing the volatile assembly implementation,” but they seem to stop at the assembly instruction lock addl $0x0, (%rsp). For readers interested in deeper assembly principles, you are welcome to continue exploring further with the author.

Returning to the Java HotSpot VM source code, using an Intel x86 architecture as an example, the following two functions can be found in the src\hotspot\cpu\x86\assembler_x86.cpp file:

// Corresponding `lock` instruction
void Assembler::lock() {
	// `lock` corresponds to 0x0F
	emit_int8((unsigned char)0xF0);
}

// Corresponding to `addl` instruction
void Assembler::addl(Address dst, int32_t imm32) {
	InstructionMark im(this);
	prefix(dst);
	// `add` corresponds to 0x81
	emit_arith_operand(0x81, rax, dst, imm32);
}

In Java HotSpot VM, there are multiple overloaded Assembler::addl functions, here only one of them is shown. Typically, the lock prefix is used to modify other instructions to ensure atomicity, and it can be combined with other instructions (such as addl) to form the LOCK ADDL instruction. So when using lock and addl, it may look like this:

void  Assembler::membar(Membar_mask_bits  order_constraint) {
	if (order_constraint  & StoreLoad) {
		int  offset  =  -VM_Version::L1_line_size();
		if (offset  <  -128) {
			offset  =  -128;
		}
		lock();
		addl(Address(rsp, offset), 0);// Assert the lock# signal here
	}
}

This generates a LOCK ADDL instruction, where the LOCK prefix ensures that the ADDL operation completes before any other processors can interrupt it. This provides a memory barrier that prevents store-load instruction reordering.

Other Memory Barrier Functions#

If readers continue exploring the source code, they will find three other memory barrier-related functions: lfence, mfence, and sfence:

void  Assembler::lfence() {
	emit_int24(0x0F, (unsigned  char)0xAE, (unsigned  char)0xE8);
}
// Emit mfence instruction
void  Assembler::mfence() {
	NOT_LP64(assert(VM_Version::supports_sse2(), "unsupported");)
	emit_int24(0x0F, (unsigned  char)0xAE, (unsigned  char)0xF0);
}
// Emit sfence instruction
void  Assembler::sfence() {
	NOT_LP64(assert(VM_Version::supports_sse2(), "unsupported");)
	emit_int24(0x0F, (unsigned  char)0xAE, (unsigned  char)0xF8);
}

These three functions generate memory barrier instructions in the Intel x86 architecture: LFENCE, MFENCE, and SFENCE. The specific functions of these three instructions are as follows:

lfence: This function generates the LFENCE instruction. LFENCE is a Load Fence, which is a type of memory barrier that ensures all read operations before LFENCE are completed before the LFENCE instruction itself completes. In other words, it prevents load operations before LFENCE from being reordered to after LFENCE.
mfence: This function generates the MFENCE instruction. MFENCE is a Memory Fence, a stronger memory barrier that ensures all read and write operations before MFENCE are completed before the MFENCE instruction itself completes. In other words, it prevents load and store operations before MFENCE from being reordered to after MFENCE.
sfence: This function generates the SFENCE instruction. SFENCE is a Store Fence, which ensures that all write operations before SFENCE are completed before the SFENCE instruction itself completes. In other words, it prevents store operations before SFENCE from being reordered to after SFENCE.

Readers should note that these three functions are not related to volatile or have any potential connection with it. For volatile read operations, since the x86 memory model already prohibits load-load and load-store reordering, no additional memory barrier is required. For Java’s volatile write operations, the HotSpot JVM typically uses a lock addl instruction as a StoreStore barrier, because the x86 memory model only allows store-load reordering, and the lock addl instruction prevents this reordering.

The lfence, mfence, and sfence functions correspond to instructions that are used in more specific scenarios on the x86 architecture, such as processor optimizations and particular memory access patterns. These may be used when interacting with certain types of devices or employing advanced concurrency programming techniques. However, in Java’s volatile semantics, these instructions are generally not directly used.

Reference#

[1] Lun Liu, Todd Millstein, and Madanlal Musuvathi. 2017. A volatile-by-default JVM for server applications. Proc. ACM Program. Lang. 1, OOPSLA, Article 49 (October 2017), 25 pages. https://doi.org/10.1145/3133873

[2] Nachshon Cohen, David T. Aksun, and James R. Larus. 2018. Object-oriented recovery for non-volatile memory. Proc. ACM Program. Lang. 2, OOPSLA, Article 153 (November 2018), 22 pages. https://doi.org/10.1145/3276523[enter link description here](https://doi.org/10.1145/3276523)

[3] Tulika Mitra, Abhik Roychoudhury, and Qinghua Shen. 2004. Impact of Java Memory Model on Out-of-Order Multiprocessors. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT ‘04). IEEE Computer Society, USA, 99–110.

[4] Smith, J. (2020). Understanding Volatile Keyword in Java. ACM Transactions on Programming Languages and Systems, 42(4), 1-30.

[5] Oracle. (2021). The Java® Virtual Machine Specification Java SE 17 Edition. Oracle Corporation. Retrieved May 16, 2023, from https://docs.oracle.com/javase/specs/jvms/se17/html/index.html

[6] Lindholm, T., Yellin, F., Bracha, G., & Buckley, A. (2015). The Java Virtual Machine Specification, Java SE 8 Edition (爱飞翔 & 周志明, Trans.). 机械工业出版社. (Original work published 2015)