Principle of implementation of atomic operations

1. Introduction

Atomic means “the smallest particle that cannot be further divided”, and atomic operation means “an operation or series of operations that cannot be interrupted”. Implementing atomic operations on a multiprocessor becomes a bit complicated. In this article, let’s talk about how atomic operations are implemented on Intel processors and in Java.

2. Definition of Terms

Compare and Swap CAS operations require two values to be entered, an old value (the value before the desired operation) and a new value, during which the old value is compared to the new value if it has not changed, and not exchanged if it.

CPU pipeline works like an assembly line in industrial production, in which the CPU consists of 5~6 circuit units with different functions to form an instruction processing pipeline, then an X86 instruction is divided into 5~6 steps and then executed by these circuit units separately, so that an instruction can be completed in one CPU clock cycle, thus increasing the CPU computing speed. Memory order violation is usually caused by a false share.

Memory order violation is usually caused by false sharing, which is when multiple CPUs modify different parts of the same cache line at the same time and cause one of the CPUs to invalidate the operation.

3. How processors implement atomic operations

32-bit IA-32 processors use atomic operations between multiple processors based on locking the cache or bus locking.

3.1 Processor automatically guarantees atomicity of basic memory operations

First, the processor automatically guarantees atomicity for basic memory operations. The processor guarantees that reading or writing a byte from or to system memory is atomic, meaning that when one processor reads a byte, no other processor can access the byte’s memory address. Pentium 6 and newer processors automatically guarantee atomicity for single-processor 16/32/64-bit operations on the same cache line, but the processor does not automatically guarantee atomicity for complex memory operations such as accesses across bus widths, across multiple cache lines, and across page tables. However, the processor provides both bus locking and cache locking mechanisms to guarantee the atomicity of complex memory operations.

3.2 Ensuring atomicity with bus locking

The first mechanism is to guarantee atomicity through bus locking. If multiple processors read and rewrite a shared variable at the same time (i++ is a classic read and rewrite operation), then the shared variable will be operated by multiple processors at the same time, so the read and rewrite operations are not atomic and the value of the shared variable will not be the same as expected after the operation, for example: if i=1, we perform two i++ operations and we expect the result to be 3, but it may be 2. The result is 2.

Shared Variable Operation Procedure

The reason for this is that it is possible for multiple processors to read variable i from their respective caches at the same time, add one to each of them, and then write them to system memory. To ensure that the read and write operations are atomic, CPU1 must ensure that when CPU2 reads and writes a shared variable, CPU2 cannot operate the cache that has the shared variable’s memory address.

The processor uses bus locks to solve this problem. A bus lock uses a LOCK# signal provided by the processor. When a processor outputs this signal on the bus, requests from other processors will be blocked, and the processor will have exclusive access to the shared memory.

3.3 Guaranteed Atomicity with Cache Locking

The second mechanism is to guarantee atomicity through cache locking. However, bus locking locks the communication between the CPU and the memory, which prevents other processors from manipulating data at other memory addresses during the locking period, so the overhead of bus locking is high.

Since frequently used memory is cached in the processor’s L1, L2, and L3 caches, atomic operations can be performed directly in the processor’s internal cache without the need to declare a bus lock, and in the Pentium 6 and more recent processors “cache locking” can be used to achieve complex atomicity. “Cache locking” means that if the memory area cached in the processor’s cache line is locked during a LOCK operation, when it performs a lock operation to write back to memory, the processor does not assert the LOCK# signal on the bus, but instead modifies the internal memory address and allows its cache coherency mechanism to ensure atomicity of the operation, since The cache coherency mechanism prevents simultaneous modification of data in memory areas that are cached by more than two processors, and invalidates cache lines when other processors write back data in cache lines that have been locked.

However, there are two cases where the processor does not use cache locking. The first case is that the processor invokes bus locking when the data for an operation cannot be cached inside the processor, or when the data for an operation spans multiple cache lines. The second case is that some processors do not support cache locking. For Inter486 and Pentium processors, bus locking is invoked even if the locked memory area is in the processor’s cache line.

Both of these mechanisms can be implemented by the Inter processor by providing a number of instructions with the LOCK prefix. For example, bit test and modify instructions BTS, BTR, BTC, swap instructions XADD, CMPXCHG and some other operand and logic instructions such as ADD, OR, etc., the memory area operated by these instructions is locked, so that other processors cannot access it at the same time.

4. How to implement atomic operations in JAVA

Atomic operations can be implemented in java by means of locks and cyclic CAS.

4.1 Implementing Atomic Operations with Cyclic CAS

The CAS operation in the JVM is implemented using the CMPXCHG instruction provided by the processor mentioned in the previous section. The basic idea behind the spin CAS implementation is to loop through CAS operations until they succeed. The following code implements a CAS-based thread-safe counter method safeCount and a non-thread-safe counter count.

public class Counter {
   private AtomicInteger atomicI = new AtomicInteger(0);
   private int i = 0;
   public static void main(String[] args) {
       final Counter cas = new Counter();
       List<Thread> ts = new ArrayList<Thread>(600);
       long start = System.currentTimeMillis();
       for (int j = 0; j < 100; j++) {
           Thread t = new Thread(new Runnable() {
               @Override
               public void run() {
                   for (int i = 0; i < 10000; i++) {
                       cas.count();
                       cas.safeCount();
                   }
               }
           });
           ts.add(t);
       }

       for (Thread t : ts) {
           t.start();
       }
      // Wait for all threads to finish executing
       for (Thread t : ts) {
           try {
               t.join();
           } catch (InterruptedException e) {
               e.printStackTrace();
           }
       }

       System.out.println(cas.i);
       System.out.println(cas.atomicI.get());
       System.out.println(System.currentTimeMillis() - start);

   }

   /**
    * Implementing thread-safe counters with CAS
    */
   private void safeCount() {
       for (;;) {
           int i = atomicI.get();
           boolean suc = atomicI.compareAndSet(i, ++i);
           if (suc) {
               break;
           }
       }
   }
   /**
    * Thread unsafe counter
    */
   private void count() {
       i++;
   }
}

There are some concurrency frameworks in the java concurrency package that use spin CAS to implement atomic operations, such as the Xfer method of the LinkedTransferQueue class.CAS is an efficient solution for atomic operations, but CAS still has three major problems: the ABA problem, the long loop time and the high overhead, and the fact that it can only guarantee atomic operations on a shared variable.

  • The ABA problem. The solution to the ABA problem is to use version numbers. If you append the version number in front of the variable and add one to it every time the variable is updated, then A-B-A will become 1A-2B-3A. Since Java 1.5, the atomic package of JDK provides a class AtomicStampedReference to solve the ABA problem. The compareAndSet method of this class first checks whether the current reference is equal to the expected reference and whether the current flag is equal to the expected flag, and if they are both equal, then the reference and the value of the flag are set to the given updated value in an atomic way.
 public boolean compareAndSet (V expectedReference,// Expected reference
                               V newReference,// updated reference
                               int expectedStamp, // expected flag
                               int newStamp) // updated flag
  • long loop time overhead. Spin CAS can cause very high execution overhead to the CPU if it is unsuccessful for a long time. If the JVM can support the pause instruction provided by the processor, then the efficiency will be improved. The pause instruction has two roles, first it can delay the execution of the pipeline instruction (de-pipeline), so that the CPU does not consume too much execution resources, the delay time depends on the specific implementation version, in some processors the delay time is zero. Secondly, it can avoid CPU pipeline flush due to memory order violation when exiting the loop, thus improving CPU execution efficiency.

  • Only atomic operations on a shared variable are guaranteed. When operating on one shared variable, we can use the loop CAS method to guarantee atomic operation, but when operating on multiple shared variables, the loop CAS cannot guarantee the atomicity of the operation, so we can use locks, or there is a tricky way to combine multiple shared variables into one shared variable. For example, if you have two shared variables i=2,j=a, merge ij=2a and use CAS to operate on ij. Since Java 1.5, the JDK has provided the AtomicReference class to ensure atomicity between reference objects, so you can put multiple variables in one object to perform CAS operations.

4.2 Implementing atomic operations using the lock mechanism

The JVM internally implements a variety of locking mechanisms, including biased locks, lightweight locks, and mutually exclusive locks, but interestingly, except for biased locks, the JVM implements locks in a cyclic CAS fashion, where a thread uses a cyclic CAS to obtain a lock when it wants to enter a synchronous block, and a cyclic CAS when it exits a synchronous block.

Well, I believe that seeing this you have further knowledge of atomic operations, and subsequently this site will also publish some articles on concurrent programming, which I hope will be helpful to you.