| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| <!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ --> |
| <html> |
| <head> |
| <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <title><atomic> design</title> |
| <link type="text/css" rel="stylesheet" href="menu.css"> |
| <link type="text/css" rel="stylesheet" href="content.css"> |
| </head> |
| |
| <body> |
| <div id="menu"> |
| <div> |
| <a href="https://llvm.org/">LLVM Home</a> |
| </div> |
| |
| <div class="submenu"> |
| <label>libc++ Info</label> |
| <a href="/index.html">About</a> |
| </div> |
| |
| <div class="submenu"> |
| <label>Quick Links</label> |
| <a href="https://lists.llvm.org/mailman/listinfo/cfe-dev">cfe-dev</a> |
| <a href="https://lists.llvm.org/mailman/listinfo/cfe-commits">cfe-commits</a> |
| <a href="https://bugs.llvm.org/">Bug Reports</a> |
| <a href="https://github.com/llvm/llvm-project/tree/master/libcxx/">Browse Sources</a> |
| </div> |
| </div> |
| |
| <div id="content"> |
| <!--*********************************************************************--> |
| <h1><atomic> design</h1> |
| <!--*********************************************************************--> |
| |
| <p> |
| The <tt><atomic></tt> header is one of the most closely coupled headers to |
| the compiler. Ideally when you invoke any function from |
| <tt><atomic></tt>, it should result in highly optimized assembly being |
| inserted directly into your application ... assembly that is not otherwise |
| representable by higher level C or C++ expressions. The design of the libc++ |
| <tt><atomic></tt> header started with this goal in mind. A secondary, but |
| still very important goal is that the compiler should have to do minimal work to |
| facilitate the implementation of <tt><atomic></tt>. Without this second |
| goal, then practically speaking, the libc++ <tt><atomic></tt> header would |
| be doomed to be a barely supported, second class citizen on almost every |
| platform. |
| </p> |
| |
| <p>Goals:</p> |
| |
| <blockquote><ul> |
| <li>Optimal code generation for atomic operations</li> |
| <li>Minimal effort for the compiler to achieve goal 1 on any given platform</li> |
| <li>Conformance to the C++0X draft standard</li> |
| </ul></blockquote> |
| |
| <p> |
| The purpose of this document is to inform compiler writers what they need to do |
| to enable a high performance libc++ <tt><atomic></tt> with minimal effort. |
| </p> |
| |
| <h2>The minimal work that must be done for a conforming <tt><atomic></tt></h2> |
| |
| <p> |
| The only "atomic" operations that must actually be lock free in |
| <tt><atomic></tt> are represented by the following compiler intrinsics: |
| </p> |
| |
| <blockquote><pre> |
| __atomic_flag__ |
| __atomic_exchange_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr) |
| { |
| unique_lock<mutex> _(some_mutex); |
| __atomic_flag__ result = *obj; |
| *obj = desr; |
| return result; |
| } |
| |
| void |
| __atomic_store_seq_cst(__atomic_flag__ volatile* obj, __atomic_flag__ desr) |
| { |
| unique_lock<mutex> _(some_mutex); |
| *obj = desr; |
| } |
| </pre></blockquote> |
| |
| <p> |
| Where: |
| </p> |
| |
| <blockquote><ul> |
| <li> |
| If <tt>__has_feature(__atomic_flag)</tt> evaluates to 1 in the preprocessor then |
| the compiler must define <tt>__atomic_flag__</tt> (e.g. as a typedef to |
| <tt>int</tt>). |
| </li> |
| <li> |
| If <tt>__has_feature(__atomic_flag)</tt> evaluates to 0 in the preprocessor then |
| the library defines <tt>__atomic_flag__</tt> as a typedef to <tt>bool</tt>. |
| </li> |
| <li> |
| <p> |
| To communicate that the above intrinsics are available, the compiler must |
| arrange for <tt>__has_feature</tt> to return 1 when fed the intrinsic name |
| appended with an '_' and the mangled type name of <tt>__atomic_flag__</tt>. |
| </p> |
| <p> |
| For example if <tt>__atomic_flag__</tt> is <tt>unsigned int</tt>: |
| </p> |
| <blockquote><pre> |
| __has_feature(__atomic_flag) == 1 |
| __has_feature(__atomic_exchange_seq_cst_j) == 1 |
| __has_feature(__atomic_store_seq_cst_j) == 1 |
| |
| typedef unsigned int __atomic_flag__; |
| |
| unsigned int __atomic_exchange_seq_cst(unsigned int volatile*, unsigned int) |
| { |
| // ... |
| } |
| |
| void __atomic_store_seq_cst(unsigned int volatile*, unsigned int) |
| { |
| // ... |
| } |
| </pre></blockquote> |
| </li> |
| </ul></blockquote> |
| |
| <p> |
| That's it! Compiler writers do the above and you've got a fully conforming |
| (though sub-par performance) <tt><atomic></tt> header! |
| </p> |
| |
| <h2>Recommended work for a higher performance <tt><atomic></tt></h2> |
| |
| <p> |
| It would be good if the above intrinsics worked with all integral types plus |
| <tt>void*</tt>. Because this may not be possible to do in a lock-free manner for |
| all integral types on all platforms, a compiler must communicate each type that |
| an intrinsic works with. For example if <tt>__atomic_exchange_seq_cst</tt> works |
| for all types except for <tt>long long</tt> and <tt>unsigned long long</tt> |
| then: |
| </p> |
| |
| <blockquote><pre> |
| __has_feature(__atomic_exchange_seq_cst_b) == 1 // bool |
| __has_feature(__atomic_exchange_seq_cst_c) == 1 // char |
| __has_feature(__atomic_exchange_seq_cst_a) == 1 // signed char |
| __has_feature(__atomic_exchange_seq_cst_h) == 1 // unsigned char |
| __has_feature(__atomic_exchange_seq_cst_Ds) == 1 // char16_t |
| __has_feature(__atomic_exchange_seq_cst_Di) == 1 // char32_t |
| __has_feature(__atomic_exchange_seq_cst_w) == 1 // wchar_t |
| __has_feature(__atomic_exchange_seq_cst_s) == 1 // short |
| __has_feature(__atomic_exchange_seq_cst_t) == 1 // unsigned short |
| __has_feature(__atomic_exchange_seq_cst_i) == 1 // int |
| __has_feature(__atomic_exchange_seq_cst_j) == 1 // unsigned int |
| __has_feature(__atomic_exchange_seq_cst_l) == 1 // long |
| __has_feature(__atomic_exchange_seq_cst_m) == 1 // unsigned long |
| __has_feature(__atomic_exchange_seq_cst_Pv) == 1 // void* |
| </pre></blockquote> |
| |
| <p> |
| Note that only the <tt>__has_feature</tt> flag is decorated with the argument |
| type. The name of the compiler intrinsic is not decorated, but instead works |
| like a C++ overloaded function. |
| </p> |
| |
| <p> |
| Additionally there are other intrinsics besides |
| <tt>__atomic_exchange_seq_cst</tt> and <tt>__atomic_store_seq_cst</tt>. They |
| are optional. But if the compiler can generate faster code than provided by the |
| library, then clients will benefit from the compiler writer's expertise and |
| knowledge of the targeted platform. |
| </p> |
| |
| <p> |
| Below is the complete list of <i>sequentially consistent</i> intrinsics, and |
| their library implementations. Template syntax is used to indicate the desired |
| overloading for integral and void* types. The template does not represent a |
| requirement that the intrinsic operate on <em>any</em> type! |
| </p> |
| |
| <blockquote><pre> |
| T is one of: bool, char, signed char, unsigned char, short, unsigned short, |
| int, unsigned int, long, unsigned long, |
| long long, unsigned long long, char16_t, char32_t, wchar_t, void* |
| |
| template <class T> |
| T |
| __atomic_load_seq_cst(T const volatile* obj) |
| { |
| unique_lock<mutex> _(some_mutex); |
| return *obj; |
| } |
| |
| template <class T> |
| void |
| __atomic_store_seq_cst(T volatile* obj, T desr) |
| { |
| unique_lock<mutex> _(some_mutex); |
| *obj = desr; |
| } |
| |
| template <class T> |
| T |
| __atomic_exchange_seq_cst(T volatile* obj, T desr) |
| { |
| unique_lock<mutex> _(some_mutex); |
| T r = *obj; |
| *obj = desr; |
| return r; |
| } |
| |
| template <class T> |
| bool |
| __atomic_compare_exchange_strong_seq_cst_seq_cst(T volatile* obj, T* exp, T desr) |
| { |
| unique_lock<mutex> _(some_mutex); |
| if (std::memcmp(const_cast<T*>(obj), exp, sizeof(T)) == 0) |
| { |
| std::memcpy(const_cast<T*>(obj), &desr, sizeof(T)); |
| return true; |
| } |
| std::memcpy(exp, const_cast<T*>(obj), sizeof(T)); |
| return false; |
| } |
| |
| template <class T> |
| bool |
| __atomic_compare_exchange_weak_seq_cst_seq_cst(T volatile* obj, T* exp, T desr) |
| { |
| unique_lock<mutex> _(some_mutex); |
| if (std::memcmp(const_cast<T*>(obj), exp, sizeof(T)) == 0) |
| { |
| std::memcpy(const_cast<T*>(obj), &desr, sizeof(T)); |
| return true; |
| } |
| std::memcpy(exp, const_cast<T*>(obj), sizeof(T)); |
| return false; |
| } |
| |
| T is one of: char, signed char, unsigned char, short, unsigned short, |
| int, unsigned int, long, unsigned long, |
| long long, unsigned long long, char16_t, char32_t, wchar_t |
| |
| template <class T> |
| T |
| __atomic_fetch_add_seq_cst(T volatile* obj, T operand) |
| { |
| unique_lock<mutex> _(some_mutex); |
| T r = *obj; |
| *obj += operand; |
| return r; |
| } |
| |
| template <class T> |
| T |
| __atomic_fetch_sub_seq_cst(T volatile* obj, T operand) |
| { |
| unique_lock<mutex> _(some_mutex); |
| T r = *obj; |
| *obj -= operand; |
| return r; |
| } |
| |
| template <class T> |
| T |
| __atomic_fetch_and_seq_cst(T volatile* obj, T operand) |
| { |
| unique_lock<mutex> _(some_mutex); |
| T r = *obj; |
| *obj &= operand; |
| return r; |
| } |
| |
| template <class T> |
| T |
| __atomic_fetch_or_seq_cst(T volatile* obj, T operand) |
| { |
| unique_lock<mutex> _(some_mutex); |
| T r = *obj; |
| *obj |= operand; |
| return r; |
| } |
| |
| template <class T> |
| T |
| __atomic_fetch_xor_seq_cst(T volatile* obj, T operand) |
| { |
| unique_lock<mutex> _(some_mutex); |
| T r = *obj; |
| *obj ^= operand; |
| return r; |
| } |
| |
| void* |
| __atomic_fetch_add_seq_cst(void* volatile* obj, ptrdiff_t operand) |
| { |
| unique_lock<mutex> _(some_mutex); |
| void* r = *obj; |
| (char*&)(*obj) += operand; |
| return r; |
| } |
| |
| void* |
| __atomic_fetch_sub_seq_cst(void* volatile* obj, ptrdiff_t operand) |
| { |
| unique_lock<mutex> _(some_mutex); |
| void* r = *obj; |
| (char*&)(*obj) -= operand; |
| return r; |
| } |
| |
| void __atomic_thread_fence_seq_cst() |
| { |
| unique_lock<mutex> _(some_mutex); |
| } |
| |
| void __atomic_signal_fence_seq_cst() |
| { |
| unique_lock<mutex> _(some_mutex); |
| } |
| </pre></blockquote> |
| |
| <p> |
| One should consult the (currently draft) |
| <a href="https://wg21.link/n3126">C++ standard</a> |
| for the details of the definitions for these operations. For example |
| <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> is allowed to fail |
| spuriously while <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> is |
| not. |
| </p> |
| |
| <p> |
| If on your platform the lock-free definition of |
| <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> would be the same as |
| <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt>, you may omit the |
| <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> intrinsic without a |
| performance cost. The library will prefer your implementation of |
| <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> over its own |
| definition for implementing |
| <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt>. That is, the library |
| will arrange for <tt>__atomic_compare_exchange_weak_seq_cst_seq_cst</tt> to call |
| <tt>__atomic_compare_exchange_strong_seq_cst_seq_cst</tt> if you supply an |
| intrinsic for the strong version but not the weak. |
| </p> |
| |
| <h2>Taking advantage of weaker memory synchronization</h2> |
| |
| <p> |
| So far all of the intrinsics presented require a <em>sequentially |
| consistent</em> memory ordering. That is, no loads or stores can move across |
| the operation (just as if the library had locked that internal mutex). But |
| <tt><atomic></tt> supports weaker memory ordering operations. In all, |
| there are six memory orderings (listed here from strongest to weakest): |
| </p> |
| |
| <blockquote><pre> |
| memory_order_seq_cst |
| memory_order_acq_rel |
| memory_order_release |
| memory_order_acquire |
| memory_order_consume |
| memory_order_relaxed |
| </pre></blockquote> |
| |
| <p> |
| (See the |
| <a href="https://wg21.link/n3126">C++ standard</a> |
| for the detailed definitions of each of these orderings). |
| </p> |
| |
| <p> |
| On some platforms, the compiler vendor can offer some or even all of the above |
| intrinsics at one or more weaker levels of memory synchronization. This might |
| lead for example to not issuing an <tt>mfence</tt> instruction on the x86. |
| </p> |
| |
| <p> |
| If the compiler does not offer any given operation, at any given memory ordering |
| level, the library will automatically attempt to call the next highest memory |
| ordering operation. This continues up to <tt>seq_cst</tt>, and if that doesn't |
| exist, then the library takes over and does the job with a <tt>mutex</tt>. This |
| is a compile-time search & selection operation. At run time, the |
| application will only see the few inlined assembly instructions for the selected |
| intrinsic. |
| </p> |
| |
| <p> |
| Each intrinsic is appended with the 7-letter name of the memory ordering it |
| addresses. For example a <tt>load</tt> with <tt>relaxed</tt> ordering is |
| defined by: |
| </p> |
| |
| <blockquote><pre> |
| T __atomic_load_relaxed(const volatile T* obj); |
| </pre></blockquote> |
| |
| <p> |
| And announced with: |
| </p> |
| |
| <blockquote><pre> |
| __has_feature(__atomic_load_relaxed_b) == 1 // bool |
| __has_feature(__atomic_load_relaxed_c) == 1 // char |
| __has_feature(__atomic_load_relaxed_a) == 1 // signed char |
| ... |
| </pre></blockquote> |
| |
| <p> |
| The <tt>__atomic_compare_exchange_strong(weak)</tt> intrinsics are parameterized |
| on two memory orderings. The first ordering applies when the operation returns |
| <tt>true</tt> and the second ordering applies when the operation returns |
| <tt>false</tt>. |
| </p> |
| |
| <p> |
| Not every memory ordering is appropriate for every operation. <tt>exchange</tt> |
| and the <tt>fetch_<i>op</i></tt> operations support all 6. But <tt>load</tt> |
| only supports <tt>relaxed</tt>, <tt>consume</tt>, <tt>acquire</tt> and <tt>seq_cst</tt>. |
| <tt>store</tt> |
| only supports <tt>relaxed</tt>, <tt>release</tt>, and <tt>seq_cst</tt>. The |
| <tt>compare_exchange</tt> operations support the following 16 combinations out |
| of the possible 36: |
| </p> |
| |
| <blockquote><pre> |
| relaxed_relaxed |
| consume_relaxed |
| consume_consume |
| acquire_relaxed |
| acquire_consume |
| acquire_acquire |
| release_relaxed |
| release_consume |
| release_acquire |
| acq_rel_relaxed |
| acq_rel_consume |
| acq_rel_acquire |
| seq_cst_relaxed |
| seq_cst_consume |
| seq_cst_acquire |
| seq_cst_seq_cst |
| </pre></blockquote> |
| |
| <p> |
| Again, the compiler supplies intrinsics only for the strongest orderings where |
| it can make a difference. The library takes care of calling the weakest |
| supplied intrinsic that is as strong or stronger than the customer asked for. |
| </p> |
| |
| </div> |
| </body> |
| </html> |