Use-after-free bug in Maged M. Michael and Michael L. Scott's non-blocking concurrent queue algorithm

Note: This post has been automatically imported from my old blog. Formatting may be incorrect.

Note: I've gotten a response from one of the authors of the post.

Google results for lock-free queue algorithms frequently point to Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms (pdf), whose pseudocode is available online. When I read the algorithm, I was surprised to discover a use-after-free bug not mentioned anywhere I could find!

Essentially, the problem is that Tail.ptr can't be dereferenced without first loading Tail, and dequeue frees Head.ptr. If the the queue is empty at line E5, then tail == Head since when the queue is empty Tail == Head. If, on other threads, a single enqueue followed by a single dequeue occurs before line E6, then tail.ptr will be freed because dequeue frees head.ptr, head == Head, and tail == Head. Then line E6 will proceed to dereference it.

I wanted to be sure I was reading this right, so I implemented the algorithm with one modification to enqueue: After loading Tail into tail at line E5, if a thread is marked as "slow" it first notifies a condition variable that it has loaded tail and then blocks on that condition variable. Using that modified implementation, the program consists of two threads. The first thread waits for a notification that the second has loaded tail, then proceeds to enqueue and dequeue an item, then notifies the second that it can continue, then joins with the second thread an exits. The second thread marks itself slow and enqueues a single item and exits.

Here's the whole program in C++11, with comments for anything deviating from a straightforward translation of the pseudocode. Lines which correspond to numbered lines in the pseudocode are also marked:

[code language="cpp"]
#include <condition_variable>
#include <mutex>
#include <atomic>
#include <thread>

// Am I the slow thread?
thread_local auto is_slow_thread = bool{false};

// Is the slow thread waiting yet?
auto slow_thread_waiting = bool{false};

// Has the node been freed yet?
auto node_freed = bool{false};

// Used for inter-thread signalling
std::condition_variable cond{};
// Mutex for cond
std::mutex cond_mutex{};

// Implementation of the Michael-Scott algorithm
template <typename T> class queue_t {
struct node_t;
// Explicitly aligned due to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147: gcc should automatically align std::atomic<pointer_t> on 16-byte boundary but doesn't (until 5.1)
struct alignas(16) pointer_t {
node_t* ptr;
unsigned int count;
// A zero-initialized pointer_t
// I'm pretty sure we don't actually need to initialize count to 0 here given how these are used, but it can't hurt.
pointer_t() noexcept : ptr{nullptr}, count{0} {}
// A pointer_t pointing to a specific node
pointer_t(node_t* ptr) : ptr{ptr}, count{0} {}
// A pointer_t pointing to a specific node with a specific count
pointer_t(node_t* ptr, unsigned int count) : ptr{ptr}, count{count} {}
// bitwise-compare two pointer_ts
bool operator ==(const pointer_t & other) const {
return ptr == other.ptr && count == other.count;
}
};
struct node_t {
T value;
// We're going to do atomic ops on next
std::atomic<pointer_t> next;
// A dummy node, next is initialized with a zero-initialized ptr
node_t() : next{pointer_t{}} {}
// A node filled with a given value, next is initialized with a zero-initialized ptr
node_t(T value) : value(value), next{pointer_t{}} {}
};

// We're going to do atomic ops on Head
std::atomic<pointer_t> Head;
// We're going to do atomic ops on Tail
std::atomic<pointer_t> Tail;

public:
queue_t() : Head{new node_t{}}, Tail{Head.load().ptr} {}

void enqueue(T value) {
// Node is initialized in ctor, so three lines in one
auto node = new node_t{value}; // E1, E2, E3
decltype(Tail.load()) tail;
while (true) { // E4
tail = Tail.load(); // E5
// If we're the slow thread, we wait until the node we just loaded is freed.
if (is_slow_thread) {
{
std::lock_guard<std::mutex> lock{cond_mutex};
slow_thread_waiting = true;
}
// Let the main thread know we're waiting
cond.notify_one();
auto lock = std::unique_lock<std::mutex>{cond_mutex};
// Wait until the main thread tells us the node is freed.
cond.wait(lock, []{ return node_freed; });
}
// Use-after-free here in slow thread!
auto next = tail.ptr->next.load(); // E6
if (tail == Tail.load()) { // E7
if (!next.ptr) { // E8
if (tail.ptr->next.compare_exchange_weak(next, pointer_t{node, next.count + 1})) { // E9
break; // E10
} // E11
} else { // E12
Tail.compare_exchange_weak(tail, pointer_t{next.ptr, tail.count + 1}); // E13
} // E14
} // E15
} // E16

Tail.compare_exchange_weak(tail, pointer_t{node, tail.count + 1}); // E17
}

bool dequeue(T* pvalue) {
decltype(Head.load()) head;
while (true) { // D1
head = Head.load(); // D2
auto tail = Tail.load(); // D3
auto next = head.ptr->next.load(); // D4
if (head == Head.load()) { // D5
if (head.ptr == tail.ptr) { // D6
if (!next.ptr) { // D7
return false; // D8
} // D9
Tail.compare_exchange_weak(tail, pointer_t{next.ptr, tail.count + 1}); // D10
} else { // D11
*pvalue = next.ptr->value; // D12
if (Head.compare_exchange_weak(head, pointer_t{next.ptr, head.count + 1})) { // D13
break; // D14
} // D15
} // D16
} // D17
} // D18
delete head.ptr; // D19
return true; // D20
}
};

// Empty struct to fill our queue with
struct empty {};

// Our queue
queue_t<empty> queue{};

// The slow thread
void slow_thread() {
// Set that we're the slow thread
is_slow_thread = true;
// Enqueue something
queue.enqueue(empty{});
};

// The main thread
int main() {
// Launch the slow thread
auto slow = std::thread{slow_thread};
{
auto lock = std::unique_lock<std::mutex>{cond_mutex};
// Wait until the slow thread is waiting
cond.wait(lock, []{ return slow_thread_waiting; });
}
// Enqueue something
queue.enqueue(empty{});
empty ref;
// Dequeue something
queue.dequeue(&ref);
{
std::lock_guard<std::mutex> lock{cond_mutex};
node_freed = true;
}
// Tell the slow thread we've freed the node
cond.notify_one();
// Wait for the slow thread to finish
slow.join();
return 0;
}

[/code]

I compiled this with gcc 4.8.4, using the following command line (the -mcx16 is needed to do atomic ops on the >1 word pointer_t):

g++ -std=c++11 michael-scott.cc -o michael-scott -mcx16 -lpthread -ggdb -O0

Finally, using valgrind 3.10.1 (note the Address 0x5c0b058 is 24 bytes inside a block of size 32 free'd):


$ valgrind ./michael-scott
==12588== Memcheck, a memory error detector
==12588== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==12588== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==12588== Command: ./michael-scott
==12588==
==12588== Thread 2:
==12588== Invalid read of size 8
==12588== at 0x401CEE: std::atomic::pointer_t>::load(std::memory_order) const (atomic:209)
==12588== by 0x4017B4: queue_t::enqueue(empty) (michael-scott.cc:76)
==12588== by 0x4011BB: slow_thread() (michael-scott.cc:127)
==12588== by 0x402EB8: void std::_Bind_simple::_M_invoke<>(std::_Index_tuple<>) (functional:1732)
==12588== by 0x402E12: std::_Bind_simple::operator()() (functional:1720)
==12588== by 0x402DAB: std::thread::_Impl >::_M_run() (thread:115)
==12588== by 0x50FF7AF: ??? (in /nix/store/6vz6baw7wc26bp2c2i2lpip1z9yvcw0c-gcc-4.8.4/lib/libstdc++.so.6.0.19)
==12588== by 0x4E3A483: start_thread (in /nix/store/6k9z1sfl7kghmagwd205k3i81pbcw57s-glibc-2.21/lib/libpthread-2.21.so)
==12588== by 0x595204C: clone (in /nix/store/6k9z1sfl7kghmagwd205k3i81pbcw57s-glibc-2.21/lib/libc-2.21.so)
==12588== Address 0x5c0b058 is 24 bytes inside a block of size 32 free'd
==12588== at 0x4C290B1: operator delete(void*) (in /nix/store/rgxnhg1wpqzvjslyzk47z3www5clfc0l-valgrind-3.10.1/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12588== by 0x401BE6: queue_t::dequeue(empty*) (michael-scott.cc:111)
==12588== by 0x40123A: main (michael-scott.cc:143)
==12588==
==12588==
==12588== HEAP SUMMARY:
==12588== in use at exit: 64 bytes in 2 blocks
==12588== total heap usage: 5 allocs, 3 frees, 456 bytes allocated
==12588==
==12588== LEAK SUMMARY:
==12588== definitely lost: 0 bytes in 0 blocks
==12588== indirectly lost: 0 bytes in 0 blocks
==12588== possibly lost: 0 bytes in 0 blocks
==12588== still reachable: 64 bytes in 2 blocks
==12588== suppressed: 0 bytes in 0 blocks
==12588== Rerun with --leak-check=full to see details of leaked memory
==12588==
==12588== For counts of detected and suppressed errors, rerun with: -v
==12588== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 1 from 1)

So, there you have it: Don't use an unmodified Michael-Scott queue when doing manual memory management!

Update 1

I heard back from one of the authors of the paper. At the time of the paper, it was acknowledged that releasing memory for general use/back to the OS was still an open problem. The free in the algorithm is meant to represent a function putting the node back on to a locally-maintained special-use free list and not the partner to malloc. This is mentioned briefly in the paper ("We use Treiber’s simple and efficient non-blocking stack algorithm to implement a non-blocking free list."), but I didn't realize that implied the memory wouldn't be used for anything else. So under those circumstances, this algorithm works.

He also pointed me to his more recent paper on hazard pointers (pdf), which provides a general means of truly releasing memory for general use in lock-free or even wait-free algorithms, and shows a modification of the Michael-Scott queue that's safe to use with general use memory allocation/deallocation.