Implement lock-free queue for scheduler message queue #9710

toffaletti · 2013-10-04T02:37:35Z

I implemented the lock-free queue algorithm from Dmitry Vyukov http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue as a potential replacement for std::rt::message_queue::MessageQueue. make check-stage2 passes, but I've done little else to validate this change and it is definitely a change that shouldn't be taken lightly. I included the license Dmitry asked me to use when I implemented a C++11 version of this queue a while ago. If the license is an issue, I suggest reaching out to Dmitry. I believe he works at Google now and recently rewrote Go's scheduler.

toffaletti · 2013-10-04T03:29:59Z

check-stage1 failed in a rt::comm test, not sure what test though, the console output is all interleaved.

toffaletti · 2013-10-04T03:32:25Z

narrowed it to rt::io tests

$ ./x86_64-unknown-linux-gnu/stage1/test/stdtest-x86_64-unknown-linux-gnu rt::io

running 81 tests
test rt::io::buffered::test::test_buffered_writer ... ok
task '<unnamed>' failed at 'Unhandled condition: read_error: rt::io::IoError{kind: OtherIoError, desc: "Placeholder error. You shouldn't be seeing this", detail: None}', /home/jason/rust/src/libstd/condition.rs:131
test rt::io::buffered::test::test_buffered_reader ... ok
test rt::io::extensions::test::bytes_0_bytes ... ok
test rt::io::buffered::test::test_buffered_stream ... ok
test rt::io::extensions::test::bytes_eof ... ok
test rt::io::extensions::test::push_bytes ... task 'ok<unnamed>
' failed at 'Unhandled condition: read_error: rt::io::IoError{kind: OtherIoError, desc: "Placeholder error. You shouldn't be seeing this", detail: None}', /home/jason/rust/src/libstd/condition.rs:131
test rt::io::extensions::test::bytes_error ... ok
test rt::io::extensions::test::push_bytes_eof ... ok
test rt::io::extensions::test::push_bytes_error ... ok
test rt::io::extensions::test::push_bytes_partial ... ok
test rt::io::extensions::test::read_byte_0_bytes ... ok
test rt::io::extensions::test::read_byte_eof ... ok
test rt::io::extensions::test::read_byte ... ok
test rt::io::extensions::test::read_bytes ... ok
test rt::io::extensions::test::read_byte_error ... ok
test rt::io::extensions::test::read_bytes_eof ... ok
test rt::io::extensions::test::read_bytes_partial ... ok
test rt::io::extensions::test::read_to_end ... ok
test rt::io::extensions::test::push_bytes_fail_reset_len ... ok
test rt::io::extensions::test::test_read_be_int_n ... ok
test rt::io::extensions::test::test_read_f32 ... ok
test rt::io::extensions::test::test_read_write_be ... ok
test rt::io::extensions::test::read_to_end_error ... ok
test rt::io::extensions::test::test_read_write_f32 ... ok
test rt::io::extensions::test::test_read_write_le_mem ... ok
test rt::io::flate::test::smoke_test ... ignored
test rt::io::mem::test::test_buf_reader ... ok
test rt::io::mem::test::test_mem_reader ... ok
test rt::io::mem::test::test_mem_writer ... ok
test rt::io::mem::test::test_with_mem_writer ... ok
test rt::io::net::ip::test::ipv6_addr_to_str ... ok
test rt::io::net::ip::test::test_from_str_ipv4 ... ok
test rt::io::net::ip::test::test_from_str_ipv4_in_ipv6 ... ok
test rt::io::net::ip::test::test_from_str_ipv6 ... ok
test rt::io::net::ip::test::test_from_str_socket_addr ... ok
test rt::io::net::tcp::test::bind_error ... ignored
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
Aborted

toffaletti · 2013-10-04T03:42:43Z

valgrind told me the Abort was from not enough file descriptors. I bumped fd up and now I get:

$ ./x86_64-unknown-linux-gnu/stage1/test/stdtest-x86_64-unknown-linux-gnu rt::io

running 81 tests
task '<unnamed>' failed at 'Unhandled condition: read_error: rt::io::IoError{kind: OtherIoError, desc: "Placeholder error. You shouldn't be seeing this", detail: None}', /home/jason/rust/src/libstd/condition.rs:131
test rt::io::buffered::test::test_buffered_reader ... ok
test rt::io::buffered::test::test_buffered_stream ... ok
test rt::io::buffered::test::test_buffered_writer ... ok
test rt::io::extensions::test::bytes_0_bytes ... ok
test rt::io::extensions::test::bytes_eof ... ok
task '<unnamed>' failed at 'Unhandled condition: read_error: rt::io::IoError{kind: OtherIoError, desc: "Placeholder error. You shouldn't be seeing this", detail: None}', /home/jason/rust/src/libstd/condition.rs:131
test rt::io::extensions::test::push_bytes ... ok
test rt::io::extensions::test::bytes_error ... ok
test rt::io::extensions::test::push_bytes_eof ... ok
test rt::io::extensions::test::push_bytes_partial ... ok
test rt::io::extensions::test::read_byte ... ok
test rt::io::extensions::test::push_bytes_error ... ok
test rt::io::extensions::test::read_byte_0_bytes ... ok
test rt::io::extensions::test::read_byte_eof ... ok
test rt::io::extensions::test::read_byte_error ... ok
test rt::io::extensions::test::read_bytes_eof ... ok
test rt::io::extensions::test::read_bytes ... ok
test rt::io::extensions::test::read_bytes_partial ... ok
test rt::io::extensions::test::push_bytes_fail_reset_len ... ok
test rt::io::extensions::test::read_to_end ... ok
test rt::io::extensions::test::test_read_be_int_n ... ok
test rt::io::extensions::test::test_read_f32 ... ok
test rt::io::extensions::test::read_to_end_error ... ok
test rt::io::extensions::test::test_read_write_be ... ok
test rt::io::extensions::test::test_read_write_f32 ... ok
test rt::io::extensions::test::test_read_write_le_mem ... ok
test rt::io::flate::test::smoke_test ... ignored
test rt::io::mem::test::test_buf_reader ... ok
test rt::io::mem::test::test_mem_reader ... ok
test rt::io::mem::test::test_mem_writer ... ok
test rt::io::mem::test::test_with_mem_writer ... ok
test rt::io::net::ip::test::ipv6_addr_to_str ... ok
test rt::io::net::ip::test::test_from_str_ipv4 ... ok
test rt::io::net::ip::test::test_from_str_ipv6 ... ok
test rt::io::net::ip::test::test_from_str_ipv4_in_ipv6 ... ok
test rt::io::net::tcp::test::bind_error ... ignored
test rt::io::net::ip::test::test_from_str_socket_addr ... ok
task '<unnamed>' failed at 'assertion failed: !dir.exists()', /home/jason/rust/src/libstd/rt/io/file.rs:930


Instead of the poems I had hoped for, there came only a shuddering blackness
and ineffable loneliness; and I saw at last a fearful truth which no one had
ever dared to breathe before — the unwhisperable secret of secrets — The fact
that this city of stone and stridor is not a sentient perpetuation of Old New
York as London is of Old London and Paris of Old Paris, but that it is in fact
quite dead, its sprawling body imperfectly embalmed and infested with queer
animate things which have nothing to do with it as it was in life.

fatal runtime error: assertion failed: exit_status
Aborted

toffaletti · 2013-10-04T06:23:57Z

So the above failure was caused by the previous fd limit failures not cleaning up directories in tmp. I had to rm these two directories:

rm -rf tmp/before_and_after_dir
rm -rf tmp/di_readdir/

and now all tests pass.

test result: ok. 1194 passed; 0 failed; 91 ignored; 0 measured

I'd still be very wary of this change though.

brson · 2013-10-06T13:36:18Z

Thanks! I'm out of town at the moment but will give this a close look this week.

toffaletti · 2013-10-06T17:37:28Z

I found SleeperList, which looked like a copy of the MessageQueue code. So I tried replacing it with mpsc_queue::Queue, but then I get crashes in the AtomicPtr::get refcount != 0 check. I haven't been able to track down which AtomicPtr is causing this yet, but it is troubling because I can't explain why or how this change would expose what looks like a bug in AtomicPtr. I suspect the bug is in my code, not AtomicPtr, but this sure is a head scratcher.

compile_and_link: x86_64-unknown-linux-gnu/stage1/lib/rustc/x86_64-unknown-linux-gnu/lib/libstd.so
Warning: removing previous 'libstd-*.so' libraries: x86_64-unknown-linux-gnu/stage1/lib/rustc/x86_64-unknown-linux-gnu/lib//libstd-6c65cf4b443341b1-0.9-pre.so
failed in non-task context at 'assertion failed: (*self.data).count.load(Relaxed) > 0', /home/jason/rust/src/libstd/unstable/sync.rs:97
Segmentation fault

alexcrichton · 2013-10-06T18:11:52Z

cc #6837 and #6838

I've found breaking on rust_begin_unwind useful for generating stack traces in gdb, although the segfault would probably also help showing the stack trace.

alexcrichton · 2013-10-06T18:16:20Z

I haven't given this a whole lot of thought, but there are two high-level comments which I was wondering about:

I think that the SleeperList is a multiple consumer queue, meaning that the queue implemented here probably isn't suitable for that purpose.
Along those lines, it's certainly unsafe to use this queue with more than one consumer, so I was curious if we could leverage the rust type system to enforce this safety? Right now it's not guaranteed that you have one consumer in safe code, so safe code can run into problems when using this queue.

toffaletti · 2013-10-06T22:27:32Z

@alexcrichton I was wondering if SleeperList needed to be multi-consumer and whether that might be the issue. If that is the case, it explains the crashes. I will implement another lock-free queue for SleeperList. I also have thought about using the type system to enforce the thread-safe properties of the queue. I think something similar to comm::stream() which makes a Pusher that is Cloneable and Sendable and a Popper that isn't. I'll probably need some help with the intricacies of the type system.

toffaletti · 2013-10-07T07:17:09Z

Tests are now passing with sleeper list using the multiple producer multiple consumer bounded queue. I used this algorithm: http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue. The downside is this queue is bounded at powers of 2 capacity. However, it is very fast and easy to understand. I wasn't sure what I could do safely inside the scheduler when a push to the sleeper list failed because the queue was full, so I just put an assert! in for now and made the queue fairly large. Usually there would be a spin with back-off yielding loop.

brson · 2013-10-16T23:09:18Z

Forgive me taking so long on this. It looks very impressive, but I'm wary of hastily merging concurrency code these days.

I think it's not too big a problem for the SleeperList to have a bound, at least for now. At the moment the number of work stealing schedulers is fixed at startup, putting a bound on the size of the SleeperList. In the future that number will be able to increase dynamically, but I still don't think that having a reasonable limit on the number of schedulers is a major problem.

This does convert the SleeperList from a stack to a queue. I'm not sure if this is good or bad. I chose a stack originally with the vague idea that it would be better for cache reasons to keep using threads that were most recently active, but it may not matter at all.

This also makes all pushes to the scheduler message queue allocate, so there's going to be a performance tradeoff here. This will probably scale a lot better when message queues are contested, but be slower to push/pop.

A minor nit, but I would prefer that the mpsc queue is encapsulated in the scheduler-specific MessageQueue type. Partly just for the naming, but also to make it easier to swap out MessageQueue implementations.

Because of the special license this won't pass make tidy. License exceptions are listed in src/etc/licenseck.py. This appears to be a standard BSD license, which we can accept.

Have you measured any difference in performance with this implementation?

I'm going to run some tests locally here.

brson · 2013-10-16T23:52:19Z

It would be appropriate to put links to the original C source directly in the Rust source. Can you do so?

toffaletti · 2013-10-17T00:01:41Z

I agree, this merge should not be taken lightly. I have run tests through valgrind's helgrind and drd tools, without noticing any problems however the output is rather noisy even without this patch so I could be missing things. I'd feel better if Rust had http://clang.llvm.org/docs/ThreadSanitizer.html support and/or was helgrind clean.

I will look into implementing a lock-free stack for SleeperList, I missed that detail.

The allocation per queue push is a concern. It is possible to use a bounded free-list to reduce this, but it would complicate the code and I'd want to see measurements of the allocation being an issue first because jemalloc is hard to beat.

I like encapsulating MessageQueue, perhaps allowing a runtime flag that swaps between implementations could be a good way to go forward with merging this. That minimizes risk and makes testing easier than rebuilding all of Rust from this branch.

I have tried, but so far failed to find a good measure of performance for this. Benchmarking is hard. I think it would be fairly trivial to come up with some micro-benchmarks, but I was hoping to see a difference in real-world applications. I tried running the stage1 tests and benchmarks and also apachebench with various levels of concurrency against rust-http server hoping to see some improvement, but I didn't. I suspect with running the tests and benchmarks the scheduler overhead was too small. As for rust-http I believe I talked to you on irc about the issues with libuv and io on multiple threads. rust-http server really only utilizes a single core when io bound because of libuv. Perhaps Servo would be a good test?

toffaletti · 2013-10-17T00:26:45Z

I'll update this branch when I get a chance, but until then the code also exists in stand-alone form at https://github.com/toffaletti/rust-code with links to the original algorithms, if that helps. On second thought, I think the code I'm linking to is more C style pseudo-code. I'll try to dig up some actual compilable code (I believe some were posted on the mailing list) if you like.

https://github.com/toffaletti/rust-code/blob/master/mpmc_bounded_queue.rs#L29
https://github.com/toffaletti/rust-code/blob/master/mpsc_queue.rs#L30

I'll also explain the padding in a comment in the code, but I wanted to bring up an issue with it here. The padding is to prevent false sharing. For it to work, the padding has to be at least the size of the cpu cache line. I hard coded it at 64 bytes because I'm not aware of a good way to determine the actual value at compile or runtime. I've seen reference to some older server class cpus using 128 byte cache lines. I have no idea what ARM uses, but I think most modern desktop/server cpus use 64.

brson · 2013-10-17T00:29:32Z

I think microbenchmarks are probably the best bet for measuring any results at this point. As you say, most of our real world code is going to swamp the scheduler itself with additional inefficiencies; we probably don't have anything that would actually be impacted by the scalability improvements this should bring (nor do I even have a massively multi-core machine to test on right now).

Last time I was tuning the scheduler I used this stupid benchmark and was just optimizing profiles to remove as much of the scheduler overhead as possible. I was also using mutrace to measure lock contention, which this change should reduce to zero.

brson · 2013-10-17T00:34:09Z

@toffaletti I think don't worry about the lock-free stack. There's no evidence a stack is better here.

brson · 2013-10-17T00:39:52Z

This does turn the casual_pop operations into normal atomic pops. In my last round of optimization these were pretty important for reducing lock contention. I imagine it's not nearly as important now that there are no locks, but it still may be an optimization to peek at the end of the queue non-atomically before deciding to pop. If we're going to leave the operations the same for now then it probably deserves an XXX comment indicating that we should reinvestigate later.

brson · 2013-10-17T00:50:08Z

Oh, the biggest source of contention is actually probably the WorkQueue - the work-stealing deque, the replacement of which is the topic of #4877.

toffaletti · 2013-10-17T21:37:08Z

I'll take a shot at the WorkQueue after I get some benchmark and testing in place to measure these changes. I think it'll be about 2-3 weeks before I'll have something more to show.

toffaletti · 2013-10-19T11:26:38Z

My first attempt at implementing the chase-lev deque is pretty ugly.

https://github.com/toffaletti/rust-code/blob/master/chase_lev_deque.rs

Interestingly I believe I found a bug in the take function from the C11 implementation in http://www.di.ens.fr/~zappa/readings/ppopp13.pdf. They used unsigned ints for top/bottom while the original algorithm used signed ints. The overflow caused by bottom = bottom - 1 when bottom is 0 would cause the check top <= bottom to pass and take would return a value even when the queue was supposed to be empty. I suspect they never noticed this bug because they were storing ints and weren't distinguishing between 0 as a valid return value and 0 as empty.

I also implemented the deque in C++11 https://github.com/toffaletti/libten/blob/ws_sched/include/ten/work_deque.hh and plugged it into a task scheduler which has a decent test suite. I haven't seen any crashes yet, but helgrind/drd do report some races. I probably have a bug in the memory ordering for array initialization and/or array grow.

brson · 2013-10-24T01:04:41Z

I've not forgotten about this PR. I have stress tested it here quite a bit without any failures, so I'm feeling pretty good about it, but I've yet to read the code for the data structures closely, which I want to do still.

brson · 2013-10-26T01:07:57Z

src/libstd/rt/mpmc_bounded_queue.rs

+                2u
+            } else {
+                // use next power of 2 as capacity
+                2f64.pow(&((capacity as f64).log2().ceil())) as uint


This can use uint::next_power_of_two.

brson · 2013-10-26T02:58:49Z

Reopened as #10080

brson · 2013-10-26T03:00:31Z

@toffaletti You may also be interested in #8568

Rebase and update of #9710

sruggier · 2013-10-30T02:31:09Z

src/libstd/rt/mpmc_bounded_queue.rs

+}
+
+struct State<T> {
+    pad0: [u8, ..64],


These can be less than 64 bytes, you can save a bit of memory by using 64 minus the size of the padded object.

Rust has no sizeof that works at compile time.

Yeah, I hit that too with my padding. Since it's hard-coded, though, you can safely change the size to 56, knowing that buffer and mask will fill the remaining space. I admit it's a pretty minor issue, I wish that my first comment on your patch had been something more helpful than this :)

That won't work on 32 bit machines though.

@toffaletti you could do something with #[cfg(target_word_size = "32")] and #[cfg(target_word_size = "64")] to conditionally chose a definition of the structs.

I'd also have to conditionally define the constructor. I chose to keep the code simple instead of saving the 16-32 bytes per cpu core.

You're right, I should have said 60 bytes. I forgot about the possibility of the cache line boundary being in between buffer and mask.

Fix `needless_borrow` false positive rust-lang#9710 Fixes rust-lang#9710 changelog: fix `needless_borrow` false positive rust-lang#9710

toffaletti added 2 commits October 3, 2013 23:24

lock-free queue for scheduler message queue

6d5f94d

add cache line padding

e3b6576

add multi-producer multi-consumer bounded queue to use for sleeper list

81a1fb2

toffaletti added 8 commits October 7, 2013 00:43

add padding to prevent false sharing

c381d77

clean up

d0064f9

fix bug introduced by previous clean up. more clean up.

1c1caaa

minor

c3d50ce

Merge branch 'master' into sched_queue

fd67f06

minor

a53c141

Merge branch 'master' into sched_queue

22d1305

Merge branch 'master' into sched_queue

d199dc2

brson reviewed Oct 26, 2013
View reviewed changes

This was referenced Oct 26, 2013

Experiment with unbounded channel types that do less allocation #8568

Closed

Lock free implementations of MessageQueue and SleeperList #10080

Merged

brson closed this Oct 26, 2013

bors added a commit that referenced this pull request Oct 28, 2013

auto merge of #10080 : brson/rust/sched_queue, r=brson

d664ca2

Rebase and update of #9710

This was referenced Oct 29, 2013

Implement MessageQueue without locks #6837

Closed

Implement SleeperList without locks #6838

Closed

sruggier reviewed Oct 30, 2013
View reviewed changes

pmarcelll mentioned this pull request Sep 18, 2016

Licensing issue #36556

Closed

brson mentioned this pull request May 22, 2017

libstd/sync/mpsc: relicense under rust license #42149

Merged

flip1995 pushed a commit to flip1995/rust that referenced this pull request Nov 21, 2022

Fix rust-lang#9710

c42626f

flip1995 pushed a commit to flip1995/rust that referenced this pull request Nov 21, 2022

Auto merge of rust-lang#9711 - smoelius:issue-9710, r=Jarcho

0ab512c

Fix `needless_borrow` false positive rust-lang#9710 Fixes rust-lang#9710 changelog: fix `needless_borrow` false positive rust-lang#9710

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement lock-free queue for scheduler message queue #9710

Implement lock-free queue for scheduler message queue #9710

toffaletti commented Oct 4, 2013

toffaletti commented Oct 4, 2013

toffaletti commented Oct 4, 2013

toffaletti commented Oct 4, 2013

toffaletti commented Oct 4, 2013

brson commented Oct 6, 2013

toffaletti commented Oct 6, 2013

alexcrichton commented Oct 6, 2013

alexcrichton commented Oct 6, 2013

toffaletti commented Oct 6, 2013

toffaletti commented Oct 7, 2013

brson commented Oct 16, 2013

brson commented Oct 16, 2013

toffaletti commented Oct 17, 2013

toffaletti commented Oct 17, 2013

brson commented Oct 17, 2013

brson commented Oct 17, 2013

brson commented Oct 17, 2013

brson commented Oct 17, 2013

toffaletti commented Oct 17, 2013

toffaletti commented Oct 19, 2013

brson commented Oct 24, 2013

brson Oct 26, 2013

brson commented Oct 26, 2013

brson commented Oct 26, 2013

sruggier Oct 30, 2013

toffaletti Oct 30, 2013

sruggier Oct 30, 2013

toffaletti Oct 30, 2013

huonw Oct 30, 2013

toffaletti Oct 30, 2013

sruggier Oct 31, 2013

Implement lock-free queue for scheduler message queue #9710

Implement lock-free queue for scheduler message queue #9710

Conversation

toffaletti commented Oct 4, 2013

toffaletti commented Oct 4, 2013

toffaletti commented Oct 4, 2013

toffaletti commented Oct 4, 2013

toffaletti commented Oct 4, 2013

brson commented Oct 6, 2013

toffaletti commented Oct 6, 2013

alexcrichton commented Oct 6, 2013

alexcrichton commented Oct 6, 2013

toffaletti commented Oct 6, 2013

toffaletti commented Oct 7, 2013

brson commented Oct 16, 2013

brson commented Oct 16, 2013

toffaletti commented Oct 17, 2013

toffaletti commented Oct 17, 2013

brson commented Oct 17, 2013

brson commented Oct 17, 2013

brson commented Oct 17, 2013

brson commented Oct 17, 2013

toffaletti commented Oct 17, 2013

toffaletti commented Oct 19, 2013

brson commented Oct 24, 2013

Choose a reason for hiding this comment

brson commented Oct 26, 2013

brson commented Oct 26, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment