Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lock-free queue for scheduler message queue #9710

Closed
wants to merge 11 commits into from

Conversation

toffaletti
Copy link
Contributor

I implemented the lock-free queue algorithm from Dmitry Vyukov http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue as a potential replacement for std::rt::message_queue::MessageQueue. make check-stage2 passes, but I've done little else to validate this change and it is definitely a change that shouldn't be taken lightly. I included the license Dmitry asked me to use when I implemented a C++11 version of this queue a while ago. If the license is an issue, I suggest reaching out to Dmitry. I believe he works at Google now and recently rewrote Go's scheduler.

@toffaletti
Copy link
Contributor Author

check-stage1 failed in a rt::comm test, not sure what test though, the console output is all interleaved.

@toffaletti
Copy link
Contributor Author

narrowed it to rt::io tests

$ ./x86_64-unknown-linux-gnu/stage1/test/stdtest-x86_64-unknown-linux-gnu rt::io

running 81 tests
test rt::io::buffered::test::test_buffered_writer ... ok
task '<unnamed>' failed at 'Unhandled condition: read_error: rt::io::IoError{kind: OtherIoError, desc: "Placeholder error. You shouldn't be seeing this", detail: None}', /home/jason/rust/src/libstd/condition.rs:131
test rt::io::buffered::test::test_buffered_reader ... ok
test rt::io::extensions::test::bytes_0_bytes ... ok
test rt::io::buffered::test::test_buffered_stream ... ok
test rt::io::extensions::test::bytes_eof ... ok
test rt::io::extensions::test::push_bytes ... task 'ok<unnamed>
' failed at 'Unhandled condition: read_error: rt::io::IoError{kind: OtherIoError, desc: "Placeholder error. You shouldn't be seeing this", detail: None}', /home/jason/rust/src/libstd/condition.rs:131
test rt::io::extensions::test::bytes_error ... ok
test rt::io::extensions::test::push_bytes_eof ... ok
test rt::io::extensions::test::push_bytes_error ... ok
test rt::io::extensions::test::push_bytes_partial ... ok
test rt::io::extensions::test::read_byte_0_bytes ... ok
test rt::io::extensions::test::read_byte_eof ... ok
test rt::io::extensions::test::read_byte ... ok
test rt::io::extensions::test::read_bytes ... ok
test rt::io::extensions::test::read_byte_error ... ok
test rt::io::extensions::test::read_bytes_eof ... ok
test rt::io::extensions::test::read_bytes_partial ... ok
test rt::io::extensions::test::read_to_end ... ok
test rt::io::extensions::test::push_bytes_fail_reset_len ... ok
test rt::io::extensions::test::test_read_be_int_n ... ok
test rt::io::extensions::test::test_read_f32 ... ok
test rt::io::extensions::test::test_read_write_be ... ok
test rt::io::extensions::test::read_to_end_error ... ok
test rt::io::extensions::test::test_read_write_f32 ... ok
test rt::io::extensions::test::test_read_write_le_mem ... ok
test rt::io::flate::test::smoke_test ... ignored
test rt::io::mem::test::test_buf_reader ... ok
test rt::io::mem::test::test_mem_reader ... ok
test rt::io::mem::test::test_mem_writer ... ok
test rt::io::mem::test::test_with_mem_writer ... ok
test rt::io::net::ip::test::ipv6_addr_to_str ... ok
test rt::io::net::ip::test::test_from_str_ipv4 ... ok
test rt::io::net::ip::test::test_from_str_ipv4_in_ipv6 ... ok
test rt::io::net::ip::test::test_from_str_ipv6 ... ok
test rt::io::net::ip::test::test_from_str_socket_addr ... ok
test rt::io::net::tcp::test::bind_error ... ignored
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
failed in non-task context at 'assertion failed: handle.is_not_null()', /home/jason/rust/src/libstd/rt/uv/mod.rs:105
Aborted

@toffaletti
Copy link
Contributor Author

valgrind told me the Abort was from not enough file descriptors. I bumped fd up and now I get:

$ ./x86_64-unknown-linux-gnu/stage1/test/stdtest-x86_64-unknown-linux-gnu rt::io

running 81 tests
task '<unnamed>' failed at 'Unhandled condition: read_error: rt::io::IoError{kind: OtherIoError, desc: "Placeholder error. You shouldn't be seeing this", detail: None}', /home/jason/rust/src/libstd/condition.rs:131
test rt::io::buffered::test::test_buffered_reader ... ok
test rt::io::buffered::test::test_buffered_stream ... ok
test rt::io::buffered::test::test_buffered_writer ... ok
test rt::io::extensions::test::bytes_0_bytes ... ok
test rt::io::extensions::test::bytes_eof ... ok
task '<unnamed>' failed at 'Unhandled condition: read_error: rt::io::IoError{kind: OtherIoError, desc: "Placeholder error. You shouldn't be seeing this", detail: None}', /home/jason/rust/src/libstd/condition.rs:131
test rt::io::extensions::test::push_bytes ... ok
test rt::io::extensions::test::bytes_error ... ok
test rt::io::extensions::test::push_bytes_eof ... ok
test rt::io::extensions::test::push_bytes_partial ... ok
test rt::io::extensions::test::read_byte ... ok
test rt::io::extensions::test::push_bytes_error ... ok
test rt::io::extensions::test::read_byte_0_bytes ... ok
test rt::io::extensions::test::read_byte_eof ... ok
test rt::io::extensions::test::read_byte_error ... ok
test rt::io::extensions::test::read_bytes_eof ... ok
test rt::io::extensions::test::read_bytes ... ok
test rt::io::extensions::test::read_bytes_partial ... ok
test rt::io::extensions::test::push_bytes_fail_reset_len ... ok
test rt::io::extensions::test::read_to_end ... ok
test rt::io::extensions::test::test_read_be_int_n ... ok
test rt::io::extensions::test::test_read_f32 ... ok
test rt::io::extensions::test::read_to_end_error ... ok
test rt::io::extensions::test::test_read_write_be ... ok
test rt::io::extensions::test::test_read_write_f32 ... ok
test rt::io::extensions::test::test_read_write_le_mem ... ok
test rt::io::flate::test::smoke_test ... ignored
test rt::io::mem::test::test_buf_reader ... ok
test rt::io::mem::test::test_mem_reader ... ok
test rt::io::mem::test::test_mem_writer ... ok
test rt::io::mem::test::test_with_mem_writer ... ok
test rt::io::net::ip::test::ipv6_addr_to_str ... ok
test rt::io::net::ip::test::test_from_str_ipv4 ... ok
test rt::io::net::ip::test::test_from_str_ipv6 ... ok
test rt::io::net::ip::test::test_from_str_ipv4_in_ipv6 ... ok
test rt::io::net::tcp::test::bind_error ... ignored
test rt::io::net::ip::test::test_from_str_socket_addr ... ok
task '<unnamed>' failed at 'assertion failed: !dir.exists()', /home/jason/rust/src/libstd/rt/io/file.rs:930


Instead of the poems I had hoped for, there came only a shuddering blackness
and ineffable loneliness; and I saw at last a fearful truth which no one had
ever dared to breathe before — the unwhisperable secret of secrets — The fact
that this city of stone and stridor is not a sentient perpetuation of Old New
York as London is of Old London and Paris of Old Paris, but that it is in fact
quite dead, its sprawling body imperfectly embalmed and infested with queer
animate things which have nothing to do with it as it was in life.

fatal runtime error: assertion failed: exit_status
Aborted

@toffaletti
Copy link
Contributor Author

So the above failure was caused by the previous fd limit failures not cleaning up directories in tmp. I had to rm these two directories:

rm -rf tmp/before_and_after_dir
rm -rf tmp/di_readdir/

and now all tests pass.

test result: ok. 1194 passed; 0 failed; 91 ignored; 0 measured

I'd still be very wary of this change though.

@brson
Copy link
Contributor

brson commented Oct 6, 2013

Thanks! I'm out of town at the moment but will give this a close look this week.

@toffaletti
Copy link
Contributor Author

I found SleeperList, which looked like a copy of the MessageQueue code. So I tried replacing it with mpsc_queue::Queue, but then I get crashes in the AtomicPtr::get refcount != 0 check. I haven't been able to track down which AtomicPtr is causing this yet, but it is troubling because I can't explain why or how this change would expose what looks like a bug in AtomicPtr. I suspect the bug is in my code, not AtomicPtr, but this sure is a head scratcher.

compile_and_link: x86_64-unknown-linux-gnu/stage1/lib/rustc/x86_64-unknown-linux-gnu/lib/libstd.so
Warning: removing previous 'libstd-*.so' libraries: x86_64-unknown-linux-gnu/stage1/lib/rustc/x86_64-unknown-linux-gnu/lib//libstd-6c65cf4b443341b1-0.9-pre.so
failed in non-task context at 'assertion failed: (*self.data).count.load(Relaxed) > 0', /home/jason/rust/src/libstd/unstable/sync.rs:97
Segmentation fault

@alexcrichton
Copy link
Member

cc #6837 and #6838

I've found breaking on rust_begin_unwind useful for generating stack traces in gdb, although the segfault would probably also help showing the stack trace.

@alexcrichton
Copy link
Member

I haven't given this a whole lot of thought, but there are two high-level comments which I was wondering about:

  • I think that the SleeperList is a multiple consumer queue, meaning that the queue implemented here probably isn't suitable for that purpose.
  • Along those lines, it's certainly unsafe to use this queue with more than one consumer, so I was curious if we could leverage the rust type system to enforce this safety? Right now it's not guaranteed that you have one consumer in safe code, so safe code can run into problems when using this queue.

@toffaletti
Copy link
Contributor Author

@alexcrichton I was wondering if SleeperList needed to be multi-consumer and whether that might be the issue. If that is the case, it explains the crashes. I will implement another lock-free queue for SleeperList. I also have thought about using the type system to enforce the thread-safe properties of the queue. I think something similar to comm::stream() which makes a Pusher that is Cloneable and Sendable and a Popper that isn't. I'll probably need some help with the intricacies of the type system.

@toffaletti
Copy link
Contributor Author

Tests are now passing with sleeper list using the multiple producer multiple consumer bounded queue. I used this algorithm: http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue. The downside is this queue is bounded at powers of 2 capacity. However, it is very fast and easy to understand. I wasn't sure what I could do safely inside the scheduler when a push to the sleeper list failed because the queue was full, so I just put an assert! in for now and made the queue fairly large. Usually there would be a spin with back-off yielding loop.

@brson
Copy link
Contributor

brson commented Oct 16, 2013

Forgive me taking so long on this. It looks very impressive, but I'm wary of hastily merging concurrency code these days.

I think it's not too big a problem for the SleeperList to have a bound, at least for now. At the moment the number of work stealing schedulers is fixed at startup, putting a bound on the size of the SleeperList. In the future that number will be able to increase dynamically, but I still don't think that having a reasonable limit on the number of schedulers is a major problem.

This does convert the SleeperList from a stack to a queue. I'm not sure if this is good or bad. I chose a stack originally with the vague idea that it would be better for cache reasons to keep using threads that were most recently active, but it may not matter at all.

This also makes all pushes to the scheduler message queue allocate, so there's going to be a performance tradeoff here. This will probably scale a lot better when message queues are contested, but be slower to push/pop.

A minor nit, but I would prefer that the mpsc queue is encapsulated in the scheduler-specific MessageQueue type. Partly just for the naming, but also to make it easier to swap out MessageQueue implementations.

Because of the special license this won't pass make tidy. License exceptions are listed in src/etc/licenseck.py. This appears to be a standard BSD license, which we can accept.

Have you measured any difference in performance with this implementation?

I'm going to run some tests locally here.

@brson
Copy link
Contributor

brson commented Oct 16, 2013

It would be appropriate to put links to the original C source directly in the Rust source. Can you do so?

@toffaletti
Copy link
Contributor Author

I agree, this merge should not be taken lightly. I have run tests through valgrind's helgrind and drd tools, without noticing any problems however the output is rather noisy even without this patch so I could be missing things. I'd feel better if Rust had http://clang.llvm.org/docs/ThreadSanitizer.html support and/or was helgrind clean.

I will look into implementing a lock-free stack for SleeperList, I missed that detail.

The allocation per queue push is a concern. It is possible to use a bounded free-list to reduce this, but it would complicate the code and I'd want to see measurements of the allocation being an issue first because jemalloc is hard to beat.

I like encapsulating MessageQueue, perhaps allowing a runtime flag that swaps between implementations could be a good way to go forward with merging this. That minimizes risk and makes testing easier than rebuilding all of Rust from this branch.

I have tried, but so far failed to find a good measure of performance for this. Benchmarking is hard. I think it would be fairly trivial to come up with some micro-benchmarks, but I was hoping to see a difference in real-world applications. I tried running the stage1 tests and benchmarks and also apachebench with various levels of concurrency against rust-http server hoping to see some improvement, but I didn't. I suspect with running the tests and benchmarks the scheduler overhead was too small. As for rust-http I believe I talked to you on irc about the issues with libuv and io on multiple threads. rust-http server really only utilizes a single core when io bound because of libuv. Perhaps Servo would be a good test?

@toffaletti
Copy link
Contributor Author

I'll update this branch when I get a chance, but until then the code also exists in stand-alone form at https://github.com/toffaletti/rust-code with links to the original algorithms, if that helps. On second thought, I think the code I'm linking to is more C style pseudo-code. I'll try to dig up some actual compilable code (I believe some were posted on the mailing list) if you like.

https://github.com/toffaletti/rust-code/blob/master/mpmc_bounded_queue.rs#L29
https://github.com/toffaletti/rust-code/blob/master/mpsc_queue.rs#L30

I'll also explain the padding in a comment in the code, but I wanted to bring up an issue with it here. The padding is to prevent false sharing. For it to work, the padding has to be at least the size of the cpu cache line. I hard coded it at 64 bytes because I'm not aware of a good way to determine the actual value at compile or runtime. I've seen reference to some older server class cpus using 128 byte cache lines. I have no idea what ARM uses, but I think most modern desktop/server cpus use 64.

@brson
Copy link
Contributor

brson commented Oct 17, 2013

I think microbenchmarks are probably the best bet for measuring any results at this point. As you say, most of our real world code is going to swamp the scheduler itself with additional inefficiencies; we probably don't have anything that would actually be impacted by the scalability improvements this should bring (nor do I even have a massively multi-core machine to test on right now).

Last time I was tuning the scheduler I used this stupid benchmark and was just optimizing profiles to remove as much of the scheduler overhead as possible. I was also using mutrace to measure lock contention, which this change should reduce to zero.

@brson
Copy link
Contributor

brson commented Oct 17, 2013

@toffaletti I think don't worry about the lock-free stack. There's no evidence a stack is better here.

@brson
Copy link
Contributor

brson commented Oct 17, 2013

This does turn the casual_pop operations into normal atomic pops. In my last round of optimization these were pretty important for reducing lock contention. I imagine it's not nearly as important now that there are no locks, but it still may be an optimization to peek at the end of the queue non-atomically before deciding to pop. If we're going to leave the operations the same for now then it probably deserves an XXX comment indicating that we should reinvestigate later.

@brson
Copy link
Contributor

brson commented Oct 17, 2013

Oh, the biggest source of contention is actually probably the WorkQueue - the work-stealing deque, the replacement of which is the topic of #4877.

@toffaletti
Copy link
Contributor Author

I'll take a shot at the WorkQueue after I get some benchmark and testing in place to measure these changes. I think it'll be about 2-3 weeks before I'll have something more to show.

@toffaletti
Copy link
Contributor Author

My first attempt at implementing the chase-lev deque is pretty ugly.

https://github.com/toffaletti/rust-code/blob/master/chase_lev_deque.rs

Interestingly I believe I found a bug in the take function from the C11 implementation in http://www.di.ens.fr/~zappa/readings/ppopp13.pdf. They used unsigned ints for top/bottom while the original algorithm used signed ints. The overflow caused by bottom = bottom - 1 when bottom is 0 would cause the check top <= bottom to pass and take would return a value even when the queue was supposed to be empty. I suspect they never noticed this bug because they were storing ints and weren't distinguishing between 0 as a valid return value and 0 as empty.

I also implemented the deque in C++11 https://github.com/toffaletti/libten/blob/ws_sched/include/ten/work_deque.hh and plugged it into a task scheduler which has a decent test suite. I haven't seen any crashes yet, but helgrind/drd do report some races. I probably have a bug in the memory ordering for array initialization and/or array grow.

@brson
Copy link
Contributor

brson commented Oct 24, 2013

I've not forgotten about this PR. I have stress tested it here quite a bit without any failures, so I'm feeling pretty good about it, but I've yet to read the code for the data structures closely, which I want to do still.

2u
} else {
// use next power of 2 as capacity
2f64.pow(&((capacity as f64).log2().ceil())) as uint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can use uint::next_power_of_two.

@brson
Copy link
Contributor

brson commented Oct 26, 2013

Reopened as #10080

@brson brson closed this Oct 26, 2013
@brson
Copy link
Contributor

brson commented Oct 26, 2013

@toffaletti You may also be interested in #8568

bors added a commit that referenced this pull request Oct 28, 2013
}

struct State<T> {
pad0: [u8, ..64],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be less than 64 bytes, you can save a bit of memory by using 64 minus the size of the padded object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust has no sizeof that works at compile time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I hit that too with my padding. Since it's hard-coded, though, you can safely change the size to 56, knowing that buffer and mask will fill the remaining space. I admit it's a pretty minor issue, I wish that my first comment on your patch had been something more helpful than this :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work on 32 bit machines though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toffaletti you could do something with #[cfg(target_word_size = "32")] and #[cfg(target_word_size = "64")] to conditionally chose a definition of the structs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also have to conditionally define the constructor. I chose to keep the code simple instead of saving the 16-32 bytes per cpu core.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I should have said 60 bytes. I forgot about the possibility of the cache line boundary being in between buffer and mask.

@pmarcelll pmarcelll mentioned this pull request Sep 18, 2016
flip1995 pushed a commit to flip1995/rust that referenced this pull request Nov 21, 2022
flip1995 pushed a commit to flip1995/rust that referenced this pull request Nov 21, 2022
Fix `needless_borrow` false positive rust-lang#9710

Fixes rust-lang#9710

changelog: fix `needless_borrow` false positive rust-lang#9710
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants