Design goal, algorithmic description and performance tests. Signed-off-by: Mathieu Desnoyers CC: Linus Torvalds Cc: "H. Peter Anvin" CC: Jeremy Fitzhardinge CC: Andrew Morton CC: Ingo Molnar CC: "Paul E. McKenney" CC: Peter Zijlstra CC: Joe Perches CC: Wei Weng --- Documentation/psrwlock.txt | 440 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 440 insertions(+) Index: linux-2.6-lttng/Documentation/psrwlock.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-lttng/Documentation/psrwlock.txt 2008-09-08 20:29:13.000000000 -0400 @@ -0,0 +1,440 @@ + Priority Sifting Reader-Writer Locks + Design and Performance tests + Mathieu Desnoyers, 2008 + + +****** Design Goal ****** + +The main design goal is to lessen the rwlock impact on the irq and softirq +latency of the system. + +A typical case leading to long interrupt latencies : + +- rwlock shared between + - Rare update in thread context + - Frequent slow read in thread context (task list iteration) + - Fast interrupt handler read + +The slow write must therefore disable interrupts around the write lock, but will +therefore add up to the global interrupt latency; worse case being the duration +of the slow read. + + +****** Description of the psrwlock algorithm ****** + +The writer fast path uses a single bit to indicate that a fast path writer is +active (UC_WRITER). The uncontended case is done by upgrading the priority to +the priority of the highest priority reader (e.g. by disabling interrupts) and +by doing a cmpxchg which sets the UC_WRITER bit atomically only if there is no +other writer nor reader in their critical section. If there is contention caused +by either a reader or a writer, the writer falls in the slow path. + +The writer slow path first sets the UC_SLOW_WRITER bit and increments the WS +(writers in slow path) counter (the two operations are made atomic by using the +WS_COUNT_MUTEX bit as a mutex) and then subscribes to the preemptable lock, +which locks out the preemptable reader threads. It waits for all the preemptable +reader threads in the slow path to exit their critical section. It then waits +for all the "in-flight" fast path readers to exit their critical section. Then, +it upgrades its priority by disabling preemption and does the same +(subscription, wait for slow path readers, wait for fast path readers) for +preemptable readers. The same is then done for bottom halves and interrupt +contexts. One all the reader contexts has been excluded, the writer takes the +slow path mutex, accesses the data structure. + +In its unlock, the writer detects if it was a fast or slow path writer by +checking the UC_WRITER bit. A fast path writer has to clear the UC_WRITER bit +and bring its priority back to its original state (e.g. reenabling interrupts). +The slow path writer must unlock the mutex, lower its priority stage by stage +(reenabling interrupt, bh, preemption). At each stage, it unsubscribes from the +specific context. Then, it checks if it is the last writer in the slow path by +decrementing and testing the WS counter (writers in slow path) bit. If it is the +last writer, it clears the UC_SLOW_WRITER bit (count and bit are made atomic by +the WS_COUNT_MUTEX bit used as a mutex). + +The reader does an atomic cmpxchg to check if there is any contention on the +lock (other readers or writers in their critical section). If not, it increments +the reader count. If there are other active readers, the first cmpxchg will +fail, but a second cmpxchg will be taken at the beginning of the slow path if +there are no writers contending the lock. A cmpxchg is used to take the reader +lock instead of a simple addition because we cannot afford to take the read +fast-path lock, even for a short period of time, while we are in a lower +execution context while there is a writer in an higher priority execution +context waiting for the lock. Failure to do so would result in priority +inversion. + +The reader slow path waits for the slow path writer subscription count in its +particular context to become 0. When it does, the reader atomically increments +the reader count for this context. Then, it waits for all the fast path writers +to exit their critical section and increments the fast path reader count. Before +returning from the reader lock primitive, it decrements the slow path reader +count for the context it subscribed to, behaving exactly as a fast path reader. + +The unlock primitive is then exactly the same for the fast and slow path +readers: They only have to decrement the fastpath reader count. + +WS_WQ_MUTEX protects the waitqueue and UC_WQ_ACTIVE changes. WS_WQ_MUTEX must be +taken with interrupts off. The ordering of operations dealing with preemptable +threads involves that any code path which can cause a thread to be added to the +wait queue must end with a check for UC_WQ_ACTIVE flag which leads to a wake up +if there are pending threads in the wait queue. Also, any point where a thread +can be woken up from the wait queue must be followed by a UC_WQ_ACTIVE check. +Given that the UC_WQ_ACTIVE flag is tested without taking the WS_WQ_MUTEX, we +must make sure that threads added to the waitqueue first set the UC_WQ_ACTIVE +flag and then re-test for the condition which led them to be put to sleep. + +Upon unlock, the following sequence is done : +- atomically unlock and return the lock value, which contains the UC_WQ_ACTIVE + bit status at the moment the unlock is done. +- if UC_WQ_ACTIVE is set : + - take WS_WQ_MUTEX + - wake a thread + - release WS_WQ_MUTEX + +When a thread is ready to be added to the wait queue : +- the last busy-looping iteration fails. +- take the WS_WQ_MUTEX +- set the UC_WQ_ACTIVE bit if the list about to pass from inactive to active. +- check again for the failed condition, since its status may have changed + since the busy-loop failed. If the condition now succeeds, return to + busy-looping after putting the UC_WQ_ACTIVE bit back to its original state and + releasing the WS_WQ_MUTEX. +- add the current thread to the wait queue, change state. +- release WS_WQ_MUTEX. + +Upon wakeup : +- take WS_WQ_MUTEX +- set state to running, remove from the wait queue. +- clear UC_WQ_ACTIVE if the list passed to inactive. +- release WS_WQ_MUTEX. + +A sequence similar to the unlock must be done when a trylock fails. + + +****** Performance tests ****** + +The test module is available at : + +http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-psrwlock.c + +Dual quad-core Xeon 2.0GHz E5405 + + +**** Latency **** + +This section presents the detailed breakdown of latency preemption, softirq and +interrupt latency generated by the psrwlock. In the "High contention" section, +is compares the "irqoff latency tracer" results between standard Linux kernel +rwlocks and the psrwlocks (tests done on wbias-rwlock v8). + +get_cycles takes [min,avg,max] 72,75,78 cycles, results calibrated on avg + +** Single writer test, no contention ** +SINGLE_WRITER_TEST_DURATION 10s + +IRQ latency for cpu 6 disabled 99490 times, [min,avg,max] 471,485,1527 cycles +SoftIRQ latency for cpu 6 disabled 99490 times, [min,avg,max] 693,704,3969 cycles +Preemption latency for cpu 6 disabled 99490 times, [min,avg,max] 909,917,4593 cycles + + +** Single trylock writer test, no contention ** +SINGLE_WRITER_TEST_DURATION 10s + +IRQ latency for cpu 2 disabled 10036 times, [min,avg,max] 393,396,849 cycles +SoftIRQ latency for cpu 2 disabled 10036 times, [min,avg,max] 609,614,1317 cycles +Preemption latency for cpu 2 disabled 10036 times, [min,avg,max] 825,826,1971 cycles + + +** Single reader test, no contention ** +SINGLE_READER_TEST_DURATION 10s + +Preemption latency for cpu 2 disabled 31596702 times, [min,avg,max] 502,508,54256 cycles + + +** Multiple readers test, no contention (4 readers, busy-loop) ** +MULTIPLE_READERS_TEST_DURATION 10s +NR_READERS 4 + +Preemption latency for cpu 1 disabled 9302974 times, [min,avg,max] 502,2039,88060 cycles +Preemption latency for cpu 3 disabled 9270742 times, [min,avg,max] 508,2045,61342 cycles +Preemption latency for cpu 6 disabled 13331943 times, [min,avg,max] 508,1387,309088 cycles +Preemption latency for cpu 7 disabled 4781453 times, [min,avg,max] 508,4092,230752 cycles + + +** High contention test ** +TEST_DURATION 60s +NR_WRITERS 2 +NR_TRYLOCK_WRITERS 1 +NR_READERS 4 +NR_TRYLOCK_READERS 1 +WRITER_DELAY 100us +TRYLOCK_WRITER_DELAY 1000us +TRYLOCK_WRITERS_FAIL_ITER 100 +THREAD_READER_DELAY 0 /* busy loop */ +INTERRUPT_READER_DELAY 100ms + +Standard Linux rwlock + +irqsoff latency trace v1.1.5 on 2.6.27-rc3-trace +-------------------------------------------------------------------- + latency: 2902 us, #3/3, CPU#5 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8) + ----------------- + | task: wbiasrwlock_wri-4984 (uid:0 nice:-5 policy:0 rt_prio:0) + ----------------- + => started at: _write_lock_irq + => ended at: _write_unlock_irq + +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / +# ||||| delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / +wbiasrwl-4984 5d..1 0us!: _write_lock_irq (0) +wbiasrwl-4984 5d..2 2902us : _write_unlock_irq (0) +wbiasrwl-4984 5d..3 2903us : trace_hardirqs_on (_write_unlock_irq) + + +Writer-biased rwlock, same test routine + +irqsoff latency trace v1.1.5 on 2.6.27-rc3-trace +-------------------------------------------------------------------- + latency: 33 us, #3/3, CPU#7 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:8) + ----------------- + | task: events/7-27 (uid:0 nice:-5 policy:0 rt_prio:0) + ----------------- + => started at: _spin_lock_irqsave + => ended at: _spin_unlock_irqrestore + +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / +# ||||| delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / +events/7-27 7d... 0us+: _spin_lock_irqsave (0) +events/7-27 7d..1 33us : _spin_unlock_irqrestore (0) +events/7-27 7d..2 33us : trace_hardirqs_on (_spin_unlock_irqrestore) + +(latency unrelated to the tests, therefore irq latency <= 33us) + +wbias rwlock instrumentation (below) shows that interrupt latency has been 14176 +cycles, for a total of 7us. + +Detailed psrwlock latency breakdown : + +IRQ latency for cpu 0 disabled 1086419 times, [min,avg,max] 316,2833,14176 cycles +IRQ latency for cpu 1 disabled 1099517 times, [min,avg,max] 316,1820,8254 cycles +IRQ latency for cpu 3 disabled 159088 times, [min,avg,max] 316,1409,5632 cycles +IRQ latency for cpu 4 disabled 161 times, [min,avg,max] 340,1882,5206 cycles +SoftIRQ latency for cpu 0 disabled 1086419 times, [min,avg,max] 2212,5350,166402 cycles +SoftIRQ latency for cpu 1 disabled 1099517 times, [min,avg,max] 2230,4265,138988 cycles +SoftIRQ latency for cpu 3 disabled 159088 times, [min,avg,max] 2212,3319,14992 cycles +SoftIRQ latency for cpu 4 disabled 161 times, [min,avg,max] 2266,3802,7138 cycles +Preemption latency for cpu 3 disabled 59855 times, [min,avg,max] 5266,15706,53494 cycles +Preemption latency for cpu 4 disabled 72 times, [min,avg,max] 5728,14132,28042 cycles +Preemption latency for cpu 5 disabled 55586612 times, [min,avg,max] 196,2080,126526 cycles + +Note : preemptable critical sections has been implemented after the previous +latency tests. It should be noted that the worse latency obtained in wbias +rwlock comes from the busy-loop for the wait queue protection mutex (100us) : + +IRQ latency for cpu 3 disabled 2822178 times, [min,avg,max] 256,8892,209926 cycles +disable : [] rwlock_wait+0x265/0x2c0 + + + +**** Lock contention delays **** + +Number of cycles required to take the lock are benchmarked for each context. + + +** get_cycles calibration ** +get_cycles takes [min,avg,max] 72,75,78 cycles, results calibrated on avg + + +** Single writer test, no contention ** + +* Writer-biased rwlocks v13 + +writer_thread/0 iterations : 100274, lock delay [min,avg,max] 27,33,249 cycles +writer_thread/0 iterations : 100274, unlock delay [min,avg,max] 27,30,10407 cycles + +* Standard rwlock, kernel 2.6.27-rc3 + +writer_thread/0 iterations : 100322, lock delay [min,avg,max] 37,40,4537 cycles +writer_thread/0 iterations : 100322, unlock delay [min,avg,max] 37,40,25435 cycles + + +** Single preemptable reader test, no contention ** + +Writer-biased rwlock has a twice faster lock and unlock uncontended fast path. +Note that wbias rwlock support preemptable readers. Standard rwlocks disables +preemption. + +* Writer-biased rwlocks v13 + +preader_thread/0 iterations : 33856510, lock delay [min,avg,max] 27,29,34035 cycles +preader_thread/0 iterations : 33856510, unlock delay [min,avg,max] 15,20,34701 cycles + +* Standard rwlock, kernel 2.6.27-rc3 + +N/A : preemption must be disabled with standard rwlocks. + + +** Single non-preemptable reader test, no contention ** +wbias rwlock read is still twice faster than standard rwlock even for read done +in non-preemptable context. + +* Writer-biased rwlocks v13 + +npreader_thread/0 iterations : 33461225, lock delay [min,avg,max] 27,30,16329 cycles +npreader_thread/0 iterations : 33461225, unlock delay [min,avg,max] 15,19,21657 cycles + +* Standard rwlock, kernel 2.6.27-rc3 + +npreader_thread/0 iterations : 31639225, lock delay [min,avg,max] 37,39,127111 cycles +npreader_thread/0 iterations : 31639225, unlock delay [min,avg,max] 37,42,215587 cycles + + +** Multiple p(reemptable)/n(on-)p(reemptable) readers test, no contention ** +This contended case where multiple readers try to access the data structure in +loop, without any writer, shows that standard rwlock average is a little bit +better than wbias rwlock. It could be explained by the fact that wbias rwlock +cmpxchg operation, which is used to keep the count of active readers, may fail +if there is contention and must therefore be retried. The fastpath actually +expects the number of readers to be 0, which isn't the case here. + +* Writer-biased rwlocks v13 + +npreader_thread/0 iterations : 16885001, lock delay [min,avg,max] 27,425,40239 cycles +npreader_thread/0 iterations : 16885001, unlock delay [min,avg,max] 15,220,18153 cycles +npreader_thread/1 iterations : 16832690, lock delay [min,avg,max] 33,433,26841 cycles +npreader_thread/1 iterations : 16832690, unlock delay [min,avg,max] 15,219,22329 cycles +preader_thread/0 iterations : 17185174, lock delay [min,avg,max] 27,438,31437 cycles +preader_thread/0 iterations : 17185174, unlock delay [min,avg,max] 15,211,30465 cycles +preader_thread/1 iterations : 17293655, lock delay [min,avg,max] 27,435,53301 cycles +preader_thread/1 iterations : 17293655, unlock delay [min,avg,max] 15,209,63921 cycles + +* Standard rwlock, kernel 2.6.27-rc3 + +npreader_thread/0 iterations : 19248438, lock delay [min,avg,max] 37,273,364459 cycles +npreader_thread/0 iterations : 19248438, unlock delay [min,avg,max] 43,216,272539 cycles +npreader_thread/1 iterations : 19251717, lock delay [min,avg,max] 37,242,365719 cycles +npreader_thread/1 iterations : 19251717, unlock delay [min,avg,max] 43,249,162847 cycles +preader_thread/0 iterations : 19557931, lock delay [min,avg,max] 37,250,334921 cycles +preader_thread/0 iterations : 19557931, unlock delay [min,avg,max] 37,245,266377 cycles +preader_thread/1 iterations : 19671318, lock delay [min,avg,max] 37,258,390913 cycles +preader_thread/1 iterations : 19671318, unlock delay [min,avg,max] 37,234,604507 cycles + + + +** High contention test ** + +In high contention test : + +TEST_DURATION 60s +NR_WRITERS 2 +NR_TRYLOCK_WRITERS 1 +NR_PREEMPTABLE_READERS 2 +NR_NON_PREEMPTABLE_READERS 2 +NR_TRYLOCK_READERS 1 +WRITER_DELAY 100us +TRYLOCK_WRITER_DELAY 1000us +TRYLOCK_WRITERS_FAIL_ITER 100 +THREAD_READER_DELAY 0 /* busy loop */ +INTERRUPT_READER_DELAY 100ms + + +* Preemptable writers + +* Writer-biased rwlocks v13 + +writer_thread/0 iterations : 537678, lock delay [min,avg,max] 123,14021,8580813 cycles +writer_thread/0 iterations : 537678, unlock delay [min,avg,max] 387,9070,1450053 cycles +writer_thread/1 iterations : 536944, lock delay [min,avg,max] 123,13179,8687331 cycles +writer_thread/1 iterations : 536944, unlock delay [min,avg,max] 363,10430,1400835 cycles + +* Standard rwlock, kernel 2.6.27-rc3 + +writer_thread/0 iterations : 222797, lock delay [min,avg,max] 127,336611,4710367 cycles +writer_thread/0 iterations : 222797, unlock delay [min,avg,max] 151,2009,714115 cycles +writer_thread/1 iterations : 6845, lock delay [min,avg,max] 139,17271138,352848961 cycles +writer_thread/1 iterations : 6845, unlock delay [min,avg,max] 217,93935,1991509 cycles + + +* Non-preemptable readers + +* Writer-biased rwlocks v13 + +npreader_thread/0 iterations : 64652609, lock delay [min,avg,max] 27,828,67497 cycles +npreader_thread/0 iterations : 64652609, unlock delay [min,avg,max] 15,485,202773 cycles +npreader_thread/1 iterations : 65143310, lock delay [min,avg,max] 27,817,64569 cycles +npreader_thread/1 iterations : 65143310, unlock delay [min,avg,max] 15,484,133611 cycles + +* Standard rwlock, kernel 2.6.27-rc3 + +npreader_thread/0 iterations : 68298472, lock delay [min,avg,max] 37,640,733423 cycles +npreader_thread/0 iterations : 68298472, unlock delay [min,avg,max] 37,565,672241 cycles +npreader_thread/1 iterations : 70331311, lock delay [min,avg,max] 37,603,393925 cycles +npreader_thread/1 iterations : 70331311, unlock delay [min,avg,max] 37,558,373477 cycles + + +* Preemptable readers + +* Writer-biased rwlocks v13 + +preader_thread/0 iterations : 38484022, lock delay [min,avg,max] 27,2207,89363619 cycles +preader_thread/0 iterations : 38484022, unlock delay [min,avg,max] 15,392,1965315 cycles +preader_thread/1 iterations : 44661191, lock delay [min,avg,max] 27,1672,8708253 cycles +preader_thread/1 iterations : 44661191, unlock delay [min,avg,max] 15,495,142119 cycles + +* Standard rwlock, kernel 2.6.27-rc3 + +N/A : preemption must be disabled with standard rwlocks. + + +* Interrupt context readers + +* Writer-biased rwlocks v13 (note : the highest unlock delays (32us) are + caused by the wakeup of the wait queue done at the exit of the critical section + if the waitqueue is active) + +interrupt readers on CPU 0, lock delay [min,avg,max] 135,1603,28119 cycles +interrupt readers on CPU 0, unlock delay [min,avg,max] 9,1712,35355 cycles +interrupt readers on CPU 1, lock delay [min,avg,max] 39,1756,18285 cycles +interrupt readers on CPU 1, unlock delay [min,avg,max] 9,2628,58257 cycles +interrupt readers on CPU 2, lock delay [min,avg,max] 129,1450,16533 cycles +interrupt readers on CPU 2, unlock delay [min,avg,max] 27,2354,49647 cycles +interrupt readers on CPU 3, lock delay [min,avg,max] 75,1758,27051 cycles +interrupt readers on CPU 3, unlock delay [min,avg,max] 9,2446,63603 cycles +interrupt readers on CPU 4, lock delay [min,avg,max] 159,1707,27903 cycles +interrupt readers on CPU 4, unlock delay [min,avg,max] 9,1822,39957 cycles +interrupt readers on CPU 6, lock delay [min,avg,max] 105,1635,24489 cycles +interrupt readers on CPU 6, unlock delay [min,avg,max] 9,2390,36771 cycles +interrupt readers on CPU 7, lock delay [min,avg,max] 135,1614,22995 cycles +interrupt readers on CPU 7, unlock delay [min,avg,max] 9,2052,43479 cycles + +* Standard rwlock, kernel 2.6.27-rc3 (note : these numbers seems good, but + they do not take interrupt latency in account. See interrupt latency tests + above for discussion of this issue) + +interrupt readers on CPU 0, lock delay [min,avg,max] 55,573,4417 cycles +interrupt readers on CPU 0, unlock delay [min,avg,max] 43,529,1591 cycles +interrupt readers on CPU 1, lock delay [min,avg,max] 139,591,5731 cycles +interrupt readers on CPU 1, unlock delay [min,avg,max] 31,534,2395 cycles +interrupt readers on CPU 2, lock delay [min,avg,max] 127,671,6043 cycles +interrupt readers on CPU 2, unlock delay [min,avg,max] 37,401,1567 cycles +interrupt readers on CPU 3, lock delay [min,avg,max] 151,676,5569 cycles +interrupt readers on CPU 3, unlock delay [min,avg,max] 127,536,2797 cycles +interrupt readers on CPU 5, lock delay [min,avg,max] 127,531,15397 cycles +interrupt readers on CPU 5, unlock delay [min,avg,max] 31,323,1747 cycles +interrupt readers on CPU 6, lock delay [min,avg,max] 121,548,29125 cycles +interrupt readers on CPU 6, unlock delay [min,avg,max] 31,435,2089 cycles +interrupt readers on CPU 7, lock delay [min,avg,max] 37,613,5485 cycles +interrupt readers on CPU 7, unlock delay [min,avg,max] 49,541,1645 cycles -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/