lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 20 Oct 2009 22:02:00 +0300 From: Török Edwin <edwin@...mav.net> To: Peter Zijlstra <peterz@...radead.org> CC: David Howells <dhowells@...hat.com>, Ingo Molnar <mingo@...e.hu>, Linux Kernel <linux-kernel@...r.kernel.org>, aCaB <acab@...mav.net>, Nick Piggin <npiggin@...e.de>, Linus Torvalds <torvalds@...ux-foundation.org>, Thomas Gleixner <tglx@...utronix.de> Subject: Re: Mutex vs semaphores scheduler bug On 2009-10-17 18:32, Peter Zijlstra wrote: > On Fri, 2009-10-16 at 00:44 +0100, David Howells wrote: >> Peter Zijlstra <peterz@...radead.org> wrote: >> >>> The problem appears to be that rwsem doesn't allow lock-stealing >> With good reason. rwsems can be read or write locked for a long time - so if >> readers can jump the queue on read-locked rwsems, then writer starvation is a >> real possibility. I carefully implemented it so that it is a strict FIFO to >> avoid certain problems I was having. > > Well, it kinda sucks that rwsem is slower than a mutex. > > What about allowing writer stealing when the next contending task is a > writer? > Also can the scheduler be improved so it doesn't unnecessarily wake up processes if they can't take the semaphore? AFAICT (in linux 2.6.31.3) this is what happens: semaphore fails to get acquired -> rwsem.c:rwsem_down_failed_common -> if nobody holds the semaphore, wake up the process(es) that can acquire it -> loop calling schedule(), until woken the thread holding the semaphore releases it (no wakeups!), and the scheduler will schedule the next process There are several things that are suboptimal here IMHO: 1. The task calls schedule(), but does that prevent the scheduler from scheduling it again if waiter->task is non-null? If not, then there are useless wakeups (task gets scheduled, waiter->task is non-null, task calls schedule again). 2. The process in the front of the queue is woken only if there was contention, which introduces unnecessary latency for the task in the front of the queue waiting for the semaphore. 3. Due to 2) a task on another CPU waiting for the semaphore is only woken: when schedule() is called, when another thread tries to acquire the lock and fails Suggested solution: schedule() should know that scheduling this process is useless, it won't get the lock, so it shouldn't attempt to schedule it (maybe by setting a flag in the task state). When a thread releases a semaphore, it should also try to wake other threads that were waiting on it (if any). With both of these changes the scheduler will know that when it schedules the task it'll be able to acquire the semaphore (it was called by wake_up_process in rwsem_do_wake which will clear the do-not-wake flag). Also the thread will be scheduled as soon as another thread (on another CPU) released the semaphore, and it doesn't need to wait for the timeslice to elapse. Here is a snippet of ftrace of sched_switch: <...>-27515 [000] 44338.386867: 27515:120:R + [000] 27522:120:D <...> <...>-27515 [000] 44338.386868: 27515:120:R + [000] 27511:120:D <...> <...>-27515 [000] 44338.386869: 27515:120:R + [001] 27523:120:D <...> <...>-27515 [000] 44338.386875: 27515:120:R + [003] 27512:120:D <...> <...>-27515 [000] 44338.386877: 27515:120:R + [000] 27517:120:S <...> <...>-27515 [000] 44338.386905: 27515:120:D ==> [000] 27517:120:R <...> <...>-27517 [000] 44338.386907: 27517:120:S ==> [000] 27522:120:R <...> <...>-27522 [000] 44338.386915: 27522:120:D ==> [000] 27511:120:R <...> <...>-27511 [000] 44338.386917: 27511:120:R + [001] 27523:120:D <...> <...>-27511 [000] 44338.386918: 27511:120:D ==> [000] 0:140:R <idle> <idle>-0 [000] 44338.386973: 0:140:R ==> [000] 27522:120:R <...> If I interpret this correctly 27515 wakes a bunch of readers (27511, 27523, 27512, 27517), then it switches to 27517. However 27517 immediately schedules again (!) after 2ns, wakes 27522, it runs for 8ns, then schedules to 27511, which runs for a short while again, finally it schedules to idle. All this scheduling ping-pong looks like a waste, none of the tasks woken up did useful work. Another situation is this scheduler ping-pong with md4_raid10, notice how 27961 wakes up md4_raid10, switches to it, md4_raid10 schedules back to 27961, which then schedules to idle (2ns later). md4_raid10 could have gone directly to idle ....: md4_raid10-982 [000] 44746.579830: 982:115:S ==> [000] 27961:120:R lt-clamd lt-clamd-27964 [003] 44746.579892: 27964:120:R + [001] 27966:120:D lt-clamd lt-clamd-27964 [003] 44746.580065: 27964:120:D ==> [003] 0:140:R <idle> lt-clamd-27961 [000] 44746.580069: 27961:120:R + [003] 27964:120:D lt-clamd <idle>-0 [003] 44746.580073: 0:140:R ==> [003] 27964:120:R lt-clamd lt-clamd-27961 [000] 44746.580912: 27961:120:R + [002] 27960:120:S lt-clamd lt-clamd-27961 [000] 44746.580945: 27961:120:D + [000] 982:115:S md4_raid10 lt-clamd-27961 [000] 44746.580946: 27961:120:D ==> [000] 982:115:R md4_raid10 md4_raid10-982 [000] 44746.580948: 982:115:S ==> [000] 27961:120:D lt-clamd lt-clamd-27961 [000] 44746.580950: 27961:120:D ==> [000] 0:140:R <idle> lt-clamd-27964 [003] 44746.581054: 27964:120:R + [000] 27961:120:D lt-clamd <idle>-0 [000] 44746.581064: 0:140:R ==> [000] 27961:120:R lt-clamd lt-clamd-27964 [003] 44746.581092: 27964:120:R + [002] 27960:120:S lt-clamd lt-clamd-27964 [003] 44746.581125: 27964:120:D + [000] 982:115:S md4_raid10 lt-clamd-27964 [003] 44746.581128: 27964:120:D ==> [003] 27965:120:R lt-clamd lt-clamd-27961 [000] 44746.581128: 27961:120:R ==> [000] 982:115:R md4_raid10 Best regards, --Edwin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists