linux-kernel - Re: [patch] aio: add per task aio wait event condition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <32AF3D8D-536F-4B9F-908B-C292CB9D6AEE@oracle.com>
Date:	Wed, 3 Jan 2007 10:50:06 -0800
From:	Zach Brown <zach.brown@...cle.com>
To:	"Chen, Kenneth W" <kenneth.w.chen@...el.com>
Cc:	"'Andrew Morton'" <akpm@...l.org>, <linux-aio@...ck.org>,
	<linux-kernel@...r.kernel.org>,
	"'Benjamin LaHaise'" <bcrl@...ck.org>, <suparna@...ibm.com>
Subject: Re: [patch] aio: add per task aio wait event condition

On Jan 2, 2007, at 10:36 PM, Chen, Kenneth W wrote:

> Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
>>> In the example you
>>> gave earlier, task with min_nr of 2 will be woken up after 4  
>>> completed
>>> events.
>>
>> I only gave 2 ios/events in that example.
>>
>> Does that clear up the confusion?
>
> It occurs to me that people might not be aware how peculiar the
> current io_getevent wakeup scheme is, to the extend of erratic
> behavior.

Yeah.  I have to admit that I hadn't fully grasped the behaviour of  
the interface when we started this thread.  I realized just how  
insane it was as I was finishing off last night.

> wakes up on event _4_.  If someone ask me to describe algorithm
> of io_getevents wake-up scheme in the presence of multiple
> waiters, I call it erratic and un-deterministic.

I agree completely.  This leads me to assume that essentially *no  
one* has multiple waiters on a context who care about min_nr.  That's  
good news because we can consider it the slow path for now :)

> So I can categorize my patchset as a bug fix instead of a
> performance patch ;-)  Let's be serious, this ought to be fixed
> one way or the other.

Yeah.  How about this:

Start with a cleanup of the event copying loop:

- prefault user event dest with fault_in_pages_writable()
- copy multiple events with _inatomic
- only copy events if min_nr are available
- back off and retry if _inatomic faults
- go back to sleep if too few events are available
- return once a copy doesn't fault

(I guess one could only fall back to prefaulting if the _inatomic  
fails.  I don't know if most event buffers fault or not, I'd guess not.)

This would get rid of the nonsense of consuming events (possibly  
depriving other concurrent waiters) without returning them to  
userspace in a timely fashion.  We get rid of a lot of the per-event- 
delivery overhead that the current loop suffers from.

The controversial part here happens when min_nr is larger than the  
ring size.  In that case I think we should consider min_nr to be  
equal to the ring size.  We'll return fewer events than userspace  
asked for, but it'll still be a big batch.

Does io_getevents() returning +ve but < min_nr raise alarm bells for  
anyone?  It's already done in the timeout case, so arguably code  
already handles it.

Now, on the waking and sleeping side:

- min_nr 0 and 1 as exclusive waiters in ctx->wait
- maintain a rbtree of with min_nr of waiters when > 1
- always wake ctx->wait if it's pending
- wake rb_first of the tree if nr_in_ring >= node->min_nr

The rbtree nodes would have a task_struct pointer and its waker would  
use wake_up_process().  It'd be maintained under the ctx lock.  It'll  
be expensive to maintain in the case of lots of concurrent min_nr > 1  
waiters, but we're declaring that the slow path.

The fast path of a single waiter with > 1 min_nr becomes a pointer  
indirection through the root node of the rb tree.  It should be just  
like the current list_head use in the wait queue.  If the wait queue  
head and rbtree root node share a cacheline in the context then  
checking them both (and almost always only finding one of them  
populated) shouldn't cost the completion path much.

I don't have a strong opinion about using exclusive wake-ups or not  
because I doubt we ever have concurrent waiters in the field.  I'm  
leaning towards avoiding the thundering herd, even if that risks the  
event delivery stream waiting behind a woken waiter who might be  
blocking while prefaulting their destination event buffer.  We should  
wake from io_getevents() before it sleeps or returns, then.  I could  
easily be convinced otherwise.

The wait queue lets us support lots of min_nr == 1 waiters without  
the rbtree maintenance overhead.. they're not as self-evidently  
broken in the current code as multiple min_nr > 1 waiters, so they  
may well be in use?  who knows.  It's easy.

How's this all sound?  I can throw this together if you'd prefer it  
that way.  Though I suspect you'll want to keep hacking on this :).

Finally, let's get a test case for multiple min_nr > 1 waiters into  
autotest.  I'd use our trivial example where the min_nr = 2 waiter  
fires of 3 ios for the concurrent min_nr = 3 waiter to ensure that  
things don't get stuck.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/