[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54E0B857.30902@gmail.com>
Date: Sun, 15 Feb 2015 16:16:39 +0100
From: "Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>
To: Omar Sandoval <osandov@...ndov.com>, Fam Zheng <famz@...hat.com>
CC: mtk.manpages@...il.com, linux-kernel@...r.kernel.org,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org,
Alexander Viro <viro@...iv.linux.org.uk>,
Andrew Morton <akpm@...ux-foundation.org>,
Kees Cook <keescook@...omium.org>,
Andy Lutomirski <luto@...capital.net>,
David Herrmann <dh.herrmann@...il.com>,
Alexei Starovoitov <ast@...mgrid.com>,
Miklos Szeredi <mszeredi@...e.cz>,
David Drysdale <drysdale@...gle.com>,
Oleg Nesterov <oleg@...hat.com>,
"David S. Miller" <davem@...emloft.net>,
Vivek Goyal <vgoyal@...hat.com>,
Mike Frysinger <vapier@...too.org>,
Theodore Ts'o <tytso@....edu>,
Heiko Carstens <heiko.carstens@...ibm.com>,
Rasmus Villemoes <linux@...musvillemoes.dk>,
Rashika Kheria <rashika.kheria@...il.com>,
Hugh Dickins <hughd@...gle.com>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Peter Zijlstra <peterz@...radead.org>,
linux-fsdevel@...r.kernel.org, linux-api@...r.kernel.org,
Josh Triplett <josh@...htriplett.org>,
Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch
and epoll_pwait1
On 02/13/2015 10:53 AM, Omar Sandoval wrote:
> On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
>> Hi all,
>>
>> This is the updated series for the new epoll system calls, with the cover
>> letter rewritten which includes some more explanation. Comments are very
>> welcome!
>>
>> Original Motivation
>> ===================
>>
>> QEMU, and probably many more select/poll based applications, will consider
>> epoll as an alternative, when its event loop needs to handle a big number of
>> fds. However, there are currently two concerns with epoll which prevents the
>> switching from ppoll to epoll.
>>
>> The major one is the timeout precision.
>>
>> For example in QEMU, the main loop takes care of calling callbacks at a
>> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
>> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
>> rounding up the timeout will hurt performance badly.
>>
>> The minor one is the number of system call to update fd set.
>>
>> While epoll can handle a large number of fds quickly, it still requires one
>> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
>> fd array. This may as well make epoll inferior to ppoll in the cases where a
>> small, but frequently changing set of fds are polled by the event loop.
>>
>> This series introduces two new epoll APIs to address them respectively. The
>> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
>> suggested clockid as a parameter in epoll_pwait1.
>>
>> Discussion
>> ==========
>>
>> [Note: This is the question part regarding the interface contract of
>> epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet,
>> please skip this part and probably start with the man page style documentation.
>> You can resume to this section later.]
>>
>> [Thanks to Omar Sandoval <osandov@...ndov.com>, who pointed out this in
>> reviewing v1]
>>
>> We try to report status for each command in epoll_ctl_batch, by writting to
>> user provided command array (pointed to cmds). The tricky thing in the
>> implementation is that, copying the results back to userspace comes last, after
>> the commands are executed. At this point, if the copy_to_user fails, the
>> effects are done and no return - or if we add some mechanism to revert it, the
>> code will be too complicated and slow.
>>
>> In above corner case, the return value of epoll_ctl_batch is smaller than
>> ncmds, which assures our caller that last N commands failed, where N = ncmds -
>> ret. But they'll also find that cmd.result is not changed to error code.
>>
>> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
>> user does know the actual number of commands that succeed.
>>
>> So, do we leave it as is? Or is there any way to improve?
>>
>> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
>> before we execute the commands. If it succeeds, it's even less likely the last
>> copy_to_user could fail, so that we can even probably assert it won't. The
>> testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it
>> right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the
>> performance impact, especially when @ncmds is big.
>>
> I don't think this is the right thing to do, since, for example, another
> thread could unmap the memory region holding buffer between the "check"
> copy_to_user and the actual one.
>
> The two alternatives that I see are:
>
> 1. If copy_to_user fails, ignore it and return the number of commands
> that succeeded (i.e., what the code in your patch does).
> 2. If copy_to_user fails, return -EFAULT, regardless of how many
> commands succeeded.
>
> The problem with 1 is that it could potentially mask bugs in a user
> program. You could imagine a buggy program that passes a read-only
> buffer to epoll_ctl_batch and never finds out about it because they
> don't get the error. (Then, when there's a real error in one of the
> epoll_ctl_cmds, but .result is 0, someone will be very confused.)
>
> So I think 2 is the better option. Sure, the user will have no idea how
> many commands were executed, but when would EFAULT on this system call
> be part of normal operation, anyways? You'll find the memory bug, fix
> it, and rest easy knowing that nothing is silently failing behind your
> back.
What Omar says makes sense to me too. Best to have the user get a clear
error indication for this case.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists