lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 8 Nov 2022 10:28:37 -0700 From: Jens Axboe <axboe@...nel.dk> To: Stefan Hajnoczi <stefanha@...hat.com> Cc: linux-kernel@...r.kernel.org, netdev@...r.kernel.org Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait On 11/8/22 10:24 AM, Stefan Hajnoczi wrote: > On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote: >> On 11/8/22 9:10 AM, Stefan Hajnoczi wrote: >>> On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote: >>>> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote: >>>>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote: >>>>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote: >>>>>>> Hi Jens, >>>>>>> NICs and storage controllers have interrupt mitigation/coalescing >>>>>>> mechanisms that are similar. >>>>>> >>>>>> Yep >>>>>> >>>>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold >>>>>>> (counter) value. When a completion occurs, the device waits until the >>>>>>> timeout or until the completion counter value is reached. >>>>>>> >>>>>>> If I've read the code correctly, min_wait is computed at the beginning >>>>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first >>>>>>> completion. >>>>>>> >>>>>>> It makes me wonder which approach is more useful for applications. With >>>>>>> the Aggregation Time approach applications can control how much extra >>>>>>> latency is added. What do you think about that approach? >>>>>> >>>>>> We only tested the current approach, which is time noted from entry, not >>>>>> from when the first event arrives. I suspect the nvme approach is better >>>>>> suited to the hw side, the epoll timeout helps ensure that we batch >>>>>> within xx usec rather than xx usec + whatever the delay until the first >>>>>> one arrives. Which is why it's handled that way currently. That gives >>>>>> you a fixed batch latency. >>>>> >>>>> min_wait is fine when the goal is just maximizing throughput without any >>>>> latency targets. >>>> >>>> That's not true at all, I think you're in different time scales than >>>> this would be used for. >>>> >>>>> The min_wait approach makes it hard to set a useful upper bound on >>>>> latency because unlucky requests that complete early experience much >>>>> more latency than requests that complete later. >>>> >>>> As mentioned in the cover letter or the main patch, this is most useful >>>> for the medium load kind of scenarios. For high load, the min_wait time >>>> ends up not mattering because you will hit maxevents first anyway. For >>>> the testing that we did, the target was 2-300 usec, and 200 usec was >>>> used for the actual test. Depending on what the kind of traffic the >>>> server is serving, that's usually not much of a concern. From your >>>> reply, I'm guessing you're thinking of much higher min_wait numbers. I >>>> don't think those would make sense. If your rate of arrival is low >>>> enough that min_wait needs to be high to make a difference, then the >>>> load is low enough anyway that it doesn't matter. Hence I'd argue that >>>> it is indeed NOT hard to set a useful upper bound on latency, because >>>> that is very much what min_wait is. >>>> >>>> I'm happy to argue merits of one approach over another, but keep in mind >>>> that this particular approach was not pulled out of thin air AND it has >>>> actually been tested and verified successfully on a production workload. >>>> This isn't a hypothetical benchmark kind of setup. >>> >>> Fair enough. I just wanted to make sure the syscall interface that gets >>> merged is as useful as possible. >> >> That is indeed the main discussion as far as I'm concerned - syscall, >> ctl, or both? At this point I'm inclined to just push forward with the >> ctl addition. A new syscall can always be added, and if we do, then it'd >> be nice to make one that will work going forward so we don't have to >> keep adding epoll_wait variants... > > epoll_wait3() would be consistent with how maxevents and timeout work. > It does not suffer from extra ctl syscall overhead when applications > need to change min_wait. > > The way the current patches add min_wait into epoll_ctl() seems hacky to > me. struct epoll_event was meant for file descriptor event entries. It > won't necessarily be large enough for future extensions (luckily > min_wait only needs a uint64_t value). It's turning epoll_ctl() into an > ioctl()/setsockopt()-style interface, which is bad for anything that > needs to understand syscalls, like seccomp. A properly typed > epoll_wait3() seems cleaner to me. The ctl method is definitely a bit of an oddball. I've highlighted why I went that way in earlier emails, but in summary: - Makes it easy to adopt, just adding two lines at init time. - Moves detection of availability to init time as well, rather than the fast path. I don't think anyone would want to often change the wait, it's something you'd set at init time. If you often want to change values for some reason, then obviously a syscall parameter would be a lot better. epoll_pwait3() would be vastly different than the other ones, simply because epoll_pwait2() is already using the maximum number of args. We'd need to add an epoll syscall struct at that point, probably with flags telling us if signal_struct or timeout is actually valid. This is not to say I don't think we should add a syscall interface, just some of the arguments pro and con from having actually looked at it. -- Jens Axboe
Powered by blists - more mailing lists