linux-kernel - Re: [PATCHSET v3 0/5] Add support for epoll min

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y2qQuiZvuML14wX0@fedora>
Date:   Tue, 8 Nov 2022 12:24:10 -0500
From:   Stefan Hajnoczi <stefanha@...hat.com>
To:     Jens Axboe <axboe@...nel.dk>
Cc:     linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait

On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote:
> On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
> > On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
> >> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
> >>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
> >>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
> >>>>> Hi Jens,
> >>>>> NICs and storage controllers have interrupt mitigation/coalescing
> >>>>> mechanisms that are similar.
> >>>>
> >>>> Yep
> >>>>
> >>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
> >>>>> (counter) value. When a completion occurs, the device waits until the
> >>>>> timeout or until the completion counter value is reached.
> >>>>>
> >>>>> If I've read the code correctly, min_wait is computed at the beginning
> >>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
> >>>>> completion.
> >>>>>
> >>>>> It makes me wonder which approach is more useful for applications. With
> >>>>> the Aggregation Time approach applications can control how much extra
> >>>>> latency is added. What do you think about that approach?
> >>>>
> >>>> We only tested the current approach, which is time noted from entry, not
> >>>> from when the first event arrives. I suspect the nvme approach is better
> >>>> suited to the hw side, the epoll timeout helps ensure that we batch
> >>>> within xx usec rather than xx usec + whatever the delay until the first
> >>>> one arrives. Which is why it's handled that way currently. That gives
> >>>> you a fixed batch latency.
> >>>
> >>> min_wait is fine when the goal is just maximizing throughput without any
> >>> latency targets.
> >>
> >> That's not true at all, I think you're in different time scales than
> >> this would be used for.
> >>
> >>> The min_wait approach makes it hard to set a useful upper bound on
> >>> latency because unlucky requests that complete early experience much
> >>> more latency than requests that complete later.
> >>
> >> As mentioned in the cover letter or the main patch, this is most useful
> >> for the medium load kind of scenarios. For high load, the min_wait time
> >> ends up not mattering because you will hit maxevents first anyway. For
> >> the testing that we did, the target was 2-300 usec, and 200 usec was
> >> used for the actual test. Depending on what the kind of traffic the
> >> server is serving, that's usually not much of a concern. From your
> >> reply, I'm guessing you're thinking of much higher min_wait numbers. I
> >> don't think those would make sense. If your rate of arrival is low
> >> enough that min_wait needs to be high to make a difference, then the
> >> load is low enough anyway that it doesn't matter. Hence I'd argue that
> >> it is indeed NOT hard to set a useful upper bound on latency, because
> >> that is very much what min_wait is.
> >>
> >> I'm happy to argue merits of one approach over another, but keep in mind
> >> that this particular approach was not pulled out of thin air AND it has
> >> actually been tested and verified successfully on a production workload.
> >> This isn't a hypothetical benchmark kind of setup.
> > 
> > Fair enough. I just wanted to make sure the syscall interface that gets
> > merged is as useful as possible.
> 
> That is indeed the main discussion as far as I'm concerned - syscall,
> ctl, or both? At this point I'm inclined to just push forward with the
> ctl addition. A new syscall can always be added, and if we do, then it'd
> be nice to make one that will work going forward so we don't have to
> keep adding epoll_wait variants...

epoll_wait3() would be consistent with how maxevents and timeout work.
It does not suffer from extra ctl syscall overhead when applications
need to change min_wait.

The way the current patches add min_wait into epoll_ctl() seems hacky to
me. struct epoll_event was meant for file descriptor event entries. It
won't necessarily be large enough for future extensions (luckily
min_wait only needs a uint64_t value). It's turning epoll_ctl() into an
ioctl()/setsockopt()-style interface, which is bad for anything that
needs to understand syscalls, like seccomp. A properly typed
epoll_wait3() seems cleaner to me.

Stefan

Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)