netdev - Re: [PATCHSET v3 0/5] Add support for epoll min

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ba524eab-eff7-5fad-06c2-8188cdf881a1@kernel.dk>
Date:   Tue, 8 Nov 2022 10:28:37 -0700
From:   Jens Axboe <axboe@...nel.dk>
To:     Stefan Hajnoczi <stefanha@...hat.com>
Cc:     linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait

On 11/8/22 10:24 AM, Stefan Hajnoczi wrote:
> On Tue, Nov 08, 2022 at 09:15:23AM -0700, Jens Axboe wrote:
>> On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
>>> On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
>>>> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
>>>>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
>>>>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
>>>>>>> Hi Jens,
>>>>>>> NICs and storage controllers have interrupt mitigation/coalescing
>>>>>>> mechanisms that are similar.
>>>>>>
>>>>>> Yep
>>>>>>
>>>>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
>>>>>>> (counter) value. When a completion occurs, the device waits until the
>>>>>>> timeout or until the completion counter value is reached.
>>>>>>>
>>>>>>> If I've read the code correctly, min_wait is computed at the beginning
>>>>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
>>>>>>> completion.
>>>>>>>
>>>>>>> It makes me wonder which approach is more useful for applications. With
>>>>>>> the Aggregation Time approach applications can control how much extra
>>>>>>> latency is added. What do you think about that approach?
>>>>>>
>>>>>> We only tested the current approach, which is time noted from entry, not
>>>>>> from when the first event arrives. I suspect the nvme approach is better
>>>>>> suited to the hw side, the epoll timeout helps ensure that we batch
>>>>>> within xx usec rather than xx usec + whatever the delay until the first
>>>>>> one arrives. Which is why it's handled that way currently. That gives
>>>>>> you a fixed batch latency.
>>>>>
>>>>> min_wait is fine when the goal is just maximizing throughput without any
>>>>> latency targets.
>>>>
>>>> That's not true at all, I think you're in different time scales than
>>>> this would be used for.
>>>>
>>>>> The min_wait approach makes it hard to set a useful upper bound on
>>>>> latency because unlucky requests that complete early experience much
>>>>> more latency than requests that complete later.
>>>>
>>>> As mentioned in the cover letter or the main patch, this is most useful
>>>> for the medium load kind of scenarios. For high load, the min_wait time
>>>> ends up not mattering because you will hit maxevents first anyway. For
>>>> the testing that we did, the target was 2-300 usec, and 200 usec was
>>>> used for the actual test. Depending on what the kind of traffic the
>>>> server is serving, that's usually not much of a concern. From your
>>>> reply, I'm guessing you're thinking of much higher min_wait numbers. I
>>>> don't think those would make sense. If your rate of arrival is low
>>>> enough that min_wait needs to be high to make a difference, then the
>>>> load is low enough anyway that it doesn't matter. Hence I'd argue that
>>>> it is indeed NOT hard to set a useful upper bound on latency, because
>>>> that is very much what min_wait is.
>>>>
>>>> I'm happy to argue merits of one approach over another, but keep in mind
>>>> that this particular approach was not pulled out of thin air AND it has
>>>> actually been tested and verified successfully on a production workload.
>>>> This isn't a hypothetical benchmark kind of setup.
>>>
>>> Fair enough. I just wanted to make sure the syscall interface that gets
>>> merged is as useful as possible.
>>
>> That is indeed the main discussion as far as I'm concerned - syscall,
>> ctl, or both? At this point I'm inclined to just push forward with the
>> ctl addition. A new syscall can always be added, and if we do, then it'd
>> be nice to make one that will work going forward so we don't have to
>> keep adding epoll_wait variants...
> 
> epoll_wait3() would be consistent with how maxevents and timeout work.
> It does not suffer from extra ctl syscall overhead when applications
> need to change min_wait.
> 
> The way the current patches add min_wait into epoll_ctl() seems hacky to
> me. struct epoll_event was meant for file descriptor event entries. It
> won't necessarily be large enough for future extensions (luckily
> min_wait only needs a uint64_t value). It's turning epoll_ctl() into an
> ioctl()/setsockopt()-style interface, which is bad for anything that
> needs to understand syscalls, like seccomp. A properly typed
> epoll_wait3() seems cleaner to me.

The ctl method is definitely a bit of an oddball. I've highlighted why
I went that way in earlier emails, but in summary:

- Makes it easy to adopt, just adding two lines at init time.

- Moves detection of availability to init time as well, rather than
  the fast path.

I don't think anyone would want to often change the wait, it's
something you'd set at init time. If you often want to change values
for some reason, then obviously a syscall parameter would be a lot
better.

epoll_pwait3() would be vastly different than the other ones, simply
because epoll_pwait2() is already using the maximum number of args.
We'd need to add an epoll syscall struct at that point, probably
with flags telling us if signal_struct or timeout is actually valid.

This is not to say I don't think we should add a syscall interface,
just some of the arguments pro and con from having actually looked
at it.

-- 
Jens Axboe