netdev - Re: [GIT PULL] Add support for epoll min wait time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <581a6498-efcd-b57d-02b6-4237559c72e6@kernel.dk>
Date:   Sat, 10 Dec 2022 19:31:06 -0700
From:   Jens Axboe <axboe@...nel.dk>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     netdev <netdev@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>
Subject: Re: [GIT PULL] Add support for epoll min wait time

On 12/10/22 7:20?PM, Jens Axboe wrote:
> On 12/10/22 6:58?PM, Jens Axboe wrote:
>> On 12/10/22 11:51?AM, Linus Torvalds wrote:
>>> On Sat, Dec 10, 2022 at 7:36 AM Jens Axboe <axboe@...nel.dk> wrote:
>>>>
>>>> This adds an epoll_ctl method for setting the minimum wait time for
>>>> retrieving events.
>>>
>>> So this is something very close to what the TTY layer has had forever,
>>> and is useful (well... *was* useful) for pretty much the same reason.
>>>
>>> However, let's learn from successful past interfaces: the tty layer
>>> doesn't have just VTIME, it has VMIN too.
>>>
>>> And I think they very much go hand in hand: you want for at least VMIN
>>> events or for at most VTIME after the last event.
>>
>> It has been suggested before too. A more modern example is how IRQ
>> coalescing works on eg nvme or nics. Those generally are of the nature
>> of "wait for X time, or until Y events are available". We can certainly
>> do something like that here too, it's just adding a minevents and
>> passing them in together.
>>
>> I'll add that, really should be trivial, and resend later in the merge
>> window once we're happy with that.
> 
> Took a quick look, and it's not that trivial. The problem is you have
> to wake the task to reap events anyway, this cannot be checked at
> wakeup time. And now you lose the nice benefit of reducing the
> context switch rate, which was a good chunk of the win here...

One approximation we could make is that once we've done that first reap
of events, let's say we get N events (where N could be zero), the number
of wakeups post that is a rough approximation of the number of events
that have arrived. We already use this to break out of min_wait if we
think we'll exceed maxevents. We could use that same metric to estimate
if we've hit minevents as well. It would not be guaranteed accurate, but
probably good enough. Even if we didn't quite hit minevents there, we'd
return rather than do another sleep and wakeup cycle.

-- 
Jens Axboe