[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+FuTSdEKsN_47RtW6pOWEnrKkewuDBdsv_qAhR1EyXUr3obrg@mail.gmail.com>
Date: Wed, 2 Nov 2022 19:51:18 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Jens Axboe <axboe@...nel.dk>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>,
linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait
On Wed, Nov 2, 2022 at 7:42 PM Jens Axboe <axboe@...nel.dk> wrote:
>
> On 11/2/22 5:09 PM, Willem de Bruijn wrote:
> > On Wed, Nov 2, 2022 at 1:54 PM Jens Axboe <axboe@...nel.dk> wrote:
> >>
> >> On 11/2/22 11:46 AM, Willem de Bruijn wrote:
> >>> On Sun, Oct 30, 2022 at 6:02 PM Jens Axboe <axboe@...nel.dk> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> tldr - we saw a 6-7% CPU reduction with this patch. See patch 6 for
> >>>> full numbers.
> >>>>
> >>>> This adds support for EPOLL_CTL_MIN_WAIT, which allows setting a minimum
> >>>> time that epoll_wait() should wait for events on a given epoll context.
> >>>> Some justification and numbers are in patch 6, patches 1-5 are really
> >>>> just prep patches or cleanups.
> >>>>
> >>>> Sending this out to get some input on the API, basically. This is
> >>>> obviously a per-context type of operation in this patchset, which isn't
> >>>> necessarily ideal for any use case. Questions to be debated:
> >>>>
> >>>> 1) Would we want this to be available through epoll_wait() directly?
> >>>> That would allow this to be done on a per-epoll_wait() basis, rather
> >>>> than be tied to the specific context.
> >>>>
> >>>> 2) If the answer to #1 is yes, would we still want EPOLL_CTL_MIN_WAIT?
> >>>>
> >>>> I think there are pros and cons to both, and perhaps the answer to both is
> >>>> "yes". There are some benefits to doing this at epoll setup time, for
> >>>> example - it nicely isolates it to that part rather than needing to be
> >>>> done dynamically everytime epoll_wait() is called. This also helps the
> >>>> application code, as it can turn off any busy'ness tracking based on if
> >>>> the setup accepted EPOLL_CTL_MIN_WAIT or not.
> >>>>
> >>>> Anyway, tossing this out there as it yielded quite good results in some
> >>>> initial testing, we're running more of it. Sending out a v3 now since
> >>>> someone reported that nonblock issue which is annoying. Hoping to get some
> >>>> more discussion this time around, or at least some...
> >>>
> >>> My main question is whether the cycle gains justify the code
> >>> complexity and runtime cost in all other epoll paths.
> >>>
> >>> Syscall overhead is quite dependent on architecture and things like KPTI.
> >>
> >> Definitely interested in experiences from other folks, but what other
> >> runtime costs do you see compared to the baseline?
> >
> > Nothing specific. Possible cost from added branches and moving local
> > variables into structs with possibly cold cachelines.
> >
> >>> Indeed, I was also wondering whether an extra timeout arg to
> >>> epoll_wait would give the same feature with less side effects. Then no
> >>> need for that new ctrl API.
> >>
> >> That was my main question in this posting - what's the best api? The
> >> current one, epoll_wait() addition, or both? The nice thing about the
> >> current one is that it's easy to integrate into existing use cases, as
> >> the decision to do batching on the userspace side or by utilizing this
> >> feature can be kept in the setup path. If you do epoll_wait() and get
> >> -1/EINVAL or false success on older kernels, then that's either a loss
> >> because of thinking it worked, or a fast path need to check for this
> >> specifically every time you call epoll_wait() rather than just at
> >> init/setup time.
> >>
> >> But this is very much the question I already posed and wanted to
> >> discuss...
> >
> > I see the value in being able to detect whether the feature is present.
> >
> > But a pure epoll_wait implementation seems a lot simpler to me, and
> > more elegant: timeout is an argument to epoll_wait already.
> >
> > A new epoll_wait variant would have to be a new system call, so it
> > would be easy to infer support for the feature.
>
> Right, but it'd still mean that you'd need to check this in the fast
> path in the app vs being able to do it at init time.
A process could call the new syscall with timeout 0 at init time to
learn whether the feature is supported.
> Might there be
> merit to doing both? From the conversion that we tried, the CTL variant
> definitely made things easier to port. The new syscall would make enable
> per-call delays however. There might be some merit to that, though I do
> think that max_events + min_time is how you'd control batching anything
> and that's suitably set in the context itself for most use cases.
I'm surprised a CTL variant is easier to port. An epoll_pwait3 with an
extra argument only needs to pass that argument to do_epoll_wait.
FWIW, when adding nsec resolution I initially opted for an init-based
approach, passing a new flag to epoll_create1. Feedback then was that
it was odd to have one syscall affect the behavior of another. The
final version just added a new epoll_pwait2 with timespec.
Powered by blists - more mailing lists