[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <75c8f5fe-6d5f-32a9-1417-818246126789@kernel.dk>
Date: Tue, 8 Nov 2022 09:15:23 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Stefan Hajnoczi <stefanha@...hat.com>
Cc: linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCHSET v3 0/5] Add support for epoll min_wait
On 11/8/22 9:10 AM, Stefan Hajnoczi wrote:
> On Tue, Nov 08, 2022 at 07:09:30AM -0700, Jens Axboe wrote:
>> On 11/8/22 7:00 AM, Stefan Hajnoczi wrote:
>>> On Mon, Nov 07, 2022 at 02:38:52PM -0700, Jens Axboe wrote:
>>>> On 11/7/22 1:56 PM, Stefan Hajnoczi wrote:
>>>>> Hi Jens,
>>>>> NICs and storage controllers have interrupt mitigation/coalescing
>>>>> mechanisms that are similar.
>>>>
>>>> Yep
>>>>
>>>>> NVMe has an Aggregation Time (timeout) and an Aggregation Threshold
>>>>> (counter) value. When a completion occurs, the device waits until the
>>>>> timeout or until the completion counter value is reached.
>>>>>
>>>>> If I've read the code correctly, min_wait is computed at the beginning
>>>>> of epoll_wait(2). NVMe's Aggregation Time is computed from the first
>>>>> completion.
>>>>>
>>>>> It makes me wonder which approach is more useful for applications. With
>>>>> the Aggregation Time approach applications can control how much extra
>>>>> latency is added. What do you think about that approach?
>>>>
>>>> We only tested the current approach, which is time noted from entry, not
>>>> from when the first event arrives. I suspect the nvme approach is better
>>>> suited to the hw side, the epoll timeout helps ensure that we batch
>>>> within xx usec rather than xx usec + whatever the delay until the first
>>>> one arrives. Which is why it's handled that way currently. That gives
>>>> you a fixed batch latency.
>>>
>>> min_wait is fine when the goal is just maximizing throughput without any
>>> latency targets.
>>
>> That's not true at all, I think you're in different time scales than
>> this would be used for.
>>
>>> The min_wait approach makes it hard to set a useful upper bound on
>>> latency because unlucky requests that complete early experience much
>>> more latency than requests that complete later.
>>
>> As mentioned in the cover letter or the main patch, this is most useful
>> for the medium load kind of scenarios. For high load, the min_wait time
>> ends up not mattering because you will hit maxevents first anyway. For
>> the testing that we did, the target was 2-300 usec, and 200 usec was
>> used for the actual test. Depending on what the kind of traffic the
>> server is serving, that's usually not much of a concern. From your
>> reply, I'm guessing you're thinking of much higher min_wait numbers. I
>> don't think those would make sense. If your rate of arrival is low
>> enough that min_wait needs to be high to make a difference, then the
>> load is low enough anyway that it doesn't matter. Hence I'd argue that
>> it is indeed NOT hard to set a useful upper bound on latency, because
>> that is very much what min_wait is.
>>
>> I'm happy to argue merits of one approach over another, but keep in mind
>> that this particular approach was not pulled out of thin air AND it has
>> actually been tested and verified successfully on a production workload.
>> This isn't a hypothetical benchmark kind of setup.
>
> Fair enough. I just wanted to make sure the syscall interface that gets
> merged is as useful as possible.
That is indeed the main discussion as far as I'm concerned - syscall,
ctl, or both? At this point I'm inclined to just push forward with the
ctl addition. A new syscall can always be added, and if we do, then it'd
be nice to make one that will work going forward so we don't have to
keep adding epoll_wait variants...
--
Jens Axboe
Powered by blists - more mailing lists