linux-kernel - Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cf813f92-9806-4449-b099-1bb2bd492b3c@arm.com>
Date: Tue, 12 Mar 2024 09:45:57 +0000
From: Luis Machado <luis.machado@....com>
To: "Michael S. Tsirkin" <mst@...hat.com>,
 Tobias Huschle <huschle@...ux.ibm.com>
Cc: Jason Wang <jasowang@...hat.com>, Abel Wu <wuyun.abel@...edance.com>,
 Peter Zijlstra <peterz@...radead.org>,
 Linux Kernel <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
 virtualization@...ts.linux.dev, netdev@...r.kernel.org, nd <nd@....com>
Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add
 lag based placement)

On 3/11/24 17:05, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
>> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
>>> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
>>>> On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
>>>>> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
>>>>>> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
>>>>
>>>> -------- Summary --------
>>>>
>>>> In my (non-vhost experience) opinion the way to go would be either
>>>> replacing the cond_resched with a hard schedule or setting the
>>>> need_resched flag within vhost if the a data transfer was successfully
>>>> initiated. It will be necessary to check if this causes problems with
>>>> other workloads/benchmarks.
>>>
>>> Yes but conceptually I am still in the dark on whether the fact that
>>> periodically invoking cond_resched is no longer sufficient to be nice to
>>> others is a bug, or intentional.  So you feel it is intentional?
>>
>> I would assume that cond_resched is still a valid concept.
>> But, in this particular scenario we have the following problem:
>>
>> So far (with CFS) we had:
>> 1. vhost initiates data transfer
>> 2. kworker is woken up
>> 3. CFS gives priority to woken up task and schedules it
>> 4. kworker runs
>>
>> Now (with EEVDF) we have:
>> 0. In some cases, kworker has accumulated negative lag 
>> 1. vhost initiates data transfer
>> 2. kworker is woken up
>> -3a. EEVDF does not schedule kworker if it has negative lag
>> -4a. vhost continues running, kworker on same CPU starves
>> --
>> -3b. EEVDF schedules kworker if it has positive or no lag
>> -4b. kworker runs
>>
>> In the 3a/4a case, the kworker is given no chance to set the
>> necessary flag. The flag can only be set by another CPU now.
>> The schedule of the kworker was not caused by cond_resched, but
>> rather by the wakeup path of the scheduler.
>>
>> cond_resched works successfully once the load balancer (I suppose) 
>> decides to migrate the vhost off to another CPU. In that case, the
>> load balancer on another CPU sets that flag and we are good.
>> That then eventually allows the scheduler to pick kworker, but very
>> late.
> 
> Are we going anywhere with this btw?
> 
>

I think Tobias had a couple other threads related to this, with other potential fixes:

https://lore.kernel.org/lkml/20240228161018.14253-1-huschle@linux.ibm.com/

https://lore.kernel.org/lkml/20240228161023.14310-1-huschle@linux.ibm.com/