netdev - Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ZfQH7b3ZBwqwV3G3@DESKTOP-2CCOB1S.>
Date: Fri, 15 Mar 2024 09:33:49 +0100
From: Tobias Huschle <huschle@...ux.ibm.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: Luis Machado <luis.machado@....com>, Jason Wang <jasowang@...hat.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Linux Kernel <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
        virtualization@...ts.linux.dev, netdev@...r.kernel.org,
        nd <nd@....com>
Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add
 lag based placement)

On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> 
> Thanks a lot! To clarify it is not that I am opposed to changing vhost.
> I would like however for some documentation to exist saying that if you
> do abc then call API xyz. Then I hope we can feel a bit safer that
> future scheduler changes will not break vhost (though as usual, nothing
> is for sure).  Right now we are going by the documentation and that says
> cond_resched so we do that.
> 
> -- 
> MST
> 

Here I'd like to add that we have two different problems:

1. cond_resched not working as expected
   This appears to me to be a bug in the scheduler where it lets the cgroup, 
   which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup
   is allowed to surpass its own deadline without consequences. One of my RFCs
   mentioned above adresses this issue (not happy yet with the implementation).
   This issue only appears in that specific scenario, so it's not a general 
   issue, rather a corner case.
   But, this fix will still allow the vhost to reach its deadline, which is
   one full time slice. This brings down the max delays from 300+ms to whatever
   the timeslice is. This is not enough to fix the regression.

2. vhost relying on kworker being scheduled on wake up
   This is the bigger issue for the regression. There are rare cases, where
   the vhost runs only for a very short amount of time before it wakes up 
   the kworker. Simultaneously, the kworker takes longer than usual to 
   complete its work and takes longer than the vhost did before. We
   are talking 4digit to low 5digit nanosecond values.
   With those two being the only tasks on the CPU, the scheduler now assumes
   that the kworker wants to unfairly consume more than the vhost and denies
   it being scheduled on wakeup.
   In the regular cases, the kworker is faster than the vhost, so the 
   scheduler assumes that the kworker needs help, which benefits the
   scenario we are looking at.
   In the bad case, this means unfortunately, that cond_resched cannot work
   as good as before, for this particular case!
   So, let's assume that problem 1 from above is fixed. It will take one 
   full time slice to get the need_resched flag set by the scheduler
   because vhost surpasses its deadline. Before, the scheduler cannot know
   that the kworker should actually run. The kworker itself is unable
   to communicate that by itself since it's not getting scheduled and there 
   is no external entity that could intervene.
   Hence my argumentation that cond_resched still works as expected. The
   crucial part is that the wake up behavior has changed which is why I'm 
   a bit reluctant to propose a documentation change on cond_resched.
   I could see proposing a doc change, that cond_resched should not be
   used if a task heavily relies on a woken up task being scheduled.