[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZfQH7b3ZBwqwV3G3@DESKTOP-2CCOB1S.>
Date: Fri, 15 Mar 2024 09:33:49 +0100
From: Tobias Huschle <huschle@...ux.ibm.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: Luis Machado <luis.machado@....com>, Jason Wang <jasowang@...hat.com>,
Abel Wu <wuyun.abel@...edance.com>,
Peter Zijlstra <peterz@...radead.org>,
Linux Kernel <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
virtualization@...ts.linux.dev, netdev@...r.kernel.org,
nd <nd@....com>
Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add
lag based placement)
On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
>
> Thanks a lot! To clarify it is not that I am opposed to changing vhost.
> I would like however for some documentation to exist saying that if you
> do abc then call API xyz. Then I hope we can feel a bit safer that
> future scheduler changes will not break vhost (though as usual, nothing
> is for sure). Right now we are going by the documentation and that says
> cond_resched so we do that.
>
> --
> MST
>
Here I'd like to add that we have two different problems:
1. cond_resched not working as expected
This appears to me to be a bug in the scheduler where it lets the cgroup,
which the vhost is running in, loop endlessly. In EEVDF terms, the cgroup
is allowed to surpass its own deadline without consequences. One of my RFCs
mentioned above adresses this issue (not happy yet with the implementation).
This issue only appears in that specific scenario, so it's not a general
issue, rather a corner case.
But, this fix will still allow the vhost to reach its deadline, which is
one full time slice. This brings down the max delays from 300+ms to whatever
the timeslice is. This is not enough to fix the regression.
2. vhost relying on kworker being scheduled on wake up
This is the bigger issue for the regression. There are rare cases, where
the vhost runs only for a very short amount of time before it wakes up
the kworker. Simultaneously, the kworker takes longer than usual to
complete its work and takes longer than the vhost did before. We
are talking 4digit to low 5digit nanosecond values.
With those two being the only tasks on the CPU, the scheduler now assumes
that the kworker wants to unfairly consume more than the vhost and denies
it being scheduled on wakeup.
In the regular cases, the kworker is faster than the vhost, so the
scheduler assumes that the kworker needs help, which benefits the
scenario we are looking at.
In the bad case, this means unfortunately, that cond_resched cannot work
as good as before, for this particular case!
So, let's assume that problem 1 from above is fixed. It will take one
full time slice to get the need_resched flag set by the scheduler
because vhost surpasses its deadline. Before, the scheduler cannot know
that the kworker should actually run. The kworker itself is unable
to communicate that by itself since it's not getting scheduled and there
is no external entity that could intervene.
Hence my argumentation that cond_resched still works as expected. The
crucial part is that the wake up behavior has changed which is why I'm
a bit reluctant to propose a documentation change on cond_resched.
I could see proposing a doc change, that cond_resched should not be
used if a task heavily relies on a woken up task being scheduled.
Powered by blists - more mailing lists