linux-kernel - Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ZZv09bLJvA5M/kc7@DESKTOP-2CCOB1S.>
Date: Mon, 8 Jan 2024 14:13:25 +0100
From: Tobias Huschle <huschle@...ux.ibm.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: Jason Wang <jasowang@...hat.com>, Abel Wu <wuyun.abel@...edance.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Linux Kernel <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
        virtualization@...ts.linux.dev, netdev@...r.kernel.org
Subject: Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6
 sched/fair: Add lag based placement)

On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> 
> Peter, would appreciate feedback on this. When is cond_resched()
> insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
> be updated to require schedule() instead?
> 

Happy new year everybody!

I'd like to bring this thread back to life. To reiterate:

- The introduction of the EEVDF scheduler revealed a performance
  regression in a uperf testcase of ~50%.
- Tracing the scheduler showed that it takes decisions which are
  in line with its design.
- The traces showed as well, that a vhost instance might run
  excessively long on its CPU in some circumstance. Those cause
  the performance regression as they cause delay times of 100+ms
  for a kworker which drives the actual network processing.
- Before EEVDF, the vhost would always be scheduled off its CPU
  in favor of the kworker, as the kworker was being woken up and
  the former scheduler was giving more priority to the woken up
  task. With EEVDF, the kworker, as a long running process, is
  able to accumulate negative lag, which causes EEVDF to not
  prefer it on its wake up, leaving the vhost running.
- If the kworker is not scheduled when being woken up, the vhost
  continues looping until it is migrated off the CPU.
- The vhost offers to be scheduled off the CPU by calling 
  cond_resched(), but, the the need_resched flag is not set,
  therefore cond_resched() does nothing.

To solve this, I see the following options 
  (might not be a complete nor a correct list)
- Along with the wakeup of the kworker, need_resched needs to
  be set, such that cond_resched() triggers a reschedule.
- The vhost calls schedule() instead of cond_resched() to give up
  the CPU. This would of course be a significantly stricter
  approach and might limit the performance of vhost in other cases.
- Preventing the kworker from accumulating negative lag as it is
  mostly not runnable and if it runs, it only runs for a very short
  time frame. This might clash with the overall concept of EEVDF.
- On cond_resched(), verify if the consumed runtime of the caller
  is outweighing the negative lag of another process (e.g. the 
  kworker) and schedule the other process. Introduces overhead
  to cond_resched.

I would be curious on feedback on those ideas and interested in
alternative approaches.