netdev - Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ZXmJYxLfLGBtuQ3L@DESKTOP-2CCOB1S.>
Date: Wed, 13 Dec 2023 11:37:23 +0100
From: Tobias Huschle <huschle@...ux.ibm.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: Jason Wang <jasowang@...hat.com>, Abel Wu <wuyun.abel@...edance.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Linux Kernel <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
        virtualization@...ts.linux.dev, netdev@...r.kernel.org
Subject: Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6
 sched/fair: Add lag based placement)

On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin <mst@...hat.com> wrote:

We played around with the suggestions and some other ideas.
I would like to share some initial results.

We tried the following:

1. Call uncondtional schedule in the vhost_worker function
2. Change the HZ value from 100 to 1000
3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()
4. Adding a cond_resched to translate_desc
5. Reducing VHOST_NET_WEIGHT to 25% of its original value

Please find the diffs below.

Summary:

Option 1 is very very hacky but resolved the regression.
Option 2 reduces the regression by ~20%.
Options 3-5 do not help unfortunately.

Potential explanation:

While the vhost is executing, the need_resched flag is not set (observable
in the traces). Therefore cond_resched and alike will do nothing. vhost
will continue executing until the need_resched flag is set by an external
party, e.g. by a request to migrate the vhost.

Calling schedule unconditionally forces the scheduler to re-evaluate all 
tasks and their vruntime/deadline/vlag values. The scheduler comes to the
correct conclusion, that the kworker should be executed and from there it
is smooth sailing. I will have to verify that sequence by collecting more
traces, but this seems rather plausible.
This hack might of course introduce all kinds of side effects but might
provide an indicator that this is the actual problem.
The big question would be how to solve this conceptually, and, first
things first, whether you think this is a viable hypothesis.

Increasing the HZ value helps most likely because the other CPUs take 
scheduling/load balancing decisions more often as well and therefore
trigger the migration faster.

Bringing down VHOST_NET_WEIGHT even more might also help to shorten the
vhost loop. But I have no intuition how low we can/should go here.

We also changed vq_err to print error messages, but did not encounter any.

Diffs:
--------------------------------------------------------------------------

1. Call uncondtional schedule in the vhost_worker function

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..16d73fd28831 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -414,6 +414,7 @@ static bool vhost_worker(void *data)
                }
        }

+       schedule();
        return !!node;
 }

--------------------------------------------------------------------------

2. Change the HZ value from 100 to 1000

--> config change 

--------------------------------------------------------------------------

3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..d519d598ebb9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -410,7 +410,8 @@ static bool vhost_worker(void *data)
                        kcov_remote_start_common(worker->kcov_handle);
                        work->fn(work);
                        kcov_remote_stop();
-                       cond_resched();
+                       if (need_resched())
+                               schedule();
                }
        }

--------------------------------------------------------------------------

4. Adding a cond_resched to translate_desc

I just picked some location.

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..f885dd29cbd1 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2367,6 +2367,7 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
                s += size;
                addr += size;
                ++ret;
+               cond_resched();
        }

        if (ret == -EAGAIN)

--------------------------------------------------------------------------

5. Reducing VHOST_NET_WEIGHT to 25% of its original value

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..2c6966ea6229 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -42,7 +42,7 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"

 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
-#define VHOST_NET_WEIGHT 0x80000
+#define VHOST_NET_WEIGHT 0x20000

 /* Max number of packets transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others with small