[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2c7509e3-6db0-461e-991b-026553157dbe@bytedance.com>
Date: Sat, 18 Nov 2023 13:14:32 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Peter Zijlstra <peterz@...radead.org>,
Tobias Huschle <huschle@...ux.ibm.com>
Cc: Linux Kernel <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
virtualization@...ts.linux.dev, netdev@...r.kernel.org,
mst@...hat.com, jasowang@...hat.com
Subject: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair:
Add lag based placement)
On 11/17/23 5:23 PM, Peter Zijlstra Wrote:
>
> Your email is pretty badly mangled by wrapping, please try and
> reconfigure your MUA, esp. the trace and debug output is unreadable.
>
> On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:
>
>> The base scenario are two KVM guests running on an s390 LPAR. One guest
>> hosts the uperf server, one the uperf client.
>> With EEVDF we observe a regression of ~50% for a strburst test.
>> For a more detailed description of the setup see the section TEST SUMMARY at
>> the bottom.
>
> Well, that's not good :/
>
>> Short summary:
>> The mentioned kworker has been scheduled to CPU 14 before the tracing was
>> enabled.
>> A vhost process is migrated onto CPU 14.
>> The vruntimes of kworker and vhost differ significantly (86642125805 vs
>> 4242563284 -> factor 20)
>
> So bear with me, I know absolutely nothing about virt stuff. I suspect
> there's cgroups involved because shiny or something.
>
> kworkers are typically not in cgroups and are part of the root cgroup,
> but what's a vhost and where does it live?
>
> Also, what are their weights / nice values?
>
>> The vhost process wants to wake up the kworker, therefore the kworker is
>> placed onto the runqueue again and set to runnable.
>> The vhost process continues to execute, waking up other vhost processes on
>> other CPUs.
>>
>> So far this behavior is not different to what we see on pre-EEVDF kernels.
>>
>> On timestamp 576.162767, the vhost process triggers the last wake up of
>> another vhost on another CPU.
>> Until timestamp 576.171155, we see no other activity. Now, the vhost process
>> ends its time slice.
>> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
>> off to CPU 15.
>
> So why does this vhost stay on the CPU if it doesn't have anything to
> do? (I've not tried to make sense of the trace, that's just too
> painful).
>
>> This does not occur with older kernels.
>> The kworker has to wait for the migration to happen in order to be able to
>> execute again.
>> This is due to the fact, that the vruntime of the kworker is significantly
>> larger than the one of vhost.
>
> That's, weird. Can you add a trace_printk() to update_entity_lag() and
> have it print out the lag, limit and vlag (post clamping) values? And
> also in place_entity() for the reverse process, lag pre and post scaling
> or something.
>
> After confirming both tasks are indeed in the same cgroup ofcourse,
> because if they're not, vruntime will be meaningless to compare and we
> should look elsewhere.
>
> Also, what HZ and what preemption mode are you running? If kworker is
> somehow vastly over-shooting it's slice -- keeps running way past the
> avg_vruntime, then it will build up a giant lag and you get what you
> describe, next time it wakes up it gets placed far to the right (exactly
> where it was when it 'finally' went to sleep, relatively speaking).
>
>> We found some options which sound plausible but we are not sure if they are
>> valid or not:
>>
>> 1. The wake up path has a dependency on the vruntime metrics that now delays
>> the execution of the kworker.
>> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
>> which updates the way cfs_rq->min_vruntime and
>> cfs_rq->avg_runtime are set might have introduced an issue which is
>> uncovered with the commit mentioned above.
>
> Suppose you have a few tasks (of equal weight) on you virtual timeline
> like so:
>
> ---------+---+---+---+---+------
> ^ ^
> | `avg_vruntime
> `-min_vruntime
>
> Then the above would be more or less the relative placements of these
> values. avg_vruntime is the weighted average of the various vruntimes
> and is therefore always in the 'middle' of the tasks, and not somewhere
> out-there.
>
> min_vruntime is a monotonically increasing 'minimum' that's left-ish on
> the tree (there's a few cases where a new task can be placed left of
> min_vruntime and its no longer actuall the minimum, but whatever).
>
> These values should be relatively close to one another, depending
> ofcourse on the spread of the tasks. So I don't think this is causing
> trouble.
>
> Anyway, the big difference with lag based placement is that where
> previously tasks (that do not migrate) retain their old vruntime and on
> placing they get pulled forward to at least min_vruntime, so a task that
> wildly overshoots, but then doesn't run for significant time can still
> be overtaken and then when placed again be 'okay'.
>
> Now OTOH, with lag-based placement, we strictly preserve their relative
> offset vs avg_vruntime. So if they were *far* too the right when they go
> to sleep, they will again be there on placement.
Hi Peter, I'm a little confused here. As we adopt placement strategy #1
when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
Given that the weight doesn't change, we have:
vl' = vl
But in fact it is scaled on placement:
vl' = vl * W/(W + w)
Does this intended? And to illustrate my understanding of strategy #1:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07f555857698..a24ef8b297ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5131,7 +5131,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*
* EEVDF: placement strategy #1 / #2
*/
- if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
+ if (sched_feat(PLACE_LAG) && cfs_rq->nr_running && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
unsigned long load;
@@ -5150,7 +5150,10 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* To avoid the 'w_i' term all over the place, we only track
* the virtual lag:
*
- * vl_i = V - v_i <=> v_i = V - vl_i
+ * vl_i = V' - v_i <=> v_i = V' - vl_i
+ *
+ * Where V' is the new weighted average after placing this
+ * entity, and v_i is its newly assigned vruntime.
*
* And we take V to be the weighted average of all v:
*
@@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* vl_i is given by:
*
* V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
- * = (W*V + w_i*(V - vl_i)) / (W + w_i)
- * = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
- * = (V*(W + w_i) - w_i*l) / (W + w_i)
- * = V - w_i*vl_i / (W + w_i)
- *
- * And the actual lag after adding an entity with vl_i is:
- *
- * vl'_i = V' - v_i
- * = V - w_i*vl_i / (W + w_i) - (V - vl_i)
- * = vl_i - w_i*vl_i / (W + w_i)
- *
- * Which is strictly less than vl_i. So in order to preserve lag
- * we should inflate the lag before placement such that the
- * effective lag after placement comes out right.
- *
- * As such, invert the above relation for vl'_i to get the vl_i
- * we need to use such that the lag after placement is the lag
- * we computed before dequeue.
+ * = (W*V + w_i*(V' - vl_i)) / (W + w_i)
+ * = V - w_i*vl_i / W
*
- * vl'_i = vl_i - w_i*vl_i / (W + w_i)
- * = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
- *
- * (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
- * = W*vl_i
- *
- * vl_i = (W + w_i)*vl'_i / W
*/
load = cfs_rq->avg_load;
if (curr && curr->on_rq)
load += scale_load_down(curr->load.weight);
-
- lag *= load + scale_load_down(se->load.weight);
if (WARN_ON_ONCE(!load))
load = 1;
- lag = div_s64(lag, load);
+
+ vruntime -= div_s64(lag * scale_load_down(se->load.weight), load);
}
se->vruntime = vruntime - lag;
Powered by blists - more mailing lists