lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 19 Jul 2023 13:18:26 +0800
From:   Aaron Lu <aaron.lu@...el.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Daniel Jordan <daniel.m.jordan@...cle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        "Mel Gorman" <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Nitin Tekchandani <nitin.tekchandani@...el.com>,
        Yu Chen <yu.c.chen@...el.com>,
        Waiman Long <longman@...hat.com>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for
 cfs_rq's removed load

On Tue, Jul 18, 2023 at 06:01:51PM +0200, Vincent Guittot wrote:
> On Tue, 18 Jul 2023 at 15:41, Aaron Lu <aaron.lu@...el.com> wrote:
> >
> > When a workload involves many wake time task migrations, tg->load_avg
> > can be heavily contended among CPUs because every migration involves
> > removing the task's load from its src cfs_rq and attach that load to
> > its new cfs_rq. Both the remove and attach requires an update to
> > tg->load_avg as well as propagating the change up the hierarchy.
> >
> > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > migration number is 11millions. Since the workload can trigger many
> > wakeups and migrations, the access(both read and write) to tg->load_avg
> > can be unbound. For the above said workload, the profile shows
> > update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
> > netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
> > migration number is 14millions; update_cfs_group() costs ~25% and
> > update_load_avg() costs ~16%.
> >
> > This patch is an attempt to reduce the cost of accessing tg->load_avg.
> 
> here you mention tg->load_avg which is updated with update_tg_load_avg()
> 
> but your patch below modifies the local update of cfs's util_avg,
> runnable_avg  and load_avg which need to be maintained up to date

Yes, since it delayed propagating the removed load, the upper cfs_rqs'
*_avg could be updated later than current code.

> You should be better to delay or rate limit the call to
> update_tg_load_avg() or calc_group_shares()/update_cfs_group() which
> access tg->load_avg and are the culprit instead of modifying other
> place.

Thanks for the suggestion and I think it makes perfect sense.

I tried below to rate limit the update to tg->load_avg at most once per
ms in update_tg_load_avg():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e48fd0e6982d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
 	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	u64 now = cfs_rq_clock_pelt(cfs_rq);
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
-	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+	if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
+	    abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14dfaafb3a8f..b5201d44d820 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct cfs_rq {
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long		tg_load_avg_contrib;
+	u64			last_update_tg_load_avg;
 	long			propagate;
 	long			prop_runnable_sum;
 
>From some quick tests using postgres_sysbench and netperf/UDP_RR on SPR,
this alone already delivers good results. For postgres_sysbench, the two
functions cost dropped to 1%-2% each; and for netperf/UDP_RR, the two
functions cost dropped to ~2% and ~4%. Togerther with some more rate
limiting on the read side, I think the end result will be better. Does
the above look reasonable to you?

Alternatively, I can remove some callsites of update_tg_load_avg() like
you suggested below and only call update_tg_load_avg() when cfs_rq is
decayed(really just decayed, not when it detected it has removed load
pending or load propagated from its children). This way it would give us
similar result as above(roughly once per ms).

> 
> Have you tried to remove update_cfs_group() from enqueue/dequeue and
> only let the tick update the share periodically ?

patch4 kind of did that :-)

> 
> Have you tried to make update_tg_load_avg() return early ? the change
> of cfs' load_avg is written in the tg->load_avg only when the change
> is bigger than 1.5%. maybe increase it to 6% ?

Yes, increase the delta is also a good way to limit the update to
tg->load_avg. Will try that too.

> 
> Or like for update_cfs_group, remove it from attach/detach entity and
> let the periodic update to propagate the change
> 
> But please don't touch local update of *_avg

OK I see.

Thanks again for the comments, they are very helpful.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ