linux-kernel - Re: [Resend PATCH] psi : calc cfs task memstall time more precisely

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YYGV1TxsZXzGXFmx@cmpxchg.org>
Date:   Tue, 2 Nov 2021 15:47:33 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Huangzhaoyang <huangzhaoyang@...il.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Zhaoyang Huang <zhaoyang.huang@...soc.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [Resend PATCH] psi : calc cfs task memstall time more precisely

CC peterz as well for rt and timekeeping magic

On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@...soc.com>
> 
> In an EAS enabled system, there are two scenarios discordant to current design,
> 
> 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> RT task usually preempts CFS task in little core.
> 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> ignore the preempted time by RT, DL and Irqs.
> 
> With these two constraints, the percpu nonidle time would be mainly consumed by
> none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> via the proportion of cfs_rq's utilization on the whole rq.
> 
> eg.
> Here is the scenario which this commit want to fix, that is the rt and irq consume
> some utilization of the whole rq. This scenario could be typical in a core
> which is assigned to deal with all irqs. Furthermore, the rt task used to run on
> little core under EAS.
> 
> Binder:305_3-314    [002] d..1   257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18
> droid.phone-1525    [001] d..1   265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15
>
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@...soc.com>
> ---
>  kernel/sched/psi.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index cc25a3c..754a836 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -182,6 +182,8 @@ struct psi_group psi_system = {
>  
>  static void psi_avgs_work(struct work_struct *work);
>  
> +static unsigned long psi_memtime_fixup(u32 growth);
> +
>  static void group_init(struct psi_group *group)
>  {
>  	int cpu;
> @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
>  	return growth;
>  }
>  
> +static unsigned long psi_memtime_fixup(u32 growth)
> +{
> +	struct rq *rq = task_rq(current);
> +	unsigned long growth_fixed = (unsigned long)growth;
> +
> +	if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> +		return growth_fixed;
> +
> +	if (current->in_memstall)
> +		growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> +					- rq->avg_irq.util_avg + 1) * growth, 1024);
> +
> +	return growth_fixed;
> +}
> +
>  static void init_triggers(struct psi_group *group, u64 now)
>  {
>  	struct psi_trigger *t;
> @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
>  	}
>  
>  	if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> +		delta = psi_memtime_fixup(delta);

Ok, so we want to deduct IRQ and RT preemption time from the memstall
period of an active reclaimer, since it's technically not stalled on
memory during this time but on CPU.

However, we do NOT want to deduct IRQ and RT time from memstalls that
are sleeping on refaults swapins, since they are not affected by what
is going on on the CPU.

Does util_avg capture that difference? I'm not confident it does - but
correct me if I'm wrong. We need length of time during which and IRQ
or an RT task preempted the old rq->curr, not absolute irq/rt length.

(Btw, such preemption periods, in addition to being deducted from
memory stalls, should probably also be added to CPU contention stalls,
to make CPU pressure reporting more accurate as well.)