[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b6bf22e9-0d01-4e6b-bd06-4eaef73e76ce@linux.dev>
Date: Thu, 29 Aug 2024 16:27:06 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: Johannes Weiner <hannes@...xchg.org>, bugzilla-daemon@...nel.org
Cc: Suren Baghdasaryan <surenb@...gle.com>,
Peter Zijlstra <peterz@...radead.org>, Shakeel Butt <shakeelb@...gle.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched: psi: fix bogus pressure spikes from aggregation
race
On 2024/8/27 20:18, Johannes Weiner wrote:
> Sending a proper patch. Peter, can you please take this?
>
> ---
>
> Brandon reports sporadic, non-sensical spikes in cumulative pressure
> time (total=) when reading cpu.pressure at a high rate. This is due to
> a race condition between reader aggregation and tasks changing states.
>
> While it affects all states and all resources captured by PSI, in
> practice it most likely triggers with CPU pressure, since scheduling
> events are so frequent compared to other resource events.
>
> The race context is the live snooping of ongoing stalls during a
> pressure read. The read aggregates per-cpu records for stalls that
> have concluded, but will also incorporate ad-hoc the duration of any
> active state that hasn't been recorded yet. This is important to get
> timely measurements of ongoing stalls. Those ad-hoc samples are
> calculated on-the-fly up to the current time on that CPU; since the
> stall hasn't concluded, it's expected that this is the minimum amount
> of stall time that will enter the per-cpu records once it does.
>
> The problem is that the path that concludes the state uses a CPU clock
> read that is not synchronized against aggregators; the clock is read
> outside of the seqlock protection. This allows aggregators to race and
> snoop a stall with a longer duration than will actually be recorded.
>
> With the recorded stall time being less than the last snapshot
> remembered by the aggregator, a subsequent sample will underflow and
> observe a bogus delta value, resulting in an erratic jump in pressure.
>
> Fix this by moving the clock read of the state change into the seqlock
> protection. This ensures no aggregation can snoop live stalls past the
> time that's recorded when the state concludes.
>
> Reported-by: Brandon Duffany <brandon@...ldbuddy.io>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
> Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi")
> Cc: stable@...r.kernel.org
> Signed-off-by: Johannes Weiner <hannes@...xchg.org>
Good catch!
Reviewed-by: Chengming Zhou <chengming.zhou@...ux.dev>
Maybe another solution is to check the race happened on the read side,
then return delta as 0, right?
Thanks.
> ---
> kernel/sched/psi.c | 26 ++++++++++++--------------
> 1 file changed, 12 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 020d58967d4e..84dad1511d1e 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -769,12 +769,13 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> }
>
> static void psi_group_change(struct psi_group *group, int cpu,
> - unsigned int clear, unsigned int set, u64 now,
> + unsigned int clear, unsigned int set,
> bool wake_clock)
> {
> struct psi_group_cpu *groupc;
> unsigned int t, m;
> u32 state_mask;
> + u64 now;
>
> lockdep_assert_rq_held(cpu_rq(cpu));
> groupc = per_cpu_ptr(group->pcpu, cpu);
> @@ -789,6 +790,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
> * SOME and FULL time these may have resulted in.
> */
> write_seqcount_begin(&groupc->seq);
> + now = cpu_clock(cpu);
>
> /*
> * Start with TSK_ONCPU, which doesn't have a corresponding
> @@ -899,18 +901,15 @@ void psi_task_change(struct task_struct *task, int clear, int set)
> {
> int cpu = task_cpu(task);
> struct psi_group *group;
> - u64 now;
>
> if (!task->pid)
> return;
>
> psi_flags_change(task, clear, set);
>
> - now = cpu_clock(cpu);
> -
> group = task_psi_group(task);
> do {
> - psi_group_change(group, cpu, clear, set, now, true);
> + psi_group_change(group, cpu, clear, set, true);
> } while ((group = group->parent));
> }
>
> @@ -919,7 +918,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> {
> struct psi_group *group, *common = NULL;
> int cpu = task_cpu(prev);
> - u64 now = cpu_clock(cpu);
>
> if (next->pid) {
> psi_flags_change(next, 0, TSK_ONCPU);
> @@ -936,7 +934,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> break;
> }
>
> - psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
> + psi_group_change(group, cpu, 0, TSK_ONCPU, true);
> } while ((group = group->parent));
> }
>
> @@ -974,7 +972,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> do {
> if (group == common)
> break;
> - psi_group_change(group, cpu, clear, set, now, wake_clock);
> + psi_group_change(group, cpu, clear, set, wake_clock);
> } while ((group = group->parent));
>
> /*
> @@ -986,7 +984,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
> clear &= ~TSK_ONCPU;
> for (; group; group = group->parent)
> - psi_group_change(group, cpu, clear, set, now, wake_clock);
> + psi_group_change(group, cpu, clear, set, wake_clock);
> }
> }
> }
> @@ -997,8 +995,8 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
> int cpu = task_cpu(curr);
> struct psi_group *group;
> struct psi_group_cpu *groupc;
> - u64 now, irq;
> s64 delta;
> + u64 irq;
>
> if (static_branch_likely(&psi_disabled))
> return;
> @@ -1011,7 +1009,6 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
> if (prev && task_psi_group(prev) == group)
> return;
>
> - now = cpu_clock(cpu);
> irq = irq_time_read(cpu);
> delta = (s64)(irq - rq->psi_irq_time);
> if (delta < 0)
> @@ -1019,12 +1016,15 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
> rq->psi_irq_time = irq;
>
> do {
> + u64 now;
> +
> if (!group->enabled)
> continue;
>
> groupc = per_cpu_ptr(group->pcpu, cpu);
>
> write_seqcount_begin(&groupc->seq);
> + now = cpu_clock(cpu);
>
> record_times(groupc, now);
> groupc->times[PSI_IRQ_FULL] += delta;
> @@ -1223,11 +1223,9 @@ void psi_cgroup_restart(struct psi_group *group)
> for_each_possible_cpu(cpu) {
> struct rq *rq = cpu_rq(cpu);
> struct rq_flags rf;
> - u64 now;
>
> rq_lock_irq(rq, &rf);
> - now = cpu_clock(cpu);
> - psi_group_change(group, cpu, 0, 0, now, true);
> + psi_group_change(group, cpu, 0, 0, true);
> rq_unlock_irq(rq, &rf);
> }
> }
Powered by blists - more mailing lists