[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241101102842.GW14555@noisy.programming.kicks-ass.net>
Date: Fri, 1 Nov 2024 11:28:42 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Yafang Shao <laoar.shao@...il.com>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, hannes@...xchg.org,
surenb@...gle.com, cgroups@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 4/4] sched: Fix cgroup irq accounting for
CONFIG_IRQ_TIME_ACCOUNTING
On Fri, Nov 01, 2024 at 11:17:50AM +0800, Yafang Shao wrote:
> After enabling CONFIG_IRQ_TIME_ACCOUNTING to monitor IRQ pressure in our
> container environment, we observed several noticeable behavioral changes.
>
> One of our IRQ-heavy services, such as Redis, reported a significant
> reduction in CPU usage after upgrading to the new kernel with
> CONFIG_IRQ_TIME_ACCOUNTING enabled. However, despite adding more threads
> to handle an increased workload, the CPU usage could not be raised. In
> other words, even though the container’s CPU usage appeared low, it was
> unable to process more workloads to utilize additional CPU resources, which
> caused issues.
> We can verify the CPU usage of the test cgroup using cpuacct.stat. The
> output shows:
>
> system: 53
> user: 2
>
> The CPU usage of the cgroup is relatively low at around 55%, but this usage
> doesn't increase, even with more netperf tasks. The reason is that CPU0 is
> at 100% utilization, as confirmed by mpstat:
>
> 02:56:22 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 02:56:23 PM 0 0.99 0.00 55.45 0.00 0.99 42.57 0.00 0.00 0.00 0.00
>
> 02:56:23 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 02:56:24 PM 0 2.00 0.00 55.00 0.00 0.00 43.00 0.00 0.00 0.00 0.00
>
> It is clear that the %soft is not accounted into the cgroup of the
> interrupted task. This behavior is unexpected. We should account for IRQ
> time to the cgroup to reflect the pressure the group is under.
>
> After a thorough analysis, I discovered that this change in behavior is due
> to commit 305e6835e055 ("sched: Do not account irq time to current task"),
> which altered whether IRQ time should be charged to the interrupted task.
> While I agree that a task should not be penalized by random interrupts, the
> task itself cannot progress while interrupted. Therefore, the interrupted
> time should be reported to the user.
>
> The system metric in cpuacct.stat is crucial in indicating whether a
> container is under heavy system pressure, including IRQ/softirq activity.
> Hence, IRQ/softirq time should be accounted for in the cpuacct system
> usage, which also applies to cgroup2’s rstat.
>
> This patch reintroduces IRQ/softirq accounting to cgroups.
How !? what does it actually do?
> Signed-off-by: Yafang Shao <laoar.shao@...il.com>
> Cc: Johannes Weiner <hannes@...xchg.org>
> ---
> kernel/sched/core.c | 33 +++++++++++++++++++++++++++++++--
> kernel/sched/psi.c | 14 +++-----------
> kernel/sched/stats.h | 7 ++++---
> 3 files changed, 38 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 06a06f0897c3..5ed2c5c8c911 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5579,6 +5579,35 @@ __setup("resched_latency_warn_ms=", setup_resched_latency_warn_ms);
> static inline u64 cpu_resched_latency(struct rq *rq) { return 0; }
> #endif /* CONFIG_SCHED_DEBUG */
>
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> +static void account_irqtime(struct rq *rq, struct task_struct *curr,
> + struct task_struct *prev)
> +{
> + int cpu = smp_processor_id();
> + s64 delta;
> + u64 irq;
> +
> + if (!static_branch_likely(&sched_clock_irqtime))
> + return;
> +
> + irq = irq_time_read(cpu);
> + delta = (s64)(irq - rq->psi_irq_time);
At this point the variable is no longer exclusive to PSI and should
probably be renamed.
> + if (delta < 0)
> + return;
> +
> + rq->psi_irq_time = irq;
> + psi_account_irqtime(rq, curr, prev, delta);
> + cgroup_account_cputime(curr, delta);
> + /* We account both softirq and irq into softirq */
> + cgroup_account_cputime_field(curr, CPUTIME_SOFTIRQ, delta);
This seems wrong.. we have CPUTIME_IRQ.
> +}
In fact, much of this seems like it's going about things sideways.
Why can't you just add the cgroup_account_*() garbage to
irqtime_account_irq()? That is were it's still split out into softirq
and irq.
But the much bigger question is -- how can you be sure that this
interrupt is in fact for the cgroup you're attributing it to? Could be
for an entirely different cgroup.
Powered by blists - more mailing lists