linux-kernel - Re: [PATCH v8 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <z2s55zx724rsytuyppikxxnqrxt23ojzoovdpkrk3yc4nwqmc7@of7dq2vj7oi3>
Date: Thu, 9 Jan 2025 11:46:58 +0100
From: Michal Koutný <mkoutny@...e.com>
To: Yafang Shao <laoar.shao@...il.com>
Cc: peterz@...radead.org, mingo@...hat.com, hannes@...xchg.org, 
	juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com, 
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, 
	surenb@...gle.com, linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, 
	lkp@...el.com
Subject: Re: [PATCH v8 4/4] sched: Fix cgroup irq time for
 CONFIG_IRQ_TIME_ACCOUNTING

Hello Yafang.

I consider the runtimization you did in the first three patches
sensible, however, this fourth patch is a hard sell.

On Fri, Jan 03, 2025 at 10:24:09AM +0800, Yafang Shao <laoar.shao@...il.com> wrote:
> However, despite adding more threads to handle an increased workload,
> the CPU usage could not be raised. 

(Is that behavior same in both CONFIG_IRQ_TIME_ACCOUNTING and
!CONFIG_IRQ_TIME_ACCOUNTING cases?)

> In other words, even though the container’s CPU usage appeared low, it
> was unable to process more workloads to utilize additional CPU
> resources, which caused issues.

Hm, I think this would be worth documenting in the context of
CONFIG_IRQ_TIME_ACCOUNTING and irq.pressure.

> The CPU usage of the cgroup is relatively low at around 55%, but this usage
> doesn't increase, even with more netperf tasks. The reason is that CPU0 is
> at 100% utilization, as confirmed by mpstat:
> 
>   02:56:22 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>   02:56:23 PM    0    0.99    0.00   55.45    0.00    0.99   42.57    0.00    0.00    0.00    0.00
> 
>   02:56:23 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>   02:56:24 PM    0    2.00    0.00   55.00    0.00    0.00   43.00    0.00    0.00    0.00    0.00
> 
> It is clear that the %soft is excluded in the cgroup of the interrupted
> task. This behavior is unexpected. We should include IRQ time in the
> cgroup to reflect the pressure the group is under.

What is irq.pressure shown in this case?

> The system metric in cpuacct.stat is crucial in indicating whether a
> container is under heavy system pressure, including IRQ/softirq activity.
> Hence, IRQ/softirq time should be included in the cpuacct system usage,
> which also applies to cgroup2’s rstat.

But this only works for you where cgroup's workload induces IRQ on
itself but generally it'd be confusing (showing some sys time that
originates out of the cgroup). irq.pressure covers this universally (or
it should).

On Fri, Jan 03, 2025 at 10:24:05AM +0800, Yafang Shao <laoar.shao@...il.com> wrote:
> The load balancer is malfunctioning due to the exclusion of IRQ time from
> CPU utilization calculations. What's worse, there is no effective way to
> add the irq time back into the CPU utilization based on current
> available metrics. Therefore, we have to change the kernel code.

That's IMO what irq.pressure (PSI) should be useful for. Adjusting
cgroup's CPU usage with irq.pressue (perhaps not as simple as
multiplication, Johannes may step in here) should tell you info for load
balancer.

Regards,
Michal

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)