linux-kernel - Re: [PATCH v5 0/4] sched: Fix missing irq time when CONFIG_IRQ_TIME

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALOAHbC54QZ6ZrRBHHKKz8F79C1J8fYcA1q59iwotuBBKtFGmA@mail.gmail.com>
Date: Sun, 17 Nov 2024 10:56:21 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: Michal Koutný <mkoutny@...e.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	vincent.guittot@...aro.org, dietmar.eggemann@....com, rostedt@...dmis.org, 
	bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, hannes@...xchg.org, 
	surenb@...gle.com, cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 0/4] sched: Fix missing irq time when
 CONFIG_IRQ_TIME_ACCOUNTING is enabled

On Fri, Nov 15, 2024 at 9:41 PM Michal Koutný <mkoutny@...e.com> wrote:
>
> Hello Yafang.
>
> On Fri, Nov 08, 2024 at 09:29:00PM GMT, Yafang Shao <laoar.shao@...il.com> wrote:
> > After enabling CONFIG_IRQ_TIME_ACCOUNTING to track IRQ pressure in our
> > container environment, we encountered several user-visible behavioral
> > changes:
> >
> > - Interrupted IRQ/softirq time is excluded in the cpuacct cgroup
> >
> >   This breaks userspace applications that rely on CPU usage data from
> >   cgroups to monitor CPU pressure. This patchset resolves the issue by
> >   ensuring that IRQ/softirq time is included in the cgroup of the
> >   interrupted tasks.
> >
> > - getrusage(2) does not include time interrupted by IRQ/softirq
> >
> >   Some services use getrusage(2) to check if workloads are experiencing CPU
> >   pressure. Since IRQ/softirq time is no longer included in task runtime,
> >   getrusage(2) can no longer reflect the CPU pressure caused by heavy
> >   interrupts.
>
> I understand that IRQ/softirq time is difficult to attribute to an
> "accountable" entity and it's technically simplest to attribute it
> everyone/noone, i.e. to root cgroup (or through a global stat w/out
> cgroups).

This issue is not about deciding which IRQ/softIRQ events should be
accounted for. Instead, it focuses on reflecting the interrupted
runtime of a task or a cgroup. I might be misunderstanding the
distinction between *charge* and *account*—or perhaps there is no
difference between them—but PATCH #4 captures exactly what I mean.
While IRQ/softIRQ time should not be attributed to the interrupted
task, it is crucial to have a metric that reflects this interrupted
runtime in CPU utilization.

The purpose of this patchset is to address this issue, conceptually
represented as:

   |<----Runtime---->|<----Interrupted time---->|<----Runtime---->|<---Sleep-->|

Without reflecting the *interrupted time* in CPU utilization, a gap—or
hole—is created:

    |<----Runtime---->|<----HOLE---->|<----Runtime---->|<---Sleep-->|

This gap will misleadingly appear as sleep time to the user:

  |<----Runtime---->|<----Sleep---->|<----Runtime---->|<---Sleep-->|

As a result, users may interpret this as underutilized CPU time and
attempt to increase their workloads to raise CPU runtime. However,
these efforts will be futile, as the observed runtime cannot increase
due to the missing metric for interrupted time.

>
> > This patchset addresses the first issue, which is relatively
> > straightforward. Once this solution is accepted, I will address the second
> > issue in a follow-up patchset.
>
> Is the first issue about cpuacct data or irq.pressure?

The data in question is from cpu.stat. Below is the cpu.stat file for cgroup2:

  $ cat cpu.stat
  usage_usec 0
  user_usec 0
  system_usec 0                            <<<< We should reflect the
interrupted time here.
  core_sched.force_idle_usec 0
  nr_periods 0
  nr_throttled 0
  throttled_usec 0
  nr_bursts 0
  burst_usec 0

>
> It sounds kind of both and I noticed the docs for irq.pressure is
> lacking in Documentation/accounting/psi.rst. When you're touching this,
> could you please add a paragraph or sentence explaining what does this
> value represent?

I believe I have explained this clearly in the comments above.
However, if anything remains unclear, please feel free to ask for
further clarification.

>
> (Also, there is same change both for cpuacct and
> cgroup_base_stat_cputime_show(), right?)
>
> >                    ----------------
> >                    | Load Balancer|
> >                    ----------------
> >                 /    |      |        \
> >                /     |      |         \
> >           Server1 Server2 Server3 ... ServerN
> >
> > Although the load balancer's algorithm is complex, it follows some core
> > principles:
> >
> > - When server CPU utilization increases, it adds more servers and deploys
> >   additional instances to meet SLA requirements.
> > - When server CPU utilization decreases, it scales down by decommissioning
> >   servers and reducing the number of instances to save on costs.
>
> A server here references to a whole node (whole kernel) or to a cgroup
> (i.e. more servers on top of one kernel)?

It is, in fact, a cgroup. These cgroups may be deployed across
different servers.

>
> > The load balancer is malfunctioning due to the exclusion of IRQ time from
> > CPU utilization calculations.
>
> Could this be fixed by subtracting (global) IRQ time from (presumed
> total) system capacity that the balancer uses for its decisions? (i.e.
> without exact per-cgroup breakdown of IRQ time)

The issue here is that the global IRQ time may include the interrupted
time of tasks outside the target cgroup. As a result, I don't believe
it's possible to find a reliable solution without modifying the
kernel.

--
Regards
Yafang