linux-kernel - Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c976eae7-56f8-4b7b-821a-1ec4291b21dd@arm.com>
Date: Fri, 10 Jan 2025 14:52:09 +0000
From: Christian Loehle <christian.loehle@....com>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>,
 Linux PM <linux-pm@...r.kernel.org>,
 Aboorva Devarajan <aboorvad@...ux.ibm.com>,
 LKML <linux-kernel@...r.kernel.org>,
 Daniel Lezcano <daniel.lezcano@...aro.org>,
 Artem Bityutskiy <artem.bityutskiy@...ux.intel.com>
Subject: Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early
 return path

On 1/10/25 13:34, Rafael J. Wysocki wrote:
> On Fri, Jan 10, 2025 at 2:16 PM Christian Loehle
> <christian.loehle@....com> wrote:
>>
>> On 1/10/25 12:53, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>>>
>>> After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length()
>>> call in some cases") the teo governor behaves a bit differently on
>>> systems where idle state 0 is a "polling" state (that is, it is not
>>> really an idle state, but a loop continuously executed by the CPU).
>>> Namely, on such systems it skips the tick_nohz_get_sleep_length() call
>>> if the target residency of the current candidate idle state is small
>>> enough.
>>>
>>> However, if state 0 itself was to be returned, it would be returned
>>> right away without calling tick_nohz_get_sleep_length() even on systems
>>> where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle:
>>> teo: Don't count non-existent intercepts") that attempted to fix this
>>> problem.
>>>
>>> Unfortunately, commit 4b20b07ce72f has made the governor always call
>>> tick_nohz_get_sleep_length() when about to return state 0 early, even
>>> if that state is a "polling" one, which is inconsistent and defeats
>>> the purpose of commit 6da8f9ba5a87 in that case.
>>>
>>> Address this by adding a CPUIDLE_FLAG_POLLING check to the path where
>>> state 0 is returned early to prevent tick_nohz_get_sleep_length() from
>>> being called if it is a "polling" state.
>>>
>>> Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts")
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>>> ---
>>>  drivers/cpuidle/governors/teo.c |    3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> --- a/drivers/cpuidle/governors/teo.c
>>> +++ b/drivers/cpuidle/governors/teo.c
>>> @@ -422,7 +422,8 @@
>>>                       first_suitable_idx = i;
>>>               }
>>>       }
>>> -     if (!idx && prev_intercept_idx) {
>>> +     if (!idx && prev_intercept_idx &&
>>> +         !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) {
>>>               /*
>>>                * We have to query the sleep length here otherwise we don't
>>>                * know after wakeup if our guess was correct.
>>>
>>>
>>>
>>
>> But then you do run into the issue of intercepts not being detected if
>> state0 is the right choice, don't you?
> 
> That's true, but then on systems with a "polling" state 0 you still
> have this problem if the state returned early is not state 0.  Say C1
> on x86.> 
> The point here is that the behavior needs to be consistent, one way or another.

Yes, gotcha. Why not be consistent 'in the other way' then?

> 
> The exact point of commit 6da8f9ba5a87 was to avoid calling
> tick_nohz_get_sleep_length() in some cases when the state to be
> returned is shallow enough and obviously that includes a "polling"
> state 0, possibly at the cost of being somewhat inaccurate in
> prediction.

Somewhat inaccurate meaning not making any prediction?
cpu_data->sleep_length_ns = KTIME_MAX;

How much is the harm for calling tick_nohz_get_sleep_length() when
polling anyway?
I know tick_nohz_get_sleep_length() is the majority of the usual
cpuidle entry path, but for many scenarios where state0 is appropriate
that should be pretty fast, no?

> 
> Then you're seeing this intercept accumulation for state 0 when there
> are only 2 states in the table (or all of the other states are much
> higher target residency than state 0).
> 
> Commit 4b20b07ce72f effectively caused tick_nohz_get_sleep_length() to
> be called every time on systems without a "polling" state 0, which was
> fair enough, but it also affected the other systems, which wasn't.
> 
>> This would then enable intercept-detection only for <50% of the time,
>> another option is to not allow intercepts selecting a polling state, but
>> there were recent complaints about this exact behavior from Aboorva (+TO).
>> They don't have a low-latency non-polling state.
>>
>> https://lore.kernel.org/lkml/20240809073120.250974-1-aboorvad@linux.ibm.com/
> 
> If they don't have a "polling" state 0, they won't be affected by this
> patch and after commit 4b20b07ce72f, they'll always call
> tick_nohz_get_sleep_length(), so the current governor behavior is
> generally unsuitable for them.

They do though.
commit 5ddcc03a07ae ("powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state")
So they have a polling 'snooze' and a relatively high latency (hundreds usecs)
non-polling state and no deeper state.
So if they don't query sleep length on snooze on a (1us)-interrupt-wakeup heavy
workload they will get 50% state0 and 50% state1 (because intercepts recovered
due to not querying sleep length).

> 
> I have an idea how to change it to be more accurate in prediction, but
> we'll see how it goes.  Stay tuned.

I will.

Regards,
Christian