linux-kernel - Re: [PATCH] cpuidle: riscv-sbi: Stop using non-retentive suspend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57114829-c205-dece-abdb-1100947286d4@sholland.org>
Date:   Tue, 22 Nov 2022 23:11:23 -0600
From:   Samuel Holland <samuel@...lland.org>
To:     Palmer Dabbelt <palmer@...osinc.com>, conor.dooley@...rochip.com
Cc:     apatel@...tanamicro.com, anup@...infault.org,
        Conor Dooley <conor@...nel.org>, rafael@...nel.org,
        daniel.lezcano@...aro.org,
        Paul Walmsley <paul.walmsley@...ive.com>,
        aou@...s.berkeley.edu, linux-pm@...r.kernel.org,
        linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux@...osinc.com
Subject: Re: [PATCH] cpuidle: riscv-sbi: Stop using non-retentive suspend

Hi Palmer,

On 11/22/22 09:28, Palmer Dabbelt wrote:
> On Tue, 22 Nov 2022 03:19:43 PST (-0800), conor.dooley@...rochip.com wrote:
>> On Tue, Nov 22, 2022 at 11:06:15AM +0530, Anup Patel wrote:
>>> On Tue, Nov 22, 2022 at 10:46 AM Palmer Dabbelt <palmer@...osinc.com>
>>> wrote:
>>> >
>>> > On Mon, 21 Nov 2022 19:45:07 PST (-0800), anup@...infault.org wrote:
>>> > > On Tue, Nov 22, 2022 at 2:27 AM Palmer Dabbelt
>>> <palmer@...osinc.com> wrote:
>>> > >>
>>> > >> From: Palmer Dabbelt <palmer@...osinc.com>
>>> > >>
>>> > >> As per [1], whether or not the core can wake up from non-retentive
>>> > >> suspend is a platform-specific detail.  We don't have any way to
>>> encode
>>> > >> that, so just stop using them until we've sorted that out.
>>> > >>
>>> > >> Link:
>>> https://github.com/riscv-non-isa/riscv-sbi-doc/issues/98#issuecomment-1288564687
>>> > >> Fixes: 6abf32f1d9c5 ("cpuidle: Add RISC-V SBI CPU idle driver")
>>> > >> Signed-off-by: Palmer Dabbelt <palmer@...osinc.com>
>>> > >
>>> > > This is just unnecessary maintenance churn and it's not the
>>> > > right way to go. Better to fix this the right way instead of having
>>> > > a temporary fix.
>>> > >
>>> > > I had already sent-out a patch series 5 months back to describe
>>> > > this in DT:
>>> > >
>>> https://lore.kernel.org/lkml/20220727114302.302201-1-apatel@ventanamicro.com/
>>> > >
>>> > > No one has commented/suggested anything (except Samuel
>>> > > Holland and Sudeep Holla).
>>> >
>>> > I see some comments from Krzysztof here
>>> >
>>> <https://lore.kernel.org/lkml/7a0477a0-9f0f-87d6-4070-30321745f4cc@linaro.org/>
>>> > as well.  Looks like everyone is pointing out that having our CPU
>>> nodes
>>> > encode timers is a bad idea, my guess is that they're probably right.
>>>
>>> Adding a separate timer DT node, creates a new set of compatibility
>>> issues for existing platforms. I am fine updating my series to have
>>> separate timer DT node but do we want to go in this direction ?
>>
>> I don't really follow. How is there a compatibility issue created by
>> adding a new node that is not added for a new property? Both will
>> require changes to the device tree. (You need not reply here, I am going
>> to review the other thread, it's been on my todo list for too long. Been
>> caught up with non-coherent stuff & our sw release cycle..)
>>
>>> Even if ARM has a separate timer DT node, the timers are still part
>>> of the CPU. It depends on how we see the DT bindings aligning with
>>> actual HW.
>>>
>>> >
>>> > > Please review this series. I can quickly address comments to
>>> > > make this available for Linux-6.2. Until this series is merged,
>>> > > the affected platforms can simply remove non-retentive suspend
>>> > > states from their DT.
>>> >
>>> > That leaves us with a dependency between kernel versions and DT
>>> > bindings: kernels with the current driver will result in broken
>>> systems
>>> > with the non-retentive suspend states in the DT they boot with when
>>> > those states can't wake up the CPU.
>>
>> Can someone point me at a (non D1 or virt) system that has suspend
>> states in the DT that would need fixing?
>>
>>> This is not a new problem we are facing. Even in the ARM world,
>>> the DT bindings grew organically over time based on newer platform
>>> requirements.
>>>
>>> Now that we have a platform which does not want the time
>>> C3STOP feature, we need to first come-up with DT bindings
>>> to support this platform instead of temporarily disabling
>>> features which don't work on this platform.
>>
>> It's the opposite surely? It should be "now that we have a platform that
>> *does want* the C3STOP feature", right?
> 
> IMO we just shouldn't be turning on C3STOP at all.  Sure it's making the
> problem here go away, but it's trying to emulate a bunch of Intel timer
> features we don't have on RISC-V and ending up in a bunch of fallback
> paths.

While the comment in include/linux/clockchips.h and the name of the flag
imply that C3STOP is Intel-specific, it really isn't. The flag is used
on ARM, MIPS, and PowerPC as well.

However we do it, the end goal here is making tick_broadcast_enter()
return nonzero (when called from inside cpuidle_enter_state()), thus
inhibiting Linux from using idle states with the "local-timer-stop"
property set in the DT.

Looking down inside tick_broadcast_oneshot_control(), it appears
CLOCK_EVT_FEAT_C3STOP really is required to make this happen.

What are you referring to by "fallback paths"?

We could add another CLOCK_EVT_FEAT_??? flag, but how should it differ
from CLOCK_EVT_FEAT_C3STOP?

> If we need some workaround in the timer subsystem to sort out the
> non-retentive states then we can sort that out, but my guess is that
> vendors are just going to 3

(your message got cut off here)

>>> > > With all due respect, NACK to this patch from my side.
>>
>> As Samuel pointed out that the D1 doesn't actually use the timer in
>> question, I think we are okay here?
> 
> IIUC it just should use the sunxi timer driver, but that requires some
> priority changes (and presumably breaks

(and here too)

D1 currently uses sun4i_tick (rating 350) over riscv_timer_clockevent
(rating 100). The ratings are fine as is.

> That said, I guess I'm confused about what's actually broken here.  My
> understanding of the problem is: The D1's firmware disables some
> interrupts during non-retentive suspend, which results in SBI timers
> failing to wake up the core.

The D1's hardware cannot deliver the RISC-V architectural timer
interrupt while the C906 core is powered off. It cannot deliver the
RISC-V architectural external interrupt either, but the SoC provides a
side channel for all of the PLIC inputs, so they _can_ wake up the CPU.
So the net result is that the CLINT is the only peripheral unable to
wake the CPU.

> The patch to add C3STOP makes the core
> come back from sleep, but that results in a bunch of other timer-related
> issues.

No, C3STOP _inhibits_ Linux from entering idle states that both:
 1) rely on that clockevent device to wake the local CPU, and
 2) cause that clockevent device to stop working, as signified by the
    CPUIDLE_FLAG_TIMER_STOP flag, which is set by the local-timer-stop
    property in the idle state DT node.

That means entering the idle state is allowed if Linux can solve that
first condition by finding another clockevent device on some other CPU
to wake the local CPU with an IPI. That is the purpose of the broadcast
timer mechanism.

In the case of the D1, since it is single core, there is no other CPU to
broadcast a timer event. So if riscv_timer_clockevent is the only
clockevent device, then tick_broadcast_enter() should return nonzero,
and find_deepest_state() will pick a retentive idle state instead.

This is the already-existing mechanism for only entering idle states we
can reliably wake up from. :)

> So IMO we should revert the C3STOP patch as it's causing regressions

I agree that clearly something is going wrong in the Linux code to cause
problems on SMP systems like PolarFire. I do not really understand the
broadcast timer internals, but what I do know is:
 1) By setting CLOCK_EVT_FEAT_C3STOP, we inhibit the RISC-V timer from
    being used as the broadcast timer (tick_check_broadcast_device()),
    and IIUC fall back to kernel/time/tick-broadcast-hrtimer.c. Maybe
    something goes wrong there?
 2) The broadcast timer wakes up CPUs via IPIs. Maybe something goes
    wrong with IPI delivery? (This raises the question of if we need
    another DT property for receiving IPIs in non-retentive suspend.)

But neither of these should affect behavior when not using broadcast timers.

> (ie, old workloads break in order to make new ones work).  Seems like
> folks mostly agree on that one?

Well, for the D1 specifically, there is no new workload that the C3STOP
patch makes work. Non-retentive suspend worked there already, as I have
explained. The patch was always about being compliant to the spec.

> I also think we should stop entering non-retentive suspend until we can
> sort out how reliably wake up from it, as the SBI makes that a
> platform-specific detail.  If the answer there is "non-retentive suspend
> is fine on the D1 as long as we don't use the SBI timers" then that
> seems fine, we just need some way to describe that in Linux -- that
> doesn't fix other platforms and other interrupts, but at least it's a
> start.

We need some way to describe the situation from the SBI implementation
to Linux.

Non-retentive suspend is fine on the D1 as long as either one of these
conditions is met:
 1) we don't use the SBI timers, or
 2) the SBI timer implementation does not use the CLINT

And it is up to the SBI implementation which timer hardware it uses, so
the SBI implementation needs to patch this information in to the DT at
runtime.

Regards,
Samuel