linux-kernel - Re: How to Avoid Starving the Kernel When Using SSE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <bec1e1f4-e188-415d-bbb2-1854b7e8cf1d@rivosinc.com>
Date: Fri, 14 Nov 2025 14:36:16 +0100
From: Clément Léger <cleger@...osinc.com>
To: 张展鹏 <zhangzhanpeng.jasper@...edance.com>
Cc: Paul Walmsley <paul.walmsley@...ive.com>,
 Palmer Dabbelt <palmer@...belt.com>,
 "linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "linux-arm-kernel@...ts.infradead.org"
 <linux-arm-kernel@...ts.infradead.org>,
 Himanshu Chauhan <hchauhan@...tanamicro.com>,
 Anup Patel <apatel@...tanamicro.com>, 路旭
 <luxu.kernel@...edance.com>, Atish Patra <atishp@...shpatra.org>,
 Björn Töpel <bjorn@...osinc.com>,
 崔运辉 <cuiyunhui@...edance.com>,
 元竹 <yuanzhu@...edance.com>
Subject: Re: How to Avoid Starving the Kernel When Using SSE



Hi Zhanpeng,

On 11/14/25 11:24, 张展鹏 wrote:
> Hi Clément,
> 
> Lately, I've been thinking about how to avoid starving the kernel when
> using SSE:
> SSE is powered by M-mode irqs such as M-mode PMU irq for perf sampling and
> M-mode IPI for inter-hart injection. Meanwhile, kernel is powered by S-mode
> irqs, so the kernel may experience starvation when there is a flood of
> M-mode irqs, and kernel may cause such flooding of M-mode irqs when using
> SSE, either deliberately or inadvertently:
> 
>    1.  Malicious SSE handler: Kernel may deliberately register a bad SSE
>    handler, which triggers a new inter-hart SSE request via ecall. This will
>    cause an endless loop of SSE, rendering the kernel unresponsive. In this
>    case, the only thing SBI can do is to prevent the nesting of SSE in
>    `sbi_trap_handler` and ensure that SSE events are executed in priority
>    order.

That seems quite convoluted. Anyone that can load a module can do worse
than crashing the kernel :)

>    2.  Perf sampling: Kernel may inadvertently choose a bad parameter for
>    Perf, which causes PMU irqs to occur too frequently. Continuous PMU irqs
>    will leave the system with no time to respond to S-mode irqs.

But this one concern however is valid !

> 
>  Hence, I think we are supposed to improve the SSE framework to avoid
> starving the kernel so easily.
> 
> Here is a case study of perf sampling:
> When using PMU-SSE for Perf sampling, the kernel may hang and become
> unresponsive due to the PMU-SSE loop. Once we start to process a Perf
> sampling using PMU-SSE, the kernel may fail to respond to `Ctrl+C` or fail
> to exit after the timing of `sleep 1` completes (these are the two most
> commonly used time-based sampling methods in perf).
> 
> By default, perf uses a relatively high sampling frequency, namely
> `perf_event_max_sample_rate`, and will adjust it on demand if sampling
> takes too much time. If this frequency/period goes beyond what system can
> handle, it will make SSE events connect end-to-end, and the system will get
> stuck in an endless loop of "SSE → PMU interrupt → SSE". The kernel is then
> starved (at this point, if you print the `sepc` of SSE completion, you will
> find that the `sepc` remains unchanged each time, indicating that the
> kernel is stuck), and the kernel can never escape from this loop of
> PMU-SSE, because it can neither respond to Ctrl+C interrupts nor adjust the
> sampling frequency.
> 
> Current solution: The key to this problem is that every time we finish
> sse_complete, there is already a new PMU irq pending. Then we resume the
> kernel execution via mret, and the system will immediately trap back into
> SSE.
> 
> The PMU-SSE-Perf processing flow includes the following steps: `sse_inject`
> (mret to SSE handler), `pmu_stop` (clear PMU pending bit), `pmu_start` (set
> a new value for PMU counter), and `sse_complete` (resume execution to the
> point where the kernel was interrupted). The reason why kernel traps right
> after `sse_complete` is that there is a new PMU irq generated between
> `pmu_start` and `sse_complete`.
> 
> In order to address this issue, we propose to delay the procedure of
> re-starting the overflowed PMU counter during PMU-SSE. When kernel triggers
> an ecall to restart the overflowed PMU counters, SBI can check whether it
> is SSE-powered PMU handling. If so, we temporarily modify mhpmevent CSR to
> stop counting kernel events. In this process, M-mode events are always
> inhibited, and U-mode code will not be executed during the
> `pmu_sbi_ovf_handler`, so we only need to inhibit the counting of kernel
> events.

I'd rather let the kernel control the PMU SSE event delivery by masking
it at the end of the SSE handler and reenabling it later. Additionally,
that solution being in the SBI itself, it does not guarantee that all
SBI implementation will actually do that correctly.

What seems odd is that the perf_event_sample_took() call at each end of
PMU event handler should actually allow perf subsystem to throttle the
rate. I'll take another look at that part to make sure it works as
expected and that we aren't missing any bits.

Thanks,

Clément

> 
> In this way, we can ensure that `pmu_sbi_ovf_handler` will not be
> re-entered by the new PMU-SSE, and minimize the modification of perf logic.
> The price is that we gave up sampling a small portion of kernel code(from
> `pmu_ctr_start` to the end of `pmu_sbi_ovf_handler`), and we probably need
> a new parameter in `pmu_ctr_start`.
> 
> Looking forward to your suggestions. Thanks!
> 
> Best regards,
> Zhanpeng Zhang
>