linux-kernel - Re: How to Avoid Starving the Kernel When Using SSE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4a24df59-a209-4f2f-9465-f98482e83952@rivosinc.com>
Date: Fri, 14 Nov 2025 16:41:28 +0100
From: Clément Léger <cleger@...osinc.com>
To: 张展鹏 <zhangzhanpeng.jasper@...edance.com>
Cc: Paul Walmsley <paul.walmsley@...ive.com>,
 Palmer Dabbelt <palmer@...belt.com>,
 "linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "linux-arm-kernel@...ts.infradead.org"
 <linux-arm-kernel@...ts.infradead.org>,
 Himanshu Chauhan <hchauhan@...tanamicro.com>,
 Anup Patel <apatel@...tanamicro.com>, 路旭
 <luxu.kernel@...edance.com>, Atish Patra <atishp@...shpatra.org>,
 Björn Töpel <bjorn@...osinc.com>,
 崔运辉 <cuiyunhui@...edance.com>,
 元竹 <yuanzhu@...edance.com>
Subject: Re: How to Avoid Starving the Kernel When Using SSE



On 11/14/25 14:36, Clément Léger wrote:
> 
> 
> Hi Zhanpeng,
> 
> On 11/14/25 11:24, 张展鹏 wrote:
>> Hi Clément,
>>
>> Lately, I've been thinking about how to avoid starving the kernel when
>> using SSE:
>> SSE is powered by M-mode irqs such as M-mode PMU irq for perf sampling and
>> M-mode IPI for inter-hart injection. Meanwhile, kernel is powered by S-mode
>> irqs, so the kernel may experience starvation when there is a flood of
>> M-mode irqs, and kernel may cause such flooding of M-mode irqs when using
>> SSE, either deliberately or inadvertently:
>>
>>    1.  Malicious SSE handler: Kernel may deliberately register a bad SSE
>>    handler, which triggers a new inter-hart SSE request via ecall. This will
>>    cause an endless loop of SSE, rendering the kernel unresponsive. In this
>>    case, the only thing SBI can do is to prevent the nesting of SSE in
>>    `sbi_trap_handler` and ensure that SSE events are executed in priority
>>    order.
> 
> That seems quite convoluted. Anyone that can load a module can do worse
> than crashing the kernel :)
> 
>>    2.  Perf sampling: Kernel may inadvertently choose a bad parameter for
>>    Perf, which causes PMU irqs to occur too frequently. Continuous PMU irqs
>>    will leave the system with no time to respond to S-mode irqs.
> 
> But this one concern however is valid !
> 
>>
>>  Hence, I think we are supposed to improve the SSE framework to avoid
>> starving the kernel so easily.
>>
>> Here is a case study of perf sampling:
>> When using PMU-SSE for Perf sampling, the kernel may hang and become
>> unresponsive due to the PMU-SSE loop. Once we start to process a Perf
>> sampling using PMU-SSE, the kernel may fail to respond to `Ctrl+C` or fail
>> to exit after the timing of `sleep 1` completes (these are the two most
>> commonly used time-based sampling methods in perf).
>>
>> By default, perf uses a relatively high sampling frequency, namely
>> `perf_event_max_sample_rate`, and will adjust it on demand if sampling
>> takes too much time. If this frequency/period goes beyond what system can
>> handle, it will make SSE events connect end-to-end, and the system will get
>> stuck in an endless loop of "SSE → PMU interrupt → SSE". The kernel is then
>> starved (at this point, if you print the `sepc` of SSE completion, you will
>> find that the `sepc` remains unchanged each time, indicating that the
>> kernel is stuck), and the kernel can never escape from this loop of
>> PMU-SSE, because it can neither respond to Ctrl+C interrupts nor adjust the
>> sampling frequency.
>>
>> Current solution: The key to this problem is that every time we finish
>> sse_complete, there is already a new PMU irq pending. Then we resume the
>> kernel execution via mret, and the system will immediately trap back into
>> SSE.
>>
>> The PMU-SSE-Perf processing flow includes the following steps: `sse_inject`
>> (mret to SSE handler), `pmu_stop` (clear PMU pending bit), `pmu_start` (set
>> a new value for PMU counter), and `sse_complete` (resume execution to the
>> point where the kernel was interrupted). The reason why kernel traps right
>> after `sse_complete` is that there is a new PMU irq generated between
>> `pmu_start` and `sse_complete`.
>>
>> In order to address this issue, we propose to delay the procedure of
>> re-starting the overflowed PMU counter during PMU-SSE. When kernel triggers
>> an ecall to restart the overflowed PMU counters, SBI can check whether it
>> is SSE-powered PMU handling. If so, we temporarily modify mhpmevent CSR to
>> stop counting kernel events. In this process, M-mode events are always
>> inhibited, and U-mode code will not be executed during the
>> `pmu_sbi_ovf_handler`, so we only need to inhibit the counting of kernel
>> events.
> 
> I'd rather let the kernel control the PMU SSE event delivery by masking
> it at the end of the SSE handler and reenabling it later. Additionally,
> that solution being in the SBI itself, it does not guarantee that all
> SBI implementation will actually do that correctly.
> 
> What seems odd is that the perf_event_sample_took() call at each end of
> PMU event handler should actually allow perf subsystem to throttle the
> rate. I'll take another look at that part to make sure it works as
> expected and that we aren't missing any bits.

Hey Zhanpeng,

Could you try to apply this quick'n'dirty patch on top of the SSE series
and check if it still hang ?

diff --git a/drivers/perf/riscv_pmu_dev.c b/drivers/perf/riscv_pmu_dev.c
index 7dec9c2afa9b..0fb8749c476f 100644
--- a/drivers/perf/riscv_pmu_dev.c
+++ b/drivers/perf/riscv_pmu_dev.c
@@ -1326,6 +1326,7 @@ static irqreturn_t rvpmu_ovf_handler(struct
cpu_hw_events *cpu_hw_evt,
        int lidx, hidx, fidx;
        struct riscv_pmu *pmu;
        struct perf_event *event;
+       int ev_overflow = 0;
        u64 overflow;
        u64 overflowed_ctrs = 0;
        u64 start_clock = sched_clock();
@@ -1423,13 +1424,15 @@ static irqreturn_t rvpmu_ovf_handler(struct
cpu_hw_events *cpu_hw_evt,
                         * TODO: We will need to stop the guest counters
once
                         * virtualization support is added.
                         */
-                       perf_event_overflow(event, &data, regs);
+                       ev_overflow |= perf_event_overflow(event, &data,
regs);
                }
                /* Reset the state as we are going to start the counter
after the loop */
                hw_evt->state = 0;
        }

-       rvpmu_start_overflow_mask(pmu, overflowed_ctrs);
+       if (!ev_overflow || !from_sse)
+               rvpmu_start_overflow_mask(pmu, overflowed_ctrs);
+
        perf_sample_event_took(sched_clock() - start_clock);

        return IRQ_HANDLED;

> 
> Thanks,
> 
> Clément
> 
>>
>> In this way, we can ensure that `pmu_sbi_ovf_handler` will not be
>> re-entered by the new PMU-SSE, and minimize the modification of perf logic.
>> The price is that we gave up sampling a small portion of kernel code(from
>> `pmu_ctr_start` to the end of `pmu_sbi_ovf_handler`), and we probably need
>> a new parameter in `pmu_ctr_start`.
>>
>> Looking forward to your suggestions. Thanks!
>>
>> Best regards,
>> Zhanpeng Zhang
>>
>