linux-kernel - Re: [External] Re: How to Avoid Starving the Kernel When Using SSE

Open Source and information security mailing list archives

Message-ID: <CAEEQ3wmXGJz=aoWLWeGQBTb4w7=FJnVOsELgKz1Gs9NoChM01g@mail.gmail.com>
Date: Thu, 20 Nov 2025 19:55:37 +0800
From: yunhui cui <cuiyunhui@...edance.com>
To: Atish Patra <atish.patra@...ux.dev>
Cc: Clément Léger <cleger@...osinc.com>, 
	Zhanpeng Zhang <zhangzhanpeng.jasper@...edance.com>, 
	Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>, 
	"linux-riscv@...ts.infradead.org" <linux-riscv@...ts.infradead.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>, 
	Himanshu Chauhan <hchauhan@...tanamicro.com>, Anup Patel <apatel@...tanamicro.com>, 
	路旭 <luxu.kernel@...edance.com>, 
	Atish Patra <atishp@...shpatra.org>, Björn Töpel <bjorn@...osinc.com>, 
	元竹 <yuanzhu@...edance.com>
Subject: Re: [External] Re: How to Avoid Starving the Kernel When Using SSE

Hi Atish,

On Thu, Nov 20, 2025 at 5:09 PM Atish Patra <atish.patra@...ux.dev> wrote:
>
> On 11/18/25 12:51 AM, ClÃ©ment LÃ©ger wrote:
> >
> >
> > On 11/18/25 09:46, Zhanpeng Zhang wrote:
> >> Hi Clément,
> >>
> >> It seems that your patch is based on the PMU-ctr-delegation version. However, I'm using PMU-SBI and the SSE extension of v8.
> >> So, I made corresponding modifications in `pmu_sbi_ovf_handler` as yours in `rvpmu_ovf_handler`.
> >
> > Hi Zhanpeng,
> >
> > Indeed my modifications were based on Atish rvpmu series.
> >
> >> This indeed prevents the kernel from hanging during perf sampling, and the sampling results look good. But according to the debug results, I think the problem with this approach is quite obvious. After this modification, most `pmu_start` operations are triggered by `event_sched_in` rather than do_sse. That is to say, `pmu_start` is delayed until `event_sched_in`.
> >
> > That's the expected result, point being that the IRQ rate should be
> > throttled. Depending on how slow is your platform, overflow might kick
> > in faster leading to more throttling and thus delayed start of counter.
> > I do think this solution is the one that should be implemented since the
> > overflow return value from perf_event_overflow() is exactly meant for
> > that (ie throttling IRQs).
> >
>
> While that works fine for the events that already overflowed, what
> happens for the counter that was about to overflow but stopped because
> the SSE event handler is called ?
>
> In that case, you may miss those events until the process is scheduled
> again. Depending on the frequency of the event that may be okay or not.

What if it's tickless and the CONFIG_NO_HZ_FULL configuration is enabled?

>
> We also need to think about if we need to handle such scenarios for
> other SSE event ? For example, a continuous trigger correctable RAS
> errors can end up in the same situation.
>
> Should we provide a generic sysfs mechanism for users to disable this
> SSE event ? By default, SSE event would not be throttled but an user can
> opt in for throttling by writing to the sysfs.

Other architectures such as x86 and arm64 do not need this option, so
RISC-V should not need it either.

Thanks,
Yunhui

>
>
> > Clément
> >
> >> Actually, I'm also planning to try Atish's PMU-ctr-delegation extension, such delegation optimization seems to be helpful to alleviate the hanging problem. But I think this is no longer the same issue, as it requires comprehensive modifications to the hardware/qemu, kernel, SBI, and Perf itself.
> >>
> >> Regards,
> >> Zhanpeng
> >>> From: "Clément Léger"<cleger@...osinc.com>
> >>> Date:  Fri, Nov 14, 2025, 23:41
> >>> Subject:  [External] Re: How to Avoid Starving the Kernel When Using SSE
> >>> To: "张展鹏"<zhangzhanpeng.jasper@...edance.com>
> >>> Cc: "Paul Walmsley"<paul.walmsley@...ive.com>, "Palmer Dabbelt"<palmer@...belt.com>, "linux-riscv@...ts.infradead.org"<linux-riscv@...ts.infradead.org>, "linux-kernel@...r.kernel.org"<linux-kernel@...r.kernel.org>, "linux-arm-kernel@...ts.infradead.org"<linux-arm-kernel@...ts.infradead.org>, "Himanshu Chauhan"<hchauhan@...tanamicro.com>, "Anup Patel"<apatel@...tanamicro.com>, "路旭"<luxu.kernel@...edance.com>, "Atish Patra"<atishp@...shpatra.org>, "Björn Töpel"<bjorn@...osinc.com>, "崔运辉"<cuiyunhui@...edance.com>, "元竹"<yuanzhu@...edance.com>
> >>> On 11/14/25 14:36, Clément Léger wrote:
> >>
> >>>>
> >>
> >>>>
> >>
> >>>> Hi Zhanpeng,
> >>
> >>>>
> >>
> >>>> On 11/14/25 11:24, 张展鹏 wrote:
> >>
> >>>>> Hi Clément,
> >>
> >>>>>
> >>
> >>>>> Lately, I've been thinking about how to avoid starving the kernel when
> >>
> >>>>> using SSE:
> >>
> >>>>> SSE is powered by M-mode irqs such as M-mode PMU irq for perf sampling and
> >>
> >>>>> M-mode IPI for inter-hart injection. Meanwhile, kernel is powered by S-mode
> >>
> >>>>> irqs, so the kernel may experience starvation when there is a flood of
> >>
> >>>>> M-mode irqs, and kernel may cause such flooding of M-mode irqs when using
> >>
> >>>>> SSE, either deliberately or inadvertently:
> >>
> >>>>>
> >>
> >>>>>     1.  Malicious SSE handler: Kernel may deliberately register a bad SSE
> >>
> >>>>>     handler, which triggers a new inter-hart SSE request via ecall. This will
> >>
> >>>>>     cause an endless loop of SSE, rendering the kernel unresponsive. In this
> >>
> >>>>>     case, the only thing SBI can do is to prevent the nesting of SSE in
> >>
> >>>>>     `sbi_trap_handler` and ensure that SSE events are executed in priority
> >>
> >>>>>     order.
> >>
> >>>>
> >>
> >>>> That seems quite convoluted. Anyone that can load a module can do worse
> >>
> >>>> than crashing the kernel :)
> >>
> >>>>
> >>
> >>>>>     2.  Perf sampling: Kernel may inadvertently choose a bad parameter for
> >>
> >>>>>     Perf, which causes PMU irqs to occur too frequently. Continuous PMU irqs
> >>
> >>>>>     will leave the system with no time to respond to S-mode irqs.
> >>
> >>>>
> >>
> >>>> But this one concern however is valid !
> >>
> >>>>
> >>
> >>>>>
> >>
> >>>>>   Hence, I think we are supposed to improve the SSE framework to avoid
> >>
> >>>>> starving the kernel so easily.
> >>
> >>>>>
> >>
> >>>>> Here is a case study of perf sampling:
> >>
> >>>>> When using PMU-SSE for Perf sampling, the kernel may hang and become
> >>
> >>>>> unresponsive due to the PMU-SSE loop. Once we start to process a Perf
> >>
> >>>>> sampling using PMU-SSE, the kernel may fail to respond to `Ctrl+C` or fail
> >>
> >>>>> to exit after the timing of `sleep 1` completes (these are the two most
> >>
> >>>>> commonly used time-based sampling methods in perf).
> >>
> >>>>>
> >>
> >>>>> By default, perf uses a relatively high sampling frequency, namely
> >>
> >>>>> `perf_event_max_sample_rate`, and will adjust it on demand if sampling
> >>
> >>>>> takes too much time. If this frequency/period goes beyond what system can
> >>
> >>>>> handle, it will make SSE events connect end-to-end, and the system will get
> >>
> >>>>> stuck in an endless loop of "SSE → PMU interrupt → SSE". The kernel is then
> >>
> >>>>> starved (at this point, if you print the `sepc` of SSE completion, you will
> >>
> >>>>> find that the `sepc` remains unchanged each time, indicating that the
> >>
> >>>>> kernel is stuck), and the kernel can never escape from this loop of
> >>
> >>>>> PMU-SSE, because it can neither respond to Ctrl+C interrupts nor adjust the
> >>
> >>>>> sampling frequency.
> >>
> >>>>>
> >>
> >>>>> Current solution: The key to this problem is that every time we finish
> >>
> >>>>> sse_complete, there is already a new PMU irq pending. Then we resume the
> >>
> >>>>> kernel execution via mret, and the system will immediately trap back into
> >>
> >>>>> SSE.
> >>
> >>>>>
> >>
> >>>>> The PMU-SSE-Perf processing flow includes the following steps: `sse_inject`
> >>
> >>>>> (mret to SSE handler), `pmu_stop` (clear PMU pending bit), `pmu_start` (set
> >>
> >>>>> a new value for PMU counter), and `sse_complete` (resume execution to the
> >>
> >>>>> point where the kernel was interrupted). The reason why kernel traps right
> >>
> >>>>> after `sse_complete` is that there is a new PMU irq generated between
> >>
> >>>>> `pmu_start` and `sse_complete`.
> >>
> >>>>>
> >>
> >>>>> In order to address this issue, we propose to delay the procedure of
> >>
> >>>>> re-starting the overflowed PMU counter during PMU-SSE. When kernel triggers
> >>
> >>>>> an ecall to restart the overflowed PMU counters, SBI can check whether it
> >>
> >>>>> is SSE-powered PMU handling. If so, we temporarily modify mhpmevent CSR to
> >>
> >>>>> stop counting kernel events. In this process, M-mode events are always
> >>
> >>>>> inhibited, and U-mode code will not be executed during the
> >>
> >>>>> `pmu_sbi_ovf_handler`, so we only need to inhibit the counting of kernel
> >>
> >>>>> events.
> >>
> >>>>
> >>
> >>>> I'd rather let the kernel control the PMU SSE event delivery by masking
> >>
> >>>> it at the end of the SSE handler and reenabling it later. Additionally,
> >>
> >>>> that solution being in the SBI itself, it does not guarantee that all
> >>
> >>>> SBI implementation will actually do that correctly.
> >>
>
> I agree. The solution has to be implemented within Linux kernel rather
> than in the SBI specification as it is a problem of the SSE event
> handler & user (in this case where incorrect sampling rate is used).
>
> >>>>
> >>
> >>>> What seems odd is that the perf_event_sample_took() call at each end of
> >>
> >>>> PMU event handler should actually allow perf subsystem to throttle the
> >>
> >>>> rate. I'll take another look at that part to make sure it works as
> >>
> >>>> expected and that we aren't missing any bits.
> >>
> >>>
> >>> Hey Zhanpeng,
> >>
> >>>
> >>> Could you try to apply this quick'n'dirty patch on top of the SSE series
> >>
> >>> and check if it still hang ?
> >>
> >>>
> >>> diff --git a/drivers/perf/riscv_pmu_dev.c b/drivers/perf/riscv_pmu_dev.c
> >>
> >>> index 7dec9c2afa9b..0fb8749c476f 100644
> >>
> >>> --- a/drivers/perf/riscv_pmu_dev.c
> >>
> >>> +++ b/drivers/perf/riscv_pmu_dev.c
> >>
> >>> @@ -1326,6 +1326,7 @@ static irqreturn_t rvpmu_ovf_handler(struct
> >>
> >>> cpu_hw_events *cpu_hw_evt,
> >>
> >>>          int lidx, hidx, fidx;
> >>
> >>>          struct riscv_pmu *pmu;
> >>
> >>>          struct perf_event *event;
> >>
> >>> +       int ev_overflow = 0;
> >>
> >>>          u64 overflow;
> >>
> >>>          u64 overflowed_ctrs = 0;
> >>
> >>>          u64 start_clock = sched_clock();
> >>
> >>> @@ -1423,13 +1424,15 @@ static irqreturn_t rvpmu_ovf_handler(struct
> >>
> >>> cpu_hw_events *cpu_hw_evt,
> >>
> >>>                           * TODO: We will need to stop the guest counters
> >>
> >>> once
> >>
> >>>                           * virtualization support is added.
> >>
> >>>                           */
> >>
> >>> -                       perf_event_overflow(event, &data, regs);
> >>
> >>> +                       ev_overflow |= perf_event_overflow(event, &data,
> >>
> >>> regs);
> >>
> >>>                  }
> >>
> >>>                  /* Reset the state as we are going to start the counter
> >>
> >>> after the loop */
> >>
> >>>                  hw_evt->state = 0;
> >>
> >>>          }
> >>
> >>>
> >>> -       rvpmu_start_overflow_mask(pmu, overflowed_ctrs);
> >>
> >>> +       if (!ev_overflow || !from_sse)
> >>
> >>> +               rvpmu_start_overflow_mask(pmu, overflowed_ctrs);
> >>
> >>> +
> >>
> >>>          perf_sample_event_took(sched_clock() - start_clock);
> >>
> >>>
> >>>          return IRQ_HANDLED;
> >>
> >>>
> >>>>
> >>
> >>>> Thanks,
> >>
> >>>>
> >>
> >>>> Clément
> >>
> >>>>
> >>
> >>>>>
> >>
> >>>>> In this way, we can ensure that `pmu_sbi_ovf_handler` will not be
> >>
> >>>>> re-entered by the new PMU-SSE, and minimize the modification of perf logic.
> >>
> >>>>> The price is that we gave up sampling a small portion of kernel code(from
> >>
> >>>>> `pmu_ctr_start` to the end of `pmu_sbi_ovf_handler`), and we probably need
> >>
> >>>>> a new parameter in `pmu_ctr_start`.
> >>
> >>>>>
> >>
> >>>>> Looking forward to your suggestions. Thanks!
> >>
> >>>>>
> >>
> >>>>> Best regards,
> >>
> >>>>> Zhanpeng Zhang
> >>
> >>>>>
> >>
> >>>>
> >>>
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@...ts.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
>

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives