linux-kernel - Re: [RFC] sched,livepatch: call klp_try_switch_task in __cond

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <D298A3F1-43A5-4FD5-B198-906364BF4B79@fb.com>
Date:   Tue, 10 May 2022 23:57:04 +0000
From:   Song Liu <songliubraving@...com>
To:     Josh Poimboeuf <jpoimboe@...nel.org>
CC:     Rik van Riel <riel@...com>, "song@...nel.org" <song@...nel.org>,
        "joe.lawrence@...hat.com" <joe.lawrence@...hat.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "pmladek@...e.com" <pmladek@...e.com>,
        "live-patching@...r.kernel.org" <live-patching@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "jpoimboe@...hat.com" <jpoimboe@...hat.com>
Subject: Re: [RFC] sched,livepatch: call klp_try_switch_task in __cond_resched



> On May 10, 2022, at 4:04 PM, Josh Poimboeuf <jpoimboe@...nel.org> wrote:
> 
> On Tue, May 10, 2022 at 07:45:49PM +0000, Song Liu wrote:
>>>> A KLP transition preempt notifier would help those
>>>> kernel threads transition to the new KLP version at
>>>> any time they reschedule.
>>> 
>>> ... unless cond_resched() is a no-op due to CONFIG_PREEMPT?
>> 
>> Based on my understanding (and a few other folks we chatted with),
>> a kernel thread can legally run for extended time, as long as it 
>> calls cond_resched() at a reasonable frequency. Therefore, I 
>> think we should be able to patch such thread easily, unless it 
>> calls cond_resched() with being-patched function in the stack, 
>> of course.
> 
> But again, with CONFIG_PREEMPT, that doesn't work.
> 
>> OTOH, Petr's mindset of allowing many minutes for the patch 
>> transition is new to me. I need to think more about it. 
>> Josh, what’s you opinion on this? IIUC, kpatch is designed to 
>> only wait up to 60 seconds (no option to overwrite the time). 
> 
> I wouldn't be necessarily opposed to changing the kpatch timeout to
> something bigger, or eliminating it altogether in favor of a WARN()
> after x minutes.
> 
>>>> How much it will help is hard to predict, but I should
>>>> be able to get results from a fairly large sample size
>>>> of systems within a few weeks :)
>>> 
>>> As Peter said, keep in mind that we will need to fix other cases beyond
>>> Facebook, i.e., CONFIG_PREEMPT combined with non-x86 arches which don't
>>> have ORC so they can't reliably unwind from an IRQ.
>> 
>> I think livepatch transition may fail in different cases, and we
>> don't need to address all of them in one shoot. Fixing some cases
>> is an improvement as long as we don't slow down other cases. I 
>> understand that adding tiny overhead to __cond_resched() may end 
>> up as a visible regression. But maybe adding it to 
>> preempt_schedule_common() is light enough?
>> 
>> Did I miss/misunderstand something?
> 
> If it's a real bug, we should fix it everywhere, not just for Facebook.
> Otherwise CONFIG_PREEMPT and/or non-x86 arches become second-class
> citizens.

I think "is it a real bug?" is the top question for me. So maybe we 
should take a step back.

The behavior we see is: A busy kernel thread blocks klp transition 
for more than a minute. But the transition eventually succeeded after 
< 10 retries on most systems. The kernel thread is well-behaved, as 
it calls cond_resched() at a reasonable frequency, so this is not a 
deadlock. 

If I understand Petr correctly, this behavior is expected, and thus 
is not a bug or issue for the livepatch subsystem. This is different
to our original expectation, but if this is what we agree on, we 
will look into ways to incorporate long wait time for patch 
transition in our automations. 

OTOH, if we think this is a suboptimal behavior (not necessarily a 
bug, IIUC), we should consider improve it. I personally don’t think 
shipping an improvement to one configuration makes other configurations
second-class citizens. And, it is possible PREEMPT kernel does NOT
even have such suboptimal behavior, right? (I really don't know.)

So, if we come back to the same question: is this a bug (or a suboptimal
behavior that worth fixing)? If so, we are open to any solution that 
would also help PREEMPT and/or non-x86 arches. 

Lastly, maybe a really naive question: does the following also helps
PREEMPT=y configurations?

Thanks,
Song

diff --git c/kernel/sched/core.c w/kernel/sched/core.c
index a7302b7b65f2..cc9b1c9c02ba 100644
--- c/kernel/sched/core.c
+++ w/kernel/sched/core.c
@@ -5226,6 +5226,9 @@ void __sched schedule_preempt_disabled(void)

 static void __sched notrace preempt_schedule_common(void)
 {
+       if (unlikely(klp_patch_pending(current)))
+               klp_try_switch_task(current);
+
        do {
                /*
                 * Because the function tracer can trace preempt_count_sub()