linux-kernel - Re: [RFC PATCH 0/2] livepatch: Add support for hybrid mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z6Xn61Bjm8pMd8zX@pathway.suse.cz>
Date: Fri, 7 Feb 2025 12:00:59 +0100
From: Petr Mladek <pmladek@...e.com>
To: Yafang Shao <laoar.shao@...il.com>
Cc: Miroslav Benes <mbenes@...e.cz>, jpoimboe@...nel.org, jikos@...nel.org,
	joe.lawrence@...hat.com, live-patching@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/2] livepatch: Add support for hybrid mode

On Wed 2025-02-05 14:16:42, Yafang Shao wrote:
> On Tue, Feb 4, 2025 at 9:05 PM Petr Mladek <pmladek@...e.com> wrote:
> >
> > On Mon 2025-02-03 17:44:52, Yafang Shao wrote:
> > > On Fri, Jan 31, 2025 at 9:18 PM Miroslav Benes <mbenes@...e.cz> wrote:
> > > >
> > > > > >
> > > > > >   + What exactly is meant by frequent replacements (busy loop?, once a minute?)
> > > > >
> > > > > The script:
> > > > >
> > > > > #!/bin/bash
> > > > > while true; do
> > > > >         yum install -y ./kernel-livepatch-6.1.12-0.x86_64.rpm
> > > > >         ./apply_livepatch_61.sh # it will sleep 5s
> > > > >         yum erase -y kernel-livepatch-6.1.12-0.x86_64
> > > > >         yum install -y ./kernel-livepatch-6.1.6-0.x86_64.rpm
> > > > >         ./apply_livepatch_61.sh  # it will sleep 5s
> > > > > done
> > > >
> > > > A live patch application is a slowpath. It is expected not to run
> > > > frequently (in a relative sense).
> > >
> > > The frequency isn’t the main concern here; _scalability_ is the key issue.
> > > Running livepatches once per day (a relatively low frequency) across all of our
> > > production servers (hundreds of thousands) isn’t feasible. Instead, we need to
> > > periodically run tests on a subset of test servers.
> >
> > I am confused. The original problem was a system crash when
> > livepatching do_exit() function, see
> > https://lore.kernel.org/r/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@mail.gmail.com
> 
> Why do you view this patchset as a solution to the original problem?

You discovered the hardlockup warnings when trying to reproduce the
original problem. At least, you mentioned this at
https://lore.kernel.org/r/20250122085146.41553-1-laoar.shao@gmail.com

And using the hybrid module would allow to livepatch do_exit() only
once and do not touch it any longer. It is not the right solution
but it would be a workaround.


> > The rcu watchdog warning was first mentioned in this patchset.
> > Do you see rcu watchdog warning in production or just
> > with this artificial test, please?
> 
> So, we shouldn’t run any artificial tests on the livepatch, correct?
> What exactly is the issue with these test cases?

Some tests might be too artificial. They might find problems which
do not exist in practice.

Disclaimer: I do not say that this is the case. You actually prove
	later in this mail that the hardlockup warning happen
	even in production.

Anyway, if an artificial test finds a potential problem and the fix is
complicated then we need to decide if it is worth it.

It does not make sense to complicate the code too much when
the fixed problem does not happen in practice.

  + Too complicated code might introduce regressions which are
    more serious than the original problem.

  + Too complicated code increases the maintenance cost. It is
    more complicated to add new features or fix bugs.


> > > > If you stress it like this, it is quite
> > > > expected that it will have an impact. Especially on a large busy system.
> > >
> > > It seems you agree that the current atomic-replace process lacks scalability.
> > > When deploying a livepatch across a large fleet of servers, it’s impossible to
> > > ensure that the servers are idle, as their workloads are constantly varying and
> > > are not deterministic.
> >
> > Do you see the scalability problem in production, please?
> 
> Yes, the livepatch transition was stalled.

Good to know.

> 
> > And could you prove that it was caused by livepatching, please?
> 
> When the livepatch transition is stalled, running `kpatch list` will
> display the stalled information.

OK.

> > > The challenges are very different when managing 1K servers versus 1M servers.
> > > Similarly, the issues differ significantly between patching a single
> > > function and
> > > patching 100 functions, especially when some of those functions are critical.
> > > That’s what scalability is all about.
> > >
> > > Since we transitioned from the old livepatch mode to the new
> > > atomic-replace mode,
> >
> > What do you mean with the old livepatch mode, please?
> 
> $ kpatch-build -R

I am not familiar with kpatch-build. OK, I see the following at
https://github.com/dynup/kpatch/blob/master/kpatch-build/kpatch-build

echo "		-R, --non-replace       Disable replace patch (replace is on by default)" >&2

> >
> > Did you allow to install more livepatches in parallel?
> 
> No.

I guess that there is a misunderstanding. I am sorry my wording was
not clear enough.

By "installing" more livepatches in parallel I meant to "have enabled"
more livepatches in parallel. It is possible only when you do not
use the atomic replace.

By other words, if you use "kpatch-build -R" then you could have
enabled more livepatches in parallel.


> > What was the motivation to switch to the atomic replace, please?
> 
> This is the default behavior of kpatch [1] after upgrading to a new version.
> 
> [1].  https://github.com/dynup/kpatch/tree/master

OK. I wonder if the atomic replace simplified some actions for you.

I ask because the proposed "hybrid" model is very similar to the "old"
model which did not use the atomic replace.

What are the advantages of the "hybrid" model over the "old" model, please?


> > > our SREs have consistently reported that one or more servers become
> > > stalled during
> > > the upgrade (replacement).
> >
> > What is SRE, please?
> 
> >From the wikipedia : https://en.wikipedia.org/wiki/Site_reliability_engineering

Good to know.

> > Could you please show some log from a production system?
> 
> When the SREs initially reported that the livepatch transition was
> stalled, I simply advised them to try again. However, after
> experiencing these crashes, I dug deeper into the livepatch code and
> realized that scalability is a concern. As a result, periodically
> replacing an old livepatch triggers RCU warnings on our production
> servers.
> 
> [Wed Feb  5 10:56:10 2025] livepatch: enabling patch 'livepatch_61_release6'
> [Wed Feb  5 10:56:10 2025] livepatch: 'livepatch_61_release6':
> starting patching transition
> [Wed Feb  5 10:56:24 2025] rcu_tasks_wait_gp: rcu_tasks grace period
> 1126113 is 10078 jiffies old.
> [Wed Feb  5 10:56:38 2025] rcu_tasks_wait_gp: rcu_tasks grace period
> 1126117 is 10199 jiffies old.
> [Wed Feb  5 10:56:52 2025] rcu_tasks_wait_gp: rcu_tasks grace period
> 1126121 is 10047 jiffies old.
> [Wed Feb  5 10:56:57 2025] livepatch: 'livepatch_61_release6': patching complete

I guess that this happens primary because there are many processes
running in kernel code:

       + many processes => long task list
       + in kernel code => need to check stack

I wondered how much it is caused by livepatching do_exit() which
takes tasklist_lock. The idea was:

	+ The processes calling do_exit() are blocked at
	  write_lock_irq(&tasklist_lock) when
	  klp_try_complete_transition() takes the tasklist_lock.

	+ These processes can't be transitioned because do_exit()
	  is on the stack => more klp_try_complete_transition()
	  rounds is needed.

	  => livepatching do_exit() reducess the chance of
	     klp_try_complete_transition() succcess.

	Well, it should not be that big problem because the next
	 klp_try_complete_transition() should be much faster.
	 It will skip already transitioned processes quickly.

       Resume: I think that livepatching of do_exit() should not be the main
	 	problem for the stall.


> PS: You might argue again about the frequency. If you believe this is
> just a frequency issue, please suggest a suitable frequency.

I do not know. The livepatch transition might block some processes.
It is a kind of stress for the system. Similar to another
housekeeping operations.

It depends on the load and the amount and type of livepatched
functions. It might take some time until the system recovers
from the stress and the system load drops back to normal.

If you create the stress (livepatch transition) too frequently
and the system does not get chance to recover in between the
stress situations then the bad effects might accumulate
and might be much worse.

I have no idea if it is the case here. The rule of thumb would be:

  + If you see the hardlockup warning _only_ when running the stress
    test "while true: do apply_livepatch ; done;" then
    the problem might be rather theoretical.

  + If you see the hardlockup warning on production systems where
    the you apply a livepatch only occasionally (one per day)
    then the problem is real and we should fix it.


> > > > > > > Other potential risks may also arise
> > > > > > >   due to inconsistencies or race conditions during transitions.
> > > > > >
> > > > > > What inconsistencies and race conditions you have in mind, please?
> > > > >
> > > > > I have explained it at
> > > > > https://lore.kernel.org/live-patching/Z5DHQG4geRsuIflc@pathway.suse.cz/T/#m5058583fa64d95ef7ac9525a6a8af8ca865bf354
> > > > >
> > > > >  klp_ftrace_handler
> > > > >       if (unlikely(func->transition)) {
> > > > >           WARN_ON_ONCE(patch_state == KLP_UNDEFINED);
> > > > >   }
> > > > >
> > > > > Why is WARN_ON_ONCE() placed here? What issues have we encountered in the past
> > > > > that led to the decision to add this warning?
> > > >
> > > > A safety measure for something which really should not happen.
> > >
> > > Unfortunately, this issue occurs during my stress tests.
> >
> > I am confused. Do you see the above WARN_ON_ONCE() during your
> > stress test? Could you please provide a log?
> 
> Could you pls read my replyment seriously ?

This is pretty hars and offending after so many details I have already
provided!

It is easy to miss a detail in a flood of long mails. Also I am
working on many other things in parallel.

> https://lore.kernel.org/live-patching/Z5DHQG4geRsuIflc@pathway.suse.cz/T/#m5058583fa64d95ef7ac9525a6a8af8ca865bf354

Ah, I have missed that you triggered this exact WARNING. It is great.
It confirms the theory about the race in do_exit(). I mean that
the transition finishes early because the processes in do_exit()
are not longer visible in the tasklist.


> > > > > > The main advantage of the atomic replace is simplify the maintenance
> > > > > > and debugging.
> >
> > If you have problems with the atomic replace then you might stop using
> > it completely and just install more livepatches in parallel.
> 
> Why do we need to install livepatches in parallel if atomic replace is disabled?
> We only need to install the additional new livepatch. Parallel
> installation is only necessary at boot time.

This is misunderstanding. By "installed" livepatches in parallel I
mean "enabled" livepatches in parallel, aka, without atomic replace.

If you have problems with atomic replace, you might stop using it.
Honestly, I do not see that big advantage in the hybrid model
over the non-atomic-replace model.

That said, I think that the hybrid mode will not prevent the
hardlockup warning. It seems that you have reproduced the hardlockup
even with a relatively simple livepatch, see
https://lore.kernel.org/all/CALOAHbBZc6ORGzXwBRwe+rD2=YGf1jub5TEr989_GpK54P2o1A@mail.gmail.com/

IMHO, we should rather detect and break the stall in
klp_try_complete_transition(). I mean to go the way explored in
the thread
https://lore.kernel.org/all/20250122085146.41553-1-laoar.shao@gmail.com/

Best Regards,
Petr