linux-kernel - Re: [RFC PATCH 0/2] livepatch: Add support for hybrid mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALOAHbBktE_jYd5zSzvmbo_K7PkFDXrykTqV1-ZDQju64EYPyg@mail.gmail.com>
Date: Wed, 5 Feb 2025 14:16:42 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: Petr Mladek <pmladek@...e.com>
Cc: Miroslav Benes <mbenes@...e.cz>, jpoimboe@...nel.org, jikos@...nel.org, 
	joe.lawrence@...hat.com, live-patching@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/2] livepatch: Add support for hybrid mode

On Tue, Feb 4, 2025 at 9:05 PM Petr Mladek <pmladek@...e.com> wrote:
>
> On Mon 2025-02-03 17:44:52, Yafang Shao wrote:
> > On Fri, Jan 31, 2025 at 9:18 PM Miroslav Benes <mbenes@...e.cz> wrote:
> > >
> > > > >
> > > > >   + What exactly is meant by frequent replacements (busy loop?, once a minute?)
> > > >
> > > > The script:
> > > >
> > > > #!/bin/bash
> > > > while true; do
> > > >         yum install -y ./kernel-livepatch-6.1.12-0.x86_64.rpm
> > > >         ./apply_livepatch_61.sh # it will sleep 5s
> > > >         yum erase -y kernel-livepatch-6.1.12-0.x86_64
> > > >         yum install -y ./kernel-livepatch-6.1.6-0.x86_64.rpm
> > > >         ./apply_livepatch_61.sh  # it will sleep 5s
> > > > done
> > >
> > > A live patch application is a slowpath. It is expected not to run
> > > frequently (in a relative sense).
> >
> > The frequency isn’t the main concern here; _scalability_ is the key issue.
> > Running livepatches once per day (a relatively low frequency) across all of our
> > production servers (hundreds of thousands) isn’t feasible. Instead, we need to
> > periodically run tests on a subset of test servers.
>
> I am confused. The original problem was a system crash when
> livepatching do_exit() function, see
> https://lore.kernel.org/r/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@mail.gmail.com

Why do you view this patchset as a solution to the original problem?

>
> The rcu watchdog warning was first mentioned in this patchset.
> Do you see rcu watchdog warning in production or just
> with this artificial test, please?

So, we shouldn’t run any artificial tests on the livepatch, correct?
What exactly is the issue with these test cases?

>
>
> > > If you stress it like this, it is quite
> > > expected that it will have an impact. Especially on a large busy system.
> >
> > It seems you agree that the current atomic-replace process lacks scalability.
> > When deploying a livepatch across a large fleet of servers, it’s impossible to
> > ensure that the servers are idle, as their workloads are constantly varying and
> > are not deterministic.
>
> Do you see the scalability problem in production, please?

Yes, the livepatch transition was stalled.


> And could you prove that it was caused by livepatching, please?

When the livepatch transition is stalled, running `kpatch list` will
display the stalled information.

>
>
> > The challenges are very different when managing 1K servers versus 1M servers.
> > Similarly, the issues differ significantly between patching a single
> > function and
> > patching 100 functions, especially when some of those functions are critical.
> > That’s what scalability is all about.
> >
> > Since we transitioned from the old livepatch mode to the new
> > atomic-replace mode,
>
> What do you mean with the old livepatch mode, please?

$ kpatch-build -R

>
> Did you allow to install more livepatches in parallel?

No.

> What was the motivation to switch to the atomic replace, please?

This is the default behavior of kpatch [1] after upgrading to a new version.

[1].  https://github.com/dynup/kpatch/tree/master

>
> > our SREs have consistently reported that one or more servers become
> > stalled during
> > the upgrade (replacement).
>
> What is SRE, please?

>From the wikipedia : https://en.wikipedia.org/wiki/Site_reliability_engineering

> Could you please show some log from a production system?

When the SREs initially reported that the livepatch transition was
stalled, I simply advised them to try again. However, after
experiencing these crashes, I dug deeper into the livepatch code and
realized that scalability is a concern. As a result, periodically
replacing an old livepatch triggers RCU warnings on our production
servers.

[Wed Feb  5 10:56:10 2025] livepatch: enabling patch 'livepatch_61_release6'
[Wed Feb  5 10:56:10 2025] livepatch: 'livepatch_61_release6':
starting patching transition
[Wed Feb  5 10:56:24 2025] rcu_tasks_wait_gp: rcu_tasks grace period
1126113 is 10078 jiffies old.
[Wed Feb  5 10:56:38 2025] rcu_tasks_wait_gp: rcu_tasks grace period
1126117 is 10199 jiffies old.
[Wed Feb  5 10:56:52 2025] rcu_tasks_wait_gp: rcu_tasks grace period
1126121 is 10047 jiffies old.
[Wed Feb  5 10:56:57 2025] livepatch: 'livepatch_61_release6': patching complete

PS: You might argue again about the frequency. If you believe this is
just a frequency issue, please suggest a suitable frequency.

>
>
> > > > > > Other potential risks may also arise
> > > > > >   due to inconsistencies or race conditions during transitions.
> > > > >
> > > > > What inconsistencies and race conditions you have in mind, please?
> > > >
> > > > I have explained it at
> > > > https://lore.kernel.org/live-patching/Z5DHQG4geRsuIflc@pathway.suse.cz/T/#m5058583fa64d95ef7ac9525a6a8af8ca865bf354
> > > >
> > > >  klp_ftrace_handler
> > > >       if (unlikely(func->transition)) {
> > > >           WARN_ON_ONCE(patch_state == KLP_UNDEFINED);
> > > >   }
> > > >
> > > > Why is WARN_ON_ONCE() placed here? What issues have we encountered in the past
> > > > that led to the decision to add this warning?
> > >
> > > A safety measure for something which really should not happen.
> >
> > Unfortunately, this issue occurs during my stress tests.
>
> I am confused. Do you see the above WARN_ON_ONCE() during your
> stress test? Could you please provide a log?

Could you pls read my replyment seriously ?
https://lore.kernel.org/live-patching/Z5DHQG4geRsuIflc@pathway.suse.cz/T/#m5058583fa64d95ef7ac9525a6a8af8ca865bf354

>
> > > > > The main advantage of the atomic replace is simplify the maintenance
> > > > > and debugging.
> > > >
> > > > Is it worth the high overhead on production servers?
> > >
> > > Yes, because the overhead once a live patch is applied is negligible.
> >
> > If you’re managing a large fleet of servers, this issue is far from negligible.
> >
> > >
> > > > Can you provide examples of companies that use atomic replacement at
> > > > scale in their production environments?
> > >
> > > At least SUSE uses it as a solution for its customers. No many problems
> > > have been reported since we started ~10 years ago.
> >
> > Perhaps we’re running different workloads.
> > Going back to the original purpose of livepatching: is it designed to address
> > security vulnerabilities, or to deploy new features?
>
> We (SUSE) use livepatches only for fixing CVEs and serious bugs.
>
>
> > If it’s the latter, then there’s definitely a lot of room for improvement.
>
> You might be right. I am just not sure whether the hybrid mode would
> be the right solution.
>
> If you have problems with the atomic replace then you might stop using
> it completely and just install more livepatches in parallel.

Why do we need to install livepatches in parallel if atomic replace is disabled?
We only need to install the additional new livepatch. Parallel
installation is only necessary at boot time.

>
>
> My view:
>
> More livepatches installed in parallel are more prone to

I’m confused as to why you consider this a parallel installation issue.

> inconsistencies. A good example is the thread about introducing
> stack order sysfs interface, see
> https://lore.kernel.org/all/AAD198C9-210E-4E31-8FD7-270C39A974A8@gmail.com/
>
> The atomic replace helps to keep the livepatched functions consistent.
>
> The hybrid model would allow to install more livepatches in parallel except
> that one livepatch could be replaced atomically. It would create even
> more scenarios than allowing all livepatches in parallel.
>
> What would be the rules, please?
>
> Which functionality will be livepatched by the atomic replace, please?
>
> Which functionality will be handled by the extra non-replaceable
> livepatches, please?
>
> How would you keep the livepatches consistent, please?
>
> How would you manage dependencies between livepatches, please?
>
> What is the advantage of the hybrid model over allowing
> all livepatches in parallel, please?

I can’t answer your questions if you insist on framing this as a
parallel installation issue.

--
Regards

Yafang