[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250207023116.wx4i3n7ks3q2hfpu@jpoimboe>
Date: Thu, 6 Feb 2025 18:31:16 -0800
From: Josh Poimboeuf <jpoimboe@...nel.org>
To: Yafang Shao <laoar.shao@...il.com>
Cc: jikos@...nel.org, mbenes@...e.cz, pmladek@...e.com,
joe.lawrence@...hat.com, live-patching@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 2/2] livepatch: Implement livepatch hybrid mode
On Mon, Jan 27, 2025 at 02:35:26PM +0800, Yafang Shao wrote:
> The atomic replace livepatch mechanism was introduced to handle scenarios
> where we want to unload a specific livepatch without unloading others.
> However, its current implementation has significant shortcomings, making
> it less than ideal in practice. Below are the key downsides:
>
> - It is expensive
>
> During testing with frequent replacements of an old livepatch, random RCU
> warnings were observed:
>
> [19578271.779605] rcu_tasks_wait_gp: rcu_tasks grace period 642409 is 10024 jiffies old.
> [19578390.073790] rcu_tasks_wait_gp: rcu_tasks grace period 642417 is 10185 jiffies old.
> [19578423.034065] rcu_tasks_wait_gp: rcu_tasks grace period 642421 is 10150 jiffies old.
> [19578564.144591] rcu_tasks_wait_gp: rcu_tasks grace period 642449 is 10174 jiffies old.
> [19578601.064614] rcu_tasks_wait_gp: rcu_tasks grace period 642453 is 10168 jiffies old.
> [19578663.920123] rcu_tasks_wait_gp: rcu_tasks grace period 642469 is 10167 jiffies old.
> [19578872.990496] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 10215 jiffies old.
> [19578903.190292] rcu_tasks_wait_gp: rcu_tasks grace period 642529 is 40415 jiffies old.
> [19579017.965500] rcu_tasks_wait_gp: rcu_tasks grace period 642577 is 10174 jiffies old.
> [19579033.981425] rcu_tasks_wait_gp: rcu_tasks grace period 642581 is 10143 jiffies old.
> [19579153.092599] rcu_tasks_wait_gp: rcu_tasks grace period 642625 is 10188 jiffies old.
>
> This indicates that atomic replacement can cause performance issues,
> particularly with RCU synchronization under frequent use.
Why does this happen?
> - Potential Risks During Replacement
>
> One known issue involves replacing livepatched versions of critical
> functions such as do_exit(). During the replacement process, a panic
> might occur, as highlighted in [0]. Other potential risks may also arise
> due to inconsistencies or race conditions during transitions.
That needs to be fixed.
> - Temporary Loss of Patching
>
> During the replacement process, the old patch is set to a NOP (no-operation)
> before the new patch is fully applied. This creates a window where the
> function temporarily reverts to its original, unpatched state. If the old
> patch fixed a critical issue (e.g., one that prevented a system panic), the
> system could become vulnerable to that issue during the transition.
Are you saying that atomic replace is not atomic? If so, this sounds
like another bug.
--
Josh
Powered by blists - more mailing lists