linux-kernel - Re: [PATCH 2/3] livepatch/rcu: Warn when system consistency is broken in RCU code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170508165108.d3vd4h6ffa25bfui@treble>
Date:   Mon, 8 May 2017 11:51:08 -0500
From:   Josh Poimboeuf <jpoimboe@...hat.com>
To:     Petr Mladek <pmladek@...e.com>
Cc:     Jessica Yu <jeyu@...hat.com>, Jiri Kosina <jikos@...nel.org>,
        Miroslav Benes <mbenes@...e.cz>,
        Steven Rostedt <rostedt@...dmis.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        live-patching@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/3] livepatch/rcu: Warn when system consistency is
 broken in RCU code

On Thu, May 04, 2017 at 12:55:15PM +0200, Petr Mladek wrote:
> RCU is not watching inside some RCU code. As a result, the livepatch
> ftrace handler might see ops->func_stack and some other flags in
> a wrong state. Then a livepatch might make the system unstable.
> 
> Note that there might be serious consequences only when the livepatch
> modifies semantic of functions used by some parts of RCU infrastructure.
> We are safe when none of the patched functions is used by this RCU code.
> Also everything is good when the functions might be changed one by one
> (using the immediate flag). See Documentation/livepatch/livepatch.txt
> for more details about the consistency model.
> 
> Fortunately, the sensitive parts of the RCU code are annotated
> and could be detected. This patch adds a warning when an insecure
> usage is detected.
> 
> Unfortunately, the ftrace handler might be called when the problematic
> patch has already been removed from ops->func stack. In this case,
> it is not able to read the immediate flag. It makes the check
> unreliable. We rather avoid it and report the problem even when
> the system stability is not affected.
> 
> It would be possible to add some more complex logic to avoid
> warnings when RCU infrastructure is modified using immediate
> patches. But let's keep it simple until real life experience
> forces us to do the opposite.
> 
> This patch is inspired by similar problems solved in stack
> tracer, see
> https://lkml.kernel.org/r/20170412115304.3077dbc8@gandalf.local.home
> 
> Signed-off-by: Petr Mladek <pmladek@...e.com>
> ---
>  Documentation/livepatch/livepatch.txt | 15 +++++++++++++++
>  kernel/livepatch/patch.c              |  8 ++++++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/Documentation/livepatch/livepatch.txt b/Documentation/livepatch/livepatch.txt
> index ecdb18104ab0..92619f7d86fd 100644
> --- a/Documentation/livepatch/livepatch.txt
> +++ b/Documentation/livepatch/livepatch.txt
> @@ -476,6 +476,21 @@ The current Livepatch implementation has several limitations:
>      by "notrace".
>  
>  
> +  + Limited patching of RCU infrastructure
> +
> +    The livepatch ftrace handler uses RCU for handling the stack of patches.
> +    Also the ftrace framework uses RCU to detect when the handlers are not
> +    longer in use.
> +
> +    The directly used RCU functions could not be patched. Otherwise,
> +    there would be an infinite recursion.
> +
> +    Some other RCU functions can not be patched safely because the RCU
> +    framework might be in an inconsistent state. This context is annotated
> +    and warning is printed. In theory, it might be safe to modify such
> +    functions using immediate patches. But this is hard to detect properly
> +    in the ftrace handler, so the warning is always printed.
> +
>    + Livepatch works reliably only when the dynamic ftrace is located at
>      the very beginning of the function.
> diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
> index 4c4fbe409008..ffdf5fa8005b 100644
> --- a/kernel/livepatch/patch.c
> +++ b/kernel/livepatch/patch.c
> @@ -62,6 +62,14 @@ static void notrace klp_ftrace_handler(unsigned long ip,
>  	/* RCU may not be watching, make it see us. */
>  	rcu_irq_enter_irqson();
>  
> +	/*
> +	 * RCU still might not see us if we patch a function inside
> +	 * the RCU infrastructure. Then we might see wrong state of
> +	 * func->stack and other flags.
> +	 */
> +	if (unlikely(!rcu_is_watching()))
> +		WARN_ONCE(1, "Livepatch modified a function that can not be handled a safe way.!");
> +
>  	rcu_read_lock();
>  
>  	func = list_first_or_null_rcu(&ops->func_stack, struct klp_func,

So the ftrace stack tracer seems to do this check a little differently.
It calls rcu_irq_enter_disabled() first, and then calls rcu_irq_enter().
Any reason we're not doing it that way?

The warning would be more helpful if it printed a little more
information, like the ip and the parent_ip.  And it should probably
mention that RCU is broken.

Also I wonder if we can constrain the warning somehow.  I think the
warning only applies if a patch is in progress, right?  In that case, if
RCU is broken, would it be feasible to mutex_trylock() the klp mutex to
try to ensure that no patches are being applied while the ftrace handler
is running?  Then it wouldn't matter if RCU were broken because the func
stack wouldn't be changing anyway.  Then it could only warn if it failed
to get the mutex.

Stepping back a bit, the documentation and comments describe patches to
functions inside the RCU infrastructure.  As far as I can tell, only a
single function would be affected: rcu_dynticks_eqs_enter().  Because
it's the only function called after the "Breaks tracing momentarily"
comment in rcu_eqs_enter_common().  Any reason why we couldn't just
annotate rcu_dynticks_eqs_enter() with notrace?

Stepping back even further, if I'm understanding this issue correctly,
this warning can also affect patches to functions which are called from
NMI context.  If the NMI occurs in the part of rcu_eqs_enter_common()
where rcu_irq_enter() doesn't work, then RCU won't work here, right?  If
so, that worries me because there are a lot of functions which can be
called (and patched) from NMI context.

I wonder if there's some way to solve this by changing RCU code, but I'm
not familiar enough with RCU to have any ideas there.

Another idea would be to figure out a way to stop using RCU in
klp_ftrace_handler() altogether.

-- 
Josh