linux-kernel - Re: [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb NOHZ

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YV1aYaHEynjSAUuI@alley>
Date:   Wed, 6 Oct 2021 10:12:17 +0200
From:   Petr Mladek <pmladek@...e.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     gor@...ux.ibm.com, jpoimboe@...hat.com, jikos@...nel.org,
        mbenes@...e.cz, mingo@...nel.org, linux-kernel@...r.kernel.org,
        joe.lawrence@...hat.com, fweisbec@...il.com, tglx@...utronix.de,
        hca@...ux.ibm.com, svens@...ux.ibm.com, sumanthk@...ux.ibm.com,
        live-patching@...r.kernel.org, paulmck@...nel.org,
        rostedt@...dmis.org, x86@...nel.org
Subject: Re: [RFC][PATCH v2 09/11] context_tracking,livepatch: Dont disturb
 NOHZ_FULL

On Wed 2021-09-29 17:17:32, Peter Zijlstra wrote:
> Using the new context_tracking infrastructure, avoid disturbing
> userspace tasks when context tracking is enabled.
> 
> When context_tracking_set_cpu_work() returns true, we have the
> guarantee that klp_update_patch_state() is called from noinstr code
> before it runs normal kernel code. This covers
> syscall/exceptions/interrupts and NMI entry.

This patch touches the most tricky (lockless) parts of the livepatch code.
I always have to refresh my head about all the dependencies.

Sigh, I guess that the livepatch code looks over complicated to you.

The main problem is that we want to migrate tasks only when they
are not inside any livepatched function. It allows to do semantic
changes which is needed by some sort of critical security fixes.


> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -55,15 +56,13 @@ static noinstr void ct_exit_user_work(struct
>  {
>  	unsigned int work = arch_atomic_read(&ct->work);
>  
> -#if 0
> -	if (work & CT_WORK_n) {
> +	if (work & CT_WORK_KLP) {
>  		/* NMI happens here and must still do/finish CT_WORK_n */
> -		do_work_n();
> +		__klp_update_patch_state(current);
>  
>  		smp_mb__before_atomic();
> -		arch_atomic_andnot(CT_WORK_n, &ct->work);
> +		arch_atomic_andnot(CT_WORK_KLP, &ct->work);
>  	}
> -#endif
>  
>  	smp_mb__before_atomic();
>  	arch_atomic_andnot(CT_SEQ_WORK, &ct->seq);
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -153,6 +154,11 @@ void klp_cancel_transition(void)
>  	klp_complete_transition();
>  }
>  
> +noinstr void __klp_update_patch_state(struct task_struct *task)
> +{
> +	task->patch_state = READ_ONCE(klp_target_state);
> +}
> +
>  /*
>   * Switch the patched state of the task to the set of functions in the target
>   * patch state.
> @@ -180,8 +186,10 @@ void klp_update_patch_state(struct task_
>  	 *    of func->transition, if klp_ftrace_handler() is called later on
>  	 *    the same CPU.  See __klp_disable_patch().
>  	 */
> -	if (test_and_clear_tsk_thread_flag(task, TIF_PATCH_PENDING))
> +	if (test_tsk_thread_flag(task, TIF_PATCH_PENDING)) {

This would require smp_rmb() here. It will make sure that we will
read the right @klp_target_state when TIF_PATCH_PENDING is set.

, where @klp_target_state is set in klp_init_transition()
  and TIF_PATCH_PENDING is set in klp_start_transition()

There are actually two related smp_wmp() barriers between these two
assignments in:

	1st in klp_init_transition()
	2nd in __klp_enable_patch()

One would be enough for klp_update_patch_state(). But we need
both for klp_ftrace_handler(), see the smp_rmb() there.
In particular, they synchronize:

   + ops->func_stack vs.
   + func->transition vs.
   + current->patch_state


>  		task->patch_state = READ_ONCE(klp_target_state);

Note that smp_wmb() is not needed here because
klp_complete_transition() calls klp_synchronize_transition()
aka synchronize_rcu() before clearing klp_target_state.
This is why the original code worked.


> +		clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
> +	}
>  
>  	preempt_enable_notrace();
>  }
> @@ -270,15 +278,30 @@ static int klp_check_and_switch_task(str
>  {
>  	int ret;
>  
> -	if (task_curr(task))
> +	if (task_curr(task)) {
> +		/*
> +		 * This only succeeds when the task is in NOHZ_FULL user
> +		 * mode, the true return value guarantees any kernel entry
> +		 * will call klp_update_patch_state().
> +		 *
> +		 * XXX: ideally we'd simply return 0 here and leave
> +		 * TIF_PATCH_PENDING alone, to be fixed up by
> +		 * klp_update_patch_state(), except livepatching goes wobbly
> +		 * with 'pending' TIF bits on.
> +		 */
> +		if (context_tracking_set_cpu_work(task_cpu(task), CT_WORK_KLP))
> +			goto clear;

If I get it correctly, this will clear TIF_PATCH_PENDING immediately
but task->patch_state = READ_ONCE(klp_target_state) will be
done later by ct_exit_user_work().

This is a bit problematic:

  1. The global @klp_target_state is set to KLP_UNDEFINED when all
     processes have TIF_PATCH_PENDING is cleared. This is actually
     still fine because func->transition is cleared as well.
     As a result, current->patch_state is ignored in klp_ftrace_handler.

  2. The real problem happens when another livepatch is enabled.
     The global @klp_target_state is set to new value and
     func->transition is set again. In this case, the delayed
     ct_exit_user_work() might assign wrong value that might
     really be used by klp_ftrace_handler().


IMHO, the original solution from v1 was better. We only needed to
be careful when updating task->patch_state and clearing
TIF_PATCH_PENDING to avoid the race.

The following might work:

static int klp_check_and_switch_task(struct task_struct *task, void *arg)
{
	int ret;

	/*
	 * Stack is reliable only when the task is not running on any CPU,
	 * except for the task running this code.
	 */
	if (task_curr(task) && task != current) {
		/*
		 * This only succeeds when the task is in NOHZ_FULL user
		 * mode. Such a task might be migrated immediately. We
		 * only need to be careful to set task->patch_state before
		 * clearing TIF_PATCH_PENDING so that the task migrates
		 * itself when entring kernel in the meatime.
		 */
		if (is_ct_user(task)) {
			klp_update_patch_state(task);
			return 0;
		}

		return -EBUSY;
	}

	ret = klp_check_stack(task, arg);
	if (ret)
		return ret;

	/*
	 * The task neither is running on any CPU and nor it can get
	 * running. As a result, the ordering is not important and
	 * barrier is not needed.
	 */
	task->patch_state = klp_target_state;
	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);

	return 0;
}

, where is_ct_user(task) would return true when task is running in
CONTEXT_USER. If I get the context_tracking API correctly then
it might be implemeted the following way:


#ifdef CONFIG_CONTEXT_TRACKING

/*
 * XXX: The value is reliable depending the context where this is called.
 * At least migrating between CPUs should get prevented.
 */
static __always_inline bool is_ct_user(struct task_struct *task)
{
	int seq;

	if (!context_tracking_enabled())
		return false;

	seq = __context_tracking_cpu_seq(task_cpu(task));
	return __context_tracking_seq_in_user(seq);
}

#else

static __always_inline bool is_ct_user(struct task_struct *task)
{
	return false;
}

#endif /* CONFIG_CONTEXT_TRACKING */

Best Regards,
Petr

>  		return -EBUSY;
> +	}
>  
>  	ret = klp_check_stack(task, arg);
>  	if (ret)
>  		return ret;
>  
> -	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
>  	task->patch_state = klp_target_state;
> +clear:
> +	clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
>  	return 0;
>  }