lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231108093645.GL8262@noisy.programming.kicks-ass.net>
Date:   Wed, 8 Nov 2023 10:36:45 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Ankur Arora <ankur.a.arora@...cle.com>
Cc:     linux-kernel@...r.kernel.org, tglx@...utronix.de,
        torvalds@...ux-foundation.org, paulmck@...nel.org,
        linux-mm@...ck.org, x86@...nel.org, akpm@...ux-foundation.org,
        luto@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, willy@...radead.org, mgorman@...e.de,
        jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        jgross@...e.com, andrew.cooper3@...rix.com, mingo@...nel.org,
        bristot@...nel.org, mathieu.desnoyers@...icios.com,
        geert@...ux-m68k.org, glaubitz@...sik.fu-berlin.de,
        anton.ivanov@...bridgegreys.com, mattst88@...il.com,
        krypton@...ich-teichert.org, rostedt@...dmis.org,
        David.Laight@...lab.com, richard@....at, mjguzik@...il.com
Subject: Re: [RFC PATCH 41/86] sched: handle resched policy in resched_curr()

On Tue, Nov 07, 2023 at 01:57:27PM -0800, Ankur Arora wrote:

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1027,13 +1027,13 @@ void wake_up_q(struct wake_q_head *head)
>  }
>  
>  /*
> - * resched_curr - mark rq's current task 'to be rescheduled now'.
> + * __resched_curr - mark rq's current task 'to be rescheduled'.
>   *
> - * On UP this means the setting of the need_resched flag, on SMP it
> - * might also involve a cross-CPU call to trigger the scheduler on
> - * the target CPU.
> + * On UP this means the setting of the need_resched flag, on SMP, for
> + * eager resched it might also involve a cross-CPU call to trigger
> + * the scheduler on the target CPU.
>   */
> -void resched_curr(struct rq *rq)
> +void __resched_curr(struct rq *rq, resched_t rs)
>  {
>  	struct task_struct *curr = rq->curr;
>  	int cpu;
> @@ -1046,17 +1046,77 @@ void resched_curr(struct rq *rq)
>  	cpu = cpu_of(rq);
>  
>  	if (cpu == smp_processor_id()) {
> -		set_tsk_need_resched(curr, RESCHED_eager);
> -		set_preempt_need_resched();
> +		set_tsk_need_resched(curr, rs);
> +		if (rs == RESCHED_eager)
> +			set_preempt_need_resched();
>  		return;
>  	}
>  
> -	if (set_nr_and_not_polling(curr, RESCHED_eager))
> -		smp_send_reschedule(cpu);
> -	else
> +	if (set_nr_and_not_polling(curr, rs)) {
> +		if (rs == RESCHED_eager)
> +			smp_send_reschedule(cpu);

I think you just broke things.

Not all idle threads have POLLING support, in which case you need that
IPI to wake them up, even if it's LAZY.

> +	} else if (rs == RESCHED_eager)
>  		trace_sched_wake_idle_without_ipi(cpu);
>  }



>  
> +/*
> + * resched_curr - mark rq's current task 'to be rescheduled' eagerly
> + * or lazily according to the current policy.
> + *
> + * Always schedule eagerly, if:
> + *
> + *  - running under full preemption
> + *
> + *  - idle: when not polling (or if we don't have TIF_POLLING_NRFLAG)
> + *    force TIF_NEED_RESCHED to be set and send a resched IPI.
> + *    (the polling case has already set TIF_NEED_RESCHED via
> + *     set_nr_if_polling()).
> + *
> + *  - in userspace: run to completion semantics are only for kernel tasks
> + *
> + * Otherwise (regardless of priority), run to completion.
> + */
> +void resched_curr(struct rq *rq)
> +{
> +	resched_t rs = RESCHED_lazy;
> +	int context;
> +
> +	if (IS_ENABLED(CONFIG_PREEMPT) ||
> +	    (rq->curr->sched_class == &idle_sched_class)) {
> +		rs = RESCHED_eager;
> +		goto resched;
> +	}
> +
> +	/*
> +	 * We might race with the target CPU while checking its ct_state:
> +	 *
> +	 * 1. The task might have just entered the kernel, but has not yet
> +	 * called user_exit(). We will see stale state (CONTEXT_USER) and
> +	 * send an unnecessary resched-IPI.
> +	 *
> +	 * 2. The user task is through with exit_to_user_mode_loop() but has
> +	 * not yet called user_enter().
> +	 *
> +	 * We'll see the thread's state as CONTEXT_KERNEL and will try to
> +	 * schedule it lazily. There's obviously nothing that will handle
> +	 * this need-resched bit until the thread enters the kernel next.
> +	 *
> +	 * The scheduler will still do tick accounting, but a potentially
> +	 * higher priority task waited to be scheduled for a user tick,
> +	 * instead of execution time in the kernel.
> +	 */
> +	context = ct_state_cpu(cpu_of(rq));
> +	if ((context == CONTEXT_USER) ||
> +	    (context == CONTEXT_GUEST)) {
> +
> +		rs = RESCHED_eager;
> +		goto resched;
> +	}

Like said, this simply cannot be. You must not rely on the remote CPU
being in some state or not. Also, it's racy, you could observe USER and
then it enters KERNEL.

> +
> +resched:
> +	__resched_curr(rq, rs);
> +}
> +
>  void resched_cpu(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ