linux-kernel - Re: [PATCH] random: set fast pool count to zero in cpuhp teardown

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Ygo3/puhZFpuX91x@linutronix.de>
Date:   Mon, 14 Feb 2022 12:07:42 +0100
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     "Jason A. Donenfeld" <Jason@...c4.com>
Cc:     linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
        Peter Zijlstra <peterz@...radead.org>,
        Theodore Ts'o <tytso@....edu>,
        Jonathan Neuschäfer <j.neuschaefer@....net>,
        Sultan Alsawaf <sultan@...neltoast.com>,
        Dominik Brodowski <linux@...inikbrodowski.net>
Subject: Re: [PATCH] random: set fast pool count to zero in cpuhp teardown

On 2022-02-13 22:53:43 [+0100], Jason A. Donenfeld wrote:
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 805a924b1f5f..e177d806db1d 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1216,6 +1216,19 @@ static void fast_mix(u32 pool[4])
>  
>  static DEFINE_PER_CPU(struct fast_pool, irq_randomness);
>  
> +int random_offline_cpu(unsigned int cpu)
> +{
> +	/*
> +	 * Set the count to zero when offlining the CPU for two
> +	 * reasons: 1) so that all new accumulated irqs are fresh
> +	 * when it comes back online, and 2) so that its worker is
> +	 * permitted to schedule again when it comes back online,
> +	 * since the MIX_INFLIGHT flag will be cleared.
> +	 */
> +	per_cpu_ptr(&irq_randomness, cpu)->count = 0;

You registered that handler between
   CPUHP_AP_ONLINE_IDLE … CPUHP_AP_ACTIVE

That means:
- interrupts are still active. So you could schedule a worker after the
  handler run (need 64 interrupts).
- That handler will be invoked on the CPU that goes offline.
- it will be invoked before the worker is unbound from the offline-going
  CPU. Once unbound at CPUHP_AP_WORKQUEUE_ONLINE the worker may remain
  on the CPU before the scheduler pushes it away.

You could:
- keeping it at this position but
  - this_cpu_ptr() could be used.
  - move it to the online phase so it is reset for the CPU coming up
    (from this point, you could have >= 64 interrupts before the CPU is
    no longer serving interrupts).

- move it to CPUHP_OFFLINE … CPUHP_BRINGUP_CPU. This is invoked on
  another CPU once the is dead / before it comes up.
  - in that case the function can remain as-is. But we have "rollback".

One slight complication:
After the CPUHP_AP_WORKQUEUE_ONLINE state (going down), the worker is
tainted. It may be moved to another CPU but may remain on the (correct)
CPU and run. Moving HP-handler to an earlier point has the advantage
that you know. 

We also have this "rollback". That means the CPU goes down and then one
step fails and it rolls back online. That means if you have your handler
at an earlier point, you don't notice it and the worker may have skipped
its turn. (Also I didn't check but I *assume* that after
CPUHP_AP_WORKQUEUE_ONLINE there is no pool bound to the CPU and any new
worker will be moved to an unbound pool).

The more I think I about it, moving the state left and right, I think
that having a CPU up handler after CPUHP_AP_WORKQUEUE_ONLINE (like you
have now (but up, not down)) doing:

	fast_pool = this_cpu_ptr(&irq_randomness);
	local_irq_disable();
	if (fast_pool->count & FAST_POOL_MIX_INFLIGHT) {
		fast_pool->count = 0;
		fast_pool->last = jiffies
	}
	local_irq_enable();

should work just fine. This covers:
- should a worker be scheduled while CPU is going down and do its job,
  we good.

- should a worker be scheduled while CPU is going down, moved to another
  CPU and skip its work then we have FAST_POOL_MIX_INFLIGHT set and
  nothing pending. This gets reset once the CPU is going online.
  This also covers the possible rollback scenarios.

- should a CPU already contribute entropy then the HP-handler is not
  going to reset it.

- should a CPU already contribute entropy and schedule a worker then we
  reset the FAST_POOL_MIX_INFLIGHT bit. However ->last is also reset so
  no new worker is scheduled due the time and because count == 0.
  So either the worker runs (and does the same reset) _or_ we wait
  another 64 interrupts + jiffy (in case the worker was scheduled but
  did nothing because it was run on another CPU).

> +	return 0;
> +}
> +

Sebastian