[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5603004A.20801@gmail.com>
Date: Wed, 23 Sep 2015 15:40:58 -0400
From: Austin S Hemmelgarn <ahferroin7@...il.com>
To: Andi Kleen <andi@...stfloor.org>, tytso@....edu
Cc: linux-kernel@...r.kernel.org, kirill.shutemov@...ux.intel.com,
herbert@...dor.apana.org.au, Andi Kleen <ak@...ux.intel.com>
Subject: Re: [PATCH 1/3] Make /dev/urandom scalable
On 2015-09-22 19:16, Andi Kleen wrote:
> From: Andi Kleen <ak@...ux.intel.com>
>
> We had a case where a 4 socket system spent >80% of its total CPU time
> contending on the global urandom nonblocking pool spinlock. While the
> application could probably have used an own PRNG, it may have valid
> reasons to use the best possible key for different session keys.
>
> The application still ran acceptable under 2S, but just fell over
> the locking cliff on 4S.
>
> Implementation
> ==============
>
> The non blocking pool is used widely these days, from every execve() (to
> set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequent
> /dev/urandom users in user space. Clearly having such a popular resource
> under a global lock is bad thing.
>
> This patch changes the random driver to use distributed per NUMA node
> nonblocking pools. The basic structure is not changed: entropy is
> first fed into the input pool and later from there distributed
> round-robin into the blocking and non blocking pools. This patch extends
> this to use an dedicated non blocking pool for each node, and distribute
> evenly from the input pool into these distributed pools, in
> addition to the blocking pool.
>
> Then every urandom/getrandom user fetches data from its node local
> pool. At boot time when users may be still waiting for the non
> blocking pool initialization we use the node 0 non blocking pool,
> to avoid the need for different wake up queues.
>
> For single node systems (like the vast majority of non server systems)
> nothing changes. There is still only a single non blocking pool.
>
> The different per-node pools also start with different start
> states and diverge more and more over time, as they get
> feed different input data. So "replay" attacks are
> difficult after some time.
I really like this idea, as it both makes getting random numbers on busy
servers faster, and makes replay attacks more difficult.
>
> Without hardware random number seed support the start states
> (until enough real entropy is collected) are not very random, but
> that's not worse than before
>
> Since we still have a global input pool there are no problems
> with load balancing entropy data between nodes. Any node that never
> runs any interrupts would still get the same amount of entropy as
> other nodes.
>
> Entropy is fed preferably to nodes that need it more using
> the existing 75% threshold.
>
> For saving/restoring /dev/urandom, there is currently no mechanism
> to access the non local node pool (short of setting task affinity).
> This implies that currently the standard init/exit random save/restore
> scripts would only save node 0. On restore all pools are updates.
> So the entropy of non 0 gets lost over reboot. That seems acceptable
> to me for now (fixing this would need a new separate save/restore interface)
I agree that this is acceptable, it wouldn't be hard for someone who
wants to to just modify the script to set it's own task affinity and
loop through the nodes (although that might get confusing with
hot-plugged/hot-removed nodes).
>
> Scalability
> ===========
>
> I tested the patch with a simple will-it-scale test banging
> on get_random() in parallel on more and more CPUs. Of course
> that is not a realistic scenario, as real programs should
> do some work between getting random numbers. But it's a worst
> case for the random scalability.
>
> On a 4S Xeon v3 system _without_ the patchkit the benchmark
> maxes out when using all the threads on one node. After
> that it quickly settles to about half the throughput of
> one node with 2-4 nodes.
>
> (all throughput factors, bigger is better)
> Without patchkit:
>
> 1 node: 1x
> 2 nodes: 0.75x
> 3 nodes: 0.55x
> 4 nodes: 0.42x
>
> With the patchkit applied:
>
> 1 node: 1x
> 2 nodes: 2x
> 3 nodes: 3.4x
> 4 nodes: 6x
>
> So it's not quite linear scalability, but 6x maximum throughput
> is already a lot better.
>
> A node can still have a large number of CPUs: on my test system 36
> logical software threads (18C * 2T). In principle it may make
> sense to split it up further. Per logical CPU would be clearly
> overkill. But that would also add more pressure on the input
> pools. For now per node seems like a acceptable compromise.
I'd almost say that making the partitioning level configurable at build
time might be useful. I can see possible value to being able to at
least partition down to physical cores (so, shared between HyperThreads
on Intel processors, and between Compute Module cores on AMD
processors), as that could potentially help people running large numbers
of simulations in parallel.
Personally, I'm the type who would be willing to take the performance
hit to do it per logical CPU just for the fact that it would make replay
attacks more difficult, but I'm probably part of a very small minority
in that case.
>
> /dev/random still uses a single global lock. For now that seems
> acceptable as it normally cannot be used for real high volume
> accesses anyways.
>
> The input pool also still uses a global lock. The existing per CPU
> fast pool and "give up when busy" mechanism seems to scale well enough
> even on larger systems.
>
Download attachment "smime.p7s" of type "application/pkcs7-signature" (3019 bytes)
Powered by blists - more mailing lists