linux-kernel - Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve memory-side-cache utilization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPcyv4gkSBW5Te0RZLrkxzufyVq56-7pHu__YfffBiWhoqg7Yw@mail.gmail.com>
Date:   Thu, 10 Jan 2019 13:29:27 -0800
From:   Dan Williams <dan.j.williams@...el.com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...e.com>,
        Kees Cook <keescook@...omium.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        Keith Busch <keith.busch@...el.com>,
        Linux MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v7 1/3] mm: Shuffle initial free memory to improve
 memory-side-cache utilization

On Thu, Jan 10, 2019 at 2:57 AM Mel Gorman <mgorman@...e.de> wrote:
>
> On Mon, Jan 07, 2019 at 03:21:10PM -0800, Dan Williams wrote:
> > Randomization of the page allocator improves the average utilization of
> > a direct-mapped memory-side-cache. Memory side caching is a platform
> > capability that Linux has been previously exposed to in HPC
> > (high-performance computing) environments on specialty platforms. In
> > that instance it was a smaller pool of high-bandwidth-memory relative to
> > higher-capacity / lower-bandwidth DRAM. Now, this capability is going to
> > be found on general purpose server platforms where DRAM is a cache in
> > front of higher latency persistent memory [1].
> >
>
> So I glanced through the series and while I won't nak it, I'm not a
> major fan either so I won't ack it either.

Thanks for taking a look, some more comments / advocacy below...
because I'm not sure what Andrew will do with a "meh" response
compared to an ack.

> While there are merits to
> randomisation in terms of cache coloring, it may not be robust. IIRC, the
> main strength of randomisation vs being smart was "it's simple and usually
> doesn't fall apart completely". In particular I'd worry that compaction
> will undo all the randomisation work by moving related pages into the same
> direct-mapped lines. Furthermore, the runtime list management of "randomly
> place and head or tail of list" will have variable and non-deterministic
> outcomes and may also be undone by either high-order merging or compaction.

It's a fair point. To date we have not been able to measure the
average performance degrading over time (pages becoming more ordered)
but that said I think it would take more resources and time than I
have available for that trend to present. If it did present that would
only speak to a need to be more aggressive on the runtime
re-randomization. I think there's a case to be made to start simple
and only get more aggressive with evidence.

Note that higher order merging is not a current concern since the
implementation is already randomizing on MAX_ORDER sized pages. Since
memory side caches are so large there's no worry about a 4MB
randomization boundary.

However, for the (unproven) security use case where folks want to
experiment with randomizing on smaller granularity, they should be
wary of this (/me nudges Kees).

> As bad as it is, an ideal world would have a proper cache-coloring
> allocation algorithm but they previously failed as the runtime overhead
> exceeded the actual benefit, particularly as fully associative caches
> became more popular and there was no universal "one solution fits all". One
> hatchet job around it may be to have per-task free-lists that put free
> pages into buckets with the obvious caveat that those lists would need
> draining and secondary locking. A caveat of that is that there may need
> to be arch and/or driver hooks to detect how the colors are managed which
> could also turn into a mess.

We (Dave, I and others that took a look at this) started here, and the
"mess" looked daunting compared to randomization. Also a mess without
much more incremental benefit.

We also settled on a numa_emulation based approach for the cases where
an administrator knows they have a workload that can fit in the
cache... more on that below:

> The big plus of the series is that it's relatively simple and appears to
> be isolated enough that it only has an impact when the necessary hardware
> in place. It will deal with some cases but I'm not sure it'll survive
> long-term, particularly if HPC continues to report in the field that
> reboots are necessary to reshufffle the lists (taken from your linked
> documents). That workaround of running STREAM before a job starts and
> rebooting the machine if the performance SLAs are not met is horrid.

That workaround is horrid, and we have a separate solution for it
merged in commit cc9aec03e58f "x86/numa_emulation: Introduce uniform
split capability". When an administrator knows in advance that a
workload will fit in cache they can use this capability to run the
workload in a numa node that is guaranteed to not have cache conflicts
with itself.

Whereas randomization benefits the general cache-overcommit case. The
uniform numa split case addresses those niche users that can manually
time schedule jobs with different working set sizes... without needing
to reboot.