linux-kernel - Re: [PATCH] mm/mincore: allow for making sys

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5c3e7de6.1c69fb81.4aebb.3fec@mx.google.com>
Date:   Tue, 15 Jan 2019 16:42:08 -0800
From:   Josh Snyder <joshs@...flix.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Dominique Martinet <asmadeus@...ewreck.org>,
        Dave Chinner <david@...morbit.com>,
        Jiri Kosina <jikos@...nel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Jann Horn <jannh@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Greg KH <gregkh@...uxfoundation.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Michal Hocko <mhocko@...e.com>, Linux-MM <linux-mm@...ck.org>,
        kernel list <linux-kernel@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds wrote on Thu, Jan 10, 2019:
> So right now, I consider the mincore change to be a "try to probe the
> state of mincore users", and we haven't really gotten a lot of
> information back yet.

<raises hand>

For Netflix, losing accurate information from the mincore syscall would
lengthen database cluster maintenance operations from days to months.  We
rely on cross-process mincore to migrate the contents of a page cache from
machine to machine, and across reboots.

To do this, I wrote and maintain happycache [1], a page cache dumper/loader
tool. It is quite similar in architecture to pgfincore, except that it is
agnostic to workload. The gist of happycache's operation is "produce a dump
of residence status for each page, do some operation, then reload exactly
the same pages which were present before." happycache is entirely dependent
on accurate reporting of the in-core status of file-backed pages, as
accessed by another process.

We primarily use happycache with Cassandra, which (like Postgres +
pgfincore) relies heavily on OS page cache to reduce disk accesses. Because
our workloads never experience a cold page cache, we are able to provision
hardware for a peak utilization level that is far lower than the
hypothetical "every query is a cache miss" peak.

A database warmed by happycache can be ready for service in seconds
(bounded only by the performance of the drives and the I/O subsystem), with
no period of in-service degradation. By contrast, putting a database in
service without a page cache entails a potentially unbounded period of
degradation (at Netflix, the time to populate a single node's cache via
natural cache misses varies by workload from hours to weeks). If a single
node upgrade were to take weeks, then upgrading an entire cluster would
take months. Since we want to apply security upgrades (and other things) on
a somewhat tighter schedule, we would have to develop more complex
solutions to provide the same functionality already provided by mincore.

At the bottom line, happycache is designed to benignly exploit the same
information leak documented in the paper [2]. I think it makes perfect
sense to remove cross-process mincore functionality from unprivileged
users, but not to remove it entirely.

Josh Snyder
Netflix Cloud Database Engineering

[1] https://github.com/hashbrowncipher/happycache
[2] https://arxiv.org/abs/1901.01161