linux-kernel - Re: [RFC v2 PATCH 4/4] mm: pre zero out free pages to speed up page allocation for __GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210105092037.GY13207@dhcp22.suse.cz>
Date:   Tue, 5 Jan 2021 10:20:37 +0100
From:   Michal Hocko <mhocko@...e.com>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     David Hildenbrand <david@...hat.com>,
        Matthew Wilcox <willy@...radead.org>,
        Alexander Duyck <alexander.h.duyck@...ux.intel.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Dan Williams <dan.j.williams@...el.com>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        Jason Wang <jasowang@...hat.com>,
        Liang Li <liliangleo@...iglobal.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org,
        virtualization@...ts.linux-foundation.org
Subject: Re: [RFC v2 PATCH 4/4] mm: pre zero out free pages to speed up page
 allocation for __GFP_ZERO

On Mon 04-01-21 15:00:31, Dave Hansen wrote:
> On 1/4/21 12:11 PM, David Hildenbrand wrote:
> >> Yeah, it certainly can't be the default, but it *is* useful for
> >> thing where we know that there are no cache benefits to zeroing
> >> close to where the memory is allocated.
> >> 
> >> The trick is opting into it somehow, either in a process or a VMA.
> >> 
> > The patch set is mostly trying to optimize starting a new process. So
> > process/vma doesn‘t really work.
> 
> Let's say you have a system-wide tunable that says: pre-zero pages and
> keep 10GB of them around.  Then, you opt-in a process to being allowed
> to dip into that pool with a process-wide flag or an madvise() call.
> You could even have the flag be inherited across execve() if you wanted
> to have helper apps be able to set the policy and access the pool like
> how numactl works.

While possible, it sounds quite heavy weight to me. Page allocator would
have to somehow maintain those pre-zeroed pages. This pool will also
become a very scarce resource very soon because everybody just want to
run faster. So this would open many more interesting questions.

A global knob with all or nothing sounds like an easier to use and
maintain solution to me.

> Dan makes a very good point about using filesystems for this, though.
> It wouldn't be rocket science to set up a special tmpfs mount just for
> VM memory and pre-zero it from userspace.  For qemu, you'd need to teach
> the management layer to hand out zeroed files via mem-path=.

Agreed. That would be an interesting option.

> Heck, if
> you taught MADV_FREE how to handle tmpfs, you could even pre-zero *and*
> get the memory back quickly if those files ended up over-sized somehow.

We can probably allow MADV_FREE on shmem but that would require an
exclusively mapped page. Shared case is really tricky because of silent
data corruption in uncoordinated userspace.
-- 
Michal Hocko
SUSE Labs