linux-kernel - Re: [PATCH 1/4] mm: Trial do_wp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200917200638.GM8409@ziepe.ca>
Date:   Thu, 17 Sep 2020 17:06:38 -0300
From:   Jason Gunthorpe <jgg@...pe.ca>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Peter Xu <peterx@...hat.com>, John Hubbard <jhubbard@...dia.com>,
        Leon Romanovsky <leonro@...dia.com>,
        Linux-MM <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "Maya B . Gokhale" <gokhale2@...l.gov>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Marty Mcfadden <mcfadden8@...l.gov>,
        Kirill Shutemov <kirill@...temov.name>,
        Oleg Nesterov <oleg@...hat.com>, Jann Horn <jannh@...gle.com>,
        Jan Kara <jack@...e.cz>, Kirill Tkhai <ktkhai@...tuozzo.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Christoph Hellwig <hch@....de>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

On Thu, Sep 17, 2020 at 12:42:11PM -0700, Linus Torvalds wrote:

> Because the whole "do page pinning without MADV_DONTFORK and then fork
> the area" is I feel a very very invalid load. It sure as hell isn't
> something we should care about performance for, and in fact it is
> something we should very well warn for exactly to let people know
> "this process is doing bad things".

It is easy for things like iouring that can just allocate the queue
memory they care about and MADV_DONTFORK it.

Other things work more like O_DIRECT - the data it is working on is
arbtiary app memory, not controlled in anyway.

In RDMA we have this ugly scheme were we automatically call
MADV_DONTFORK on the virtual address and hope it doesn't explode. It
is very hard to call MADV_DONTFORK if you don't control the
allocation. Don't want to break huge pages, have to hope really really
hard that a fork doesn't need that memory. Hope you don't run out of
vmas beause it causes a vma split. So ugly. So much overhead.

Considering almost anything can do a fork() - we've seen app authors
become confused. They say stuff is busted, support folks ask if they
use fork, author says no.. Investigation later shows some hidden
library did system() or whatever.

In this case the tests that found this failed because they were
written in Python and buried in there was some subprocess.call().

I would prefer the kernel consider it a valid work load with the
semantics the sketch patch shows..

> Is there possibly somethign else we can filter on than just
> GUP_PIN_COUNTING_BIAS? Because it could be as simple as just marking
> the vma itself and saying "this vma has had a page pinning event done
> on it".

We'd have to give up pin_user_pages_fast() to do that as we can't fast
walk and get vmas?

Hmm, there are many users. I remember that the hfi1 folks really
wanted the fast version for some reason..

> Because if we only start copying the page *iff* the vma is marked by
> that "this vma had page pinning" _and_ the page count is bigger than
> GUP_PIN_COUNTING_BIAS, than I think we can rest pretty easily knowing
> that we aren't going to hit some regular old-fashioned UNIX server
> cases with a lot of forks..

Agree

Given that this is a user visible regression, it is nearly rc6, what
do you prefer for next steps? 

Sorting out this for fork, especially if it has the vma change is
probably more than a weeks time.

Revert this patch and try again next cycle?

Thanks,
Jason