linux-kernel - Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181003160836.GF24030@quack2.suse.cz>
Date:   Wed, 3 Oct 2018 18:08:36 +0200
From:   Jan Kara <jack@...e.cz>
To:     Jerome Glisse <jglisse@...hat.com>
Cc:     John Hubbard <jhubbard@...dia.com>, john.hubbard@...il.com,
        Matthew Wilcox <willy@...radead.org>,
        Michal Hocko <mhocko@...nel.org>,
        Christopher Lameter <cl@...ux.com>,
        Jason Gunthorpe <jgg@...pe.ca>,
        Dan Williams <dan.j.williams@...el.com>,
        Jan Kara <jack@...e.cz>, Al Viro <viro@...iv.linux.org.uk>,
        linux-mm@...ck.org, LKML <linux-kernel@...r.kernel.org>,
        linux-rdma <linux-rdma@...r.kernel.org>,
        linux-fsdevel@...r.kernel.org,
        Christian Benvenuti <benve@...co.com>,
        Dennis Dalessandro <dennis.dalessandro@...el.com>,
        Doug Ledford <dledford@...hat.com>,
        Mike Marciniszyn <mike.marciniszyn@...el.com>
Subject: Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps

On Sat 29-09-18 04:46:09, Jerome Glisse wrote:
> On Fri, Sep 28, 2018 at 07:28:16PM -0700, John Hubbard wrote:
> > Actually, the latest direction on that discussion was toward periodically
> > writing back, even while under RDMA, via bounce buffers:
> > 
> >   https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrcz6n@quack2.suse.cz
> > 
> > I still think that's viable. Of course, there are other things besides 
> > writeback (see below) that might also lead to waiting.
> 
> Write back under bounce buffer is fine, when looking back at links you
> provided the solution that was discuss was blocking in page_mkclean()
> which is horrible in my point of view.

Yeah, after looking into it for some time, we figured that waiting for page
pins in page_mkclean() isn't really going to fly due to deadlocks. So we
came up with the bounce buffers idea which should solve that nicely.

> > > With the solution put forward here you can potentialy wait _forever_ for
> > > the driver that holds a pin to drop it. This was the point i was trying to
> > > get accross during LSF/MM. 
> > 
> > I agree that just blocking indefinitely is generally unacceptable for kernel
> > code, but we can probably avoid it for many cases (bounce buffers), and
> > if we think it is really appropriate (file system unmounting, maybe?) then
> > maybe tolerate it in some rare cases.  
> > 
> > >You can not fix broken hardware that decided to
> > > use GUP to do a feature they can't reliably do because their hardware is
> > > not capable to behave.
> > > 
> > > Because code is easier here is what i was meaning:
> > > 
> > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=a5dbc0fe7e71d347067579f13579df372ec48389
> > > https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup&id=01677bc039c791a16d5f82b3ef84917d62fac826
> > > 
> > 
> > While that may work sometimes, I don't think it is reliable enough to trust for
> > identifying pages that have been gup-pinned. There's just too much overloading of
> > other mechanisms going on there, and if we pile on top with this constraint of "if you
> > have +3 refcounts, and this particular combination of page counts and mapcounts, then
> > you're definitely a long-term pinned page", I think users will find a lot of corner
> > cases for us that break that assumption. 
> 
> So the mapcount == refcount (modulo extra reference for mapping and
> private) should holds, here are the case when it does not:
>     - page being migrated
>     - page being isolated from LRU
>     - mempolicy changes against the page
>     - page cache lookup
>     - some file system activities
>     - i likely miss couples here i am doing that from memory
> 
> What matter is that all of the above are transitory, the extra reference
> only last for as long as it takes for the action to finish (migration,
> mempolicy change, ...).
> 
> So skipping those false positive page while reclaiming likely make sense,
> the blocking free buffer maybe not.

Well, as John wrote, these page refcount are fragile (and actually
filesystem dependent as some filesystems hold page reference from their
page->private data and some don't). So I think we really need a new
reliable mechanism for tracking page references from GUP. And John works
towards that.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR