[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190417020345.GB3330@redhat.com>
Date: Tue, 16 Apr 2019 22:03:46 -0400
From: Jerome Glisse <jglisse@...hat.com>
To: Boaz Harrosh <boaz@...xistor.com>
Cc: Boaz Harrosh <openosd@...il.com>,
Dan Williams <dan.j.williams@...el.com>,
Kent Overstreet <kent.overstreet@...il.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
linux-block@...r.kernel.org, Linux MM <linux-mm@...ck.org>,
John Hubbard <jhubbard@...dia.com>, Jan Kara <jack@...e.cz>,
Alexander Viro <viro@...iv.linux.org.uk>,
Johannes Thumshirn <jthumshirn@...e.de>,
Christoph Hellwig <hch@....de>, Jens Axboe <axboe@...nel.dk>,
Ming Lei <ming.lei@...hat.com>, Jason Gunthorpe <jgg@...pe.ca>,
Matthew Wilcox <willy@...radead.org>,
Steve French <sfrench@...ba.org>, linux-cifs@...r.kernel.org,
Yan Zheng <zyan@...hat.com>, Sage Weil <sage@...hat.com>,
Ilya Dryomov <idryomov@...il.com>,
Alex Elder <elder@...nel.org>, ceph-devel@...r.kernel.org,
Eric Van Hensbergen <ericvh@...il.com>,
Latchesar Ionkov <lucho@...kov.net>,
Mike Marshall <hubcap@...ibond.com>,
Martin Brandenburg <martin@...ibond.com>,
Dominique Martinet <asmadeus@...ewreck.org>,
v9fs-developer@...ts.sourceforge.net, Coly Li <colyli@...e.de>,
linux-bcache@...r.kernel.org,
Ernesto A. Fernández
<ernesto.mnd.fernandez@...il.com>
Subject: Re: [PATCH v1 00/15] Keep track of GUPed pages in fs and block
On Wed, Apr 17, 2019 at 04:11:03AM +0300, Boaz Harrosh wrote:
> On 17/04/19 02:16, Jerome Glisse wrote:
> > On Wed, Apr 17, 2019 at 01:09:22AM +0300, Boaz Harrosh wrote:
> >> On 16/04/19 22:57, Jerome Glisse wrote:
> >> <>
> >>>
> >>> A very long thread on this:
> >>>
> >>> https://lkml.org/lkml/2018/12/3/1128
> >>>
> >>> especialy all the reply to this first one
> >>>
> >>> There is also:
> >>>
> >>> https://lkml.org/lkml/2019/3/26/1395
> >>> https://lwn.net/Articles/753027/
> >>>
> >>
> >> OK I have re-read this patchset and a little bit of the threads above (not all)
> >>
> >> As I understand the long term plan is to keep two separate ref-counts one
> >> for GUP-ref and one for the regular page-state/ownership ref.
> >> Currently looking at page-ref we do not know if we have a GUP currently held.
> >> With the new plan we can (Still not sure what's the full plan with this new info)
> >>
> >> But if you make it such as the first GUP-ref also takes a page_ref and the
> >> last GUp-dec also does put_page. Then the all of these becomes a matter of
> >> matching every call to get_user_pages or iov_iter_get_pages() with a new
> >> put_user_pages or iov_iter_put_pages().
> >>
> >> Then if much below us an LLD takes a get_page() say an skb below the iscsi
> >> driver, and so on. We do not care and we keep doing a put_page because we know
> >> the GUP-ref holds the page for us.
> >>
> >> The current block layer is transparent to any page-ref it does not take any
> >> nor put_page any. It is only the higher users that have done GUP that take care of that.
> >>
> >> The patterns I see are:
> >>
> >> iov_iter_get_pages()
> >>
> >> IO(sync)
> >>
> >> for(numpages)
> >> put_page()
> >>
> >> Or
> >>
> >> iov_iter_get_pages()
> >>
> >> IO (async)
> >> -> foo_end_io()
> >> put_page
> >>
> >> (Same with get_user_pages)
> >> (IO need not be block layer. It can be networking and so on like in NFS or CIFS
> >> and so on)
> >
> > They are also other code that pass around bio_vec and the code that
> > fill it is disconnected from the code that release the page and they
> > can mix and match GUP and non GUP AFAICT.
> >
> > On fs side they are also code that fill either bio or bio_vec and
> > use some extra mechanism other than bio_end to submit io through
> > workqueue and then release pages (cifs for instance). Again i believe
> > they can mix and match GUP and non GUP (i have not spotted something
> > obvious indicating otherwise).
> >
>
> But what I meant is why do we care at all? block layer does not inc page nor put
> page in any of bio or bio_vec. It is agnostic to the page-refs.
>
> Users register an end_io and know if pages are getted or not.
> So the balanced put is up to the user.
>
> >>
> >> The first pattern is easy just add the proper new api for
> >> it, so for every iov_iter_get_pages() you have an iov_iter_put_pages() and remove
> >> lots of cooked up for loops. Also the all iov_iter_get_pages_use_gup() just drops.
> >> (Same at get_user_pages sites use put_user_pages)
> >
> > Yes this patchset already convert some of this first pattern.
> >
>
> Right!
>
> >> The second pattern is a bit harder because it is possible that the foo_end_io()
> >> is currently used for GUP as well as none-GUP cases. this is easy to fix. But the
> >> even harder case is if the same foo_end_io() call has some pages GUPed and some not
> >> in the same call.
> >>
> >> staring at this patchset and the call sites I did not see any such places. Do you know
> >> of any?
> >> (We can always force such mixed-case users to always GUP-ref the pages and code
> >> foo_end_io() to GUP-dec)
> >
> > I believe direct-io.c is such example thought in that case i believe it
> > can only be the ZERO_PAGE so this might easily detectable. They are also
> > lot of fs functions taking an iterator and then using iov_iter_get_pages*()
> > to fill a bio. AFAICT those functions can be call with pipe iterator or
> > iovec iterator and probably also with other iterator type. But it is all
> > common code afterward (the bi_end_io function is the same no matter the
> > iterator).
> >
> > Thought that can probably be solve that way:
> >
> > From:
> > foo_bi_end_io(struct bio *bio) {
> > ...
> > for (i = 0; i < npages; ++i) {
> > put_page(pages[i]);
> > }
> > }
> >
> > To:
> > foo_bi_end_io_common(struct bio *bio) {
> > ...
> > }
> >
> > foo_bi_end_io_normal(struct bio *bio)
> > foo_bi_end_io_common(bio);
> > for (i = 0; i < npages; ++i) {
> > put_page(pages[i]);
> > }
> > }
> >
> > foo_bi_end_io_gup(struct bio *bio)
> > foo_bi_end_io_common(bio);
> > for (i = 0; i < npages; ++i) {
> > put_user_page(pages[i]);
> > }
> > }
> >
>
> Yes or when foo_bi_end_io_common is more complicated, then just make it
>
> foo_bi_end_io_common(struct bio *bio, bool gup) {
> ...
> }
>
> foo_bi_end_io_normal(struct bio *bio)
> foo_bi_end_io_common(bio, false);
> }
>
> foo_bi_end_io_gup(struct bio *bio)
> foo_bi_end_io_common(bio, true);
> }
>
> Less risky coding of code we do not know?
Yes whatever is more appropriate for eahc end_io func.
>
> > Then when filling in the bio i either pick foo_bi_end_io_normal() or
> > foo_bi_end_io_gup(). I am assuming that bio with different bi_end_io
> > function never get merge.
> >
>
> Exactly
>
> > The issue is that some bio_add_page*() call site are disconnected
> > from where the bio is allocated and initialized (and also where the
> > bi_end_io function is set). This make it quite hard to ascertain
> > that GUPed page and non GUP page can not co-exist in same bio.
> >
>
> Two questions if they always do a put_page at end IO. Who takes a page_ref
> in the not GUP case? page-cache? VFS? a mechanical get_page?
It depends, some get page to page-cache (find_get_page), other just
get_page on a page that is coming from unknown origin (ie it is not
clear from within the function where the page is coming from), other
allocate a page and transfer the page reference to the bio ie the
page will get free once the bio end function is call through as it
call call put_page on page within the bio.
So there is not one simple story here. Hence why it looks harder to
me (still likely do-able).
> > Also in some cases it is not clear that the same iter is use to
> > fill the same bio ie it might be possible that some code path fill
> > the same bio from different iterator (and thus some pages might
> > be coming from GUP and other not).
> >
>
> This one is hard to believe for me.
> one iter may produce multiple iter_get_pages() and many more bios.
> But the opposite?
>
> I know, never say never. Do you know of a specific example?
> I would like to stare at it.
No i do not know of any such example but reading through a lot of code
i could not convince myself of that fact. I agree with your feeling on
the matter i just don't have proof. I can spend more time digging all
code path and ascertaining thing but this will inevitably be a snapshot
in time and something that people might break in the future.
>
> > It would certainly seems to require more careful review from the
> > maintainers of such fs. I tend to believe that putting the burden
> > on the reviewer is a harder sell :)
> >
>
> I think a couple carefully put WARN_ONs in the PUT path can
> detect any leakage of refs. And help debug these cases.
Yes we do plan on catching things like underflow (easy one to catch)
or put_page that bring the refcount just below the bias value ... and
in anycase the only harm that can come of this should be memory leak.
So i believe that even if we miss some weird corner case we should be
able to catch it eventualy and nothing harmful should ever happen,
famous last word i will regret when i get burnt at the stake for missing
that one case ;)
>
> > From quick glance:
> > - nilfs segment thing
> > - direct-io same bio accumulate pages over multiple call but
> > it should always be from same iterator and thus either always
> > be from GUP or non GUP. Also the ZERO_PAGE case should be easy
> > to catch.
>
> Yes. Or we can always take a GUP-ref on the ZERO_PAGE as well
>
> > - fs/nfs/blocklayout/blocklayout.c
>
> This one is an example of "please do not touch" if you look at the code
> it currently does not do any put page at all. Though yes it does bio_add_page.
>
> The pages are GETed and PUTed in nfs/direct.c and reference are balanced there.
>
> this is the trivial case of for every iov_iter_get_pages[_alloc]() there is
> a new defined iov_iter_put_pages[_alloc]()
>
> So this is an example of extra not needed code changes in your approach
>
> > - gfs2 log buffer, that should never be page from GUP but i could
> > not ascertain that easily from quick review
>
> Same as NFS maybe? didn't look.
>
> >
> > This is not extensive, i was just grepping for bio_add_page() and
> > they are 2 other variant to check and i tended to discard places
> > where bio is allocated in same function as bio_add_page() but this
> > might not be a valid assumption either. Some bio might be allocated
> > and only if there is no default bio already and then set as default
> > bio which might be use latter on with different iterator.
> >
>
> I think we do not care at all about any of the bio_add_page() or bio_alloc
> places. All we care about is the call to iov_iter_get_pages* and where in the
> code these puts are balanced.
>
> If we need to split the endio case at those sights then we can do as above.
> Or in the worse case when pages are really mixed. Always take a GUP ref also
> on the not GUP case.
> (I would like to see where this happens)
>
> >>
> >> So with a very careful coding I think you need not touch the block / scatter-list layers
> >> nor any LLD drivers. The only code affected is the code around the get_user_pages and friends.
> >> Changing the API will surface all those.
> >> (IE. introduce a new API, convert one by one, Remove old API)
> >>
> >> Am I smoking?
> >
> > No, i thought about it seemed more dangerous and harder to get right
> > because some code add page in one place and setup bio in another. I
> > can dig some more on that front but this still leave the non-bio user
> > of bio_vec and those IIRC also suffer from same disconnect issue.
> >
>
> Again I should not care about bio_vec. I only need to trace the balancing of the
> ref taken in GUP call sight. Let me help you in those places it is not obvious to
> you.
>
> >>
> >> BTW: Are you aware of the users of iov_iter_get_pages_alloc() Do they need fixing too?
> >
> > Yeah and that patchset should address those already, i do not think
> > i missed any.
> >
>
> I could not find a patch for nfs/direct.c where a put_page is called
> to balance the iov_iter_get_pages_alloc(). Which takes care of for example of
> the blocklayout.c pages state
>
> So I think the deep Audit needs to be for iov_iter_get_pages and get_user_pages
> and the balancing of that. And the all of bio_alloc and bio_add_page should stay
> agnostic to any pege-refs taking/putting
Will try to get started on that and see if i hit any roadblock. I will
report once i get my feet wet, or at least before i drown ;)
Cheers,
Jérôme
Powered by blists - more mailing lists