lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190417020345.GB3330@redhat.com>
Date:   Tue, 16 Apr 2019 22:03:46 -0400
From:   Jerome Glisse <jglisse@...hat.com>
To:     Boaz Harrosh <boaz@...xistor.com>
Cc:     Boaz Harrosh <openosd@...il.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Kent Overstreet <kent.overstreet@...il.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-block@...r.kernel.org, Linux MM <linux-mm@...ck.org>,
        John Hubbard <jhubbard@...dia.com>, Jan Kara <jack@...e.cz>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Johannes Thumshirn <jthumshirn@...e.de>,
        Christoph Hellwig <hch@....de>, Jens Axboe <axboe@...nel.dk>,
        Ming Lei <ming.lei@...hat.com>, Jason Gunthorpe <jgg@...pe.ca>,
        Matthew Wilcox <willy@...radead.org>,
        Steve French <sfrench@...ba.org>, linux-cifs@...r.kernel.org,
        Yan Zheng <zyan@...hat.com>, Sage Weil <sage@...hat.com>,
        Ilya Dryomov <idryomov@...il.com>,
        Alex Elder <elder@...nel.org>, ceph-devel@...r.kernel.org,
        Eric Van Hensbergen <ericvh@...il.com>,
        Latchesar Ionkov <lucho@...kov.net>,
        Mike Marshall <hubcap@...ibond.com>,
        Martin Brandenburg <martin@...ibond.com>,
        Dominique Martinet <asmadeus@...ewreck.org>,
        v9fs-developer@...ts.sourceforge.net, Coly Li <colyli@...e.de>,
        linux-bcache@...r.kernel.org,
        Ernesto A. Fernández 
        <ernesto.mnd.fernandez@...il.com>
Subject: Re: [PATCH v1 00/15] Keep track of GUPed pages in fs and block

On Wed, Apr 17, 2019 at 04:11:03AM +0300, Boaz Harrosh wrote:
> On 17/04/19 02:16, Jerome Glisse wrote:
> > On Wed, Apr 17, 2019 at 01:09:22AM +0300, Boaz Harrosh wrote:
> >> On 16/04/19 22:57, Jerome Glisse wrote:
> >> <>
> >>>
> >>> A very long thread on this:
> >>>
> >>> https://lkml.org/lkml/2018/12/3/1128
> >>>
> >>> especialy all the reply to this first one
> >>>
> >>> There is also:
> >>>
> >>> https://lkml.org/lkml/2019/3/26/1395
> >>> https://lwn.net/Articles/753027/
> >>>
> >>
> >> OK I have re-read this patchset and a little bit of the threads above (not all)
> >>
> >> As I understand the long term plan is to keep two separate ref-counts one
> >> for GUP-ref and one for the regular page-state/ownership ref.
> >> Currently looking at page-ref we do not know if we have a GUP currently held.
> >> With the new plan we can (Still not sure what's the full plan with this new info)
> >>
> >> But if you make it such as the first GUP-ref also takes a page_ref and the
> >> last GUp-dec also does put_page. Then the all of these becomes a matter of
> >> matching every call to get_user_pages or iov_iter_get_pages() with a new
> >> put_user_pages or iov_iter_put_pages().
> >>
> >> Then if much below us an LLD takes a get_page() say an skb below the iscsi
> >> driver, and so on. We do not care and we keep doing a put_page because we know
> >> the GUP-ref holds the page for us.
> >>
> >> The current block layer is transparent to any page-ref it does not take any
> >> nor put_page any. It is only the higher users that have done GUP that take care of that.
> >>
> >> The patterns I see are:
> >>
> >>   iov_iter_get_pages()
> >>
> >> 	IO(sync)
> >>
> >>   for(numpages)
> >> 	put_page()
> >>
> >> Or
> >>
> >>   iov_iter_get_pages()
> >>
> >> 	IO (async)
> >> 		->	foo_end_io()
> >> 				put_page
> >>
> >> (Same with get_user_pages)
> >> (IO need not be block layer. It can be networking and so on like in NFS or CIFS
> >>  and so on)
> > 
> > They are also other code that pass around bio_vec and the code that
> > fill it is disconnected from the code that release the page and they
> > can mix and match GUP and non GUP AFAICT.
> > 
> > On fs side they are also code that fill either bio or bio_vec and
> > use some extra mechanism other than bio_end to submit io through
> > workqueue and then release pages (cifs for instance). Again i believe
> > they can mix and match GUP and non GUP (i have not spotted something
> > obvious indicating otherwise).
> > 
> 
> But what I meant is why do we care at all? block layer does not inc page nor put
> page in any of bio or bio_vec. It is agnostic to the page-refs.
> 
> Users register an end_io and know if pages are getted or not.
> So the balanced put is up to the user.
> 
> >>
> >> The first pattern is easy just add the proper new api for
> >> it, so for every iov_iter_get_pages() you have an iov_iter_put_pages() and remove
> >> lots of cooked up for loops. Also the all iov_iter_get_pages_use_gup() just drops.
> >> (Same at get_user_pages sites use put_user_pages)
> > 
> > Yes this patchset already convert some of this first pattern.
> > 
> 
> Right!
> 
> >> The second pattern is a bit harder because it is possible that the foo_end_io()
> >> is currently used for GUP as well as none-GUP cases. this is easy to fix. But the
> >> even harder case is if the same foo_end_io() call has some pages GUPed and some not
> >> in the same call.
> >>
> >> staring at this patchset and the call sites I did not see any such places. Do you know
> >> of any?
> >> (We can always force such mixed-case users to always GUP-ref the pages and code
> >>  foo_end_io() to GUP-dec)
> > 
> > I believe direct-io.c is such example thought in that case i believe it
> > can only be the ZERO_PAGE so this might easily detectable. They are also
> > lot of fs functions taking an iterator and then using iov_iter_get_pages*()
> > to fill a bio. AFAICT those functions can be call with pipe iterator or
> > iovec iterator and probably also with other iterator type. But it is all
> > common code afterward (the bi_end_io function is the same no matter the
> > iterator).
> > 
> > Thought that can probably be solve that way:
> > 
> > From:
> >     foo_bi_end_io(struct bio *bio) {
> >         ...
> >         for (i = 0; i < npages; ++i) {
> >             put_page(pages[i]);
> >         }
> >     }
> > 
> > To:
> >     foo_bi_end_io_common(struct bio *bio) {
> >         ...
> >     }
> > 
> >     foo_bi_end_io_normal(struct bio *bio)
> >         foo_bi_end_io_common(bio);
> >         for (i = 0; i < npages; ++i) {
> >             put_page(pages[i]);
> >         }
> >     }
> > 
> >     foo_bi_end_io_gup(struct bio *bio)
> >         foo_bi_end_io_common(bio);
> >         for (i = 0; i < npages; ++i) {
> >             put_user_page(pages[i]);
> >         }
> >     }
> > 
> 
> Yes or when foo_bi_end_io_common is more complicated, then just make it
> 
>      foo_bi_end_io_common(struct bio *bio, bool gup) {
>          ...
>      }
> 
>      foo_bi_end_io_normal(struct bio *bio)
> 	foo_bi_end_io_common(bio, false);
>      }
>  
>      foo_bi_end_io_gup(struct bio *bio)
> 	foo_bi_end_io_common(bio, true);
>      }
> 
> Less risky coding of code we do not know?

Yes whatever is more appropriate for eahc end_io func.

> 
> > Then when filling in the bio i either pick foo_bi_end_io_normal() or
> > foo_bi_end_io_gup(). I am assuming that bio with different bi_end_io
> > function never get merge.
> > 
> 
> Exactly
> 
> > The issue is that some bio_add_page*() call site are disconnected
> > from where the bio is allocated and initialized (and also where the
> > bi_end_io function is set). This make it quite hard to ascertain
> > that GUPed page and non GUP page can not co-exist in same bio.
> > 
> 
> Two questions if they always do a put_page at end IO. Who takes a page_ref
> in the not GUP case? page-cache? VFS? a mechanical get_page?

It depends, some get page to page-cache (find_get_page), other just
get_page on a page that is coming from unknown origin (ie it is not
clear from within the function where the page is coming from), other
allocate a page and transfer the page reference to the bio ie the
page will get free once the bio end function is call through as it
call call put_page on page within the bio.

So there is not one simple story here. Hence why it looks harder to
me (still likely do-able).

> > Also in some cases it is not clear that the same iter is use to
> > fill the same bio ie it might be possible that some code path fill
> > the same bio from different iterator (and thus some pages might
> > be coming from GUP and other not).
> > 
> 
> This one is hard to believe for me. 
> one iter may produce multiple iter_get_pages() and many more bios.
> But the opposite?
> 
> I know, never say never. Do you know of a specific example?
> I would like to stare at it.

No i do not know of any such example but reading through a lot of code
i could not convince myself of that fact. I agree with your feeling on
the matter i just don't have proof. I can spend more time digging all
code path and ascertaining thing but this will inevitably be a snapshot
in time and something that people might break in the future.

> 
> > It would certainly seems to require more careful review from the
> > maintainers of such fs. I tend to believe that putting the burden
> > on the reviewer is a harder sell :)
> > 
> 
> I think a couple carefully put WARN_ONs in the PUT path can
> detect any leakage of refs. And help debug these cases.

Yes we do plan on catching things like underflow (easy one to catch)
or put_page that bring the refcount just below the bias value ... and
in anycase the only harm that can come of this should be memory leak.

So i believe that even if we miss some weird corner case we should be
able to catch it eventualy and nothing harmful should ever happen,
famous last word i will regret when i get burnt at the stake for missing
that one case ;)

> 
> > From quick glance:
> >    - nilfs segment thing
> >    - direct-io same bio accumulate pages over multiple call but
> >      it should always be from same iterator and thus either always
> >      be from GUP or non GUP. Also the ZERO_PAGE case should be easy
> >      to catch.
> 
> Yes. Or we can always take a GUP-ref on the ZERO_PAGE as well
> 
> >    - fs/nfs/blocklayout/blocklayout.c
> 
> This one is an example of "please do not touch" if you look at the code
> it currently does not do any put page at all. Though yes it does bio_add_page.
> 
> The pages are GETed and PUTed in nfs/direct.c and reference are balanced there.
> 
> this is the trivial case of for every iov_iter_get_pages[_alloc]() there is
> a new defined iov_iter_put_pages[_alloc]()
> 
> So this is an example of extra not needed code changes in your approach
> 
> >    - gfs2 log buffer, that should never be page from GUP but i could
> >      not ascertain that easily from quick review
> 
> 	Same as NFS maybe? didn't look.
> 
> > 
> > This is not extensive, i was just grepping for bio_add_page() and
> > they are 2 other variant to check and i tended to discard places
> > where bio is allocated in same function as bio_add_page() but this
> > might not be a valid assumption either. Some bio might be allocated
> > and only if there is no default bio already and then set as default
> > bio which might be use latter on with different iterator.
> > 
> 
> I think we do not care at all about any of the bio_add_page() or bio_alloc
> places. All we care about is the call to iov_iter_get_pages* and where in the
> code these puts are balanced.
> 
> If we need to split the endio case at those sights then we can do as above.
> Or in the worse case when pages are really mixed. Always take a GUP  ref also
> on the not GUP case. 
> (I would like to see where this happens)
> 
> >>
> >> So with a very careful coding I think you need not touch the block / scatter-list layers
> >> nor any LLD drivers. The only code affected is the code around the get_user_pages and friends.
> >> Changing the API will surface all those.
> >> (IE. introduce a new API, convert one by one, Remove old API)
> >>
> >> Am I smoking?
> > 
> > No, i thought about it seemed more dangerous and harder to get right
> > because some code add page in one place and setup bio in another. I
> > can dig some more on that front but this still leave the non-bio user
> > of bio_vec and those IIRC also suffer from same disconnect issue.
> > 
> 
> Again I should not care about bio_vec. I only need to trace the balancing of the
> ref taken in GUP call sight. Let me help you in those places it is not obvious to
> you.
> 
> >>
> >> BTW: Are you aware of the users of iov_iter_get_pages_alloc() Do they need fixing too?
> > 
> > Yeah and that patchset should address those already, i do not think
> > i missed any.
> > 
> 
> I could not find a patch for nfs/direct.c where a put_page is called
> to balance the iov_iter_get_pages_alloc(). Which takes care of for example of
> the blocklayout.c pages state
> 
> So I think the deep Audit needs to be for iov_iter_get_pages and get_user_pages
> and the balancing of that. And the all of bio_alloc and bio_add_page should stay
> agnostic to any pege-refs taking/putting

Will try to get started on that and see if i hit any roadblock. I will
report once i get my feet wet, or at least before i drown ;)

Cheers,
Jérôme

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ