lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130321113047.GA31599@redhat.com>
Date:	Thu, 21 Mar 2013 13:30:47 +0200
From:	"Michael S. Tsirkin" <mst@...hat.com>
To:	Roland Dreier <roland@...nel.org>
Cc:	"Michael R. Hines" <mrhines@...ux.vnet.ibm.com>,
	Sean Hefty <sean.hefty@...el.com>,
	Hal Rosenstock <hal.rosenstock@...il.com>,
	Yishai Hadas <yishaih@...lanox.com>,
	Christoph Lameter <cl@...ux.com>,
	"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>, qemu-devel@...gnu.org
Subject: Re: [PATCH] rdma: don't make pages writeable if not requiested

On Thu, Mar 21, 2013 at 11:32:30AM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote:
> > On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <mst@...hat.com> wrote:
> > > core/umem.c seems to get the arguments to get_user_pages
> > > in the reverse order: it sets writeable flag and
> > > breaks COW for MAP_SHARED if and only if hardware needs to
> > > write the page.
> > >
> > > This breaks memory overcommit for users such as KVM:
> > > each time we try to register a page to send it to remote, this
> > > breaks COW.  It seems that for applications that only have
> > > REMOTE_READ permission, there is no reason to break COW at all.
> > 
> > I proposed a similar (but not exactly the same, see below) patch a
> > while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
> > especially https://lkml.org/lkml/2012/2/6/265
> > 
> > I think this change will break the case where userspace tries to
> > register an MR with read-only permission, but intends locally through
> > the CPU to write to the memory.  If the memory registration is done
> > while the memory is mapped read-only but has VM_MAYWRITE, then
> > userspace gets into trouble when COW happens.  In the case you're
> > describing (although I'm not sure where in KVM we're talking about
> > using RDMA), what happens if you register memory with only REMOTE_READ
> > and then COW is triggered because of a local write?  (I'm assuming you
> > don't want remote access to continue to get the old contents of the
> > page)
> 
> I read that, and the above. It looks like everyone was doing tricks
> like "register page, then modify it, then let remote read it"
> and for some reason assumed it's ok to write into page locally from CPU
> even if LOCAL_WRITE is not set.  I don't see why don't applications set
> LOCAL_WRITE if they are going to write to memory locally, but assuming
> they don't, we can't just break them.
> 
> So what we need is a new "no I really do not intend to write into this
> memory" flag that avoids doing tricks in the kernel and treats the
> page normally, just pins it so hardware can read it.
> 
> 
> > I have to confess that I still haven't had a chance to implement the
> > proposed FOLL_FOLLOW solution to all of this.
> 
> See a much easier to implement proposal at the bottom.
> 
> > > If the page that is COW has lots of copies, this makes the user process
> > > quickly exceed the cgroups memory limit.  This makes RDMA mostly useless
> > > for virtualization, thus the stable tag.
> > 
> > The actual problem description here is a bit too terse for me to
> > understand.  How do we end up with lots of copies of a COW page?
> 
> Reading the links above, rdma breaks COW intentionally.
> 
> Imagine a page with lots of instances in the process page map.
> For example a zero page, but not only that: we rely on KSM heavily
> to deduplicate pages for multiple VMs.
> There are gigabytes of these in each of the multiple VMs
> running on a host.
> 
> What we are using RDMA for is VM migration so we careful not to change
> this memory: when we do allow memory to change we are careful
> to track what was changed, reregister and resend the data.
> 
> But at the moment, each time we register a virtual address referencing
> this page, infiniband assumes we might want to change the page so it
> does get_user_pages with writeable flag, forcing a copy.
> Amount of used RAM explodes.
> 
> >  Why
> > is RDMA registering the memory any more  special than having everyone
> > who maps this page actually writing to it and triggering COW?
> > 
> > >                 ret = get_user_pages(current, current->mm, cur_base,
> > >                                      min_t(unsigned long, npages,
> > >                                            PAGE_SIZE / sizeof (struct page *)),
> > > -                                    1, !umem->writable, page_list, vma_list);
> > > +                                    !umem->writable, 1, page_list, vma_list);
> > 
> > The first two parameters in this line being changed are "write" and "force".
> > 
> > I think if we do change this, then we need to pass umem->writable (as
> > opposed to !umem->writable) for the "write" parameter.
> 
> Ugh. Sure enough. Let's agree on the direction before I respin the
> patch though.
> 
> > Not sure
> > whether "force" makes sense or not.
> > 
> >  - R.
> 
> If you don't force write on read-only mappings you don't, but
> it seems harmless for read-only gup. Still, no need to change
> what's not broken.
> 
> Please comment on the below (completely untested, and needs userspace
> patch too, but just to give you the idea)
> 
> --->
> 
> rdma: add IB_ACCESS_APP_READONLY 

Or we can call it IB_ACCESS_GIFT - this is a bit like SPLICE_F_GIFT
semantics.



> At the moment any attempt to register memory for RDMA breaks
> COW, which hurts hosts overcomitted for memory.
> But if the application knows it won't write into the MR after
> registration, we can save (sometimes a lot of) memory
> by telling the kernel not to bother breaking COW for us.
> 
> If the application does change memory registered with this flag, it can
> re-register afterwards, and resend the data on the wire.
> 
> Signed-off-by: Michael S. Tsirkin <mst@...hat.com>
> 
> ---
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 5929598..635b57a 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -152,7 +152,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  		ret = get_user_pages(current, current->mm, cur_base,
>  				     min_t(unsigned long, npages,
>  					   PAGE_SIZE / sizeof (struct page *)),
> -				     !umem->writable, 1, page_list, vma_list);
> +				     umem->writable ||
> +				     !(access & IB_ACCESS_APP_READONLY),
> +				     !umem->writable, page_list, vma_list);
>  
>  		if (ret < 0)
>  			goto out;
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 98cc4b2..3a3ba1b 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -871,7 +871,8 @@ enum ib_access_flags {
>  	IB_ACCESS_REMOTE_READ	= (1<<2),
>  	IB_ACCESS_REMOTE_ATOMIC	= (1<<3),
>  	IB_ACCESS_MW_BIND	= (1<<4),
> -	IB_ZERO_BASED		= (1<<5)
> +	IB_ZERO_BASED		= (1<<5),
> +	IB_ACCESS_APP_READONLY	= (1<<6) /* User promises not to change the data */
>  };
>  
>  struct ib_phys_buf {
> 
> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ