[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211202170032.GN5112@ziepe.ca>
Date: Thu, 2 Dec 2021 13:00:32 -0400
From: Jason Gunthorpe <jgg@...pe.ca>
To: Matthew Wilcox <willy@...radead.org>
Cc: Leon Romanovsky <leon@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Bixuan Cui <cuibixuan@...ux.alibaba.com>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, torvalds@...ux-foundation.org,
w@....eu, keescook@...omium.org
Subject: Re: [PATCH -next] mm: delete oversized WARN_ON() in kvmalloc() calls
On Thu, Dec 02, 2021 at 03:29:47PM +0000, Matthew Wilcox wrote:
> On Thu, Dec 02, 2021 at 05:23:42PM +0200, Leon Romanovsky wrote:
> > The problem is that this WARN_ON() is triggered by the users.
>
> ... or the problem is that you don't do a sanity check between the user
> and the MM system. I mean, that's what this conversation is about --
> is it a bug to be asking for this much memory in the first place?
>
> > At least in the RDMA world, users can provide huge sizes and they expect
> > to get plain -ENOMEM and not dump stack, because it happens indirectly
> > to them.
> >
> > In our case, these two kvcalloc() generates WARN_ON().
> >
> > umem_odp->pfn_list = kvcalloc(
> > npfns, sizeof(*umem_odp->pfn_list), GFP_KERNEL);
>
> Does it really make sense for the user to specify 2^31 PFNs in a single
> call? I mean, that's 8TB of memory. Should RDMA put its own limit
> in here, or should it rely on kvmalloc returning -ENOMEM?
I wrote this - I don't think RDMA should put a limit here. What
limit would it use anyhow?
I'm pretty sure database people are already using low TB's here. It is
not absurd when you have DAX and the biggest user of ODP is with DAX.
If anything we might get to a point in a few years where the 2^31 is
too small and this has to be a better datastructure :\
Maybe an xarray and I should figure out how to use the multi-order
stuff to optimize huge pages?
I'd actually really like to get rid of it, just haven't figured out
how. The only purpose is to call set_page_dirty() and in many cases
the pfn should still be in the mm's page table. We also store another
copy of the PFN in the NIC's mapping. Surely one of these two could do
instead?
Jason
Powered by blists - more mailing lists