linux-kernel - Re: [PATCH 0/3] have pooled sunrpc services make more intelligent allocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20080604075318.66fece9c@barsoom.rdu.redhat.com>
Date:	Wed, 4 Jun 2008 07:53:18 -0400
From:	Jeff Layton <jlayton@...hat.com>
To:	Tom Tucker <tom@...ngridcomputing.com>
Cc:	linux-kernel@...r.kernel.org, linux-nfs@...r.kernel.org,
	bfields@...ldses.org
Subject: Re: [PATCH 0/3] have pooled sunrpc services make more intelligent
 allocations

On Tue, 03 Jun 2008 13:37:25 -0500
Tom Tucker <tom@...ngridcomputing.com> wrote:

> 
> On Tue, 2008-06-03 at 13:42 -0400, Jeff Layton wrote:
> > On Tue, 03 Jun 2008 11:53:42 -0500
> > Tom Tucker <tom@...ngridcomputing.com> wrote:
> > 
> > > Jeff:
> > > 
> > > This brings up an interesting issue with the RDMA transport and
> > > RDMA_READ. RDMA_READ is submitted as part of fetching an RPC from the
> > > client (e.g. NFS_WRITE). The xpo_recvfrom function doesn't block waiting
> > > for the RDMA_READ to complete, but rather queues the RPC for subsequent
> > > processing when the I/O completes and returns 0. 
> > > 
> > > I can use these new services to allocate CPU local pages for this I/O.
> > > So far, so good. However, when the I/O completes, and the transport is
> > > rescheduled for subsequent RPC completion processing, the pool/CPU that
> > > is elected doesn't have any affinity for the CPU on which the I/O was
> > > initially submitted. I think this means that the svc_process/reply steps
> > > may occur on a CPU far away from the memory in which the data resides.
> > > 
> > > Am I making sense here? If so, any thoughts on what could/should be
> > > done?
> > > 
> > > Thanks,
> > > Tom
> > > 
> > 
> > I confess I didn't think hard about the RDMA case here (and haven't
> > been paying as much attention as I probably should to the design of
> > it). So take my thoughts with a large chunk of salt...
> > 
> > On a NUMA box, the pages have to live _somewhere_ and some CPUs will be
> > closer to them than others. If we're concerned about making sure that
> > the post-RDMA_READ processing is done on a CPU close to the memory,
> > then we don't have much choice but to try to make sure that this
> > processing is only done on CPUs that are close to that memory.
> > 
> > Assuming that this post-processing is done by nfsd, I suppose we'd need
> > to tag the post-RDMA_READ RPC with a poolid or something and make sure
> > that only nfsds running on CPUs close to the memory pick it up. Perhaps
> > there could be a per-pool queue for these RPC's or something...
> > 
> > Either way, the big question is whether that will be a net win or loss
> > for throughput. i.e. are we better off waiting for the right nfsd to
> > become available or allowing the first nfsd that becomes available to
> > make the crosscalls needed to do the RPC? It's hard to say...
> 
> Not only that, but it would lead to more disorder in the RPC processing
> which might kill write-behind.
> 

Oof, yeah...good point...

Another option might be to keep the nfsd that issued the RDMA_READ idle
for a short time in the expectation that the RDMA_READ reply will come
in soon. With a large enough pool of nfsd's I'd think that wouldn't
cause too much of a problem. That might be easier to implement anyway,
though we'd still have to think about how best to make sure that we
dispatch the RDMA_READ reply to the right nfsd (or at least to the
right svc pool).

> > 
> > In the near term, I doubt this patchset will harm the RDMA case. 
> 
> Agreed. 
> 
> > After
> > all, the distribution of memory allocations is pretty lumpy now. On
> > a NUMA box with RDMA you're probably doing a lot of crosscalls with
> > the current code.
> 
> Probably no worse than the socket's transport since the skbuf's aren't
> necessarily allocated on the CPU calling svc_recv.
> 

Right, it's certainly no worse than the current situation for the
non-RDMA case.

-- 
Jeff Layton <jlayton@...hat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/