[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48EA549E.1050001@opengridcomputing.com>
Date: Mon, 06 Oct 2008 13:10:38 -0500
From: Tom Tucker <tom@...ngridcomputing.com>
To: Roland Dreier <rdreier@...co.com>
CC: ericvh@...il.com, linux-kernel@...r.kernel.org,
v9fs-devel@...r.kernel.org, rminnich@...cer.ca.sandia.gov,
lionkov@...l.gov
Subject: Re: [PATCH 01/03] 9prdma: RDMA Transport Support for 9P
Roland Dreier wrote:
> very cool... neat that it only takes 1000 lines of code to do this too.
>
> a few quick comments from a cursory read:
>
> > This file implements the RDMA transport provider for 9P. It allows
> > mounts to be performed over iWARP and IB capable network interfaces
> > and uses the OpenFabrics API to perform I/O.
>
> I don't like this "openfabrics API" terminology -- the RDMA API is just
> one part of the kernel API that you're using, and I'm not sure it's
> worth calling out specially. ie just delete that third line from the
> changelog.
>
Ok.
> > + atomic_t next_tag;
>
> this seems to be a write-only field... delete it?
This is left over garbage. Nice catch.
>
> > + * @wc_op: Mellanox's broken HW doesn't provide the original WR op
> > + * when the CQE completes in error. This forces apps to keep track of
> > + * the op themselves. Yes, it's a Pet Peeve of mine ;-)
>
> there's nothing broken about this behavior -- the IB spec very
> explicitly calls out that the opcode field is undefined for completions
> with error status.
>
Fair enough...
> > +find_req(struct p9_trans_rdma *rdma, u16 tag)
>
> > + return found ? req : 0;
>
> using 0 instead of NULL here... probably worth running all this through
> sparse to see what else if flags.
>
will do.
> > + if (atomic_inc_return(&rdma->sq_count) >= rdma->sq_depth)
> > + wait_event_interruptible
> > + (rdma->send_wait,
> > + (atomic_read(&rdma->sq_count) < rdma->sq_depth));
>
> this worries me a bit... for one thing you use wait_event_interruptible()
> without checking the return value, so it's entirely possible that this
> gets woken up before the condition actually becomes true. And to handle
> that correctly, you'd need extra code to decrement the sq_count and wake
> up other waiters if it was interrupted, etc.
>
Yes, I think this needs to be wait_event.
> I know counting semaphores are out of favor in the kernel but this seems
> to be a good place to use real semaphores, rather than concocting your
> own in a much more complicated way.
>
> > + wait_for_completion_interruptible(&rdma->cm_done);
>
> similarly there are lots of places you do this interruptible wait but
> then don't check if you were interrupted.
>
Well. The only place I do that is when the connection is being
established. And actually, I don't(didn't?) really care why it stopped
waiting, I only care about the state of the connection and if it's not
what I need it to be I error out. I suppose there is a theoretical race
here where the user cancels the wait with ctrl-C, but the connection
keeps going because the transport always gets back fast enough. This
would eventually fail in rdma_rpc.
I'll add the code to check so that it's obvious.
> > + if (0 == (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
> > + rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE);
> > + if (IS_ERR(rdma->dma_mr))
> > + goto error;
> > + rdma->lkey = rdma->dma_mr->lkey;
> > + } else
> > + rdma->lkey = rdma->cm_id->device->local_dma_lkey;
>
> seems to me this could be written more naturally and avoid the "if
> (0==(test))" construction (which I always have to reread to parse) if
> you reverse the test:
>
> if (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
> rdma->lkey = rdma->cm_id->device->local_dma_lkey;
> } else {
> rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE);
> if (IS_ERR(rdma->dma_mr))
> goto error;
> rdma->lkey = rdma->dma_mr->lkey;
> }
>
agree. thanks.
> The only place I see a send request posted has:
>
> > + wr.opcode = IB_WR_SEND;
>
> so I guess your protocol is only send/receives (ie no RDMA in the
> pedantic sense, just two-sided operations). I'm wondering what the
> motivation for all this is: is using send/receive on IB/iWARP a big
> performance win over non-offloaded transport? Even compared to TCP on a
> NIC with LSO and (software) LRO?
>
For IB it avoids IPoIB. Which I think is a big win. For iWARP, it's not
so clear. There is the potential for RDMA ops in the future.
> Is your protocol documented anywhere? It seems npfs has support for the
> same transport.
>
When using SEND/RECV, there isn't a protocol other than 9P itself. Once
we add RDMA ops, one will be required to exchange RKEY, etc... But maybe
I don't understand the question?
> Finally, you create your QP with:
>
> > + qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
>
> but you unconditionally do:
>
> > + wr.send_flags = IB_SEND_SIGNALED;
>
> not a big deal but you could save one line of code and a correspondingly
> tiny amount of .text by just making the send queue signal all requests.
>
ok.
Thanks for the review,
Tom
> - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists