linux-kernel - Re: [PATCH 01/03] 9prdma: RDMA Transport Support for 9P

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48EA549E.1050001@opengridcomputing.com>
Date:	Mon, 06 Oct 2008 13:10:38 -0500
From:	Tom Tucker <tom@...ngridcomputing.com>
To:	Roland Dreier <rdreier@...co.com>
CC:	ericvh@...il.com, linux-kernel@...r.kernel.org,
	v9fs-devel@...r.kernel.org, rminnich@...cer.ca.sandia.gov,
	lionkov@...l.gov
Subject: Re: [PATCH 01/03] 9prdma: RDMA Transport Support for 9P

Roland Dreier wrote:
> very cool... neat that it only takes 1000 lines of code to do this too.
> 
> a few quick comments from a cursory read:
> 
>  > This file implements the RDMA transport provider for 9P. It allows
>  > mounts to be performed over iWARP and IB capable network interfaces
>  > and uses the OpenFabrics API to perform I/O.
> 
> I don't like this "openfabrics API" terminology -- the RDMA API is just
> one part of the kernel API that you're using, and I'm not sure it's
> worth calling out specially.  ie just delete that third line from the
> changelog.
> 

Ok.

>  > +	atomic_t next_tag;
> 
> this seems to be a write-only field... delete it?

This is left over garbage. Nice catch.

> 
>  > + * @wc_op: Mellanox's broken HW doesn't provide the original WR op
>  > + * when the CQE completes in error. This forces apps to keep track of
>  > + * the op themselves. Yes, it's a Pet Peeve of mine ;-)
> 
> there's nothing broken about this behavior -- the IB spec very
> explicitly calls out that the opcode field is undefined for completions
> with error status.
>

Fair enough...

>  > +find_req(struct p9_trans_rdma *rdma, u16 tag)
> 
>  > +	return found ? req : 0;
> 
> using 0 instead of NULL here... probably worth running all this through
> sparse to see what else if flags.
>

will do.


>  > +	if (atomic_inc_return(&rdma->sq_count) >= rdma->sq_depth)
>  > +		wait_event_interruptible
>  > +			(rdma->send_wait,
>  > +			 (atomic_read(&rdma->sq_count) < rdma->sq_depth));
> 
> this worries me a bit... for one thing you use wait_event_interruptible()
> without checking the return value, so it's entirely possible that this
> gets woken up before the condition actually becomes true.  And to handle
> that correctly, you'd need extra code to decrement the sq_count and wake
> up other waiters if it was interrupted, etc.
>

Yes, I think this needs to be wait_event.

> I know counting semaphores are out of favor in the kernel but this seems
> to be a good place to use real semaphores, rather than concocting your
> own in a much more complicated way.
> 


>  > +	wait_for_completion_interruptible(&rdma->cm_done);
> 
> similarly there are lots of places you do this interruptible wait but
> then don't check if you were interrupted.
> 

Well. The only place I do that is when the connection is being 
established. And actually, I don't(didn't?) really care why it stopped 
waiting, I  only care about the state of the connection and if it's not 
what I need it to be I error out. I suppose there is a theoretical race 
here where the user cancels the wait with ctrl-C, but the connection 
keeps going because the transport always gets back fast enough. This 
would eventually fail in rdma_rpc.

I'll add the code to check so that it's obvious.

>  > +	if (0 == (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
>  > +		rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE);
>  > +		if (IS_ERR(rdma->dma_mr))
>  > +			goto error;
>  > +		rdma->lkey = rdma->dma_mr->lkey;
>  > +	} else
>  > +		rdma->lkey = rdma->cm_id->device->local_dma_lkey;
> 
> seems to me this could be written more naturally and avoid the "if
> (0==(test))" construction (which I always have to reread to parse) if
> you reverse the test:
> 
> 	if (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
> 		rdma->lkey = rdma->cm_id->device->local_dma_lkey;
> 	} else {
> 		rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE);
> 		if (IS_ERR(rdma->dma_mr))
> 			goto error;
> 		rdma->lkey = rdma->dma_mr->lkey;
> 	}
> 
agree. thanks.

> The only place I see a send request posted has:
> 
>  > +	wr.opcode = IB_WR_SEND;
> 
> so I guess your protocol is only send/receives (ie no RDMA in the
> pedantic sense, just two-sided operations).  I'm wondering what the
> motivation for all this is: is using send/receive on IB/iWARP a big
> performance win over non-offloaded transport?  Even compared to TCP on a
> NIC with LSO and (software) LRO?
> 

For IB it avoids IPoIB. Which I think is a big win. For iWARP, it's  not 
so clear. There is the potential for RDMA ops in the future.

> Is your protocol documented anywhere?  It seems npfs has support for the
> same transport.
> 

When using SEND/RECV, there isn't a protocol other than 9P itself. Once 
we add RDMA ops, one will be required to exchange RKEY, etc... But maybe 
I don't understand the question?

> Finally, you create your QP with:
> 
>  > +	qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
> 
> but you unconditionally do:
> 
>  > +	wr.send_flags = IB_SEND_SIGNALED;
> 
> not a big deal but you could save one line of code and a correspondingly
> tiny amount of .text by just making the send queue signal all requests.
> 

ok.

Thanks for the review,
Tom

>  - R.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/