linux-kernel - Re: [PATCH 01/03] 9prdma: RDMA Transport Support for 9P

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 01 Oct 2008 22:11:52 -0700
From:	Roland Dreier <rdreier@...co.com>
To:	Tom Tucker <tom@...ngridcomputing.com>
Cc:	ericvh@...il.com, linux-kernel@...r.kernel.org,
	v9fs-devel@...r.kernel.org, rminnich@...cer.ca.sandia.gov,
	lionkov@...l.gov
Subject: Re: [PATCH 01/03] 9prdma: RDMA Transport Support for 9P

very cool... neat that it only takes 1000 lines of code to do this too.

a few quick comments from a cursory read:

 > This file implements the RDMA transport provider for 9P. It allows
 > mounts to be performed over iWARP and IB capable network interfaces
 > and uses the OpenFabrics API to perform I/O.

I don't like this "openfabrics API" terminology -- the RDMA API is just
one part of the kernel API that you're using, and I'm not sure it's
worth calling out specially.  ie just delete that third line from the
changelog.

 > +	atomic_t next_tag;

this seems to be a write-only field... delete it?

 > + * @wc_op: Mellanox's broken HW doesn't provide the original WR op
 > + * when the CQE completes in error. This forces apps to keep track of
 > + * the op themselves. Yes, it's a Pet Peeve of mine ;-)

there's nothing broken about this behavior -- the IB spec very
explicitly calls out that the opcode field is undefined for completions
with error status.

 > +find_req(struct p9_trans_rdma *rdma, u16 tag)

 > +	return found ? req : 0;

using 0 instead of NULL here... probably worth running all this through
sparse to see what else if flags.

 > +	if (atomic_inc_return(&rdma->sq_count) >= rdma->sq_depth)
 > +		wait_event_interruptible
 > +			(rdma->send_wait,
 > +			 (atomic_read(&rdma->sq_count) < rdma->sq_depth));

this worries me a bit... for one thing you use wait_event_interruptible()
without checking the return value, so it's entirely possible that this
gets woken up before the condition actually becomes true.  And to handle
that correctly, you'd need extra code to decrement the sq_count and wake
up other waiters if it was interrupted, etc.

I know counting semaphores are out of favor in the kernel but this seems
to be a good place to use real semaphores, rather than concocting your
own in a much more complicated way.

 > +	wait_for_completion_interruptible(&rdma->cm_done);

similarly there are lots of places you do this interruptible wait but
then don't check if you were interrupted.

 > +	if (0 == (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
 > +		rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE);
 > +		if (IS_ERR(rdma->dma_mr))
 > +			goto error;
 > +		rdma->lkey = rdma->dma_mr->lkey;
 > +	} else
 > +		rdma->lkey = rdma->cm_id->device->local_dma_lkey;

seems to me this could be written more naturally and avoid the "if
(0==(test))" construction (which I always have to reread to parse) if
you reverse the test:

	if (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
		rdma->lkey = rdma->cm_id->device->local_dma_lkey;
	} else {
		rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE);
		if (IS_ERR(rdma->dma_mr))
			goto error;
		rdma->lkey = rdma->dma_mr->lkey;
	}

The only place I see a send request posted has:

 > +	wr.opcode = IB_WR_SEND;

so I guess your protocol is only send/receives (ie no RDMA in the
pedantic sense, just two-sided operations).  I'm wondering what the
motivation for all this is: is using send/receive on IB/iWARP a big
performance win over non-offloaded transport?  Even compared to TCP on a
NIC with LSO and (software) LRO?

Is your protocol documented anywhere?  It seems npfs has support for the
same transport.

Finally, you create your QP with:

 > +	qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;

but you unconditionally do:

 > +	wr.send_flags = IB_SEND_SIGNALED;

not a big deal but you could save one line of code and a correspondingly
tiny amount of .text by just making the send queue signal all requests.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/