[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100407180929.GB26186@redhat.com>
Date: Wed, 7 Apr 2010 21:09:30 +0300
From: "Michael S. Tsirkin" <mst@...hat.com>
To: David Stevens <dlstevens@...ibm.com>
Cc: kvm@...r.kernel.org, kvm-owner@...r.kernel.org,
netdev@...r.kernel.org, rusty@...tcorp.com.au,
virtualization@...ts.osdl.org
Subject: Re: [PATCH v3] Add Mergeable receive buffer support to vhost_net
On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote:
> >
> > Thanks!
> > There's some whitespace damage, are you sending with your new
> > sendmail setup? It seems to have worked for qemu patches ...
>
> Yes, I saw some line wraps in what I received, but I checked
> the original draft to be sure and they weren't there. Possibly from
> the relay; Sigh.
>
>
> > > @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net *
> > > /* TODO: Check specific error and bomb out unless ENOBUFS? */
> > > err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > > if (unlikely(err < 0)) {
> > > - vhost_discard_vq_desc(vq);
> > > - tx_poll_start(net, sock);
> > > + if (err == -EAGAIN) {
> > > + vhost_discard_desc(vq, 1);
> > > + tx_poll_start(net, sock);
> > > + } else {
> > > + vq_err(vq, "sendmsg: errno %d\n", -err);
> > > + /* drop packet; do not discard/resend */
> > > + vhost_add_used_and_signal(&net->dev, vq, head,
> > > + 0);
> >
> > vhost does not currently has a consistent error handling strategy:
> > if we drop packets, need to think which other errors should cause
> > packet drops. I prefer to just call vq_err for now,
> > and have us look at handling segfaults etc in a consistent way
> > separately.
>
> I had to add this to avoid an infinite loop when I wrote a bad
> packet on the socket. I agree error handling needs a better look,
> but retrying a bad packet continuously while dumping in the log
> is what it was doing when I hit an error before this code. Isn't
> this better, at least until a second look?
>
Hmm, what do you mean 'continuously'? Don't we only try again
on next kick?
> > > +}
> > > +
> >
> > I wonder whether it makes sense to check
> > skb_queue_empty(&sk->sk_receive_queue)
> > outside the lock, to reduce the cost of this call
> > on an empty queue (we know that it happens at least once
> > each time we exit the loop on rx)?
>
> I was looking at alternatives to adding the lock in the
> first place, but I found I couldn't measure a difference in the
> cost with and without the lock.
>
>
> > > + int retries = 0;
> > > size_t len, total_len = 0;
> > > - int err;
> > > + int err, headcount, datalen;
> > > size_t hdr_size;
> > > struct socket *sock = rcu_dereference(vq->private_data);
> > > if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > > @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net *
> > > vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
> > > vq->log : NULL;
> > >
> > > - for (;;) {
> > > - head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > > - ARRAY_SIZE(vq->iov),
> > > - &out, &in,
> > > - vq_log, &log);
> > > + while ((datalen = vhost_head_len(sock->sk))) {
> > > + headcount = vhost_get_desc_n(vq, vq->heads, datalen, &in,
> > > + vq_log, &log);
> >
> > This looks like a bug, I think we need to pass
> > datalen + header size to vhost_get_desc_n.
> > Not sure how we know the header size that backend will use though.
> > Maybe just look at our features.
>
> Yes; we have hdr_size, so I can add it here. It'll be 0 for
> the cases where the backend and guest both have vnet header (either
> the regular or larger mergeable buffers one), but should be added
> in for future raw socket support.
So hdr_size is the wrong thing to add then.
We need to add non-zero value for tap now.
> >
> > > /* OK, now we need to know about added descriptors. */
> > > - if (head == vq->num) {
> > > - if (unlikely(vhost_enable_notify(vq))) {
> > > + if (!headcount) {
> > > + if (retries == 0 && unlikely(vhost_enable_notify(vq))) {
> > > /* They have slipped one in as we were
> > > * doing that: check again. */
> > > vhost_disable_notify(vq);
> > > + retries++;
> > > continue;
> > > }
> >
> > Hmm. The reason we have the code at all, as the comment says, is because
> > guest could have added more buffers between the time we read last index
> > and the time we enabled notification. So if we just break like this
> > the race still exists. We could remember the
> > last head value we observed, and have vhost_enable_notify check
> > against this value?
>
> This is to prevent a spin loop in the case where we have some
> buffers available, but not enough for the current packet (ie, this
> is the replacement code for the "rxmaxheadcount" business). If they
> actually added something new, retrying once should see it, but what
> vhost_enable_notify() returns non-zero on is not "new buffers" but
> rather "not empty". In the case mentioned, we aren't empty, so
> vhost_enable_notify() returns nonzero every time, but the guest hasn't
> given us enough buffers to proceed, so we continuously retry; this
> code breaks the spin loop until we've really got new buffers from
> the guest.
What about the race I point out above? It seems to potentially
cause a deadlock.
> >
> > Need to think about it.
> >
> > Another concern here is that on retries vhost_get_desc_n
> > is doing extra work, rescanning the same descriptor
> > again and again. Not sure how common this is, might be
> > worthwhile to add a TODO to consider this at least.
>
> I had a printk in there to test the code and with the
> retries counter, it happens when we fill the ring (once,
> because of the retries checks), and then proceeds as
> desired when the guest gives us more buffers. Without the
> check, it spews until we hear from the guest.
I don't understand whether what you write above means 'yes this happens
and is worth optimizing for' or 'this happens very rarely
and is not worth optimizing for'.
> > > @@ -261,14 +275,33 @@ static void handle_rx(struct vhost_net *
> > > len, MSG_DONTWAIT | MSG_TRUNC);
> > > /* TODO: Check specific error and bomb out unless EAGAIN? */
> > > if (err < 0) {
> >
> > I think we need to compare err and datalen and drop packet on mismatch
> as well.
> > The check err > len won't be needed then.
>
> OK, but this is the original code, unchanged by my patch. :-)
Right. original code does not know the datalen.
> > > - vhost_discard_vq_desc(vq);
> > > + vhost_discard_desc(vq, headcount);
> > > break;
> > > }
> > > /* TODO: Should check and handle checksum. */
> > > + if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF)) {
> > > + struct virtio_net_hdr_mrg_rxbuf *vhdr =
> > > + (struct virtio_net_hdr_mrg_rxbuf *)
> > > + vq->iov[0].iov_base;
> > > + /* add num_buffers */
> > > + if (vhost_has_feature(&net->dev,
> > > + VHOST_NET_F_VIRTIO_NET_HDR))
> > > + hdr.num_buffers = headcount;
> >
> > Why is the above necessary?
>
> This adds mergeable buffer support for the raw case.
So my suggestion is to have a copy of the header
and then same code will fill in the field for
both raw and tap.
> >
> > > + else if (vq->iov[0].iov_len < sizeof(*vhdr)) {
> >
> > I think length check is already done when we copy the header. No?
>
> This sets num_buffers for the tap case (where hdr_size is 0).
Ah, right. So let's just check the total length is > sizeof(*vhdr).
> We don't copy any headers at all for this case, but we need to
> add num_buffers at offset 10 in the user buffer already read by
> recvmsg().
>
> >
> > > + vq_err(vq, "tiny buffers < %d unsupported",
> > > + vq->iov[0].iov_len);
> > > + vhost_discard_desc(vq, headcount);
> > > + break;
> >
> > Problem here is that recvmsg might modify iov.
> > This is why I think we need to use vq->hdr here.
>
> rcvmsg() can never modify the iovec;
Look at af_packet code for an example where it does.
The pointer is non-const, it's OK to modify.
tap used to modify iovec as well, the fact it doesn't
now is a side-effect of reusing same code for
recvmsg and aio read.
> it's the standard socket call.
It's an internal API that is used to implement the call.
iovec it gets is a kernel side copy of what user passed in.
> To use vq->hdr, we'd have to copy
> it in from user space, modify it, and then copy it back
> in to user space, on every packet. In the normal tap case,
> hdr_size is 0 and we read it directly from the socket to
> user space. It is already correct for mergeable buffers,
> except for the num_buffers value added here.
I don't understand what you write above, sorry.
We have iov, all we need is store the first address+length
in the hdr field.
This should be close to free and does not involve any copies
from/to userspace. All the branching and special-casing
is possibly more expensive.
> >
> > > + } else if (put_user(headcount, &vhdr->num_buffers)) {
> >
> > The above put_user writes out a 32 bit value, right?
> > This seems wrong.
>
> I'll recheck this -- I'll make the type of "headcount" be
> the type of "num_buffers" in the MRXB vnet header, if it isn't
> already.
No, I was confused Sorry about the noise.
> >
> > How about using
> > memcpy_toiovecend(vq->hdr, &headcount,
> > offsetof(struct virtio_net_hdr_mrg_rxbuf, num_buffers),
> > sizeof headcount);
> >
> > this way we also do not make any assumptions about layout.
>
> Because this overwrites the valid vnet header we got from
> the tap socket with a local copy that isn't correct.
It does? I think that this only writes out 2 bytes at an offset.
> For this to
> work, we first need to copy_from_user() the vnet header from the
> socket into vq->hdr (on every packet).
copy_from_user from vnet header to vq->hdr does not
seem to make any sense and is not what I suggested.
> It doesn't assume anything here-- it requires (and checks)
> that the user didn't give us a <12 byte buffer, which I think is
> reasonable.
> That's about the size of the descriptor for the buffer,
> and I'd call that a broken guest.
I think it's better not to assume this. virtio spec does
mentions layout assumptions as legacy limitations.
Guest can post descriptor consisting of 2 buffers:
1. 10 bytes. 2. 2 bytes. and I don't see a reason
not to support this unless this makes code simpler.
In this case code is more complex.
> The cost of supporting those
> tiny buffers in the general case would be a copy_from_user() and
> copy_to_user() of the vnet_hdr on every packet, which I think is
> not worth it (and especially since I don't expect any guest is
> ever going to give us a <12 byte buffer in the first place).
You keep saying this, but I do not see any need for extra
copy_to_user. Just use memcpy_toiovecend instead of put_user.
> > > @@ -560,9 +593,14 @@ done:
> > >
> > > static int vhost_net_set_features(struct vhost_net *n, u64 features)
> > > {
> > > - size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ?
> > > - sizeof(struct virtio_net_hdr) : 0;
> > > + size_t hdr_size = 0;
> > > int i;
> > > +
> > > + if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
> > > + hdr_size = sizeof(struct virtio_net_hdr);
> > > + if (features & (1 << VIRTIO_NET_F_MRG_RXBUF))
> > > + hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> >
> > My personal style for this would be:
> > if (!(features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)))
> > hdr_size = 0
> > else if (!(features & (1 << VIRTIO_NET_F_MRG_RXBUF)))
> > hdr_size = sizeof(virtio_net_hdr);
> > else
> > hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> >
> > which results in more symmetry and less nesting.
>
> OK.
>
> >
> > > mutex_lock(&n->dev.mutex);
> > > if ((features & (1 << VHOST_F_LOG_ALL)) &&
> > > !vhost_log_access_ok(&n->dev)) {
> > > diff -ruNp net-next-p0/drivers/vhost/vhost.c
> > > net-next-v3/drivers/vhost/vhost.c
> > > --- net-next-p0/drivers/vhost/vhost.c 2010-03-22 12:04:38.000000000
> > > -0700
> > > +++ net-next-v3/drivers/vhost/vhost.c 2010-04-06 12:57:51.000000000
> > > -0700
> > > @@ -856,6 +856,47 @@ static unsigned get_indirect(struct vhos
> > > return 0;
> > > }
> > >
> > > +/* This is a multi-buffer version of vhost_get_vq_desc
> > > + * @vq - the relevant virtqueue
> > > + * datalen - data length we'll be reading
> > > + * @iovcount - returned count of io vectors we fill
> > > + * @log - vhost log
> > > + * @log_num - log offset
> > > + * returns number of buffer heads allocated, 0 on error
> >
> > This is unusual. Let's return a negative error code on error.
>
> This is like malloc -- "0" is never a valid return value, and
> we don't have actual error values to return and handle; they generate
> vq_err() messages within the code itself. In all cases,
> it is the count of buffers we allocated (which is 0 when we fail to
> allocate). So, I don't agree with this, but can do it if you insist.
malloc returns NULL, not zero.
standard error handling approaches are:
- NULL on error
- ERR_PTR on error
- >= 0 for success, -errno for error
- true for success false for error
And of course
- some custom strategy so you need to read documentation to figure out
how to use a function
which is where this one gets classified.
So I am not saying 0 won't work, just that standard stuff is better.
> >
> > > + */
> > > +int vhost_get_desc_n(struct vhost_virtqueue *vq, struct
> vring_used_elem
> > > *heads,
> > > + int datalen, int *iovcount, struct vhost_log *log,
> > > + unsigned int *log_num)
> > > +{
> > > + int out, in;
> > > + int seg = 0; /* iov index */
> > > + int hc = 0; /* head count */
> > > +
> > > + while (datalen > 0) {
> > > + if (hc >= VHOST_NET_MAX_SG)
> > > + goto err;
> > > + heads[hc].id = vhost_get_desc(vq->dev, vq, vq->iov+seg,
> > > + ARRAY_SIZE(vq->iov)-seg, &out,
> > > + &in, log, log_num);
> > > + if (heads[hc].id == vq->num)
> > > + goto err;
> > > + if (out || in <= 0) {
> > > + vq_err(vq, "unexpected descriptor format for RX: "
> > > + "out %d, in %d\n", out, in);
> > > + goto err;
> > > + }
> > > + heads[hc].len = iov_length(vq->iov+seg, in);
> > > + datalen -= heads[hc].len;
> >
> > This signed/unsigned mix makes me nervuous.
> > Let's make datalen unsigned, add unsigned total_len, and
> > while (datalen < total_len).
> >
> > > + hc++;
> > > + seg += in;
> > > + }
> > > + *iovcount = seg;
> > > + return hc;
> > > +err:
> > > + vhost_discard_desc(vq, hc);
> > > + return 0;
> > > +}
> > > +
> > > /* This looks in the virtqueue and for the first available buffer,
> and
> > > converts
> > > * it to an iovec for convenient access. Since descriptors consist
> of
> > > some
> > > * number of output then some number of input descriptors, it's
> > > actually two
> > > @@ -863,7 +904,7 @@ static unsigned get_indirect(struct vhos
> > > *
> > > * This function returns the descriptor number found, or vq->num
> (which
> > > * is never a valid descriptor number) if none was found. */
> > > -unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct
> > > vhost_virtqueue *vq,
> > > +unsigned vhost_get_desc(struct vhost_dev *dev, struct vhost_virtqueue
> > > *vq,
> > > struct iovec iov[], unsigned int iov_size,
> > > unsigned int *out_num, unsigned int *in_num,
> > > struct vhost_log *log, unsigned int *log_num)
> > > @@ -981,31 +1022,42 @@ unsigned vhost_get_vq_desc(struct vhost_
> > > }
> > >
> > > /* Reverse the effect of vhost_get_vq_desc. Useful for error
> handling.
> > > */
> > > -void vhost_discard_vq_desc(struct vhost_virtqueue *vq)
> > > +void vhost_discard_desc(struct vhost_virtqueue *vq, int n)
> > > {
> > > - vq->last_avail_idx--;
> > > + vq->last_avail_idx -= n;
> > > }
> > >
> > > /* After we've used one of their buffers, we tell them about it.
> We'll
> > > then
> > > * want to notify the guest, using eventfd. */
> > > -int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int
> > > len)
> > > +int vhost_add_used(struct vhost_virtqueue *vq, struct vring_used_elem
> > > *heads,
> > > + int count)
> >
> > I think we are better off with vhost_add_used and vhost_add_used_n:
> > the version with _n has a lot of extra complexity, and tx always
> > adds them 1 by one.
>
> The external code only calls vhost_add_used_and_signal(), so I
> split that. I can add a single variant of vhost_add_used() too, but
> I was trying to avoid adding duplicate code and external interface.
> I don't actually agree with splitting these; isn't it better to have
> them going through the same code path whether it's single or multiple?
> A bug in one would show up as an intermittent failure, and a change
> to one requires changing both.
It's a small function so I am not really worried about maintainance.
> I don't think the multiple should just
> do the single repeatedly, either, since the multiple variant now can
> do the work in one or two copy_to_user()'s, without a loop;
Yes.
> so splitting
> these just makes a special case for single to carry in code maintenance
> with no benefit, I think.
I donnu, I have a feeling all these loops and memory accesses
on data path won't be free, and for tx this just adds overhead.
> >
> > > {
> > > struct vring_used_elem *used;
> > > + int start, n;
> > > +
> > > + if (count <= 0)
> > > + return -EINVAL;
> > >
> > > - /* The virtqueue contains a ring of used buffers. Get a pointer
> to
> > > the
> > > - * next entry in that used ring. */
> > > - used = &vq->used->ring[vq->last_used_idx % vq->num];
> > > - if (put_user(head, &used->id)) {
> > > - vq_err(vq, "Failed to write used id");
> > > + start = vq->last_used_idx % vq->num;
> > > + if (vq->num - start < count)
> > > + n = vq->num - start;
> > > + else
> > > + n = count;
> >
> > use min?
>
> Sure.
>
> >
> > > + used = vq->used->ring + start;
> > > + if (copy_to_user(used, heads, sizeof(heads[0])*n)) {
> > > + vq_err(vq, "Failed to write used");
> > > return -EFAULT;
> > > }
> > > - if (put_user(len, &used->len)) {
> > > - vq_err(vq, "Failed to write used len");
> > > - return -EFAULT;
> > > + if (n < count) { /* wrapped the ring */
> > > + used = vq->used->ring;
> > > + if (copy_to_user(used, heads+n, sizeof(heads[0])*(count-n))) {
> > > + vq_err(vq, "Failed to write used");
> > > + return -EFAULT;
> > > + }
> > > }
> > > /* Make sure buffer is written before we update index. */
> > > smp_wmb();
> > > - if (put_user(vq->last_used_idx + 1, &vq->used->idx)) {
> > > + if (put_user(vq->last_used_idx+count, &vq->used->idx)) {
> >
> > I am a bit confused ... will this write a 32 or 16 bit value?
> > count is 32 bit ... Maybe we are better off with
> > u16 idx = vq->last_used_idx + count
> > put_user(idx, &vq->used->idx)
> > vq->last_used_idx = idx
>
> How about a cast?:
>
> if (put_user((u16)(vq->last_used_idx+count), &vq->used->idx)) {
>
>
> +-DLS
No, I was confused. Code is ok (just whitespace damaged).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists