[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100308074533.GA7482@redhat.com>
Date: Mon, 8 Mar 2010 09:45:33 +0200
From: "Michael S. Tsirkin" <mst@...hat.com>
To: David Stevens <dlstevens@...ibm.com>
Cc: kvm@...r.kernel.org, netdev@...r.kernel.org, rusty@...tcorp.com.au,
virtualization@...ts.osdl.org
Subject: Re: [RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in
receiver
On Sun, Mar 07, 2010 at 04:34:32PM -0800, David Stevens wrote:
> "Michael S. Tsirkin" <mst@...hat.com> wrote on 03/07/2010 07:31:30 AM:
>
> > On Tue, Mar 02, 2010 at 05:20:15PM -0700, David Stevens wrote:
> > > This patch generalizes buffer handling functions to
> NULL, NULL);
> > > + head.iov_base = (void *)vhost_get_vq_desc(&net->dev,
> vq,
> > > + vq->iov, ARRAY_SIZE(vq->iov), &out, &in, NULL,
>
> > > NULL);
> >
> > Should type for head be changed so that we do not need to cast?
> >
> > I also prefer aligning descendants on the opening (.
>
> Yes, that's probably better; the indentation with the cast would
> still wrap a lot, but I'll see what I can do here.
>
>
> > > err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > > if (unlikely(err < 0)) {
> > > - vhost_discard_vq_desc(vq);
> > > + vhost_discard(vq, 1);
> >
> > Isn't the original name a bit more descriptive?
>
> During development, I had both and I generally like
> shorter names, but I can change it back.
>
> > > +static int skb_head_len(struct sk_buff_head *skq)
> > > +{
> > > + struct sk_buff *head;
> > > +
> > > + head = skb_peek(skq);
> > > + if (head)
> > > + return head->len;
> > > + return 0;
> > > +}
> > > +
> >
> > This is done without locking, which I think can crash
> > if skb is consumed after we peek it but before we read the
> > length.
>
> This thread is the only legitimate consumer, right? But
> qemu has the file descriptor and I guess we shouldn't trust
> that it won't give it to someone else; it'd break vhost, but
> a crash would be inappropriate, of course. I'd like to avoid
> the lock, but I have another idea here, so will investigate.
>
> >
> >
> > > /* Expects to be always run from workqueue - which acts as
> > > * read-size critical section for our kind of RCU. */
> > > static void handle_rx(struct vhost_net *net)
> > > {
> > > struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > > - unsigned head, out, in, log, s;
> > > + unsigned in, log, s;
> > > struct vhost_log *vq_log;
> > > struct msghdr msg = {
> > > .msg_name = NULL,
> > > @@ -204,10 +213,11 @@
> > > };
> > >
> > > size_t len, total_len = 0;
> > > - int err;
> > > + int err, headcount, datalen;
> > > size_t hdr_size;
> > > struct socket *sock = rcu_dereference(vq->private_data);
> > > - if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > > +
> > > + if (!sock || !skb_head_len(&sock->sk->sk_receive_queue))
> > > return;
> > >
> >
> > Isn't this equivalent? Do you expect zero len skbs in socket?
> > If yes, maybe we should discard these, not stop processing ...
>
> A zero return means "no skb's". They are equivalent; I
> changed this call just to make it identical to the loop check,
> but I don't have a strong attachment to this.
>
>
> > > vq_log = unlikely(vhost_has_feature(&net->dev,
> VHOST_F_LOG_ALL)) ?
> > > vq->log : NULL;
> > >
> > > - for (;;) {
> > > - head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > > - ARRAY_SIZE(vq->iov),
> > > - &out, &in,
> > > - vq_log, &log);
> > > + while ((datalen = skb_head_len(&sock->sk->sk_receive_queue)))
> {
> >
> > This peeks in the queue to figure out how much data we have.
> > It's a neat trick, but it does introduce new failure modes
> > where an skb is consumed after we did the peek:
> > - new skb could be shorter, we should return the unconsumed part
> > - new skb could be longer, this will drop a packet.
> > maybe this last is not an issue as the race should be rare in practice
>
> As before, this loop is the only legitimate consumer of skb's; if
> anyone else is reading the socket, it's broken. But since the file
> descriptor is available to qemu, it's probably trusting qemu too much.
> I don't believe there is a race at all on a working system; the head
> can't change until we read it after this test, but I'll see if I can
> come up with something better here. Closing the fd for qemu so vhost
> has exclusive access might do it, but otherwise we could just get a
> max-sized packet worth of buffers and then return them after the read.
> That'd force us to wait with small packets when the ring is nearly
> full, where we don't have to now, but I expect locking to read the head
> length will hurt performance in all cases. Will try these ideas out.k
>
> >
> > > + headcount = vhost_get_heads(vq, datalen, &in, vq_log,
> > > &log);
> > > /* OK, now we need to know about added descriptors. */
> > > - if (head == vq->num) {
> > > + if (!headcount) {
> > > if (unlikely(vhost_enable_notify(vq))) {
> > > /* They have slipped one in as we were
> > > * doing that: check again. */
> > > @@ -235,13 +242,6 @@
> > > * they refilled. */
> > > break;
> > > }
> > > - /* We don't need to be notified again. */
> >
> > you find this comment unhelpful?
>
> This code is changed in the later patches; the comment was
> removed with that, but I got it in the wrong patch on the split.
> I guess the comment is ok to stay, anyway, but notification may
> require multiple buffers to be available; I had fixed that here, but
> the final patch pushes that into the notify code, so yes, this can
> go back in.
>
> > > - if (out) {
> > > - vq_err(vq, "Unexpected descriptor format for
> RX: "
> > > - "out %d, int %d\n",
> > > - out, in);
> > > - break;
> > > - }
> >
> >
> > we still need this check, don't we?
>
> It's done in vhost_get_heads(); "out" isn't visible in handle_rx()
> anymore.
>
> > > +unsigned vhost_get_heads(struct vhost_virtqueue *vq, int datalen, int
>
> > > *iovcount,
> > > + struct vhost_log *log, unsigned int *log_num)
> >
> > Could you please document this function's parameters? It's not obvious
> > what iovcount, log, log_num are.
>
> Yes. To answer the question, they are the same as vhost_get_vq_desc(),
> except "iovcount" replaces "in" (and "out" is not needed in the caller).
>
> > > +{
> > > + struct iovec *heads = vq->heads;
> > > + int out, in;
> > > + int hc = 0;
> > > +
> > > + while (datalen > 0) {
> > > + if (hc >= VHOST_NET_MAX_SG) {
> > > + vhost_discard(vq, hc);
> > > + return 0;
> > > + }
> > > + heads[hc].iov_base = (void
> *)vhost_get_vq_desc(vq->dev,
> > > vq,
> > > + vq->iov, ARRAY_SIZE(vq->iov), &out, &in, log,
> > > log_num);
> > > + if (heads[hc].iov_base == (void *)vq->num) {
> > > + vhost_discard(vq, hc);
> > > + return 0;
> > > + }
> > > + if (out || in <= 0) {
> > > + vq_err(vq, "unexpected descriptor format for
> RX: "
> > > + "out %d, in %d\n", out, in);
> > > + vhost_discard(vq, hc);
> > > + return 0;
> > > + }
> > > + heads[hc].iov_len = iov_length(vq->iov, in);
> > > + hc++;
> > > + datalen -= heads[hc].iov_len;
> > > + }
> > > + *iovcount = in;
> >
> >
> > Only the last value?
>
> In this part of the split, it only goes through once; this is
> changed to the accumulated value "seg" in a later patch. So, split
> artifact.
>
> {
> > > struct vring_used_elem *used;
> > > + int i;
> > >
> > > - /* The virtqueue contains a ring of used buffers. Get a
> pointer
> > > to the
> > > - * next entry in that used ring. */
> > > - used = &vq->used->ring[vq->last_used_idx % vq->num];
> > > - if (put_user(head, &used->id)) {
> > > - vq_err(vq, "Failed to write used id");
> > > - return -EFAULT;
> > > - }
> > > - if (put_user(len, &used->len)) {
> > > - vq_err(vq, "Failed to write used len");
> > > - return -EFAULT;
> > > + for (i=0; i<count; i++, vq->last_used_idx++) {
> >
> > whitespace damage: I prefer space around =, <.
> > I also use ++i, etc in this driver, better be consistent?
> > Also for clarity, I prefer to put vq->last_used_idx inside the loop.
>
> OK.
> >
> > > + used = &vq->used->ring[vq->last_used_idx % vq->num];
> > > + if (put_user((unsigned)heads[i].iov_base, &used->id))
> {
> > > + vq_err(vq, "Failed to write used id");
> > > + return -EFAULT;
> > > + }
> > > + if (put_user(heads[i].iov_len, &used->len)) {
> > > + vq_err(vq, "Failed to write used len");
> > > + return -EFAULT;
> > > + }
> >
> > If this fails, last_used_idx will still be incremented, which I think is
> wrong.
>
> True, but if these fail, aren't we dead? I don't think we can
> recover
> from an EFAULT in any of these; I didn't test those paths from the
> original,
> but I think we need to bail out entirely for these cases, right?
>
> +-DLS
Yes, we stop on error, but host can fix uo the vq and redo a kick.
That's why there's errorfd.
--
MST
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists