[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200911061529.17500.rusty@rustcorp.com.au>
Date: Fri, 6 Nov 2009 15:29:17 +1030
From: Rusty Russell <rusty@...tcorp.com.au>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: netdev@...r.kernel.org, virtualization@...ts.linux-foundation.org,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org, mingo@...e.hu,
linux-mm@...ck.org, akpm@...ux-foundation.org, hpa@...or.com,
gregory.haskins@...il.com, s.hetze@...ux-ag.com,
Daniel Walker <dwalker@...o99.com>,
Eric Dumazet <eric.dumazet@...il.com>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Subject: Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server
On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote:
> What it is: vhost net is a character device that can be used to reduce
> the number of system calls involved in virtio networking.
Hi Michael,
Now everyone else has finally kicked all the tires and it seems to pass,
I've done a fairly complete review. Generally, it's really nice; just one
bug and a few minor suggestions for polishing.
> +/* Caller must have TX VQ lock */
> +static void tx_poll_stop(struct vhost_net *net)
> +{
> + if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
> + return;
likely? Really?
> + for (;;) {
> + head = vhost_get_vq_desc(&net->dev, vq, vq->iov, &out, &in,
> + NULL, NULL);
Danger! You need an arg to vhost_get_vq_desc to tell it the max desc size
you can handle. Otherwise, it's only limited by ring size, and a malicious
guest can overflow you here, and below:
> + /* Skip header. TODO: support TSO. */
> + s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
...
> +
> + use_mm(net->dev.mm);
> + mutex_lock(&vq->mutex);
> + vhost_no_notify(vq);
I prefer a name like "vhost_disable_notify()".
> + /* OK, now we need to know about added descriptors. */
> + if (head == vq->num && vhost_notify(vq))
> + /* They could have slipped one in as we were doing that:
> + * check again. */
> + continue;
> + /* Nothing new? Wait for eventfd to tell us they refilled. */
> + if (head == vq->num)
> + break;
> + /* We don't need to be notified again. */
> + vhost_no_notify(vq);
Similarly, vhost_enable_notify. This one is particularly misleading since
it doesn't actually notify anything!
In particular, this code would be neater as:
if (head == vq->num) {
if (vhost_enable_notify(vq)) {
/* Try again, they could have slipped one in. */
continue;
}
/* Nothing more to do. */
break;
}
vhost_disable_notify(vq);
Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to
return true only when there's more pending. Saves a loop around here most
of the time. Also, the vhost_no_notify/vhost_disable_notify() can be moved
out of the loop entirely. (It could be under an if (unlikely(enabled)), not
sure if it's worth it).
> + len = err;
> + err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size);
That unsigned char * arg to memcpy_toiovec is annoying. A patch might be
nice, separate from this effort.
> +static int vhost_net_open(struct inode *inode, struct file *f)
> +{
> + struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
> + int r;
> + if (!n)
> + return -ENOMEM;
> + f->private_data = n;
> + n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> + n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
I have a personal dislike of calloc for structures. In userspace, it's
because valgrind can't spot uninitialized fields. These days a similar
argument applies in the kernel, because we have KMEMCHECK now. If someone
adds a field to the struct and forgets to initialize it, we can spot it.
> +static void vhost_net_enable_vq(struct vhost_net *n, int index)
> +{
> + struct socket *sock = n->vqs[index].private_data;
OK, I can't help but this that presenting the vqs as an array doesn't buy
us very much. Esp. if you change vhost_dev_init to take a NULL-terminated
varargs. I think readability would improve. It means passing a vq around
rather than an index.
Not completely sure it'll be a win tho.
> +static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> +{
> + struct socket *sock, *oldsock = NULL;
...
> + sock = get_socket(fd);
> + if (IS_ERR(sock)) {
> + r = PTR_ERR(sock);
> + goto done;
> + }
> +
> + /* start polling new socket */
> + oldsock = vq->private_data;
...
> +done:
> + mutex_unlock(&n->dev.mutex);
> + if (oldsock) {
> + vhost_net_flush_vq(n, index);
> + fput(oldsock->file);
I dislike this style; I prefer multiple different goto points, one for when
oldsock is set, and one for when it's not.
That way, gcc warns us about uninitialized variables if we get it wrong.
> +static long vhost_net_reset_owner(struct vhost_net *n)
> +{
> + struct socket *tx_sock = NULL;
> + struct socket *rx_sock = NULL;
> + long r;
This should be called "err", since that's what it is.
> +static void vhost_net_set_features(struct vhost_net *n, u64 features)
> +{
> + size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ?
> + sizeof(struct virtio_net_hdr) : 0;
> + int i;
> + mutex_lock(&n->dev.mutex);
> + n->dev.acked_features = features;
Why is this called "acked_features"? Not just "features"? I expected
to see code which exposed these back to userspace, and didn't.
> + case VHOST_GET_FEATURES:
> + features = VHOST_FEATURES;
> + return put_user(features, featurep);
> + case VHOST_ACK_FEATURES:
> + r = get_user(features, featurep);
> + /* No features for now */
> + if (r < 0)
> + return r;
> + if (features & ~VHOST_FEATURES)
> + return -EOPNOTSUPP;
> + vhost_net_set_features(n, features);
OK, from the userspace POV it's "get features" then "ack features". But
I think "VHOST_SET_FEATURES" is more consistent, despite this usage.
> + switch (ioctl) {
> + case VHOST_SET_VRING_NUM:
I haven't looked at your userspace implementation, but does a generic
VHOST_SET_VRING_STATE & VHOST_GET_VRING_STATE with a struct make more
sense? It'd be simpler here, but not sure if it'd be simpler to use?
(Not the fd-setting ioctls of course)
> + case VHOST_SET_VRING_LOG:
> + r = copy_from_user(&a, argp, sizeof a);
> + if (r < 0)
> + break;
> + if (a.padding) {
> + r = -EOPNOTSUPP;
> + break;
> + }
> + if (a.user_addr == VHOST_VRING_LOG_DISABLE) {
> + vq->log_used = false;
> + break;
> + }
> + if (a.user_addr & (sizeof *vq->used->ring - 1)) {
> + r = -EINVAL;
> + break;
> + }
> + vq->log_used = true;
> + vq->log_addr = a.user_addr;
> + break;
For future reference, this is *exactly* the kind of thing which would have
been nice as a followup patch. Easy to separate, easy to review, not critical
to the core.
> +/* TODO: This is really inefficient. We need something like get_user()
> + * (instruction directly accesses the data, with an exception table entry
> + * returning -EFAULT). See Documentation/x86/exception-tables.txt.
> + */
> +static int set_bit_to_user(int nr, void __user *addr)
> +{
I guess we won't be dealing with many contiguous pages, otherwise we could
get a cheap speedup making this set_bits_to_user(int nr, int num_bits...).
> +/* Each buffer in the virtqueues is actually a chain of descriptors. This
> + * function returns the next descriptor in the chain,
> + * or -1 if we're at the end. */
> +static unsigned next_desc(struct vring_desc *desc)
> +{
> + unsigned int next;
> +
> + /* If this descriptor says it doesn't chain, we're done. */
> + if (!(desc->flags & VRING_DESC_F_NEXT))
> + return -1;
Hmm, prefer s/-1/-1U/ in comment, here, and below. Clarifies a bit.
> +/* After we've used one of their buffers, we tell them about it. We'll then
> + * want to send them an interrupt, using vq->call. */
This comment has too much cut & paste:
... want to notify the guest, using the eventfd */
> +/* This actually sends the interrupt for this virtqueue */
> +void vhost_trigger_irq(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> +{
Rename vhost_notify_eventfd() or something, and fix comments?
> +enum {
> + VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
+2? Believable, but is it correct?
> +/* Poll a file (eventfd or socket) */
> +/* Note: there's nothing vhost specific about this structure. */
> +struct vhost_poll {
This comment really helped while reading the code. Kudos!
Thanks!
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists