[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090819130417.GB3080@redhat.com>
Date: Wed, 19 Aug 2009 16:04:17 +0300
From: "Michael S. Tsirkin" <mst@...hat.com>
To: Arnd Bergmann <arnd@...db.de>
Cc: virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org, mingo@...e.hu,
linux-mm@...ck.org, akpm@...ux-foundation.org, hpa@...or.com,
gregory.haskins@...il.com, Or Gerlitz <ogerlitz@...taire.com>
Subject: Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
On Wed, Aug 19, 2009 at 11:04:50AM +0200, Arnd Bergmann wrote:
> On Sunday 16 August 2009, Michael S. Tsirkin wrote:
> > On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote:
> > >
> > > * most of the transports are sockets, tap uses a character device.
> > > This could be dealt with by having both a struct socket * in
> > > struct vhost_net *and* a struct file *, or by always keeping the
> > > struct file and calling vfs_readv/vfs_writev for the data transport
> > > in both cases.
> >
> > I am concerned that character devices might have weird side effects with
> > read/write operations and that calling them from kernel thread the way I
> > do might have security implications. Can't point at anything specific
> > though at the moment.
>
> I understand your feelings about passing a chardev fd into your driver
> and I agree that we need to be very careful if we want to allow it.
>
> Maybe we could instead extend the 'splice' system call to work on a
> vhost_net file descriptor. If we do that, we can put the access back
> into a user thread (or two) that stays in splice indefinetely
An issue with exposing internal threading model to userspace
in this way is that we lose control of e.g. CPU locality -
and it is very hard for userspace to get it right.
> to
> avoid some of the implications of kernel threads like the missing
> ability to handle transfer errors in user space.
Are you talking about TCP here?
Transfer errors are typically asynchronous - possibly eventfd
as I expose for vhost net is sufficient there.
> > I wonder - can we expose the underlying socket used by tap, or will that
> > create complex lifetime issues?
>
> I think this could get more messy in the long run than calling vfs_readv
> on a random fd. It would mean deep internal knowledge of the tap driver
> in vhost_net, which I really would prefer to avoid.
No, what I had in mind is adding a GET_SOCKET ioctl to tap.
vhost would then just use the socket.
> > > * Each transport has a slightly different header, we have
> > > - raw ethernet frames (raw, udp multicast, tap)
> > > - 32-bit length + raw frames, possibly fragmented (tcp)
> > > - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
> > > To handle these three cases, we need either different ioctl numbers
> > > so that vhost_net can choose the right one, or a flags field in
> > > VHOST_NET_SET_SOCKET, like
> > >
> > > #define VHOST_NET_RAW 1
> > > #define VHOST_NET_LEN_HDR 2
> > > #define VHOST_NET_VNET_HDR 4
> > >
> > > struct vhost_net_socket {
> > > unsigned int flags;
> > > int fd;
> > > };
> > > #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket)
> >
> > It seems we can query the socket to find out the type,
>
> yes, I understand that you can do that, but I still think that decision
> should be left to user space. Adding a length header for TCP streams but
> not for UDP is something that we would normally want to do, but IMHO
> vhost_net should not need to know about this.
>
> > or use the features ioctl.
>
> Right, I had forgotten about that one. It's probably equivalent
> to the flags I suggested, except that one allows you to set features
> after starting the communication, while the other one prevents
> you from doing that.
>
> > > Qemu could then automatically try to use vhost_net, if it's available
> > > in the kernel, or just fall back on software vlan otherwise.
> > > Does that make sense?
> >
> > I agree, long term it should be enabled automatically when possible.
>
> So how about making the qemu command line interface an extension to
> what Or Gerlitz has done for the raw packet sockets?
>
> Arnd <><
Not sure I see the connection, but I have not thought about qemu
side of things too much yet - trying to get kernel bits in place
first so that there's a stable ABI to work with.
--
MST
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists