lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <461C6360.1060908@qumranet.com>
Date:	Wed, 11 Apr 2007 07:26:08 +0300
From:	Avi Kivity <avi@...ranet.com>
To:	Rusty Russell <rusty@...tcorp.com.au>
Cc:	Ingo Molnar <mingo@...e.hu>, kvm-devel@...ts.sourceforge.net,
	netdev <netdev@...r.kernel.org>
Subject: Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

Rusty Russell wrote:
> On Mon, 2007-04-09 at 16:38 +0300, Avi Kivity wrote:
>   
>> Moreover, some things just don't lend themselves to a userspace 
>> abstraction.  If we want to expose tso (tcp segmentation offload), we 
>> can easily do so with a kernel driver since the kernel interfaces are 
>> all tso aware.  Tacking on tso awareness to tun/tap is doable, but at 
>> the very least wierd.
>>     
>
> It is kinda weird, yes, but it certainly makes sense.  All the arguments
> for tso apply in triplicate to userspace packet sends...
>
>   

Well, write() with a large buffer is a sort of tso device.  The problem
is tso breaks through several layers (like I'm advocating in the other
thread :), pushing tcp functionality into ethernet.  Well, we've seen worse.


>>> We're dealing with the tun/tap device here, not a socket.
>>>       
>> Hmm.  tun actually has aio_write implemented, but it seems synchronous.  
>> So does the read path.
>>
>> If these are made truly asynchronous, and the write path is made in 
>> addition copyless, then we might have something workable.  I still 
>> cringe at having a pagetable walk in order to deliver a 1500-byte packet.
>>     
>
> Right, now we're talking!
>
> However, it's not clear to me why creating an skb which references a kvm
> guest's memory doesn't need a pagetable walk, but a packet in (other)
> userspace memory does?
>   

Currently guest pages are stashed in a kernel array, as well as being
mmap()ed into user space.

That's not a very strong argument though, as I'd like to be map
userspace memory into the guest, or map address_spaces to the guest, or
something, so accessing guest physical memory will become more expensive
in time.

> My conviction which started this discussion is that if we can offer an
> efficient interface for kvm, we should be able to offer an efficient
> interface for any (other) userspace.
>   

Fully agreed.  It's mostly a question of who and when.  Designing and
implementing this interface is going to be difficult, require deep
knowledge of Linux networking, and consume a lot of time.

> As to async, I'm not *so* worried about that for the moment, although it
> would probably be nicer to fail than to block.  Otherwise we could
> simply set an skb destructor to wake us up.
>   

Nope.  Being async is critical for copyless networking:

- in the transmit path, so need to stop the sender (guest) from touching
the memory until it's on the wire.  This means 100% of packets sent will
be blocked.
- in the receive path, you could separate receive notification from the
single copy that must be done (like poll() + read()), but to make use of
dma engines you need to provide the end address beforehand.

> I think the first step is to see how much worse a decent userspace net
> driver is compared with the current in-kernel one.
>   

A userspace net interface needs to provide the following:

- true async operations
- multiple packets per operation (for interrupt mitigation) (like
lio_listio)
- scatter/gather packets (iovecs)
- configurable wakeup (by packet count/timeout) for queue management
- hacks (tso)

Most of these can be provided by a combination of the pending aio work,
the pending aio/fd integration, and the not-so-pending tap aio work.  As
the first two are available as patches and the third is limited to the
tap device, it is not unreasonable to try it out.  Maybe it will turn
out not to be as difficult as I predicted just a few lines above.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ