netdev - Re: [RFC PATCH 00/17] virtual-bus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 05 Apr 2009 11:06:12 +0300
From:	Avi Kivity <avi@...hat.com>
To:	Rusty Russell <rusty@...tcorp.com.au>
CC:	Anthony Liguori <anthony@...emonkey.ws>,
	Gregory Haskins <ghaskins@...ell.com>,
	linux-kernel@...r.kernel.org, agraf@...e.de, pmullaney@...ell.com,
	pmorreale@...ell.com, netdev@...r.kernel.org, kvm@...r.kernel.org
Subject: Re: [RFC PATCH 00/17] virtual-bus

Rusty Russell wrote:
> On Thursday 02 April 2009 02:40:29 Anthony Liguori wrote:
>   
>> Rusty Russell wrote:
>>     
>>> As you point out, 350-450 is possible, which is still bad, and it's at least
>>> partially caused by the exit to userspace and two system calls.  If virtio_net
>>> had a backend in the kernel, we'd be able to compare numbers properly.
>>>       
>> I doubt the userspace exit is the problem.  On a modern system, it takes 
>> about 1us to do a light-weight exit and about 2us to do a heavy-weight 
>> exit.  A transition to userspace is only about ~150ns, the bulk of the 
>> additional heavy-weight exit cost is from vcpu_put() within KVM.
>>     
>
> Just to inject some facts, servicing a ping via tap (ie host->guest then
> guest->host response) takes 26 system calls from one qemu thread, 7 from
> another (see strace below). Judging by those futex calls, multiple context
> switches, too.
>   

Interesting stuff.  Even if amortized over half a ring's worth of 
packets, that's quite a lot.

Two threads are involved (we complete on the iothread, since we don't 
know which vcpu will end up processing the interrupt, if any).

>
> Pid 10260:
> 12:37:40.245785 select(17, [4 6 8 14 16], [], [], {0, 996000}) = 1 (in [6], left {0, 992000}) <0.003995>
>   

Should switch to epoll with its lower wait costs.  Unfortunately the 
relative timeout requires reading the clock.

> 12:37:40.250226 read(6, "\0\0\0\0\0\0\0\0\0\0RT\0\0224V*\211\24\210`\304\10\0E\0"..., 69632) = 108 <0.000051>
> 12:37:40.250462 write(1, "tap read: 108 bytes\n", 20) = 20 <0.000197>
>   

I hope this is your addition.

> 12:37:40.250800 ioctl(7, 0x4008ae61, 0x7fff8cafb3a0) = 0 <0.000223>
> 12:37:40.251149 read(6, 0x115c6ac, 69632) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>
>   

This wouldn't be necessary with io_getevents().

> 12:37:40.251292 write(1, "tap read: -1 bytes\n", 19) = 19 <0.000085>
>   

...

> 12:37:40.251488 clock_gettime(CLOCK_MONOTONIC, {1554, 633304282}) = 0 <0.000020>
> 12:37:40.251604 clock_gettime(CLOCK_MONOTONIC, {1554, 633413793}) = 0 <0.000019>
>   

Great.

> 12:37:40.251717 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.001222>
> 12:37:40.253037 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {1, 0}) <0.000026>
> 12:37:40.253196 read(16, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 <0.000022>
> 12:37:40.253324 rt_sigaction(SIGALRM, NULL, {0x406d50, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f1a842430f0}, 8) = 0 <0.000018>
> 12:37:40.253477 write(5, "\0", 1)       = 1 <0.000022>
>   

The write is to wake someone up.  Who?

> 12:37:40.253585 read(16, 0x7fff8cb09440, 128) = -1 EAGAIN (Resource temporarily unavailable) <0.000020>
>   

Clearing up signalfd...

> 12:37:40.253687 clock_gettime(CLOCK_MONOTONIC, {1554, 635496181}) = 0 <0.000019>
> 12:37:40.253798 writev(6, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"*\211\24\210`\304RT\0\0224V\10\0E\0\0T\255\262\0\0@\1G"..., 98}], 2) = 108 <0.000062>
> 12:37:40.253993 ioctl(7, 0x4008ae61, 0x7fff8caff460) = 0 <0.000161>
>   

Injecting the interrupt.

> 12:37:40.254263 clock_gettime(CLOCK_MONOTONIC, {1554, 636077540}) = 0 <0.000019>
> 12:37:40.254380 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.000394>
> 12:37:40.254861 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [4], left {1, 0}) <0.000022>
> 12:37:40.255001 read(4, "\0", 512)      = 1 <0.000021>
>   

Great,  the write() was to wake ourselves up.

> 12:37:40.255109 read(4, 0x7fff8cb092d0, 512) = -1 EAGAIN (Resource temporarily unavailable) <0.000018>
> 12:37:40.255211 clock_gettime(CLOCK_MONOTONIC, {1554, 637020677}) = 0 <0.000019>
> 12:37:40.255314 clock_gettime(CLOCK_MONOTONIC, {1554, 637123483}) = 0 <0.000019>
> 12:37:40.255416 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 <0.000018>
> 12:37:40.255524 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 14000000}}, NULL) = 0 <0.000021>
> 12:37:40.255635 clock_gettime(CLOCK_MONOTONIC, {1554, 637443915}) = 0 <0.000019>
> 12:37:40.255739 clock_gettime(CLOCK_MONOTONIC, {1554, 637547001}) = 0 <0.000018>
> 12:37:40.255847 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {0, 988000}) <0.014303>
>
>   

This is the vcpu thread:

> Pid 10262:
> 12:37:40.252531 clock_gettime(CLOCK_MONOTONIC, {1554, 634339051}) = 0 <0.000018>
> 12:37:40.252631 timer_gettime(0, {it_interval={0, 0}, it_value={0, 17549811}}) = 0 <0.000021>
> 12:37:40.252750 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0 <0.000024>
> 12:37:40.252868 ioctl(11, 0xae80, 0)    = 0 <0.001171>
> 12:37:40.254128 futex(0xb81360, 0x80 /* FUTEX_??? */, 2) = 0 <0.000270>
> 12:37:40.254490 ioctl(7, 0x4008ae61, 0x4134bee0) = 0 <0.000019>
> 12:37:40.254598 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 0 <0.000017>
> 12:37:40.254693 ioctl(11, 0xae80 <unfinished ...>
>   

Looks like the interrupt from the iothread was injected and delivered 
before the iothread could give up the mutex, so we needed to wait here.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html