linux-kernel - Re: [RFC PATCH 0/3] generic hypercall support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A0860D7.6010708@codemonkey.ws>
Date:	Mon, 11 May 2009 12:31:03 -0500
From:	Anthony Liguori <anthony@...emonkey.ws>
To:	Gregory Haskins <gregory.haskins@...il.com>
CC:	Gregory Haskins <ghaskins@...ell.com>, Avi Kivity <avi@...hat.com>,
	Chris Wright <chrisw@...s-sol.org>,
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
	Hollis Blanchard <hollisb@...ibm.com>
Subject: Re: [RFC PATCH 0/3] generic hypercall support

Gregory Haskins wrote:
> I specifically generalized my statement above because #1 I assume
> everyone here is smart enough to convert that nice round unit into the
> relevant figure.  And #2, there are multiple potential latency sources
> at play which we need to factor in when looking at the big picture.  For
> instance, the difference between PF exit, and an IO exit (2.58us on x86,
> to be precise).  Or whether you need to take a heavy-weight exit.  Or a
> context switch to qemu, the the kernel, back to qemu, and back to the
> vcpu).  Or acquire a mutex.  Or get head-of-lined on the VGA models IO. 
> I know you wish that this whole discussion would just go away, but these
> little "300ns here, 1600ns there" really add up in aggregate despite
> your dismissive attitude towards them.  And it doesn't take much to
> affect the results in a measurable way.  As stated, each 1us costs ~4%. 
> My motivation is to reduce as many of these sources as possible.
>
> So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
> improvement.  So what?  Its still an improvement.  If that improvement
> were for free, would you object?  And we all know that this change isn't
> "free" because we have to change some code (+128/-0, to be exact).  But
> what is it specifically you are objecting to in the first place?  Adding
> hypercall support as an pv_ops primitive isn't exactly hard or complex,
> or even very much code.
>   

Where does 25us come from?  The number you post below are 33us and 
66us.  This is part of what's frustrating me in this thread.  Things are 
way too theoretical.  Saying that "if packet latency was 25us, then it 
would be a 1.4% improvement" is close to misleading.  The numbers you've 
posted are also measuring on-box speeds.  What really matters are 
off-box latencies and that's just going to exaggerate.

IIUC, if you switched vbus to using PIO today, you would go from 66us to 
to 65.65, which you'd round to 66us for on-box latencies.  Even if you 
didn't round, it's a 0.5% improvement in latency.

Adding hypercall support as a pv_ops primitive is adding a fair bit of 
complexity.  You need a hypercall fd mechanism to plumb this down to 
userspace otherwise, you can't support migration from in-kernel backend 
to non in-kernel backend.  You need some way to allocate hypercalls to 
particular devices which so far, has been completely ignored.  I've 
already mentioned why hypercalls are also unfortunate from a guest 
perspective.  They require kernel patching and this is almost certainly 
going to break at least Vista as a guest.  Certainly Windows 7.

So it's not at all fair to trivialize the complexity introduce here.  
I'm simply asking for justification to introduce this complexity.  I 
don't see why this is unfair for me to ask.

>> As a more general observation, we need numbers to justify an
>> optimization, not to justify not including an optimization.
>>
>> In other words, the burden is on you to present a scenario where this
>> optimization would result in a measurable improvement in a real world
>> work load.
>>     
>
> I have already done this.  You seem to have chosen to ignore my
> statements and results, but if you insist on rehashing:
>
> I started this project by analyzing system traces and finding some of
> the various bottlenecks in comparison to a native host.  Throughput was
> already pretty decent, but latency was pretty bad (and recently got
> *really* bad, but I know you already have a handle on whats causing
> that).  I digress...one of the conclusions of the research was that  I
> wanted to focus on building an IO subsystem designed to minimize the
> quantity of exits, minimize the cost of each exit, and shorten the
> end-to-end signaling path to achieve optimal performance.  I also wanted
> to build a system that was extensible enough to work with a variety of
> client types, on a variety of architectures, etc, so we would only need
> to solve these problems "once".  The end result was vbus, and the first
> working example was venet.  The measured performance data of this work
> was as follows:
>
> 802.x network, 9000 byte MTU,  2 8-core x86_64s connected back to back
> with Chelsio T3 10GE via crossover.
>
> Bare metal            : tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
> Virtio-net (PCI)    : tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
> Venet      (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt)
>
> For more details:  http://lkml.org/lkml/2009/4/21/408
>   

Sending out a massive infrastructure change that does things wildly 
differently from how they're done today without any indication of why 
those changes were necessary is disruptive.

If you could characterize all of the changes that vbus makes that are 
different from virtio, demonstrating at each stage why the change 
mattered and what benefit it brought, then we'd be having a completely 
different discussion.  I have no problem throwing away virtio today if 
there's something else better.

That's not what you've done though.  You wrote a bunch of code without 
understanding why virtio does things the way it does and then dropped it 
all on the list.  This isn't necessarily a bad exercise, but there's a 
ton of work necessary to determine which things vbus does differently 
actually matter.  I'm not saying that you shouldn't have done vbus, but 
I'm saying there's a bunch of analysis work that you haven't done that 
needs to be done before we start making any changes in upstream code.

I've been trying to argue why I don't think hypercalls are an important 
part of vbus from a performance perspective.   I've tried to demonstrate 
why I don't think this is an important part of vbus.  The frustration I 
have with this series is that you seem unwilling to compromise any 
aspect of vbus design.  I understand you've made your decisions  in vbus 
for some reasons and you think the way you've done things is better, but 
that's not enough.  We have virtio today, it provides greater 
functionality than vbus does, it supports multiple guest types, and it's 
gotten quite a lot of testing.  It has its warts, but most things that 
have been around for some time do.

> Now I know you have been quick in the past to dismiss my efforts, and to
> claim you can get the same results without needing the various tricks
> and optimizations I uncovered.  But quite frankly, until you post some
> patches for community review and comparison (as I have done), it's just
> meaningless talk.

I can just as easily say that until you post a full series that covers 
all of the functionality that virtio has today, vbus is just meaningless 
talk.  But I'm trying not to be dismissive in all of this because I do 
want to see you contribute to the KVM paravirtual IO infrastructure.  
Clearly, you have useful ideas.

We can't just go rewriting things without a clear understanding of why 
something's better.  What's missing is a detailed analysis of what 
virtio-net does today and what vbus does so that it's possible to draw 
some conclusions.

For instance, this could look like:

For a single packet delivery:

150ns are spent from PIO operation
320ns are spent in heavy-weight exit handler
150ns are spent transitioning to userspace
5us are spent contending on qemu_mutex
30us are spent copying data in tun/tap driver
40us are spent waiting for RX
...

For vbus, it would look like:

130ns are spent from HC instruction
100ns are spent signaling TX thread
...

But single packet delivery is just one part of the puzzle.  Bulk 
transfers are also important.  CPU consumption is important.  How we 
address things like live migration, non-privileged user initialization, 
and userspace plumbing are all also important.

Right now, the whole discussion around this series is wildly speculative 
and quite frankly, counter productive.  A few RTT benchmarks are not 
sufficient to make any kind of forward progress here.  I certainly like 
rewriting things as much as anyone else, but you need a substantial 
amount of justification for it that so far hasn't been presented.

Do you understand what my concerns are and why I don't want to just 
switch to a new large infrastructure?

Do you feel like you understand what sort of data I'm looking for to 
justify the changes vbus is proposing to make?  Is this something your 
willing to do because IMHO this is a prerequisite for any sort of merge 
consideration.  The analysis of the virtio-net side of things is just as 
important as the vbus side of things.

I've tried to explain this to you a number of times now and so far it 
doesn't seem like I've been successful.  If it isn't clear, please let 
me know.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/