linux-kernel - Re: [ANNOUNCE] Native Linux KVM tool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 08 Apr 2011 09:00:43 -0500
From:	Anthony Liguori <anthony@...emonkey.ws>
To:	Pekka Enberg <penberg@...nel.org>
CC:	Ingo Molnar <mingo@...e.hu>, Avi Kivity <avi@...hat.com>,
	linux-kernel@...r.kernel.org, aarcange@...hat.com,
	mtosatti@...hat.com, kvm@...r.kernel.org, joro@...tes.org,
	penberg@...helsinki.fi, asias.hejun@...il.com, gorcunov@...il.com
Subject: Re: [ANNOUNCE] Native Linux KVM tool

On 04/08/2011 12:14 AM, Pekka Enberg wrote:
> Hey, feel free to help out! ;-)
>
> I don't agree that a working 2500 LOC program is 'repeating the same
> architectural mistakes' as QEMU. I hope you realize that we've gotten
> here with just three part-time hackers working from their proverbial
> basements. So what you call mistakes, we call features for the sake of
> simplicity.

And by all means, it's a good accomplishment.

But the mistakes I'm referring to aren't missing bits of code.  It's 
that the current code makes really bad assumptions.

An example is ioport_ops.  This maps directly to 
ioport_{read,write}_table in QEMU.  Then you use ioport__register() to 
register entries in this table similar register_ioport_{read,write}() in 
QEMU.

The use of a struct is a small improvement but the fundamental design is 
flawed because it models a view of hardware where all devices are 
directly connected to the CPU.  This is not how hardware works at all.

On the PC QEMU tries to emulate, a PIO operation flows from the CPU to 
the i440fx.  The i440fx will do the first level of decoding treating the 
PCI host controller ports specially and then posting any I/Os in the PCI 
port range to the PCI bus.  If no device selects these ports, or the 
ports fall into the non-PCI range, the I/O request is then posted to the 
PIIX3.

The PIIX3 will handle a good chunk of the I/O requests (via it's Super 
I/O chipset) and the remainder will be posted to the ISA bus.  One or 
more ISA devices may then react to these posted I/O operation.

Really, having a flat table doesn't make sense.  You should just send 
everything to an i440fx directly.  Then the i440fx should decode what it 
can, and send it to the next level, and so forth.

You can get 90% of the way to working device model without modelling 
this type of flow, but you hit a wall pretty quickly as it's not unusual 
for PCI controllers to manipulate I/O requests in some fashion 
(particularly on non-x86 platforms).  If you treat everything as 
directly attached to the CPU, it's impossible to model this.

Likewise, the same flow is true in the opposite direction.  You use 
guest_flat_to_host() which assumes a linear mapping of guest memory to 
host memory.  We used to do that too in QEMU (phys_ram_base + X).  It 
took a long time to get rid of that assumption in QEMU.

There are multiple problems with this sort of assumption.  The first is 
that you treat all devices as being directly attached to the memory 
controller.  As with I/O instruction dispatch, this is not the case, and 
there are many PCI controllers that will munge these accesses (think 
IOMMU, for instance).  The second is you assume that you're not doing 
I/O to device memory, but this does happen in practice.  The 
cpu_physical_memory_rw() API is careful to support cases where you're 
writing data to I/O memory.

The other big problem here is that if you have open access to guest 
memory like this, you cannot easily track dirty information.  Userspace 
accesses to guest memory will not result in KVM updating the guest dirty 
bitmap.  You can add another API to explicitly set dirty bits (and 
that's exactly what we did a few years ago) but then you'll get 
extremely subtle bugs in migration if you're missing a dirty update 
somewhere.  This is exactly how our API evolved in QEMU.

As I said earlier, there are very good reasons we do the things we do in 
QEMU.  We're a large code base and there's far too much of the code base 
that noone cares about enough but that users are happy with.  It's far 
too hard to make broad sweeping changes right now (although that's 
something we're trying to improve).

But I'd strongly suggest taking some of the advise being offered here.  
Don't ignore the hard problems to start out with because as the code 
base grows, it'll become more difficult to fix those.  That's not to say 
that you need to implement migration tomorrow, but at least keep the 
constraints in mind and make sure that you're designing interfaces that 
let you do things like keep an updated dirty bitmap when you do memory 
accesses in userspace.

> I also don't agree with this sentiment that unless we have SMP,
> migration, yadda yadda yadda, now, it's impossible to change that in
> the future. It ignores the fact that this is exactly how the Linux
> kernel evolved

Over the course of 20 years.  By my count, we still have another decade 
of refactoring before I can get on top of my ivory tower and call every 
other project terrible.

>   and the fact that we're aggressively trying to keep the
> code size as small and tidy as possible so that changing things is as
> easy as possible.
>
> I've looked at QEMU sources over the years and especially over the
> past year and I think you might be way too familiar with its inner
> workings to see how complex (even the core code) has become for
> someone who isn't familiar with it.

I have no doubts about the complexity of QEMU.  But the 'goo' factor is 
not due to complexity, it's due to the fact that there's a lot of code 
that basically needs to be removed.  But removing features from an 
existing project is never a popular thing to do particularly when the 
work well enough for a lot of people.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/