linux-kernel - RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused page info

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <F2CBF3009FA73547804AE4C663CAB28E3A12F831@shsmsx102.ccr.corp.intel.com>
Date:   Tue, 6 Dec 2016 04:47:27 +0000
From:   "Li, Liang Z" <liang.z.li@...el.com>
To:     "Hansen, Dave" <dave.hansen@...el.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>
CC:     "virtualization@...ts.linux-foundation.org" 
        <virtualization@...ts.linux-foundation.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "virtio-dev@...ts.oasis-open.org" <virtio-dev@...ts.oasis-open.org>,
        "qemu-devel@...gnu.org" <qemu-devel@...gnu.org>,
        "quintela@...hat.com" <quintela@...hat.com>,
        "dgilbert@...hat.com" <dgilbert@...hat.com>,
        "mst@...hat.com" <mst@...hat.com>,
        "jasowang@...hat.com" <jasowang@...hat.com>,
        "kirill.shutemov@...ux.intel.com" <kirill.shutemov@...ux.intel.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "mhocko@...e.com" <mhocko@...e.com>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Cornelia Huck <cornelia.huck@...ibm.com>,
        "Amit Shah" <amit.shah@...hat.com>
Subject: RE: [PATCH kernel v5 5/5] virtio-balloon: tell host vm's unused
 page info

> >>> +	mutex_lock(&vb->balloon_lock);
> >>> +
> >>> +	for (order = MAX_ORDER - 1; order >= 0; order--) {
> >>
> >> I scratched my head for a bit on this one.  Why are you walking over
> >> orders,
> >> *then* zones.  I *think* you're doing it because you can efficiently
> >> fill the bitmaps at a given order for all zones, then move to a new
> >> bitmap.  But, it would be interesting to document this.
> >
> > Yes, use the order is somewhat strange, but it's helpful to keep the API simple.
> > Do you think it's acceptable?
> 
> Yeah, it's fine.  Just comment it, please.
> 
Good!

> >>> +		if (ret == -ENOSPC) {
> >>> +			void *new_resp_data;
> >>> +
> >>> +			new_resp_data = kmalloc(2 * vb->resp_buf_size,
> >>> +						GFP_KERNEL);
> >>> +			if (new_resp_data) {
> >>> +				kfree(vb->resp_data);
> >>> +				vb->resp_data = new_resp_data;
> >>> +				vb->resp_buf_size *= 2;
> >>
> >> What happens to the data in ->resp_data at this point?  Doesn't this
> >> just throw it away?
> >
> > Yes, so we should make sure the data in resp_data is not inuse.
> 
> But doesn't it have valid data that we just collected and haven't told the
> hypervisor about yet?  Aren't we throwing away good data that cost us
> something to collect?

Indeed.  Some filled data may exist for the previous zone. Should we
change the API to 
'int get_unused_pages(unsigned long *unused_pages, unsigned long size,
		int order, unsigned long *pos, struct zone *zone)' ?

then we can use the 'zone' to record the zone to retry and not discard the
filled data.

> >> ...
> >>> +struct page_info_item {
> >>> +	__le64 start_pfn : 52; /* start pfn for the bitmap */
> >>> +	__le64 page_shift : 6; /* page shift width, in bytes */
> 
> What does a page_shift "in bytes" mean? :)

Obviously, you know. :o
I will try to make it clear.

> 
> >>> +	__le64 bmap_len : 6;  /* bitmap length, in bytes */ };
> >>
> >> Is 'bmap_len' too short?  a 64-byte buffer is a bit tiny.  Right?
> >
> > Currently, we just use the 8 bytes and 0 bytes bitmap, should we support
> more than 64 bytes?
> 
> It just means that with this format, you end up wasting at least ~1/8th of the
> space with metadata.  That's a bit unfortunate, but I guess it's not fatal.
> 
> I'd definitely call it out in the patch description and make sure other folks take
> a look at it.

OK.

> 
> There's a somewhat easy fix, but that would make the qemu implementation
> more complicated: You could just have bmap_len==0x3f imply that there's
> another field that contains an extended bitmap length for when you need long
> bitmaps.
> 
> But, as you note, there's no need for it, so it's a matter of trading the extra
> complexity versus the desire to not habing to change the ABI again for longer
> (hopefully).
> 

Your suggestion still works without changing the current code, just reserve
 ' bmap_len==0x3f' for future extension, and it's not used by the current code.

> >>> +static int  mark_unused_pages(struct zone *zone,
> >>> +		unsigned long *unused_pages, unsigned long size,
> >>> +		int order, unsigned long *pos)
> >>> +{
> >>> +	unsigned long pfn, flags;
> >>> +	unsigned int t;
> >>> +	struct list_head *curr;
> >>> +	struct page_info_item *info;
> >>> +
> >>> +	if (zone_is_empty(zone))
> >>> +		return 0;
> >>> +
> >>> +	spin_lock_irqsave(&zone->lock, flags);
> >>> +
> >>> +	if (*pos + zone->free_area[order].nr_free > size)
> >>> +		return -ENOSPC;
> >>
> >> Urg, so this won't partially fill?  So, what the nr_free pages limit
> >> where we no longer fit in the kmalloc()'d buffer where this simply won't
> work?
> >
> > Yes.  My initial implementation is partially fill, it's better for the worst case.
> > I thought the above code is more efficient for most case ...
> > Do you think partially fill the bitmap is better?
> 
> Could you please answer the question I asked?
> 

For your question:
-------------------------------------------------------------------------------------------------------
>So, what the nr_free pages limit where we no longer fit in the kmalloc()'d buffer
> where this simply won't work?
------------------------------------------------------------------------------------------------------
No, if the buffer is not big enough to save 'nr_free'  pages, get_unused_pages() will return
'-ENOSPC', and the following code will try to allocate a 2x times size buffer for retrying,
until the proper size buffer is allocated. The current order will not be skipped unless the
buffer allocation failed.

> Because if you don't get this right, it could mean that there are system that
> simply *fail* here.