linux-kernel - Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UcEFi7Rq9+EnLNNm61782QYRT6XZQDgbWFt2S8eMLf=qA@mail.gmail.com>
Date:   Fri, 8 Feb 2019 09:58:47 -0800
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     Nitesh Narayan Lal <nitesh@...hat.com>
Cc:     "Michael S. Tsirkin" <mst@...hat.com>,
        kvm list <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Paolo Bonzini <pbonzini@...hat.com>, lcapitulino@...hat.com,
        pagupta@...hat.com, wei.w.wang@...el.com,
        Yang Zhang <yang.zhang.wz@...il.com>,
        Rik van Riel <riel@...riel.com>, david@...hat.com,
        dodgen@...gle.com, Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        dhildenb@...hat.com, Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report
 free pages

On Thu, Feb 7, 2019 at 12:50 PM Nitesh Narayan Lal <nitesh@...hat.com> wrote:
>
>
> On 2/7/19 12:43 PM, Alexander Duyck wrote:
> > On Tue, Feb 5, 2019 at 3:21 PM Michael S. Tsirkin <mst@...hat.com> wrote:
> >> On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote:
> >>> On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
> >>>> On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
> >>>>> This patch enables the kernel to scan the per cpu array and
> >>>>> compress it by removing the repetitive/re-allocated pages.
> >>>>> Once the per cpu array is completely filled with pages in the
> >>>>> buddy it wakes up the kernel per cpu thread which re-scans the
> >>>>> entire per cpu array by acquiring a zone lock corresponding to
> >>>>> the page which is being scanned. If the page is still free and
> >>>>> present in the buddy it tries to isolate the page and adds it
> >>>>> to another per cpu array.
> >>>>>
> >>>>> Once this scanning process is complete and if there are any
> >>>>> isolated pages added to the new per cpu array kernel thread
> >>>>> invokes hyperlist_ready().
> >>>>>
> >>>>> In hyperlist_ready() a hypercall is made to report these pages to
> >>>>> the host using the virtio-balloon framework. In order to do so
> >>>>> another virtqueue 'hinting_vq' is added to the balloon framework.
> >>>>> As the host frees all the reported pages, the kernel thread returns
> >>>>> them back to the buddy.
> >>>>>
> >>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@...hat.com>
> >>>> This looks kind of like what early iterations of Wei's patches did.
> >>>>
> >>>> But this has lots of issues, for example you might end up with
> >>>> a hypercall per a 4K page.
> >>>> So in the end, he switched over to just reporting only
> >>>> MAX_ORDER - 1 pages.
> >>> You mean that I should only capture/attempt to isolate pages with order
> >>> MAX_ORDER - 1?
> >>>> Would that be a good idea for you too?
> >>> Will it help if we have a threshold value based on the amount of memory
> >>> captured instead of the number of entries/pages in the array?
> >> This is what Wei's patches do at least.
> > So in the solution I had posted I was looking more at
> > HUGETLB_PAGE_ORDER and above as the size of pages to provide the hints
> > on [1]. The advantage to doing that is that you can also avoid
> > fragmenting huge pages which in turn can cause what looks like a
> > memory leak as the memory subsystem attempts to reassemble huge
> > pages[2]. In my mind a 2MB page makes good sense in terms of the size
> > of things to be performing hints on as anything smaller than that is
> > going to just end up being a bunch of extra work and end up causing a
> > bunch of fragmentation.
> As per my opinion, in any implementation which page size to store before
> reporting depends on the allocation pattern of the workload running in
> the guest.

I suggest you take a look at item 2 that I had called out in the
previous email. There are known issues with providing hints smaller
than THP using MADV_DONTNEED or MADV_FREE. Specifically what will
happen is that you end up breaking up a higher order transparent huge
page, backfilling a few holes with other pages, but then the memory
allocation subsystem attempts to reassemble the larger THP page
resulting in an application exhibiting behavior similar to a memory
leak while not actually allocating memory since it is sitting on
fragments of THP pages.

Also while I am thinking of it I haven't noticed anywhere that you are
handling the case of a device assigned to the guest. That seems like a
spot where we are going to have to stop hinting as well aren't we?
Otherwise we would need to redo the memory mapping of the guest in the
IOMMU every time a page is evicted and replaced.

> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
> However I am still thinking about a workload which I can use to test its
> effectiveness.

You might want to look at doing something like min(MAX_ORDER - 1,
HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
THP which is the most likely to be used page size with the guest.

> >
> > The only issue with limiting things on an arbitrary boundary like that
> > is that you have to hook into the buddy allocator to catch the cases
> > where a page has been merged up into that range.
> I don't think, I understood your comment completely. In any case, we
> have to rely on the buddy for merging the pages.
> >
> > [1] https://lkml.org/lkml/2019/2/4/903
> > [2] https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/
> --
> Regards
> Nitesh
>