linux-kernel - Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210518183442.6e078ea1@ibm-vm>
Date:   Tue, 18 May 2021 18:34:42 +0200
From:   Claudio Imbrenda <imbrenda@...ux.ibm.com>
To:     Christian Borntraeger <borntraeger@...ibm.com>
Cc:     Cornelia Huck <cohuck@...hat.com>, kvm@...r.kernel.org,
        frankja@...ux.ibm.com, thuth@...hat.com, pasic@...ux.ibm.com,
        david@...hat.com, linux-s390@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Tue, 18 May 2021 18:20:22 +0200
Christian Borntraeger <borntraeger@...ibm.com> wrote:

> On 18.05.21 18:13, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:45:18 +0200
> > Christian Borntraeger <borntraeger@...ibm.com> wrote:
> >   
> >> On 18.05.21 17:36, Claudio Imbrenda wrote:  
> >>> On Tue, 18 May 2021 17:05:37 +0200
> >>> Cornelia Huck <cohuck@...hat.com> wrote:
> >>>      
> >>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>> Claudio Imbrenda <imbrenda@...ux.ibm.com> wrote:
> >>>>     
> >>>>> Previously, when a protected VM was rebooted or when it was shut
> >>>>> down, its memory was made unprotected, and then the protected VM
> >>>>> itself was destroyed. Looping over the whole address space can
> >>>>> take some time, considering the overhead of the various
> >>>>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
> >>>>> would take a potentially long amount of time, depending on the
> >>>>> amount of used memory.
> >>>>>
> >>>>> This patchseries implements a deferred destroy mechanism for
> >>>>> protected guests. When a protected guest is destroyed, its
> >>>>> memory is cleared in background, allowing the guest to restart
> >>>>> or terminate significantly faster than before.
> >>>>>
> >>>>> There are 2 possibilities when a protected VM is torn down:
> >>>>> * it still has an address space associated (reboot case)
> >>>>> * it does not have an address space anymore (shutdown case)
> >>>>>
> >>>>> For the reboot case, the reference count of the mm is increased,
> >>>>> and then a background thread is started to clean up. Once the
> >>>>> thread went through the whole address space, the protected VM is
> >>>>> actually destroyed.
> >>>>>
> >>>>> For the shutdown case, a list of pages to be destroyed is formed
> >>>>> when the mm is torn down. Instead of just unmapping the pages
> >>>>> when the address space is being torn down, they are also set
> >>>>> aside. Later when KVM cleans up the VM, a thread is started to
> >>>>> clean up the pages from the list.  
> >>>>
> >>>> Just to make sure, 'clean up' includes doing uv calls?  
> >>>
> >>> yes
> >>>      
> >>>>>
> >>>>> This means that the same address space can have memory belonging
> >>>>> to more than one protected guest, although only one will be
> >>>>> running, the others will in fact not even have any CPUs.  
> >>>>
> >>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>> accessible in any way? I would assume that they only belong to
> >>>> the 
> >>>
> >>> in case of reboot: yes, they are still in the address space of the
> >>> guest, and can be swapped if needed
> >>>      
> >>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>> that needs to get new pages?  
> >>>
> >>> the rebooted guest (normal or secure) will re-use the same pages
> >>> of the old guest (before or after cleanup, which is the reason of
> >>> patches 3 and 4)
> >>>
> >>> the KVM guest is not affected in case of reboot, so the userspace
> >>> address space is not touched.
> >>>      
> >>>> Can too many not-yet-cleaned-up pages lead to a (temporary)
> >>>> memory exhaustion?  
> >>>
> >>> in case of reboot, not much; the pages were in use are still in
> >>> use after the reboot, and they can be swapped.
> >>>
> >>> in case of a shutdown, yes, because the pages are really taken
> >>> aside and cleared/destroyed in background. they cannot be
> >>> swapped. they are freed immediately as they are processed, to try
> >>> to mitigate memory exhaustion scenarios.
> >>>
> >>> in the end, this patchseries is a tradeoff between speed and
> >>> memory consumption. the memory needs to be cleared up at some
> >>> point, and that requires time.
> >>>
> >>> in cases where this might be an issue, I introduced a new KVM flag
> >>> to disable lazy destroy (patch 10)  
> >>
> >> Maybe we could piggy-back on the OOM-kill notifier and then fall
> >> back to synchronous freeing for some pages?  
> > 
> > I'm not sure I follow
> > 
> > once the pages have been set aside, it's too late
> > 
> > while the pages are being set aside, every now and then some memory
> > needs to be allocated. the allocation is atomic, not allowed to use
> > emergency reserves, and can fail without warning. if the allocation
> > fails, we clean up one page and continue, without setting aside
> > anything (patch 9)
> > 
> > so if the system is low on memory, the lazy destroy should not make
> > the situation too much worse.
> > 
> > the only issue here is starting a normal process in the host (maybe
> > a non secure guest) that uses a lot of memory very quickly, right
> > after a large secure guest has terminated.  
> 
> I think page cache page allocations do not need to be atomic.
> In that case the kernel might stil l decide to trigger the oom
> killer. We can let it notify ourselves free 256 pages synchronously
> and avoid the oom kill. Have a look at the virtio-balloon
> virtio_balloon_oom_notify

the issue is that once the pages have been set aside, it's too late.
the OOM notifier would only be useful if we get notified of the OOM
situation _while_ setting aside the pages.

unless you mean that the notifier should simply wait until the thread
has done (some of) its work?