linux-kernel - Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <896be0fd-5d5d-5998-8cb0-4ac8637412ac@de.ibm.com>
Date:   Tue, 18 May 2021 18:20:22 +0200
From:   Christian Borntraeger <borntraeger@...ibm.com>
To:     Claudio Imbrenda <imbrenda@...ux.ibm.com>
Cc:     Cornelia Huck <cohuck@...hat.com>, kvm@...r.kernel.org,
        frankja@...ux.ibm.com, thuth@...hat.com, pasic@...ux.ibm.com,
        david@...hat.com, linux-s390@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy



On 18.05.21 18:13, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 17:45:18 +0200
> Christian Borntraeger <borntraeger@...ibm.com> wrote:
> 
>> On 18.05.21 17:36, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 17:05:37 +0200
>>> Cornelia Huck <cohuck@...hat.com> wrote:
>>>    
>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>> Claudio Imbrenda <imbrenda@...ux.ibm.com> wrote:
>>>>   
>>>>> Previously, when a protected VM was rebooted or when it was shut
>>>>> down, its memory was made unprotected, and then the protected VM
>>>>> itself was destroyed. Looping over the whole address space can
>>>>> take some time, considering the overhead of the various
>>>>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
>>>>> would take a potentially long amount of time, depending on the
>>>>> amount of used memory.
>>>>>
>>>>> This patchseries implements a deferred destroy mechanism for
>>>>> protected guests. When a protected guest is destroyed, its memory
>>>>> is cleared in background, allowing the guest to restart or
>>>>> terminate significantly faster than before.
>>>>>
>>>>> There are 2 possibilities when a protected VM is torn down:
>>>>> * it still has an address space associated (reboot case)
>>>>> * it does not have an address space anymore (shutdown case)
>>>>>
>>>>> For the reboot case, the reference count of the mm is increased,
>>>>> and then a background thread is started to clean up. Once the
>>>>> thread went through the whole address space, the protected VM is
>>>>> actually destroyed.
>>>>>
>>>>> For the shutdown case, a list of pages to be destroyed is formed
>>>>> when the mm is torn down. Instead of just unmapping the pages when
>>>>> the address space is being torn down, they are also set aside.
>>>>> Later when KVM cleans up the VM, a thread is started to clean up
>>>>> the pages from the list.
>>>>
>>>> Just to make sure, 'clean up' includes doing uv calls?
>>>
>>> yes
>>>    
>>>>>
>>>>> This means that the same address space can have memory belonging
>>>>> to more than one protected guest, although only one will be
>>>>> running, the others will in fact not even have any CPUs.
>>>>
>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>> accessible in any way? I would assume that they only belong to the
>>>>   
>>>
>>> in case of reboot: yes, they are still in the address space of the
>>> guest, and can be swapped if needed
>>>    
>>>> 'zombie' guests, and any new or rebooted guest is a new entity that
>>>> needs to get new pages?
>>>
>>> the rebooted guest (normal or secure) will re-use the same pages of
>>> the old guest (before or after cleanup, which is the reason of
>>> patches 3 and 4)
>>>
>>> the KVM guest is not affected in case of reboot, so the userspace
>>> address space is not touched.
>>>    
>>>> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
>>>> exhaustion?
>>>
>>> in case of reboot, not much; the pages were in use are still in use
>>> after the reboot, and they can be swapped.
>>>
>>> in case of a shutdown, yes, because the pages are really taken aside
>>> and cleared/destroyed in background. they cannot be swapped. they
>>> are freed immediately as they are processed, to try to mitigate
>>> memory exhaustion scenarios.
>>>
>>> in the end, this patchseries is a tradeoff between speed and memory
>>> consumption. the memory needs to be cleared up at some point, and
>>> that requires time.
>>>
>>> in cases where this might be an issue, I introduced a new KVM flag
>>> to disable lazy destroy (patch 10)
>>
>> Maybe we could piggy-back on the OOM-kill notifier and then fall back
>> to synchronous freeing for some pages?
> 
> I'm not sure I follow
> 
> once the pages have been set aside, it's too late
> 
> while the pages are being set aside, every now and then some memory
> needs to be allocated. the allocation is atomic, not allowed to use
> emergency reserves, and can fail without warning. if the allocation
> fails, we clean up one page and continue, without setting aside
> anything (patch 9)
> 
> so if the system is low on memory, the lazy destroy should not make the
> situation too much worse.
> 
> the only issue here is starting a normal process in the host (maybe
> a non secure guest) that uses a lot of memory very quickly, right after
> a large secure guest has terminated.

I think page cache page allocations do not need to be atomic.
In that case the kernel might stil l decide to trigger the oom killer. We can
let it notify ourselves free 256 pages synchronously and avoid the oom kill.
Have a look at the virtio-balloon virtio_balloon_oom_notify