[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1150fd9c-25c8-c91c-4ca2-2c587e3fbb43@suse.com>
Date: Tue, 13 Jun 2023 08:45:31 +0200
From: Juergen Gross <jgross@...e.com>
To: Demi Marie Obenour <demi@...isiblethingslab.com>,
Stefano Stabellini <sstabellini@...nel.org>,
Oleksandr Tyshchenko <oleksandr_tyshchenko@...m.com>,
Jan Beulich <JBeulich@...e.com>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
Cc: Xen developer discussion <xen-devel@...ts.xenproject.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
stable@...r.kernel.org
Subject: Re: [PATCH] xen: speed up grant-table reclaim
On 12.06.23 22:09, Demi Marie Obenour wrote:
> On Mon, Jun 12, 2023 at 08:27:59AM +0200, Juergen Gross wrote:
>> On 10.06.23 17:32, Demi Marie Obenour wrote:
>>> When a grant entry is still in use by the remote domain, Linux must put
>>> it on a deferred list.
>>
>> This lacks quite some context.
>>
>> The main problem is related to the grant not having been unmapped after
>> the end of a request, but the side granting the access is assuming this
>> should be the case.
>
> The GUI agent has relied on deferred grant reclaim for as long as it has
> existed. One could argue that doing so means that the agent is misusing
> gntalloc, but this is not documented anywhere. A better fix would be to
> use IOCTL_GNTDEV_SET_UNMAP_NOTIFY in the GUI daemon.
I don't think this is a gntalloc specific problem. You could create the same
problem with a kernel implementation.
And yes, using IOCTL_GNTDEV_SET_UNMAP_NOTIFY might be a good idea.
>> In general this means that the two sides implementing the protocol don't
>> agree how it should work, or that the protocol itself has a flaw.
>
> What would a better solution be? This is going to be particularly
> tricky with Wayland, as the wl_shm protocol makes absolutely no
> guarantees that compositors will promptly release the mapping and
> provides no way whatsoever for Wayland clients to know when this has
> happened. Relying on an LD_PRELOAD hack is not sustainable.
>
>>> Normally, this list is very short, because
>>> the PV network and block protocols expect the backend to unmap the grant
>>> first.
>>
>> Normally the list is just empty. Only in very rare cases like premature
>> PV frontend module unloading it is expected to see cases of deferred
>> grant reclaims.
>
> In the case of a system using only properly-designed PV protocols
> implemented in kernel mode I agree. However, both libxenvchan and the
> Qubes GUI protocol are implemented in user mode and this means that if
> the frontend process (the one that uses gntalloc) crashes, deferred
> grant reclaims will occur.
This would be the user space variant of frontend driver unloading.
> Worse, it is possible for the domain to use
> the grant in a PV protocol. If the PV backend driver then maps and
> unmaps the grant and then tells the frontend driver to reclaim it, but
> the backend userspace process (the one using gntdev) maps it before the
> frontend does reclaim it, the frontend will think the backend is trying
> to exploit XSA-396 and freeze the connection.
Let me sum up how I understand above statement:
1. Frontend F in DomA is granting DomB access via grant X for PV-device Y.
2. DomB backend B for PV-device Y is mapping the page using grant X and uses it.
3. DomB backend B unmaps grant X and signals end of usage to DomA frontend F.
4. DomB userspace process maps grant X
5. DomA frontend F tries to reclaim grant X and fails due to DomB userspace
mapping
So why would DomB userspace process map grant X? This would imply a malicious
process in DomB. This might be possible, but I don't see how such an attack
would relate to your problem. It could happen with any malicious userspace
program with root access in a domain running a PV backend.
>
>>> However, Qubes OS's GUI protocol is subject to the constraints
>>> of the X Window System, and as such winds up with the frontend unmapping
>>> the window first. As a result, the list can grow very large, resulting
>>> in a massive memory leak and eventual VM freeze.
>>
>> I do understand that it is difficult to change the protocol and/or
>> behavior after the fact, or that performance considerations are in the
>> way of doing so.
>
> Would the correct fix be to use IOCTL_GNTDEV_SET_UNMAP_NOTIFY? That
> would require that the agent either create a new event channel for each
> window or maintain a pool of event channels, but that should be doable.
I think this sounds like a promising idea.
> This still does not solve the problem of the frontend exiting
> unexpectedly, though.
Such cases need to be handled via deferred reclaim. You could add a flag
to struct deferred_entry when called through gntalloc_vma_close(),
indicating that this is an expected deferred reclaim not leading to
error messages.
>
>>> To partially solve this problem, make the number of entries that the VM
>>> will attempt to free at each iteration tunable. The default is still
>>> 10, but it can be overridden at compile-time (via Kconfig), boot-time
>>> (via a kernel command-line option), or runtime (via sysfs).
>>
>> Is there really a need to have another Kconfig option for this? AFAICS
>> only QubesOS is affected by the problem you are trying to solve. I don't
>> see why you can't use the command-line option or sysfs node to set the
>> higher reclaim batch size.
>
> Fair. In practice, Qubes OS will need to use the sysfs node, since
> the other two do not work with in-VM kernels.
>
>>> Fixes: 569ca5b3f94c ("xen/gnttab: add deferred freeing logic")
>>
>> I don't think this "Fixes:" tag is appropriate. The mentioned commit didn't
>> have a bug. You are adding new functionality on top of it.
>
> I’ll drop the "Fixes:" tag, but I will keep the "Cc: stable@...r.kernel.org"
> as I believe this patch meets the following criterion for stable
> backport (from Documentation/process/stable-kernel-rules.rst):
>
> Serious issues as reported by a user of a distribution kernel may also
> be considered if they fix a notable performance or interactivity issue.
>
Sure, this seems appropriate.
>>> Cc: stable@...r.kernel.org
>>> Signed-off-by: Demi Marie Obenour <demi@...isiblethingslab.com>
>>> ---
>>> drivers/xen/Kconfig | 12 ++++++++++++
>>> drivers/xen/grant-table.c | 40 ++++++++++++++++++++++++++++-----------
>>> 2 files changed, 41 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
>>> index d5d7c402b65112b8592ba10bd3fd1732c26b771e..8f96e1359eb102d6420775b66e7805004a4ce9fe 100644
>>> --- a/drivers/xen/Kconfig
>>> +++ b/drivers/xen/Kconfig
>>> @@ -65,6 +65,18 @@ config XEN_MEMORY_HOTPLUG_LIMIT
>>> This value is used to allocate enough space in internal
>>> tables needed for physical memory administration.
>>> +config XEN_GRANTS_RECLAIM_PER_ITERATION
>>> + int "Default number of grant entries to reclaim per iteration"
>>> + default 10
>>> + range 10 4294967295
>>> + help
>>> + This sets the default value for the grant_table.free_per_iteration
>>> + kernel command line option, which sets the number of grants that
>>> + Linux will try to reclaim at once. The default is 10, but
>>> + workloads that make heavy use of gntalloc will likely want to
>>> + increase this. The current value can be accessed and/or modified
>>> + via /sys/module/grant_table/parameters/free_per_iteration.
>>> +
>>
>> Apart from the fact that I don't like adding a new Kconfig option, the help
>> text is not telling the true story. Heavy use of gntalloc isn't to blame, but
>> the fact that your PV-device implementation is based on the reclaim
>> functionality. TBH, someone not familiar with the grant reclaim will have no
>> chance to select a sensible value based on your help text.
>
> That’s a good point. Qubes OS will need to use the sysfs value anyway
> in order to support in-VM kernels, so the Kconfig option is not really
> useful.
Nice.
>
>>> config XEN_SCRUB_PAGES_DEFAULT
>>> bool "Scrub pages before returning them to system by default"
>>> depends on XEN_BALLOON
>>> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
>>> index e1ec725c2819d4d5dede063eb00d86a6d52944c0..fa666aa6abc3e786dddc94f895641505ec0b23d8 100644
>>> --- a/drivers/xen/grant-table.c
>>> +++ b/drivers/xen/grant-table.c
>>> @@ -498,14 +498,20 @@ static LIST_HEAD(deferred_list);
>>> static void gnttab_handle_deferred(struct timer_list *);
>>> static DEFINE_TIMER(deferred_timer, gnttab_handle_deferred);
>>> +static atomic64_t deferred_count;
>>> +static atomic64_t leaked_count;
>>> +static unsigned int free_per_iteration = CONFIG_XEN_GRANTS_RECLAIM_PER_ITERATION;
>>> +
>>> static void gnttab_handle_deferred(struct timer_list *unused)
>>> {
>>> - unsigned int nr = 10;
>>> + unsigned int nr = READ_ONCE(free_per_iteration);
>>> + const bool ignore_limit = nr == 0;
>>> struct deferred_entry *first = NULL;
>>> unsigned long flags;
>>> + size_t freed = 0;
>>> spin_lock_irqsave(&gnttab_list_lock, flags);
>>> - while (nr--) {
>>> + while ((ignore_limit || nr--) && !list_empty(&deferred_list)) {
>>> struct deferred_entry *entry
>>> = list_first_entry(&deferred_list,
>>> struct deferred_entry, list);
>>> @@ -515,10 +521,13 @@ static void gnttab_handle_deferred(struct timer_list *unused)
>>> list_del(&entry->list);
>>> spin_unlock_irqrestore(&gnttab_list_lock, flags);
>>> if (_gnttab_end_foreign_access_ref(entry->ref)) {
>>> + uint64_t ret = atomic64_sub_return(1, &deferred_count);
>>> put_free_entry(entry->ref);
>>> - pr_debug("freeing g.e. %#x (pfn %#lx)\n",
>>> - entry->ref, page_to_pfn(entry->page));
>>> + pr_debug("freeing g.e. %#x (pfn %#lx), %llu remaining\n",
>>> + entry->ref, page_to_pfn(entry->page),
>>> + (unsigned long long)ret);
>>
>> I'd prefer not to issue lots of prints (being it debug or info ones) if the
>> reclaim is expected to happen with a specific PV-device.
>>
>> My preferred solution would be a per-device flag, but at least you should
>> switch to pr_*_ratelimited(). Same applies further down.
>
> What do you mean by “per-device flag”? These grants are allocated by
> userspace using gntalloc, so there is no PV device on which the flag
> could be set.
In this case the flag should relate to the file pointer used for
gntalloc_open().
Juergen
Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3099 bytes)
Download attachment "OpenPGP_signature" of type "application/pgp-signature" (496 bytes)
Powered by blists - more mailing lists