[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <79d2fe19-f2fa-7df6-3833-81bd0b171fa1@suse.com>
Date: Mon, 12 Jun 2023 08:45:22 +0200
From: Juergen Gross <jgross@...e.com>
To: Demi Marie Obenour <demi@...isiblethingslab.com>,
Stefano Stabellini <sstabellini@...nel.org>,
Oleksandr Tyshchenko <oleksandr_tyshchenko@...m.com>,
Jan Beulich <JBeulich@...e.com>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
Cc: Xen developer discussion <xen-devel@...ts.xenproject.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
stable@...r.kernel.org
Subject: Re: [PATCH] xen: speed up grant-table reclaim
On 12.06.23 08:27, Juergen Gross wrote:
> On 10.06.23 17:32, Demi Marie Obenour wrote:
>> When a grant entry is still in use by the remote domain, Linux must put
>> it on a deferred list.
>
> This lacks quite some context.
>
> The main problem is related to the grant not having been unmapped after
> the end of a request, but the side granting the access is assuming this
> should be the case.
>
> In general this means that the two sides implementing the protocol don't
> agree how it should work, or that the protocol itself has a flaw.
>
> > Normally, this list is very short, because
>> the PV network and block protocols expect the backend to unmap the grant
>> first.
>
> Normally the list is just empty. Only in very rare cases like premature
> PV frontend module unloading it is expected to see cases of deferred
> grant reclaims.
>
>> However, Qubes OS's GUI protocol is subject to the constraints
>> of the X Window System, and as such winds up with the frontend unmapping
>> the window first. As a result, the list can grow very large, resulting
>> in a massive memory leak and eventual VM freeze.
>
> I do understand that it is difficult to change the protocol and/or
> behavior after the fact, or that performance considerations are in the
> way of doing so.
>
>> To partially solve this problem, make the number of entries that the VM
>> will attempt to free at each iteration tunable. The default is still
>> 10, but it can be overridden at compile-time (via Kconfig), boot-time
>> (via a kernel command-line option), or runtime (via sysfs).
>
> Is there really a need to have another Kconfig option for this? AFAICS
> only QubesOS is affected by the problem you are trying to solve. I don't
> see why you can't use the command-line option or sysfs node to set the
> higher reclaim batch size.
>
>>
>> Fixes: 569ca5b3f94c ("xen/gnttab: add deferred freeing logic")
>
> I don't think this "Fixes:" tag is appropriate. The mentioned commit didn't
> have a bug. You are adding new functionality on top of it.
>
>> Cc: stable@...r.kernel.org
>> Signed-off-by: Demi Marie Obenour <demi@...isiblethingslab.com>
>> ---
>> drivers/xen/Kconfig | 12 ++++++++++++
>> drivers/xen/grant-table.c | 40 ++++++++++++++++++++++++++++-----------
>> 2 files changed, 41 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
>> index
>> d5d7c402b65112b8592ba10bd3fd1732c26b771e..8f96e1359eb102d6420775b66e7805004a4ce9fe 100644
>> --- a/drivers/xen/Kconfig
>> +++ b/drivers/xen/Kconfig
>> @@ -65,6 +65,18 @@ config XEN_MEMORY_HOTPLUG_LIMIT
>> This value is used to allocate enough space in internal
>> tables needed for physical memory administration.
>> +config XEN_GRANTS_RECLAIM_PER_ITERATION
>> + int "Default number of grant entries to reclaim per iteration"
>> + default 10
>> + range 10 4294967295
>> + help
>> + This sets the default value for the grant_table.free_per_iteration
>> + kernel command line option, which sets the number of grants that
>> + Linux will try to reclaim at once. The default is 10, but
>> + workloads that make heavy use of gntalloc will likely want to
>> + increase this. The current value can be accessed and/or modified
>> + via /sys/module/grant_table/parameters/free_per_iteration.
>> +
>
> Apart from the fact that I don't like adding a new Kconfig option, the help
> text is not telling the true story. Heavy use of gntalloc isn't to blame, but
> the fact that your PV-device implementation is based on the reclaim
> functionality. TBH, someone not familiar with the grant reclaim will have no
> chance to select a sensible value based on your help text.
>
>> config XEN_SCRUB_PAGES_DEFAULT
>> bool "Scrub pages before returning them to system by default"
>> depends on XEN_BALLOON
>> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
>> index
>> e1ec725c2819d4d5dede063eb00d86a6d52944c0..fa666aa6abc3e786dddc94f895641505ec0b23d8 100644
>> --- a/drivers/xen/grant-table.c
>> +++ b/drivers/xen/grant-table.c
>> @@ -498,14 +498,20 @@ static LIST_HEAD(deferred_list);
>> static void gnttab_handle_deferred(struct timer_list *);
>> static DEFINE_TIMER(deferred_timer, gnttab_handle_deferred);
>> +static atomic64_t deferred_count;
>> +static atomic64_t leaked_count;
>> +static unsigned int free_per_iteration =
>> CONFIG_XEN_GRANTS_RECLAIM_PER_ITERATION;
>> +
>> static void gnttab_handle_deferred(struct timer_list *unused)
>> {
>> - unsigned int nr = 10;
>> + unsigned int nr = READ_ONCE(free_per_iteration);
Two additional remarks:
I wouldn't be opposed to raise the default number of reclaims to a higher
number, e.g. 100.
Another option would be to move the reclaim handling into a workqueue and
try to reclaim as many grants as possible in one go with a cond_resched()
call every e.g. 100 reclaims. This would remove the need for having a
reclaim upper bound.
Juergen
Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3099 bytes)
Download attachment "OpenPGP_signature" of type "application/pgp-signature" (496 bytes)
Powered by blists - more mailing lists