lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <da8288f7-f6c6-00da-fcbc-5c70c42359bc@linux.ibm.com>
Date:   Tue, 20 Apr 2021 20:55:12 +0530
From:   Pratik Sampat <psampat@...ux.ibm.com>
To:     Dennis Zhou <dennis@...nel.org>
Cc:     Roman Gushchin <guro@...com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 0/4] percpu: partial chunk depopulation

Hello Dennis,


On 20/04/21 8:09 pm, Dennis Zhou wrote:
> On Tue, Apr 20, 2021 at 04:37:02PM +0530, Pratik Sampat wrote:
>> On 20/04/21 4:27 am, Dennis Zhou wrote:
>>> On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
>>>> Hello,
>>>>
>>>> This series is a continuation of Roman's series in [1]. It aims to solve
>>>> chunks holding onto free pages by adding a reclaim process to the percpu
>>>> balance work item.
>>>>
>>>> The main difference is that the nr_empty_pop_pages is now managed at
>>>> time of isolation instead of intermixed. This helps with deciding which
>>>> chunks to free instead of having to interleave returning chunks to
>>>> active duty.
>>>>
>>>> The allocation priority is as follows:
>>>>     1) appropriate chunk slot increasing until fit
>>>>     2) sidelined chunks
>>>>     3) full free chunks
>>>>
>>>> The last slot for to_depopulate is never used for allocations.
>>>>
>>>> A big thanks to Roman for initiating the work and being available for
>>>> iterating on these ideas.
>>>>
>>>> This patchset contains the following 4 patches:
>>>>     0001-percpu-factor-out-pcpu_check_block_hint.patch
>>>>     0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
>>>>     0003-percpu-implement-partial-chunk-depopulation.patch
>>>>     0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
>>>>
>>>> 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
>>>> initially from Roman. 0004 adds a reclaim threshold so we do not need to
>>>> schedule for every page freed.
>>>>
>>>> This series is on top of percpu$for-5.14 67c2669d69fb.
>>>>
>>>> diffstats below:
>>>>
>>>> Dennis Zhou (2):
>>>>     percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
>>>>     percpu: use reclaim threshold instead of running for every page
>>>>
>>>> Roman Gushchin (2):
>>>>     percpu: factor out pcpu_check_block_hint()
>>>>     percpu: implement partial chunk depopulation
>>>>
>>>>    mm/percpu-internal.h |   5 +
>>>>    mm/percpu-km.c       |   5 +
>>>>    mm/percpu-stats.c    |  20 ++--
>>>>    mm/percpu-vm.c       |  30 ++++++
>>>>    mm/percpu.c          | 252 ++++++++++++++++++++++++++++++++++++++-----
>>>>    5 files changed, 278 insertions(+), 34 deletions(-)
>>>>
>>>> Thanks,
>>>> Dennis
>>> Hello Pratik,
>>>
>>> Do you mind testing this series again on POWER9? The base is available
>>> here:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14
>>>
>>> Thanks,
>>> Dennis
>> Hello Dennis, I have tested this patchset on POWER9.
>>
>> I have tried variations of the percpu_test in the top level and nested cgroups
>> creation as the test with 1000:10 didn't show any benefits.
> This is most likely because the 1 in every 11 still pins every page
> while 1 in 50 does not. Can you try the patch below on top? I think it
> may show slightly better perf as well. If it doesn't I'll just drop it.

I did try it out, although my test spanned only across varying the inner cgroup
folders; it didn't seem to show any benefit over the previous test without the
patch for being being able to spawn as little memory cgroup folders and seeing
partial memory depopulation.

>> The following example shows more consistent benefits with the de-allocation
>> strategy.
>> Outer: 1000
>> Inner: 50
>> # ./percpu_test.sh
>> Percpu:             6912 kB
>> Percpu:           532736 kB
>> Percpu:           278784 kB
>>
>> I believe it could be a result of bulk freeing within "free_unref_page_commit",
>> where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
>> page size it would end up creating lesser number of pages but with the
>> effects of fragmentation.
> This is unrelated to per cpu pages in slab/slub. Percpu is a separate
> memory allocator.
>
You're right, I was actually referencing incorrect code.

>> Having said that, the patchset and its behavior does look good to me.
> Thanks, can I throw the following on the appropriate patches? In the
> future it's good to be explicit about this because some prefer to credit
> different emails.
>
> Tested-by: Pratik Sampat <psampat@...ux.ibm.com>

Sure thing please feel free to add a tested-by wherever you feel appropriate.
I'll be more explicit about them in the future. Thanks!

> Thanks,
> Dennis
>
> The following may do a little better on power9:
> ---
>  From a1464c4d5900cca68fd95b935178d72bb74837d5 Mon Sep 17 00:00:00 2001
> From: Dennis Zhou <dennis@...nel.org>
> Date: Tue, 20 Apr 2021 14:25:20 +0000
> Subject: [PATCH] percpu: convert free page float to bytes
>
> The percpu memory allocator keeps around a minimum number of free pages
> to ensure we can satisfy atomic allocations. However, we've always kept
> this number in terms of pages. On certain architectures like arm and
> powerpc, the default page size could be 64k instead of 4k. So, start
> with a target number of free bytes and then convert to pages.
>
> Signed-off-by: Dennis Zhou <dennis@...nel.org>
> ---
>   mm/percpu.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index ba13e683d022..287fe3091244 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -80,6 +80,7 @@
>   #include <linux/mutex.h>
>   #include <linux/percpu.h>
>   #include <linux/pfn.h>
> +#include <linux/sizes.h>
>   #include <linux/slab.h>
>   #include <linux/spinlock.h>
>   #include <linux/vmalloc.h>
> @@ -107,11 +108,12 @@
>   /* chunks in slots below this are subject to being sidelined on failed alloc */
>   #define PCPU_SLOT_FAIL_THRESHOLD	3
>   
> -#define PCPU_EMPTY_POP_PAGES_LOW	2
> -#define PCPU_EMPTY_POP_PAGES_HIGH	4
> +#define PCPU_EMPTY_POP_PAGES_LOW	(max_t(int, (SZ_8K) / PAGE_SIZE, 1))
> +#define PCPU_EMPTY_POP_PAGES_HIGH	(max_t(int, (SZ_16K) / PAGE_SIZE, \
> +					       PCPU_EMPTY_POP_PAGES_LOW + 1))
>   
>   /* only schedule reclaim if there are at least N empty pop pages sidelined */
> -#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	4
> +#define PCPU_EMPTY_POP_RECLAIM_THRESHOLD	(PCPU_EMPTY_POP_PAGES_HIGH)
>   
>   #ifdef CONFIG_SMP
>   /* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */

Thanks,
Pratik

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ