[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZtDFQHGHMq6TfbKA@pc636>
Date: Thu, 29 Aug 2024 21:00:16 +0200
From: Uladzislau Rezki <urezki@...il.com>
To: Adrian Huang <adrianhuang0701@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Uladzislau Rezki <urezki@...il.com>,
Christoph Hellwig <hch@...radead.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Adrian Huang <ahuang12@...ovo.com>
Subject: Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when
purging each vmap_area
On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote:
> From: Adrian Huang <ahuang12@...ovo.com>
>
> When running the vmalloc stress on a 448-core system, observe the average
> latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
> 'funclatency.py' tool [1].
>
> # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
>
> usecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 29 | |
> 4 -> 7 : 19 | |
> 8 -> 15 : 56 | |
> 16 -> 31 : 483 |**** |
> 32 -> 63 : 1548 |************ |
> 64 -> 127 : 2634 |********************* |
> 128 -> 255 : 2535 |********************* |
> 256 -> 511 : 1776 |************** |
> 512 -> 1023 : 1015 |******** |
> 1024 -> 2047 : 573 |**** |
> 2048 -> 4095 : 488 |**** |
> 4096 -> 8191 : 1091 |********* |
> 8192 -> 16383 : 3078 |************************* |
> 16384 -> 32767 : 4821 |****************************************|
> 32768 -> 65535 : 3318 |*************************** |
> 65536 -> 131071 : 1718 |************** |
> 131072 -> 262143 : 2220 |****************** |
> 262144 -> 524287 : 1147 |********* |
> 524288 -> 1048575 : 1179 |********* |
> 1048576 -> 2097151 : 822 |****** |
> 2097152 -> 4194303 : 906 |******* |
> 4194304 -> 8388607 : 2148 |***************** |
> 8388608 -> 16777215 : 4497 |************************************* |
> 16777216 -> 33554431 : 289 |** |
>
> avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
>
> The worst case is over 16-33 seconds, so soft lockup is triggered [2].
>
> [Root Cause]
> 1) Each purge_list has the long list. The following shows the number of
> vmap_area is purged.
>
> crash> p vmap_nodes
> vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
> crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
> nr_purged = 663070
> ...
> nr_purged = 821670
> nr_purged = 692214
> nr_purged = 726808
> ...
>
> 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
> operation when purging each vmap_area. However, the iteration is over
> 600000 vmap_area (See 'nr_purged' above).
>
> Here is objdump output:
>
> $ objdump -D vmlinux
> ffffffff813e8c80 <purge_vmap_node>:
> ...
> ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip)
> ...
>
> Quote from "Instruction tables" pdf file [3]:
> Instructions with a LOCK prefix have a long latency that depends on
> cache organization and possibly RAM speed. If there are multiple
> processors or cores or direct memory access (DMA) devices, then all
> locked instructions will lock a cache line for exclusive access,
> which may involve RAM access. A LOCK prefix typically costs more
> than a hundred clock cycles, even on single-processor systems.
>
> That's why the latency of purge_vmap_node() dramatically increases
> on a many-core system: One core is busy on purging each vmap_area of
> the *long* purge_list and executing atomic_long_sub() for each
> vmap_area, while other cores free vmalloc allocations and execute
> atomic_long_add_return() in free_vmap_area_noflush().
>
> [Solution]
> Employ a local variable to record the total purged pages, and execute
> atomic_long_sub() after the traversal of the purge_list is done. The
> experiment result shows the latency improvement is 99%.
>
> [Experiment Result]
> 1) System Configuration: Three servers (with HT-enabled) are tested.
> * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
> * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
> * 448-core server: AMD Zen 4 Processor*2
>
> 2) Kernel Config
> * CONFIG_KASAN is disabled
>
> 3) The data in column "w/o patch" and "w/ patch"
> * Unit: micro seconds (us)
> * Each data is the average of 3-time measurements
>
> System w/o patch (us) w/ patch (us) Improvement (%)
> --------------- -------------- ------------- -------------
> 72-core server 2194 14 99.36%
> 192-core server 143799 1139 99.21%
> 448-core server 1992122 6883 99.65%
>
> [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
> [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
> [3] https://www.agner.org/optimize/instruction_tables.pdf
>
> Signed-off-by: Adrian Huang <ahuang12@...ovo.com>
> ---
> mm/vmalloc.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 3f9b6bd707d2..607697c81e60 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work)
> {
> struct vmap_node *vn = container_of(work,
> struct vmap_node, purge_work);
> + unsigned long nr_purged_pages = 0;
> struct vmap_area *va, *n_va;
> LIST_HEAD(local_list);
>
> @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work)
>
> list_del_init(&va->list);
>
> - atomic_long_sub(nr, &vmap_lazy_nr);
> + nr_purged_pages += nr;
> vn->nr_purged++;
>
> if (is_vn_id_valid(vn_id) && !vn->skip_populate)
> @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work)
> list_add(&va->list, &local_list);
> }
>
> + atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
> +
> reclaim_list_global(&local_list);
> }
>
> --
> 2.34.1
>
I see the point and it looks good to me.
Reviewed-by: Uladzislau Rezki (Sony) <urezki@...il.com>
Thank you for improving this. There is one more spot which i detected
earlier, it is:
<snip>
static void free_vmap_area_noflush(struct vmap_area *va)
{
unsigned long nr_lazy_max = lazy_max_pages();
unsigned long va_start = va->va_start;
unsigned int vn_id = decode_vn_id(va->flags);
struct vmap_node *vn;
unsigned long nr_lazy;
if (WARN_ON_ONCE(!list_empty(&va->list)))
return;
nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
PAGE_SHIFT, &vmap_lazy_nr);
...
<snip>
atomic_long_add_return() might also introduce a high contention. We can
optimize by splitting into more light atomics. Can you check it on your
448-cores system?
Tnanks!
--
Uladzislau Rezki
Powered by blists - more mailing lists