lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 27 Feb 2023 17:08:30 +0200
From:   Mike Rapoport <rppt@...nel.org>
To:     Qi Zheng <zhengqi.arch@...edance.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, tkhai@...ru,
        hannes@...xchg.org, shakeelb@...gle.com, mhocko@...nel.org,
        roman.gushchin@...ux.dev, muchun.song@...ux.dev, david@...hat.com,
        shy828301@...il.com, sultan@...neltoast.com, dave@...olabs.net,
        penguin-kernel@...ove.sakura.ne.jp, paulmck@...nel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 0/8] make slab shrink lockless

Hi,

On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
> 
> 
> On 2023/2/27 03:51, Andrew Morton wrote:
> > On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@...edance.com> wrote:
> > 
> > > Hi all,
> > > 
> > > This patch series aims to make slab shrink lockless.
> > 
> > What an awesome changelog.
> > 
> > > 2. Survey
> > > =========
> > 
> > Especially this part.
> > 
> > Looking through all the prior efforts and at this patchset I am not
> > immediately seeing any statements about the overall effect upon
> > real-world workloads.  For a good example, does this patchset
> > measurably improve throughput or energy consumption on your servers?
> 
> Hi Andrew,
> 
> I re-tested with the following physical machines:
> 
> Architecture:        x86_64
> CPU(s):              96
> On-line CPU(s) list: 0-95
> Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
> 
> I found that the reason for the hotspot I described in cover letter is
> wrong. The reason for the down_read_trylock() hotspot is not because of
> the failure to trylock, but simply because of the atomic operation
> (cmpxchg). And this will lead to a significant reduction in IPC (insn
> per cycle).

... 
 
> Then we can use the following perf command to view hotspots:
> 
> perf top -U -F 999
> 
> 1) Before applying this patchset:
> 
>   32.31%  [kernel]           [k] down_read_trylock
>   19.40%  [kernel]           [k] pv_native_safe_halt
>   16.24%  [kernel]           [k] up_read
>   15.70%  [kernel]           [k] shrink_slab
>    4.69%  [kernel]           [k] _find_next_bit
>    2.62%  [kernel]           [k] shrink_node
>    1.78%  [kernel]           [k] shrink_lruvec
>    0.76%  [kernel]           [k] do_shrink_slab
> 
> 2) After applying this patchset:
> 
>   27.83%  [kernel]           [k] _find_next_bit
>   16.97%  [kernel]           [k] shrink_slab
>   15.82%  [kernel]           [k] pv_native_safe_halt
>    9.58%  [kernel]           [k] shrink_node
>    8.31%  [kernel]           [k] shrink_lruvec
>    5.64%  [kernel]           [k] do_shrink_slab
>    3.88%  [kernel]           [k] mem_cgroup_iter
> 
> 2. At the same time, we use the following perf command to capture IPC
> information:
> 
> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
> 
> 1) Before applying this patchset:
> 
>  Performance counter stats for 'system wide' (5 runs):
> 
>       454187219766      cycles                    test                    (
> +-  1.84% )
>        78896433101      instructions              test #    0.17  insn per
> cycle           ( +-  0.44% )
> 
>         10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
> 
> 2) After applying this patchset:
> 
>  Performance counter stats for 'system wide' (5 runs):
> 
>       841954709443      cycles                    test                    (
> +- 15.80% )  (98.69%)
>       527258677936      instructions              test #    0.63  insn per
> cycle           ( +- 15.11% )  (98.68%)
> 
>           10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
> 
> We can see that IPC drops very seriously when calling
> down_read_trylock() at high frequency. After using SRCU,
> the IPC is at a normal level.

The results you present do show improvement in IPC for an artificial test
script. But more interesting would be to see how a real world workloads
benefit from your changes.
 
> Thanks,
> Qi

-- 
Sincerely yours,
Mike.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ