lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CALvZod44ZLA8U=ormvuKZhJ1vCJf8qOHMRSouih4E-oaLihV=Q@mail.gmail.com>
Date:   Thu, 3 Dec 2020 14:11:49 -0800
From:   Shakeel Butt <shakeelb@...gle.com>
To:     Mike Kravetz <mike.kravetz@...cle.com>
Cc:     Linux MM <linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>,
        Cgroups <cgroups@...r.kernel.org>,
        Mina Almasry <almasrymina@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Greg Thelen <gthelen@...gle.com>,
        Sandipan Das <sandipan@...ux.ibm.com>,
        Shuah Khan <shuah@...nel.org>,
        Adrian Moreno <amorenoz@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        stable@...r.kernel.org
Subject: Re: [PATCH] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations

On Thu, Dec 3, 2020 at 2:04 PM Mike Kravetz <mike.kravetz@...cle.com> wrote:
>
> Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
> using hugetlbfs.  In this environment the issue is reproduced by:
> 1 - Start a simple pod that uses the recently added HugePages medium
>     feature (pod yaml attached)
> 2 - Start a DPDK app. It doesn't need to run successfully (as in transfer
>     packets) nor interact with real hardware. It seems just initializing
>     the EAL layer (which handles hugepage reservation and locking) is
>     enough to trigger the issue
> 3 - Delete the Pod (or let it "Complete").
>
> This would result in a kworker thread going into a tight loop (top output):
>  1425 root      20   0       0      0      0 R  99.7   0.0   5:22.45
> kworker/28:7+cgroup_destroy
>
> 'perf top -g' reports:
> -   63.28%     0.01%  [kernel]                    [k] worker_thread
>    - 49.97% worker_thread
>       - 52.64% process_one_work
>          - 62.08% css_killed_work_fn
>             - hugetlb_cgroup_css_offline
>                  41.52% _raw_spin_lock
>                - 2.82% _cond_resched
>                     rcu_all_qs
>                  2.66% PageHuge
>       - 0.57% schedule
>          - 0.57% __schedule
>
> We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
> Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
> infinitely spinning.  Little else can be done on the system as the
> cgroup_mutex can not be acquired.
>
> Do note that the issue can be reproduced by simply offlining a hugetlb
> cgroup containing pages with reservation counts.
>
> The loop in hugetlb_cgroup_css_offline is moving page counts from the
> cgroup being offlined to the parent cgroup.  This is done for each hstate,
> and is repeated until hugetlb_cgroup_have_usage returns false.  The routine
> moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts.
> The routine hugetlb_cgroup_have_usage is checking for both 'usage' and
> 'reservation' counts.  Discussion about what to do with reservation
> counts when reparenting was discussed here:
>
> https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/
>
> The decision was made to leave a zombie cgroup for with reservation
> counts.  Unfortunately, the code checking reservation counts was
> incorrectly added to hugetlb_cgroup_have_usage.
>
> To fix the issue, simply remove the check for reservation counts.  While
> fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed.
> The hstate index is not reinitialized each time through the do-while loop.
> Fix this as well.
>
> Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
> Cc: <stable@...r.kernel.org>
> Reported-by: Adrian Moreno <amorenoz@...hat.com>
> Tested-by: Adrian Moreno <amorenoz@...hat.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@...cle.com>

Reviewed-by: Shakeel Butt <shakeelb@...gle.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ