linux-kernel - Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180717112515.GE7193@dhcp22.suse.cz>
Date:   Tue, 17 Jul 2018 13:25:15 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Daniel Drake <drake@...lessm.com>
Cc:     hannes@...xchg.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, cgroups@...r.kernel.org, linux@...lessm.com,
        linux-block@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Andrew Morton <akpm@...uxfoundation.org>,
        Tejun Heo <tj@...nel.org>,
        Balbir Singh <bsingharora@...il.com>,
        Mike Galbraith <efault@....de>,
        Oliver Yang <yangoliver@...com>,
        Shakeel Butt <shakeelb@...gle.com>,
        xxx xxx <x.qendo@...il.com>,
        Taras Kondratiuk <takondra@...co.com>,
        Daniel Walker <danielwa@...co.com>,
        Vinayak Menon <vinmenon@...eaurora.org>,
        Ruslan Ruslichenko <rruslich@...co.com>, kernel-team@...com
Subject: Re: [PATCH 0/10] psi: pressure stall information for CPU, memory,
 and IO v2

On Mon 16-07-18 10:57:45, Daniel Drake wrote:
> Hi Johannes,
> 
> Thanks for your work on psi! 
> 
> We have also been investigating the "thrashing problem" on our Endless
> desktop OS. We have seen that systems can easily get into a state where the
> UI becomes unresponsive to input, and the mouse cursor becomes extremely
> slow or stuck when the system is running out of memory. We are working with
> a full GNOME desktop environment on systems with only 2GB RAM, and
> sometimes no real swap (although zram-swap helps mitigate the problem to
> some extent).
> 
> My analysis so far indicates that when the system is low on memory and hits
> this condition, the system is spending much of the time under
> __alloc_pages_direct_reclaim. "perf trace -F" shows many many page faults
> in executable code while this is going on. I believe the kernel is
> swapping out executable code in order to satisfy memory allocation
> requests, but then that swapped-out code is needed a moment later so it
> gets swapped in again via the page fault handler, and all this activity
> severely starves the system from being able to respond to user input.
> 
> I appreciate the kernel's attempt to keep processes alive, but in the
> desktop case we see that the system rarely recovers from this situation,
> so you have to hard shutdown. In this case we view it as desirable that
> the OOM killer would step in (it is not doing so because direct reclaim
> is not actually failing).

Yes this is really unfortunate. One thing that could help would be to
consider a trashing level during the reclaim (get_scan_count) to simply
forget about LRUs which are constantly refaulting pages back. We already
have the infrastructure for that. We just need to plumb it in.
-- 
Michal Hocko
SUSE Labs