[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180717112515.GE7193@dhcp22.suse.cz>
Date: Tue, 17 Jul 2018 13:25:15 +0200
From: Michal Hocko <mhocko@...nel.org>
To: Daniel Drake <drake@...lessm.com>
Cc: hannes@...xchg.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, cgroups@...r.kernel.org, linux@...lessm.com,
linux-block@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Andrew Morton <akpm@...uxfoundation.org>,
Tejun Heo <tj@...nel.org>,
Balbir Singh <bsingharora@...il.com>,
Mike Galbraith <efault@....de>,
Oliver Yang <yangoliver@...com>,
Shakeel Butt <shakeelb@...gle.com>,
xxx xxx <x.qendo@...il.com>,
Taras Kondratiuk <takondra@...co.com>,
Daniel Walker <danielwa@...co.com>,
Vinayak Menon <vinmenon@...eaurora.org>,
Ruslan Ruslichenko <rruslich@...co.com>, kernel-team@...com
Subject: Re: [PATCH 0/10] psi: pressure stall information for CPU, memory,
and IO v2
On Mon 16-07-18 10:57:45, Daniel Drake wrote:
> Hi Johannes,
>
> Thanks for your work on psi!
>
> We have also been investigating the "thrashing problem" on our Endless
> desktop OS. We have seen that systems can easily get into a state where the
> UI becomes unresponsive to input, and the mouse cursor becomes extremely
> slow or stuck when the system is running out of memory. We are working with
> a full GNOME desktop environment on systems with only 2GB RAM, and
> sometimes no real swap (although zram-swap helps mitigate the problem to
> some extent).
>
> My analysis so far indicates that when the system is low on memory and hits
> this condition, the system is spending much of the time under
> __alloc_pages_direct_reclaim. "perf trace -F" shows many many page faults
> in executable code while this is going on. I believe the kernel is
> swapping out executable code in order to satisfy memory allocation
> requests, but then that swapped-out code is needed a moment later so it
> gets swapped in again via the page fault handler, and all this activity
> severely starves the system from being able to respond to user input.
>
> I appreciate the kernel's attempt to keep processes alive, but in the
> desktop case we see that the system rarely recovers from this situation,
> so you have to hard shutdown. In this case we view it as desirable that
> the OOM killer would step in (it is not doing so because direct reclaim
> is not actually failing).
Yes this is really unfortunate. One thing that could help would be to
consider a trashing level during the reclaim (get_scan_count) to simply
forget about LRUs which are constantly refaulting pages back. We already
have the infrastructure for that. We just need to plumb it in.
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists