lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 11 Mar 2016 18:00:23 +0100
From:	Michal Hocko <mhocko@...nel.org>
To:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Cc:	akpm@...ux-foundation.org, torvalds@...ux-foundation.org,
	hannes@...xchg.org, mgorman@...e.de, rientjes@...gle.com,
	hillf.zj@...baba-inc.com, kamezawa.hiroyu@...fujitsu.com,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Sat 12-03-16 01:49:26, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > > > (Posting as a reply to this thread.)
> > > > 
> > > > I really do not see how this is related to this thread.
> > > 
> > > All allocating tasks are looping at
> > > 
> > >                         /*
> > >                          * If we didn't make any progress and have a lot of
> > >                          * dirty + writeback pages then we should wait for
> > >                          * an IO to complete to slow down the reclaim and
> > >                          * prevent from pre mature OOM
> > >                          */
> > >                         if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > >                                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >                                 return true;
> > >                         }
> > > 
> > > in should_reclaim_retry().
> > > 
> > > should_reclaim_retry() was added by OOM detection rework, wan't it?
> > 
> > What happens without this patch applied. In other words, it all smells
> > like the IO got stuck somewhere and the direct reclaim cannot perform it
> > so we have to wait for the flushers to make a progress for us. Are those
> > stuck? Is the IO making any progress at all or it is just too slow and
> > it would finish actually.  Wouldn't we just wait somewhere else in the
> > direct reclaim path instead.
> 
> As of next-20160311, CPU usage becomes 0% when this problem occurs.
> 
> If I remove
> 
>   mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
>   mm: use watermark checks for __GFP_REPEAT high order allocations
>   mm: throttle on IO only when there are too many dirty and writeback pages
>   mm-oom-rework-oom-detection-checkpatch-fixes
>   mm, oom: rework oom detection
> 
> then CPU usage becomes 60% and most of allocating tasks
> are looping at
> 
>         /*
>          * Acquire the oom lock.  If that fails, somebody else is
>          * making progress for us.
>          */
>         if (!mutex_trylock(&oom_lock)) {
>                 *did_some_progress = 1;
>                 schedule_timeout_uninterruptible(1);
>                 return NULL;
>         }
> 
> in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).

OK, that would suggest that the oom rework patches are not really
related. They just moved from the livelock to a sleep which is good in
general IMHO. We even know that it is most probably the IO that is the
problem because we know that more than half of the reclaimable memory is
either dirty or under writeback. That is where you should be looking.
Why the IO is not making progress or such a slow progress.

-- 
Michal Hocko
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ