lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Tue, 6 Dec 2016 09:32:01 +0100
From:   Donald Buczek <buczek@...gen.mpg.de>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Paul Menzel <pmenzel@...gen.mpg.de>, dvteam@...gen.mpg.de,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Josh Triplett <josh@...htriplett.org>,
        "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and
 `mem_cgroup_shrink_node`

On 12/02/16 10:14, Donald Buczek wrote:
> On 11/30/16 12:43, Donald Buczek wrote:
>> On 11/30/16 12:09, Michal Hocko wrote:
>>> [CCing Paul]
>>>
>>> On Wed 30-11-16 11:28:34, Donald Buczek wrote:
>>> [...]
>>>> shrink_active_list gets and releases the spinlock and calls 
>>>> cond_resched().
>>>> This should give other tasks a chance to run. Just as an 
>>>> experiment, I'm
>>>> trying
>>>>
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1921,7 +1921,7 @@ static void shrink_active_list(unsigned long
>>>> nr_to_scan,
>>>>          spin_unlock_irq(&pgdat->lru_lock);
>>>>
>>>>          while (!list_empty(&l_hold)) {
>>>> -               cond_resched();
>>>> +               cond_resched_rcu_qs();
>>>>                  page = lru_to_page(&l_hold);
>>>>                  list_del(&page->lru);
>>>>
>>>> and didn't hit a rcu_sched warning for >21 hours uptime now. We'll 
>>>> see.
>>> This is really interesting! Is it possible that the RCU stall detector
>>> is somehow confused?
>>
>> Wait... 21 hours is not yet a test result.
>
> For the records: We didn't have any stall warnings after 2 days and 20 
> hours now and so I'm quite confident, that my above patch fixed the 
> problem for v4.8.0. On previous boots the rcu warnings started after 
> 37,0.2,1,2,0.8 hours uptime.
>
> Now I've applied this patch to stable latest (v4.8.11) on another 
> backup machine which suffered even more rcu stalls.
>
> Donald
>
>> [...]

For the records: After 3 days and 21 hours we've got a rcu stall warning 
again [1]. So my patch didn't fix it.

Trying "[PATCH] mm, vmscan: add cond_resched into shrink_node_memcg" 
from Michal Hocko [2] on top of v4.8.12 on both servers now.

[1] https://owww.molgen.mpg.de/~buczek/321322/2016-12-06.dmesg.txt
[2] https://marc.info/?i=20161202095841.16648-1-mhocko%40kernel.org

-- 
Donald Buczek
buczek@...gen.mpg.de
Tel: +49 30 8413 1433

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ