linux-kernel - Re: [PATCH net v2 0/2] Revert the 'socket

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 5 May 2020 09:25:06 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     SeongJae Park <sjpark@...zon.com>,
        Eric Dumazet <edumazet@...gle.com>
Cc:     Eric Dumazet <eric.dumazet@...il.com>,
        David Miller <davem@...emloft.net>,
        Al Viro <viro@...iv.linux.org.uk>,
        Jakub Kicinski <kuba@...nel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        sj38.park@...il.com, netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        SeongJae Park <sjpark@...zon.de>, snu@...zon.com,
        amit@...nel.org, stable@...r.kernel.org,
        Paul McKenney <paulmck@...nel.org>
Subject: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life cycle change



On 5/5/20 9:13 AM, SeongJae Park wrote:
> On Tue, 5 May 2020 09:00:44 -0700 Eric Dumazet <edumazet@...gle.com> wrote:
> 
>> On Tue, May 5, 2020 at 8:47 AM SeongJae Park <sjpark@...zon.com> wrote:
>>>
>>> On Tue, 5 May 2020 08:20:50 -0700 Eric Dumazet <eric.dumazet@...il.com> wrote:
>>>
>>>>
>>>>
>>>> On 5/5/20 8:07 AM, SeongJae Park wrote:
>>>>> On Tue, 5 May 2020 07:53:39 -0700 Eric Dumazet <edumazet@...gle.com> wrote:
>>>>>
>>>>
>>>>>> Why do we have 10,000,000 objects around ? Could this be because of
>>>>>> some RCU problem ?
>>>>>
>>>>> Mainly because of a long RCU grace period, as you guess.  I have no idea how
>>>>> the grace period became so long in this case.
>>>>>
>>>>> As my test machine was a virtual machine instance, I guess RCU readers
>>>>> preemption[1] like problem might affected this.
>>>>>
>>>>> [1] https://www.usenix.org/system/files/conference/atc17/atc17-prasad.pdf
>>>>>
>>>>>>
>>>>>> Once Al patches reverted, do you have 10,000,000 sock_alloc around ?
>>>>>
>>>>> Yes, both the old kernel that prior to Al's patches and the recent kernel
>>>>> reverting the Al's patches didn't reproduce the problem.
>>>>>
>>>>
>>>> I repeat my question : Do you have 10,000,000 (smaller) objects kept in slab caches ?
>>>>
>>>> TCP sockets use the (very complex, error prone) SLAB_TYPESAFE_BY_RCU, but not the struct socket_wq
>>>> object that was allocated in sock_alloc_inode() before Al patches.
>>>>
>>>> These objects should be visible in kmalloc-64 kmem cache.
>>>
>>> Not exactly the 10,000,000, as it is only the possible highest number, but I
>>> was able to observe clear exponential increase of the number of the objects
>>> using slabtop.  Before the start of the problematic workload, the number of
>>> objects of 'kmalloc-64' was 5760, but I was able to observe the number increase
>>> to 1,136,576.
>>>
>>>           OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>> before:   5760   5088  88%    0.06K     90       64       360K kmalloc-64
>>> after:  1136576 1136576 100%    0.06K  17759       64     71036K kmalloc-64
>>>
>>
>> Great, thanks.
>>
>> How recent is the kernel you are running for your experiment ?
> 
> It's based on 5.4.35.
> 
>>
>> Let's make sure the bug is not in RCU.
> 
> One thing I can currently say is that the grace period passes at last.  I
> modified the benchmark to repeat not 10,000 times but only 5,000 times to run
> the test without OOM but easily observable memory pressure.  As soon as the
> benchmark finishes, the memory were freed.
> 
> If you need more tests, please let me know.
> 

I would ask Paul opinion on this issue, because we have many objects
being freed after RCU grace periods.

If RCU subsystem can not keep-up, I guess other workloads will also suffer.

Sure, we can revert patches there and there trying to work around the issue,
but for objects allocated from process context, we should not have these problems.