lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 6 May 2020 07:33:41 -0700
From:   Eric Dumazet <edumazet@...gle.com>
To:     SeongJae Park <sjpark@...zon.com>
Cc:     "Paul E. McKenney" <paulmck@...nel.org>,
        Eric Dumazet <eric.dumazet@...il.com>,
        David Miller <davem@...emloft.net>,
        Al Viro <viro@...iv.linux.org.uk>,
        Jakub Kicinski <kuba@...nel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        sj38.park@...il.com, netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        SeongJae Park <sjpark@...zon.de>, snu@...zon.com,
        amit@...nel.org, stable@...r.kernel.org
Subject: Re: Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life
 cycle change

On Wed, May 6, 2020 at 5:59 AM SeongJae Park <sjpark@...zon.com> wrote:
>
> TL; DR: It was not kernel's fault, but the benchmark program.
>
> So, the problem is reproducible using the lebench[1] only.  I carefully read
> it's code again.
>
> Before running the problem occurred "poll big" sub test, lebench executes
> "context switch" sub test.  For the test, it sets the cpu affinity[2] and
> process priority[3] of itself to '0' and '-20', respectively.  However, it
> doesn't restore the values to original value even after the "context switch" is
> finished.  For the reason, "select big" sub test also run binded on CPU 0 and
> has lowest nice value.  Therefore, it can disturb the RCU callback thread for
> the CPU 0, which processes the deferred deallocations of the sockets, and as a
> result it triggers the OOM.
>
> We confirmed the problem disappears by offloading the RCU callbacks from the
> CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or
> priority.
>
> Someone _might_ still argue that this is kernel problem because the problem
> didn't occur on the old kernels prior to the Al's patches.  However, setting
> the affinity and priority was available because the program received the
> permission.  Therefore, it would be reasonable to blame the system
> administrators rather than the kernel.
>
> So, please ignore this patchset, apology for making confuse.  If you still has
> some doubts or need more tests, please let me know.
>
> [1] https://github.com/LinuxPerfStudy/LEBench
> [2] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L820
> [3] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L822
>
>
> Thanks,
> SeongJae Park

No harm done, thanks for running more tests and root-causing the issue !

Powered by blists - more mailing lists