[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iL2ZRa9CtypuZXL_+aGQmiqxP9q7eutozJ6G8b=QWjZKw@mail.gmail.com>
Date: Wed, 6 May 2020 07:33:41 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: SeongJae Park <sjpark@...zon.com>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
Eric Dumazet <eric.dumazet@...il.com>,
David Miller <davem@...emloft.net>,
Al Viro <viro@...iv.linux.org.uk>,
Jakub Kicinski <kuba@...nel.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
sj38.park@...il.com, netdev <netdev@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>,
SeongJae Park <sjpark@...zon.de>, snu@...zon.com,
amit@...nel.org, stable@...r.kernel.org
Subject: Re: Re: Re: Re: Re: [PATCH net v2 0/2] Revert the 'socket_alloc' life
cycle change
On Wed, May 6, 2020 at 5:59 AM SeongJae Park <sjpark@...zon.com> wrote:
>
> TL; DR: It was not kernel's fault, but the benchmark program.
>
> So, the problem is reproducible using the lebench[1] only. I carefully read
> it's code again.
>
> Before running the problem occurred "poll big" sub test, lebench executes
> "context switch" sub test. For the test, it sets the cpu affinity[2] and
> process priority[3] of itself to '0' and '-20', respectively. However, it
> doesn't restore the values to original value even after the "context switch" is
> finished. For the reason, "select big" sub test also run binded on CPU 0 and
> has lowest nice value. Therefore, it can disturb the RCU callback thread for
> the CPU 0, which processes the deferred deallocations of the sockets, and as a
> result it triggers the OOM.
>
> We confirmed the problem disappears by offloading the RCU callbacks from the
> CPU 0 using rcu_nocbs=0 boot parameter or simply restoring the affinity and/or
> priority.
>
> Someone _might_ still argue that this is kernel problem because the problem
> didn't occur on the old kernels prior to the Al's patches. However, setting
> the affinity and priority was available because the program received the
> permission. Therefore, it would be reasonable to blame the system
> administrators rather than the kernel.
>
> So, please ignore this patchset, apology for making confuse. If you still has
> some doubts or need more tests, please let me know.
>
> [1] https://github.com/LinuxPerfStudy/LEBench
> [2] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L820
> [3] https://github.com/LinuxPerfStudy/LEBench/blob/master/TEST_DIR/OS_Eval.c#L822
>
>
> Thanks,
> SeongJae Park
No harm done, thanks for running more tests and root-causing the issue !
Powered by blists - more mailing lists