[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aQ3G44fqJG378pM1@rli9-mobl>
Date: Fri, 7 Nov 2025 18:16:03 +0800
From: Philip Li <philip.li@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: "Chen, Yu C" <yu.c.chen@...el.com>, kernel test robot
<oliver.sang@...el.com>, Fernand Sieber <sieberf@...zon.com>,
<oe-lkp@...ts.linux.dev>, <lkp@...el.com>, <linux-kernel@...r.kernel.org>,
<x86@...nel.org>, <aubrey.li@...ux.intel.com>
Subject: Re: [tip:sched/core] [sched/fair] 79104becf4:
BUG:kernel_NULL_pointer_dereference,address
On Wed, Nov 05, 2025 at 08:06:32PM +0800, Philip Li wrote:
> On Wed, Nov 05, 2025 at 12:00:26PM +0100, Peter Zijlstra wrote:
> > On Tue, Oct 28, 2025 at 10:30:08AM +0800, Chen, Yu C wrote:
> > > On 10/27/2025 10:09 PM, Peter Zijlstra wrote:
> > > > On Mon, Oct 27, 2025 at 03:07:18PM +0100, Peter Zijlstra wrote:
> > > > > On Mon, Oct 27, 2025 at 02:55:16PM +0100, Peter Zijlstra wrote:
> > > > >
> > > > > > > May I know if you are using the kernel config 0day attached?
> > > > > > > I found that the config 0day attached
> > > > > > > (https://download.01.org/0day-ci/archive/20251021/202510211205.1e0f5223-lkp@intel.com/config-6.18.0-rc1-00001-g79104becf42b)
> > > > > > > has
> > > > > > > CONFIG_IA32_EMULATION=y
> > > > > > > CONFIG_IA32_EMULATION_DEFAULT_DISABLED=y
> > > > >
> > > > > Yep, deleting that entry makes it all work.
> > > >
> > > > 'work' might be over stating, it boots and starts trinity, which then
> > > > promptly (as in a handful of seconds) triggers OOM and dies. Not
> > > > actually reproducing the NULL deref I was looking for.
> > >
> > > Change the following line in job-script
> > > export memory='16G'
> > > to
> > > export memory='64G'
> > > ?
> >
> > Yes, that seems to help.
> >
> > > I had a try and can reproduce the NULL except at first run:
> >
> > Took me two runs, but yes, I can see it now.
> >
> > Anyway, this is two bugs in the robot, can we please fix all this to not
> > happen again?
>
> Got it, I will dig into the detail to understand the difference of local
> reproduce and internal cluster run. The image, kconfig, and memory
> are exactly the same for actual robot run and provided reproduce instruction,
> since the attachment is reproduced from the job execution. I didn't find the
> cause quickly, and i will be back to this asap and provide update.
>
> >
> > - .config has 32bit disabled while robot provides 32bit images. Clearly
> > the actual robot runs 64bit images and the reproduction should
> > provide those too.
Some update that this one is resolved, the cluster run has set ia32_emulation=on
in kernel cmdline, which is missed to set in the reproduce step.
> >
> > - job description is inaccurate in the amount of memory required.
Got it, the cluster run with 16G has 40% rate (in about 20 runs), now i
have increased the memory to 32G so it will reduce the OOM chance in local
reproduction.
> >
> > The reproduction steps must exactly match what the real robot runs, not
> > something else.
Sorry for wrong reproduce steps, we should be more careful to make it consistent.
And thanks again to Peter and Yu.
> >
Powered by blists - more mailing lists