[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZOs5j93aAmZhrA/G@casper.infradead.org>
Date: Sun, 27 Aug 2023 12:54:55 +0100
From: Matthew Wilcox <willy@...radead.org>
To: dianlujitao <dianlujitao@...il.com>
Cc: Bagas Sanjaya <bagasdotme@...il.com>, Chris Mason <clm@...com>,
Josef Bacik <josef@...icpanda.com>,
David Sterba <dsterba@...e.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux btrfs <linux-btrfs@...r.kernel.org>,
Linux Filesystem Development <linux-fsdevel@...r.kernel.org>
Subject: Re: Fwd: kernel bug when performing heavy IO operations
On Sun, Aug 27, 2023 at 12:34:54PM +0800, dianlujitao wrote:
>
> 在 2023/8/27 11:45, Matthew Wilcox 写道:
> > On Sun, Aug 27, 2023 at 10:20:51AM +0700, Bagas Sanjaya wrote:
> > > > When the IO load is heavy (compiling AOSP in my case), there's a chance to crash the kernel, the only way to recover is to perform a hard reset. Logs look like follows:
> > > >
> > > > 8月 25 13:52:23 arch-pc kernel: BUG: Bad page map in process tmux: client pte:8000000462500025 pmd:b99c98067
> > > > 8月 25 13:52:23 arch-pc kernel: page:00000000460fa108 refcount:4 mapcount:-256 mapping:00000000612a1864 index:0x16 pfn:0x462500
> > > > 8月 25 13:52:23 arch-pc kernel: memcg:ffff8a1056ed0000
> > > > 8月 25 13:52:23 arch-pc kernel: aops:btrfs_aops [btrfs] ino:9c4635 dentry name:"locale-archive"
> > > > 8月 25 13:52:23 arch-pc kernel: flags: 0x2ffff5800002056(referenced|uptodate|lru|workingset|private|node=0|zone=2|lastcpupid=0xffff)
> > > > 8月 25 13:52:23 arch-pc kernel: page_type: 0xfffffeff(offline)
> > This is interesting. PG_offline is set.
> >
> > $ git grep SetPageOffline
> > arch/powerpc/platforms/powernv/memtrace.c: __SetPageOffline(pfn_to_page(pfn));
> > drivers/hv/hv_balloon.c: __SetPageOffline(pg);
> > drivers/hv/hv_balloon.c: __SetPageOffline(pg + j);
> > drivers/misc/vmw_balloon.c: __SetPageOffline(page + i);
> > drivers/virtio/virtio_mem.c: __SetPageOffline(page);
> > drivers/xen/balloon.c: __SetPageOffline(page);
> > include/linux/balloon_compaction.h: __SetPageOffline(page);
> > include/linux/balloon_compaction.h: __SetPageOffline(page);
> >
> > But there's no indication that this kernel is running under a
> > hypervisor:
> >
> > > > 8月 25 13:52:23 arch-pc kernel: Hardware name: JGINYUE X99-8D3/2.5G Server/X99-8D3/2.5G Server, BIOS 5.11 06/30/2022
> Yes, I'm running on bare metal hardware.
> > So I'd agree with Artem, this looks like bad RAM.
> >
> I ran memtest86+ 6.20 for a cycle and it passed. However, could an OOM
> trigger the bug? e.g., kernel bug fired before the OOM killer has a
> chance to start? Just a guess because the last log entry in journalctl
> before "BUG" is an hour earlier.
The problem is that OOM doesn't SetPageOffline. The only things that
do are hypervisor guest drivers. So we've got a random bit being
cleared, and either that's a stray write which happens to land in
the struct page in question, or it's bad hardware. Since it's a
single bit that's being cleared, bad hardware is the most likely
explanation, but it's not impossible for there to be a bug that's
doing this. The problem is that it could be almost anything ...
Powered by blists - more mailing lists