lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150814083818.GB6956@hori1.linux.bs1.fc.nec.co.jp>
Date:	Fri, 14 Aug 2015 08:38:18 +0000
From:	Naoya Horiguchi <n-horiguchi@...jp.nec.com>
To:	Wanpeng Li <wanpeng.li@...mail.com>
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm/hwpoison: fix race between soft_offline_page and
 unpoison_memory

On Fri, Aug 14, 2015 at 03:59:21PM +0800, Wanpeng Li wrote:
> On 8/14/15 3:54 PM, Wanpeng Li wrote:
> >[...]
> >>OK, then I rethink of handling the race in unpoison_memory().
> >>
> >>Currently properly contained/hwpoisoned pages should have page refcount 1
> >>(when the memory error hits LRU pages or hugetlb pages) or refcount 0
> >>(when the memory error hits the buddy page.) And current unpoison_memory()
> >>implicitly assumes this because otherwise the unpoisoned page has no place
> >>to go and it's just leaked.
> >>So to avoid the kernel panic, adding prechecks of refcount and mapcount
> >>to limit the page to unpoison for only unpoisonable pages looks OK to me.
> >>The page under soft offlining always has refcount >=2 and/or mapcount > 0,
> >>so such pages should be filtered out.
> >>
> >>Here's a patch. In my testing (run soft offline stress testing then repeat
> >>unpoisoning in background,) the reported (or similar) bug doesn't happen.
> >>Can I have your comments?
> >As page_action() prints out page maybe still referenced by some users,
> >however, PageHWPoison has already set. So you will leak many poison pages.
> >
>
> Anyway, the bug is still there.
>
> [  944.387559] BUG: Bad page state in process expr  pfn:591e3
> [  944.393053] page:ffffea00016478c0 count:-1 mapcount:0 mapping:
> (null) index:0x2
> [  944.401147] flags: 0x1fffff80000000()
> [  944.404819] page dumped because: nonzero _count

Hmm, no luck :(

To investigate more, I'd like to test the exactly same kernel as yours, so
could you share the kernel info (.config and base kernel and what patches
you applied)? or pushing your tree somewhere like github?
# if you like, sending to me privately is fine.

I think that I tested v4.2-rc6 + <your recent 7 hwpoison patches> +
"mm/hwpoison: fix race between soft_offline_page and unpoison_memory",
but I experienced some conflict in applying your patches for some reason,
so it might happen that we are testing on different kernels.

Mine is here:
  https://github.com/Naoya-Horiguchi/linux v4.2-rc6/fix_race_soft_offline_unpoison

Thanks,
Naoya Horiguchi--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ