[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <772a2c59-7616-4ec7-9050-17d3abf0b6eb@collabora.com>
Date: Fri, 12 Jan 2024 11:16:32 +0500
From: Muhammad Usama Anjum <usama.anjum@...labora.com>
To: Jiaqi Yan <jiaqiyan@...gle.com>,
Sidhartha Kumar <sidhartha.kumar@...cle.com>
Cc: Muhammad Usama Anjum <usama.anjum@...labora.com>, linmiaohe@...wei.com,
mike.kravetz@...cle.com, naoya.horiguchi@....com, akpm@...ux-foundation.org,
songmuchun@...edance.com, shy828301@...il.com, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, jthoughton@...gle.com,
"kernel@...labora.com" <kernel@...labora.com>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>,
Linux Regressions <regressions@...ts.linux.dev>
Subject: Re: [PATCH v4 4/4] selftests/mm: add tests for HWPOISON hugetlbfs
read
On 1/10/24 3:15 PM, Muhammad Usama Anjum wrote:
> On 1/10/24 11:49 AM, Muhammad Usama Anjum wrote:
>> On 1/6/24 2:13 AM, Jiaqi Yan wrote:
>>> On Thu, Jan 4, 2024 at 10:27 PM Muhammad Usama Anjum
>>> <usama.anjum@...labora.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to convert this test to TAP as I think the failures sometimes go
>>>> unnoticed on CI systems if we only depend on the return value of the
>>>> application. I've enabled the following configurations which aren't already
>>>> present in tools/testing/selftests/mm/config:
>>>> CONFIG_MEMORY_FAILURE=y
>>>> CONFIG_HWPOISON_INJECT=m
>>>>
>>>> I'll send a patch to add these configs later. Right now I'm trying to
>>>> investigate the failure when we are trying to inject the poison page by
>>>> madvise(MADV_HWPOISON). I'm getting device busy every single time. The test
>>>> fails as it doesn't expect any business for the hugetlb memory. I'm not
>>>> sure if the poison handling code has issues or test isn't robust enough.
>>>>
>>>> ./hugetlb-read-hwpoison
>>>> Write/read chunk size=0x800
>>>> ... HugeTLB read regression test...
>>>> ... ... expect to read 0x200000 bytes of data in total
>>>> ... ... actually read 0x200000 bytes of data in total
>>>> ... HugeTLB read regression test...TEST_PASSED
>>>> ... HugeTLB read HWPOISON test...
>>>> [ 9.280854] Injecting memory failure for pfn 0x102f01 at process virtual
>>>> address 0x7f28ec101000
>>>> [ 9.282029] Memory failure: 0x102f01: huge page still referenced by 511
>>>> users
>>>> [ 9.282987] Memory failure: 0x102f01: recovery action for huge page: Failed
>>>> ... !!! MADV_HWPOISON failed: Device or resource busy
>>>> ... HugeTLB read HWPOISON test...TEST_FAILED
>>>>
>>>> I'm testing on v6.7-rc8. Not sure if this was working previously or not.
>>>
>>> Thanks for reporting this, Usama!
>>>
>>> I am also able to repro MADV_HWPOISON failure at "501a06fe8e4c
>>> (akpm/mm-stable, mm-stable) zswap: memcontrol: implement zswap
>>> writeback disabling."
>>>
>>> Then I checked out the earliest commit "ba91e7e5d15a (HEAD -> Base)
>>> selftests/mm: add tests for HWPOISON hugetlbfs read". The
>>> MADV_HWPOISON injection works and and the test passes:
>>>
>>> ... HugeTLB read HWPOISON test...
>>> ... ... expect to read 0x101000 bytes of data in total
>>> ... !!! read failed: Input/output error
>>> ... ... actually read 0x101000 bytes of data in total
>>> ... HugeTLB read HWPOISON test...TEST_PASSED
>>> ... HugeTLB seek then read HWPOISON test...
>>> ... ... init val=4 with offset=0x102000
>>> ... ... expect to read 0xfe000 bytes of data in total
>>> ... ... actually read 0xfe000 bytes of data in total
>>> ... HugeTLB seek then read HWPOISON test...TEST_PASSED
>>> ...
>>>
>>> [ 2109.209225] Injecting memory failure for pfn 0x3190d01 at process
>>> virtual address 0x7f75e3101000
>>> [ 2109.209438] Memory failure: 0x3190d01: recovery action for huge
>>> page: Recovered
>>> ...
>>>
>>> I think something in between broken MADV_HWPOISON on hugetlbfs, and we
>>> should be able to figure it out via bisection (and of course by
>>> reading delta commits between them, probably related to page
>>> refcount).
>> Thank you for this information.
>>
>>>
>>> That being said, I will be on vacation from tomorrow until the end of
>>> next week. So I will get back to this after next weekend. Meanwhile if
>>> you want to go ahead and bisect the problematic commit, that will be
>>> very much appreciated.
>> I'll try to bisect and post here if I find something.
> Found the culprit commit by bisection:
>
> a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3
> mm/filemap: remove hugetlb special casing in filemap.c
#regzbot title: hugetlbfs hwpoison handling
#regzbot introduced: a08c7193e4f1
#regzbot monitor:
https://lore.kernel.org/all/20240111191655.295530-1-sidhartha.kumar@oracle.com
>
> hugetlb-read-hwpoison started failing from this patch. I've added the
> author of this patch to this bug report.
>
>>
>>>
>>> Thanks,
>>> Jiaqi
>>>
>>>
>>>>
>>>> Regards,
>>>> Usama
>>>>
>
--
BR,
Muhammad Usama Anjum
Powered by blists - more mailing lists