[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d5227f37-4e44-169e-c54b-587c335514c1@oracle.com>
Date: Thu, 16 May 2019 17:01:46 -0700
From: Jane Chu <jane.chu@...cle.com>
To: Dan Williams <dan.j.williams@...el.com>
Cc: linux-nvdimm <linux-nvdimm@...ts.01.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"Rafael J. Wysocki" <rafael@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux MM <linux-mm@...ck.org>,
Jérôme Glisse <jglisse@...hat.com>,
Bjorn Helgaas <bhelgaas@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Christoph Hellwig <hch@....de>
Subject: Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race
On 5/16/2019 2:51 PM, Dan Williams wrote:
> On Thu, May 16, 2019 at 9:45 AM Jane Chu <jane.chu@...cle.com> wrote:
>> Hi,
>>
>> I'm able to reproduce the panic below by running two sets of ndctl
>> commands that actually serve legitimate purpose in parallel (unlike
>> the brute force experiment earlier), each set in a indefinite loop.
>> This time it takes about an hour to panic. But I gather the cause
>> is probably the same: I've overlapped ndctl commands on the same
>> region.
>>
>> Could we add a check in nd_ioctl(), such that if there is
>> an ongoing ndctl command on a region, subsequent ndctl request
>> will fail immediately with something to the effect of EAGAIN?
>> The rationale being that kernel should protect itself against
>> user mistakes.
> We do already have locking in the driver to prevent configuration
> collisions. The problem looks to be broken assumptions about running
> the device unregistration path in a separate thread outside the lock.
> I suspect it may be incorrect assumptions about the userspace
> visibility of the device relative to teardown actions. To be clear
> this isn't the nd_ioctl() path this is the sysfs path.
I see, thanks!
>
>> Also, sensing the subject fix is for a different problem, and has been
>> verified, I'm happy to see it in upstream, so we have a better
>> code base to digger deeper in terms of how the destructive ndctl
>> commands interacts to typical mission critical applications, include
>> but not limited to rdma.
> Right, the crash signature you are seeing looks unrelated to the issue
> being address in these patches which is device-teardown racing active
> page pins. I'll start the investigation on the crash signature, but
> again I don't think it reads on this fix series.
Agreed on investigating the crash as separate issue, looking forward
to see this patchset in upstream.
Thanks!
-jane
Powered by blists - more mailing lists