[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5A324351.8000004@huawei.com>
Date: Thu, 14 Dec 2017 17:24:33 +0800
From: Jason Yan <yanaijie@...wei.com>
To: Greg KH <gregkh@...uxfoundation.org>
CC: <martin.petersen@...cle.com>, <jejb@...ux.vnet.ibm.com>,
<linux-scsi@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
Bart Van Assche <bart.vanassche@....com>,
"Ewan D . Milne" <emilne@...hat.com>,
"Christoph Hellwig" <hch@....de>
Subject: Re: [PATCH] driver core: Make it safe to use get_device() if the
reference count is zero
On 2017/12/14 16:56, Greg KH wrote:
> On Thu, Dec 14, 2017 at 04:27:56PM +0800, Jason Yan wrote:
>>
>>
>> On 2017/12/14 16:10, Greg KH wrote:
>>> On Thu, Dec 14, 2017 at 03:56:46PM +0800, Jason Yan wrote:
>>>>
>>>> On 2017/12/14 15:42, Greg KH wrote:
>>>>> On Thu, Dec 14, 2017 at 11:39:36AM +0800, Jason Yan wrote:
>>>>>> Some driviers may have the chance to increase a reference count that
>>>>>> has dropped to zero when using get_device() because of their design.
>>>>> Then those drivers are broken :)
>>>>>
>>>>>> We have met such a issue with scsi:
>>>>>> https://www.spinics.net/lists/linux-scsi/msg115295.html
>>>>>>
>>>>>> The scsi core will keep the scsi device object in the host list after
>>>>>> it has been deleted and the iterator can still find it. All of the
>>>>>> places where need iterating have to check the state of the scsi device
>>>>>> and this makes a lot of code redundancy and complexity.
>>>>>>
>>>>>> Provide a safe mechanism in get_device() by using
>>>>>> kobject_get_unless_zero().
>>>>>>
>>>>>> Suggested-by: Bart Van Assche <bart.vanassche@....com>
>>>>>> Signed-off-by: Jason Yan <yanaijie@...wei.com>
>>>>>> CC: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
>>>>>> CC: Bart Van Assche <bart.vanassche@....com>
>>>>>> CC: Ewan D. Milne <emilne@...hat.com>
>>>>>> CC: James E.J. Bottomley <jejb@...ux.vnet.ibm.com>
>>>>>> CC: Christoph Hellwig <hch@....de>
>>>>>> ---
>>>>>> drivers/base/core.c | 2 +-
>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/base/core.c b/drivers/base/core.c
>>>>>> index 12ebd05..cc74810 100644
>>>>>> --- a/drivers/base/core.c
>>>>>> +++ b/drivers/base/core.c
>>>>>> @@ -1916,7 +1916,7 @@ EXPORT_SYMBOL_GPL(device_register);
>>>>>> */
>>>>>> struct device *get_device(struct device *dev)
>>>>>> {
>>>>>> - return dev ? kobj_to_dev(kobject_get(&dev->kobj)) : NULL;
>>>>>> + return dev && kobject_get_unless_zero(&dev->kobj) ? dev : NULL;
>>>>> I really don't want to do this, the bus the device is on should prevent
>>>>> this from happening.
>>>>>
>>>>> Also, once that reference count drops to zero, the memory should be
>>>>> freed, so you really have a stale pointer here, and this code would fail
>>>>> if you had slab debugging enabled anyway.
>>>>
>>>> Actually I don't want this either. But the design of scsi core will leave
>>>> the scsi device on the host list after it is deleted, and it can be
>>>> found later and the refcount have a very big chance to increase from
>>>> zero again. And after a lot of discussion it seems that the scsi layer
>>>> is difficult to change the situation in the near future.
>>>
>>> Keeping a 'struct device' reference counted chunk of memory on a list
>>> that has a different lifetime rule from that device itself, is crazy.
>>>
>>
>> The lifetime rule is the same. That device itself will delete from the
>> host list in the destructor, scsi_device_dev_release_usercontext(), and
>> the memory will be freed after that. That's why this issue came out.
>>
>>> And yes, I remember how all of this came about, but I really don't have
>>> the time to work on it myself...
>>>
>>>>> So I don't even think this fixes the issue you think it fixes :)
>>>>
>>>> This issue is very easy to reproduce on my machine and I have tested the
>>>> patch and it really fixes the issue.
>>>
>>> Even with slab debugging enabled? If so, what is keeping that memory
>>> from being freed once the reference count drops to 0?
>>>
>>
>> As above, the memory is not freed yet when we increasing the refconunt from
>> zero, so it's nothing to do with slab debugging enabled or not. If
>> we want to free it, we have to grab host lock first to delete it from
>> the list, so if we are grabing the host lock, we can increase the
>> refcount safely from 0 to 1.
>
> Wait, what? Once that refcount goes to 0, the release callback will be
> called, and the memory had better be freed. Any pointer you might still
> have to the structure is now invalid. The fact that you are somehow
> still keeping that pointer around is not ok, and slab debugging should
> have caused the memory to be overwritten and garbage would result if you
> tried to make this call again.
>
I don't know why scsi have this design. Anyone who is familiar with the
history of this design can explain this?
> thanks,
>
> greg k-h
>
> .
>
Powered by blists - more mailing lists