linux-kernel - Re: Re: [Xen-devel] [PATCH v10 2/4] xen/blkback: Squeeze page pools if a memory pressure is detected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191217083032.19400-1-sjpark@amazon.com>
Date:   Tue, 17 Dec 2019 09:30:32 +0100
From:   SeongJae Park <sjpark@...zon.com>
To:     Jürgen Groß <jgross@...e.com>
CC:     SeongJae Park <sjpark@...zon.com>, <axboe@...nel.dk>,
        <linux-block@...r.kernel.org>, <konrad.wilk@...cle.com>,
        <pdurrant@...zon.com>, <linux-kernel@...r.kernel.org>,
        SeongJae Park <sj38.park@...il.com>,
        <xen-devel@...ts.xenproject.org>, <roger.pau@...rix.com>
Subject: Re: Re: [Xen-devel] [PATCH v10 2/4] xen/blkback: Squeeze page pools if a memory pressure is detected

On Tue, 17 Dec 2019 09:16:47 +0100 "Jürgen Groß" <jgross@...e.com> wrote:

> On 17.12.19 08:59, SeongJae Park wrote:
> > On Tue, 17 Dec 2019 07:23:12 +0100 "Jürgen Groß" <jgross@...e.com> wrote:
> > 
> >> On 16.12.19 20:48, SeongJae Park wrote:
> >>> On on, 16 Dec 2019 17:23:44 +0100, Jürgen Groß wrote:
> >>>
> >>>> On 16.12.19 17:15, SeongJae Park wrote:
> >>>>> On Mon, 16 Dec 2019 15:37:20 +0100 SeongJae Park <sjpark@...zon.com> wrote:
> >>>>>
> >>>>>> On Mon, 16 Dec 2019 13:45:25 +0100 SeongJae Park <sjpark@...zon.com> wrote:
> >>>>>>
> >>>>>>> From: SeongJae Park <sjpark@...zon.de>
> >>>>>>>
> >>>>> [...]
> >>>>>>> --- a/drivers/block/xen-blkback/xenbus.c
> >>>>>>> +++ b/drivers/block/xen-blkback/xenbus.c
> >>>>>>> @@ -824,6 +824,24 @@ static void frontend_changed(struct xenbus_device *dev,
> >>>>>>>     }
> >>>>>>>     
> >>>>>>>     
> >>>>>>> +/* Once a memory pressure is detected, squeeze free page pools for a while. */
> >>>>>>> +static unsigned int buffer_squeeze_duration_ms = 10;
> >>>>>>> +module_param_named(buffer_squeeze_duration_ms,
> >>>>>>> +		buffer_squeeze_duration_ms, int, 0644);
> >>>>>>> +MODULE_PARM_DESC(buffer_squeeze_duration_ms,
> >>>>>>> +"Duration in ms to squeeze pages buffer when a memory pressure is detected");
> >>>>>>> +
> >>>>>>> +/*
> >>>>>>> + * Callback received when the memory pressure is detected.
> >>>>>>> + */
> >>>>>>> +static void reclaim_memory(struct xenbus_device *dev)
> >>>>>>> +{
> >>>>>>> +	struct backend_info *be = dev_get_drvdata(&dev->dev);
> >>>>>>> +
> >>>>>>> +	be->blkif->buffer_squeeze_end = jiffies +
> >>>>>>> +		msecs_to_jiffies(buffer_squeeze_duration_ms);
> >>>>>>
> >>>>>> This callback might race with 'xen_blkbk_probe()'.  The race could result in
> >>>>>> __NULL dereferencing__, as 'xen_blkbk_probe()' sets '->blkif' after it links
> >>>>>> 'be' to the 'dev'.  Please _don't merge_ this patch now!
> >>>>>>
> >>>>>> I will do more test and share results.  Meanwhile, if you have any opinion,
> >>>>>> please let me know.
> >>>
> >>> I reduced system memory and attached bunch of devices in short time so that
> >>> memory pressure occurs while device attachments are ongoing.  Under this
> >>> circumstance, I was able to see the race.
> >>>
> >>>>>
> >>>>> Not only '->blkif', but 'be' itself also coule be a NULL.  As similar
> >>>>> concurrency issues could be in other drivers in their way, I suggest to change
> >>>>> the reclaim callback ('->reclaim_memory') to be called for each driver instead
> >>>>> of each device.  Then, each driver could be able to deal with its concurrency
> >>>>> issues by itself.
> >>>>
> >>>> Hmm, I don't like that. This would need to be changed back in case we
> >>>> add per-guest quota.
> >>>
> >>> Extending this callback in that way would be still not too hard.  We could use
> >>> the argument to the callback.  I would keep the argument of the callback to
> >>> 'struct device *' as is, and will add a comment saying 'NULL' value of the
> >>> argument means every devices.  As an example, xenbus would pass NULL-ending
> >>> array of the device pointers that need to free its resources.
> >>>
> >>> After seeing this race, I am now also thinking it could be better to delegate
> >>> detailed control of each device to its driver, as some drivers have some
> >>> complicated and unique relation with its devices.
> >>>
> >>>>
> >>>> Wouldn't a get_device() before calling the callback and a put_device()
> >>>> afterwards avoid that problem?
> >>>
> >>> I didn't used the reference count manipulation operations because other similar
> >>> parts also didn't.  But, if there is no implicit reference count guarantee, it
> >>> seems those operations are indeed necessary.
> >>>
> >>> That said, as get/put operations only adjust the reference count, those will
> >>> not make the callback to wait until the linking of the 'backend' and 'blkif' to
> >>> the device (xen_blkbk_probe()) is finished.  Thus, the race could still happen.
> >>> Or, am I missing something?
> >>
> >> No, I think we need a xenbus lock per device which will need to be
> >> taken in xen_blkbk_probe(), xenbus_dev_remove() and while calling the
> >> callback.
> > 
> > I also agree that locking should be used at last.  But, as each driver manages
> > its devices and resources in their way, it could have its unique race
> > conditions.  And, each unique race condition might have its unique efficient
> > way to synchronize it.  Therefore, I think the synchronization should be done
> > by each driver, not by xenbus and thus we should make the callback to be called
> > per-driver.
> 
> xenbus controls creation and removing of devices, so applying locking
> at xenbus level is the right thing to do in order to avoid races with
> device removal.
> 
> In case a backend has further synchronization requirements those have to
> be handled at backend level, of course.
> 
> In the end you'll need the xenbus level locking anyway in order to avoid
> a race when the last backend specific device is just being removed when
> the callback is about to be called for that device. Or you'd need to
> call try_get_module() before calling into each backend...

Agreed.  Thank you for your kind explanation of your concerns.


Thanks,
SeongJae Park

> 
> 
> Juergen