linux-kernel - Re: Re: [Xen-devel] [PATCH v10 2/4] xen/blkback: Squeeze page pools if a memory pressure is detected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191217113915.GS11756@Air-de-Roger>
Date:   Tue, 17 Dec 2019 12:39:15 +0100
From:   Roger Pau Monné <roger.pau@...rix.com>
To:     SeongJae Park <sj38.park@...il.com>
CC:     <jgross@...e.com>, <axboe@...nel.dk>, <konrad.wilk@...cle.com>,
        <linux-block@...r.kernel.org>, <xen-devel@...ts.xenproject.org>,
        <pdurrant@...zon.com>, <linux-kernel@...r.kernel.org>
Subject: Re: Re: [Xen-devel] [PATCH v10 2/4] xen/blkback: Squeeze page pools
 if a memory pressure is detected

On Mon, Dec 16, 2019 at 08:48:03PM +0100, SeongJae Park wrote:
> On on, 16 Dec 2019 17:23:44 +0100, Jürgen Groß wrote:
> 
> > On 16.12.19 17:15, SeongJae Park wrote:
> > > On Mon, 16 Dec 2019 15:37:20 +0100 SeongJae Park <sjpark@...zon.com> wrote:
> > > 
> > >> On Mon, 16 Dec 2019 13:45:25 +0100 SeongJae Park <sjpark@...zon.com> wrote:
> > >>
> > >>> From: SeongJae Park <sjpark@...zon.de>
> > >>>
> > > [...]
> > >>> --- a/drivers/block/xen-blkback/xenbus.c
> > >>> +++ b/drivers/block/xen-blkback/xenbus.c
> > >>> @@ -824,6 +824,24 @@ static void frontend_changed(struct xenbus_device *dev,
> > >>>   }
> > >>>   
> > >>>   
> > >>> +/* Once a memory pressure is detected, squeeze free page pools for a while. */
> > >>> +static unsigned int buffer_squeeze_duration_ms = 10;
> > >>> +module_param_named(buffer_squeeze_duration_ms,
> > >>> +		buffer_squeeze_duration_ms, int, 0644);
> > >>> +MODULE_PARM_DESC(buffer_squeeze_duration_ms,
> > >>> +"Duration in ms to squeeze pages buffer when a memory pressure is detected");
> > >>> +
> > >>> +/*
> > >>> + * Callback received when the memory pressure is detected.
> > >>> + */
> > >>> +static void reclaim_memory(struct xenbus_device *dev)
> > >>> +{
> > >>> +	struct backend_info *be = dev_get_drvdata(&dev->dev);
> > >>> +
> > >>> +	be->blkif->buffer_squeeze_end = jiffies +
> > >>> +		msecs_to_jiffies(buffer_squeeze_duration_ms);
> > >>
> > >> This callback might race with 'xen_blkbk_probe()'.  The race could result in
> > >> __NULL dereferencing__, as 'xen_blkbk_probe()' sets '->blkif' after it links
> > >> 'be' to the 'dev'.  Please _don't merge_ this patch now!
> > >>
> > >> I will do more test and share results.  Meanwhile, if you have any opinion,
> > >> please let me know.
> 
> I reduced system memory and attached bunch of devices in short time so that
> memory pressure occurs while device attachments are ongoing.  Under this
> circumstance, I was able to see the race.
> 
> > > 
> > > Not only '->blkif', but 'be' itself also coule be a NULL.  As similar
> > > concurrency issues could be in other drivers in their way, I suggest to change
> > > the reclaim callback ('->reclaim_memory') to be called for each driver instead
> > > of each device.  Then, each driver could be able to deal with its concurrency
> > > issues by itself.
> > 
> > Hmm, I don't like that. This would need to be changed back in case we
> > add per-guest quota.
> 
> Extending this callback in that way would be still not too hard.  We could use
> the argument to the callback.  I would keep the argument of the callback to
> 'struct device *' as is, and will add a comment saying 'NULL' value of the
> argument means every devices.  As an example, xenbus would pass NULL-ending
> array of the device pointers that need to free its resources.
> 
> After seeing this race, I am now also thinking it could be better to delegate
> detailed control of each device to its driver, as some drivers have some
> complicated and unique relation with its devices.
> 
> > 
> > Wouldn't a get_device() before calling the callback and a put_device()
> > afterwards avoid that problem?
> 
> I didn't used the reference count manipulation operations because other similar
> parts also didn't.  But, if there is no implicit reference count guarantee, it
> seems those operations are indeed necessary.
> 
> That said, as get/put operations only adjust the reference count, those will
> not make the callback to wait until the linking of the 'backend' and 'blkif' to
> the device (xen_blkbk_probe()) is finished.  Thus, the race could still happen.
> Or, am I missing something?

I would expect the device is not added to the list of backend devices
until the probe hook has finished with a non-error return code. Ie:
bus_for_each_dev should _not_ iterate over devices for which the probe
function hasn't been run to competition without errors.

The same way I would expect the remove hook to first remove the device
from the list of backend devices and then run the remove hook.

blkback uses an ad-hoc reference counting mechanism, but if the above
assumptions are true I think it would be enough to take an extra
reference in xen_blkbk_probe and drop it in xen_blkbk_remove.

Additionally it might be interesting to switch the ad-hoc reference
counting to use get_device/put_device (in a separate patch), but I'm
not sure how feasible that is.

Roger.