linux-kernel - Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <C9B2B7D3-1E92-4E22-80FA-8A606643B536@linaro.org>
Date:   Wed, 24 May 2017 17:43:18 +0100
From:   Paolo Valente <paolo.valente@...aro.org>
To:     Tejun Heo <tj@...nel.org>
Cc:     Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
        Linux-Kernal <linux-kernel@...r.kernel.org>,
        Ulf Hansson <ulf.hansson@...aro.org>,
        Linus Walleij <linus.walleij@...aro.org>, broonie@...nel.org
Subject: Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe


> Il giorno 24 mag 2017, alle ore 15:50, Tejun Heo <tj@...nel.org> ha scritto:
> 
> Hello, Paolo.
> 
> On Wed, May 24, 2017 at 12:53:26PM +0100, Paolo Valente wrote:
>> Exact, but even after all blkgs, as well as the cfq_group and pd, are
>> gone, the children cfq_queues of the gone cfq_group continue to point
>> to unexisting objects, until new cfq_set_requests are executed for
>> those cfq_queues.  To try to make this statement clearer, here is the
>> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
>> say cfqg:
>> 
>> 1 cfq_set_request for a request rq of cfqq
>> 2 removal of (the process associated with cfqq) from bfqg
>> 3 destruction of the blkg that bfqg is associated with
>> 4 destruction of the blkcg the above blkg belongs to
>> 5 destruction of the pd pointed to by cfqg, and of cfqg itself
>> !!!-> from now on cfqq->cfqg is a dangling reference <-!!!
>> 6 execution of cfq functions, different from cfq_set_request, on cfqq
>> 	. cfq_insert, cfq_dispatch, cfq_completed_rq, ...
>> 7 execution of a new cfq_set_request for cfqq
>> -> now cfqq->cfqg is again a sane pointer <-
>> 
>> Every function executed at step 6 sees a dangling reference for
>> cfqq->cfqg.
>> 
>> My fix for caching data doesn't solve this more serious problem.
>> 
>> Where have I been mistaken?
> 
> Hmmm... cfq_set_request() invokes cfqg_get() which increases refcnt on
> the blkg, which should pin everything down till the request is done,

Yes, I missed that step, sorry. Still ...

> so none of the above objects can be destroyed before the request is
> done.
> 

... the issue seems just to move to a more subtle position: cfq is ok,
because it protects itself with rq lock, but blk-mq schedulers don't.
So, the race that leads to the (real) crashes reported by people may
actually be:
1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a
copy of the content of the blkg, but the rcu mechanism doesn't prevent
destruction from going on
2 blkg_get gets executed on the copy of the original blkg
3 subsequent scheduler operations involving that stale blkg lead to
the dangling-pointer accesses we have already discussed

Could you patiently tell me whether I'm still wrong?

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun