linux-kernel - Re: [PATCH 2/5] i915: flush gem obj freeing workqueues to add accuracy to the i915 shrinker

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170407130600.GA5035@redhat.com>
Date:   Fri, 7 Apr 2017 15:06:00 +0200
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Chris Wilson <chris@...is-wilson.co.uk>,
        Martin Kepplinger <martink@...teo.de>,
        Thorsten Leemhuis <regressions@...mhuis.info>,
        daniel.vetter@...el.com, Dave Airlie <airlied@...il.com>,
        intel-gfx@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
        dri-devel@...ts.freedesktop.org
Subject: Re: [PATCH 2/5] i915: flush gem obj freeing workqueues to add
 accuracy to the i915 shrinker

On Fri, Apr 07, 2017 at 11:02:11AM +0100, Chris Wilson wrote:
> On Fri, Apr 07, 2017 at 01:23:44AM +0200, Andrea Arcangeli wrote:
> > Waiting a RCU grace period only guarantees the work gets queued, but
> > until after the queued workqueue returns, there's no guarantee the
> > memory was actually freed. So flush the work to provide better
> > guarantees to the reclaim code in addition of waiting a RCU grace
> > period to pass.
> 
> We are not allowed to call flush_work() from the shrinker, the workqueue
> doesn't have and can't have the right reclaim flags.

I figured the flush_work had to be conditional to "unlock" being true
too in the i915 shrinker (not only synchronize_rcu_expedited()), and I
already fixed that bit, but I didn't think it would be a problem to
wait for the workqueue as long as reclaim didn't recurse on the
struct_mutex (it is a problem if unlock is false of course as we would
be back to square one). I didn't get further hangs and I assume I've
been running a couple of synchronize_rcu_expedited() and flush_work (I
should add dynamic tracing to be sure).

Also note, I didn't get any lockdep warning when I reproduced the
workqueue hang in 4.11-rc5 so at least as far as lockdep is concerned
there's no problem to call synchronize_rcu_expedited and it couldn't
notice we were holding the struct_mutex while waiting for the new
workqueue to run.

Also note recursing on the lock (unlock false case) is something
nothing else does, I'm not sure if it's worth the risk and if you
shouldn't just call mutex_trylock in the shrinker instead of
mutex_trylock_recursive. One thing was to recurse on the lock
internally in the same context, but recursing through the whole
reclaim is more dubious as safe.

You could start dropping objects and wiping vmas and stuff in the
middle of some kmalloc/alloc_pages that doesn't expect it and then
crash for other reasons. So this reclaim recursion model of the
shinker is quite unique and quite challenging to proof as safe if you
keep using mutex_trylock_recursive in i915_gem_shrinker_scan.

Lock recursion in all other places could be dropped without runtime
downsides, the only place mutex_trylock_recursive makes a design
difference and makes sense to be used is in i915_gem_shrinker_scan,
the rest are implementation issues not fundamental shrinker design and
it'd be nice if those other mutex_trylock_recursive would all be
removed and the only one that is left is in i915_gem_shrinker_scan and
nowhere else (or to drop it also from i915_gem_shrinker_scan).

mutex_trylock_recursive() should also be patched to use
READ_ONCE(__mutex_owner(lock)) because currently it breaks C.

In the whole kernel i915 and msm drm are the only two users of such
function in fact.

Another thing is what value return from i915_gem_shrinker_scan when
unlock is false, and we can't possibly wait for the memory to be freed
let alone for a rcu grace period. For various reasons I think it's
safer to return the current "free" even if we could also return "0" in
such case. There are different tradeoffs, returning "free" is less
likely to trigger an early OOM as the VM thinks it's still making
progress and in fact it will get more free memory shortly, while
returning SHRINK_STOP would also be an option and it would insist more
on the other slabs so it would be more reliable at freeing memory
timely, but it would be more at risk of early OOM. I think returning
"free" is the better tradeoff of the two, but I suggest to add a
comment as it's not exactly obvious what is better.

Thanks,
Andrea