linux-kernel - Re: [BUG] fs/super: a possible sleep-in-atomic bug in put

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171008211357.GV21978@ZenIV.linux.org.uk>
Date:   Sun, 8 Oct 2017 22:13:57 +0100
From:   Al Viro <viro@...IV.linux.org.uk>
To:     Vladimir Davydov <vdavydov.dev@...il.com>
Cc:     Michal Hocko <mhocko@...nel.org>,
        Jia-Ju Bai <baijiaju1990@....com>, torbjorn.lindh@...ta.se,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super

On Sun, Oct 08, 2017 at 06:47:46PM +0300, Vladimir Davydov wrote:
> On Sun, Oct 08, 2017 at 03:03:32AM +0100, Al Viro wrote:
> > On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote:
> > 
> > > What's more, we need to be careful about resize vs. drain.  Right now it's
> > > on list_lrus_mutex, but if we drop that around actual resize of an individual
> > > list_lru, we'll need something else.  Would there be any problem if we
> > > took memcg_cache_ids_sem shared in memcg_offline_kmem()?
> > > 
> > > The first problem is not fatal - we can e.g. use the sign of the field used
> > > to store the number of ->memcg_lrus elements (i.e. stashed value of
> > > memcg_nr_cache_ids at allocation or last resize) to indicate that actual
> > > freeing is left for resizer...
> > 
> > Ugh.  That spinlock would have to be held over too much work, or bounced back
> > and forth a lot on memcg shutdowns ;-/  Gets especially nasty if we want
> > list_lru_destroy() callable from rcu callbacks.  Oh, well...
> > 
> > I still suspect that locking there is too heavy, but it looks like I don't have
> > a better replacement.
> > 
> > What are the realistic numbers of memcg on a big system?
> 
> Several thousand. I guess we could turn list_lrus_mutex into a spin lock
> by making resize/drain procedures handle list_lru destruction as you
> suggested above, but list_lru_destroy() would still have to iterate over
> all elements of list_lru_node->memcg_lrus array to free per-memcg
> objects, which is too heavy to be performed under sb_lock IMHO.

Hmm...  Some observations:
	* struct list_lru_one is fairly small and it doesn't pack well - you
get 25% overhead on it.  Dedicated kmem_cache might've be worth doing.
	* in addition to list_lru_one (all allocated one by one), you have
an array of pointers to them.  And that array never shrinks.  What's more,
the objects pointed to are only 3 times bigger than pointers...
	* how hot is memcg destruction codepath?  The only real argument
in favour of list instead of hlist is list_splice(); if that's not that
critical, hlist would do just fine.  And that drives the size of
struct list_lru_one down to two words, at which point getting rid of
that array of pointers becomes rather attractive.  Amount of cacheline
bouncing might be an issue, but then it might be an overall win; can't
tell without experiments.  It certainly would've simplified the things
a whole lot, including the rollback on allocation failure during resize,
etc.  And list_lru_destroy() would get several orders of magnitude cheaper...
	* coallocating ->list with ->node[] is almost certain worth doing -
AFAICS, it's a clear optimization, no matter whether we do anything else
or not.  Loops by list_lrus would be better off without fetching lru->node
for every entry.  _And_ the objects containing list_lru wouldn't be
reachable via list_lrus.

As for fs/super.c side...  IMO destroy_super() ought to WARN_ON when
it sees non-NULL ->node.  Let alloc_super()/sget_userns() do those
list_lru_destroy() directly for instances that get killed before
becoming reachable via shared data structures; everything else must
go through deactivate_locked_super().  The only reason for list_lru_destroy()
in destroy_super() is the use of the latter for disposal of never-seen-by-anyone
struct super_block instances.  Come to think of that, something like this
might be a good approach:

diff --git a/fs/super.c b/fs/super.c
index 166c4ee0d0ed..8ca15415351a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -154,21 +154,19 @@ static void destroy_super_rcu(struct rcu_head *head)
 	schedule_work(&s->destroy_work);
 }
 
-/**
- *	destroy_super	-	frees a superblock
- *	@s: superblock to free
- *
- *	Frees a superblock.
- */
-static void destroy_super(struct super_block *s)
+/* Free a superblock that has never been seen by anyone */
+static void destroy_unused_super(struct super_block *s)
 {
+	if (!s)
+		return;
+	up_write(&s->s_umount);
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
 	security_sb_free(s);
-	WARN_ON(!list_empty(&s->s_mounts));
 	put_user_ns(s->s_user_ns);
 	kfree(s->s_subtype);
-	call_rcu(&s->rcu, destroy_super_rcu);
+	/* no delays needed */
+	destroy_super_work(&s->destroy_work);
 }
 
 /**
@@ -256,7 +254,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	return s;
 
 fail:
-	destroy_super(s);
+	destroy_unused_super(s);
 	return NULL;
 }
 
@@ -265,11 +263,17 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 /*
  * Drop a superblock's refcount.  The caller must hold sb_lock.
  */
-static void __put_super(struct super_block *sb)
+static void __put_super(struct super_block *s)
 {
-	if (!--sb->s_count) {
-		list_del_init(&sb->s_list);
-		destroy_super(sb);
+	if (!--s->s_count) {
+		list_del_init(&s->s_list);
+		WARN_ON(s->s_dentry_lru.node);
+		WARN_ON(s->s_inode_lru.node);
+		WARN_ON(!list_empty(&s->s_mounts));
+		security_sb_free(s);
+		put_user_ns(s->s_user_ns);
+		kfree(s->s_subtype);
+		call_rcu(&s->rcu, destroy_super_rcu);
 	}
 }
 
@@ -484,19 +488,12 @@ struct super_block *sget_userns(struct file_system_type *type,
 				continue;
 			if (user_ns != old->s_user_ns) {
 				spin_unlock(&sb_lock);
-				if (s) {
-					up_write(&s->s_umount);
-					destroy_super(s);
-				}
+				destroy_unused_super(s);
 				return ERR_PTR(-EBUSY);
 			}
 			if (!grab_super(old))
 				goto retry;
-			if (s) {
-				up_write(&s->s_umount);
-				destroy_super(s);
-				s = NULL;
-			}
+			destroy_unused_super(s);
 			return old;
 		}
 	}
@@ -511,8 +508,7 @@ struct super_block *sget_userns(struct file_system_type *type,
 	err = set(s, data);
 	if (err) {
 		spin_unlock(&sb_lock);
-		up_write(&s->s_umount);
-		destroy_super(s);
+		destroy_unused_super(s);
 		return ERR_PTR(err);
 	}
 	s->s_type = type;