netdev - Re: [net/sched] Question: Locks for clearing ERR_PTR() value from idrinfo->action

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240614010548.71803-1-kuniyu@amazon.com>
Date: Thu, 13 Jun 2024 18:05:48 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <penguin-kernel@...ove.sakura.ne.jp>
CC: <davem@...emloft.net>, <edumazet@...gle.com>, <jhs@...atatu.com>,
	<jiri@...nulli.us>, <kuba@...nel.org>, <linux-kernel@...r.kernel.org>,
	<netdev@...r.kernel.org>, <pabeni@...hat.com>, <xiyou.wangcong@...il.com>,
	<kuniyu@...zon.com>
Subject: Re: [net/sched] Question: Locks for clearing ERR_PTR() value from idrinfo->action_idr ?

From: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Date: Fri, 14 Jun 2024 09:58:48 +0900
> Hello.
> 
> syzbot is reporting hung task problems involving rtnl_muxex. A debug printk()
> patch added to linux-next-20240611 suggested that many of them are caused by
> an infinite busy loop inside tcf_idr_check_alloc().

I think the fix is:
https://lore.kernel.org/netdev/20240613071021.471432-1-druth@chromium.org/


> 
> ----------
> again:
> 		rcu_read_lock();
> 		p = idr_find(&idrinfo->action_idr, *index);
> 
> 		if (IS_ERR(p)) {
> 			/* This means that another process allocated
> 			 * index but did not assign the pointer yet.
> 			 */
> 			rcu_read_unlock();
> 			goto again;
> 		}
> ----------
> 
> Since there is no sleep (e.g. cond_resched()/schedule_timeout_uninterruptible(1))
> before "goto again;", once idr_find() returns an IS_ERR() value, all of that CPU's
> computation resource is wasted forever with rtnl_mutex held (and anybody else who
> tries to hold rtnl_mutex at rtnl_lock() is reported as hung task, resulting in
> various hung task reports waiting for rtnl_mutex at rtnl_lock()).
> 
> Therefore, I tried to add a sleep before "goto again;", but I can't know whether
> a sleep added to linux-next-20240612 solves the hung task problem because syzbot
> currently cannot test linux-next kernels due to some different problem.
> 
> Therefore, I'm posting a question here before syzbot can resume testing of
> linux-next kernels. As far as I can see, the ERR_PTR(-EBUSY) assigned at
> 
> 	mutex_lock(&idrinfo->lock);
> 	ret = idr_alloc_u32(&idrinfo->action_idr, ERR_PTR(-EBUSY), index, max,
> 			    GFP_KERNEL);
> 	mutex_unlock(&idrinfo->lock);
> 
> in tcf_idr_check_alloc() is cleared by either
> 
> 	mutex_lock(&idrinfo->lock);
> 	/* Remove ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
> 	WARN_ON(!IS_ERR(idr_remove(&idrinfo->action_idr, index)));
> 	mutex_unlock(&idrinfo->lock);
> 
> in tcf_idr_cleanup() or
> 
> 	mutex_lock(&idrinfo->lock);
> 	/* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
> 	idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
> 	mutex_unlock(&idrinfo->lock);
> 
> in tcf_idr_insert_many().
> 
> But is there a possibility that rtnl_mutex is released between
> tcf_idr_check_alloc() and tcf_idr_{cleanup,insert_many}() ? If yes,
> adding a sleep before "goto again;" won't be sufficient. But if no,
> how can
> 
> 	/* This means that another process allocated
> 	 * index but did not assign the pointer yet.
> 	 */
> 
> happen (because both setting ERR_PTR(-EBUSY) and replacing with an !IS_ERR()
> value are done without temporarily releasing rtnl_mutex) ?
> 
> Is there a possibility that tcf_idr_check_alloc() is called without holding
> rtnl_mutex? If yes, adding a sleep before "goto again;" would help. But if no,
> is this a sign that some path forgot to call tcf_idr_{cleanup,insert_many}() ?