netdev - Re: [bpf PATCH v2 3/6] bpf: sockhash fix omitted bucket lock in sock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 15 Jun 2018 08:45:07 -0700
From:   Martin KaFai Lau <kafai@...com>
To:     John Fastabend <john.fastabend@...il.com>
CC:     <ast@...nel.org>, <daniel@...earbox.net>, <netdev@...r.kernel.org>
Subject: Re: [bpf PATCH v2 3/6] bpf: sockhash fix omitted bucket lock in
 sock_close

On Fri, Jun 15, 2018 at 08:23:14AM -0700, John Fastabend wrote:
> On 06/14/2018 10:41 PM, Martin KaFai Lau wrote:
> > On Thu, Jun 14, 2018 at 09:44:57AM -0700, John Fastabend wrote:
> >> First in tcp_close, reduce scope of sk_callback_lock() the lock is
> >> only needed for protecting smap_release_sock() the ingress and cork
> >> lists are protected by sock lock. Having the lock in wider scope is
> >> harmless but may confuse the reader who may infer it is in fact
> >> needed.
> >>
> >> Next, in sock_hash_delete_elem() the pattern is as follows,
> >>
> >>   sock_hash_delete_elem()
> >>      [...]
> >>      spin_lock(bucket_lock)
> >>      l = lookup_elem_raw()
> >>      if (l)
> >>         hlist_del_rcu()
> >>         write_lock(sk_callback_lock)
> >>          .... destroy psock ...
> >>         write_unlock(sk_callback_lock)
> >>      spin_unlock(bucket_lock)
> >>
> >> The ordering is necessary because we only know the {p}sock after
> >> dereferencing the hash table which we can't do unless we have the
> >> bucket lock held. Once we have the bucket lock and the psock element
> >> it is deleted from the hashmap to ensure any other path doing a lookup
> >> will fail. Finally, the refcnt is decremented and if zero the psock
> >> is destroyed.
> >>
> >> In parallel with the above (or free'ing the map) a tcp close event
> >> may trigger tcp_close(). Which at the moment omits the bucket lock
> >> altogether (oops!) where the flow looks like this,
> >>
> >>   bpf_tcp_close()
> >>      [...]
> >>      write_lock(sk_callback_lock)
> >>      for each psock->maps // list of maps this sock is part of
> >>          hlist_del_rcu(ref_hash_node);
> >>          .... destroy psock ...
> >>      write_unlock(sk_callback_lock)
> >>
> >> Obviously, and demonstrated by syzbot, this is broken because
> >> we can have multiple threads deleting entries via hlist_del_rcu().
> >>
> >> To fix this we might be tempted to wrap the hlist operation in a
> >> bucket lock but that would create a lock inversion problem. In
> >> summary to follow locking rules maps needs the sk_callback_lock but we
> >> need the bucket lock to do the hlist_del_rcu. To resolve the lock
> >> inversion problem note that when bpf_tcp_close is called no updates
> >> can happen in parallel, due to ESTABLISH state check in update logic,
> >> so pop the head of the list repeatedly and remove the reference until
> >> no more are left. If a delete happens in parallel from the BPF API
> >> that is OK as well because it will do a similar action, lookup the
> >> sock in the map/hash, delete it from the map/hash, and dec the refcnt.
> >> We check for this case before doing a destroy on the psock to ensure
> >> we don't have two threads tearing down a psock. The new logic is
> >> as follows,
> >>
> >>   bpf_tcp_close()
> >>   e = psock_map_pop(psock->maps) // done with sk_callback_lock
> >>   bucket_lock() // lock hash list bucket
> >>   l = lookup_elem_raw(head, hash, key, key_size);
> >>   if (l) {
> >>      //only get here if elmnt was not already removed
> >>      hlist_del_rcu()
> >>      ... destroy psock...
> >>   }
> >>   bucket_unlock()
> >>
> >> And finally for all the above to work add missing sk_callback_lock
> >> around smap_list_remove in sock_hash_ctx_update_elem(). Otherwise
> >> delete and update may corrupt maps list.
> >>
> >> (As an aside the sk_callback_lock serves two purposes. The
> >>  first, is to update the sock callbacks sk_data_ready, sk_write_space,
> >>  etc. The second is to protect the psock 'maps' list. The 'maps' list
> >>  is used to (as shown above) to delete all map/hash references to a
> >>  sock when the sock is closed)
> >>
> >> (If we did not have the ESTABLISHED state guarantee from tcp_close
> >>  then we could not ensure completion because updates could happen
> >>  forever and pin thread in delete loop.)
> >>
> >> Reported-by: syzbot+0ce137753c78f7b6acc1@...kaller.appspotmail.com
> >> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> >> Signed-off-by: John Fastabend <john.fastabend@...il.com>
> >> ---
> >>  0 files changed
> >>
> 
> ^^^^ Will fix this 0 files changes as well.
> 
> >>  	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> >> @@ -2114,10 +2169,13 @@ static void sock_hash_free(struct bpf_map *map)
> >>  	 */
> >>  	rcu_read_lock();
> >>  	for (i = 0; i < htab->n_buckets; i++) {
> >> -		struct hlist_head *head = select_bucket(htab, i);
> >> +		struct bucket *b = __select_bucket(htab, i);
> >> +		struct hlist_head *head;
> >>  		struct hlist_node *n;
> >>  		struct htab_elem *l;
> >>  
> >> +		raw_spin_lock_bh(&b->lock);
> > There is a synchronize_rcu() at the beginning of sock_hash_free().
> > Taking the bucket's lock of the free-ing map at this point is a bit
> > suspicious.  What may still access the map after synchronize_rcu()?
> > 
> 
> tcp_close() may be called while the map is being free. The sync_rcu will
> only sync the BPF side.
Thanks for the explanation.

hmm....so the bpf_tcp_close(), which is under rcu_read_lock(), may still
has a reference to this hash.  Should it wait for another rcu grace period
before freeing the htab->buckets and htab?