netdev - Re: [PATCH v2 bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210722214256.ncuz6k5bjt4vgru6@kafai-mbp.dhcp.thefacebook.com>
Date:   Thu, 22 Jul 2021 14:42:56 -0700
From:   Martin KaFai Lau <kafai@...com>
To:     Kuniyuki Iwashima <kuniyu@...zon.co.jp>
CC:     <ast@...nel.org>, <bpf@...r.kernel.org>, <daniel@...earbox.net>,
        <edumazet@...gle.com>, <kernel-team@...com>,
        <ncardwell@...gle.com>, <netdev@...r.kernel.org>,
        <ycheng@...gle.com>, <yhs@...com>
Subject: Re: [PATCH v2 bpf-next 1/8] tcp: seq_file: Avoid skipping sk during
 tcp_seek_last_pos

On Fri, Jul 23, 2021 at 12:08:10AM +0900, Kuniyuki Iwashima wrote:
> From:   Kuniyuki Iwashima <kuniyu@...zon.co.jp>
> Date:   Thu, 22 Jul 2021 23:16:37 +0900
> > From:   Martin KaFai Lau <kafai@...com>
> > Date:   Thu, 1 Jul 2021 13:05:41 -0700
> > > st->bucket stores the current bucket number.
> > > st->offset stores the offset within this bucket that is the sk to be
> > > seq_show().  Thus, st->offset only makes sense within the same
> > > st->bucket.
> > > 
> > > These two variables are an optimization for the common no-lseek case.
> > > When resuming the seq_file iteration (i.e. seq_start()),
> > > tcp_seek_last_pos() tries to continue from the st->offset
> > > at bucket st->bucket.
> > > 
> > > However, it is possible that the bucket pointed by st->bucket
> > > has changed and st->offset may end up skipping the whole st->bucket
> > > without finding a sk.  In this case, tcp_seek_last_pos() currently
> > > continues to satisfy the offset condition in the next (and incorrect)
> > > bucket.  Instead, regardless of the offset value, the first sk of the
> > > next bucket should be returned.  Thus, "bucket == st->bucket" check is
> > > added to tcp_seek_last_pos().
> > > 
> > > The chance of hitting this is small and the issue is a decade old,
> > > so targeting for the next tree.
> > 
> > Multiple read()s or lseek()+read() can call tcp_seek_last_pos().
> > 
> > IIUC, the problem happens when the sockets placed before the last shown
> > socket in the list are closed between some read()s or lseek() and read().
> > 
> > I think there is still a case where bucket is valid but offset is invalid:
> > 
> >   listening_hash[1] -> sk1 -> sk2 -> sk3 -> nulls
> >   listening_hash[2] -> sk4 -> sk5 -> nulls
> > 
> >   read(/proc/net/tcp)
> >     end up with sk2
> > 
> >   close(sk1)
> > 
> >   listening_hash[1] -> sk2 -> sk3 -> nulls
> >   listening_hash[2] -> sk4 -> sk5 -> nulls
> > 
> >   read(/proc/net/tcp) (resume)
> >     offset = 2
> > 
> >     listening_get_next() returns sk2
> > 
> >     while (offset--)
> >       1st loop listening_get_next() returns sk3 (bucket == st->bucket)
> >       2nd loop listening_get_next() returns sk4 (bucket != st->bucket)
> > 
> >     show() starts from sk4
> > 
> >     only is sk3 skipped, but should be shown.
> 
> Sorry, this example is wrong.
> We can handle this properly by testing bucket != st->bucket.
> 
> In the case below, we cannot check if the offset is valid or not by testing
> the bucket.
> 
>   listening_hash[1] -> sk1 -> sk2 -> sk3 -> sk4 -> nulls
> 
>   read(/proc/net/tcp)
>     end up with sk2
> 
>   close(sk1)
> 
>   listening_hash[1] -> sk2 -> sk3 -> sk4 -> nulls
> 
>   read(/proc/net/tcp) (resume)
>     offset = 2
> 
>     listening_get_first() returns sk2
> 
>     while (offset--)
>       1st loop listening_get_next() returns sk3 (bucket == st->bucket)
>       2nd loop listening_get_next() returns sk4 (bucket == st->bucket)
> 
>     show() starts from sk4
> 
>     only is sk3 skipped, but should be shown.
> 
> 
> > 
> > In listening_get_next(), we can check if we passed through sk2, but this
> > does not work well if sk2 itself is closed... then there are no way to
> > check the offset is valid or not.
> > 
> > Handling this may be too much though, what do you think ?
There will be cases that misses sk after releasing
the bucket lock (and then things changed).  For example,
another case could be sk_new is added to the head of the bucket,
although it could arguably be treated as a legit miss since
"cat /proc/net/tcp" has already been in-progress.

The chance of hitting m->buf limit and that bucket gets changed should be slim.
If there is use case such that lhash2 (already hashed by port+addr) is still
having a large bucket (e.g. many SO_REUSEPORT), it will be a better problem
to solve first.  imo, remembering sk2 to solve the "cat /proc/net/tcp" alone
does not worth it.

Thanks for the review!