linux-kernel - Re: [take19 1/4] kevent: Core files.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20061005105536.GA4838@2ka.mipt.ru>
Date:	Thu, 5 Oct 2006 14:55:37 +0400
From:	Evgeniy Polyakov <johnpol@....mipt.ru>
To:	Eric Dumazet <dada1@...mosbay.com>
Cc:	Ulrich Drepper <drepper@...il.com>,
	lkml <linux-kernel@...r.kernel.org>,
	David Miller <davem@...emloft.net>,
	Ulrich Drepper <drepper@...hat.com>,
	Andrew Morton <akpm@...l.org>, netdev <netdev@...r.kernel.org>,
	Zach Brown <zach.brown@...cle.com>,
	Christoph Hellwig <hch@...radead.org>,
	Chase Venters <chase.venters@...entec.com>,
	Johann Borck <johann.borck@...sedata.com>
Subject: Re: [take19 1/4] kevent: Core files.

On Thu, Oct 05, 2006 at 12:45:03PM +0200, Eric Dumazet (dada1@...mosbay.com) wrote:
> On Thursday 05 October 2006 12:21, Evgeniy Polyakov wrote:
> > On Thu, Oct 05, 2006 at 11:56:24AM +0200, Eric Dumazet (dada1@...mosbay.com) 
> > > I may be wrong, but what is currently missing for me is :
> > >
> > > - No hardcoded limit on the max number of events. (A process that can
> > > open XXX.XXX files should be allowed to open a kevent queue with at least
> > > XXX.XXX events). Right now thats not clear what happens IF the current
> > > limit is reached.
> >
> > This forces to overflows in fixed sized memory mapped buffer.
> > If we remove memory mapped buffer or will allow to have overflows (and
> > thus skipped entries) keven can easily scale to that limits (tested with
> > xx.xxx events though).
> 
> What is missing or not obvious is : If events are skipped because of 
> overflows, What happens ? Connections stuck forever ? Hope that everything 
> will restore itself ? Is kernel able to SIGNAL this problem to user land ?
 
Exisitng  code does not overflow by design, but can consume a lot of
memory. I talked about the case, when there will be some limit on
number of entries put into mapped buffer.

> > > - In order to avoid touching the whole ring buffer, it might be good to
> > > be able to reset the indexes to the beginning when ring buffer is empty.
> > > (So if the user land is responsive enough to consume events, only first
> > > pages of the mapping would be used : that saves L1/L2 cpu caches)
> >
> > And what happens when there are 3 empty at the beginning and \we need to
> > put there 4 ready events?
> 
> Re-read what I said :  when ring buffer is empty.
> 
> When ring buffer is empty, kernel can reset index right before adding XX new 
> events. You read 3 events consumed, I said : When all ring buffer is empty, 
> because all previous events were consumed by user land, then we can reset 
> indexes to 0.

It is the same.
What if reing buffer was grown upto 3 entry, and is now empty, and we
need to put there 4 entries? Grow it again?
It can be done, easily, but it looks like a workaround not as solution.
And it is highly unlikely that in situation, when there are a lot of
event, ring can be empty.

> >
> > > A plus would be
> > >
> > > - A working/usable mmap ring buffer implementation, but I think its not
> > > mandatory. System calls are not that expensive, especially if you can
> > > batch XX events per syscall (like epoll). Nice thing with a ring buffer
> > > is that we touch less cache lines than say epoll that have lot of linked
> > > structures.
> > >
> > > About mmap, I think you might want a hybrid thing :
> > >
> > > One writable page where userland can write its index, (and hold one or
> > > more futex shared by kernel) (with appropriate thread locking in case
> > > multiple threads want to dequeue events). In fast path, no syscalls are
> > > needed to maintain this user index.
> > >
> > > XXX readonly pages (for user, but r/w for kernel), where kernel write its
> > > own index, and events of course.
> >
> > The problem is in that xxx pages - how many can we eat per kevent
> > descriptor? It is pinned memory and thus it is possible to have a DoS.
> > If xxx above is not enough to store all events, we will have
> > yet-another-broken behaviour like rt-signal queue overflow.
> >
> 
> Re-read : I have a process that has the right to open XXX.XXX handles, 
> allocating XXX.XXX tcp sockets, dentries, files structures, inodes, epoll 
> events, its obviously already a DOS risk, but controled by 'ulimit -n'
> 
> Allocating XXX.XXX * (32 or 64) bytes is a win if I can zap epoll structures 
> (currently more than 256 bytes per event)
> 
> epoll structures are pinned too... what's wrong with that ?
> 
> # egrep "filp|poll|TCP|dentries|sock_inode" /proc/slabinfo |cut -c1-50
> tw_sock_TCP         1302   2200    192   20    1 :
> request_sock_TCP    2046   4260    128   30    1 :
> TCP               151509 196910   1472    5    2 :
> eventpoll_pwq     146718 199439     72   53    1 :
> eventpoll_epi     146718 199360    192   20    1 :
> sock_inode_cache  149182 197940    640    6    1 :
> filp              149537 202515    256   15    1 :
> 
> If you want to protect from DOS, just use ulimit -n 100

epoll() does not have mmap.
Problem is not about how many events can be put into the kernel, but how
many of them can be put into mapped buffer.
There is no problem if mmap is turned off.

> Eric

-- 
	Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/