linux-kernel - Re: Unexpected cascaded epoll behavior

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200901260228.54098.pisa@cmp.felk.cvut.cz>
Date:	Mon, 26 Jan 2009 02:28:53 +0100
From:	Pavel Pisa <pisa@....felk.cvut.cz>
To:	Davide Libenzi <davidel@...ilserver.org>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Unexpected cascaded epoll behavior - my mistake or kernel bug

On Saturday 24 January 2009 22:58:45 Davide Libenzi wrote:
> On Fri, 23 Jan 2009, Pavel Pisa wrote:
> > Hello Davide and all others,
> >
> > I have got to implementing yet another event library and I experience
> > something strange when level triggered epoll_wait() monitors another
> > level triggered epoll set. Top level epoll_wait() return information
> > about one event pending. The event points correctly to the monitored
> > second level epoll fd, but epoll_wait() on this fd with 0 or small
> > timeout returns zero/no events and top level epoll reporting is not reset
> > at this point so program enters busy loop. If there are events, they are
> > processed correctly. More interresting is, that the ill behavior is
> > prevented if I change number of debug writes into unrelated fd or if the
> > application is slowed down by strace to standard output. The strace to
> > the
> > file does not (fortunatelly) hide ill behavior to the strange behavior
> > of my code or kernel can be documented in attached straces. The both
> > traces documents the busy loop cause for case, where there is created
> > epoll set with fd #3, that set monitors fd #0. When character arrives
> > at fd #0, the new epoll fd set with fd #4 is created ad fd #3 is added
> > into this new top level fd set.
> >
> > epoll_create(8)                         = 3
> > epoll_ctl(3, EPOLL_CTL_ADD, 0, {EPOLLIN, {u32=6353264, u64=6353264}}) = 0
> > epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 20000) = 1
> > read(0, "E"..., 1)                      = 1
> > ...
> > epoll_create(8)                         = 4
> > epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN, {u32=6353664, u64=6353664}}) = 0
> > epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> > epoll_wait(3, {{EPOLLIN, {u32=6353264, u64=6353264}}}, 8, 0) = 1
> > read(0, "\n"..., 1)                     = 1
> > epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> > epoll_wait(3, {}, 8, 0)                 = 0
> > epoll_wait(4, {{EPOLLIN, {u32=6353664, u64=6353664}}}, 8, 19998) = 1
> > epoll_wait(3, {}, 8, 0)                 = 0
> >
> > You may ask, why that terrible clam starts to writting new event library
> > code and why he needs to cascade event handling mechanisms. I try to
> > defend myself a little. If you do not like beat me, you can skip next
> > section
>
> Pavel, can you give the patch below a spin? It's working fine in my two
> boxes, but it'd be great if you could confirm it too.
> Attached there's also a simple test program to check epoll behavior,
> especially the one related to recursion.
>

Hello Davide,

thanks for fast reply and effort.

I have tested your patch with patched and unpatched 2.6.28.2 kernel.
The outputs are attached. The patched kernel passes all your tests.

But I have bad news. My library code does not fall into busy loop
after your patching but my FIFO tests do not work even in
single level epoll scenario. The strace output is attached as well.
I try to do more testing tomorrow. I have returned from weekend trip
at the evening today and I have not much time for deeper analysis.

But it looks like write level sensitive events are not triggering
for epoll at all. The pipe is not fill by any character and specified
timeout is triggered with next message as fail results.
  @ pipe #X evptrig wr timeout
  @ pipe #X evptrig rd timeout
 
If you want to test the code on your box, download library

http://cmp.felk.cvut.cz/~pisa/ulan/ul_evpoll-090123.tar.gz

The simple "make" in the top directory should work.

Then you can test correct behavior with sysvpoll default events
monitoring. Run program "_compiled/bin-utils/ul_evpchk".
If you pres "p" once or multiple times and then Enter, you
should see, that FD 0 receives your characters

<5>@ fd #0 evptrig 1 count 1
'p'

Then there should be 10x burs series of 50 pipe writes and reads.
Each series is separated by short timed delay.

The print of last few reads and writes looks like this

<5>@ pipe #3 rd evptrig 1 count wr 499 rd 498
<5>@ pipe #4 wr evptrig 2 count wr 500 rd 499
<5>***** ul_evpoll_dispatch returned 1 *****
<5>***** ul_evpoll_dispatch called *****
<5>@ pipe #3 rd evptrig 1 count wr 500 rd 499
<5>@ pipe #3 rd final count reached
<4>@ pipe rd #3 wr #4 destroyed
<3>@ pipe active count reached 0 in 20131

The important is the last line, which confirms, that all FIFOs
has reached final count and that there is 0 active pipes now.
The last number informs about time in ms spent before
all pipes has finished. There are some distracting prints
about timers 1,2,3 to test timers support in parallel.
If you like to get rid of these, you can comment out 
or disable timer lines in

ulevloop/ul_evpchk.c

-  if(1) {
+  if(0) {
     add_my_timer(2000, 0);
     add_my_timer(3000, 20);
     add_my_timer(1400, 10);
   }

You can check the code with epoll event monitoring mechanism
then

_compiled/bin-utils/ul_evpchk lnxepoll

It works same way (correctly) without patch.
You can try to glib support as well - surprisingly by parameter "glib".

You can start with "lnxepoll" again and then pres "C"+Enter
to cascade epoll in glib main loop. Again worked correctly in past.
If you cascade by "E"+Enter, then cascading of epoll into epoll
is requested and it failed in default setup. If the output
has been redirected, then it worked correctly in some situations.

The other tests can be invoked by another letters
  "t" .. run 1000 active FDs in parallel
  "T" .. run 10000 active FDs in parallel
  "l"  .. create 1000 inactive FDs as load to epoll
   "L" .. created 8000 inactive FDs

The all these has been working with all mechanisms before patching,
even with cascading except epoll at epoll cascade.
These huge tests switch low debug level so only errors
and final reports are seen after test letter.
The glib based tests for 10000 FDs is horribly long, so I have
not found patience to run it at full.

There are some tests results for different tests and mechanisms.
Each test is invoked by initial command line. The first number is number
of active FDs (500 events scheduled for each), the second number
of inactive FDs, the third is mechanism (cascade means epoll cascaded
on glib), the last number is time in ms. 
All test has been run on Core 2 Duo T7300  @ 2.00GHz

t

1000 0 sysvpoll 1369
1000 0 epoll    1570
1000 0 glib     3046
1000 0 cascaded 1783

tttttttttt

10000 0 sysvpoll  17774
10000 0 epoll     20750
10000 0 glib     445341
10000 0 cascaded  23098

tltltltltltltltltltl

10000 10000 sysvpoll 18924
10000 10000 epoll    20748
10000 10000 glib        ?
10000 10000 cascaded 23370

llltllltll

2000 8000 sysvpoll  4014
2000 8000 epoll     3205
2000 8000 glib     87357
2000 8000 cascaded  3718


Thanks again for testing and if you have some time and interrest,
you can try my kernel torture on your boxes,

                       Pavel



View attachment "epoll_test-2.6.28-orig.out" of type "text/plain" (3482 bytes)

View attachment "epoll_test-2.6.28-patched.out" of type "text/plain" (3442 bytes)

View attachment "ul_evpchk-lnxepoll-2.6.28.2-patched.strace" of type "text/plain" (11013 bytes)